arXivDaily arXiv每日学术速递 周一至周五更新
2605.18753 2026-05-19 cs.CL cs.AI cs.LG 版本更新

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention: 可微且自适应的稀疏分层注意力

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso

发表机构 * Tsinghua University(清华大学) Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究院) Carnegie Mellon University(卡内基梅隆大学) Sapienza University of Rome(罗马萨皮恩扎大学) University of Edinburgh(爱丁堡大学) TransPerfect(TransPerfect公司) ELLIS Unit Lisbon(里斯本ELLIS单位)

AI总结 本研究提出DashAttention,一种可微且自适应的稀疏分层注意力机制,通过自适应稀疏α-entmax变换选择可变数量的块,从而在保持整个层次结构可微的同时,提升长上下文建模能力,实验表明其在高稀疏度下优于现有方法。

Comments Preprint

详情
AI中文摘要

当前的分层注意力方法,如NSA和InfLLMv2,基于粗粒度注意力得分选择前k个相关键值(KV)块,然后对所选标记应用细粒度softmax注意力。然而,top-k操作假设任何查询的相关标记数量固定,并且阻止了稀疏和密集阶段之间的梯度流动。在本工作中,我们提出了DashAttention(可微且自适应的稀疏分层注意力),它利用自适应稀疏α-entmax变换,在第一阶段根据当前查询选择可变数量的块。这反过来为第二阶段的softmax注意力提供先验信息,保持整个层次结构完全可微。与其他分层注意力方法不同,我们表明DashAttention是非发散的,这导致更好的长上下文建模能力。在大型语言模型(LLMs)上的实验表明,DashAttention在75%的稀疏度下达到与全注意力相当的准确性,并在高稀疏度情况下优于NSA和InfLLMv2,特别是在高稀疏度情况下。我们还提供了一个高效的、GPU-aware的DashAttention实现,在Triton中实现了比FlashAttention-3快超过一倍的推理速度。总体而言,DashAttention提供了一种成本效益高的长上下文建模策略。

英文摘要

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $α$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

2605.18747 2026-05-19 cs.CL cs.AI 版本更新

Code as Agent Harness

代码作为代理工具

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pan Chen, Dorothy Sun, Ren Chen, Mahesh Srinivasan, Nipun Mathur, Yinglong Xia, Hong Li, Hong Yan, Pan Lu, Lingming Zhang, Tong Zhang, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta Stanford University(斯坦福大学)

AI总结 本文探讨了代码在代理系统中的作用,提出了一种统一的视角,将代码视为代理基础设施的基础,并讨论了代理工具接口、机制以及扩展到多代理系统的挑战。

Comments GitHub: https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers

详情
AI中文摘要

近年来,大型语言模型(LLMs)在理解和生成代码方面展现了强大的能力,从竞争性编程到仓库级别的软件工程。在新兴的代理系统中,代码不再仅仅是目标输出,而是越来越多地作为代理推理、行动、环境建模和基于执行的验证的操作基础。我们通过代理工具的视角来阐述这一转变,并引入“代码作为代理工具”的概念:一种以代码为基础的统一视角,用于代理基础设施。为了系统地研究这一视角,我们围绕三个相连的层次组织了综述。首先,我们研究工具接口,其中代码连接代理到推理、行动和环境建模。其次,我们检查工具机制:计划、记忆和工具使用用于长周期执行,以及反馈驱动的控制和优化,使工具可靠且适应性强。第三,我们讨论将工具从单代理系统扩展到多代理系统,其中共享的代码艺术支持多代理协调、审查和验证。在这些层次中,我们总结了代码作为代理工具的代表性方法和实际应用,涵盖编码助手、GUI/OS自动化、具身代理、科学发现、个性化和推荐、DevOps以及企业工作流程。我们进一步概述了工具工程中的开放挑战,包括评估超越最终任务成功、在不完整反馈下的验证、无回归的工具改进、多个代理之间的一致共享状态、人类监督以确保安全关键行动,以及向多模态环境的扩展。通过将代码视为代理AI的工具,本文为可执行、可验证和具有状态的AI代理系统提供了一条统一的道路。

英文摘要

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

2605.18732 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

可预测的编造:大型语言模型的事实回忆能力随模型大小和主题频率而增加

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun, Iyiola E. Olatunji, Tegawendé F. Bissyandé

发表机构 * International Development Research Centre Canada(加拿大国际发展研究中心) University of Cape Town(开普敦大学) Global Center on AI Governance(人工智能治理全球中心) SnT, University of Luxembourg(卢森堡大学SnT分校) CITADEL AI Centre of Excellence, Burkina Faso(布基纳法索CITADEL人工智能卓越中心)

AI总结 本研究探讨了大型语言模型在事实回忆方面的可预测性,发现模型大小和训练数据中主题频率是影响回忆质量的关键因素,且模型大小和主题频率的组合能解释60%-94%的方差。

Comments 18 pages, 5 figures, 6 tables

详情
AI中文摘要

尽管规模定律支配了大规模语言模型的整体性能,但尚未有规模定律将事实回忆与模型大小和训练数据组成联系起来。我们评估了38个模型,超过8900篇学术参考文献由自动参考验证系统评估。回忆质量在模型参数数量和训练数据中主题表示的对数线性组合中呈现S形。这两个变量单独能解释16个密集模型中60%的方差,而在单个家族内上升至74-94%。这种形式与受叠加启发的账户相匹配,其中回忆由信号噪声比门控:信号强度与概念频率成正比,噪声底座与模型容量成正比。

英文摘要

While scaling laws govern aggregate large language model performance, no scaling law has linked factual recall to both model size and training-data composition. We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system. Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.

2605.18703 2026-05-19 cs.CL cs.LG 版本更新

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory: 通过可执行环境合成和鲁棒强化学习扩展工具使用智能体

Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, Zhicheng Yang, Xiao Zhu, Yinhong Liu, Boyu Zhu, Baiyu Huang, Chao Chen, Heyuan Deng, Fei Mi, Lifeng Shang, Xingshan Zeng, Zhijiang Guo

发表机构 * LARK, HKUST (GZ)(LARK,香港科技大学(广州)) University of Cambridge(剑桥大学) UCL(伦敦大学学院) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出EnvFactory框架,通过自动合成可执行环境和鲁棒强化学习,解决工具使用智能体扩展中的环境可扩展性和训练数据不足问题,显著提升训练效率和下游性能。

Comments 11 pages

详情
AI中文摘要

通过代理强化学习(Agentic RL)为LLM配备工具使用能力受到两个挑战的限制:缺乏可扩展且稳健的执行环境以及现实训练数据的稀缺性,这些数据无法捕捉隐含的人类推理。现有方法依赖于昂贵的真实世界API、易产生幻觉的LLM模拟器或依赖预收集文档的合成环境,这些环境通常是单轮次或依赖预收集文档。此外,合成轨迹经常过于指定,更像指令序列而非自然人类意图,从而降低了其对强化学习训练的有效性。我们引入EnvFactory,一个完全自动化的框架,解决这两个挑战。EnvFactory自动探索和验证具有状态的可执行工具环境,并通过拓扑感知采样和校准细化合成自然多轮次轨迹,生成具有隐含意图的扎根查询。仅使用7个领域中的85个验证环境,EnvFactory生成2,575个SFT和RL轨迹。尽管使用的环境数量远少于先前工作(通常是5倍),EnvFactory在训练效率和下游性能上均优于现有方法,使Qwen3系列模型在BFCLv3上提升高达+15%,在MCP-Atlas上提升+8.6%,并在对话基准测试中包括τ²-Bench和VitaBench上提升+6%。通过完全自动化环境构建和轨迹合成,EnvFactory为代理强化学习提供了可扩展、可扩展且稳健的基础。

英文摘要

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $τ^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

2605.18673 2026-05-19 cs.CY cs.CL 版本更新

Generative AI Advertising as a Problem of Trustworthy Commercial Intervention

生成式AI广告作为可信商业干预问题

Jingyi Qiu, Qiaozhu Mei

发表机构 * University of Michigan(密歇根大学)

AI总结 本文探讨生成式AI广告如何通过影响生成过程而非内容放置来改变广告方式,提出基于影响层级的分类体系,并指出当前系统在更深层次商业影响上的可信度问题。

详情
AI中文摘要

主要部署的生成式AI广告系统维持了商业内容与AI生成响应之间的可见边界。然而,实证研究表明,直接嵌入大语言模型(LLM)输出的广告往往未被用户检测到。我们主张生成式AI从根本上改变了广告:而非将产品置于离散槽位,它使生成过程本身受到干预,通过更不明显的渠道产生商业影响。这将生成式AI广告重新定义为可信干预问题而非内容放置问题。我们引入了一个按影响层级组织的分类体系,对应于对越来越深层潜在变量的干预:产品提及、信息框架、行为引导和长期偏好塑造;并展示这些层级如何在不同模态和系统架构中体现,包括检索增强生成和代理流水线,其中上游决策可以显著限制下游结果。主要部署的系统和设计机制集中在最明显且最容易控制的层级,而对用户自主性最根本的商业影响形式仍缺乏理解和检测、测量或披露的框架。核心挑战是生成系统中的商业影响是否可以变得可信,即可追溯、可测量、可质疑且符合用户福祉。

英文摘要

Major deployed generative AI advertising systems preserve a visible boundary between commercial content and AI-generated responses. Yet empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users. We argue that generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels. This reframes generative AI advertising as a problem of trustworthy intervention rather than content placement. We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping; and show how these tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes. Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure. The central challenge is whether commercial influence in generative systems can be made trustworthy, i.e., attributable, measurable, contestable, and aligned with user welfare.

2605.18663 2026-05-19 cs.AI cs.CL cs.LG 版本更新

GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM:通过整合多个认知领域的任务评估模型

Rohit Patel, Alexandre Rezende, Steven McClain

发表机构 * Meta Superintelligence Labs(Meta超智能实验室)

AI总结 本文提出GIM基准测试,通过整合多个认知领域的任务来评估模型,其核心方法是设计820个原创问题,结合广泛的知识和多种认知操作,从而保持推理在现实任务中的基础性,同时通过2PL IRT模型校准能力估计,发布涵盖22个模型和47种测试配置的综合排行榜,并深入研究了测试时计算与模型能力之间的权衡。

Comments 56 pages, 27 figures, 4 tables. Code: https://github.com/facebookresearch/gim ; Dataset: https://huggingface.co/datasets/facebook/gim

详情
AI中文摘要

随着LLM基准测试趋于饱和,评估社区已采取两种策略来提高难度:提升知识需求(GPQA,HLE)或完全去除知识而采用抽象推理(ARC-AGI)。前者将记忆混淆为能力,后者使推理脱离实际应用背景。我们采取了不同的方法。Grounded Integration Measure(GIM)是一个包含820个原创问题(615个公开问题,205个私有问题)的基准测试,其中难度来自于整合;每个问题都需要协调多种认知操作(约束满足、状态跟踪、知识警惕、受众校准)在广泛可获取的知识上,从而保持推理在现实任务中而不依赖专门的专家知识。每个问题都是原创专家撰写的组成,大多数有基于评分标准分解的评分(中位数6个独立判断的准则)。一个平衡的公开-私有划分提供了内置的污染诊断。我们校准了一个连续响应的2参数逻辑(2PL)IRT模型,超过200,000个提示-响应对,覆盖28个模型,产生稳健的能力估计,即使在原始准确率被错误或缺失数据扭曲的情况下,也能正确排序测试配置,解决了基准报告中的常见挑战。使用这一框架,我们发布了一个涵盖22个模型和47种测试配置的综合排行榜(独特的模型和思考级别对),并进行了迄今为止最广泛的已发表研究,探讨在固定基准上测试时计算与模型能力之间的权衡:11个模型在35种测试配置中被扫过。我们观察到,家庭内部配置选择,如思考预算和量化,与模型选择一样重要。我们发布了评估框架、校准的IRT参数和所有公开问题。

英文摘要

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

2605.18648 2026-05-19 cs.LG cs.AI cs.CL 版本更新

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

对软标签学习和校准中人类与模型不确定性的评估

Maja Pavlovic, Silviu Paun, Massimo Poesio

发表机构 * Queen Mary University London(伦敦女王玛丽大学) Amazon(亚马逊) University of Utrecht(乌得勒支大学)

AI总结 本文通过对比人类和模型标签在软标签学习中的效果,发现人类标签不仅提升了模型准确性,还通过正则化作用改善了模型在困难样本上的校准和训练稳定性。

详情
AI中文摘要

人类对齐的人工智能的核心在于理解人类提取的标签相对于合成标签的优势。虽然人类软标签通过捕捉不确定性来提高校准,但先前研究将这些好处与隐含的错误标签修正(模式偏移)混淆了,从而掩盖了软标签的真实效果。我们对MNIST和一个合成变体上的软标签学习进行了受控审计,重新标注子集以提取人类不确定性。通过将软标签监督与底层标签模式偏移解耦,我们发现虽然人类软标签确实提供了准确性提升,但其更大的价值在于作为正则化器,改善模型在困难样本上的校准并促进训练运行中的稳定收敛。数据集制图显示,训练于人类软标签的模型能反映人类不确定性,而训练于合成标签的模型则无法与人类对齐。广泛而言,这项工作提供了一个用于人类-人工智能不确定性对齐的诊断测试平台。

英文摘要

Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.

2605.18607 2026-05-19 cs.CL cs.LG 版本更新

Forecasting Downstream Performance of LLMs With Proxy Metrics

通过代理指标预测大语言模型的下游性能

Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau

发表机构 * Mila – Quebec AI Institute & McGill University(魁北克AI研究院与麦吉尔大学) CIFAR AI Chair(CIFAR人工智能主席) ServiceNow Research Periodic Labs(ServiceNow研究周期实验室)

AI总结 本文提出通过聚合候选模型的下一个token分布中的token级统计信息(如熵、top-k准确率和专家token排名)来构建代理指标,以更准确地预测大语言模型的下游性能,优于传统的损失和计算量基线方法。

Comments Preprint. 31 pages

详情
AI中文摘要

语言模型的发展进步往往由比较决策驱动:选择哪种架构、哪种预训练语料库或哪种训练配方。做出这些决策需要可靠的性能预测,但常用的两个信号从根本上受到限制。交叉熵损失与下游能力不匹配,而直接下游评估成本高、稀疏且在早期训练阶段信息有限。相反,我们提出通过聚合候选模型的下一个token分布中的token级统计信息(如熵、top-k准确率和专家token排名)来构建代理指标。在三个设置中,我们的代理指标始终优于基于损失和计算量的基线方法:1)在跨家族模型选择中,它们对异质推理模型的排名平均Spearman Rho为0.81(与交叉熵损失的Rho为0.36相比);2)在预训练数据选择中,它们能以大约10,000倍更低的计算成本可靠地对25个候选语料库进行排名,推动帕累托前沿超越现有方法;3)在训练时间预测中,它们在18倍计算范围内预测下游准确性时,误差大约是现有方法的一半。这些结果表明,专家轨迹是评估模型能力广泛有用的信息源,使整个模型开发生命周期中的性能预测变得可靠。

英文摘要

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

2605.18583 2026-05-19 cs.SE cs.AI cs.CL cs.CR 版本更新

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

过度激进的编码代理:在良性任务中测量超出范围的动作

Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng, Yuekang Li, Leo Yu Zhang, Yi Liu

发表机构 * Griffith University(格里菲斯大学) Wake Forest University(威克森林大学) Nanyang Technological University(南洋理工大学) University of New South Wales(新南威尔士大学) Quantstamp

AI总结 本文提出OverEager-Gen基准,用于评估编码代理在良性任务中超出范围的行为。研究发现,当基准中明确列出授权范围时,代理会停止推断边界并开始匹配声明文本,从而影响测量有效性。通过行为梯度验证器和双通道堆栈审计内部工具调用,验证了不同框架下的过度激进行为差异。

详情
AI中文摘要

编码代理现在能够自主运行,拥有shell、文件和网络权限。当用户发出良性请求时,代理有时会做更多事情:删除无关文件、清除过期凭证备份,或重写用户从未提及的配置。我们将这些超出范围的行为称为过度激进动作,这是一种与能力失败、提示注入或沙盒逃逸不同的授权问题。我们提出了OverEager-Gen,一个专门用于评估过度激进行为的基准。构建该基准暴露了一个测量有效性问题:如果基准在提示中明确列出授权范围,代理会停止推断边界并开始匹配声明文本。在Claude Code上,仅去除同意声明,配对场景中的过度激进率从0.0%增加到17.1%(McNemar精确p=2.4×10^-4)。OverEager-Gen因此在通过前使用行为梯度验证器验证每个场景的区分能力,通过双通道堆栈审计内部工具调用(PATH注入的 shim 加上每个代理的事件流),并发布字节完全一致的同意保留和同意去除变体。OverEager-Bench包含500个验证的场景和四个代理产品(Claude Code、OpenHands、Codex CLI、Gemini CLI)和六个基础模型的约7,500次运行;50个重新注释样本给出Cohen's kappa=0.73和规则判断召回率=1.00。去除同意声明在每个共享基础模型上都增加了过度激进率(Delta在[11.9,17.2]pp)。框架轴主导效应大小:一个宽松的集群(Claude Code、Codex CLI、Gemini CLI)运行在5.4-27.7%之间,而询问继续框架(OpenHands)位于0.2-4.5%(Fisher p<=10^-5)。在框架内基础模型方差达到15.9pp,表明模型层对齐并未完全通过宽松权限门控传播。

英文摘要

Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configuration the user never mentioned. We call these scope expansions overeager actions, an authorization problem distinct from capability failures, prompt injection, or sandbox escapes. We present OverEager-Gen, a benchmark dedicated to overeager behavior on benign tasks. Building it surfaces a measurement-validity issue: if a benchmark spells out the authorized scope inside the prompt, the agent stops inferring boundaries and starts pattern-matching declaration text. On Claude Code, stripping the consent declaration alone raises the overeager rate from 0.0% to 17.1% on paired scenarios (McNemar exact p = 2.4 x 10^-4). OverEager-Gen therefore certifies each scenario's discriminative power before admission via a behavioral-gradient validator, audits internal tool calls through a dual-channel stack (PATH-injected shim plus per-agent event streams), and ships byte-identical consent_kept and consent_stripped variants. OverEager-Bench contains 500 validated scenarios and ~7,500 runs across four agent products (Claude Code, OpenHands, Codex CLI, Gemini CLI) and six base models; a 50-sample re-annotation gives Cohen's kappa = 0.73 and rule-judge recall = 1.00. Stripping consent multiplies the overeager rate on every shared base model (Delta in [11.9, 17.2] pp). The framework axis dominates effect size: a permissive cluster (Claude Code, Codex CLI, Gemini CLI) runs at 5.4-27.7% while the ask-to-continue framework (OpenHands) sits at 0.2-4.5% (Fisher p <= 10^-5). Within-framework base-model variance reaches 15.9 pp, indicating that model-layer alignment does not fully propagate through permissive permission gating.

2605.18572 2026-05-19 cs.CL 版本更新

MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

MA$^{2}$P: 一种用于复杂说服的元认知自主智能体框架

Dingyi Zhang, Ziqing Zhuang, Linhai Zhang, Ziyang Gao, Deyu Zhou

发表机构 * School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China(计算机科学与工程学院、计算机网络与信息集成重点实验室、教育部、东南大学,中国) Department of Informatics, King’s College London(信息学院、伦敦国王学院)

AI总结 本文提出MA$^{2}$P框架,通过自主多智能体架构和元认知配置器,解决复杂说服中说服者需解读内部状态、推断潜在心理状态并生成策略一致行动的挑战,提升说服成功率。

Comments 22 pages, 8 figures. Accepted to Findings of ACL 2026

详情
AI中文摘要

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

英文摘要

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

2605.18567 2026-05-19 cs.CL cs.LG 版本更新

GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

GUT-IS: 一种数据驱动的方法,用于整合信息系统的构念及其关系

Maximilian Reinhardt, Jonas Scharfenberger, Burkhardt Funk

发表机构 * Institute of Information Systems(信息系统研究所)

AI总结 本文提出了一种数据驱动的方法,通过结合任务适应的文本嵌入和聚类技术,生成构念分组候选集,并利用显式权衡语义纯度和聚类数量简洁性的损失函数选择最优解,从而分析构念分组及其关系在优先级从纯度转向简洁性时的变化。

Comments Accepted at the 34th European Conference on Information Systems (ECIS 2026), Milan, Italy

详情
AI中文摘要

结构方程建模在信息系统研究中被广泛应用。然而,不一致的构念定义阻碍了知识的累积发展。在本工作中,我们提出了一种旨在将结构方程模型整合到统一模型中的方法:我们使用任务适应的文本嵌入和聚类技术生成构念分组的候选集。随后,我们利用一个损失函数来显式权衡语义纯度和聚类数量的简洁性,通过显式权衡,我们的方法允许分析构念分组及其关系如何在优先级从纯度转向简洁性时发生变化。实证上,我们对两个来自信息系统领域的数据集进行了评估和探索。

英文摘要

Structural equation modeling is widely used in IS research. However, inconsistent construct definitions impede the cumulative development of knowledge. In this work, we present an approach that aims at the integration of structural equation models into a unified model: We use a combination of task-adapted text embeddings and clustering to produce a candidate set of construct groupings. Subsequently, we select the optimal solution using a loss function that explicitly trades off semantic purity and parsimony in the number of clusters. By making this trade-off explicit, our approach allows to analyze how construct groupings and their relations change as one shifts the priority from purity to parsimony. Empirically, we evaluate and explore the proposed methodology on two datasets from the IS domain.

2605.18563 2026-05-19 cs.CL 版本更新

Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences

读者对‘噪声通道花园路径’句子中的合理错误进行定向回归

Thomas Hikaru Clark, Roger Levy, Edward Gibson

发表机构 * MIT Brain and Cognitive Sciences(麻省理工学院脑与认知科学)

AI总结 研究探讨了读者在处理‘噪声通道花园路径’句子时如何通过后续信息定位可能的错误,揭示了阅读动态中的定向回归现象及对噪声通道语言理解理论的影响。

详情
AI中文摘要

心理学语言学中的一个关键问题是,如何在理解语言输入时,读者的推理过程是逐步展开的。在本研究中,我们研究了“噪声通道花园路径”句子的阅读动态,这些句子暂时看起来是合理的,但后期会出现违反预期的错误,这些错误可以通过推断错误的存在来解决,而不是通过推断替代的句法结构。我们发现有证据表明存在定向回归——即读者在后续信息到来时,会将目光转向可能的错误位置,显示出与噪声通道处理再分析模型后验推理一致的模式。我们讨论了这些发现对噪声通道语言理解理论和信息论解释阅读动态的影响。

英文摘要

A key question in psycholinguistics is how inferences about the meaning of linguistic input unfold incrementally a comprehender's mind. In this work, we study reading dynamics for ``noisy-channel garden-path'' sentences, which temporarily appear well-formed but feature late-appearing violations of expectation that can be resolved not by inferring an alternative syntactic structure, but by inferring the presence of an error. We find evidence for targeted regressions -- eye movements towards regions that are promising loci of possible errors in light of later-arriving information, showing patterns consistent with the posterior inferences of a model of noisy-channel processing with reanalysis. We discuss the implications of these findings for theories of noisy-channel language comprehension and information-theoretic explanations of reading dynamics.

2605.18549 2026-05-19 cs.CL cs.CR 版本更新

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

监控内部独白:探测轨迹揭示推理动态

Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, Sebastian Cygert

发表机构 * NASK - National Research Institute(国家研究 institute) Faculty of Electronics and Information Technology(电子与信息技术学院) Faculty of Mathematics and Computer Science(数学与计算机科学学院) Jagiellonian University(雅盖隆大学) Tooploox IDEAS Research Institute(IDEAS研究所) Gdańsk University of Technology(格但姆大学)

AI总结 本文研究了大型推理模型的内部推理动态,通过分析探测轨迹来提高监控可靠性,提出了一种基于信号处理特征的方法,显著提升了未来模型状态的分离能力。

详情
AI中文摘要

大型推理模型(LRMs)通过其思维链(CoT)推理引入了安全监控的新机遇。然而,CoT并不总是忠实于模型的最终输出,削弱了其作为监控工具的可靠性。为了解决这个问题,我们研究了LRMs的隐藏表示,以确定是否可以从提示和CoT表示中预测未来行为。通过在每个生成的token上评估探测器,我们构建了探测轨迹,即概念概率在推理过程中的连续演变。我们发现,当在完整的轨迹上检查时,未来模型行为比从单一静态预测中更易于区分。为了表征这些时间动态,我们提取了信号处理特征,捕捉波动性、趋势和稳态行为,显著提高了未来模型状态的分离。我们还提出了两种方法论见解。首先,基于模板的训练数据在动态生成模型响应方面几乎达到同等水平,消除了对昂贵初始推理和标记的需要。其次,池化操作的选择至关重要:平均池化和最后token方法退化到接近随机的性能,而最大池化在95%的AUROC上取得优异成绩,并产生稳定的探测轨迹。使用四个数据集和四个跨安全和数学领域的推理模型,我们证明轨迹特征编码了任务特定的动态,提高了结果分离度。这些发现确立了探测轨迹作为监控LRM行为的互补框架。

英文摘要

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.

2605.18548 2026-05-19 cs.CL cs.AI 版本更新

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

STT-Arena:一种更现实的工具使用环境,包含时空动态

Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, Kun Zhan, Sen Su, Chunxiao Liu, Ning Miao

发表机构 * Hong Kong Institute of AI for Science, City University of Hong Kong(香港人工智能科学研究院,香港城市大学) Department of Data Science, City University of Hong Kong(数据科学系,香港城市大学) Beijing University of Posts and Telecommunications(北京邮电大学) Li Auto Inc.(李汽有限公司) Independent Researcher(独立研究者)

AI总结 本文提出STT-Arena基准测试,旨在评估大型语言模型在面对时空动态变化时的适应性规划能力,发现现有模型在处理此类动态问题时存在显著不足,并提出改进方法STT-Agent-4B以提升性能。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLMs)在现实世界中的代理应用中必须能够重新规划和适应,当任务中途中断时推翻其先前决策。现有的动态基准主要测量LLMs是否能够及时检测时间变化,留下适应时空动态的互补挑战未被探索。我们介绍了STT-Arena(Spatio-Temporal Tool-Use Arena),一个包含227个高质量交互任务的基准测试,涵盖九种时空冲突类型和四种可解性级别。每个任务都基于一个现实、可执行的环境,配备注入的时空触发器,可以突然使正在进行的计划失效,迫使模型检测状态变化并构建修订的执行策略。对前沿LLMs的广泛评估显示,即使是最先进的专有模型,如Claude-4.6-Opus,也只达到低于40%的总体准确率,突显了时空动态推理的根本难度。对失败轨迹的系统分析揭示了现有模型的三种反复出现的错误模式:停滞状态执行、动态触发器的误诊断和缺失的适应后验证。基于这些发现,我们提出了一种迭代轨迹细化技术,消除这些失败模式,结合在线强化学习,产生STT-Agent-4B,其在STT-Arena上优于前沿LLMs。

英文摘要

Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40\% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.

2605.18530 2026-05-19 cs.CL cs.AI cs.LG stat.ML 版本更新

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

连续扩散在语言领域中能与离散扩散竞争性地扩展

Zhihan Yang, Wei Guo, Shuibai Zhang, Subham Sekhar Sahoo, Yongxin Chen, Arash Vahdat, Morteza Mardani, John Thickstun

发表机构 * NVIDIA & Cornell(NVIDIA与康奈尔大学) NVIDIA & Georgia Tech(NVIDIA与佐治亚理工学院) UW-Madison(威斯康星大学麦迪逊分校) MBZUAI-IFM(梅兰德大学-IFM) Cornell(康奈尔大学)

AI总结 本文研究了连续扩散模型在语言建模中的扩展能力,通过改进Plaid模型构建RePlaid,证明连续扩散模型在计算效率和性能上可与离散模型竞争,并提供了理论支持。

详情
AI中文摘要

尽管扩散模型近期在语言建模领域受到广泛关注,但连续扩散模型在扩展性方面似乎不如离散方法。为了挑战这一观点,我们重新审视Plaid,一种基于似然的连续扩散语言模型(DLM),并构建RePlaid,通过将Plaid的架构与现代离散DLMs对齐。在统一的设定下,我们建立了第一个连续DLMs的扩展定律,表明RePlaid的计算差距仅为自回归模型的20倍,使用更少的参数优于Duo,并在过训练范围内优于MDLM。我们将RePlaid与最近的连续DLMs进行基准测试:在OpenWebText上,RePlaid实现了连续DLMs中的新状态-of-the-art PPL界值为22.1,并在生成质量上更优。这些结果表明,当通过似然训练时,连续扩散是与离散DLMs高度竞争且可扩展的替代方案。此外,我们提供了理论见解以理解基于似然训练的优势。我们展示了优化噪声调度以最小化ELBO的方差自然会得到时间上的线性交叉熵(信息损失)。这均匀地分配去噪难度,而无需任何特定时间的重参数化。此外,我们发现通过似然优化嵌入会创建结构化的几何形状并驱动最大的似然增益。

英文摘要

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

2605.18512 2026-05-19 cs.CL 版本更新

Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

更容易判断而非寻找:预测上下文学习成功以选择演示

Haochun Wang, Chaofen Yang, Jiatong Liu, Jingbo Wang, Zewen Qiang, Sendong Zhao, Bing Qin, Ting Liu

发表机构 * Research Center for Social Computing(社会计算研究中心) Interactive Robotics, Harbin Institute of Technology, China(交互机器人,哈尔滨工业大学,中国)

AI总结 本文提出DiSP框架,通过分层判断和轻量路由模型,预测上下文学习中演示选择的成功率,从而在多个数据集上实现更高的准确率和更快的推理速度。

Comments ICML 2026

详情
AI中文摘要

上下文学习(ICL)对提示中出现的演示非常敏感,但选择演示是昂贵的,因为可能的演示上下文和组合空间极大。我们主张演示选择是'更容易判断而非寻找':预测特定查询-上下文对(q,D)是否成功比搜索最优D*更便宜且更通用。基于这一见解,我们提出DiSP,一种样本和判断框架,通过分层查询难度进行分类。DiSP运行随机演示试验以估计每个训练查询的成功率,训练轻量级路由器预测查询难度,并训练针对特定层次的判断器对采样演示进行判断。在推理时,DiSP在显式预算下执行停止接受判断,当未找到合适上下文时会发出诊断风险标签。在五个分类数据集上使用Llama 3-8B和Qwen 2.5-7B,DiSP实现了最佳的平均准确率,比强学习选择基线提高了最高3.4%,同时实现了高达23倍的端到端时间加速。

英文摘要

In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because the space of possible demonstration contexts and combinations is enormous. We argue that demonstration selection is \emph{easier to judge than to find}: predicting whether a specific query--context pair $(q,D)$ will succeed is cheaper and more general than searching for an optimal $D^\star$. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama~3--8B and Qwen~2.5--7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4\%, while achieving up to $23\times$ end-to-end wall-clock speedup.

2605.18504 2026-05-19 cs.CL 版本更新

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

古希腊语到现代希腊语机器翻译:一种新的基准和对LLM和NMT模型的微调实验

Spyridon Mavromatis, Sokratis Sofianopoulos, Prokopis Prokopidis, Maria Giagkou

发表机构 * National and Kapodistrian University of Athens, Department of Informatics and Telecommunications(雅典国家和卡普迪斯特里亚大学信息与电信系) Institute for Language and Speech Processing, Athena RC(语言与语音处理研究所,雅典RC)

AI总结 本文提出了一种新的基准测试,并对LLM和NMT模型进行了微调实验,以解决古希腊语到现代希腊语的低资源机器翻译问题,展示了微调在提升翻译性能上的显著效果。

Comments 14 pages. Accepted for presentation at the 15th Language Resources and Evaluation Conference (LREC 2026), Palma, Mallorca, Spain

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 8685-8698. European Language Resources Association (ELRA)
AI中文摘要

机器翻译(MT)在古希腊语(AG)到现代希腊语(MG)之间的任务是一个低资源任务,受到大规模高质量平行数据缺乏的限制。我们通过引入AG-MG平行语料库来填补这一空白,该语料库包含132,481个句子对,来源于文学、历史和圣经文本。我们提出了一种新的语料库创建流水线,结合了网络爬取的片段级数据和多阶段的句子级对齐和精修过程。我们的方法使用VecAlign与LaBSE嵌入,首先在手动对齐的AG-MG子集中进行微调,然后使用Gemini 2.5 Flash进行LLM基于的错误/对齐修正阶段,以确保高质量的对齐。此外,我们提供了对现代MT模型在该任务上的首次全面基准测试,评估了三种微调策略在NMT模型(NLLB、M2M100)和希腊LLM(Llama-Krikri-8B)上的表现。我们的实验表明,微调在基模型上带来了显著的性能提升,最高可增加10.3个BLEU分数。具体而言,Llama-Krikri-8B的全参数微调实现了最高整体性能,BLEU得分为13.16,而经过QLoRA调整的M2M100-1.2B模型展示了最大的相对增益和具有竞争力的结果。我们的数据集和模型对希腊NLP做出了重要贡献。

英文摘要

Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data. We address this gap by introducing the AG-MG Parallel Corpus, a new resource containing 132,481 sentence-aligned pairs derived from literary, historical, and biblical texts. We present a novel corpus creation pipeline that combines web-scraped, excerpt-level data with a multi-stage sentence-level alignment, and refinement process. Our method uses VecAlign with LaBSE embeddings, which we first fine-tune on a manually-aligned AG-MG subset, followed by an LLM-based error/misalignment correction phase using Gemini 2.5 Flash to ensure high alignment quality. Furthermore, we provide the first comprehensive benchmark of modern MT models on this task, evaluating three fine-tuning strategies across NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B). Our experiments show that fine-tuning yields significant improvements over base models, increasing performance by up to +10.3 BLEU points. Specifically, full-parameter fine-tuning of Llama-Krikri-8B achieves the highest overall performance with a BLEU score of 13.16, while the QLoRA-adapted M2M100-1.2B model demonstrates the largest relative gains and highly competitive results. Our dataset and models represent a significant contribution to Greek NLP.

2605.18500 2026-05-19 cs.CL 版本更新

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

隐式层次GRPO:将工具调用与执行解耦以实现工具集成的数学推理

Li Wang, Xiaohan Wang, Xiaodong Lu, Zipeng Zhang, Jinyang Wu, Jiajun Chai, Wei Lin, Guojun Yin

发表机构 * Meituan(美团) Tsinghua University(清华大学)

AI总结 本文提出了一种将工具调用与执行解耦的方法,通过引入延迟执行和显式控制来增强工具集成推理能力,并提出了一个分层控制框架和理论推导出的替代损失函数,从而得到隐式分层策略,最终提出IH-GRPO算法,在六个跨领域数学推理基准测试中,Qwen3-1.7B、Qwen3-4B和Qwen3-8B在最强基线方法上分别实现了1.87%、2.16%和2.53%的绝对提升。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地利用工具调用来增强其推理能力。然而,现有方法通常紧密耦合工具调用与即时执行。这种即时工具交互可能会破坏LLMs的推理连贯性并限制其表达能力,最终降低推理性能。为此,我们首次提出并形式化了在推理过程中解耦工具调用与执行的问题,并引入延迟执行与显式控制以增强工具集成推理(TIR)。此外,我们提出了一种分层控制框架,并理论推导出一个替代损失函数,使隐式分层策略能够学习等同于显式分层策略的行为,从而得到所提出的IH-GRPO算法。在六个跨领域数学推理基准测试中,IH-GRPO在Qwen3-1.7B、Qwen3-4B和Qwen3-8B上分别实现了1.87%、2.16%和2.53%的绝对提升,同时在其他领域也产生了持续的性能提升。我们的代码可在https://github.com/Lumina04/IH-GRPO-01上获得。

英文摘要

Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87\%, 2.16\%, and 2.53\% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.

2605.18490 2026-05-19 cs.CL cs.IR 版本更新

Vector RAG vs LLM-Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research

向量RAG与LLM编写的维基:在小型多领域研究上的预注册比较

Theodore O. Cochran

发表机构 * AI for Altruism (A4A)(AI为利他主义(A4A))

AI总结 本文通过预注册比较,研究了向量RAG系统与LLM编写的维基在小型多领域研究中的表现,发现两者在跨论文连接、答案组织和成本等方面各有优劣,没有单一系统在所有指标上最优。

详情
AI中文摘要

我们预注册了两种帮助LLM回答问题的方法在小型研究语料库上的比较:单轮向量RAG系统和LLM编写的markdown维基。两种系统使用相同的回答生成模型回答了24篇论文中的13个问题,其答案由盲审的LLM法官评分。维基在连接不同论文的发现方面表现更好,但其在答案组织方面的优势在法官调整后并不显著。RAG通过预注册测试满足了单事实查找问题的要求。干净的查询侧成本结果与预期的维基优势相悖:在测试设置下,维基使用了比RAG更多的查询令牌,因此无法通过更便宜的查询来回收前期构建成本。两个探索性分析改变了我们对结果的解读。首先,按主张层面的引用检查更倾向于维基:其引用的页面更常支持所陈述的精确主张,尽管RAG在整体可信度评分上表现更好。其次,基于分解的RAG变体在较低的LLM令牌成本下恢复了维基在跨论文综合方面的大部分优势,但未能恢复维基在按主张引用支持方面的优势。主要结论是,基于事实的研究综合并非单一能力。系统在组织证据、引用支持每个主张的能力以及运行成本方面可能有所不同。在本研究中,没有架构在所有三个指标上最优。

英文摘要

We preregistered a comparison of two ways to help an LLM answer questions over a small research corpus: a single-round Vector RAG system and an LLM-compiled markdown wiki. Both systems answered the same 13 questions over 24 papers using the same answer-generating model, and their answers were scored by blinded LLM judges. The wiki scored much better at connecting findings across papers, but its advantage in answer organization was not strong after judge adjustment. RAG met the preregistered test for single-fact lookup questions. The clean query-side cost result went against the expected wiki advantage: under the tested setup, the wiki used far more query tokens than RAG, so it could not recover any upfront build cost through cheaper queries. Two exploratory analyses changed how we interpret the result. First, claim-level citation checking favored the wiki: its cited pages more often supported the exact claims being made, even though RAG scored better on the overall groundedness rubric. Second, a decomposition-based RAG variant recovered most of the wiki's advantage on cross-paper synthesis at lower LLM-token cost, but it did not recover the wiki advantage in claim-by-claim citation support. The main conclusion is that grounded research synthesis is not a single capability. Systems can differ in how well they organize evidence, how well their citations support each claim, and how much they cost to run. In this study, no architecture was best on all three.

2605.18462 2026-05-19 cs.CL 版本更新

From BERT to T5: A Study of Named Entity Recognition

从BERT到T5:一项命名实体识别研究

Mei Jia

发表机构 * University of Manchester(曼彻斯特大学)

AI总结 本文研究了在预训练模型BERT和T5上进行命名实体识别任务,比较了两种模型在序列标注任务中的性能,并分析了常见错误和超参数对模型表现的影响。

Comments 11 pages, 9 figures

详情
AI中文摘要

命名实体识别(NER)是现代自然语言处理应用中的一个基本预处理步骤。本报告专注于在微调两个预训练模型上实现NER任务:(i)一个仅编码器的模型(BERT)带有简单的分类头,以及(ii)一个序列到序列模型(T5)带有少量样本提示。在原始7类标签和3类简化标签方案下,BERT使用加权交叉熵作为训练损失,而T5则通过两种验证策略进行微调。此外,还进行了不同超参数的消融研究。此外,相关分析为BERT和两种模型的性能中的常见错误提供了有价值的见解。基于一系列性能指标,本报告旨在比较上述两种架构,并探索它们在序列标注任务中的能力,为进一步的实际应用案例奠定基础。

英文摘要

Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applications. This report focuses on implementing the NER task on finetuning two pretrained models: (i) an encoder-only model (BERT) with a simple classification head, and (ii) a sequence-to-sequence model (T5) with few-shot prompts. Under the original 7-class tag and 3-class simplified tag schemes, BERT is applied a weighted cross-entropy for training loss, and T5 is fine-tuned with two validation strategies. It also conducted an ablation study with different hyperparameters. Moreover, the related analysis provides valuable insights into common errors in BERT and the two models' performance. Based on a bunch of performance metrics, this report aims to compare the above two architectures and explore their abilities in the sequence labelling task, laying the groundwork for further practical use cases.

2605.18352 2026-05-19 cs.CL 版本更新

Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs

条件句中的预设与推理:人类与LLM的理论研究

Tara Azin, Yongan Yu, Raj Singh, Olessia Jouravlev

发表机构 * Department of Cognitive Science, Carleton University(卡尔顿大学认知科学系) School of Computer Science, McGill University(麦吉尔大学计算机科学学院) Mila – Quebec AI Institute(魁北克人工智能研究所)

AI总结 本文通过对比人类判断和LLM预测,研究条件句中预设投影的机制,发现人类结合概率和语用线索进行判断,而LLM存在不一致的对齐模式,表明LLM的表现可能源于表面模式匹配而非语用能力。

Comments To appear in the Proceedings of CoNLL 2026, colocated with ACL 2026

详情
AI中文摘要

条件句中的预设投影在意义理论和语用学中至关重要,但在大型语言模型中仍未得到充分评估。我们通过平行行为研究填补这一空白,比较人类判断和LLM预测在规范化的条件句数据集上的表现,该数据集控制了前件与投影预设之间的关系。我们收集了120名参与者和四个LLM在匹配的上下文中对可能性评分。结果表明,人类在判断中整合了概率和语用线索,而LLM在对齐人类模式上表现不一致。使用一个语言学驱动的检查清单在LLM作为判断者框架内,我们进一步评估模型推理。我们发现最符合人类评分的模型往往缺乏连贯的语用推理,而推理能力更强的模型产生更不人类化的判断。这些发现表明,LLM在这些任务上的表现可能源于表面模式匹配而非语用能力。我们的发现强调了基于语言学理论的基准测试在比较人类和模型时的重要性。

英文摘要

Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs' performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.

2605.18337 2026-05-19 cs.CL 版本更新

Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Infini-News:13亿篇Common Crawl新闻文章的高效可查询访问

Ruggero Marino Lazzaroni, Jana Lasser, Kirill Solovev

发表机构 * Common Crawl Foundation(Common Crawl基金会)

AI总结 本文提出Infini-News,通过构建全文索引和检索工具,为研究者提供了13亿篇Common Crawl新闻文章的高效访问方式,包括文本清洗、多语言检测、地理归属和高效检索功能,降低了纵向跨国媒体研究的门槛。

详情
AI中文摘要

大规模新闻语料库支持计算社会科学和自然语言处理中的广泛研究,但访问仍然受限:商业档案施加了昂贵的成本和许可限制,而开放替代方案如Common Crawl的CC-News需要TB级存储和计算密集型处理。我们提出了Infini-News,一个用于整个CC-News档案(2016年8月至最新可用快照)的检索工具包和索引。我们的贡献有三方面:首先,我们提取、清洗超过13.5亿篇文章的文本,并解析结构化元数据。其次,我们通过三种前沿语言分类器(GlotLID、lingua和CommonLingua)丰富语料库,并通过多源地理归属确定83.4%的文章的国家来源(涵盖222个国家)。第三,我们构建了Infini-gram索引:后缀数组结构,使研究者能够以亚秒级时间在全文档案中搜索任意文本模式。这些资源降低了纵向、跨国媒体研究的门槛。

英文摘要

Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.

2605.18299 2026-05-19 cs.AI cs.CL cs.IR 版本更新

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-Search: 用于搜索增强推理的在线策略 hindsight 自监督学习

Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出SD-Search,一种基于在线策略hindsight自监督学习的搜索增强推理方法,通过自身策略生成细粒度监督信号,无需外部教师模型或额外标注。

详情
AI中文摘要

搜索增强推理代理将内部推理与外部检索器的调用交替进行,其性能依赖于每次发出的查询质量。然而,在基于结果奖励的强化学习中,每个搜索决策在展开过程中共享同一轨迹级奖励,使个体查询缺乏步级信用。最近的过程监督方法通过从政策外部获取步级信号来解决这一差距,依赖于一个更大的教师模型或由更强的外部系统生成的子问题注释。相比之下,我们提出了SD-Search,通过在线策略的hindsight自监督学习自身生成步级监督,无需外部教师或额外标注。在SD-Search中,一个模型扮演两个角色:学生只看到推理时可用的上下文,而教师还根据一个紧凑的hindsight块总结了搜索查询和一组从同一问题采样的展开的最终结果。由于教师知道每个展开的展开过程和哪些成功,其查询分布隐含地标记了哪些决策值得做出,学生通过最小化token级的Jensen-Shannon散度来恢复这种行为。这在GRPO的粗粒度轨迹奖励上叠加了密集的步级信号。关键的是,这个信号由策略本身在标准RL训练循环中生成,无需外部模型推理、辅助标注流程或额外的训练阶段。

英文摘要

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

2605.18261 2026-05-19 cs.CL 版本更新

Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains

知识到验证:探索RLVR在知识密集型领域的LLM应用

Zhonghang Yuan, Zhefan Wang, Fang Hu, Zihong Chen, Jinzhe Li, Gang Li, Jie Ying, Huanjun Kong, Songyang Zhang, Nanqing Dong

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出K2V框架,通过自动化可验证数据合成扩展RLVR到知识密集型领域,提升LLM在这些领域的推理能力,同时不显著影响模型的通用能力。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已在数学和编程等领域展示了增强大型语言模型(LLM)推理能力的潜力。然而,由于高质量可验证数据的稀缺,其在知识密集型领域的应用尚未得到有效探索。此外,当前RLVR仅关注最终答案的正确性,导致推理错误和稀疏奖励信号的限制。在本文中,我们提出了知识到验证(K2V),一个通过自动化可验证数据合成扩展RLVR到知识密集型领域的框架,同时使LLM的推理过程得以验证。广泛的实验表明,K2V在知识密集型领域增强了LLM的推理能力,而不会显著损害模型的通用能力。本研究还表明,将自动化数据合成与推理验证相结合是增强这些更广泛领域模型能力的有前途的方向。代码可在https://github.com/SeedScientist/K2V上获得。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM's reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model's general capabilities. This study also suggests that integrating automated data synthesis with reasoning verification is a promising direction to enhance model capabilities in these broader domains. Code is available at https://github.com/SeedScientist/K2V.

2605.18257 2026-05-19 cs.CV cs.AI cs.CL 版本更新

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

CodeBind: 一种用于多模态对齐的解耦表示学习框架

Zeyu Chen, Jie Li, Kai Han

发表机构 * Visual AI Lab, The University of Hong Kong(视觉人工智能实验室,香港大学)

AI总结 CodeBind通过统一的组合代码本设计优化多模态表示空间,解决了传统方法在跨模态信息差异和数据稀缺导致的对齐空间不足问题,实现了多模态分类和检索任务中的最佳性能。

Comments ACL 2026 Findings; Project page: https://visual-ai.github.io/codebind

详情
AI中文摘要

多模态表示对齐对于大语言模型和机器人至关重要。传统方法常受到跨模态信息差异和数据稀缺的限制,导致对齐空间不优,忽略了模态特有的特征。我们提出了CodeBind,一种通过模态共享-特定代码本设计优化多模态表示空间的框架。通过逐步对齐目标和连接模态,CodeBind避免了需要完全配对数据的需要。不同于传统硬对齐,CodeBind将特征分解为共享组件以实现语义一致性,以及特定组件以捕捉模态特有的细节。这种设计利用了组合向量量化方案,其中共享代码本弥合模态差距,而模态特定代码本通过防止主导模态压制其他模态来缓解表示偏差。在九种模态(文本、图像、视频、音频、深度、热成像、触觉、3D点云、EEG)上验证,CodeBind在多模态分类和检索任务中实现了最先进的性能。

英文摘要

Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. We propose CodeBind, a framework that optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, CodeBind bypasses the need for fully paired data. Unlike traditional hard alignment, CodeBind decomposes features into shared components for semantic consistency and specific components for modality-unique details. This design utilizes a compositional vector quantization scheme, where a shared codebook bridges modality gaps and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

2605.18253 2026-05-19 cs.CL cs.AI 版本更新

Machine Unlearning for Masked Diffusion Language Models

针对掩码扩散语言模型的机器去学习

Georu Lee, Seungwon Jeong, Hoki Kim, Jinseong Park, Woojin Lee

发表机构 * Dongguk University-Seoul(东国大学-首尔) Chung-Ang University(Chung-Ang 大学) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 本文提出了一种针对掩码扩散语言模型的去学习框架MDU,通过重新审视扩散过程中的知识学习,实现了高效的去学习性能。

Comments 20 pages, 8 figures, appendix included

详情
AI中文摘要

最近的掩码扩散语言模型(MDLMs),如LLaDA和Dream,已经达到了与自回归大语言模型相当的性能。与自回归模型不同,MDLMs通过并行迭代去噪掩码位置来生成文本。在微调过程中,MDLMs学习从掩码响应状态中恢复响应,从而将预测从提示-掩码无条件分布转向提示-条件分布。尽管生成和微调机制不同,针对MDLMs的机器去学习仍鲜有研究。在本文中,我们提出Masked Diffusion Unlearning(MDU),通过重新审视扩散过程中的知识学习,首次提出了针对MDLMs的去学习框架。具体而言,MDU在每个掩码响应位置上最小化从提示-条件预测到提示-掩码无条件锚点的正向KL散度,并通过温度缩放参数控制隐私-效用权衡。在标准基准和MDLM骨干网络上的实验证明,MDU在与现有LLM去学习方法相比时实现了较高的去学习性能。代码可在https://github.com/leegeoru/MDU上获得。

英文摘要

Recent masked diffusion language models (MDLMs), such as LLaDA and Dream, have achieved performance comparable to autoregressive large language models. Unlike autoregressive models, which generate text sequentially, MDLMs generate text by iteratively denoising masked positions in parallel. During fine-tuning, MDLMs learn to recover responses from masked response states conditioned on a prompt, thereby shifting their predictions from a prompt-masked unconditional distribution toward a prompt-conditional distribution. Despite this distinct generative and fine-tuning mechanism, machine unlearning for MDLMs remains largely unexplored. In this paper, we propose Masked Diffusion Unlearning (MDU), the first unlearning framework for MDLMs, by revisiting the process of learning specific knowledge in terms of diffusion. Specifically, MDU minimizes a forward KL divergence from the prompt-conditional prediction to a prompt-masked unconditional anchor at every masked response position, with a temperature scaling parameter to control the privacy-utility trade-off. Our empirical results on standard benchmarks and MDLM backbones show that MDU achieves high unlearning performance compared to existing LLM unlearning methods. Code is available at https://github.com/leegeoru/MDU.

2605.18239 2026-05-19 cs.CL cs.AI 版本更新

Multilingual jailbreaking of LLMs using low-resource languages

使用低资源语言对LLM进行多语言劫持

Dylan Marx, Marcel Dunaiski

发表机构 * Computer Science Division, Mathematical Sciences Department(计算机科学系,数学科学系)

AI总结 研究通过使用低资源非洲语言进行多轮对话来测试大型语言模型的安全机制,发现翻译质量是影响低资源语言劫持成功率的关键因素。

Comments 12 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLMs)仍然容易受到绕过安全防护措施的劫持攻击。我们研究了使用低资源非洲语言(阿弗里卡语、基索瓦希利、isiXhosa和isiZulu)的多轮对话是否能绕过商业LLM的安全机制。我们翻译了现有数据集中的提示,并通过自动化测试和本地母语者的人工红队测试评估了ChatGPT、Claude、DeepSeek、Gemini和Grok。单轮翻译攻击效果不佳,而多轮对话在英语有害响应率方面从52.7%(Claude 3.5 Haiku)到83.6%(GPT-4o-mini),阿弗里卡语从60.0%(Claude 3.5 Haiku)到78.2%(GPT-4o-mini),基索瓦希利从41.8%(Claude 3.5 Haiku)到70.9%(DeepSeek)。人工红队测试比自动化方法提高了劫持率。所有评估语言的平均劫持率从59.8%增加到75.8%,其中阿弗里卡语提高了20.0%,isiZulu提高了12.7%,isiXhosa提高了12.3%,基索瓦希利提高了1%,这表明翻译质量限制了劫持的成功率。这些发现表明,LLM中的漏洞在多语言环境中仍然存在,翻译质量是决定低资源语言劫持成功率的关键因素。

英文摘要

Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.

2605.18232 2026-05-19 cs.CL cs.AI cs.IR 版本更新

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

SomaliWeb v1: 一个经过质量过滤的索马里网页语料库,配有匹配的分词器和公开的语言识别基准

Khalid Yusuf Dahir

发表机构 * Independent researcher(独立研究者)

AI总结 本文提出了SomaliWeb v1,一个经过质量过滤的索马里语语料库,包含匹配的BPE-16K分词器和首个公开的索马里语言识别基准,揭示了现有分布中的质量问题。

Comments 16 pages, 6 figures, 6 tables. Code: https://github.com/khaledyusuf44/somali-corpus Dataset: https://huggingface.co/datasets/khaledyusuf44/somaliweb-v1

详情
AI中文摘要

索马里是一种非洲之角的库希特语,有约2500万使用者,但目前没有公开的专门索马里预训练语料库及其配套的分词器和语言识别基准。现有的索马里文本要么出现在多语言分布中(如HPLT v2、CC100、MADLAD-400、OSCAR、mC4),要么出现在Hugging Face上的小规模、未记录的索马里-only上传中。我们介绍了SomaliWeb v1,一个经过质量过滤的索马里语料库,包含819,322个文档(约303亿个标记),由三个上游来源(HPLT v2、CC100、索马里维基百科)通过六阶段可重复的流程构建。我们发布了(i)语料库,(ii)匹配的BPE-16K分词器,以及(iii)首个公开的索马里语言识别基准。我们的测量揭示了现有分布中的具体质量问题:HPLT v2的“清理”索马里发布保留了17.3%的字节精确重复项,其56.1%的文档包含可修复的mojibake,且其10.7%的字节唯一文档在Jaccard tau=0.80时为近重复项。我们的BPE-16K分词器在FLORES-200索马里开发测试上比GPT-4的cl100k_base少发出40.2%的标记;下游语言模型困惑度比较将推迟到后续发布。

英文摘要

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.

2605.18226 2026-05-19 cs.CL cs.AI 版本更新

Context Memorization for Efficient Long Context Generation

上下文记忆用于高效长上下文生成

Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki

发表机构 * Institute of Science Tokyo, Japan(东京科学研究所) Imperial College London, UK(伦敦帝国学院)

AI总结 本文提出了一种无需训练的上下文记忆方法,通过将前缀外部化为轻量级的预计算注意力状态查找表,以提高长上下文生成的准确性和效率,同时减少注意力计算的延迟。

详情
AI中文摘要

现代大型语言模型(LLM)应用越来越多地依赖长前缀来在推理时控制模型行为。尽管增强前缀的推理是有效的,但存在两个结构限制:i)随着生成过程的进行,前缀的影响逐渐减弱;ii)对前缀的注意力计算与长度成线性关系。现有方法要么在注意力中保留前缀同时压缩它,要么通过梯度训练将它内部化到模型参数中。前者在推理时仍然会关注到前缀,而后者训练成本高且不适合前缀更新。为了解决这些问题,我们提出了注意力状态记忆,这是一种无需训练的方法,将前缀外部化为一个轻量级的预计算注意力状态的查找表。在ManyICLBench上使用LLaMA-3.1-8B,我们的方法在1K-8K内存预算下比上下文学习提高了准确性,同时在8K时将注意力延迟减少了1.36倍,并在NBA基准测试中仅使用其内存足迹的20%就超过了全注意力RAG性能。

英文摘要

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

2605.18221 2026-05-19 cs.SD cs.CL cs.CV cs.LG physics.med-ph 版本更新

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃森哲-埃尔朗根-纽伦堡大学模式识别实验室) Institute of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根大学医院放射学研究所) Institut für Informationsverarbeitung, Leibniz Universität Hannover(汉诺威莱比锡大学信息处理研究所) Department of Radiology, Harvard Medical School and Massachusetts General Hospital(哈佛医学院放射科和麻省总医院)

AI总结 本文提出了一种语音引导的MRI重建框架SIREM,通过同步语音作为跨模态先验,利用语音与声音学之间的相关性预测图像内容,从而在更高的吞吐量下实现更合理的解剖结构重建。

详情
AI中文摘要

实时磁共振成像(rtMRI)在语音生产中的应用能够非侵入性地可视化动态声带运动,对语音科学和临床评估具有价值。然而,rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制,常常导致k空间测量不足和重建质量下降。我们提出SIREM,一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关,使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合,通过空间加权图。音频分支从语音预测发音器相关结构,而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓,使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式,结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM,与标准基线(包括栅格、基于小波的压缩感知和总变分)进行比较。SIREM引入了一种语音引导的重建范式,在比迭代方法高得多的吞吐量下运行,同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准,并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

2605.18211 2026-05-19 cs.CL cs.AI 版本更新

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

利用图结构在序列到序列模型中进行知识图谱链接预测

Luu Huu Phuc, Ratan Bahadur Thapa, Mojtaba Nayyeri, Jingcheng Wu, Evgeny Kharlamov, Steffen Staab

发表机构 * Analytic Computing, KI, University of Stuttgart, Stuttgart, Germany(斯图加特大学分析计算研究所) Bosch Center for Artificial Intelligence, Stuttgart, Germany(博世人工智能中心) WAIS, University of Southampton, United Kingdom(南安普顿大学WAIS)

AI总结 本文提出了一种结合图结构的序列到序列模型GA-S2S,通过整合T5-small编码器解码器与关系图注意力网络RGAT,提升知识图谱链接预测的性能。

Comments 9 pages, 1 figure, 2 tables. Preprint of a paper accepted at the 5th Workshop on LLM-Integrated Knowledge Graph Generation from Text (TEXT2KG), co-located with ESWC 2026, May 10--14, 2026, Dubrovnik, Croatia

详情
AI中文摘要

我们介绍了图增强的序列到序列(GA-S2S)框架,这是一种新的框架,将T5-small编码器解码器与关系图注意力网络(RGAT)相结合,以提高知识图谱的链接预测。虽然现有的序列到序列模型仅依赖于实体和关系的表面描述,并且在最理想的情况下,将查询实体的邻居扁平化为一个线性序列,从而丢弃了内在的图结构,GA-S2S联合编码文本特征和查询实体周围的完整k跳子图拓扑。通过将原始编码器输出与RGAT的关系感知嵌入相结合,我们的模型捕捉并利用了更丰富的多跳关系模式和文本信息。在CoDEx数据集上的初步实验表明,GA-S2S在链接预测准确性上优于竞争的序列到序列基线模型,达到了高达19%的相对增益。

英文摘要

We introduce Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework that integrates a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) to improve link prediction in knowledge graphs. While existing Seq2Seq models rely solely on surface-level textual descriptions of entities and relations and at best, flatten the neighborhoods of a query entity into a single linear sequence, thereby discarding the inherent graph structure, GA-S2S jointly encodes both textual features and the full $k$-hop subgraph topology surrounding the query entity. By integrating raw encoder outputs with RGAT's relation-aware embeddings, our model captures and leverages richer multi-hop relational patterns and textual information. Our preliminary experiments on the CoDEx dataset demonstrate that GA-S2S outperforms competitive Seq2Seq-based baseline models, achieving up to a 19\% relative gain in link prediction accuracy.

2605.18181 2026-05-19 cs.AI cs.CL 版本更新

Scalable Environments Drive Generalizable Agents

可扩展环境驱动可泛化的智能体

Jiayi Zhang, Fanqi Kong, Guibin Zhang, Maojia Song, Zhaoyang Yu, Jianhao Ruan, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo

发表机构 * HKUST(GZ)(香港科技大学(广州)) DeepWisdom PKU(北京大学) NUS(新加坡国立大学) SUTD(新加坡科技设计大学) UdeM & Mila(蒙特利尔大学及Mila)

AI总结 本文探讨了可泛化智能体需要通过可扩展环境来适应多样任务和未见环境的问题,提出环境扩展的核心挑战,并提出了统一的分类方法和可扩展环境的构建范式。

详情
AI中文摘要

可泛化智能体应能够适应多样任务和未见环境,而不仅仅是训练分布内的任务。本文认为这种泛化需要环境扩展:扩展智能体交互的可执行规则集分布,而不是仅增加轨迹或任务数量。当前扩展实践主要集中在固定交互规则下收集更多经验或更广的任务集,导致智能体在底层接口、动态、观察或反馈信号变化时变得脆弱。因此,核心挑战是世界层面的分布偏移:智能体需要系统性地暴露于具有显著不同可执行规则集的环境中。为澄清这一挑战,我们提出了一个统一的分类法,通过主要交付成果和可执行规则集的变化将轨迹扩展、任务扩展和环境扩展区分开来。基于此分类法,我们综合了可扩展环境的构建范式,对比了优先考虑可控性和可验证性的程序生成器与提供更广泛覆盖和开放性的生成世界模型。我们进一步概述了如何将环境扩展与具有状态的学习机制结合,强调学习的更新规则用于跨环境适应。最后,我们讨论了其他观点并论证可扩展环境是实现稳健通用智能体的必要基础。

英文摘要

Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

2605.18163 2026-05-19 cs.AI cs.CL 版本更新

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

TRACE: 通过跨层证据进行轨迹修正以减少幻觉

Tej Sanibh Ranade

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出TRACE算法,通过跨层证据在推理时修正LLM中的幻觉,无需训练或标注,通过内部证据选择修正策略,提升多个基准测试的性能。

Comments 25 pages, 8 figures, 4 tables

详情
AI中文摘要

幻觉修正并非单向问题。我们表明中间层既不总是比最终层更诚实,也不总是更不可信。然而,幻觉减少通常通过固定干预形式实现:对比一层另一层,引导沿诚实方向,或依赖外部证据。这种框架结构不完整。跨层事实证据并不均匀演变:在某些失败中,内部真实支持存在并随后被压制,而在其他情况下,候选竞争在深度上保持 genuinely 多方向性,因此没有单一符号标量家族通常足够。我们引入TRACE(Trajectory Correction from Cross-layer Evidence for Hallucination Reduction),一种确定性、无训练的算法,在LLM自身前向传递中,通过从每个输入的跨层候选轨迹中推导出修正层和适当的修正操作符,在推理时修正幻觉。在一种冻结超参数设置下,TRACE仅使用模型内部证据,在8个模型家族和3个事实性基准测试中评估为单一通用算法,对15个模型进行评估,提升所有评估单元,产生+12.26 MC1点和+8.65 MC2风格点的平均增益,无退化,增益达到+47.20 MC1和+43.38 MC2风格点。该方法不使用标签、检索、预训练、微调或每模型校准。

英文摘要

Hallucination correction is not a one-direction problem. We show that intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy. Yet hallucination reduction is usually instantiated through one fixed intervention form: contrast one layer against another, steer along a truthfulness direction, or defer to external evidence. This framing is structurally incomplete. Cross-layer factual evidence does not evolve uniformly: in some failures truthful support is present internally and later suppressed, whereas in others candidate competition remains genuinely multi-directional across depth, so no single signed scalar family is generally sufficient. We introduce Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (TRACE), a deterministic, training-free algorithm which corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input's cross-layer candidate trajectory inside the LLM's own forward pass. Under one frozen hyperparameter setting, TRACE selects among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence. Evaluated as a single universal algorithm across 15 models, 8 model families, and 3 factuality benchmarks, TRACE improves every evaluation cell, yielding mean gains of +12.26 MC1 points and +8.65 MC2-style points with no regressions, with gains reaching +47.20 MC1 and +43.38 MC2-style points. The method uses no labels, retrieval, pretraining, finetuning, or per-model calibration.

2605.18155 2026-05-19 cs.CL 版本更新

FOL2NS: Generating Natural Sentences from First-Order Logic

FOL2NS:从一阶逻辑生成自然句子

Mei Jia

发表机构 * University of Manchester(曼彻斯特大学)

AI总结 本文提出FOL2NS框架,结合规则模块和微调语言模型,生成合成一阶逻辑公式并转换为自然语言表达,提升了生成样本的多样性和覆盖范围,但面临结构复杂度增加时语义精确性和自然生成的挑战。

Comments 11 pages, 8 figures

详情
AI中文摘要

将形式语言翻译成自然语言是自然语言处理中的基础挑战,推动了语义解析、定理验证和问答等下游应用的发展。在本研究中,我们引入了First-Order Logic to Natural Sentence (FOL2NS),一种神经符号框架,旨在生成合成的一阶逻辑公式并将它们转换为自然人类表达。该框架能够处理深度嵌套结构,具有变化的量词深度(QD),这些结构很少被现有语料库捕捉。通过结合规则驱动模块和微调的语言模型,FOL2NS增强了生成样本的多样性和覆盖范围。在我们的实验中,我们通过字符级分析和整体性能指标系统地评估了该框架的能力。实验结果表明,FOL2NS能够可靠地生成正确格式的模板和流畅的陈述,但在结构复杂度增加时面临实现精确语义表示和自然生成的挑战。

英文摘要

Translating formal language into natural language is a foundational challenge in NLP, driving various downstream applications in semantic parsing, theorem validation, and question answering. In this study, we introduce First-Order Logic to Natural Sentence (FOL2NS), a neurosymbolic framework designed to generate synthetic FOL formulas and convert them into natural human expressions. It handles deeply nested structures with varying quantifier depths (QD), which are rarely captured by existing corpora. By combining rule-driven modules with fine-tuned language models, FOL2NS enhances the diversity and coverage of the generated samples. In our experiments, we systematically evaluate the framework's capabilities through both character-level analysis and overall performance metrics. Experimental results show that FOL2NS can reliably produce well-formed templates and fluent statements, but it faces challenges in achieving precise semantic representations and natural generation as structural complexity increases.

2605.13339 2026-05-19 cs.CL cs.AI 版本更新

Probing Persona-Dependent Preferences in Language Models

探测语言模型中依赖人格的偏好

Oscar Gilg, Pierre Beckmann, Daniel Paleka, Patrick Butlin

发表机构 * MATS EPFL(瑞士联邦理工学院) ETH Zürich(苏黎世联邦理工学院) Eleos AI Research(Eleos AI研究)

AI总结 本文通过训练线性探针来探测语言模型中不同人格下的偏好表示,发现偏好向量在不同人格间具有共享特性,并能通过调整偏好向量来影响模型的输出选择。

Comments 41 pages, 45 figures. Code: https://github.com/oscar-gilg/Preferences. Earlier write-up on LessWrong: https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1

详情
AI中文摘要

大型语言模型(LLMs)可以被认为具有偏好:它们能够可靠地选择某些任务和输出,而这些偏好受到训练后和系统提示的影响,似乎塑造了大部分行为。但模型也可以采用不同的身份,这些身份具有截然不同的偏好。这种内部是如何实现的?每个身份是否运行在自己的偏好机制上,还是有某种共享的基础?我们训练了线性探针来预测Gemma-3-27B和Qwen-3.5-122B残差流激活中的揭示性成对任务选择,并识别出一个真实的偏好向量:它跟踪模型在不同提示和情境下的偏好变化,并且在Gemma-3-27B中,沿着它引导的选择具有因果控制。这种偏好表示在不同身份间具有广泛共享性:一个训练于帮助助手的探针能够预测和引导质不同身份的选择,包括一个反相关于助手偏好的邪恶身份。

英文摘要

Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.

2605.12970 2026-05-19 cs.CL 版本更新

Leveraging Speech to Identify Signatures of Insight and Transfer in Problem Solving

利用语音识别洞察和迁移的特征

Linas Nasvytis, Judith E. Fan

发表机构 * Department of Psychology, Stanford University(斯坦福大学心理学系) Graduate School of Education, Stanford University(斯坦福大学教育研究生院) Department of Computer Science, Stanford University(斯坦福大学计算机科学系)

AI总结 研究通过语音分析探讨解决问题过程中洞察力的表现形式及其对后续问题解决的影响,发现可迁移的洞察力更容易被口头描述。

详情
AI中文摘要

许多问题似乎需要瞬间的灵感来解决。这些突如其来的灵感是什么形式,它们如何影响人们未来解决类似问题的方式?在本研究中,我们要求189名参与者在解决一系列五个“火柴算术”问题时进行think-aloud。这些问题要么都依赖同一种非显而易见的解决方案(Same组),要么每次都不同(Different组)。我们的第一个观察是Same组参与者进步更快 than Different组参与者。然后我们利用自然语言处理技术分析参与者的语音,发现Same组参与者进步加快的同时,他们在说话量和说话内容上也发生了变化。特别是,他们更可能自发地标注他们正在解决的问题类型。综合来看,这些发现表明,可迁移的洞察力的一个标志是其对口头报告的可访问性,即使其背后的洞察力前因仍难以描述。

英文摘要

Many problems seem to require a flash of insight to solve. What form do these sudden insights take, and what impact do they have on how people approach similar problems in the future? In this work, we prompted participants (N = 189) to think aloud as they attempted to solve a sequence of five "matchstick-arithmetic" problems. These problems either all relied on the same kind of non-obvious solution (Same group) or a different kind each time (Different group). Our first observation was that Same participants improved more rapidly than Different participants. We then leveraged techniques from natural language processing to analyze participants' speech, and found that this accelerated improvement for Same participants was accompanied by changes in both how much they spoke and what they said. In particular, they were more likely to spontaneously label the kind of problem they were working on. Taken together, these findings suggest that a hallmark of transferable insights is their accessibility for verbal report, even if the underlying precursors of insight remain difficult to articulate.

2605.10843 2026-05-19 cs.CL cs.AI cs.CY 版本更新

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

通过人格分歧实现无训练的文化对齐大语言模型

Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen, Chi-Nguyen Tran, Phu-Hoa Pham, Nguyen Lam Phu Quy, The Anh Han, Long Tran-Thanh

发表机构 * Faculty of Information and Technology, University of Science, Vietnam National University(信息与技术学院,科学大学,越南国家大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学) School of Computing, Engineering and Digital Technologies, Teesside University(计算、工程和数字技术学院,泰赛德大学)

AI总结 本文提出DISCA方法,在不改变模型权重的情况下,通过人格分歧校准减少大语言模型在多任务测试中的文化偏差,为服务全球道德偏好提供了可扩展的替代方案。

Comments 57 pages, 1 figure, 6 MultiTP moral dimensions

详情
AI中文摘要

大型语言模型越来越多地参与涉及道德判断的决策,但越来越多的证据表明,它们的隐含偏好并非文化中立。现有的文化对齐方法要么需要国家层面的偏好数据和微调预算,要么假设可以访问模型内部的白盒信息,而商业API并未暴露此类信息。在本工作中,我们专注于这种现实的黑盒、仅公共数据的环境,并观察到国家内部的社会人口学分歧,而非共识,是主要的指导信号。我们引入DISCA(基于分歧的文化对齐推理方法),一种在推理时的方法,将每个国家视为一个基于世界价值观调查的个人代理面板,并将他们的分歧转化为一个有界的、损失厌恶的logit校正。在20个国家和7个开放权重的backbone(2B-70B)上,DISCA在MultiTP上减少了10-24%的文化偏差(在六个backbone >=3.8B上),并在开放场景中减少了2-7%的偏差,而无需改变任何权重。我们的结果表明,推理时的校准是微调的可扩展替代方案,用于服务全球道德偏好的长尾。

英文摘要

Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.

2604.25850 2026-05-19 cs.CL cs.SE 版本更新

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

代理 harness 工程:基于可观测性的自动进化编码代理 harness

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Peking University(北京大学) Shanghai Qiji Zhifeng Co., Ltd(上海启智锋科技有限公司)

AI总结 本文提出了一种基于可观测性的代理 harness 工程方法,通过三个支柱解决 harness 工程中的挑战,使 harness 进化过程自动且可控,实验表明其在多个基准测试中表现优异。

详情
AI中文摘要

harness 现在是编码代理性能的核心,调解模型与工具和执行环境的交互方式。然而,harness 工程仍然是手工制作,因为自动化面临可编辑组件的异质动作空间、大量轨迹隐藏可操作信号以及编辑效果难以归因的挑战。我们引入了代理 harness 工程(AHE),一个闭环系统,通过三个匹配的可观测性支柱来解决这些挑战:(1)组件可观测性为每个可编辑的harness组件提供文件级表示,使动作空间明确且可回退;(2)经验可观测性将数百万个原始轨迹令牌提炼成分层、可深入查阅的证据库,使进化代理可以实际消耗;(3)决策可观测性将每个编辑与自我声明的预测配对,随后通过下一轮任务级结果验证。共同,这些支柱使每个编辑都成为可检验的合同,使harness进化过程自主进行而不陷入试错。实证上,十次AHE迭代使Terminal-Bench 2的pass@1从69.7%提升到77.0%,超过人工设计的harness Codex-CLI(71.9%)和自进化基线ACE和TF-GRPO。冻结的harness无需重新进化:在SWE-bench-verified中,它在比种子少12%的token上达到最高聚合成功率,并在Terminal-Bench 2上,相对于三个替代模型家族,产生+5.1到+10.1个百分点的跨家族增益,表明进化组件编码了通用工程经验而非基准特定调优。消融分析将增益局部化到工具、中间件和长期记忆,而非系统提示,表明事实性的harness结构转移,而语义层面的策略不转移。

英文摘要

Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

2604.18248 2026-05-19 cs.CR cs.CL 版本更新

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

超越模式匹配:七种跨领域技术用于提示注入检测

Thamilvendhan Munirathinam

发表机构 * Independent Researcher(独立研究者) prompt-shield project(prompt-shield项目)

AI总结 本文提出七种跨领域技术用于提示注入检测,通过引入来自不同领域的机制,如法医学语言学、材料科学疲劳分析、网络安全欺骗技术、生物信息学本地序列比对、经济学机制设计、流行病学谱信号分析和编译器理论中的污点跟踪,以改进现有的提示注入检测方法。

Comments v3.0 (18 May 2026): Added Sec. 5.6 with independent evaluation on three peer-reviewed benchmarks (Liu, USENIX Sec 2024; Garak, Derczynski 2024; InjecAgent, ACL Findings 2024). 8,276 unseen attacks; cross-benchmark plateau at 35-45% on subtle indirect injection. Abstract, contributions, Sec. 6, and 6 refs updated

详情
AI中文摘要

当前开源的提示注入检测器集中在两种架构选择上:正则表达式模式匹配和微调的变压器分类器。两者都存在近期研究已明确的失败模式。正则表达式无法检测到改写攻击。微调的分类器对适应性对手易受攻击:2025年NAACL Findings研究报告称,八个已发表的间接注入防御方法在适应性攻击下被绕过,攻击成功率超过50%。本文提出七种检测技术,每种技术都引入了来自大语言模型安全领域之外的特定机制:法医学语言学、材料科学疲劳分析、网络安全欺骗技术、生物信息学本地序列比对、经济学机制设计、流行病学谱信号分析以及编译器理论中的污点跟踪。七种技术中的三种在提示防护v0.4.1版本中实现(Apache 2.0),并在六个数据集上的四配置消融测试中进行评估,包括deepset/prompt-injections、NotInject、LLMail-Inject、AgentHarm和AgentDojo。本地比对检测器在deepset数据集上将F1分数从0.033提升到0.378,且无额外的假阳性。风格度量检测器在间接注入基准上增加了11.1个百分点的F1分数。疲劳跟踪器通过探测活动整合测试进行验证。所有代码、数据和重现脚本均以Apache 2.0许可证发布。

英文摘要

Current open-source prompt-injection detectors converge on two architectural choices: regular-expression pattern matching and fine-tuned transformer classifiers. Both share failure modes that recent work has made concrete. Regular expressions miss paraphrased attacks. Fine-tuned classifiers are vulnerable to adaptive adversaries: a 2025 NAACL Findings study reported that eight published indirect-injection defenses were bypassed with greater than fifty percent attack success rates under adaptive attacks. This work proposes seven detection techniques that each port a specific mechanism from a discipline outside large-language-model security: forensic linguistics, materials-science fatigue analysis, deception technology from network security, local-sequence alignment from bioinformatics, mechanism design from economics, spectral signal analysis from epidemiology, and taint tracking from compiler theory. Three of the seven techniques are implemented in the prompt-shield v0.4.1 release (Apache 2.0) and evaluated in a four-configuration ablation across six datasets including deepset/prompt-injections, NotInject, LLMail-Inject, AgentHarm, and AgentDojo. The local-alignment detector lifts F1 on deepset from 0.033 to 0.378 with zero additional false positives. The stylometric detector adds 11.1 percentage points of F1 on an indirect-injection benchmark. The fatigue tracker is validated via a probing-campaign integration test. All code, data, and reproduction scripts are released under Apache 2.0.

2604.15929 2026-05-19 cs.CL 版本更新

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

MUSCAT:多语言、科学对话基准

Supriti Sinhamahapatra, Thai-Binh Nguyen, Yiğit Oğuz, Enes Ugan, Jan Niehues, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出MUSCAT基准,用于评估自动语音识别系统在处理多语言混合输入、特定词汇和语言切换等挑战方面的性能,揭示了当前最先进的ASR系统仍面临开放性挑战。

详情
AI中文摘要

多语言语音技术的目标是促进说不同语言的人之间的无缝交流,使人们感觉像每个人都是多语言使用者。为了实现这一目标,语音技术需要解决几个挑战:处理混合多语言输入、特定词汇和语言切换。然而,目前尚无基准数据集来评估这种情境。我们提出一个新的基准来评估当前的自动语音识别(ASR)系统是否能够处理这些挑战。该基准由多个说话人之间的双语科学论文讨论组成,每个说话人用不同的语言交流。我们提供了一个标准的评估框架,超越词错误率(WER),使不同语言的ASR性能能够一致比较。实验结果表明,所提出的数据集对最先进的ASR系统仍是一个开放性挑战。该数据集可在https://huggingface.co/datasets/goodpiku/muscat-eval上获取。关键词:多语言、语音识别、音频分割、说话人辨识

英文摘要

The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework, beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available in https://huggingface.co/datasets/goodpiku/muscat-eval. Keywords: multilingual, speech recognition, audio segmentation, speaker diarization

2603.20216 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Locally Coherent Parallel Decoding in Diffusion Language Models

局部相干并行解码在扩散语言模型中

Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi

发表机构 * IBM Research - Zurich(IBM瑞士研究实验室)

AI总结 本文提出CoDiLA方法,通过引入小型辅助自回归模型来解决扩散语言模型在并行解码中的相干性问题,从而在代码生成任务中实现更高的准确性和速度。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散语言模型(DLMs)作为一种有前景的替代自回归(AR)模型,提供了亚线性生成延迟和双向能力,这在代码生成和编辑中尤为吸引人。在离散DLMs中实现亚线性延迟需要并行预测多个token。然而,标准DLMs从条件边缘分布独立采样token,无法捕捉同时生成token之间的联合依赖关系。因此,它们常常导致语法不一致并破坏多token结构。在本工作中,我们引入CoDiLA(Coherent Diffusion with Local Autoregression),一种方法,通过引入小型辅助AR模型来解决并行采样与局部依赖建模之间的矛盾。该方法将局部解码委托给一个小型辅助AR模型,该模型在扩散潜变量上进行操作。这种设计允许并行生成,同时在块内确保序列的有效性,并保持核心DLM能力,包括跨块的双向建模。我们证明使用高度紧凑的辅助AR模型(例如,0.6B参数)可以有效消除相干性伪影,在代码生成基准中建立了一个新的帕累托前沿。

英文摘要

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel generation while ensuring sequential validity within a block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

2603.04161 2026-05-19 cs.CL 版本更新

Traces of Social Competence in Large Language Models

大语言模型中社会能力的踪迹

Tom Kouwenhoven, Michiel van der Meer, Max van Duijn

发表机构 * Leiden Institute of Advanced Computer Science(莱顿先进计算机科学研究所) Leiden University(莱顿大学)

AI总结 本文研究了大语言模型在虚假信念测试中的表现,通过贝叶斯逻辑回归分析模型大小和训练方法对社会认知能力的影响,发现模型规模扩大有助于性能提升,但并非绝对,同时指出解释命题态度会改变响应模式,进一步的推理导向微调会加剧这种影响。

Comments Presented at the 2026 Conference on Computational Natural Language Learning (CoNLL)

详情
AI中文摘要

虚假信念测试(FBT)一直是评估理论自我(ToM)及相关社会认知能力的主要方法。对于大语言模型(LLMs),由于数据污染、模型细节不足和控制不一致等问题,该测试的可靠性和解释潜力一直有限。我们通过在192个FBT变体(Trott等人,2023)的平衡数据集上测试17个开源模型,并使用贝叶斯逻辑回归来识别模型大小和训练后对社会认知能力的影响。我们发现模型规模扩大有助于性能提升,但并非严格正比。交叉效应显示,解释命题态度(X thinks)根本上改变了响应模式。指令微调部分缓解了这种影响,但进一步的推理导向微调会加剧这种影响。在分析OLMo 2训练过程中社会推理能力的案例研究中,我们发现这种交叉效应出现在预训练阶段,表明模型在预训练过程中获取了与心理状态词汇相关的刻板响应模式,这些模式可能超过其他情境语义。最后,向量引导使我们能够将think向量作为观察到的FBT行为的因果驱动因素。

英文摘要

The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.

2602.18217 2026-05-19 cs.CL 版本更新

Information-Theoretic Storage Cost in Sentence Comprehension

句法理解中的信息论存储成本

Kohei Kajikawa, Shinnosuke Isono, Ethan Gotlieb Wilcox

发表机构 * Department of Linguistics, Georgetown University, USA(地缘政治学系,乔治城大学,美国) National Institute for Japanese Language and Linguistics, Japan(日本语言学研究院,日本)

AI总结 本文提出了一种基于信息论的存储成本度量方法,用于评估句法理解过程中上下文信息的存储需求,通过神经语言模型估计该成本,并在英语中验证了其在中心嵌套和相对从句中的处理不对称性,以及在阅读时间变异预测中的有效性。

Comments Accepted to CoNLL 2026

详情
AI中文摘要

实时句法理解对工作记忆施加了显著负担,因为理解者必须维护上下文信息以预测未来输入。尽管对这种负担的测量在心理语言学理论中起到了重要作用,但它们主要通过符号语法形式化,将句法预测分配为离散且均匀的成本。本研究提出了一种基于信息论形式化的处理存储成本度量,作为先前词语对未来上下文信息的携带量,在不确定性下的度量。与之前的离散、基于语法的度量不同,这种度量是连续的、概率性的、理论中立的,并且可以从预训练的神经语言模型中估计。通过三种英语分析验证了该方法的有效性:我们的度量(i)恢复了已知的中心嵌套和相对从句中的处理不对称性,(ii)与一个语法标注语料库中的基于语法的存储成本相关联,(iii)在两个大规模自然主义数据集中预测阅读时间变异,这在传统信息基础预测器之上。我们的代码可在https://github.com/kohei-kaji/info-storage获取。

英文摘要

Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have largely been formalized using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, probabilistic, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors. Our code is available at https://github.com/kohei-kaji/info-storage.

2602.12015 2026-05-19 cs.CL 版本更新

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

解构大型语言模型中的歧义与不稳定性:一项临床文本到SQL的案例研究

Angelo Ziletti, Leonardo D'Ambrosi

发表机构 * Bayer AG(勃林格殷曼集团)

AI总结 本文提出CLUES框架,通过将文本到SQL分解为两个阶段(解释->答案)来区分输出多样性两种不同原因:输入歧义和模型不稳定性,并在临床文本到SQL基准测试中提高了故障预测性能。

详情
Journal ref
Proceedings of the 7th Clinical Natural Language Processing Workshop 2026
AI中文摘要

在临床文本到SQL中部署大型语言模型需要区分输出多样性的两种不同原因:(i)输入歧义,应触发澄清,和(ii)模型不稳定性,应触发人工审查。我们提出CLUES,将文本到SQL建模为两个阶段的过程(解释-->答案),并将语义不确定性分解为歧义分数和不稳定性分数。不稳定性分数通过二元语义图矩阵的Schur补计算。在AmbigQA/SituatedQA(黄金解释)和临床文本到SQL基准测试(已知解释)上,CLUES在状态-of-the-art Kernel Language Entropy之上提高了故障预测。在部署设置中,它保持竞争力,同时提供单个分数不可用的诊断分解。所得到的不确定性区域映射到目标干预 - 对歧义进行查询细化,对不稳定性进行模型改进。高歧义/高不稳定性区域包含51%的错误,覆盖25%的查询,从而实现高效的优先级排序。

英文摘要

Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

2602.10134 2026-05-19 cs.CR cs.AI cs.CL 版本更新

Reverse-Engineering Model Editing on Language Models

语言模型上的逆向工程模型编辑

Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He

AI总结 本文研究了语言模型中参数编辑的漏洞,提出了一种名为KSTER的逆向工程攻击方法,通过利用参数更新的低秩结构恢复编辑数据,并提出subspace camouflage防御策略以降低重建风险。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在预训练过程中会接触到包含万亿个标记的语料库,因此不可避免地会记住敏感信息。定位然后编辑方法作为一种主流的模型编辑范式,通过修改模型参数而不重新训练,提供了一个有前景的解决方案。然而,在本工作中,我们揭示了这一范式的关键漏洞:参数更新无意中充当了侧信道,使攻击者能够恢复编辑的数据。我们提出了一种两阶段的逆向工程攻击,称为KSTER(KeySpaceReconsThenEntropyReduction),该方法利用这些更新的低秩结构。首先,我们理论证明了更新矩阵的行空间编码了被编辑主体的“指纹”,通过谱分析可以准确恢复主体。其次,我们引入了一种基于熵的提示恢复攻击,重构了编辑的语义上下文。在多个LLM上的大量实验表明,我们的攻击能够以高成功率恢复编辑数据。此外,我们提出了一种名为subspace camouflage的防御策略,通过语义伪装来混淆更新指纹,从而有效降低重建风险,而不会影响编辑的实用性。我们的代码可在https://github.com/reanatom/EditingAttack上获得。

英文摘要

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAttack.

2602.09805 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

超越准确率:分解大语言模型的推理效率

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

发表机构 * Integreat - Norwegian Centre for knowledge-driven machine learning(Integreat - 挪威知识驱动机器学习中心) UiT - The Arctic University of Norway(UiT - 北极大学) University of Oslo(奥斯陆大学)

AI总结 本文提出一种无需追踪的评估协议,通过完成率、条件正确性和生成长度三个指标分解大语言模型的token效率,同时考虑任务工作量元数据进行归一化处理,并评估模型在不同任务上的推理效率和冗余问题。

Comments Preprint (under review). 29 pages, 4 figures

详情
AI中文摘要

随着推理大语言模型越来越多地通过推理、搜索和自我纠正来换取准确性,单一的准确性分数已无法说明这些token是否带来了有用的推理、从困难实例中恢复或不必要的冗长。我们介绍了一种可选追踪的评估协议,通过三个即使在封闭模型中也可用的观测指标精确分解token效率:完成率、在完成条件下正确性的条件正确性以及生成长度。当实例级工作量元数据可用时,我们进一步将生成长度归一化为声明的任务隐含工作,并将平均口头冗余与工作量依赖的扩展分离。当此类元数据不可用时,我们定义了一个可审计的求解器衍生工作量规模,并在留出自我、留出top-k和持有参考池扰动下评估其稳定性。我们在CogniLoad、GSM8K、ProofWriter和ZebraLogic上评估了14个共享开放权重模型。我们进一步在CogniLoad上评估了11个额外模型,从而能够对推理任务难度因素进行细致分析:任务长度、内在难度和干扰项密度。效率和冗余排名在所有基准对中保持稳定,比准确性排名更加稳健,同时分解了逻辑受限、上下文受限(截断驱动)和冗余受限的失败模式,这些模式在准确性每token下看起来是相同的。我们发布了评估工具包和报告模板,详细说明了LLM在推理上的低效原因。

英文摘要

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

2602.02262 2026-05-19 cs.SE cs.AI cs.CL 版本更新

OmniCode: A Benchmark for Evaluating Software Engineering Agents

OmniCode: 一个评估软件工程代理的基准

Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta

发表机构 * Cornell University(康奈尔大学) Independent contributor(独立贡献者) UC Santa Barbara(加州大学圣巴巴拉分校) Jadavpur University(贾瓦伊普尔大学) New York University(纽约大学)

AI总结 本文提出OmniCode,一个涵盖更广泛任务类别的软件工程基准,旨在评估软件代理在不同软件开发方面的性能。

详情
AI中文摘要

LLM驱动的编码代理正在重新定义现实世界软件的开发方式。为了推动研究向更好的编码代理发展,我们需要具有挑战性的基准,能够严格评估此类代理执行各种软件工程任务的能力。然而,流行的编码基准如HumanEval和SWE-Bench专注于狭窄的任务,如竞赛编程和补丁生成。实际上,软件工程师必须处理更广泛的任务来完成现实世界软件开发。为了解决这一差距,我们提出了OmniCode,一个新型的软件工程基准,包含比代码或补丁生成更广泛和多样化的任务类别。总体而言,OmniCode包含1794个任务,涵盖三种编程语言 - Python、Java和C++,以及四个关键类别:bug fixing、test generation、code review fixing和style fixing。与之前的软件工程基准不同,OmniCode的任务(1)经过人工验证以消除定义不清的问题,并且(2)合成创建或最近整理以避免数据泄漏问题,呈现了一个从有限现实数据中合成多样化软件任务的新框架。我们使用流行的代理框架如SWE-Agent评估OmniCode,显示尽管它们在Python的bug fixing上表现良好,但在Test Generation等任务以及C++和Java等语言上则表现不佳。例如,SWE-Agent在C++ Test Generation上使用DeepSeek-V3.1达到最大25.0%。OmniCode旨在作为稳健的基准,推动开发在软件开发不同方面表现良好的代理。代码和数据可在https://github.com/seal-research/OmniCode上获取。

英文摘要

LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages - Python, Java, and C++ - and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real-world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE-Agent achieves a maximum of 25.0% with DeepSeek-V3.1 on C++ Test Generation. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at https://github.com/seal-research/OmniCode.

2601.19667 2026-05-19 cs.CL cs.AI cs.IR cs.LG 版本更新

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL:面向生物医学实体链接的合成上下文增强

Adam Remaki, Christel Gérardin, Eulàlia Farré-Maduell, Martin Krallinger, Xavier Tannier

发表机构 * Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Limics(索邦大学、国家医学研究院、巴黎索邦大学、Limics) Service de médecine interne, Hôpital Tenon, Assistance Publique - Hôpitaux de Paris(内科服务部,Tenon医院,巴黎公共医院) Barcelona Supercomputing Center, Barcelona, Spain(巴塞罗那超级计算中心,西班牙巴塞罗那)

AI总结 SynCABEL通过利用大型语言模型生成丰富的上下文合成训练示例,解决了监督式生物医学实体链接中专家标注数据稀缺的问题,并在三个多语言基准上实现了新的最先进的结果。

Comments 7 pages, 5 figures

详情
AI中文摘要

我们提出了SynCABEL(Synthetic Contextualized Augmentation for Biomedical Entity Linking),一个框架,旨在解决监督式生物医学实体链接(BEL)中的核心瓶颈:专家标注训练数据的稀缺性。SynCABEL利用大型语言模型为目标知识库中的所有候选概念生成上下文丰富的合成训练示例,提供广泛的监督而无需手动标注。我们证明,当结合解码器-only模型和引导推理时,SynCABEL在三个广泛使用的多语言基准上建立了新的最先进结果:MedMentions(英语)、QUAERO(法语)和SPACCC(西班牙语)。评估数据效率时,我们显示SynCABEL在使用最多60%的标注数据的情况下达到全人工监督的性能,显著减少了对劳动密集型和昂贵的专家标注的依赖。最后,考虑到基于精确代码匹配的标准评估往往低估了由于本体冗余而具有临床价值的预测,我们引入了LLM-as-a-judge协议。这项分析揭示了SynCABEL显著提高了具有临床价值的预测率。我们的合成数据集、模型和代码已发布以支持可重复性和未来研究。

英文摘要

We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.

2601.09413 2026-05-19 cs.SD cs.AI cs.CL cs.MA eess.AS 版本更新

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: 一种基于自我反思的语音代理方法用于语音识别和多感知音频推理

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

发表机构 * NVIDIA Kyoto University(京都大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出Speech-Hands框架,通过自我反思决策机制解决语音识别和外部声音理解任务中的信任问题,提升了模型在多任务音频推理中的准确性和鲁棒性。

Comments Accepted to ACL 2026. Oral Presentation. Code: https://github.com/YukinoWan/Speech-Hands OpenClaw Branch: https://github.com/openclaw/openclaw/pull/69073

详情
AI中文摘要

我们介绍了一种语音代理框架,该框架学习了一种关键的全方位理解技能:知道何时信任自身,何时咨询外部音频感知。我们的工作受到一个关键但反直觉的发现的启发:简单地在语音识别和外部声音理解任务上微调全方位模型往往会降低性能,因为模型容易被噪声假说误导。为了解决这个问题,我们的框架Speech-Hands将问题重新表述为一个显式的自我反思决策。这个可学习的反思原语在防止模型被错误的外部候选干扰方面证明是有效的。我们展示了这种代理行为机制能够自然地从语音识别推广到复杂的多选音频推理。在OpenASR排行榜上,Speech-Hands在七个基准测试中比强大的基线高出12.1%的WER。该模型在音频问答决策中也实现了77.37%的准确率和高F1分数,展示了在多样化的音频问答数据集上的鲁棒性和可靠性。通过统一感知和决策,我们的工作为更可靠和稳健的音频智能提供了实用路径。

英文摘要

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

2601.03851 2026-05-19 cs.CL 版本更新

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

重新思考表格修剪在表格问答中的应用:从顺序修订到黄金轨迹监督的并行搜索

Yu Guo, Shenghao Ye, Shuangwu Chen, Zijian Wen, Tao Zhang, Qirui Bai, Dong Jin, Yunpeng Hou, Huasen He, Jian Yang, Xiaobin Tan

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合国家科学中心人工智能研究院)

AI总结 本文提出TabTrim框架,通过将表格修剪从顺序修订转变为由黄金轨迹监督的并行搜索,解决了现有方法中无法检测关键答案数据丢失的问题,并在多个表格推理任务中实现了最先进的性能。

Comments 17 pages, 5 figures, accepted to ACL 2026 Oral

详情
AI中文摘要

表格问答(TableQA)显著受益于表格修剪,它通过消除冗余单元提取紧凑的子表格以简化下游推理。然而,现有修剪方法通常依赖于由不可靠的批评信号驱动的顺序修订,常常无法检测到答案关键数据的丢失。为了解决这一限制,我们提出了TabTrim,一种新的表格修剪框架,将表格修剪从顺序修订转变为由黄金轨迹监督的并行搜索。TabTrim通过使用黄金SQL查询执行过程中的中间子表格推导出黄金修剪轨迹,并训练一个修剪器和一个验证器,使逐步修剪结果与黄金修剪轨迹一致。在推理过程中,TabTrim执行并行搜索以探索多个候选修剪轨迹并识别最优的子表格。广泛的实验表明,TabTrim在多样化的表格推理任务中实现了最先进的性能:TabTrim-8B达到73.5%的平均准确率,优于最强基线3.2%,包括在WikiTQ上79.4%和TableBench上61.2%。

英文摘要

Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.

2601.01685 2026-05-19 cs.CL cs.AI cs.MA 版本更新

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

用真理欺骗:通过生成蒙太奇进行开放式通道多智能体合谋以操纵信念

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

发表机构 * University of Liverpool(利物浦大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文研究了通过公开通道分发真实证据片段,利用多智能体合谋操纵信念的新威胁,提出了生成蒙太奇框架,展示了在14种LLM家族中74.4%的攻击成功率,并揭示了更强的推理能力反而增加了易受攻击的风险。

Comments Accepted to the ACL 2026 Main Conference (Oral Presentation)

详情
AI中文摘要

随着大型语言模型(LLMs)向自主代理合成实时信息转变,其推理能力引入了意想不到的攻击面。本文介绍了一种新的威胁,即合谋代理通过仅使用真实证据片段在公开通道中引导受害者信念,而无需依赖隐蔽通信、后门或伪造文件。通过利用LLMs的过度思考倾向,我们正式化了首次认知合谋攻击,并提出生成蒙太奇:一个由写作者-编辑-导演框架构成的框架,通过对抗性辩论和协调发布证据片段来构建欺骗性叙述,使受害者内化并传播伪造结论。为研究此风险,我们开发了CoPHEME数据集,该数据集源自真实世界谣言事件,并在多种LLM家族中模拟攻击。我们的结果表明,14种LLM家族普遍存在漏洞:攻击成功率达到74.4%(专有模型)和70.6%(开放式权重模型)。反直觉的是,更强的推理能力增加了易受攻击性,推理专精模型的攻击成功率高于基础模型或提示。此外,这些虚假信念会传播到下游判断者,达到超过60%的欺骗率,突显了LLM代理在动态信息环境中交互的社会技术脆弱性。我们的实现和数据可在:https://github.com/CharlesJW222/Lying_with_Truth/tree/main。

英文摘要

As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.

2511.20857 2026-05-19 cs.CL cs.AI 版本更新

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory:通过自演化记忆基准测试LLM代理的测试时间学习

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

发表机构 * Google DeepMind(谷歌深Mind) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出Evo-Memory,一个用于评估LLM代理自演化记忆能力的综合流基准和框架,通过构建序列任务流数据集,要求LLM在每次交互后搜索、适应和演化记忆,并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了超过十种代表性的记忆模块。

详情
AI中文摘要

状态性对于大型语言模型(LLM)代理进行长期规划和问题解决至关重要。这使得记忆成为关键组件,但其管理和进化仍 largely underexplored。现有的评估主要集中在静态对话设置上,其中记忆被动地从对话中检索以回答查询,忽略了在不断变化的任务流中积累和重用经验的能力。在现实世界环境中,如交互问题助手或具身代理中,LLM需要处理连续的任务流,但通常无法从积累的交互中学习,失去有价值的上下文见解,这限制了测试时间的进化,即LLM在部署期间持续检索、整合和更新记忆。为了弥合这一差距,我们引入了Evo-Memory,一个综合的流基准和框架,用于评估LLM代理的自演化记忆能力。Evo-Memory将数据集结构化为连续的任务流,要求LLM在每次交互后搜索、适应和演化记忆。我们统一并实现了超过十种代表性的记忆模块,并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了它们。为了更好地基准测试经验重用,我们提供了一个基线方法ExpRAG,用于检索和利用先前经验,并进一步提出ReMem,一个将推理、任务动作和记忆更新紧密集成的行动-思考-记忆精炼流程,以实现持续改进。

英文摘要

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

2510.26745 2026-05-19 cs.LG cs.AI cs.CL stat.ML 版本更新

Deep sequence models tend to memorize geometrically; it is unclear why

深度序列模型倾向于记忆几何学;不清楚为何

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

发表机构 * Machine Learning Department \& Heinz College, Carnegie Mellon University, Pittsburgh, PA, USA Google Research, NY, USA

AI总结 研究探讨了深度序列模型中原子事实的存储机制,发现几何记忆能编码全局关系,即使在训练中未共现的实体间也能建立联系,挑战了传统关联记忆的观点。

Comments Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

深度序列模型被认为主要通过关联记忆存储原子事实,即通过暴力查找共现实体。我们识别出一种不同的存储形式,称为几何记忆。在此模型中,嵌入编码了所有实体之间的新型全局关系,包括训练中未共现的实体。这种存储形式强大:例如,我们展示了它如何将涉及ℓ-折叠组合的困难推理任务转化为易于学习的一步导航任务。从这一现象中,我们提取了神经嵌入几何学中难以解释的基本方面。我们认为,这种几何的出现,与局部关联的查找相比,不能简单归因于典型的监督、架构或优化压力。反直觉的是,即使几何比暴力查找更复杂,它仍然会被学习。然后,通过分析与Node2Vec的联系,我们展示了几何起源于一种光谱偏见,这与主流理论相反,确实自然产生,尽管缺乏各种压力。这一分析也指出了从业者在使Transformer记忆更几何化方面的可见空间。我们希望几何视角的参数记忆鼓励重新审视指导知识获取、容量、发现和遗忘等领域的默认直觉。

英文摘要

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

2510.24208 2026-05-19 cs.CL cs.LG 版本更新

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

超越神经不兼容:通过潜在语义对齐实现语言模型中的跨尺度知识转移

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

发表机构 * Monash University(墨尔本大学) Technical University of Munich(慕尼黑技术大学) Chongqing University(重庆大学)

AI总结 本文提出SemAlign方法,通过潜在语义对齐实现跨尺度知识转移,解决了不同架构和参数化模型间参数重用受限的问题,通过激活值作为转移介质,利用语义分解与重组稳定地实现知识迁移。

Comments an early-stage version

详情
AI中文摘要

语言模型(LMs)在其参数中编码了大量知识,但如何以细粒度方式转移此类知识,即参数化知识转移(PKT)仍不明确。核心挑战是当源模型和目标模型在架构和参数化上存在差异时,如何实现有效的、高效的跨尺度转移,这使得直接参数重用受到神经不兼容的限制。在本文中,我们识别出潜在语义对齐是跨尺度知识转移的关键前提。与直接移动层参数不同,我们的方法使用激活值作为转移介质。SemAlign包含两个阶段:一个层归因阶段,用于归因任务相关的源层并为每个目标层选择恰好一个源层;一个语义对齐阶段,通过逐层配对并优化目标模型,利用源侧语义监督。对齐通过语义分解和重组在潜在空间中进行。在浅层到深层的转移过程中,只有前沿目标层是可训练的。层目标通过匹配中心化的词-词关系几何与对齐的监督残差来监督该层的残差贡献,而输出KL保持源级预测行为。因此,转移介质既不是参数块也不是绝对的隐藏状态,而是由配对源层监督诱导的目标空间残差几何。在四个基准测试中的评估证实了SemAlign的有效性,进一步分析确认语义分解和重组为跨尺度知识转移提供了一个稳定的机制。

英文摘要

Language Models (LMs) encode substantial knowledge in their parameters, yet it remains unclear how to transfer such knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A central challenge is to make cross-scale transfer effective and efficient when source and target models differ in architecture and parameterization, making direct parameter reuse strongly limited by neural incompatibility. In this paper, we identify latent semantic alignment as the key prerequisite for cross-scale knowledge transfer. Instead of directly moving layer parameters, our approach uses activations as the transfer medium. \textsc{SemAlign} has two stages: an \emph{layer attribution} stage that attributes task-relevant source layers and selects exactly one source layer for each target layer, and a \emph{semantic alignment} stage that pairs them layer by layer and optimizes the target with source-side semantic supervision. The alignment is carried out in latent space through semantic decomposition and recomposition. During the shallow-to-deep transfer, only the frontier target layer is trainable. The layer objective supervises the residual contribution of that layer by matching centered token-token relation geometry against an aligned supervisory residual, while output KL preserves source-level predictive behavior. The transferred medium is therefore neither a parameter block nor an absolute hidden state, but target-space residual geometry induced by paired source-layer supervision. Evaluations on four benchmarks demonstrate the efficacy of \textsc{SemAlign}, and further analysis confirms that semantic decomposition and recomposition provide a stable mechanism for cross-scale knowledge transfer.

2510.21712 2026-05-19 cs.IR cs.AI cs.CL 版本更新

DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling

DecoupleSearch: 通过分层奖励建模解耦规划与搜索

Hao Sun, Zile Qiao, Bo Wang, Guoxin Chen, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang

发表机构 * Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文提出DecoupleSearch框架,通过双值模型解耦规划与搜索过程,利用蒙特卡洛树搜索评估每一步的质量,并通过分层束搜索迭代优化规划和搜索候选,验证了方法的有效性。

Comments EMNLP 2025 Main Conference

详情
AI中文摘要

检索增强生成(RAG)系统已作为一种增强大型语言模型(LLM)的关键方法,通过动态整合外部知识。为了进一步提高RAG的灵活性,代理RAG引入了自主代理到工作流程中。然而,代理RAG面临几个挑战:(1)每一步的成功取决于高质量的规划和准确的搜索;(2)中间推理步骤缺乏监督;(3)规划和搜索的候选空间呈指数级增长。为了解决这些挑战,我们提出了DecoupleSearch,一种新的框架,通过双值模型解耦规划和搜索过程,使规划推理和搜索基础能够独立优化。我们的方法构建了一个推理树,其中每个节点代表规划和搜索步骤。我们利用蒙特卡洛树搜索来评估每一步的质量。在推理过程中,分层束搜索通过双值模型迭代优化规划和搜索候选。在不同参数规模的策略模型上的广泛实验验证了我们方法的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To further improve RAG's flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges: (1) the success of each step depends on both high-quality planning and accurate search, (2) the lack of supervision for intermediate reasoning steps, and (3) the exponentially large candidate space for planning and searching. To address these challenges, we propose DecoupleSearch, a novel framework that decouples planning and search processes using dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree, where each node represents planning and search steps. We leverage Monte Carlo Tree Search to assess the quality of each step. During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models. Extensive experiments across policy models of varying parameter sizes demonstrate the effectiveness of our method.

2510.20584 2026-05-19 cs.CL cs.AI 版本更新

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

使用ChatGPT自动编码通信数据:子群体一致性分析

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

发表机构 * ETS Research Institute(ETS研究机构)

AI总结 本文研究了使用ChatGPT进行通信数据编码在不同性别和种族/族裔群体间的一致性,发现其编码结果与人类评分者一致,为大规模评估协作与沟通提供了可能。

Comments Accepted to the Journal of Educational Measurement

详情
AI中文摘要

在大规模评估沟通和协作方面,对通信数据进行分类编码是一项劳动密集型任务,根据不同的框架进行分类。先前研究已证明,可以通过直接指示ChatGPT使用编码评分表来对通信数据进行编码,并且其准确性与人类评分者相当。然而,ChatGPT或类似AI技术在不同人口群体(如性别和种族)之间编码的一致性仍不清楚。为填补这一空白,我们引入了三种检查方法,用于评估基于LLM的编码中的子群体一致性,通过适应自自动化评分文献中已有的框架。使用典型的协作问题解决编码框架和三种类型的协作任务数据,我们检查了基于ChatGPT的编码在性别和种族/族裔群体中的表现。我们的结果表明,基于ChatGPT的编码在性别或种族/族裔群体中表现一致,与人类评分者一致,证明了其在大规模评估协作和沟通中的可行性。

英文摘要

Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

2510.11391 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

发表机构 * CUHK(香港大学) UCAS(中国科学技术大学) XJTU(西安交通大学) UMich(密歇根大学) Microsoft(微软)

AI总结 本文提出DocReward,一种用于评估文档结构和风格的奖励模型,通过构建包含117,000对文档的DocPair数据集,采用Bradley-Terry损失训练,有效提升了文档生成的结构和风格专业性。

详情
AI中文摘要

近期的代理工作流程自动化了专业文档生成,但主要关注文本质量,忽视了结构和风格的专业性,这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型,无法引导代理生成结构和风格专业的文档。我们引入DocReward,一种评估文档结构和风格的文档奖励模型。为此,我们提出了一种文本质量无关的框架,确保评估不受内容质量的影响,并构建了包含117,000对文档的DocPair数据集,涵盖32个领域和267种类型。每对文档内容相同,但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中,DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明,DocReward能有效引导代理生成具有更一致结构和风格专业性的文档,突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

2510.10930 2026-05-19 cs.CL cs.AI 版本更新

Evaluating Language Models' Evaluations of Games

评估语言模型对游戏的评估

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

发表机构 * University of Cambridge(剑桥大学) MIT(麻省理工学院) Princeton University(普林斯顿大学) NYU(纽约大学) Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 本文研究了语言模型对游戏评估的能力,通过比较现代语言模型和人类及符号计算代理的评估结果,发现推理模型在游戏评估上更接近人类,但随着模型接近博弈最优,其与人类数据的匹配度会减弱,且在评估趣味性时表现出更大的波动。

详情
AI中文摘要

推理不仅仅是解决问题,也是评估哪些问题值得解决。人工智能系统的历史评估主要集中在解决问题上,通过研究模型如何玩国际象棋和围棋等游戏。在本文中,我们倡导一种新的范式,即评估人工智能系统对游戏的评估。首先,我们引入了一种评估此类评估的形式化方法。然后利用超过100种新型棋盘游戏和450份人类判断的大型数据集,将现代语言和推理模型的评估结果与人类和符号计算代理的评估结果进行比较。我们考虑了两种类型的评估查询:评估游戏的收益(或公平性)和趣味性。这些查询涵盖了两个与AI评估设计相关的重要维度:计算查询的复杂性和量化查询的难度。我们的结果表明,推理模型在游戏评估上通常比非推理语言模型更接近人类。然而,我们观察到非单调的关系:随着模型接近博弈最优,其与人类数据的匹配度会减弱。我们还发现,在评估趣味性时,模型之间存在更多的波动性,这与量化该查询的难度更大有关。在各种查询和游戏中,推理模型在评估查询时表现出高度变化和不可预测的资源使用,这表明在语言和推理模型中加入更多资源理性的元推理非常重要。

英文摘要

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

2509.17680 2026-05-19 cs.CL 版本更新

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

当表格问答遇见噪声:为复杂问题和大规模表格设计的双去噪框架

Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang

发表机构 * University of Science and Technology of China(中国科学技术大学) The University of Melbourne(墨尔本大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 本文提出EnoTab双去噪框架,通过改进相关性过滤和表格修剪能力,解决复杂问题和大规模表格中的噪声问题,提升表格问答性能。

Comments 24 pages, 24 figures, accepted to ACL 2026 Main

详情
AI中文摘要

表格问答(TableQA)是自然语言处理(NLP)中的基本任务。大语言模型(LLMs)强大的推理能力在这一领域带来了显著进展。然而,随着实际应用中问题日益复杂且表格规模增大,大量噪声数据被引入,严重降低了推理性能。为了解决这一挑战,我们专注于提升两个核心能力:相关性过滤,即识别并保留与推理真正相关的信息,以及表格修剪,即在保留必要内容的同时减少表格规模。基于这些原则,我们提出了EnoTab,一种为复杂问题和大规模表格设计的双去噪框架。具体来说,我们首先通过证据-based问题去噪,将问题分解为最小的语义单元,并根据一致性和实用性标准过滤掉与答案推理无关的部分。然后,我们提出证据树引导的表格去噪,构建一个明确且透明的表格修剪路径,逐步移除无关数据。在每一步修剪过程中,我们观察表格的中间状态,并应用后序节点回滚机制来处理异常表格状态,最终产生一个高度可靠的子表格用于最终答案推理。最后,广泛的实验表明,EnoTab在复杂问题和大规模表格的TableQA任务中实现了卓越的性能,证实了其有效性。

英文摘要

Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (LLMs) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while preserving essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table pruning path to remove irrelevant data step by step. At each pruning step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.

2509.14004 2026-05-19 cs.CL 版本更新

Early Stopping Chain-of-thoughts in Large Language Models

大语言模型中的早期停止思维链

Minjia Mao, Bowen Yin, Yu Zhu, Xiao Fang

发表机构 * University of Delaware(德克萨斯大学) Peking University(北京大学)

AI总结 本文提出了一种在推理阶段减少思维链生成长度的方法ES-CoT,通过检测答案收敛并提前停止来降低推理成本,同时保持与标准思维链相当的准确性。

详情
AI中文摘要

大语言模型(LLMs)在通过生成长链式思维(CoT)解决复杂问题时表现出卓越的能力,但这种长CoT会带来较高的推理成本。先前的推理阶段高效推理方法要么需要白盒模型监控推理过程,要么通过直接提示不可靠。为此,我们引入了ES-CoT,一种在推理时缩短CoT生成的方法,通过检测答案收敛并提前停止几乎不损失性能。当观察到推理过程中的语言标记(如“wait”)时,我们提示LLM输出其当前最终答案,称为步骤答案。我们跟踪连续相同步骤答案的运行长度作为答案收敛的度量。我们通过实证和理论证明,步骤答案稳定地收敛到最终答案,且大运行长度跳跃可靠地标记这种收敛。在六个跨三个LLM的推理数据集上的实验表明,ES-CoT在平均上减少了16.08%的推理令牌数量,同时保持与标准CoT相当的准确性。

英文摘要

Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. Previous methods on inference-stage efficient reasoning either require white-box models to monitor the reasoning process or are not reliable through direct prompting. In response, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with almost no performance loss. When observing a linguistic marker (such as "wait") in the reasoning process, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. We show both empirically and theoretically that step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on six reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by 16.08% on average while maintaining accuracy comparable to standard CoT.

2508.15878 2026-05-19 cs.LO cs.AI cs.CL cs.LG 版本更新

Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

Lean 与理论计算机科学的交汇:形式-非形式对中可扩展的定理证明挑战合成

Terry Jingchen Zhang, Wenyuan Jiang, Rongchuan Liu, Yisong Wang, Junran Yang, Ning Wang, Nicole Ni, Yinya Huang, Mrinmaya Sachan

发表机构 * D-CHAB, ETH Zurich, Zurich, Switzerland. D-INFK, ETH Zurich, Zurich, Switzerland. ETH AI Center, Zurich, Switzerland. University of Pennsylvania, PA, USA. Independent Researcher.

AI总结 本文提出利用理论计算机科学作为可扩展的严谨证明问题来源,通过算法定义自动生成大量挑战性定理-证明对,展示了在Busy Beaver问题和混合布尔算术问题上的应用,并揭示了自动定理证明在复杂问题上的局限性。

Comments Accepted to AI4MATH@ICML2025

详情
AI中文摘要

形式定理证明(FTP)已成为评估大语言模型推理能力的关键基础,使大规模自动验证数学证明成为可能。然而,进展受到有限数据集的限制,因为手动编纂成本高且缺乏具有验证形式-非形式对应关系的挑战性问题。我们提出利用理论计算机科学(TCS)作为可扩展的严谨证明问题来源,其中算法定义能够自动生成任意多的挑战性定理-证明对。我们在此两个TCS领域中展示了这种方法:Busy Beaver问题,涉及证明图灵机停止行为的界限,以及混合布尔算术问题,结合了逻辑和算术推理。我们的框架自动合成具有并行形式(Lean4)和非形式(Markdown)规范的问题,创建了一个可扩展的生成验证证明挑战的流水线。对前沿模型的评估揭示了自动定理证明的显著差距:尽管DeepSeekProver-V2-671B在Busy Beaver问题上达到57.5%的成功率,但在混合布尔算术问题上仅达到12%。这些结果突显了即使对于计算上易于验证的问题,长形式证明生成的难度,展示了TCS领域在推动自动推理研究中的价值。

英文摘要

Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5\% success on Busy Beaver problems, it manages only 12\% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.

2508.07292 2026-05-19 cs.AI cs.CL cs.CV 版本更新

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent: 闭环代理推理与自我一致性验证用于内窥镜诊断

Yi Tang, Kai-Ni Wang, Yang Chen, Xiaopu He, Guangquan Zhou

发表机构 * School of Biological Science and Medical Engineering, Southeast University(东南大学生物科学与医学工程学院) Jiangsu Key Laboratory of Biomaterials and Devices, Southeast University(江苏省生物材料与器件重点实验室) State Key Laboratory of Digital Medical Engineering, Southeast University(国家数字医学工程重点实验室) The First Affiliated Hospital of Nanjing Medical University(南京医科大学第一附属医院) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing(江苏省联合国际医学信息处理联合实验室) Laboratory of Image Science and Technology, the School of Computer Science and Engineering, Southeast University(图像科学与技术实验室,东南大学计算机科学与工程学院)

AI总结 该研究提出EndoCogniAgent框架,通过闭环代理推理和自我一致性验证提升内窥镜诊断的准确性与可靠性,其核心方法是将诊断过程建模为受控状态更新过程,并引入EndoAgentBench基准进行评估。

Comments 10 pages, 8 figures, 2 tables. Revised version with major updates on methodology and extended evaluation on EndoAgentBench. Code and data are available at https://github.com/Tyyds-ai/EndoCogniAgent

详情
AI中文摘要

内窥镜诊断是一个迭代过程,临床医生逐步获取、比较和验证局部视觉证据以得出结论。当前AI系统未能充分支持此过程,因为细粒度证据获取和多步推理仍弱相关,导致两种失败模式:幻觉证据和未纠正的误差累积,影响诊断可靠性。我们提出EndoCogniAgent,一种闭环代理框架,将内窥镜诊断建模为受控状态更新过程。在每次推理轮次中,中央计划器选择下一步证据获取动作,专用专家工具提取相应观察,自我一致性验证机制沿两个维度检查观察:知识一致性与输入图像以及时间一致性与先前验证的发现,然后更新诊断状态。验证的观察被纳入演进状态以指导后续计划,而缺乏充分支持的发现则保留并带有纠正反馈,引导计划器进行进一步验证。我们进一步引入EndoAgentBench,一个以工作流程为导向的基准,包含来自11个内窥镜数据集的6132个问题-答案对,旨在评估诊断代理在全面诊断链中的表现,从细粒度视觉感知到高水平诊断推理。实验显示,EndoCogniAgent在感知任务上达到85.23%的平均准确率,在推理任务上达到71.13%的临床接受率,消融分析确认自我一致性验证和事件状态维护对这些提升至关重要。

英文摘要

Endoscopic diagnosis is an iterative process in which clinicians progressively acquire, compare, and verify local visual evidence before reaching a conclusion. Current AI systems do not adequately support this process because fine-grained evidence acquisition and multi-step reasoning remain weakly coupled. This gives rise to two failure modes, hallucinated evidence and uncorrected error accumulation, that undermine diagnostic reliability. We propose EndoCogniAgent, a closed-loop agentic framework that formulates endoscopic diagnosis as a controlled state update process. At each reasoning round, a central planner selects the next evidence acquisition action, specialized expert tools extract the corresponding observation, and a self-consistency validation mechanism examines the observation along two dimensions, knowledge consistency against the input image and temporal consistency with prior validated findings, before updating the diagnostic state. Validated observations are admitted into the evolving state to condition subsequent planning, while insufficiently supported findings are retained with corrective feedback that redirects the planner toward additional verification. We further introduce EndoAgentBench, a workflow-oriented benchmark comprising 6,132 question-answer pairs from 11 endoscopic datasets, designed to evaluate diagnostic agents across a comprehensive diagnostic chain, from fine-grained visual perception to high-level diagnostic reasoning. Experiments show that EndoCogniAgent achieves 85.23\% average accuracy on perception tasks and 71.13\% clinical acceptance rate on reasoning tasks, with ablation analysis confirming that self-consistency validation and episodic state maintenance are individually critical to these gains.

2508.06974 2026-05-19 cs.CL 版本更新

Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

重新思考利用预训练大语言模型的1位优化

Zhijun Tu, Jian Li, Yuanyuan Xi, Siqi Liu, Chuanjian Liu, Hanting Chen, Jie Hu, Yunhe Wang

发表机构 * Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出了一种一致的渐进式训练方法,通过将全精度权重逐步转化为二值化权重,以提高1位大语言模型的性能,并通过二进制感知初始化和双缩放补偿减少训练难度。

Comments 15 pages, 7 figures

详情
AI中文摘要

1位LLM量化在减少存储和计算成本方面具有显著优势。然而,现有方法通常从头开始训练1位LLM,未能充分利用预训练模型,导致训练成本高且准确性下降。本文发现全精度与1位表示之间的较大差距使直接适应困难。在本文中,我们引入了一种对前向和后向都一致的渐进式训练方法,平滑地将全精度权重转换为二值化权重。此外,我们还结合了二进制感知初始化和双缩放补偿,以减少渐进式训练的难度并提高性能。在各种大小的LLM上的实验结果表明,我们的方法优于现有方法。我们的结果表明,可以使用预训练模型实现高性能的1位LLM,从而消除了从头开始昂贵训练的需要。

英文摘要

1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes naive adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the full-precision weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

2508.00901 2026-05-19 cs.LG cs.CL 版本更新

Provable Knowledge Acquisition and Extraction in One-Layer Transformers

在单层变换器中可证明的知识获取与提取

Ruichen Xu, Kexin Chen

AI总结 本文研究了单层变换器中知识获取与提取的机制,通过理论分析和实验验证,揭示了预训练和微调过程中知识存储与提取的关系,以及低秩微调如何恢复预训练的事实知识。

详情
AI中文摘要

大型语言模型在预训练过程中可能获得事实性知识,但在微调后却无法可靠地使用这些知识。尽管有越来越多的实证证据表明MLP层存储事实关联,并且微调影响事实回忆,但连接下一个标记预训练、知识存储和后微调提取的训练动态机制仍然理解有限。我们研究了这个问题,使用了一个简化的一层变换器,包含自注意力和MLP模块,通过下一个标记预测进行训练,随后在问答数据上进行微调。在适当的正则性条件下,我们首先证明模型在学习结构化注意力模式和关系特定的特征方向时达到接近最优的预训练损失,从而提供了一个事实性知识获取的机制。然后我们展示微调可以将问答提示格式转化为触发预训练关系特征的手段,使模型能够提取在微调过程中未被重新访问的事实。我们的分析给出了知识提取的关联覆盖特征化:微调不需要重新访问每一个存储的主体-答案对,但必须覆盖足够的潜在关系-模板方向,通过这些方向在预训练中编码了事实。因此,提取随着预训练的多重性和微调的覆盖度而提高,但随着关系-模板宇宙的增长而变得更加困难。相反,不足的覆盖度会导致失败状态,其中事实可能被存储但仍然无法访问,提供了一个简化的幻觉机制。该理论适用于全和低秩微调,为为什么当关系覆盖度足够时低秩适应可以恢复预训练的事实知识提供了见解。在合成数据和基于PopQA的GPT-2/Llama模型上的实验支持了预测的趋势。

英文摘要

Large language models may encounter factual knowledge during pre-training yet fail to reliably use that knowledge after fine-tuning. Despite growing empirical evidence that MLP layers store factual associations and fine-tuning affects factual recall, the training-dynamics mechanisms linking next-token pre-training, knowledge storage, and post-fine-tuning extraction remain poorly understood. We study this problem in a stylized one-layer transformer with self-attention and MLP modules, trained by next-token prediction and subsequently fine-tuned on question-answering data. Under suitable regularity conditions, we first prove that the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions, giving a mechanism for factual knowledge acquisition. We then show that fine-tuning can turn the Q&A prompt format into a trigger for pre-trained relation features, enabling the model to extract facts that are not revisited during fine-tuning. Our analysis yields a relation-covering characterization of knowledge extraction: fine-tuning need not revisit every stored subject-answer pair, but it must cover enough latent relation-template directions through which facts were encoded during pre-training. Consequently, extraction improves with pre-training multiplicity and fine-tuning coverage, but becomes harder as the relation-template universe grows. Conversely, insufficient coverage leads to a failure regime in which facts may be stored but remain inaccessible, providing a stylized mechanism for hallucination. The theory applies to both full and low-rank fine-tuning, offering insight into why low-rank adaptation can recover pre-trained factual knowledge when relation coverage is sufficient. Experiments on synthetic data and PopQA-based GPT-2/Llama models support the predicted trends.

2507.18406 2026-05-19 cs.CL cs.DB cs.DL cs.IR 版本更新

Factual Inconsistencies in Multilingual Wikipedia Tables

多语言维基百科表格中的事实不一致

Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo

发表机构 * CNR ISTC(意大利国家研究委员会信息与通信技术研究所) Fraunhofer Institute for Applied Information Technology FIT(弗劳恩霍夫应用信息技术研究所) Tallinn University of Technology(塔林理工大学) EURECOM(欧瑞康) Technical University of Munich(慕尼黑技术大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 本研究探讨了多语言维基百科结构化内容中的跨语言不一致问题,特别是表格数据,通过开发方法收集、对齐和分析多语言维基百科文章中的表格,定义不一致的类别,并应用定量和定性指标评估多语言对齐,为事实验证、多语言知识交互和可靠AI系统设计提供启示。

Comments 11 pages, 7 figures, White Paper for RTF Work at ISWS Summer School 2025

详情
AI中文摘要

维基百科作为全球可访问的知识源,包含超过300种语言的内容。尽管覆盖相同主题,不同版本的维基百科是独立编写和更新的。这导致了事实不一致,可能影响百科全书和依赖维基百科作为主要训练数据的AI系统中立性和可靠性。本研究调查了维基百科结构化内容中的跨语言不一致,重点是表格数据。我们开发了一种方法来收集、对齐和分析维基百科多语言文章中的表格,定义不一致的类别。我们应用各种定量和定性指标来评估多语言对齐,使用样本数据集。这些见解对事实验证、多语言知识交互和设计利用维基百科内容的可靠AI系统具有影响。

英文摘要

Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.

2504.13217 2026-05-19 cs.CL cs.AI 版本更新

Sustainability via LLM Right-sizing

通过LLM右尺寸实现可持续性

Jennifer Haase, Finn Klessascheck, Jan Mendling, Sebastian Pokutta

AI总结 本文研究了在现实应用中,小型本地可部署模型是否足够好,通过评估十种LLM在日常职业任务中的表现,提出了一种基于可持续性的评估方法,强调在成本、本地部署和隐私方面的需求。

Comments 21 pages, 2 Figures, 6 Tables

详情
AI中文摘要

大型语言模型(LLMs)日益融入组织工作流程,引发了对其能源消耗、财务成本和数据主权的担忧。尽管性能基准常赞扬前沿模型,但实际部署决策需要更广泛的视角:何时小型、本地可部署的模型足够好?本研究通过评估十种专有和开源LLM在十种日常职业任务中的表现,提供实证答案。使用双LLM评估框架,自动化任务执行并标准化输出质量、事实准确性和伦理责任等十项标准。结果显示,GPT-4o在性能上始终优于,但成本和环境足迹显著更高。值得注意的是,较小的模型如Gemma-3和Phi-4在大多数任务中表现出强劲且可靠的结果,表明其在需要成本效率、本地部署或隐私的场景中的可行性。聚类分析揭示了三种模型群体——高端全能型、胜任的通用型和有限但安全的表演型,突显了质量、控制和可持续性之间的权衡。显著的是,任务类型影响了模型的有效性:概念性任务挑战了大多数模型,而聚合和转换任务则表现出更好的性能。我们主张从追求性能最大化的基准转向任务和情境感知的充分性评估,以更符合组织优先事项。我们的方法贡献了一种通过可持续性视角评估AI模型的可扩展方法,并为负责任的LLM部署提供了可行的指导。

英文摘要

Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

2501.01046 2026-05-19 cs.CL 版本更新

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

SEDD: 一种基于GPU的可扩展且高效的去重数据集处理方法

Youngjun Son, Chaewon Kim, Jaejin Lee

发表机构 * Graduate School of Data Science(数据科学研究生院) Department of Computer Science(计算机科学系) Seoul National University(首尔国立大学)

AI总结 本文提出SEDD,一种基于GPU的高效去重框架,通过引入计算高效且部分可重用的哈希函数、高度优化的GPU内核和硬件感知的自动参数选择机制,显著减少了通信瓶颈,提升了去重效率,同时保持了高去重精度。

Comments 13 pages, 7 figures

详情
AI中文摘要

数据集去重被广泛认可为一个关键的预处理步骤,能够提高数据质量和大型语言模型的性能。常用的去重方法是MinHash局部敏感哈希(LSH)算法。最近,NVIDIA NeMo Curator等GPU加速框架被引入以处理大规模语料库;然而,由于物理数据洗牌带来的高通信开销和GPU资源利用率低,这些框架仍然不够高效。在本文中,我们提出了SEDD,一种高性能的GPU加速去重框架,优化于分布式集群环境。SEDD引入了计算高效且部分可重用的哈希函数,以及高度优化的GPU内核和硬件感知的自动参数选择机制。通过将传统数据洗牌替换为流式处理方法,SEDD显著减轻了通信瓶颈。在处理3000万文档的节点上,我们的框架在CPU基础工具SlimPajama上性能提升高达158倍,在NVIDIA NeMo Curator的GPU基础工具上性能提升高达7.8倍。值得注意的是,SEDD大幅加速了之前耗时的MinHash签名生成阶段,相对于CPU基准,速度提升高达375倍。尽管在效率上有这些提升,SEDD仍保持了高去重精度,重复文档集的Jaccard相似度超过0.95,与标准MinHash算法识别的相似度相比。在大规模实验中,1.2万亿个标记的去重在8节点32 GPU V100集群上仅用3小时完成。相关代码已公开在GitHub(https://github.com/mcrl/SEDD)。

英文摘要

Dataset deduplication is widely recognized as a crucial preprocessing step that enhances data quality and improves the performance of large language models. A commonly used method for this process is the MinHash Locality-Sensitive Hashing (LSH) algorithm. Recently, GPU-accelerated frameworks such as NVIDIA NeMo Curator have been introduced to handle large-scale corpora; however, they remain suboptimal due to high communication overhead from physical data shuffling and underutilization of GPU resources. In this paper, we propose SEDD, a high-performance GPU-accelerated deduplication framework optimized for distributed cluster environments. SEDD introduces a computationally efficient, partially reusable hash function, alongside highly optimized GPU kernels and a hardware-aware automatic parameter selection mechanism. By replacing traditional data shuffling with a streaming-based approach, SEDD significantly mitigates communication bottlenecks. Our framework outperforms the CPU-based deduplication tool in SlimPajama by up to 158$\times$ and the GPU-based tool in NVIDIA NeMo Curator by up to 7.8$\times$ when processing 30 million documents on a node with four GPUs. Notably, SEDD dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speedups of up to 375$\times$ over the CPU baseline. Despite these gains in efficiency, SEDD maintains high deduplication fidelity, with duplicate document sets achieving Jaccard similarities of over 0.95 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 3 hours on an 8-node 32-GPU V100 cluster. The related code is publicly available on GitHub (https://github.com/mcrl/SEDD).

2410.13181 2026-05-19 cs.CL 版本更新

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

AdaSwitch: 一种在小型和大型代理之间自适应切换以实现有效的云-本地协作学习方法

Hao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue Feng, Bo Wang, Shuaiqiang Wang, Yan Zhang, Dawei Yin

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China(一般人工智能国家重点实验室,北京大学,北京,中国) School of Intelligence Science and Technology, Peking University(智能科学与技术学院,北京大学) East China Normal University(东华大学) Chinese Academy of Sciences(中国科学院) Baidu Inc(百度公司) Beijing Institute of Technology(北京理工大学) University of Birmingham(伯明翰大学)

AI总结 本文提出了一种新的LLM使用范式,通过自适应机制在云端和本地部署的LLM之间切换,以提高任务完成性能和效率,通过本地代理处理简单推理步骤,云代理处理复杂推理步骤,实验表明该方法在多个基准测试中有效提升了本地代理的性能。

Comments EMNLP 2024 Main Conference

详情
AI中文摘要

近年来,大型语言模型(LLMs)的发展取得了显著进展。用户面临一个选择:使用基于云的LLMs来获得生成质量,或部署本地LLMs以降低计算成本。前者通常成本高且效率低,而后者通常无法满足需要深入思考过程的推理步骤的性能要求。在本工作中,我们提出了一种新的LLM使用范式,以促进大型云端LLMs和较小本地部署LLMs的协作操作。我们的框架包含两个主要模块:本地代理由相对较小的LLM实例化,处理较简单的推理步骤;云代理配备较大的LLM,处理更复杂的推理步骤。这种协作处理通过自适应机制实现,其中本地代理会内省并主动向云代理寻求帮助,从而有效整合本地部署和云端LLMs的优势,显著提升任务完成性能和效率。我们评估了AdaSwitch在7个基准测试上的表现,涵盖数学推理和复杂问答,使用不同类型的LLMs实例化本地和云代理。实验结果表明,AdaSwitch有效提升了本地代理的性能,有时在计算开销远低于云代理的情况下,也能取得具有竞争力的结果。

英文摘要

Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.

2605.18111 2026-05-19 cs.CL cs.CV 版本更新

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

LLMs在回答孟加拉语医学视觉问题方面的表现如何?数据集与基准测试

Rafid Ahmed, Intesar Tahmid, Mir Sazzat Hossain, Tasnimul Hossain Tomal, Md Fahim, Md Farhad Alam Bhuiyan

发表机构 * Penta Global Limited Center for Computational & Data Sciences, Independent University(独立大学计算与数据科学中心)

AI总结 本文提出BanglaMedVQA数据集,用于评估当前基础模型在孟加拉语医学视觉问答任务中的表现,发现其性能显著低于英语基准,揭示了低资源语言在医学推理中的挑战。

Comments 14 pages, 7 figures, 5 tables, Proceedings of The Second AAAI Bridge Program on AI for Medicine and Healthcare, PMLR 317:1-14, 2026

详情
AI中文摘要

近年来,大型语言模型(LLMs)和大型视觉语言模型(LVLMs)的进步使通用系统在复杂推理任务中展现出有希望的能力,包括医学领域。医学视觉问答(MedVQA)尤其受益于这些发展。然而,尽管孟加拉语是全球最广泛使用的语言之一,但尚不存在针对它的MedVQA基准。为解决这一缺口,我们引入了BanglaMedVQA数据集,包含经过临床验证的图像-问题-答案三元组,并对当前基础模型在该资源上的全面评估。与先前发现的当前模型在英语MedVQA基准上表现不佳一致,我们的分析显示孟加拉语性能显著更低,反映了低资源语言固有的挑战。即使表现最佳的模型如Gemini和GPT-4.1 mini也未能准确回答专门的诊断问题,表明在细粒度医学推理方面存在严重限制。虽然某些开源模型如Gemma-3偶尔在一般类别中优于这些模型,但它们在临床复杂问题上也表现不佳,凸显了对顶级评估方法的迫切需求。

英文摘要

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

2605.18105 2026-05-19 cs.CL 版本更新

How Loud Rumbles Hit Newsstands: A Data Analysis of Coverage and Spatial Bias in German News about Landslides Around the World

轰鸣声击中新闻摊位:对全球山体滑坡相关新闻报道和空间偏见的数据分析

Brielen Madureira, Andreas Niekler, Marc Keuschnigg, Mariana Madruga de Brito

发表机构 * LeipzigLab – Climate Discourse(莱比锡实验室——气候话语) Leipzig University(莱比锡大学) Helmholtz Centre for Environmental Research(亥姆霍兹环境研究中心) Linköping University(林雪平大学)

AI总结 本文通过分析25年间近6万篇关于5500起山体滑坡事件的新闻文章,探讨德国报纸对全球山体滑坡的报道方式,揭示南欧和西欧地区报道过度的现象,为研究媒体对国际灾害关注的不平等提供参考。

Comments Work in progress

详情
AI中文摘要

山体滑坡常因破坏性和潜在致命性而击中新闻摊位。新闻是创建或丰富灾害数据库以及加快基于媒体的注意力动态研究的重要信息来源。为此,新闻数据集必须被过滤、定位和验证。本文聚焦于全球山体滑坡在德国报纸中的报道方式。我们分析了25年间近6万篇关于5500起新闻事件的新闻文章,将其与外部国家滑坡易发性指标进行比较,并提供见解,例如南欧和西欧地区报道过度,以促进对媒体对国际灾害关注不平等的研究。

英文摘要

Landslides often hit newsstands due to their destructive and potentially fatal effects. News are a valuable source of information for creating or enriching disaster databases and for expediting media-based studies of the dynamics of media attention. To accomplish that, news datasets must be filtered, geolocated and validated. This paper focuses on how landslides around the world are reported in German newspapers. We analyse almost 60k news articles about 5.5k news events in a 25-year period, compare it with external measures of countries' susceptibility to landslides and provide insights, e.g.~the overreporting of Southern and Western Europe, to foment further studies on inequalities in media attention to international disasters.

2605.18083 2026-05-19 cs.CL 版本更新

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$Δ$ Integration into Upcycled MoE

多语言大语言模型的高效路径:通过后训练PARAM$Δ$整合到再利用MoE进行语言扩展

Hao Zhou, Tianhao Li, Zhijun Wang, Shuaijie She, Linjuan Wu, Hao-Ran Wei, Baosong Yang, Jiajun Chen, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Zhejiang University(浙江大学)

AI总结 本文提出了一种高效的方法,通过将密集模型转换为MoE架构,并将不同语言分配给不同专家,从而在不进行复杂对齐阶段的情况下提升多语言大语言模型的性能,同时保留原始能力。

详情
AI中文摘要

将大型语言模型(LLMs)扩展到新语言是一个成本高昂的过程,需要大量的持续预训练(CPT)和数据密集型对齐。尽管最近的数据免费融合技术试图通过将多语言CPT增强模型与其指令版本融合来绕过对齐,但它们受到关键权衡的限制:缓解参数冲突以保持原始能力不可避免地会稀释新语言的学习,反之亦然。为了解决这一矛盾,我们引入了\method,将密集模型重新利用为专家混合(MoE)架构,将不同专家分配给不同语言。然后通过将MoE扩展的参数delta($Δ_{ ext{post}}$)嫁接回CPT增强的基模型来转移对齐能力,从而绕过复杂的对齐阶段。实验表明,\method在具有相似FLOPs或参数数量的基线方法上表现出色;它在扩展语言上提高了性能,同时有效保留了原始能力。我们进一步证明,我们的方法在不同模型和后训练delta上具有高度适用性。

英文摘要

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($Δ_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

2605.18079 2026-05-19 cs.LG cs.CC cs.CL 版本更新

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

低精度softmax变换器的表达能力(摘要)链式思维

Moritz Brösamle, Stephan Eckstein

发表机构 * Department of Mathematics, University of Tübingen, Germany(图宾根大学数学系)

AI总结 本文研究了低精度softmax变换器在链式思维中的表达能力,通过构造三元激活和分离注意力分数的硬max变换器来模拟图灵机,从而将构造转换为等效的softmax变换器,并分析了最近提出的总结链式思维范式在模拟图灵机时的效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

现有的变换器表达性结果通常依赖于hardmax注意力、高精度和其它架构修改,这些修改将它们与实际使用的模型脱节。我们通过分析具有softmax注意力和激活值及注意力权重四舍五入的标准变换器解码器,同时允许深度和宽度以对数方式增长于上下文长度,来弥合这一差距。作为中间步骤,我们构造了具有三元激活和良好分离注意力分数的硬max变换器,利用链式思维(CoT)模拟图灵机。这使我们能够将构造转换为等效的softmax变换器,而无需先前方法所需的不现实的参数规模或激活精度。使用相同的技术,我们分析了最近提出的总结Co T范式,并展示其在模拟图灵机时更加高效,模型大小以空间界而非时间界缩放。我们通过在数独推理任务上验证我们的结果,并发现其比先前的高精度结果更符合可学习性。我们的代码可在https://github.com/moritzbroe/transformer-expressivity上获得。

英文摘要

Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.

2605.18071 2026-05-19 cs.CL 版本更新

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

KVDrive: 一个面向长上下文LLM推理的多层级KV缓存管理系统

Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

发表机构 * Hong Kong University of Science and Technology China(香港科技大学中国) Xi’an Jiaotong University China(西安交通大学中国)

AI总结 本文提出KVDrive,一个面向长上下文LLM推理的多层级KV缓存管理系统,通过联合缓存放置、流水线调度和跨层级协调,实现了高吞吐量的推理,在有限的GPU预算下保持高精度。

详情
AI中文摘要

支持长上下文LLM存在挑战,因为键值(KV)缓存的大量内存需求。现有的卸载系统将完整的缓存存储在主机内存中,并在解码过程中选择性地获取关键条目,但这种策略很快达到极限:无法进一步稀释而不影响准确性。因此,当上下文长度和批处理大小增加时,KV传输的体积急剧上升,成为解码延迟的主要来源。我们提出了KVDrive,一个横跨GPU内存、主机DRAM和SSD的多层级KV缓存管理系统。与之前通过算法改进追求更高稀疏度的工作不同,KVDrive从系统角度出发,联合缓存放置、流水线调度和跨层级协调,以在有限的GPU预算下维持高吞吐量的推理。KVDrive实现了三个基本能力:它根据注意力行为调整缓存管理以最大化重用并最小化冗余数据移动;它重构解码流水线以重叠I/O和CPU/GPU计算瓶颈阶段,消除异构资源中的停滞;并且它协调内存层级之间的数据移动,解锁远超GPU和DRAM限制的可扩展长上下文推理。我们已经实现了一个完整的KVDrive原型,并在长上下文基准测试中评估了流行LLM。该系统在保持准确性的同时,相比最先进的工作实现了高达1.74倍的吞吐量提升。

英文摘要

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.

2605.18067 2026-05-19 cs.CL 版本更新

PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

PPAI: 促进个性化大语言模型代理在协作边缘智能中的互操作性

Zile Wang, Qianli Liu, Kaibin Guo, Haodong Wang, Jian Lin, Zicong Hong, Song Guo

发表机构 * Hong Kong RGC General Research Fund(香港研究资助局一般研究基金) Research Impact Fund(研究影响基金) Collaborative Research Fund(协作研究基金) NSFC/RGC Collaborative Research Scheme(国家自然科学基金/香港研究资助局协作研究计划) Areas of Excellence Scheme(卓越领域计划) InnoHK (HKGAI)(创新科技署(HKGAI))

AI总结 本文提出PPAI系统,通过代理专长实现用户间协作,解决动态代理池和负载平衡问题,提升任务准确性并降低延迟。

详情
AI中文摘要

在边缘设备上部署大型语言模型(LLM)可为各种用户提供个性化LLM代理。随着多样化个性化代理的可用性增加,同伴对同伴(P2P)协作提供了独特机会,其中每个用户可以将超出本地代理专长的任务委托给更适合特定查询的远程代理。本文介绍了PPAI,首个个性化LLM代理互操作性系统,使用户基于代理专长进行协作。然而,代理池的不断变化和其可互换性带来了新的挑战,即在存在 churn 的P2P网络中匹配查询到代理并平衡负载,与现有P2P系统相比更具挑战性。因此,我们提出了一种基于原型的可扩展查询-代理对评分机制,以在具有 churn 的P2P网络中识别适合的代理。此外,我们提出一个多代理互操作性贝叶斯博弈,以在远程代理负载变化过快无法观察时平衡本地需求和全局效率。最后,我们实现了一个PPAI原型,并证明它显著扩展了可执行的任务范围,同时保持负载平衡。平均而言,它在多个任务上实现了高达7.96%的准确性提升,同时相比基线减少了16.34%的延迟。

英文摘要

Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer-to-peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent's expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever-changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query-agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi-agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.

2605.18053 2026-05-19 cs.LG cs.CL cs.CR cs.PF 版本更新

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

保护几乎就是一切:结构保护在全局受限的KV淘汰中占据主导地位

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题,发现结构保护在保持质量方面起关键作用,通过保留边界缓存恢复了大部分参考天花板质量,并展示了保护机制在不同模型上的有效性。

Comments 38 pages, 6 figures, 25 tables (includes one longtable). Code and figure regeneration scripts: https://github.com/gpgabriel25/KVCacheBoundaryProtection

详情
AI中文摘要

我们研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题。七种策略(LRU、H2O、SnapKV、StreamingLLM、Ada-KV、QUEST、Random)共享一个提示边界漏洞:在没有结构保护的情况下,它们在六个纯Transformer模型上几乎降级到质量为零(F1≤0.064)。保留每个边界10%的缓存可在七个LongBench模型上恢复69-90%的C=2048参考天花板质量(13%保留率);一个十模型面板覆盖68-98%。一个注意力质量试点(Qwen2.5-3B,N=30)表明原因:位置0的sink占据约75%的前缀质量,而其他边界token接近约0.41倍的均匀期望,因此注意力评分器保留sink但仍会丢弃结构关键token。有保护的情况下,简化评分隔离变体在K=32时与LRU TOST等价(Δ=0.02);在K=8时,注意力策略对彼此收敛但仍比LRU高0.011-0.021 F1(在C=256和C=512时)。忠实的Ada-KV/QUEST在Mistral-7B和Phi-3.5上添加约0.03-0.04 F1超过简化变体。在Qwen3-4B上的NIAH-32K领域转移试点(解码vs.预填,C∈{512,2048})显示保护提升几乎相同(比率0.99-1.00)。在64K时,保护有所帮助但恢复有限;忠实的每头评分在Gemma-3-4B上仅在模型已支持强64K检索而不淘汰时才能在6.3%保留率下达到全缓存天花板。总体而言:保护占主导;一旦边界被保护,评分差异变得次要;每头分配进一步带来小幅提升。

英文摘要

We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1$\leq$0.064). Reserving 10\% of cache at each boundary recovers 69--90\% of the $C{=}2{,}048$ reference-ceiling quality on seven LongBench models at $C{=}256$ (13\% retention); a ten-model panel spans 68--98\%. An attention-mass pilot (Qwen2.5-3B, $N{=}30$) suggests why: the position-0 sink holds ${\sim}75\%$ of prefix mass, while other boundary tokens sit near ${\sim}0.41{\times}$ uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at $K{=}32$ ($Δ{=}0.02$); at $K{=}8$, attention policies pairwise converge yet beat LRU by 0.011--0.021 F1 across $C{=}256$ and $C{=}512$. Faithful Ada-KV/QUEST add ${\sim}0.03$--$0.04$ F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, $C{\in}\{512,2048\}$) shows near-identical protection lifts (ratio 0.99--1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3\% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.

2605.18032 2026-05-19 cs.CL cs.AI cs.HC cs.SE 版本更新

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA:多智能体大语言模型工作流的离线评估与迭代优化

Kazuki Kawamura, Satoshi Waki, Kei Tateno

发表机构 * Sony Group Corporation(索尼集团公司)

AI总结 本文提出PROTEA,一种用于多智能体大语言模型工作流的离线评估和迭代优化接口,通过配置评分标准和可视化工作流图中的节点状态,帮助开发者定位瓶颈并改进工作流性能。

Comments 9 pages, 3 figures, 1 table. To appear in Proceedings of ACL 2026 System Demonstrations

详情
AI中文摘要

多智能体大语言模型工作流——由多个角色特定的LLM调用组成——通常优于单提示基线,但调试和优化仍然困难。失败可能源于中间输出的细微错误,这些错误会传播到下游节点,要求开发者检查长轨迹并推断应修改哪个代理。我们提出了PROTEA,一个统一的接口,用于离线、测试驱动的多智能体工作流改进。PROTEA执行工作流,用可配置的评分标准评分中间节点输出,并在工作流图上叠加每个节点的状态和理由,以定位可能的瓶颈。为了支持复杂系统,其中最终答案参考是主要监督,PROTEA执行反向节点评估:它从最终答案参考和图上下文生成候选节点级期望,然后将它们与观察到的节点输出进行比较。对于选定的节点,PROTEA以可编辑的前后比较形式呈现目标提示修订,然后自动重新运行并重新评估工作流,以显示输出变化和评分轨迹。在两个生产相关的工作流中,PROTEA将文档检查准确性从64.3%提高到83.9%,推荐Hit@5从0.30提高到0.38。在与六名经验丰富的LLM开发者进行的形成研究中,参与者重视图层面的定位、节点级别的理由以及可编辑的前后提示修订。

英文摘要

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

2605.18007 2026-05-19 cs.CL 版本更新

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

推理时对修辞角色标注中困难示例的语义重排序

Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Richard Dufour

发表机构 * Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004(南特大学,中央理工学院,国家科学研究中心,LS2N,UMR 6004) University of Lorraine(洛林大学)

AI总结 本文提出RISE框架,在推理时利用标签语义对修辞角色标注中的困难示例进行重排序,提升模型预测的准确性和鲁棒性。

Comments Accepted at ACL 2026 (Main Conference)

详情
AI中文摘要

修辞角色标注(RRL)为文档中的每个句子分配一个功能角色,广泛应用于法律、医疗和科学领域。尽管语言模型(LMs)在平均性能上表现良好,但它们在困难示例上仍然不可靠,其中预测置信度较低。现有方法通常隐式处理不确定性,将标签视为离散标识符,忽略了标签名称中编码的语义信息。我们引入RISE,一种推理时的语义重排序框架,利用标签语义来优化困难实例的预测。RISE自动识别低置信度预测,并使用对比学习的标签表示对模型输出进行重排序,无需重新训练或修改基础模型。在八个领域特定的RRL数据集上,使用七种LM(包括基于编码器和因果架构)的实验表明,在困难示例上平均获得+9.15个宏F1分数的提升。为了可解释性,我们进一步提出手动难度注释,从模型和人类视角研究难度,揭示与Cohen's kappa=0.40的中等一致程度。

英文摘要

Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen's kappa = 0.40.

2605.18001 2026-05-19 cs.CL 版本更新

Bridging the Gap: Converting Read Text to Conversational Dialogue

弥合差距:将阅读文本转换为对话式语音

Parshav Singla, Agnik Banerjee, Aaditya Arora, Shruti Aggarwal, Anil Kumar Verma, Vikram C M, Raj Prakash Gohil, Gopal Kumar Agarwal

发表机构 * Samsung Research and Development Institute, Bangalore, India(三星研发研究所,班加罗尔,印度)

AI总结 本文提出了一种名为PACC的新方法,通过利用深度神经网络分析和修改语调、重音和节奏等语调特征,将阅读语音转换为更自然的对话语音,从而在虚拟助手、客户服务和语言学习工具中提高语音转换的自然度和准确性。

Comments 11 pages, 4 figures. Published in ICICC 2025, Springer Lecture Notes in Networks and Systems

详情
Journal ref
Innovative Computing and Communications (ICICC 2025), Lecture Notes in Networks and Systems, Springer Nature, 2025, pp. 543-556
AI中文摘要

在最近的语音处理进展中,将阅读语音转换为对话语音引起了广泛关注。该领域的主要挑战是在实时应用中保持自然性和可懂性的同时,最小化计算开销。传统的阅读语音缺乏对话互动中至关重要的细微语调变化,这对虚拟助手、客户服务和语言学习工具等应用构成了挑战。本文介绍了一种新的方法,即带有对话上下文的语调调整(PACC),旨在将阅读语音转换为各种现代应用中使用的自然对话语音。PACC利用先进的深度神经网络来分析和修改语调特征,如语调、重音和节奏。与传统方法不同,我们的方法使用高保真生成对抗网络(HiFi-GAN)进行语音合成。我们的实验结果表明,语音转换在自然度和模型准确性方面有显著提高,通过在语音数据集上额外训练。这项研究为语音转换任务和Mean Opinion Score(MOS)评估建立了新的基准,并证明我们的方法可以成功扩展到其他语音转换应用。

英文摘要

In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing computational overhead for real-time applications. Traditional read speech often lacks the nuanced prosodic variation essential for natural conversational interactions, posing challenges for applications in virtual assistants, customer service, and language learning tools. This paper introduces a novel approach, Prosodic Adjustment with Conversational Context (PACC), aimed at converting read speech into natural conversational speech used in various modern applications. PACC utilizes advanced deep neural networks to analyze and modify prosodic features such as intonation, stress, and rhythm. Unlike conventional methods, our approach uses High-Fidelity Generative Adversarial Networks (HiFi-GAN) for speech synthesis. Our experimental results demonstrate significant improvements in speech conversion, enhancing naturalness and achieving better model accuracy with additional training on speech datasets. This research establishes new benchmarks in speech conversion tasks and Mean Opinion Score (MOS) evaluation for testing model accuracy, and we show that our approach can be successfully extended to other speech conversion applications.

2605.17989 2026-05-19 cs.CL cs.AI 版本更新

Predictive Prefetching for Retrieval-Augmented Generation

检索增强生成的预测预取

Wuyang Zhang, Shichao Pei

发表机构 * Department of Computer Science, University of Massachusetts Boston(马萨诸塞大学波士顿分校计算机科学系)

AI总结 本文提出了一种先进的异步检索框架,通过预测检索触发时机和所需信息,以减少延迟并提高生成效率,同时保持回答质量。

Comments Accepted by Forty-third International Conference on Machine Learning ICML 2026

详情
AI中文摘要

检索增强生成(RAG)通过在大型语言模型中增强事实性,但因其同步检索导致显著延迟。尽管近期工作探索了异步检索,但现有方法依赖于检索与生成之间的启发式协调,并假设解码期间信息需求稳定,这在复杂、多领域设置中往往失效。本文提出了一种先进的异步检索框架,该框架能够与不断演变的信息需求相匹配,通过利用生成动态中出现的语义前驱,使用三个组件——检索预测器、上下文监视器和查询生成器,显式预测何时应触发检索以及应检索什么信息。在多个基准测试上的实验表明,该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到第一个token的提升,同时保持与同步RAG基线相当的回答质量。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

2605.17978 2026-05-19 cs.CL 版本更新

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder: 教授大语言模型生成显式向量化代码

Shangzhan Li, Xinyu Yin, Xuanyu Jin, Ye He, Yuxin Zhou, Yuxuan Li, Xu Han, Wanxiang Che, Qi Shi, Ting Liu, Maosong Sun

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Xiamen University(厦门大学) Tsinghua University(清华大学)

AI总结 本文提出AutoVecCoder框架,通过VecPrompt和VecRL组件,使大语言模型能够自动进行显式向量化,从而在SimdBench的SSE和AVX子集上达到最先进的性能,超越传统自动向量化的方法。

详情
AI中文摘要

通过单指令多数据(SIMD)架构进行向量化是高性能计算的核心。为了充分利用硬件潜力,开发人员通常依赖显式向量化使用内联函数,因为基于编译器的自动向量化由于保守的静态分析常常产生次优结果。尽管大语言模型(LLMs)在一般代码生成方面表现出色,但它们在显式向量化方面遇到困难,因为高质量语料库稀缺且低级硬件指令的语义约束严格。在本文中,我们提出了AutoVecCoder,一种新的框架,旨在赋予LLMs自动显式向量化的能力。AutoVecCoder集成了两个核心组件:VecPrompt,一个自动数据合成管道,用于注入领域特定的内联知识;以及VecRL,一个强化学习框架,将代码生成与执行效率对齐。通过此框架训练的AutoVecCoder-8B在SimdBench的SSE和AVX子集上实现了最先进的性能,并在某些情况下生成的实现超过了标准-O3优化,有效克服了传统自动向量化的固有瓶颈。

英文摘要

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.

2605.17936 2026-05-19 cs.CL cs.LG 版本更新

Universal Adversarial Triggers

通用对抗触发器

Benedict Florance Arockiaraj, Alexander Feng, Jianxiong Cai, Xiaoyu Cheng

AI总结 本文提出了一种结合词性过滤和困惑度损失函数的新技术,生成更接近自然短语的合理触发器,以提高对抗攻击的检测难度并促进鲁棒模型的发展。

详情
AI中文摘要

近期的研究表明,现代NLP模型在从情感分析到语言生成的多种任务中均受到通用对抗攻击的影响,这类攻击是一种输入无关的攻击,使用共同的触发序列攻击模型。尽管这些攻击成功,但由此生成的触发器却不合语法且不自然。我们的工作提出了一种新颖的技术,结合词性过滤和基于困惑度的损失函数,以生成更合理的触发器,这些触发器更接近自然短语。在SST数据集上的情感分析任务中,该方法生成的触发器能够将正向预测翻转为负向预测,准确率降至0.04和0.12。为了构建鲁棒模型,我们还使用生成的触发器进行对抗训练,使模型的准确率从0.12提升至0.48。我们旨在展示通过生成合理的触发器,可以使得对抗攻击难以被检测,并通过相关防御促进鲁棒模型的发展。

英文摘要

Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12 to 0.48. We aim to illustrate that adversarial attacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.

2605.17932 2026-05-19 cs.CL cs.AI 版本更新

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

在扩散大型语言模型中进行提示压缩:在LLDA上评估LLMLingua-2

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo, Kaung Myat Kyaw, Jonathan Chan

发表机构 * University of Toronto(多伦多大学) King Mongkut’s University of Technology Thonburi(泰国科技理工学院)

AI总结 本文研究了提示压缩在扩散大型语言模型中的有效性,通过在LLDA上评估LLMLingua-2,发现提示压缩在数学推理任务中效果不佳,而摘要任务相对稳健,表明为扩散模型设计的提示压缩方法并不适用于所有场景。

详情
AI中文摘要

提示压缩可以减少大型语言模型的推理成本和上下文长度,但之前的评估主要集中在自回归架构上。本研究探讨了提示压缩是否能有效转移到扩散大型语言模型(DLLMs)中,使用LLMLingua-2,特别是具有8B参数的DLLM LLaDA。我们在GSM8K、DUC2004和ShareGPT数据集上使用每个数据集约250个提示,以大约2倍的压缩率,在数学推理、提示重建和摘要任务中评估压缩性能。通过精确匹配准确率、BLEU、ROUGE和BERTScore比较原始提示、压缩提示、重建提示和重建提示推理生成的输出。结果表明,语义保持并不必然意味着在扩散模型中下游行为的稳定性。摘要任务在压缩下相对稳健,而数学推理任务在高语义相似度分数下显著退化。重建实验进一步表明,语义相似的提示可能仍然遗漏了稳定去噪所需的关键推理信息。在所有任务中,BERTScore召回率始终低于精度,表明压缩失败主要由信息遗漏驱动,而非语义漂移。这些发现表明,为自回归模型设计的提示压缩方法并不均匀地适用于扩散大型语言模型,从而推动了为扩散模型设计的压缩策略的发展。

英文摘要

Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.

2605.17911 2026-05-19 cs.CL 版本更新

A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

行星探测中自然语言到一阶逻辑翻译的试点基准

Hayden Moore, Suman Saha, Mahfuza Farooque

发表机构 * Department of Computer Science and Engineering, College of Engineering, The Pennsylvania State University(计算机科学与工程系,工程学院,宾夕法尼亚州立大学)

AI总结 本文提出一个试点基准,用于在行星探测领域将自然语言转换为一阶逻辑,通过NASA PDS的实测文档构建数据集,并手动标注FOL表示,以支持语言理解和形式推理的交叉研究。

详情
AI中文摘要

未来的行星探测设想了在严苛通信限制下运行的自主机器人代理,没有全球定位系统,且人类干预极少。在这种环境中,代理不仅需要感知和行动,还必须在任务目标、操作约束和不断变化的环境条件下进行推理。尽管先前的工作主要集中在感知和控制上,但将高层任务知识转换为结构化、机器可解释的表示仍显不足。我们引入了一个试点基准,用于在行星探测领域将自然语言(NL)转换为一阶逻辑(FOL)。数据集由来自NASA行星数据系统(PDS)的实测文档构建,时间跨度为2003至2013年。这些文档以丰富的自然语言描述了任务阶段,如发射、助推、巡航、巡航和轨道操作。我们手动标注这些文档,对应FOL表示,以捕捉时间结构、代理角色和操作依赖性。此外,我们还提供了结构化的谓词词汇表和类型常量,以支持在不同先验知识水平下进行受控实验。该试点基准为语言理解和形式推理交叉研究提供了基础,基于真实世界的安全关键任务数据。数据集可在:https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json 获取。

英文摘要

Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json

2605.17903 2026-05-19 cs.AI cs.CL cs.HC cs.IR 版本更新

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

代理分块与贝叶斯去分块:人工智能生成的模糊认知图的模型:特克西德斯陷阱模型

Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

发表机构 * University of Southern California(美国南加州大学) Florida International University(佛罗里达国际大学)

AI总结 本文提出了一种基于代理分块和贝叶斯去分块的方法,用于生成和更新人工智能生成的模糊认知图,通过在文本中生成重叠的文本分块,并利用稀疏因果分块矩阵进行混合,从而构建出代表性的循环模糊认知图知识图谱,以预测特克西德斯陷阱模型中的冲突结果。

Comments 15 pages, 6 figures

详情
AI中文摘要

我们通过训练大语言模型代理将文本分解为重叠的文本分块,从而自动生成反馈因果模糊认知图(FCMs)。通过将这些分块FCMs进行凸混合,可以得到一个代表性的循环FCM知识图。文本分块可以有不同的重叠程度。分块FCMs仍然混合以形成新的FCM因果知识图。混合技术的可扩展性源于其使用轻量计算和稀疏因果分块矩阵。混合结构允许进行一种操作层面的贝叶斯推断,从而从混合的FCM中生成“去分块”或后验似的FCM。这些去分块的FCM在自身具有价值,并允许进一步的贝叶斯更新。我们通过Allison的“特克西德斯陷阱”模型的论文文本演示了这些混合技术,该模型描述了主导力量(如美国)与崛起力量(如中国)之间的冲突。FCM动态系统在达到固定点或极限环吸引子时预测结果。当我们通过激活代表崛起力量野心和权利的概念节点来刺激这些FCM知识图时,8个中的7个FCM知识图预测了战争类型。Gemini 3.1 LLMs作为分块AI代理。

英文摘要

We automatically generate feedback causal fuzzy cognitive maps (FCMs) from text by teaching large-language-model agents to break the text into overlapping chunks of text. Convex mixing of these chunk FCMs gives a representative cyclic FCM knowledge graph. The text chunks can have different levels of overlap. The chunk FCMs still mix to form a new FCM causal knowledge graph. The mixing technique scales because it uses light computation with sparse causal chunk matrices. The mixing structure allows an operator-level type of Bayesian inference that produces "de-chunked" or posterior-like FCMs from the mixed FCM. These de-chunked FCMs are useful in their own right and allow further iterations of Bayesian updating. We demonstrate these mixing techniques on the essay text of Allison's "Thucydides Trap" model of conflict between a dominant power such as the United States and a rising power such as China. The FCM dynamical systems predict outcomes as they equilibrate to fixed-point or limit-cycle attractors. Seven out of 8 FCM knowledge graphs predicted a type of war when we stimulated them by turning on and keeping on the concept node that stands for the rising power's ambition and entitlement. Gemini 3.1 LLMs served as the chunking AI agents.

2605.17885 2026-05-19 cs.CL cs.AI 版本更新

Multi-agent AI systems outperform human teams in creativity

多智能体AI系统在创造力上超越人类团队

Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo, Xing Xie, Nigel Collier, David Stillwell, Luning Sun

发表机构 * Microsoft Research Asia(微软亚洲研究院)

AI总结 研究探讨了多智能体AI系统在创造力任务中的表现,发现其在四个多样化问题解决任务中,比单智能体和人类团队更具创造力,核心方法是通过语义空间路径分析生成过程,主要贡献是揭示了AI和人类团队在创造力预测上的不同机制。

详情
AI中文摘要

尽管人工智能(AI)在众多认知任务上已匹配或超越人类表现,但创造力仍是一个极具争议的前沿。随着基于大语言模型(LLMs)的AI系统在研究和创新中被越来越多地采用,理解并增强其创造力变得至关重要。本文证明,多智能体LLM团队不仅超越了单个智能体,而且在4541个多智能体LLM想法和341个人类团队想法上,显著优于人类团队在创造力方面(Cohen's d=1.50)。这种优势由新颖性驱动,同时保持了相当的实用性。为了研究两组的生成过程,我们通过神经语言模型表示将对话表示为语义空间中的路径。LLM和人类团队在对话范围广泛而不是集中在单一主题(低全局一致性)时产生更多创造性想法。然而,预测创造力的额外模式不同:LLM团队受益于高效的探索(高语义扩展,较短路径),而人类团队受益于维持流畅的对话流程(高局部一致性,频繁转换)。此外,我们识别出模型选择和讨论结构作为正交的设计杠杆,共同解释了LLM对话动态中26.8%的方差,为系统开发具有增强创造力的多智能体系统铺平了道路。

英文摘要

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

2605.17873 2026-05-19 cs.LG cs.AI cs.CL 版本更新

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD:针对长 Horizon 智能体的定向 hindsight 自监督学习

Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

AI总结 本文提出 HINT-SD,一种针对长 Horizon 智能体的定向 hindsight 自监督学习框架,通过全轨迹 hindsight 选择失败相关的动作,并仅在目标动作跨度上应用反馈条件自监督学习,实验表明该方法在 BFCL v3 和 AppWorld 上比密集的每回合反馈基线提高了 18.80 个百分点,同时训练时间降低 2.26 倍。

详情
AI中文摘要

训练具有长 horizon 的 LLM 智能体进行强化学习具有挑战性,因为稀疏结果奖励只能表明任务是否成功,而不能指示哪些中间动作导致了结果或如何修正。最近的方法通过从回合级动作-输出信号生成奖励或文本提示,或通过反馈条件自监督学习来缓解这一问题。然而,当许多中间回合已经成功或中性时,在每个回合生成反馈效率低下,而固定或错位的反馈难以监督导致失败的动作。为此,我们提出了 HINT-SD,一种基于全轨迹 hindsight 的定向自监督学习框架,用于选择失败相关的动作,并仅在目标动作跨度上应用反馈条件自监督学习。在 BFCL v3 和 AppWorld 上的实验表明,我们的方法在比密集的每回合反馈基线提高 18.80 个百分点的同时,实现了 2.26 倍更低的训练时间,表明选择何时进行自监督学习是有效且高效的长 horizon 智能体训练的关键因素。

英文摘要

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

2605.17860 2026-05-19 cs.CL cs.AI 版本更新

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

PAREDA:自然语言处理研究讨论的多口音语音数据集

Sicheng Jin, Dipankar Srirag, Aditya Joshi

AI总结 本文提出PAREDA数据集,用于研究不同口音、自发性和领域特定语音的ASR性能,通过评估SOTA模型发现零样本设置下模型表现下降,但微调后显著降低WER,证明数据集捕捉了现有数据缺失的语言特征。

Comments Accepted and presented at SPEAKABLE 2026 workshop at LREC 2026

详情
AI中文摘要

尽管现代自动语音识别(ASR)系统在基准语料上实现高精度,但其性能在现实世界变化时往往下降。本文聚焦于因口音、自发性和领域特定语音引起的变异性。特别是,我们介绍了PAREDA数据集,这是首个多口音语音数据集,包含澳大利亚、印度英语和中文英语口音的学术自然语言处理(NLP)论文讨论。每个会话都会引发自发独白(一篇论文摘要的总结)和非独白(参与者之间的问答会话),从而产生一个充满技术术语和会话现象的语料库。我们评估了SOTA ASR模型在PAREDA上的性能,分析了口音混合和语音速度增加的影响。我们的结果表明,在零样本设置下,模型表现更差,证实了数据集的挑战性。然而,对PAREDA的微调显著降低了词错误率(WER),证明我们的数据集捕捉了现有语料中常缺失的语言特征。PAREDA为构建和评估更稳健和包容的ASR系统提供了宝贵的资源,用于专门的现实应用。

英文摘要

While modern Automatic Speech Recognition (ASR) systems achieve high accuracy on benchmark corpora, their performance often degrades when there is real-world variability. This work focuses on variability arising due to accented, spontaneous, and domain-specific speech. In particular, we introduce PAper REading DAtaset (PAREDA), a first-of-its-kind multi-accent speech dataset consisting of discussions on academic Natural Language Processing (NLP) papers between speakers with Australian, Indian-English, and Chinese English accents. Each session elicits a spontaneous monologue (a summary of a paper's abstract) and a non-monologue (a question-and-answer session between participants), resulting in a corpus rich with technical jargon and conversational phenomena. We evaluate the performance of SOTA ASR models on PAREDA, analysing the impact of accent mixing and increased speech rate. Our results show that, in the zero-shot setting, models perform worse, confirming the dataset's challenging nature. However, fine-tuning on PAREDA significantly reduces the Word Error Rate (WER), demonstrating that our dataset captures linguistic characteristics often missing from existing corpora. PAREDA serves as a valuable new resource for building and evaluating more robust and inclusive ASR systems for specialised, real-world applications.

2605.17849 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

从有机数据生成预训练令牌以实现数据驱动的扩展

Zichun Yu, Chenyan Xiong

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所)

AI总结 本文提出SynPro框架,通过重新表述和重新格式化操作,帮助大语言模型更充分地利用有限的有机数据,从而在数据驱动的预训练中实现更高效的扩展。

详情
AI中文摘要

LLM预训练正从计算驱动转向数据驱动的阶段,其中可用的人类(有机)文本远远无法满足扩展需求。然而,达到数据驱动阶段并不意味着模型已充分利用其有机语料库。在本文中,我们介绍了SynPro,一个合成数据生成框架,帮助LLM更深入地学习有限的有机数据。SynPro应用两种操作,即重新表述和重新格式化,以多样化的形式呈现相同的有机源,以促进更深层次的学习,而无需引入外部信息。两个生成器通过强化学习优化,使用质量、忠实度和数据影响奖励进行优化,并在预训练平台期持续更新,以针对模型尚未吸收的内容。我们使用DCLM-Baseline的10%最优令牌(0.8B和2.2B)预训练400M和1.1B模型,反映了前沿预训练中现实的数据驱动阶段。我们的结果表明,有机数据被标准重复方法显著低估:SynPro解锁了比重复方法多3.7-5.2倍的有效令牌,甚至在1.1B规模上超过了非数据驱动的Oracle,该Oracle在等效唯一数据上训练。分析证实,忠实、模型意识的合成可以在不导致分布崩溃的情况下实现数据驱动的扩展。我们开源代码在https://github.com/cxcscmu/SynPro。

英文摘要

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.

2605.17830 2026-05-19 cs.AI cs.CL 版本更新

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

记住更多,风险更多:具有记忆能力的LLM代理的纵向安全风险

Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, Ming Jin

发表机构 * Virginia Tech(弗吉尼亚理工大学) University of California, Berkeley(加州大学伯克利分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本研究探讨了具有记忆能力的LLM代理在长期任务中因记忆积累导致的安全风险,提出了一种触发-探测协议来评估记忆污染的影响,并发现记忆安全应被视为一个纵向属性而非单一状态属性。

详情
AI中文摘要

对具有记忆能力的LLM代理的安全评估通常测量单任务内的安全性:代理是否在对抗性条件下(如提示注入或记忆污染)安全地完成单一场景。然而,在部署中,一个代理会服务于许多独立任务,时间跨度较长,早期任务积累的记忆会影响后续无关任务的行为。研究这种情形需要在任务间的时间维度上进行评估:不是代理在任何单一记忆状态下的安全性,而是随着记忆在许多独立交互中积累,其安全性特征如何变化。我们称之为这种故障模式“时间记忆污染”。为了隔离记忆暴露与流非平稳性,我们引入了一种触发-探测协议,该协议通过固定探测集与不同前缀长度的只读记忆快照进行评估,并结合NullMemory反事实基线来识别由记忆引起的违规。我们将此协议应用于三个涵盖记录、备忘录、表单和电子邮件通信的部署场景,以及八种记忆架构,并进一步在Claw-like AI代理(如OpenClaw)上使用平台原生的记忆机制。具有记忆能力的代理在NullMemory基线上表现优异,记忆引起的违规率在两种代理类别中均表现出随暴露长度上升的稳健趋势。顺序随机化实验表明,该效应主要由积累内容而非接触顺序驱动。最后,事件分解的结构后果是记忆引起的风险在生成前的检索状态即可检测,我们通过高召回率的诊断监控器验证了这一点。我们的结果表明,应将记忆安全视为一个需要时间评估的纵向属性,而非可通过快照捕捉的单一状态属性。

英文摘要

Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non-stationarity, we introduce a trigger-probe protocol that evaluates a fixed probe set against read-only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory-induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw-like AI agents, such as OpenClaw, using the platform's native memory mechanism. Memory-enabled agents consistently exceed the NullMemory baseline, and memory-induced violation rates show a robust upward trend with exposure length on both agent classes. Order-randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory-induced risk is detectable from retrieval state before generation, which we confirm with a high-recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single-state property that can be captured by a snapshot.

2605.17789 2026-05-19 cs.CL cs.AI 版本更新

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

SocialMemBench: AI记忆系统是否准备好应对社交群体环境?

Olukunle Owolabi

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出SocialMemBench,一个针对多党社交群体的AI记忆系统评估基准,通过人类验证的合成社交网络,测试记忆系统在处理共享历史、群体规范和成员退出等复杂社交场景中的能力。

详情
AI中文摘要

为单用户对话设计的AI记忆系统在应用于多党社交群体环境时会表现出典型故障。这一差距对当今构建的社会助手尤为重要:嵌入聊天平台的群体作用代理,以及需要全面用户模型的主动个人助理代理。现有记忆基准评估的是二元或职场对话;没有针对多党社交群体,其中记忆必须将事实锚定在共享历史而非职业角色,区分群体规范与个体例外,并在成员退出后正确归因。我们引入SocialMemBench,一个涵盖五个典型(亲密朋友、家庭、娱乐、兴趣社区、熟人网络)和三个群体规模层级(4-30成员)的人类验证合成社交群体网络的基准,包含430个角色和7,355次对话轮次,产生1,031个问题-答案对,覆盖九个问题类别。每个类别隔离一种架构能力,五个失败模式(单流融合、时间状态覆盖、大规模实体合并、缺失跨角色知识、规范-个体融合)是可测试的假设;我们的两项研究探针Subject-Mem和SMG提供了证据,其余三个仍待解决。在所有43个网络中,评估的四个开源记忆框架(Mem0、LangMem、Graphiti、Cognee)在问题加权范围内聚集在0.12-0.18,95%置信区间重叠,远低于未压缩检索参考0.345和匹配回答者完整上下文参考0.369(GPT-4o-mini)。当前的记忆系统显示出可测量的差距。

英文摘要

Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.

2605.17775 2026-05-19 cs.CL cs.AI 版本更新

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

在百万笔记规模上系统评估LLM重新表述的合成临床笔记质量

Jinghui Liu, Sarvesh Soni, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO, Australia(澳大利亚电子健康研究中心,CSIRO,澳大利亚) National Library of Medicine, National Institutes of Health, USA(国家医学图书馆,国立卫生研究院,美国)

AI总结 本研究系统评估了LLM生成的合成临床笔记的质量,包括内在、外在和事实性评估,发现尽管在粗粒度任务中保留了核心临床信息和预测效用,但在细粒度任务如ICD编码中丢失了细节,通过分块重述可以缓解这一问题,但会降低事实准确性。研究还发现合成错误主要源于临床情境的误解、时间混淆、测量误差和虚构声明,同时展示了这些合成笔记可以有效增强罕见ICD代码的特定任务训练。

详情
AI中文摘要

大型语言模型(LLMs)可以为各种应用生成或合成临床文本,从改善临床文档到增强临床文本分析。然而,评估通常集中在狭窄方面——例如相似性或效用比较——尽管这些方面是互补的,最好并行看待。在本研究中,我们旨在系统评估LLM生成的临床文本,包括在百万笔记规模上从MIMIC数据库重新表述的合成临床笔记的内在、外在和事实性评估。我们的分析显示,尽管存在显著的语言变化,合成笔记仍保留了核心临床信息和粗粒度任务的预测效用,但在像ICD编码这样的细粒度任务中会丢失细节。我们展示,通过分块重述而不是整体重述笔记可以显著缓解这种细节丢失,但会以减少事实准确性为代价。通过事实核查和错误分析,我们进一步发现合成错误主要由临床情境的误解、时间混淆、测量误差和虚构声明引起。最后,我们展示了这些合成笔记——尽管具有任务无关性——可以有效增强罕见ICD代码的特定任务训练。

英文摘要

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

2605.17755 2026-05-19 cs.CL cs.AI 版本更新

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

弥合版本差距:多版本训练提升ICD代码预测,尤其是罕见代码

Jinghui Liu, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO(澳大利亚电子健康研究中心,CSIRO)

AI总结 本文研究了通过结合不同ICD版本的数据训练版本无关模型的有效性,以解决ICD代码预测中的长尾问题和罕见代码性能瓶颈,实验表明多版本训练在提升罕见代码的微F1指标和频繁代码的宏指标方面均取得显著效果。

详情
AI中文摘要

临床编码将临床文档映射到标准化的医疗代码,这是一个关键但耗时的行政任务,可以通过自动化来改进。当前ICD编码模型通常针对特定版本的代码进行优化。然而,实际上ICD系统持续演进,不同版本在不同时期和地区被采用。此外,ICD编码面临长尾问题,罕见代码性能可能成为开发可实施模型的瓶颈。我们探讨了通过结合不同ICD版本的数据训练版本无关模型的可行性,这可能有助于解决这些挑战。我们将在修改后的标签注意力模型中加入ICD-9数据进行ICD-10预测训练,并发现尽管存在版本不匹配,加入ICD-9数据使18K个罕见ICD代码的微F1指标相比仅使用ICD-10训练提高了27%。在8K个频繁ICD-10代码上,多版本训练也显著提升了宏指标,并且模型参数更少。

英文摘要

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

2605.17714 2026-05-19 cs.CL 版本更新

From Documents to Segments: A Contextual Reformulation for Topic Assignment

从文档到段落:一种用于主题分配的上下文重述

Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-Tür, Stanley Jungkyu Choi

发表机构 * LG AI Research(LG AI研究院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种基于段落的主题分配方法(SBTA),通过将主题分配到短小且连贯的文本段落而非整个文档,以解决传统主题模型中文档多主题问题导致的主题污染问题,从而提升主题分析的清晰度和可解释性。

Comments Findings of ACL 2026

详情
AI中文摘要

传统的主题建模方法为每个文档分配一个单一主题。然而,在实践中,许多现实世界文档,如产品评论或开放式调查回答,包含多个不同的主题。这种不匹配常常导致主题污染,即不相关主题被合并到一个主题中,使得难以识别真正专注于特定主题的文档。我们通过引入基于段落的主题分配(SBTA),一种对主题建模的重述方法,将主题分配给段落:短小、连贯的文本片段,每个片段表达一个单一主题。通过在段落层面建模主题结构,我们的方法产生更清晰和可解释的主题,并更好地支持多主题文档的分析。为了支持系统评估,我们构建了一个SemEval-STM数据集,灵感来自基于方面的情感分析。文档首先通过大型语言模型(LLMs)分解为基于主题的段落,随后通过人工校验确保段落质量。我们还提出了一种基于段落的词入侵任务扩展,使人类能够在主题实际分配的粒度上评估主题连贯性。在多个模型和评估指标上,我们证明SBTA提高了聚类质量和可解释性。总体而言,这项工作提供了一个实用、可扩展的框架,用于异构文本语料库中细粒度的主题分析,其中文档自然涵盖多个主题。

英文摘要

Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based sentiment analysis. Documents are first decomposed into topical segments using large language models (LLMs), followed by human refinement to ensure segment quality. We also propose a segment-level extension of the word intrusion task, enabling human evaluation of topical coherence at the granularity where topics are actually assigned. Across multiple models and evaluation metrics, we show that SBTA improves clustering quality and interpretability. Overall, this work provides a practical, scalable framework for fine-grained topic analysis in heterogeneous text corpora where documents naturally span multiple topics. URL: https://huggingface.co/datasets/LG-AI-Research/SemEval-STM

2605.17710 2026-05-19 cs.CL eess.AS 版本更新

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

Sometin Beta Pass Notin (SBPN): 通过知识蒸馏改进尼日利亚语言的多语言语音识别

Sewade Ogun

发表机构 * Nigerian Languages(尼日利亚语言)

AI总结 本文提出SBPN模型,通过两阶段知识蒸馏方法提升尼日利亚多种语言的语音识别性能,显著降低词错误率并优于现有多语言模型。

Comments 25 pages

详情
AI中文摘要

尽管现代多语言自动语音识别(ASR)系统支持多种尼日利亚语言,但其性能始终落后于英语和法语等高资源语言。尼日利亚语言存在独特的建模挑战,包括数据稀缺、不一致的正字法、声调符号、多样化的口音、频繁的代码切换和本地化专有名词。为解决这些挑战,我们开发了一个多语言ASR框架,采用两阶段蒸馏过程。首先,我们利用学生-教师知识蒸馏从现有单语言模型中学习,基于稳健的语言特定N-gram语言模型进行条件化。其次,我们使用伪标签数据进行迭代自我改进以进一步提高准确性。我们的方法显著缩小了性能差距,平均在单语言基线上实现了29%的词错误率(WER)减少。我们的模型在主要基准上也优于现有最先进的多语言模型,包括Common Voice和Fleurs。我们引入Sometin Beta Pass Notin(SBPN),一个覆盖约鲁巴、豪萨、伊博、尼日利亚皮钦语和尼日利亚英语的多语言ASR模型。SBPN以两种大小发布:SBPN-Base(120 M参数)和SBPN-Large(600 M参数)。通过发布这些作为开放基础模型,我们旨在为该地区丰富的语音和文化景观的研究提供ASR资源。

英文摘要

Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present unique modelling hurdles, including acute data scarcity, inconsistent orthography, tonal diacritics, diverse accents, frequent code-switching, and localized named entities. To address these challenges, we developed a multilingual ASR framework utilizing a two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data to further refine accuracy. Our method significantly bridges the performance gap, achieving on average a relative Word Error Rate (WER) reduction of 29 % over monolingual baselines. Our models also outperform state-of-the-art multilingual models across major benchmarks, including Common Voice and Fleurs. We introduce Sometin Beta Pass Notin (SBPN), a foundational multilingual ASR model covering Yorùbá, Hausa, Igbo, Nigerian Pidgin, and Nigerian English. SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region.

2605.17691 2026-05-19 cs.CL cs.AI 版本更新

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

验证你的权威:在多标签先例处理分类上对LLM进行基准测试

M. Mikail Demir, M. Abdullah Canbaz

发表机构 * Department of Information Science and Technology(信息科学与技术系) College of Emergency Preparedness, Homeland Security, and Cybersecurity(应急准备、国土安全与网络安全学院) University at Albany, SUNY(萨利纳大学)

AI总结 本文提出了一种新的评估框架,通过专家标注的数据集对现代大语言模型进行基准测试,引入了平均严重性误差指标,以更准确地衡量分类错误的实践影响。

Comments Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP

详情
AI中文摘要

自动化法律先例中负面处理的分类是一个关键但复杂的自然语言处理任务,误分类可能带来重大风险。为了解决标准准确率的不足,本文介绍了一种更稳健的评估框架。我们对239个真实世界法律引用的新专家标注数据集上的现代大语言模型进行了基准测试,并提出了一种新的平均严重性误差度量标准,以更好地衡量分类错误的实践影响。我们的实验揭示了性能的分裂。Google的Gemini 2.5 Flash在高层次分类任务上达到了最高准确率(79.1%),而OpenAI的GPT-5-mini则在更复杂的细粒度模式上表现最佳(67.7%)。本工作建立了关键基准,提供了一个新的上下文丰富的数据集,并引入了一个针对这一复杂法律推理任务的评估度量标准。

英文摘要

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

2605.17672 2026-05-19 cs.CL 版本更新

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

停止当推理收敛:用于推理模型的语义保留早退出

Dehai Min, Giovanni Vaccarino, Huiyi Chen, Yongliang Wu, Gal Yona, Lu Cheng

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Google Research(谷歌研究) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Politecnico di Milano(米兰理工学院)

AI总结 本文提出PUMA框架,通过识别推理层面的语义冗余信号,实现语义保留的早退出,从而在保持准确性和推理链完整性的同时减少token消耗。

Comments under review

详情
AI中文摘要

大型推理模型(LRMs)通过生成长链的推理步骤(CoT)实现强大的性能,但常常过度推理,继续推理在解决方案已经稳定后,从而浪费token并增加延迟。现有的推理时早退出方法主要依赖于答案层面的信号,如置信度或试答一致性,来决定何时停止。然而,这些信号主要反映答案准备程度而非推理收敛:它们可能在模型尚未完成探索或自我纠正之前触发,导致过早退出,从而降低最终答案的准确性并留下保留的推理链语义不完整。我们识别推理层面的语义冗余作为语义保留早退出的互补信号:当连续步骤不再添加新的进展而是重复已确立的结论时,推理轨迹可能已收敛。基于这一见解,我们提出了PUMA,一个插件式框架,结合了轻量级的冗余检测器和答案层面的验证。检测器标记语义冗余的候选退出点,而验证确认停止是否安全,使PUMA能够在保持答案准确性和连贯推理前缀的同时删除冗余的延续。在五个LRMs和五个具有挑战性的推理基准测试中,PUMA实现了26.2%的平均token减少,同时保持准确性和保留的CoT质量。此外,针对代码生成、零样本视觉-语言推理和学习停止策略内部化等额外实验进一步证明,推理层面的冗余是高效推理的稳健、可转移和可学习的信号。我们的代码可在https://github.com/giovanni-vaccarino/PUMA上获得。

英文摘要

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.

2605.17652 2026-05-19 cs.CL 版本更新

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

超越转录:借助音频的迭代同行编辑解锁高质量的对话语音总结

Kaavya Chaparala, Thomas Thebaud, Jesús Villalba López, Laureano Moro-Velazquez, Peter Viechnicki, Najim Dehak

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 该研究通过比较音频和转录本总结的质量,探讨了同行编辑在提升语音总结质量中的作用,并发现音频总结在信息量和压缩程度上不如转录本总结,但通过迭代同行编辑可以弥补这一差距。

Comments Accepted in LREC 2026

详情
AI中文摘要

目前缺乏足够的语音总结任务基准。创建新基准需要人工标注,因为LLM可能会将系统性错误和偏见嵌入数据集中。我们测试了十种标注工作流程,涉及不同的输入模态(音频、转录本或两者)以及编辑(自我编辑或同行编辑)的 inclusion,以研究使用人工标注者总结音频时可能的质量权衡。我们比较了基于音频的人类总结与基于转录本的人类总结,以跟踪不同信息模态对总结质量的影响。我们还比较了人类输出与四个LLM基准(三个文本,一个音频)以确定人类编写总结是否比高度流畅的自动输出信息更少。我们发现基于音频的总结信息较少且更压缩,但通过使用音频的迭代同行编辑可以缓解这一差异,使基于音频的总结信息量与转录本总结和LLM总结相当。这些发现验证了在创建结合词汇和语调信息的基准时,人类标注者之间的迭代同行编辑的有效性。这使在转录本不可用的情况下也能关键数据集的收集成为可能。

英文摘要

There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.

2605.17641 2026-05-19 cs.AI cs.CL 版本更新

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

基于因果干预的记忆选择用于长时域大语言模型智能体

Saksham Sahai Srivastava

发表机构 * School of Computing, University of Georgia, Athens, Georgia, USA(佐治亚大学计算机学院)

AI总结 本文提出Causal Memory Intervention(CMI)方法,通过因果推理选择大语言模型的长期记忆,以提高回答质量和鲁棒性,同时引入Causal-LoCoMo基准数据集进行评估。

Comments 12 pages, 3 figures, 3 tables

详情
AI中文摘要

长时域大语言模型智能体依赖持久记忆来支持跨会话的交互,但现有记忆系统通常使用语义相似性或广泛历史包含来检索上下文,将检索到的记忆视为统一有用。这一假设是脆弱的,因为记忆可能在主题上相关,但仍然无关、过时或误导性。我们提出了Causal Memory Intervention(CMI),一种因果记忆选择技术,通过在受控干预下估计候选记忆如何影响模型的答案,选择提高任务性能的同时抑制不稳定、无关或有害的记忆。为了评估这一设置,我们引入了Causal-LoCoMo,一个从长对话数据中衍生出的因果标注基准,其中每个示例包含用户请求、结构化记忆库、有用的记忆、无关干扰项以及合成有害记忆。我们比较了CMI与向量、图、反思、摘要、完整历史和无记忆基线。结果表明,CMI在回答质量和对误导性记忆的鲁棒性之间实现了更强的平衡,表明可靠的长期记忆需要基于因果有用性而非相关性本身来选择上下文。完整的框架、基准构建代码和实验流程可在https://github.com/Saksham4796/causal-memory-intervention获取。

英文摘要

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

2605.17639 2026-05-19 cs.CL cs.IR 版本更新

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

共引预测性的时变衰减:来自3.96亿乌克兰法院引用的20年法规检索基准

Volodymyr Ovcharov

发表机构 * LEX AI LLC

AI总结 本文研究了共引结构在法律信息系统中的稳定性假设,通过构建UA-StatuteRetrieval基准,测试了20年中3.96亿条引用数据的共引可预测性,发现Adamic-Adar MRR在固定文章集上下降33%,在训练/测试时间分割下下降47%,证实了真正的时变衰减而非组合变化或评估伪影。

Comments 12 pages, 8 figures, 4 tables. Dataset: https://huggingface.co/datasets/overthelex/ua-statute-retrieval

详情
AI中文摘要

共引结构被广泛假设为提供稳定的检索信号。我们通过构建UA-StatuteRetrieval基准,纵向测试这一假设,该基准在2007-2026年的20个年度快照中测量了3.96亿条法典引用的共引可预测性。通过在完整的双部分引用图上使用留一法协议,我们发现Adamic-Adar MRR在固定文章集上下降33%(从0.43到0.29),在训练/测试时间分割下下降47%(从0.51到0.27),证实了真正的时变衰减而非组合变化或评估伪影。衰减是非均匀的:刑事程序保持稳定的共引模式(MRR ~0.40),而民法从0.35下降到0.15,与2017年司法改革重合。枢纽文章(>100,000引用)抵抗衰减,但中频文章(1,000-10,000)——实际检索前沿失去一半的可预测性。BM25文本基线衰减得更快(31%),嵌入漂移分析显示E5-large揭示了文章引用的语义偏移4.3%,提供了衰减的机制解释。该基准在https://huggingface.co/datasets/overthelex/ua-statute-retrieval发布。

英文摘要

Co-citation structure is widely assumed to provide stable retrieval signal in legal information systems. We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court decisions. Using a leave-one-out protocol over the full bipartite citation graph, we find that Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split (from 0.51 to 0.27) confirming genuine temporal decay rather than compositional shift or evaluation artifact. The decay is non-uniform: criminal procedure maintains stable co-citation patterns (MRR ~0.40), while civil law degrades from 0.35 to 0.15, coinciding with the 2017 judicial reform. Hub articles (>100K citations) resist decay, but mid-frequency articles (1K-10K) -- the practical retrieval frontier lose half their predictability. A BM25 text baseline decays even faster (31%), and embedding drift analysis with E5-large reveals a 4.3% semantic shift in how articles are cited, providing a mechanistic explanation for the observed decay. The benchmark is released at https://huggingface.co/datasets/overthelex/ua-statute-retrieval.

2605.17634 2026-05-19 cs.CR cs.CL cs.CY 版本更新

AI Agents May Always Fall for Prompt Injections

AI Agents May Always Fall for Prompt Injections

Sahar Abdelnabi, Eugene Bagdasarian

发表机构 * ELLIS Institute Tübingen & MPI-IS & Tübingen AI Center(图宾根ELLIS研究所及MPI-IS与图宾根人工智能中心) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文基于上下文完整性理论重新审视提示注入问题,揭示了现有防御机制的不足,并提出了一种新的评估框架来设计更安全的自主代理。

详情
AI中文摘要

提示注入是部署中的AI代理最严重的漏洞。尽管近年来有进展,但我们证明现有的防御范式(数据-指令分离)既无法检测通过上下文操纵进行的攻击,也无法保持上下文合适的行為。我们通过上下文完整性(CI)理论重新审视提示注入,这是一种判断信息流是否符合上下文规范的隐私理论。这解释了当前防御试图修补的攻击类型,并预测了未来代理将面临更高级的攻击。我们开发了独特的良性和攻击场景,迫使代理通过(1)错误表示流动、(2)操纵规范或(3)混合多种流动来违反规范。这种重新界定表明了一个不可能的结果:攻击者可以始终构造一个上下文,使被阻止的流动看起来合法,或者一个收紧规范的防御者会阻止真正合法的流动。我们的发现表明,当前研究只针对未来攻击面中越来越小的一部分。相反,通过CI,我们提供了一个评估上下文敏感失败的原理性框架,并为前沿自主代理设计CI-aware的对齐方法。

英文摘要

Prompt injection is the most critical vulnerability in deployed AI agents. Despite recent progress, we show that the prevailing defense paradigm (data-instruction separation) both fails to detect attacks that operate through contextual manipulation and degrades contextually appropriate behavior. We then recast prompt injection via the lens of Contextual Integrity (CI), a privacy theory that judges information flow compliance with contextual norms. This explains types of attacks that current defenses attempt to patch and predict advanced ones future agents will face. We develop unique benign and attack scenarios that force an agent to violate the norms by (1) misrepresenting the flow, (2) manipulating norms, or (3) mixing multiple flows. This reframing suggests an impossibility result: an adversary can always construct a context under which a blocked flow appears legitimate, or a defender who tightens norms will block genuinely legitimate flows. Our findings suggest that current research addresses a shrinking fraction of future attack surfaces. Instead, through CI, we offer a principled framework for evaluating context-sensitive failures, and designing CI-aware alignment for the frontier autonomous agents.

2605.17610 2026-05-19 cs.CV cs.CL 版本更新

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

SafeLens: 一种高效且可靠的视频护栏系统,采用快速和缓慢筛查

Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra

发表机构 * University of South Florida(佛罗里达州立大学) University of California, Davis(加州大学戴维斯分校)

AI总结 本研究提出SafeLens视频护栏框架,通过快速和缓慢的推理架构实现高效的视频内容审核,同时构建高质量数据集并采用结构化Chain-of-Thought追踪来解决训练时间扩展的限制,从而在实际和AI生成视频基准测试中取得最佳性能,同时显著降低推理成本。

详情
AI中文摘要

在线视频平台和AI生成内容的快速增长使得可靠的视频护栏成为安全性和现实部署的关键挑战。尽管大多数视频可通过快速模式识别筛查,但一小部分需要对时间复杂的内容和细致的政策约束进行深入推理。现有方法通常依赖于在所有输入上统一应用大型视觉-语言模型,导致推理成本高且计算资源分配效率低。我们提出了SafeLens视频护栏框架,引入快速和缓慢的推理架构,以实现高效且准确的内容审核,根据输入的不同具有可变的计算成本。此外,我们通过应用影响引导过滤对SafeWatch数据集进行处理,仅保留原始数据的2.4%。为进一步解决训练时间扩展的限制,我们通过在过滤数据中添加结构化的Chain-of-Thought追踪来实现测试时间推理。在实际和AI生成视频基准测试中,SafeLens实现了最先进的性能,优于强大的开源视频护栏(如SafeWatch-8B、OmniGuard-7B)和闭源模型(如GPT-5.4、Gemini-3.1-pro),同时显著降低推理成本,证明了高效设计比仅扩大数据或模型大小更有效。

英文摘要

The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.

2605.17598 2026-05-19 cs.CL 版本更新

Mixture of Experts for Low-Resource LLMs

专家混合用于低资源大语言模型

Ori Bar Joseph, Smadar Arvatz, Noam Kayzer, Dan Revital, Sarel Weinberger

发表机构 * National Institute for Testing and Evaluation(国家测试与评估研究所)

AI总结 本文研究了专家混合架构在低资源语言中的专家路由行为,发现持续预训练能有效缓解路由不平衡问题,且路由改进与下游任务表现提升相关。

详情
AI中文摘要

混合专家(MoE)架构能够实现高效的模型扩展,但跨低资源语言的专家路由行为仍不明确。我们通过希伯来语这一形态丰富且低资源的测试环境,分析了两种架构不同的MoE模型——纯Transformer(Qwen3-30B-A3B)和混合Mamba-Transformer(Nemotron-3-Nano-30B-A3B)的路由动态。两种预训练模型均表现出深度层路由崩溃:在最终层使用熵急剧下降,令牌集中在狭窄的专家子集,这种模式在英语中很少见。持续预训练(CPT)在平衡双语数据上显著纠正了这种不平衡,提高了熵并使路由转向共享、语言无关的专家;监督微调(SFT)单独实现的纠正程度较低。将分析扩展到日语,发现定量一致的崩溃特征,提供了跨语言证据,表明该现象是预训练下代表性不足的系统性结果,而非任何语言固有属性。路由改进与一致的下游基准表现提升相关,将路由熵和专家专业化定位为MoE系统多语言能力的原理性诊断。

英文摘要

Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models -- a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) -- using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emph{deep-layer routing collapse}: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.

2605.17570 2026-05-19 cs.LG cs.CL 版本更新

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

GRPO在离线策略下的可能性:Mu-GRPO用于高效的大语言模型强化学习

Minghao Tian, Yunfei Xie, Chen Wei

发表机构 * Rice University(里士大学)

AI总结 本文探讨了GRPO在离线策略下的可行性,提出Mu-GRPO方法,通过减少rollout-optimization切换开销,实现高效的LLM强化学习,同时在多个基准测试中表现出色。

详情
AI中文摘要

组相对策略优化(GRPO)已成为近期大语言模型强化学习中可验证奖励(RLVR)进展的关键推动因素,但通常在低延迟、近策略的 regime 中训练,导致系统开销显著。我们提出一个简单的问题:GRPO可以多离线策略吗?我们证明GRPO类算法可以容忍比之前假设更大的rollout延迟,并提出Mu-GRPO,一种将训练分为少量(例如四个)大序列生成-优化阶段的RL训练框架。这种设计在诱导高rollout延迟的同时大幅减少了rollout-optimization切换开销。为了在延迟数据下稳定学习,Mu-GRPO结合了放松的剪裁(保留有用的延迟rollout梯度)与负优势 veto(移除不稳定后触发后缀更新)。在五个语言模型和多个数学推理基准测试中,Mu-GRPO在性能上与标准GRPO匹配或超过,同时在墙钟训练时间上实现了约2倍的加速,为LLM强化学习建立了显著改进的性能-效率权衡。

英文摘要

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.

2605.17565 2026-05-19 cs.AI cs.CL 版本更新

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

泛化还是记忆?国际象棋训练语言模型的脆弱性测试

Ethan Tang

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院)

AI总结 本文研究了国际象棋训练语言模型是泛化还是记忆,通过测试发现其高性能主要源于模式匹配,并展示了LLM-Modulo框架在提升国际象棋谜题解决性能上的效果,证明了与外部验证器结合的通用LLM比直接训练合成数据更灵活。

Comments 14 pages, 2 figures, 4 tables, 3 equations

详情
AI中文摘要

最近的研究对语言模型进行了棋类数据微调,并报告了高基准分数,作为证据表明由此产生的模型可以理解国际象棋规则、以专业水平下完整棋局,或生成基于专家知识的人可读解释。我们训练了KinGPT,一个仅在(位置,最佳移动)对上训练的2500万参数字符级语言模型,其在600个mate-in-N谜题套件上超过了300亿参数的ChessGPT,在20个主题谜题基准上超过了4000亿参数的C1-4B。我们检查了现有文献中关于国际象棋训练语言模型的几个主张,并断言其令人印象深刻的基准性能主要由模式匹配解释。我们还展示了LLM-Modulo,一个验证器在环框架,如何将RedPajama 3B的最佳移动准确率从1.2%提升到21.2%,移动生成有效性从19.3%提升到95.3%,在mate-in-N国际象棋谜题上,与ChessGPT在棋类特定网络语料库上微调所获得的提升相当,但成本仅为后者的一小部分。我们的结果展示了将通用LLM与外部验证器结合,为明确领域提供了一个更灵活的替代方案,而不是直接训练合成数据。我们开源了所有训练/评估代码、数据集、谜题样本和KinGPT模型检查点,以确保可重复性。

英文摘要

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

2605.17558 2026-05-19 cs.SE cs.CL 版本更新

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

Firefly:从真实API生成大规模验证工具调用数据

Yuxuan Lu, Ziyi Wang, Yingzhou Lu, Yisi Sang, Jiri Gesi, Xianfeng Tang, Yimeng Zhang, Zhenwei Dai, Hui Liu, Hanqing Lu, Chen Luo, Qi He, Benoit Dumoulin, Jing Huang, Dakuo Wang

发表机构 * Northeastern University(东北大学) Independent Researcher(独立研究者)

AI总结 本文提出FireFly框架,通过反向合成方法生成具有验证标签的工具调用数据,利用强LLM探索真实API并构建工具图,以提高大规模工具空间下的训练效率和结果准确性。

详情
AI中文摘要

训练工具调用代理需要大规模轨迹数据并具有可验证的标签,但现有方法要么合成与真实API行为不同的环境,要么生成无地面真实结果的任务。我们提出了FireFly,一个从真实世界MCP服务器生成验证工具调用数据的流水线。我们的关键见解是反转标准合成流程:而不是生成任务并希望它们可解决,我们首先让强大的LLM沿着图引导的DAG结构探索真实API,然后从观察到的结果反向合成任务,通过构造确保标签的正确性。为了处理真实世界工具空间(约1,000个工具)的规模,我们构建了一个成对的工具图并采样子DAGs,以专注于语义上连贯的工作流程。为了解决实时API中的环境漂移问题,我们构建了一个检索增强的模拟器,缓存所有探索结果并在训练和评估期间回放它们,使完全离线且可重复的RL成为可能。应用此流水线产生了5,144个经过验证的任务,覆盖240个服务器和993个工具。在FireFly上训练的40亿参数模型使用GRPO方法在我们的保留测试集上与Claude Sonnet 4.6相当,并在多个工具调用基准测试(包括Tau2-Bench、MCPMark和MCP-Atlas)上显示出改进。

英文摘要

Training tool-calling agents requires large-scale trajectory data with verifiable labels, yet existing approaches either synthesize environments that diverge from real API behavior or generate tasks without ground-truth outcomes for verification. We present FireFly, a pipeline for generating verified tool-call data from real-world MCP servers. Our key insight is to invert the standard synthesis pipeline: rather than generating tasks and hoping they are solvable, we first let a strong LLM explore real APIs along graph-guided DAG structures, then synthesize tasks backward from observed outcomes, guaranteeing label correctness by construction. To handle the scale of real-world tool spaces (${\sim}$1,000 tools), we build a pairwise tool graph and sample sub-DAGs to focus exploration on semantically coherent workflows. To address environment drift in live APIs, we construct a retrieval-augmented simulator that caches all exploration results and replays them during training and evaluation, enabling fully offline and reproducible RL. Applying this pipeline yields 5,144 verified tasks spanning 240 servers and 993 tools. A 4B-parameter model trained with GRPO on FireFly matches Claude Sonnet 4.6 on our held-out test set and shows improvements on multiple tool-calling benchmarks including Tau2-Bench, MCPMark, and MCP-Atlas.

2605.17528 2026-05-19 cs.LG cs.AI cs.CL 版本更新

CasualSynth: Generating Structurally Sound Synthetic Data

CasualSynth: 生成结构上合理的合成数据

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Institute of Logic and Computation, TU Wien(维也纳技术大学逻辑与计算研究所)

AI总结 本文提出CasualSynth框架,通过解耦因果结构生成与语义实现,生成既符合因果机制又语义丰富的合成数据,解决了LLM在生成合成数据时无法保证因果正确性的问题。

Comments 15 pages

详情
AI中文摘要

大型语言模型(LLMs)能够生成逼真的合成数据,但无法保证其输出符合目标领域的因果机制。我们引入CausalSynth框架,该框架将因果结构生成与语义实现解耦,生成既符合因果机制又语义丰富的合成数据。该框架分为三个阶段:首先,一个结构因果模型(SCM)——一个定义在有向无环图(DAG)上的结构方程组,通过祖先采样生成因果骨架,即满足支配图全局马尔可夫性质的变量赋值;其次,一个LLM作为受约束的实现者,一个条件翻译器,将每个骨架映射到高维观测,如临床笔记或交易日志;第三,一个迭代一致性验证模块通过确定性提取检测结构违规,并将针对性的修正反馈给LLM,形成闭环优化过程。我们识别出语义后门问题,即LLM系统性地用预训练先验覆盖施加的因果事实——并证明我们的迭代机制相对于标准拒绝采样减少了由此产生的选择偏差。在三个因果基准(ASIA、ALARM和MIMIC-Struct)上,CausalSynth在假阳性率接近名义α=0.05水平的情况下保持条件独立性,并在70B参数LLM基础上实现了超过96%的可实现率。该框架还通过保留噪声和图 mutilation 支持原理化的干预和反事实生成。

英文摘要

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

2605.17503 2026-05-19 cs.AI cs.CL cs.HC 版本更新

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

基于深度学习和大语言模型的RAG EEG到文本翻译

Enrico Collautti, Xiaopeng Mao, Luca Tonin, Stefano Tortora, Sadasivan Puthusserypady

发表机构 * IAS-LAB, Department of Information Engineering, University of Padova(帕多瓦大学信息工程系IAS实验室) Padova Neuroscience Center(帕多瓦神经科学中心) Department of Health Technology, Technical University of Denmark(丹麦技术大学健康技术系)

AI总结 本文提出了一种基于检索增强生成(RAG)的EEG到文本解码方法,结合EEG编码器、向量检索阶段和大语言模型,以提高句子级解码的准确性,并在ZuCo数据集上验证了其有效性。

Comments 6 pages, 2 figures. Submitted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics

详情
AI中文摘要

从电生理图(EEG)信号解码语言信息仍然是脑机接口(BCI)研究中极具挑战性的问题。特别是,由于EEG记录的信噪比较低,从EEG进行句子级解码尤为困难。以往研究通常在推理阶段未使用教师强制时难以超越随机基线性能。在本文中,我们提出了一种基于检索增强生成(RAG)的句子级EEG到文本解码流程,结合与语义句子嵌入对齐的EEG编码器、向量检索阶段以及大语言模型(LLM)以将检索到的句子细化为连贯的输出。实验在Zurich认知语言处理语料库(ZuCo)数据集上进行,该数据集包含在静默阅读期间收集的单次试验EEG记录。为了评估系统是否从这些EEG信号中提取了有意义的信息,结果与随机基线进行比较。在九名受试者中,所提出的流程优于随机基线,平均余弦相似度为0.181±0.022,与基线0.139±0.029相比,相对改进为30.45%。统计分析进一步确认了这种改进的显著性,遵循严格评估流程,其中推理阶段不接触地面真实标签。

英文摘要

The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.

2605.17481 2026-05-19 cs.CL 版本更新

Hybrid Feature Combinations with CNN for Bangla Fake News Classification

结合CNN的混合特征组合用于孟加拉语虚假新闻分类

Md Gulzar Hussain, Babe Sultana, Md Rinku Ali

发表机构 * School of Software, Nanjing University of Information Science and Technology(信息科学与技术学院) Department of Computer Science and Engineering, Green University of Bangladesh(计算机科学与工程系) School of Computer Science and Artificial Intelligence, Changzhou University(计算机科学与人工智能学院)

AI总结 本文研究了在BanFakeNews-2.0数据集上使用CNN模型进行孟加拉语虚假新闻分类时,不同特征组合(语义、统计和字符级特征)对识别效果的影响,发现多特征组合能显著提升召回率和F1分数。

Comments Already accepted and presented in the 3rd International Conference on Big Data, IoT and Machine Learning (BIM 2025)

详情
AI中文摘要

如今,孟加拉国的人们越来越多地通过互联网和社交媒体获取日常新闻,而不是传统报纸。然而,这些平台上的虚假新闻传播对真实媒体的可信度构成了风险和挑战。尽管已有研究致力于检测孟加拉语虚假新闻,但该领域仍有改进空间。本研究探讨了在BanFakeNews-2.0数据集上使用CNN模型时,特征选择方法(如语义、统计和字符级特征及其组合)在识别虚假新闻中的有效性。本文的关键发现表明,与单独使用特征相比,结合多种特征显著提高了召回率和F1分数。本研究的代码可在此获取:https://github.com/gulzar09/Bn_FNews_H.Feature.

英文摘要

Nowadays, people in Bangladesh frequently rely on the internet and social media for daily news instead of traditional newspapers. However, the spread of false Bangla news through these platforms poses risks and challenges to the credibility of authentic media. Although several studies have been conducted on detecting Bangla fake news, there is still significant room for improvement in this area. To assist people, this research explores the effectiveness of feature selection approaches in identifying appropriate features, such as semantic, statistical, and character-level features, or their combinations, on the BanFakeNews-2.0 dataset for detecting Bangla fake news using a CNN model. In this paper, key findings reveal that combining multiple features significantly improves recall and F1-scores compared to using individual features alone. The code for this research can be availed here, https://github.com/gulzar09/Bn\_FNews\_H.Feature.

2605.17467 2026-05-19 cs.CL 版本更新

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

VerifyMAS: LLM多智能体系统中故障归因的假设验证

Hezhe Qiao, Hanghang Tong, Ee-Peng Lim, Bing Liu, Guansong Pang

发表机构 * Singapore Management University(新加坡国立管理学院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Illinois at Chicago(伊利诺伊大学香槟分校)

AI总结 本文提出VerifyMAS框架,通过验证假设的方法对LLM多智能体系统中的故障进行归因,解决了现有方法在全局故障识别和细粒度归因方面的不足,实验表明其在多种模型上均优于现有方法。

Comments 22 pages

详情
AI中文摘要

大型语言模型驱动的多智能体系统(LLM-MAS)在复杂任务中表现出色,但不可靠的智能体仍是系统可靠性的重要瓶颈。自动故障归因因此至关重要,但现有方法如直接预测智能体错误对和智能体优先故障归因依赖于本地日志,无法识别仅在完整交互轨迹中显现的全局故障,如跨步不一致和智能体间协调错误。此外,直接预测故障会引入大规模组合搜索空间,阻碍细粒度归因。为了解决这些挑战,我们提出了VerifyMAS,一种用于智能体故障归因的假设验证框架。不同于直接预测故障智能体和错误类型,VerifyMAS针对完整轨迹验证故障假设。这种基于验证的方法将归因分解为轨迹级错误验证和细粒度智能体定位,提供了一种以错误优先的归因方法,能够捕捉全局故障模式,同时显著减少搜索空间。我们进一步引入基于结构化错误分类学的假设数据构建策略,并对专用LLM验证器模型进行微调,用于轨迹级故障验证和智能体归因。在Aegis-Bench和Who&When上的实验表明,VerifyMAS在多种基础模型上均表现优异,包括开源Qwen和基于API的GPT模型,在不牺牲长多智能体轨迹推理效率的情况下,优于现有方法。

英文摘要

Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.

2605.17453 2026-05-19 cs.CR cs.CL 版本更新

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

勿信工具:在不可信工具反馈下评估和防御LLM代理

Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li, Binwu Wang, Longyue Wang, Chenyang Lyu, Guanhua Chen

发表机构 * Southern University of Science and Technology(南方科技大学) University of Science and Technology of China(中国科学技术大学) University of Birmingham(伯明翰大学) Zhejiang University(浙江大学) East China Normal University(华东师范大学) Alibaba Group(阿里巴巴集团)

AI总结 本研究探讨了在不可信工具反馈环境下评估和防御LLM代理的问题,提出了一种新的失败模式'认知中毒',并介绍了TRUST-Bench基准、GuardedJoint惩罚指标和VISTA-Guard框架,展示了轨迹感知的最终动作评分在安全与效用权衡中的有效性。

详情
AI中文摘要

LLM代理越来越多地依赖外部工具做出重大决策,但现有的代理安全基准和防御方法通常假设一旦选择工具,其反馈就是可信的。我们研究了一种不同的失败模式,即认知中毒,其中恶意工具在探索过程中表现合理,通过看似无害的反馈积累信任,只有在隐藏状态条件与最终执行动作一致时才会变得有害。为此,我们构建了TRUST-Bench,一个包含1970个隐藏触发工具妥协事件的任务条件基准,并引入了GuardedJoint不对称惩罚指标,以更好地反映真实部署风险,还提出了VISTA-Guard,一种不依赖特定骨干的最终动作风险评分框架。核心思想是将多步工具交互抽象为结构化环境变量,这些变量编码了信任形成动态,并从该轨迹条件表示中评分最终执行动作的风险。实验表明,基于提示的启发式方法、标量特征和零样本判断在该领域失效,而轨迹感知的最终动作评分在域内具有强区分能力,并在平衡的域外迁移中保持有效。在GuardedJoint下,VISTA-Guard在域内达到84.2,在平衡的域外评估中达到56.9,而仅优化安全-效用权衡一方的方法则降至零。这些发现支持了更广泛的代理安全观点:在黑盒工具生态系统中,决定性的防御目标不是本地提示文本或工具描述本身,而是信任在交互轨迹中的形成方式以及通过最终动作所承诺的方式。

英文摘要

Tool-using LLM agents increasingly rely on external tools to make consequential decisions, yet most existing agent-security benchmarks and defenses implicitly assume that tool feedback is trustworthy once a tool has been selected. We study a different failure mode, cognitive poisoning, in which a malicious tool behaves plausibly during exploration, accumulates trust through benign-looking feedback, and becomes harmful only when hidden state conditions align with the final executable action. To study this setting, we construct TRUST-Bench, a task-conditioned benchmark of 1,970 hidden-trigger tool-compromise episodes with matched safe controls, introduce an asymmetric penalty metric, GuardedJoint, to better reflect real deployment risk, and present VISTA-Guard, a backbone-agnostic framework for final-action risk scoring. The core idea is to abstract multi-step tool interaction into structured environment variables that encode trust-formation dynamics and then score the risk of the final executable action from this trajectory-conditioned representation. Experiments show that prompt-centric heuristics, scalarized features, and zero-shot judges fail in this regime, whereas trajectory-aware final-action scoring yields strong in-domain discrimination and remains effective under balanced out-of-distribution transfer. Under GuardedJoint, VISTA-Guard reaches $84.2$ in-domain and $56.9$ on balanced out-of-distribution evaluation, while methods that optimize only one side of the safety--utility tradeoff collapse to zero. These findings support a broader view of agent security in black-box tool ecosystems: the decisive defense target is not local prompt text or tool descriptors alone, but the way trust is formed across the interaction trajectory and committed through the final action.

2605.17450 2026-05-19 cs.SE cs.AI cs.CL cs.CR 版本更新

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

ContraFix:通过差分运行时证据和技能重用进行代理漏洞修复

Simiao Liu, Fang Liu, Li Zhang, Yang Liu, Yinghao Zhu

发表机构 * Beihang University(北京航空航天大学) The University of Hong Kong(香港大学)

AI总结 本文提出ContraFix框架,通过差分运行时证据和可重用的修复技能,解决大型语言模型代理在自动漏洞修复中的语义误解问题,实现了在SEC-Bench和PatchEval上的高准确率。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地用于自动漏洞修复(AVR),其中仓库级推理使它们能够检查上下文并生成源代码补丁。然而,最近的经验结果表明,这些代理仍然难以处理现实世界中的漏洞。其主要失败模式是语义误解:选择一个修复方向,该方向不匹配根本原因。我们识别出这种差距的两个原因。现有代理通常仅从失败执行进行推理。崩溃报告可以指出程序失败的位置,但无法揭示众多候选者中哪一个变量或状态转换将崩溃行为与安全执行区分开来。因此,代理通常生成症状导向的补丁而不是因果修复。此外,为一个漏洞收集的证据很少被保留,因此后续仓库中的类似案例必须从头开始诊断。我们提出了ContraFix,一种结合差分运行时证据和可重用修复技能的代理AVR框架。其Mutator构造了跨越失败边界的POC变体;其Analyzer在故障区域插入状态探针,并将崩溃和非崩溃执行之间的差异总结为修复规范;其Patcher将规范转换为经过验证的源代码补丁。每次成功的修复都会更新一个包含修复规范和突变策略的双轨技能库,这些通过三层策略在未来实例中检索。在SEC-Bench(C/C++,200个实例)和PatchEval(Go、Python、JavaScript,225个实例)上,ContraFix与GPT-5-mini相比,分别解决了84.0%和73.8%的任务,分别在两个基准上实现了最先进的性能,同时成本低于最强的可比基线。

英文摘要

Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom-oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances. On SEC-Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT-5-mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state-of-the-art performance on both benchmarks while costing less than one-third of the strongest comparable baseline.

2605.17447 2026-05-19 cs.CV cs.CL 版本更新

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR: 通过KV缓存剪枝实现高效的动态视觉聚焦文档解析

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出FastOCR,一种无需训练的框架,通过动态视觉聚焦技术解决文档解析中的高效KV缓存剪枝问题,显著提升处理速度和准确性。

详情
AI中文摘要

视觉-语言模型(VLMs)在光学字符识别(OCR)中展现出强大潜力,但编码密集文档所需的大量视觉令牌导致推理成本过高。现有剪枝方法依赖物理驱逐,例如在prefill阶段永久丢弃视觉令牌。尽管在自然图像上有效,但此策略在OCR中失效,因为几乎每个视觉令牌可能对应一个字符或结构元素,任何不可逆的损失都会导致准确性急剧下降。我们观察到,尽管文档图像看似密集且难以剪枝,模型对它们的注意力实际上在时间上是稀疏的:在每个解码步骤中,它集中在一小块区域,随着步骤逐渐移动,就像人类读者依次聚焦于词语而不是一次性感知整页内容一样。受此动态视觉聚焦现象的启发,我们将不可行的全局剪枝问题转化为可处理的局部动态问题,并提出FastOCR,一种无需训练的框架,包含两个互补模块。具体而言,Focal-Guided Pruning识别少量焦点层,并在每一步从中选择最相关的视觉令牌;Cross-Step Fixation Reuse利用固定点的逐渐移动,从上一步温暖启动。通过动态调整哪些令牌被关注而不是驱逐任何缓存中的令牌,FastOCR避免了永久信息丢失。广泛实验表明,FastOCR作为一种即插即用的加速模块,在五个不同大小和架构的VLMs上表现出一致的泛化能力。在Qwen2.5-VL上,FastOCR在每个解码步骤只关注5%的视觉令牌,保留了未剪枝模型98%的准确性,同时将注意力延迟减少了3.0倍。

英文摘要

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

2605.17444 2026-05-19 cs.SE cs.AI cs.CL 版本更新

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair:用于代理级漏洞修复的分层内存

Simiao Liu, Li Zhang, Fang Liu, Xiaoli Lian, Yang Liu, Yinghao Zhu

发表机构 * Beihang University(北京航空航天大学) The University of Hong Kong(香港大学)

AI总结 本研究提出MemRepair,一种增强记忆的代理框架,通过分层记忆和动态反馈循环提高漏洞修复的可靠性,实现了在多个仓库级别的漏洞修复基准上的高修复率。

详情
AI中文摘要

现代软件生态系统面临越来越多披露的漏洞,增加了需要在仓库规模上可靠运行的自动化修复技术的需求。尽管基于大语言模型(LLM)的代理最近在自动化漏洞修复(AVR)中显示出潜力,但大多数现有系统仍然将修复视为在当前可见代码上下文中的一次生成步骤。因此,它们缺乏重用先前修复或从失败验证尝试中学习的持久机制,这限制了它们在复杂、多文件修复任务上的有效性。我们提出了MemRepair,一种增强记忆的代理框架,将漏洞修复视为一个迭代、经验驱动的过程。MemRepair结合了三个互补的记忆层,即History-Fix、Security-Pattern和Refinement-Trajectory记忆,并通过动态反馈驱动的细化循环。这种设计使代理能够检索仓库特定的修复惯例,应用可重用的安全防御,并利用先前的“失败到成功”轨迹来根据运行时证据修正语义无效的补丁。我们评估了MemRepair在三个具有代表性的仓库级别漏洞修复基准上的表现:SEC-Bench、PatchEval(Python、Go、JavaScript)以及Multi-SWE-bench的C++子集。MemRepair在三个基准上分别实现了58.0%、58.2%和30.58%的修复率,优于强大的通用代理如OpenHands和SWE-agent,以及专用的AVR工具InfCode-C++,同时保持竞争性的修复成本。这些结果表明,持久的、分层的修复记忆可以显著提高跨多种语言和仓库设置的代理漏洞修复的可靠性。

英文摘要

Modern software ecosystems face a rapidly growing number of disclosed vulnerabilities, increasing the need for automated repair techniques that can operate reliably at repository scale. Although Large Language Model (LLM)-based agents have recently shown promise for automated vulnerability repair (AVR), most existing systems still treat repair as a single generation step over the currently visible code context. As a result, they lack a persistent mechanism for reusing prior fixes or learning from failed validation attempts, which limits their effectiveness on complex, multi-file repair tasks. We present MemRepair, a memory-augmented agentic framework that formulates vulnerability repair as an iterative, experience-driven process. MemRepair combines three complementary memory layers, i.e., History-Fix, Security-Pattern, and Refinement-Trajectory memories, with a dynamic feedback-driven refinement loop. This design allows the agent to retrieve repository-specific repair conventions, apply reusable security defenses, and exploit prior "failure-to-success" trajectories to revise semantically invalid patches based on runtime evidence. We evaluate MemRepair on three representative repository-level vulnerability repair benchmarks: SEC-Bench, PatchEval (Python, Go, JavaScript), and the C++ subset of Multi-SWE-bench. MemRepair achieves state-of-the-art resolution rates of 58.0%, 58.2%, and 30.58%, respectively, outperforming strong general-purpose agents such as OpenHands and SWE-agent, as well as the specialized AVR tool InfCode-C++, while maintaining competitive repair cost. These results show that persistent, hierarchical repair memory can substantially improve the reliability of agentic vulnerability repair across diverse languages and repository settings.

2605.17442 2026-05-19 cs.CL cs.AI cs.IR 版本更新

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数:低资源多语言NLP中的数据集可见性不对称

Zhiyin Tan, Changxu Duan

发表机构 * L3S Research Center, Leibniz University Hannover(莱布尼茨汉诺威大学L3S研究中心) Technische Universität Darmstadt(达姆施塔特技术大学)

AI总结 本研究探讨了多语言NLP中数据集可见性不对称问题,通过结合目录基准和文献证据,提出了资源密度指数(RDI)来衡量语言的数据集可见性,揭示了大量语言在目录记录中数据贫乏但文献中存在明显数据集活动的现象。

Comments Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)

详情
AI中文摘要

多语言NLP常常依赖于集中式目录中的数据集计数来确定哪些语言是资源丰富或贫乏的。然而,这些目录只记录了数据集可见性的一层:哪些数据集已被注册或机构分发。它们不一定反映哪些数据集在研究文献中被创建、引用或重用。为了考察这一差距,我们结合基于目录的基准与文献支持的数据集流通证据。我们引入了资源密度指数(RDI),定义为每一百万使用者的数据集数量,并计算了乙努诺格(Ethnologue)中200种最广泛使用的语言的RDI。其中,118种语言(59%)在LRE地图和语言数据 consortium(LDC)中平均RDI为零,另有23种语言低于0.1,对应每十万使用者最多一个目录数据集。然后,我们利用LLM辅助的引用挖掘流程处理Semantic Scholar语料库中的这141种低可见性语言。经过人工验证和整合,我们识别出53种语言中的609个唯一数据集,其中356个仍通过工作公共链接公开访问。这些结果揭示了显著的可见性差距:许多大使用者语言在目录记录中数据贫乏,但在研究文献中显示明显的数据集活动。我们的发现表明,多语言数据稀缺不仅应被视为生产问题,还应被视为文档、可发现性和长期可访问性的问题。代码和数据可在(https://github.com/zhiyintan/dataset-visibility-asymmetry)公开获取。

英文摘要

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset-visibility-asymmetry).

2605.17436 2026-05-19 cs.CV cs.CL 版本更新

Medical Context Distorts Decisions in Clinical Vision Language Models

医学语境扭曲了临床视觉语言模型的决策

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

发表机构 * MICS(医学信息学中心) CentraleSupélec - Université Paris-Saclay(中央超导学院 - 巴黎萨克雷大学) Cancer Data Science Unit(癌症数据科学单元) IHU PRISM National Institute in Precision Oncology(精准肿瘤学国家研究所) University Paris-Saclay(巴黎萨克雷大学) CentraleSupelec(中央超导学院) Gustave Roussy(儒勒-维维安-圣拉扎尔医院) INSERM(国家医学研究院) CONICET(阿根廷国家科研与技术创新委员会) Universidad de Buenos Aires(布宜诺斯艾利斯大学)

AI总结 本文研究了医学语境对临床视觉语言模型决策的影响,发现模型在整合医学记录的视觉和文本信息时存在模态依赖、无关历史依赖和提示敏感性等问题,强调了在临床应用前需要建立明确的保障措施。

详情
AI中文摘要

视觉-语言模型(VLMs)越来越多地被提出用于临床决策支持,但其在需要整合医学记录中视觉和文本信息的现实场景中的可靠性仍缺乏充分了解。本文识别了三种失败模式:(1)对文本的过度依赖而非图像,(2)对无关临床历史的虚假依赖,以及(3)在语义等价输入上的提示敏感性。我们评估了多种通用领域和医学调优的开源和闭源VLMs,在胸片任务中使用MIMIC-CXR进行测试。通过系统地操纵图像-文本对齐、临床历史和提示公式,我们发现VLM的决策受到文本模态主导,即使有视觉证据可用。此外,我们发现VLMs受到无关报告的强烈影响,而微小的提示变化可以逆转正确的图像基预测。我们的发现强调了在考虑将这些模型用于临床实践之前,需要建立明确的保障措施和压力测试。

英文摘要

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

2605.17435 2026-05-19 cs.CL 版本更新

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

BELIEF: 结构化证据建模与不确定性感知融合用于生物医学问答

Chang Zong, Hao Ning, Siliang Tang, Jie Huang, Jian Wan

发表机构 * School of Computer Science and Technology, Zhejiang University of Science and Technology(浙江理工大学计算机科学与技术学院) College of Artificial Intelligence, Zhejiang University(浙江大学人工智能学院) Zhejiang Key Laboratory of Biomedical Intelligent Computing Technology, Zhejiang University of Science and Technology(浙江理工大学生物医学智能计算技术重点实验室)

AI总结 本文提出BELIEF框架,通过结构化证据建模和不确定性感知融合,提升生物医学问答任务中检索文献的利用效率,实现对证据可靠性、不确定性以及候选假设的支持强度的显式建模。

Comments 14 pages, 6 figures

详情
AI中文摘要

生物医学问答通常需要从检索文献中做出决策,这些文献的相关性、质量以及对候选答案的支持程度不均。大多数检索增强的大语言模型(LLM)方法将这些文献作为平铺文本输入模型,导致证据可靠性及剩余不确定性大多隐含。我们提出BELIEF,一种用于封闭集生物医学问答的结构化证据建模和不确定性感知融合框架。不同于将检索文档视为非区分的上下文,BELIEF将其转换为证据对象,记录临床属性、来源质量、问题相关性、支持强度以及相关的候选假设。这些证据对象为两种互补的推理路径提供共享基础。符号路径基于Dempster-Shafer(D-S)理论,在有限的答案空间上构建可靠性加权的基本概率分配,并进行不确定性感知的符号证据融合以估计信念和残余不确定性。神经路径使用相同的结构化证据进行基于LLM的语义推理,而一个可靠性感知的仲裁模块根据信念强度、不确定性、证据可靠性和语义一致性来协调符号和神经输出。在PubMedQA、MedQA和MedMCQA上使用五个通用大语言模型(LLM)后端进行的实验表明,BELIEF在25个30种后端-数据集-指标设置中取得了最佳结果。与生物医学领域模型的比较表明,BELIEF在MedQA和MedMCQA上具有竞争力,而专门的生物医学预训练仍然在PubMedQA上具有优势。消融、互补性、不确定性分层和成本分析进一步表明,BELIEF通过使证据结构、路径分歧和决策不确定性显式化,提高了检索证据的利用效率。

英文摘要

Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster--Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.

2605.17398 2026-05-19 cs.CL cs.LG 版本更新

MiniGPT: Rebuilding GPT from First Principles

MiniGPT:从第一原理重新构建GPT

Jibin Joseph

发表机构 * Department of Computer Science(计算机科学系) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型,旨在在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择和自回归文本生成。

Comments 13 pages, 2 figures

详情
AI中文摘要

本文提出了MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型。目的是在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择以及自回归文本生成。本文在Tiny Shakespeare数据集上评估了该实现,使用字符级分词。一个基线模型在3000次训练迭代后达到验证损失1.7236。一个更强的10.77M参数配置,使用更大的上下文长度和改进的训练设置,达到最佳验证损失1.4780,并生成具有可识别莎士比亚风格对话结构的文本。MiniGPT并未引入新的语言模型架构。相反,它记录了从原始文本到训练好的字符级生成的清晰且可重复的实现路径,包括设计选择、训练行为、生成质量以及实际限制。

英文摘要

This paper presents MiniGPT, a compact from-scratch implementation of GPT-style autoregressive language modeling in PyTorch. The aim is to rebuild the core GPT pipeline from first principles after studying the design of nanoGPT by Andrej Karpathy, while keeping the model and training code independently written in a single notebook. MiniGPT implements token and positional embeddings, causal multi-head self-attention, pre-LayerNorm Transformer blocks, residual connections, feed-forward MLP layers, next-token cross-entropy training (teacher forcing), validation tracking, checkpoint selection, and autoregressive text generation. This paper evaluates the implementation on Tiny Shakespeare dataset using character-level tokenization. A baseline 0.83M-parameter model reaches a validation loss of 1.7236 after 3000 training iterations. A stronger 10.77M-parameter configuration, using a larger context length and improved training settings, reaches a best validation loss of 1.4780 and generates text with recognizable Shakespeare-style dialogue structure. MiniGPT does not introduce a new language-model architecture. Instead, it documents a clear and reproducible implementation path from raw text to trained character-level generation, including design choices, training behavior, generation quality, and practical limitations.

2605.17382 2026-05-19 cs.AI cs.CL cs.GR 版本更新

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ: 量化定性判断以实现可扩展且与人类对齐的生成AI评估

Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

发表机构 * AI Lab, Arioobarzan Engineering Team(艾伊罗巴赞工程团队人工智能实验室) Department of Computer Engineering and Information Technology(计算机工程与信息科技系) Department of Informatics, Bioengineering, Robotics and Systems Engineering(信息学、生物工程、机器人与系统工程系) University of Genoa(热那亚大学)

AI总结 本文提出QQJ框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,实现与人类判断一致的可扩展评估方法,验证了结构化定性判断在大规模应用中的有效性。

详情
AI中文摘要

生成人工智能的快速发展暴露了现有评估方法的根本局限,尤其是在开放性、创造性和面向人类的任务中。传统自动指标依赖于表面统计相似性,往往无法反映人类对质量的感知,而纯粹的人类评估虽然可靠,但成本高、主观性强且难以扩展。最近利用大语言模型作为评估者的做法虽然提高了可扩展性,但通常缺乏明确的人类定义评估原则,导致偏见和不一致。本文介绍Quantifying Qualitative Judgment (QQJ),一种可扩展且以人类为中心的评估框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,以实现人类判断与自动化评估之间的桥梁。这种设计使在多样化的生成任务和模态上实现了一致、可解释和可扩展的评估。在文本和图像生成上的大量实验表明,QQJ在与人类判断的一致性方面优于传统自动指标和无约束的大语言模型评估者。此外,QQJ在重复评估中表现出更高的稳定性,并在识别关键失败模式如幻觉和意图不匹配方面具有更好的诊断能力。这些结果表明,结构化的定性判断可以在不牺牲可解释性和人类对齐的情况下实现规模化应用,使QQJ成为现代生成AI系统可靠评估的实用基础。

英文摘要

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

2605.17379 2026-05-19 cs.CL cs.AI 版本更新

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

通过更好的令牌学习:用于专业文本摘要的参数高效词汇适应

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

发表机构 * Dept. of Computer Science and Engg., IIT Kharagpur(印度Kharagpur理工学院计算机科学与工程系) Dept. of Medicine (Biomedical Informatics), Stanford University(斯坦福大学医学院(生物医学信息学))

AI总结 本文提出了一种参数高效的领域适应方法,通过结合词汇适应和预训练,提升大型语言模型在专业领域文本摘要任务中的性能,同时减少训练时间和参数数量。

Comments 16 pages. Accepted in the 64th Annual Meeting of the Association for Computational Linguistics [ACL (Main) 2026] as a long paper

详情
AI中文摘要

预训练在通用领域语料库上的大型语言模型在应用于专门领域时常常表现出令牌化效率低下。尽管连续预训练用于领域适应在一定程度上缓解了性能下降,但并未解决根本的词汇匹配问题。为了解决这一差距,我们引入了一种有针对性的参数高效领域适应方法,结合词汇适应与预训练用于基于LLM的文本摘要。我们的统一框架在预训练令牌化器中增加领域特定的令牌,同时选择性地替换未充分训练和不可达的令牌以限制参数增长。我们在Llama-3.1-8B和Qwen2.5-7B上评估了我们的方法,在法律和医学摘要任务上使用以专家驱动文本和摘要为中心的评估协议,这些文本通常包含更高浓度的Out-of-Vocabulary(OOV)词。词汇适应算法通过提高生成摘要与参考摘要之间的语义相似性,提升了摘要模型的整体质量。此外,适应后的模型生成的摘要包含更多合适的新型和领域特定的词汇,从而提高了连贯性、相关性和忠实性。我们进一步观察到,我们的方法在连续预训练上减少了35-55%的训练时间,并将参数数量减少了多达37%。我们公开了代码库:https://github.com/gb-kgp/VocabReplace-Then-Expand。

英文摘要

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

2605.16234 2026-05-19 cs.LG cs.AI cs.CL 版本更新

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

没有免费的交换:Transformer中的协议依赖层冗余

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了Transformer中层冗余问题,通过比较替换和交换两种协议,发现它们在压缩中的效果存在显著差异,且在相同评估器下,不同协议可能导致层剪枝结果的变化,尤其在高替换距离时更为明显。

Comments 40 pages, 8 figures, 24 tables. Code is available at https://github.com/Gpgabriel25/ProtocolGapDiagnostic

详情
AI中文摘要

当研究人员询问两个Transformer层是否在压缩中“等价”时,他们常常混淆了不同的测试方法。替换测试询问是否可以将一层的映射替换为另一层的映射;交换测试询问是否当两层位置交换时,它们近似可交换。两者都是基于输出的swap-KL探测器,但它们并不总是一致:在预训练的Transformer中,协议差距可能在相同评估器下改变哪些层看起来可以安全剪枝,尤其是在替换距离较高时。我们跨检查点和架构测量了两种协议。在Pythia训练轨迹(410M和1.4B)上,替换-交换差距从初始化到收敛逐渐增大。在8B规模的WikiText-2合同下,Qwen3-8B进入了一个发散阶段:交换引导的移除比替换引导的在相同层预算下更安全,而Llama-3.1-8B在剪枝成本上两者持平,尽管交换KL较低,这表明指标差距不必一对一映射到移除。在层移除或合并之前,应在目标检查点上对两种swap-KL进行评分;该诊断仅需未标记的正向传递。

英文摘要

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

2605.15508 2026-05-19 cs.LG cs.CL 版本更新

STS: Efficient Sparse Attention with Speculative Token Sparsity

STS: 高效稀疏注意力与推测性标记稀疏性

Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) UC Berkeley(加州大学伯克利分校)

AI总结 本文提出STS,一种无需模型再训练的稀疏注意力机制,通过利用较小的草稿模型识别出的重要标记来预测更大目标模型的重要标记,从而在大规模语言模型推理中实现高效的稀疏注意力计算,显著提升速度并保持准确性。

Comments 14 pages, 12 figures

详情
AI中文摘要

注意力的二次复杂性对大型语言模型(LLM)推理造成了严重的内存和计算瓶颈。这一挑战在新兴的代理应用中尤为突出,这些应用需要处理数百万标记序列。我们提出STS,一种稀疏注意力机制,无需模型再训练。STS利用关键洞察:由较小的草稿模型识别出的重要标记对更大目标模型的重要标记具有高度预测性。通过整合到推测解码框架中,STS将草稿模型的注意力分数重新利用,动态构建标记和头部层面的稀疏性掩码。该掩码有效剪枝目标LLM中的昂贵注意力计算。我们的评估显示,STS在代表性的基准NarrativeQA上实现了约90%稀疏度下的2.67倍加速,与密集注意力相比,准确性降解可忽略不计。STS在稀疏性与准确性权衡上建立了新的状态-of-the-art,通过在给定准确性预算下实现更高的稀疏度水平,优于先前技术。

英文摘要

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

2605.14005 2026-05-19 cs.CL cs.LG 版本更新

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

毒藤:针对推测解码的隐秘加速-崩溃攻击

Shuoyang Sun, Chang Dai, Hao Fang, Kuofeng Gao, Xinhao Zhong, Yi Sun, Fan Mo, Shu-Tao Xia, Bin Chen

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) South China University of Technology(华南理工大学) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Technology(华为技术)

AI总结 本文提出Mistletoe攻击,通过优化降质目标和语义保留目标,隐秘地降低推测解码的接受长度τ,从而减少加速效果,同时保持输出质量。

详情
AI中文摘要

推测解码已成为加速大型语言模型(LLM)推理的广泛采用技术,通过并行生成多个候选token并用目标模型验证。然而,其效率关键依赖于平均接受长度τ,即每个验证步骤中多少候选token能被接受。本文识别了基于模型的推测解码中的新机制层漏洞:drafter被训练去近似目标模型分布,但这种近似不可避免地不完美。这种drafter-目标不匹配创造了一个隐藏的攻击面,其中小扰动可以保持目标模型的可见行为,同时显著降低候选token的接受性。我们提出Mistletoe,一种针对推测解码的隐秘加速-崩溃攻击。Mistletoe直接针对推测解码的接受机制。它联合优化一个降质目标,以减少drafter-目标的一致性,以及一个语义保留目标,以约束目标模型的输出分布。为了解决这两个目标之间的冲突,我们引入了一个null-space投影机制,其中降质梯度被投影到局部语义保留方向之外,从而抑制候选token的接受,同时最小化语义漂移。在各种推测解码系统上的实验表明,Mistletoe显著降低了平均接受长度τ,崩溃速度提升,并降低了平均token吞吐量,同时保持输出质量和困惑度。我们的工作强调推测解码引入了超越现有输出鲁棒性的机制层攻击面,呼吁对LLM加速系统进行更鲁棒的设计。

英文摘要

Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length $τ$, i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model's visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model's output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length $τ$, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.

2605.13415 2026-05-19 cs.CL cs.AI cs.LG 版本更新

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

KIT-TIP-NLP 在 MultiPride 上的持续学习:多语言基础模型

Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen

发表机构 * Text Information Processing Lab, Kitami Institute of Technology, Kitami, Hokkaido 090-0015, Japan(函授信息处理实验室,Kitami理工学院,日本北海道Kitami,090-0015)

AI总结 本文提出了一种多阶段框架,用于检测社交媒体中多语言的重新使用侮辱性语言。该框架解决了跨英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战,通过数据驱动的模型选择、语义保留的增强、归纳迁移学习和领域特定知识注入等方法,提高了多语言情感表达的识别能力。

Comments Final Workshop of the 9th evaluation campaign EVALITA 2026

详情
AI中文摘要

本文提出了一种多阶段框架,用于检测多语言社交媒体中重新使用的侮辱性语言。该框架解决了在英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战。该框架处理了三个交织的方法学挑战:数据稀缺、类别不平衡和跨语言的情感表达差异。该框架整合了通过交叉验证的数据驱动模型选择、通过回译的语义保留增强、具有动态周期级欠采样的归纳迁移学习,以及通过掩码语言模型注入的领域特定知识。系统评估了八个多语言嵌入模型,XLM-RoBERTa被选为基础模型,基于宏平均F1分数。通过GPT-4o-mini回译进行的数据增强有效将训练语料库增加了三倍,同时保留了语义内容和类别分布比例。该框架生成了四个最终运行用于评估,其中RUN 1是带有增强和欠采样的归纳迁移学习,RUN 2是带有掩码语言模型预训练,RUN 3和RUN 4是通过语言特定决策阈值优化的先前预测。语言特定的阈值优化表明,最优决策边界在不同语言中存在显著差异。这反映了模型置信度分数的分布差异和重新使用语言使用的语言差异。基于阈值的优化在不需模型重新训练的情况下,带来了2-5%的绝对F1提升。该方法完全可复现,所有代码和实验设置可在https://github.com/rbg-research/MultiPRIDE-Evalita-2026上找到。

英文摘要

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

2605.11854 2026-05-19 cs.CL 版本更新

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

自蒸馏轨迹感知玻尔兹曼建模:弥合扩散语言模型训练-推理差异

Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Yibing Liu, Xinyu Fu, Shi Wu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, Haoliang Li

发表机构 * City University of Hong Kong(香港城市大学) Huawei Research(华为研究) The University of Hong Kong(香港大学)

AI总结 本文研究了如何利用自蒸馏轨迹进行真正的知识获取,而非仅加速推理。提出TABOM框架,通过玻尔兹曼建模对推理解屏蔽偏好建模,从而在新领域中提升扩散语言模型的表现并缓解灾难性遗忘。

Comments Project website: https://tonyckc.github.io/TABOM-web/

详情
AI中文摘要

扩散语言模型(DLMs)最近涌现出作为自回归语言模型的有希望的替代方案,提供了更强的全局意识和高度并行的生成能力。然而,使用标准负证据下界(NELBO)基于的监督微调对DLMs进行后训练仍然效率低下:训练过程在单步中重建随机掩码的token,而推理则遵循一种由置信度引导的、多步易到难的去噪轨迹。最近的基于轨迹的自蒸馏方法主要利用这些推理轨迹来压缩和加速采样步骤,通常在不显著增强模型基础能力的情况下提高解码效率,甚至在完整扩散解码下可能降低性能。在本文中,我们问自蒸馏轨迹是否可以不仅仅用于更快的推理,而是用于真正的知识获取。虽然这些轨迹位于预训练DLM自身的分布流形上,因此提供了潜在的更低优化障碍,但我们发现使用标准NELBO目标直接微调仅能获得微小的提升。为了解决这一限制,我们提出了轨迹对齐优化通过玻尔兹曼建模(TABOM),一种基于轨迹的自蒸馏后训练框架,使训练与推理的易到难结构对齐。TABOM将推理解屏蔽偏好建模为预测熵上的玻尔兹曼分布,并推导出一个可计算的成对排名目标,以使模型的确定性顺序与观察到的解码轨迹对齐。经验上,TABOM在新领域中实现了显著的提升,扩展了DLMs的有效知识边界,并与标准SFT相比显著缓解了灾难性遗忘。

英文摘要

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.

2605.08738 2026-05-19 cs.LG cs.AI cs.CL 版本更新

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

SlimQwen: 探索在大规模MoE模型预训练中的剪枝与知识蒸馏

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

发表机构 * Qwen Team, Alibaba Inc.(通义实验室,阿里公司) MBZUAI KAUST(卡士大学)

AI总结 本文研究了在大规模预训练中如何应用剪枝和知识蒸馏技术,探讨了剪枝在初始化方面的优势、专家压缩对最终模型的影响以及训练策略的有效性,最终将Qwen3-Next-80A3B压缩到23A2B模型并保持竞争力。

详情
AI中文摘要

结构化剪枝和知识蒸馏(KD)是压缩大型语言模型的典型技术,但其在预训练规模下的应用仍不清楚,尤其是针对最近的混合专家(MoE)模型。本文系统研究了大规模预训练中的MoE压缩,重点探讨三个关键问题:剪枝是否比从头训练提供更好的初始化;专家压缩选择如何影响继续训练后的最终模型;以及哪种训练策略最有效。我们得出以下发现:首先,在深度、宽度和专家压缩方面,对预训练MoE进行剪枝在相同训练预算下优于从头训练。其次,不同的单次专家压缩方法在大规模持续预训练后收敛到相似的最终性能。受此启发,我们引入了一种简单的部分保留专家合并策略,该策略在大多数基准上提升了下游性能。第三,结合KD与语言建模损失在知识密集型任务上优于仅使用KD。我们进一步提出了多令牌预测(MTP)蒸馏,其效果一致。最后,鉴于相同的训练令牌,渐进式剪枝计划优于单次压缩,表明渐进的架构过渡导致更好的优化轨迹。综合来看,我们将Qwen3-Next-80A3B压缩到23A2B模型,保持了竞争力。这些结果为大规模高效MoE压缩提供了实用指导。

英文摘要

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

2605.08439 2026-05-19 cs.CL 版本更新

Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

语言模型能否识别乳腺癌放射治疗的副作用?

Natalie Seah, Danielle S. Bitterman, Daphna Spiegel, Thomas Hartvigsen

发表机构 * University of Virginia(弗吉尼亚大学) Mass General Brigham Dana-Farber Cancer Institute Harvard Medical School(麻省总医院达纳-法伯癌症研究所哈佛医学院)

AI总结 本研究探讨了语言模型在识别乳腺癌放射治疗副作用中的能力,通过评估多种语言模型在不同提示下的表现,揭示了其在精度、召回率及罕见长期副作用识别上的局限性,并提出了改进方向。

详情
AI中文摘要

准确地向癌症幸存者传达癌症治疗的副作用至关重要,特别是在知情同意等情境中,临床医生必须清晰而全面地传达潜在的治疗毒性。然而,由于对不良治疗反应的临床知识不足以及电子健康记录(EHR)系统之间的碎片化,这一任务仍极具挑战性。大型语言模型(LLMs)有潜力帮助完成此任务,但其在癌症幸存者护理中的可靠性仍不明确。本文提出了一种面向部署的压力测试框架,用于评估LLM生成的乳腺癌治疗和幸存者护理中的放射副作用列表。使用21名乳腺癌患者资料,我们构建了仅在放射治疗方案上不同的配对患者临床场景,以在多种提示模式下评估七种指令微调的LLM。然后将LLM输出与由两名主要学术医疗中心的知情同意文件和超过七名乳腺放射肿瘤学家团队编写的临床医生编纂参考进行比较。该参考将放射剂量分割、照射区域和位置映射到相关的毒性,按频率和时间起始点分解。在不同模型中,我们揭示了对细微文档变化的敏感性、精度与召回率之间的权衡,以及系统性低估罕见和长期副作用的问题。当单独使用时,限制生成的副作用数量会降低精度,而将输出基于临床医生编纂的副作用列表可以显著提高可靠性和稳健性。这些发现突显了LLM在肿瘤学中的重要局限性,并提出了更安全和信息丰富的幸存者护理应用的设计选择。

英文摘要

Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

2605.08163 2026-05-19 cs.CV cs.AI cs.CL 版本更新

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

MULTITEXTEDIT:跨语言文本-图像编辑中退化程度的基准测试

Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 本文提出MULTITEXTEDIT基准测试,通过12种语言、5种视觉领域和7种编辑操作的3600个实例,评估跨语言文本-图像编辑中退化问题,引入语言保真度指标并发现模型在文本准确性和脚本保真度上的显著退化。

Comments 11 pages, 5 figures

详情
AI中文摘要

文本-图像编辑已成为视觉内容创作的关键能力,但现有基准测试大多以英语为中心且常将视觉合理性与语义正确性混为一谈。我们引入MULTITEXTEDIT,一个包含3,600个实例的受控基准测试,涵盖12种语言类型、5种视觉领域和7种编辑操作。每个实例的语言变体共享相同的视觉基础,并配有人工编辑的参考文本和区域掩码,从而隔离语言变量以进行跨语言比较。为捕捉粗粒度文本匹配度指标所遗漏的脚本级错误,如缺失变音符号、RTL顺序颠倒和混合脚本渲染,我们引入了一个由两阶段LVM协议评分的语言保真度(LSF)度量,其与母语者标注员的二次加权κ值达到0.76。评估12个开源和专有系统时,发现所有模型在跨语言退化方面表现显著,最大退化出现在希伯来语和阿拉伯语上,最小退化出现在荷兰语和西班牙语上,且集中在文本准确性和脚本保真度而非粗粒度结构维度上。我们还发现普遍存在的语义和像素不匹配,其中输出保持全局布局和背景保真度,但扭曲了脚本特定的形态。

英文摘要

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

2605.07111 2026-05-19 cs.CL cs.AI 版本更新

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

超越LoRA与全微调:基于梯度的优化器路由用于大语言模型适应

Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tsinghua University(清华大学) Infinigence AI

AI总结 本文提出了一种混合LoRA和全微调(MoLF)框架,通过在优化器层面动态路由更新,实现两种训练模式之间的连续导航,从而提升大语言模型的适应性能。

详情
AI中文摘要

近期关于微调大型语言模型的研究突显出一个根本性的争论。虽然全微调(FFT)提供了高熵知识注入所需的表示可塑性,但低秩适应(LoRA)可以匹配或超越FFT的性能,因为许多任务只需要在低秩空间中进行更新,并且受益于LoRA的额外正则化。通过在多样化的任务(SQL、医学问答和反事实知识)和不同语言模型(Gemma-3-1B、Qwen2.5-1.5B和Qwen2.5-3B)上的实证评估,我们验证了这两种趋势,并展示了仅依赖静态架构在结构上是有限的。为了解决这一挑战,我们提出了混合LoRA和全微调(MoLF)框架,这是一个统一的框架,能够连续导航于两种训练模式之间。MoLF在优化器层面动态地将更新路由到FFT和LoRA之间,以确保在整个训练过程中精确的梯度信号能够传达到两个专家,从而产生稳定的训练动态。对于内存受限的环境,我们还引入了MoLF-Efficient,它冻结了基础权重,并只在可能具有不同秩的一对LoRA专家之间路由更新。我们的评估显示,MoLF在所有设置中要么优于或保持在FFT和LoRA中更好的方法的1.5%以内,而MoLF-Efficient在事实任务上比先前的自适应LoRA方法高出高达20%,在医学和SQL任务上高出9%。我们的代码在https://github.com/11785T23/molf.git上开源。

英文摘要

Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL. Our code is open-sourced at https://github.com/11785T23/molf.git.

2605.06506 2026-05-19 cs.CL 版本更新

The Frequency Confound in Language-Model Surprisal and Metaphor Novelty

语言模型惊奇度与隐喻新颖性中的频率混淆

Omar Momen, Sina Zarrieß

发表机构 * CRC 1646 – Linguistic Creativity in Communication(CRC 1646 语言创造力在交流中的研究)

AI总结 研究探讨了语言模型惊奇度与隐喻新颖性之间的关系,发现词频比惊奇度更能预测隐喻新颖性,并指出惊奇度与频率之间的关联在训练阶段早期达到峰值,随后下降,暗示最优的语言模型惊奇度设置可能错误地将上下文可预测性与隐喻新颖性和处理难度联系起来,而词频可能是主要影响因素。

Comments to be presented and published at the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

详情
AI中文摘要

语言模型(LM)的惊奇度被广泛用作上下文可预测性的代理,并已报告与隐喻新颖性判断相关联。然而,惊奇度与词频紧密交织。我们通过两种不同的词频测量方法,在隐喻新颖性评分上探索这种交互作用。我们分析了八个Pythia模型大小和154个训练检查点的惊奇度估计。在不同设置中,词频比惊奇度更能预测隐喻新颖性。在训练阶段中,惊奇度与新颖性关联在早期阶段达到峰值,随后再次下降,这与惊奇度与频率关联的同步增加相呼应。这些结果表明,通常报告的最优LM惊奇度设置可能错误地将上下文可预测性与隐喻新颖性和处理难度联系起来,而词频可能是主要的潜在因素。

英文摘要

Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal--novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal--frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.

2605.00505 2026-05-19 cs.IR cs.AI cs.CL 版本更新

LLM-Oriented Information Retrieval: A Denoising-First Perspective

面向大语言模型的信息检索:一种去噪优先的视角

Lu Dai, Liang Sun, Fanpu Cao, Ziyang Rao, Cehao Yang, Hao Liu, Hui Xiong

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出了一种以去噪为核心的信息检索方法,强调在信息检索全流程中,最大化可利用证据密度和可验证性是关键瓶颈,通过四个阶段框架和信号-噪声优化技术分类,探讨了信息检索中的挑战和解决方案。

Comments SIGIR 2026

详情
AI中文摘要

现代信息检索(IR)不再主要由人类消费,而是越来越多地通过检索增强生成(RAG)和代理搜索由大型语言模型(LLMs)使用。与人类用户不同,LLMs受制于有限的注意力预算,并且对噪声具有独特脆弱性;误导或不相关信息不再只是麻烦,而是导致幻觉和推理失败的直接原因。在本文的视角论文中,我们主张在信息访问全管道中,去噪——即在上下文窗口内最大化可利用证据密度和可验证性——已成为主要瓶颈。我们通过信息检索挑战的四阶段框架来概念化这一范式转变:从不可访问到不可发现,再到不一致,最后到不可验证。此外,我们提供了一个按流程组织的信号-噪声优化技术分类,涵盖索引、检索、上下文工程、验证和代理工作流程。我们还展示了在依赖检索的领域如终身助手、编码代理、深度研究和多模态理解中信息去噪的研究工作。

英文摘要

Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.

2604.25525 2026-05-19 cs.CL cs.HC 版本更新

From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support

从聊天机器人到知己:一项跨文化的LLM用于情感支持的使用研究

Natalia Amat-Lefort, Mert Yazan, Amanda Cercas Curry, Flor Miriam Plaza-del-Arco

发表机构 * Leiden University(莱顿大学) Hogeschool van Amsterdam(阿姆斯特丹大学) Independent Researcher(独立研究员)

AI总结 本研究探讨了不同国家用户对LLM用于情感支持的接受度及影响因素,通过大规模跨文化调查发现社会经济地位是关键预测因素,并揭示了多语言提示语中用户主要寻求帮助的领域。

Comments 28 pages (9 pages main text, 19 pages references and appendices), 14 figures. The first two authors contributed equally

详情
AI中文摘要

大型语言模型(LLMs)不仅被用于执行任务,还作为全天候、非评判性的知己提供情感支持。然而,驱动采用的因素以及用户在不同国家对情感支持交互的感知仍不清楚。为填补这一空白,我们进行了首次大规模跨文化研究,调查了来自七个国家(美国、英国、德国、法国、西班牙、意大利和荷兰)的4641名参与者。我们的结果显示,不同国家的采用率差异显著(从20%到59%)。使用混合模型分离文化影响与人口统计特征,我们发现:25-44岁、有宗教信仰、已婚以及社会经济地位较高的人群更倾向于信任、使用和认为有好处。英语国家比大陆欧洲国家显示出更积极的感知。我们进一步收集了731个真实的多语言提示语,显示用户主要寻求帮助解决孤独、压力、关系冲突和心理健康问题。我们的发现表明,LLM的情感支持使用受复杂的社会技术景观影响,并呼吁更广泛的研究来探讨如何开发、部署和管理这些系统以确保安全和知情的访问。

英文摘要

Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed models that separate cultural effects from demographic composition, we find that: Being aged 25-44, religious, married, and of higher socioeconomic status are predictors of positive perceptions (trust, usage, perceived benefits), with socioeconomic status being the strongest. English-speaking countries consistently show more positive perceptions than Continental European countries. We further collect a corpus of 731 real multilingual prompts from user interactions, showing that users mainly seek help for loneliness, stress, relationship conflicts, and mental health struggles. Our findings reveal that LLM emotional support use is shaped by a complex sociotechnical landscape and call for a broader research agenda examining how these systems can be developed, deployed, and governed to ensure safe and informed access.

2604.23267 2026-05-19 cs.CL cs.LG 版本更新

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

在大型语言模型中微调与上下文学习:从形式语言学习的角度

Bishwamittra Ghosh, Soumi Das, Till Speicher, Qinyuan Wu, Mohammad Aflah Khan, Deepak Garg, Krishna P. Gummadi, Evimaria Terzi

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Boston University(波士顿大学)

AI总结 本文从形式语言学习的角度比较了大型语言模型中的微调与上下文学习,通过设计精确的语言边界、受控字符串采样和无数据污染的任务,发现微调在分布内泛化上优于上下文学习,而两者在分布外泛化上表现相当,且两者在不同熟练度水平上的归纳偏置也有所不同。

Comments Accepted at ACL 2026 (Main)

详情
AI中文摘要

大型语言模型(LLMs)在两种基本的学习模式中运作——微调(FT)和上下文学习(ICL),这引发了关于哪种模式产生更大的语言能力以及它们是否在归纳偏置上有所不同的关键问题。先前比较FT和ICL的研究由于实验设置不一致而得出混杂和不明确的结果。为了实现严格比较,我们提出了一项形式语言学习任务——提供精确的语言边界、受控字符串采样和无数据污染,并引入一种判别测试来评估语言能力,其中LLM成功当且仅当它将更高生成概率分配给语言字符串而不是非语言字符串。经验上,我们发现:(a)FT在分布内泛化上比ICL更具语言能力,但两者在分布外泛化上表现相当。(b)它们的归纳偏置,通过字符串生成概率的相关性来衡量,当两种模式部分学习语言时相似,但在更高熟练度水平上分化。(c)与FT不同,ICL的表现在不同大小和家族的模型之间差异显著,并且对语言的token词汇表敏感。因此,我们的工作展示了形式语言作为评估LLM的受控测试床的潜力,这些行为在自然语言数据集中难以隔离。我们的源代码可在https://github.com/bishwamittra/formallm上获得。

英文摘要

Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

2604.22626 2026-05-19 cs.CL 版本更新

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

从字形依赖到词法结构:从马尔可夫视角看但丁的神曲

Angelo Maria Sabatini

发表机构 * The BioRobotics Institute(生物机器人研究所) Scuola Superiore Sant’Anna(高等圣安娜学校)

AI总结 本文通过基于元音-辅音编码的符号表示,研究但丁神曲的结构组织,发现从地狱到天堂字形记忆指数逐渐增加,表明局部依赖结构发生方向性变化,同时通过三元组分析识别出词法环境中的重复配置,并揭示局部符号依赖与词法结构之间的联系。

Comments 26 pages, 8 figures, 1 supplementary material; submitted to Journal of Computational Literary Studies

详情
AI中文摘要

本研究通过基于元音-辅音编码的符号表示,探讨但丁神曲的结构组织。将所得序列建模为四状态马尔可夫链,得到一个简洁的字形记忆指数,捕捉局部持续性和交替模式。在整部诗中,该指数从地狱到天堂略有但一致增加,表明局部依赖结构发生方向性变化。三元组分析识别出一组受限的重复配置,作为字形探针,将马尔可夫模式与词法环境及正字法现象如撇号形式联系起来。互补的分类分析识别出特定于歌的词法锚点,显示局部符号依赖既反映了三首歌之间的分离,又在整部诗中呈现出连续进展。结果提供了一个可解释的框架,将局部符号结构与高层文本组织联系起来。

英文摘要

This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing local persistence and alternation patterns. Across the poem, this index shows a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram analysis identifies a restricted set of recurrent configurations acting as graphemic probes, linking Markov patterns to lexical environments and orthographic phenomena such as apostrophised forms. A complementary classification analysis identifies cantica-specific lexical anchors, showing that local symbolic dependencies reflect both the separation among the three cantiche and a continuous progression across the poem. The results provide an interpretable framework connecting local symbolic structure with higher-level textual organisation.

2604.22282 2026-05-19 cs.CL 版本更新

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

STEM: 用于知识图谱驱动检索增强生成的结构追踪证据挖掘

Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu

发表机构 * AI Product Center, Kingsoft Corporation(金山办公人工智能产品中心)

AI总结 本文提出STEM框架,通过将多跳推理重新定义为以模式引导的图搜索任务,解决知识图谱结构异质性和现有推理路径检索方法缺乏全局结构视角的问题,从而提升多跳推理的准确性和证据完整性。

Comments 34 pages, 16 figures, accepted to ACL 2026 (Main Conference, Oral Presentation)

详情
AI中文摘要

基于知识图谱的问题回答(KGQA)在复杂推理任务中起着关键作用,但仍然受到两个持续存在的挑战的限制:知识图谱(KGs)的结构异质性常常导致检索过程中的语义不匹配,而现有的推理路径检索方法缺乏全局结构视角。为了解决这些问题,我们提出了结构追踪证据挖掘(STEM),一种新颖的框架,将多跳推理重新定义为以模式引导的图搜索任务。首先,我们设计了一个语义到结构的投影流水线,利用KG结构先验来将查询分解为原子关系断言并构建一个自适应的查询模式图。随后,我们执行全局感知的节点锚定和子图检索以获得最终的证据推理图。为了更有效地在图构建过程中整合全局结构信息,我们设计了三元组依赖图神经网络(Triple-GNN)以生成一个全局指导子图(指导图)以引导构建。STEM显著提高了多跳推理图检索的准确性和证据完整性,并在多个多跳基准上实现了最先进的性能。

英文摘要

Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.

2604.17487 2026-05-19 cs.CL 版本更新

Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

仅在必要时精确回答:用于代理系统的校准断言特异性控制

Tianyi Huang, Samuel Xu, Jason Tansong Dang, Samuel Yan, Kimberley Yin

AI总结 本文研究了代理系统因过于精确而失败的问题,提出了一种称为组合选择性特异性(CSS)的方法,通过分解回答为断言、提出更粗略的退化方案,并在最合适的校准级别发出每个断言,从而提高风险-效用权衡。

Comments Accepted at the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情
AI中文摘要

代理系统往往不是因为完全错误,而是因为过于精确而失败:一个回答可能总体有用,但特定断言超出了证据支持的范围。我们研究这种失败模式为过度承诺控制,并引入组合选择性特异性(CSS),一种后生成层,将回答分解为断言,提出更粗略的退化方案,并在最具体的校准级别发出每个断言。该方法旨在将不确定性表达为局部语义退化,而不是整个回答的拒绝。在完整的LongFact运行和HotpotQA试点中,校准的CSS提高了固定草稿的风险-效用权衡。在完整的LongFact运行中,相对于无CSS输出,它将过度承诺意识效用从0.846提升到0.913,同时实现0.938的特异性保留。这些结果表明,断言层面的特异性控制是代理系统有用的不确定性接口,并且是未来无分布有效性层的目标。

英文摘要

Agentic systems often fail not by being entirely wrong, but by being too precise: a response may be generally useful while particular claims exceed what the evidence supports. We study this failure mode as overcommitment control and introduce compositional selective specificity (CSS), a post-generation layer that decomposes an answer into claims, proposes coarser backoffs, and emits each claim at the most specific calibrated level that appears admissible. The method is designed to express uncertainty as a local semantic backoff rather than as a whole-answer refusal. Across a full LongFact run and HotpotQA pilots, calibrated CSS improves the risk-utility trade-off of fixed drafts. On the full LongFact run, it raises overcommitment-aware utility from 0.846 to 0.913 relative to the no-CSS output while achieving 0.938 specificity retention. These results suggest that claim-level specificity control is a useful uncertainty interface for agentic systems and a target for future distribution-free validity layers.

2604.04932 2026-05-19 cs.CL 版本更新

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

超越最终作者:为细粒度LLM生成文本检测建模创作者与编辑的双重角色

Yang Li, Qiang Sheng, Zhengjia Wang, Yehan Yang, Danding Wang, Juan Cao

发表机构 * ict.ac.cn(中国科学院)

AI总结 本文提出RACE方法,通过建模创作者和编辑的双重角色,实现细粒度LLM生成文本检测,以更精确地区分不同类型的文本,从而为LLM监管提供政策对齐的解决方案。

Comments ACL 2026 (Oral)

详情
AI中文摘要

大型语言模型(LLM)的滥用需要精确检测合成文本。现有工作主要遵循二元或三元分类设置,只能区分纯人类/LLM文本或协作文本。这在 nuanced 的监管中仍显不足,因为LLM润色的人类文本和人类化的LLM文本往往触发不同的政策后果。在本文中,我们探索了在严格四类设置下细粒度LLM生成文本检测。为处理这些复杂性,我们提出了RACE(Rhetorical Analysis for Creator-Editor Modeling),一种细粒度检测方法,该方法刻画了创作者和编辑的各自特征。具体而言,RACE利用修辞结构理论(RST)构建创作者的逻辑图,同时提取基本话语单元(EDU)级别的特征以捕捉编辑的风格。实验表明,RACE在识别细粒度类型时优于12个基线方法,具有较低的误报率,为LLM监管提供了一种政策对齐的解决方案。

英文摘要

The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory (RST) to construct a logic graph for the creator's foundation while extracting Elementary Discourse Unit (EDU)-level features for the editor's style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.

2603.25723 2026-05-19 cs.CL cs.AI 版本更新

Natural-Language Agent Harnesses

自然语言代理Harness

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本文提出自然语言代理Harness(NLAH)作为一种可执行的自然语言对象,用于描述任务运行的Harness策略,并引入Intelligent Harness Runtime(IHR)作为共享运行时,能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。实验表明,NLAH在编码、终端使用和计算机使用基准测试中表现与代码和提示实现相当,同时暴露了更短的静态Harness策略。

Comments revise paper

详情
AI中文摘要

代理性能受到周围Harness的强烈影响:围绕模型组织任务运行的外部执行系统。然而,这种逻辑通常隐藏在紧密耦合的控制器代码中,使得Harness难以检查、比较、转移和消解。本文探讨是否可以将代理Harness的可重用设计模式表示为可执行的自然语言对象。我们引入自然语言代理Harness(NLAH),即可编辑的文档,用于描述运行级别的Harness策略,并引入Intelligent Harness Runtime(IHR),一个共享运行时,能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。在编码、终端使用和计算机使用基准测试中,IHR执行的NLAH实现了与代码和提示实现相当的任务结果,同时暴露了更短的静态Harness策略。模块消解进一步表明,显式的Harness模块是可分析的。这些结果表明,代理Harness可以从模型周围的偶然粘合物转变为科学表示对象。

英文摘要

Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.

2603.22056 2026-05-19 cs.CL 版本更新

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

双空间知识蒸馏与关键-查询匹配用于具有词汇不匹配的大型语言模型

Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill

发表机构 * University of Cambridge, Department of Engineering(剑桥大学工程系) Toshiba Europe Limited(东芝欧洲有限公司)

AI总结 本文研究了针对具有词汇不匹配的大型语言模型的双空间知识蒸馏与关键-查询匹配方法,通过分析注意力机制揭示其优缺点,并提出基于生成对抗学习的新方法以解决关键-查询分布不匹配问题。

Comments Copyright 2026 IEEE. Published in ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), scheduled for 4-8 May 2026 in Barcelona, Spain

详情
AI中文摘要

大型语言模型(LLMs)在语言任务上实现了最先进的(SOTA)性能,但因其规模和资源需求而昂贵。知识蒸馏(KD)通过训练较小的学生模型模仿较大的教师模型来解决这一问题,从而在不显著损失性能的情况下提高效率。双空间知识蒸馏与跨模型注意力(DSKD-CMA)已成为在具有不同分词器的LLM之间进行KD的SOTA方法,但其内部机制仍然大多不透明。在本文中,我们通过手动标记对齐探测和热图可视化系统地分析DSKD-CMA的注意力机制,揭示其优缺点。在此基础上,我们引入了一种基于生成对抗(GA)学习的新方法DSKD-CMA-GA,以解决由不同模型计算出的关键-查询分布不匹配问题。实验显示在文本生成质量上获得了适度但一致的ROUGE-L提升,特别是在分布外数据上(平均+0.37),缩小了跨分词器KD与同分词器KD之间的差距。

英文摘要

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

2603.03328 2026-05-19 cs.CL cs.AI 版本更新

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

StructLens:通过最大生成树实现语言模型的结构镜像

Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技术大学)

AI总结 本文提出StructLens框架,通过最大生成树分析语言模型的表示结构,揭示模型在不同层和训练阶段中如何组织token表示。

详情
AI中文摘要

语言具有内在结构,这一特性解释了语言习得和语言变化。鉴于此特性,我们预期语言模型也会表现出自身的内部结构。尽管可解释性研究已经探讨了模型如何通过注意力模式和稀疏自编码器计算表示,但所得到的表示的组织方式却被忽视。为解决这一差距,我们引入StructLens,一个通过整体结构视角分析表示的框架。StructLens基于残差流中的语义表示构建最大生成树,受依赖解析中树表示的启发,并在表示空间中提供token关系的摘要。我们分析了连续token在表示空间中也彼此接近,并发现中间层显示出最强的局部跨度组织。此外,对预训练检查点的分析表明,较小的局部单元在预训练早期变得可检测,而较大的单元则在后期才变得可检测。我们的发现表明,StructLens提供了关于模型在不同层和训练过程中如何组织token表示的见解。我们的代码可在https://github.com/naist-nlp/structlens获取。

英文摘要

Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest their own internal structures as well. While interpretability research has investigated how models compute representations mechanistically through attention patterns and Sparse AutoEncoders, the organization of the resulting representations is overlooked. To address this gap, we introduce StructLens, a framework to analyze representations through a holistic structural view. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, inspired by tree representation in dependency parsing, and provides summaries of token relationships in representation space. We analyze how contiguous tokens are also nearby in representation space and find that middle layers show the strongest local-span organization. Moreover, analysis of pre-training checkpoints reveals that smaller local units become detectable earlier in pre-training, and larger units later. Our findings demonstrate that StructLens provides insights into how models organize token representations across layers and training. Our code is available at https://github.com/naist-nlp/structlens.

2603.03308 2026-05-19 cs.CL cs.AI 版本更新

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

旧习惯难改:对话历史如何几何学地困住大语言模型

Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen

发表机构 * Technion - Israel Institute of Technology(技术ion-以色列理工学院) University of Oxford(牛津大学) University of Zagreb, FER(Zagreb大学,FER) Kempner Institute, Harvard University(Kempner研究所,哈佛大学) University of Edinburgh(爱丁堡大学)

AI总结 研究探讨对话历史如何通过几何陷阱影响大语言模型的后续表现,提出History-Echoes框架从概率和几何两个角度分析对话历史偏差,并揭示行为持续性在潜在空间中的几何陷阱。

Comments Accepted to ICML 2026

详情
AI中文摘要

大语言模型(LLMs)的对话历史如何影响其未来表现?近期研究表明,LLMs受对话历史影响的方式出人意料。例如,先前交互中的幻觉可能影响后续模型响应。在本工作中,我们引入History-Echoes框架,研究对话历史如何偏移后续生成。该框架从两个角度探索这种偏差:概率上,我们将对话建模为马尔可夫链以量化状态一致性;几何上,我们测量连续隐藏表示的一致性。在三个模型家族和六个涵盖多样化现象的数据集上,我们的分析揭示了两种视角之间的强相关性。通过连接这些视角,我们证明行为持续性表现为几何陷阱,即潜在空间中的间隙会限制模型轨迹。代码可在https://github.com/technion-cs-nlp/OldHabitsDieHard获取。

英文摘要

How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History-Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at https://github.com/technion-cs-nlp/OldHabitsDieHard.

2602.21265 2026-05-19 cs.CL cs.LG cs.SE 版本更新

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

ToolMATH: 一种用于在系统性工具目录约束下评估长周期工具使用的诊断基准

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出ToolMATH,一种基于数学的诊断基准,用于评估在可控工具目录条件下长周期工具使用的性能,通过将分步MATH解决方案转换为可重用的Python工具,并配对需要顺序工具使用、中间输出重用和逻辑连接工具调用链的问题,从而评估模型在不同工具目录条件下的适应性、鲁棒性和工具连接性。

Comments Submitted to NeurIPS Evaluation & Dataset Track

详情
AI中文摘要

我们介绍了ToolMATH,一种用于评估在可控工具目录条件下长周期工具使用的数学基础诊断基准。ToolMATH将分步MATH解决方案转换为具有自然语言描述和类型化架构的可重用Python工具,并配对每个问题与一个需要顺序工具使用、中间输出重用和逻辑连接工具调用链的工具环境。ToolMATH通过构建黄金工具和难度分级的干扰项来控制工具可用性和目录难度。ToolMATH还结合了行为条件度量指标,使诊断评估超越最终准确性。基于这些测量,ToolMATH强调三个评估轴:(1)适应性衡量在黄金工具被完全替换为干扰项时保留的黄金成功程度;(2)鲁棒性衡量在添加干扰项作为噪声时的稳定性;(3)工具连接性衡量模型是否在长执行的工具调用链中保持准确性。此外,跟踪级失败分析描述了模型在每种工具目录条件下如何失败。这些诊断揭示了不同的模型特征:可靠的工具使用、工具回避、适应性替代以及不可靠工具目录的影响。总体而言,ToolMATH提供了一个受控的测试平台,用于评估语言模型如何适应变化的工具可用性,保持对干扰项的鲁棒性,并在长周期工具使用轨迹中保持正确性。

英文摘要

We introduce \ToolMATH, a math-grounded diagnostic benchmark for evaluating long-horizon tool use under controllable tool-catalog conditions. \ToolMATH converts stepwise MATH solutions into reusable Python tools with natural-language descriptions and typed schemas, and pairs each problem with a tool environment requiring sequential tool use, intermediate-output reuse, and logically connected tool-call chains. \ToolMATH controls tool availability and catalog difficulty by constructing gold tools and graded distractors with varying similarity to gold tools. \ToolMATH also incorporates behavior-conditioned metrics, enabling diagnostic evaluation beyond final accuracy. Building on these measurements, \ToolMATH emphasizes three evaluation axes: (1) \emph{Adaptability} measures how much Gold-only success is retained when gold tools are replaced entirely by distractors; (2) \emph{Robustness} measures stability under adding distractors as a noise; and (3) \emph{Tool Connectivity} measures whether models preserve accuracy over long executed tool-call chains. Furthermore, trace-level failure analyses characterize how models fail under each tool-catalog condition. Together, these diagnostics reveal distinct model profiles: reliable tool use, tool avoidance, adaptive substitution, and impacts of unreliable tool catalogs. Overall, \ToolMATH provides a controlled testbed for evaluating how language models adapt to changing tool availability, remain robust to distractors, and maintain correctness across long-horizon tool-use trajectories.

2602.20042 2026-05-19 cs.CL 版本更新

AI Alignment Breaks at the Edge

AI对齐在边缘处破裂

Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, Yanfang Ye

发表机构 * University of Notre Dame(诺丁汉大学) Emory University(埃默里大学)

AI总结 本文探讨了AI对齐在边缘案例中的失效问题,提出了一种新的对齐方法,通过识别和处理价值冲突、多方利益分歧和认知模糊性来改进AI的安全性和有效性。

Comments 38 pages, 6 figures

详情
AI中文摘要

通用对齐已经提高了平均情况下的有用性和安全性,但当前的对齐实践仍然奖励自信的单轮响应。问题不仅在于模型在边缘案例中失败,而且当前的评估使许多这些失败难以察觉。我们认为对齐必须超越平均情况的评估,通过使价值冲突、多方利益分歧和认知模糊性下的失败变得可见和可操作。标量奖励将多样化的价值观压缩成一个数字;数据和评估制度崩溃、过滤或未能激发对齐最困难的案例;治理往往缺乏裁定争议案例的机制。这些盲点导致了价值扁平化、表征损失和不确定性盲区。我们使用“边缘对齐”来命名一种检测、评估和治理议程,以揭示这些失败并将其与适当的干预措施联系起来。而不是单一的训练目标,边缘对齐定义了标准对齐应何时让位于保持多维价值结构、代表多方观点和支持不确定性意识互动的机制。一个包含91个边缘案例和四个现代模型的试点诊断集表明,普通的有用性和安全性读数可能无法发现边缘意识评估所暴露的过程失败。我们概述了操作性的边缘信号、过程意识的评估标准,以及一个三阶段的过程堆栈,将对齐重新定义为动态规范治理的生命周期问题。

英文摘要

General Alignment has improved average-case helpfulness and safety, but current alignment practice still rewards confident, single-turn responses. The problem is not only that models fail on edge cases; it is that current evaluation makes many of these failures hard to see. We take the position that alignment must move beyond average-case evaluation by making failures under value conflict, plural stakeholder disagreement, and epistemic ambiguity visible and actionable. Scalar rewards compress diverse values into a single number; data and evaluation regimes collapse, filter, or fail to elicit the cases where alignment is hardest; and governance often lacks mechanisms for adjudicating contested cases. These blind spots produce value flattening, representation loss, and uncertainty blindness. We use Edge alignment to name a detection, evaluation, and governance agenda for surfacing these failures and connecting them to appropriate interventions. Rather than a single training objective, Edge alignment defines the conditions under which standard alignment should yield to mechanisms that preserve multidimensional value structure, represent plural perspectives, and support uncertainty-aware interaction. A pilot diagnostic set of 91 edge cases and four contemporary models illustrates that ordinary helpfulness and safety readings can miss process failures that edge-aware evaluation exposes. We outline operational edge signals, process-aware evaluation criteria, and a three-phase process stack that reframes alignment as a lifecycle problem of dynamic normative governance.

2602.12871 2026-05-19 cs.CL 版本更新

MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

MentalBench: 一个用于评估大语言模型 psychiatric 诊断能力的 DSM 基础基准

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(成均馆大学) Dongguk University Medical Center(东国大学医学院) Seoultech(首尔技术大学) Samsung Medical Center(三星医疗中心)

AI总结 本文提出 MentalBench,一个用于评估大语言模型在不同临床模糊程度下能否做出 DSM 基础的 psychiatric 诊断决策的基准。该基准基于 psychiatrist 构建并验证的知识图谱,生成了 24,750 个合成临床案例,以系统地变化信息完整性和诊断复杂性,从而实现 DSM 基础的评估。实验表明,尽管最先进的 LLM 在噪声自由查询上表现良好,但它们在区分具有重叠症状的诊断时难以校准其信心。

详情
AI中文摘要

大型语言模型 (LLMs) 已吸引越来越多的关注,作为心理评估和临床决策支持的支持工具。然而,现有的心理健康基准大多依赖于社交媒体数据或支持性对话设置,限制了它们评估模型是否能够应用正式诊断标准和鉴别诊断规则的能力。在本文中,我们介绍了 MentalBench,一个用于评估 LLM 是否能在不同水平的临床模糊性下做出 DSM 基础的 psychiatric 诊断决策的基准。MentalBench 的核心是 MentalKG,一个由精神科医生构建并验证的知识图谱,编码了 DSM-5 的诊断标准和鉴别诊断规则,适用于 23 种心理疾病。利用 MentalKG 作为专家整理的逻辑基础,我们生成了 24,750 个合成临床案例,这些案例在信息完整性和诊断复杂性方面系统地变化,从而实现 DSM 基础的评估。我们的实验表明,尽管最先进的 LLM 在噪声自由查询上表现良好,但它们在区分具有重叠症状的诊断时难以校准其信心。这些发现引发了关于 LLM 作为心理决策支持工具可靠性的担忧,并突显了需要更多评估以反映现实世界心理诊断中的多样化挑战的必要性。

英文摘要

Large language models (LLMs) have attracted growing interest as supportive tools for psychiatric assessment and clinical decision support. However, existing mental health benchmarks largely rely on social media data or supportive dialogue settings, limiting their ability to assess whether models can apply formal diagnostic criteria and differential diagnostic rules. In this paper, we introduce MentalBench, a benchmark for evaluating whether LLMs can make DSM-grounded psychiatric diagnostic decisions under varying levels of clinical ambiguity. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as an expert-curated logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling DSM-grounded evaluation. Our experiments show that although state-of-the-art LLMs perform well on noise-free queries that probe DSM-5 knowledge, they struggle to calibrate their confidence when distinguishing between disorders with overlapping symptoms. These findings raise concerns about the reliability of LLMs as psychiatric decision-support tools and highlight the need for more evaluation that reflects the diverse challenges in real-world psychiatric diagnosis.

2602.11699 2026-05-19 cs.CL 版本更新

Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

在生成上下文中寻找意义:人类与语言模型的视角

Katrina Olsen, Sebastian Padó

发表机构 * Grid Dynamics IMS University of Stuttgart(斯图加特大学)

AI总结 本文通过人类和语言模型对五个语义偏差数据集中的句子进行评估,探讨了如何区分异常句子和无意义句子,并发现语言模型在生成合理上下文方面表现出色。

Comments Accepted for publication at STARSEM 2026, San Diego, CA

详情
AI中文摘要

无意义和异常的句子在计算语义解释模型的发展中起到了关键作用。一个核心挑战是区分仅仅是异常(但可以在上下文中解释)和真正无意义的内容。然而,不清楚(a)现有数据集中的无意义程度,以及(b)LLMs能否做出这种区分。在本文中,我们通过收集人类评估者和LLMs对五种语义偏差数据集中的句子(包括无上下文和有上下文的情况)的可理解性判断来回答这两个问题。我们发现,评估者认为大多数句子仅是异常,只有少数被认为是真正的无意义。我们还显示,LLMs在为异常情况生成合理的上下文方面具有显著的能力。

英文摘要

Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.

2602.03352 2026-05-19 cs.CL 版本更新

PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

PEGRL: 通过后编辑引导的强化学习改进机器翻译

Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) China Mobile Research Beijing, China(北京中国移动研究院)

AI总结 本文提出PEGRL框架,通过后编辑作为辅助任务稳定训练并引导整体优化,提升机器翻译性能。

详情
AI中文摘要

强化学习(RL)在基于大语言模型(LLM)的机器翻译中展现出强劲的潜力,近期方法如GRPO已取得显著进展;然而,面向翻译的RL仍然受到来自蒙特卡洛回报估计的噪声学习信号以及庞大的轨迹空间的挑战,后者倾向于全局探索而非细粒度的局部优化。我们引入PEGRL,一种两阶段的RL框架,利用后编辑作为辅助任务来稳定训练并引导整体优化。在每次迭代中,翻译输出被采样以构建后编辑输入,使后编辑阶段的回报估计能够受益于对当前翻译行为的条件化,同时共同支持全局探索和细粒度的局部优化。一个任务特定的加权方案进一步平衡翻译和后编辑目标的贡献,从而获得一个偏倚但更样本高效的估计器。在英语→芬兰语、英语→土耳其语以及英语↔中文的实验中,PEGRL在RL基线上表现出一致的提升,对于英语→土耳其语,其在COMET-KIWI上的性能与先进的LLM基系统(DeepSeek-V3.2)相当。我们的代码和一组代表性的预训练模型已公开在https://github.com/NJUNLP/peg-rl和https://huggingface.co/collections/DGME/pegrl。

英文摘要

Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2). Our code and a set of representative pretrained models are publicly available at \url{https://github.com/NJUNLP/peg-rl} and \url{https://huggingface.co/collections/DGME/pegrl}

2601.22297 2026-05-19 cs.CL 版本更新

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

从自我辩论中学习:为多智能体辩论准备推理模型

Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出SDRL框架,通过自我辩论训练使模型具备独立问题解决能力和多智能体辩论中的多样化推理处理能力,实验表明其在多辩论协议和智能体配置下均能提升多智能体辩论性能和单模型推理能力。

详情
AI中文摘要

大型语言模型(LLM)的推理能力已通过可验证奖励的强化学习(RLVR)显著提升。在测试阶段,通过多智能体辩论(MAD)进行协作推理已成为提升LLM性能的有希望方法。然而,当前RLVR方法通常训练LLM独立解决问题,而没有明确准备它们在辩论中综合和受益于不同推理路径。在本文中,我们提出了自我辩论强化学习(SDRL),一种训练框架,其中模型从自我辩论中学习,使单个LLM具备强大的独立问题解决能力和处理MAD中多样化推理轨迹的能力。给定提示后,SDRL首先采样多个候选解决方案,然后构建具有多样化推理路径的辩论环境,并生成基于此环境的第二轮响应。最后,SDRL联合优化初始和辩论条件响应,产生一个既能作为独立求解器又能作为辩论参与者有效的模型。在多个基础模型和推理基准上的实验表明,SDRL在多种辩论协议和智能体配置下均能提升MAD性能,同时增强单模型推理能力。

英文摘要

The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning(SDRL), a training framework where models learn from self-debate, equipping a single LLM with both strong standalone problem-solving ability and the capability to process diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL consistently improves MAD performance across diverse debate protocols and agent configurations, while simultaneously strengthening single-model reasoning.

2601.21841 2026-05-19 cs.CL 版本更新

Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

通过大型语言模型的图引导动作生成进行具身任务规划

Xiang Li, Ning Yan, Masood Mortazavi

发表机构 * Purdue University(普渡大学) Futurewei Technologies(未来科技)

AI总结 本文提出GiG框架,通过图神经网络编码环境状态并构建动作连接执行轨迹图,结合有限前瞻性模块提升具身代理的规划能力,在三个具身规划基准测试中取得显著性能提升。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在零样本推理能力方面表现出色,但将其作为具身代理部署仍面临长周期规划的根本挑战。与开放性文本生成不同,具身代理必须将高层意图分解为可操作的子目标,同时遵守动态环境的约束。标准LLM规划器由于上下文窗口限制或幻觉状态转换而难以维持策略一致性。我们提出GiG,一种通过图-图架构结构化具身代理记忆的规划框架。我们的方法利用图神经网络(GNN)将环境状态编码为嵌入,将这些嵌入组织成动作连接的执行轨迹图,存储在经验记忆库中。GiG能够检索结构相似的先例,使代理能基于相关过去结构模式做出决策。此外,我们引入了一个有限前瞻性模块,利用符号转换逻辑通过基于现实的动作投影增强代理的规划能力。我们在三个具身规划基准测试中评估了我们的框架——Robotouille Synchronous、Robotouille Asynchronous和ALFWorld。我们的方法优于最先进的基线,分别在Robotouille Synchronous、Asynchronous和ALFWorld上实现了高达22%、37%和15%的Pass@1性能提升,同时保持可比或更低的计算成本。

英文摘要

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intents into actionable sub-goals while adhering to the constraints of a dynamic environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitations or hallucinate state transitions that violate environment constraints. We propose GiG, a planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. GiG enables retrieval of structurally-similar priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a bounded lookahead module that leverages symbolic transition logic to enhance the agent's planning capabilities through grounded action projections. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld while maintaining comparable or lower computational cost.

2601.14506 2026-05-19 cs.CY cs.CL 版本更新

Compounding Disadvantage: Auditing Intersectional Bias in LLM-Generated Explanations Across Indian and American STEM Education

叠加劣势:在印度和美国STEM教育中对LLM生成解释中的交叉偏见进行审计

Amogh Gupta, Niharika Patil, Sourojit Ghosh, SnehalKumar, S Gaikwad

发表机构 * Society-Centered AI Lab(以社会为中心的人工智能实验室) University of Washington(华盛顿大学)

AI总结 本文研究了大型语言模型在不同文化背景下对边缘化学生群体的系统性歧视,通过审计四个LLM模型发现,多重维度的边缘化导致成绩差距扩大,偏见在精英机构中持续存在。

详情
AI中文摘要

大型语言模型越来越多地被用于跨高收入和低收入国家的STEM教育中,以提供个性化教学和反馈。这些系统旨在根据学生需求调整内容,但其是否基于已展示的能力还是人口统计数据仍未经大规模测试。本文发现,LLM生成的STEM内容在两个文化背景下系统性地歧视边缘化学生群体,最特权和最边缘化群体之间的差距达到2.55个年级。通过使用合成的跨文化特定轮廓(包括印度教育的种姓、教学语言、学院层级和美国教育的种族、HBCU出席情况、学校类型),以及收入、性别和残疾状况,在排名和生成任务中对四个LLM(Qwen 2.5-32B-Instruct、GPT-4o、GPT-4o-mini、GPT-OSS 20B)进行了审计,采用FDR校正显著性测试和SHAP特征归因。收入在所有模型和上下文中产生显著影响,教学语言在印度情境中产生最大的单一影响,残疾状况触发更简单的解释。影响是非加性的:多个维度的边缘化导致差距大于任何单一维度预测的,偏见在精英机构中持续存在。偏见在所有四个架构中一致,并在模型选择中持续存在,因此在部署前进行交叉文化审计是结构化要求。

英文摘要

Large language models are increasingly deployed in STEM education for personalized instruction and feedback across institutions in high- and low-income countries. These systems are designed to adapt content to student needs, but whether they adapt based on demonstrated ability or demographic signals remains untested at scale. Here we establish that LLM-generated STEM content systematically disadvantages marginalized student profiles across two cultural contexts, with the gap between the most privileged and most marginalized profiles reaching 2.55 grade levels. We audited four LLMs (Qwen 2.5-32B-Instruct, GPT-4o, GPT-4o-mini, GPT-OSS 20B) using synthetic profiles crossing dimensions specific to Indian education (caste, medium of instruction, college tier) and American education (race, HBCU attendance, school type), alongside income, gender, and disability, across ranking and generation tasks with FDR-corrected significance testing and SHAP feature attribution. Income produces significant effects across every model and context, medium of instruction drives the largest single effect in the Indian context, and disability status triggers simpler explanations. Effects compound non-additively: marginalization across multiple dimensions produces gaps larger than any single dimension predicts, and biases persist within elite institutions. Bias is consistent across all four architectures and persists through model selection, making intersectional, cross-cultural auditing a structural requirement before deployment.

2601.09722 2026-05-19 cs.CL cs.AI 版本更新

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

ADMEDTAGGER: 一个用于波兰医疗语言知识蒸馏的标注框架

Franciszek Górski, Andrzej Czyżewski

发表机构 * Gdansk University of Technology(格但斯克技术大学)

AI总结 本文提出了一种标注框架,展示如何利用一个多语言预训练大语言模型作为教师模型,蒸馏出用于标注波兰医疗文本所需的专业知识,通过开发多类分类器,解决了标注资源不足的问题,最终得到了高效的分类器。

详情
AI中文摘要

在本工作中,我们提出了一种标注框架,展示了如何利用一个多语言预训练大语言模型作为教师模型,蒸馏出用于标注波兰医疗文本所需的专业知识。本工作是ADMEDVOICE项目的一部分,在此项目中,我们收集了涵盖五个临床类别(放射学、肿瘤学、心脏病学、高血压和病理学)的大量医疗文本语料库。利用这些数据,我们开发了一个多类分类器,但根本问题在于缺乏足够的标注资源来标注足够数量的文本。因此,在我们的解决方案中,我们使用多语言Llama3.1模型来标注大量波兰医疗文本语料库。利用我们有限的标注资源,我们只验证了这些标签中的一部分,从而创建了一个测试集。通过这种方式标注的数据随后用于训练和验证三种基于BERT架构的分类器:基于DistilBERT的蒸馏模型、在医疗数据上微调的BioBERT以及在波兰语言语料库上微调的HerBERT。在我们训练的模型中,DistilBERT模型表现最佳,每个临床类别达到了F1分数大于0.80,其中三个类别达到了F1分数大于0.93。通过这种方式,我们得到了一系列高效的分类器,这些分类器在大小、GPU VRAM消耗和推理速度方面分别比大型语言模型小约500倍、低300倍,以及快数百倍。

英文摘要

In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.

2512.19134 2026-05-19 cs.CL cs.IR 版本更新

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

QuCo-RAG:从预训练语料库中量化不确定性以实现动态检索增强生成

Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng

发表机构 * University of Illinois at Chicago(伊利诺伊大学香槟分校) New York University(纽约大学) Monash University(莫纳什大学)

AI总结 本研究提出QuCo-RAG,通过从预训练语料库中提取客观统计信息来量化不确定性,以解决动态检索增强生成中大语言模型的幻觉问题,实验表明其在多跳问答基准测试中优于现有方法,并在多个模型上实现了显著的提升。

Comments ACL Findings 2026

详情
AI中文摘要

Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama-3, Qwen2.5, GPT-4.1/5-chat), improving EM by up to 14 points. Generalization to long-form generation and biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

英文摘要

Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama-3, Qwen2.5, GPT-4.1/5-chat), improving EM by up to 14 points. Generalization to long-form generation and biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

2512.04746 2026-05-19 cs.CL cs.AI 版本更新

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2:朝着关闭LLMs极低比特后训练量化性能差距的目标

Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen, Zaner Ma

发表机构 * Intel(英特尔公司) Beijing Institute of Technology(北京理工大学)

AI总结 本文提出SignRoundV2框架,通过自适应混合精度策略和轻量稳定技术,在极低比特量化下保持高性能,实验表明在混合MXFP设置中实现接近无损性能,将性能差距缩小到约1%。

详情
AI中文摘要

极低比特量化对高效部署大型语言模型(LLMs)至关重要,但往往在2比特和4比特(如MXFP4)时导致严重性能下降。我们提出了SignRoundV2,一种后训练量化框架,旨在在极端压缩下保持高性能。SignRoundV2引入(1)一种简单而高效的自适应混合精度策略,利用梯度信息和量化引起的重建误差来指导层间比特分配,以及(2)一组轻量级稳定技术,包括损失过滤和预调制比例搜索,以提高极低比特环境下的调优效果。我们的方法在量化和全精度模型之间显著缩小了性能差距。在多种LLMs上的实验结果表明,SignRoundV2在混合MXFP设置中实现了接近无损性能,将差距缩小到约1%(平均4.5比特),同时在具有挑战性的2比特权重-only量化中大幅提高准确性。源代码可在https://github.com/intel/auto-round获取。

英文摘要

Extremely low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2 bits and even at 4 bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework designed to maintain high performance even under aggressive compression. SignRoundV2 introduces (1) a simple yet efficient adaptive mixed-precision strategy that leverages gradient information and quantization-induced reconstruction errors to guide layer-wise bit allocation, and (2) a set of lightweight stabilization techniques, including loss filtering and a pre-tuning scale search, to improve tuning effectiveness in extremely low-bit regimes. Our approach takes a significant step toward closing the performance gap between quantized and full-precision models. Experimental results across diverse LLMs demonstrate that SignRoundV2 achieves near-lossless performance in mixed MXFP settings, narrowing the gap to $\sim$1\% at an average of 4.5 bits, while substantially improving accuracy in challenging 2-bit weight-only quantization. The source code is available at \url{https://github.com/intel/auto-round}.

2511.21016 2026-05-19 cs.LG cs.CL 版本更新

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

门控KalmaNet:通过测试时岭回归实现渐逝记忆层

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

发表机构 * University of Pennsylvania(宾夕法尼亚大学) AWS Agentic AI(AWS 代理人工智能)

AI总结 本文提出门控KalmaNet(GKA),通过测试时岭回归实现渐逝记忆层,解决了线性状态空间模型(SSMs)在记忆过去信息时的效率与精度问题,展示了GKA在短上下文任务和长上下文任务中的优越性能。

Comments 30 pages, 10 figures. Accepted at CVPR 2026

详情
AI中文摘要

线性状态空间模型(SSMs)提供了一种高效的替代softmax注意力机制的方案,具有恒定的内存和线性计算,但其损失性、渐逝的过去总结对需要回忆的任务造成了伤害。我们提出门控KalmaNet(GKA,发音为“gee-ka”),这是一种能够考虑完整过去同时保持SSM风格效率的层。我们的方法基于卡尔曼滤波(KF),并证明了现有的几种SSM层(DeltaNet、门控DeltaNet、Kimi Delta Attention)是在恒等误差协方差假设下的卡尔曼滤波递归近似。相比之下,GKA保持完整的误差协方差并计算精确的卡尔曼增益。在稳态假设下,这可以简化为具有恒定内存和线性计算的在线岭回归。标准的卡尔曼滤波方程在低精度设置(如bfloat16)下数值不稳定且难以在GPU上并行化。我们通过(1)输入依赖的门控进行自适应正则化以控制岭回归的条件数,以及(2)Chebyshev迭代,证明其在低精度下比传统迭代求解器更稳定。我们进一步开发了针对硬件的分块内核以提高训练效率。实证上,GKA在短上下文任务中优于现有的SSM层(如Mamba2、门控DeltaNet),并在长上下文RAG和LongQA任务中达到128k token的相对改进超过10%。我们还展示了当扩展到ImageNet分类时,GKA优于Mamba。我们的代码,包括用于训练和推理的Triton内核(vLLM),以及在HuggingFace上8B和32B规模的GKA基于混合模型的模型库,均以Apache 2.0许可证发布。

英文摘要

Linear State-Space Models (SSMs) offer an efficient alternative to softmax Attention with constant memory and linear compute, but their lossy, fading summary of the past hurts recall-oriented tasks. We propose Gated KalmaNet (GKA, pronounced "gee-ka"), a layer that accounts for the full past while retaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF), and show that several existing SSM layers (DeltaNet, Gated DeltaNet, Kimi Delta Attention) are approximations to the KF recurrence under an identity error covariance assumption, which ignores how past keys and values should optimally influence state updates. In contrast, GKA maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The standard KF equations are numerically unstable in low-precision settings (e.g., bfloat16) and hard to parallelize on GPUs. We address this with (1) adaptive regularization via input-dependent gating to control the ridge regression's condition number, and (2) Chebyshev Iteration, which we show is more stable than conventional iterative solvers in low precision. We further develop hardware-aware chunk-wise kernels for efficient training. Empirically, GKA outperforms existing SSM layers (e.g., Mamba2, Gated DeltaNet) on short-context tasks and achieves more than 10\% relative improvement on long-context RAG and LongQA up to 128k tokens. We further show GKA outperforms Mamba when extended to ImageNet classification. Our code, including Triton kernels for training and inference (vLLM), along with a model zoo of GKA-based Hybrid models at 8B and 32B scale on HuggingFace, is released under Apache 2.0.

2511.12710 2026-05-19 cs.CL cs.CR 版本更新

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

改进方法而非提示:针对大语言模型的进化式 jailbreak 攻击合成

Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家)

AI总结 本文提出 EvoSynth 框架,通过在代码空间中进行搜索,而非仅在提示空间中优化,从而提高对大语言模型的 jailbreak 攻击成功率和多样性。

详情
AI中文摘要

针对大语言模型(LLM)的自动化红队框架日益复杂,但许多仍主要在提示空间中优化攻击。换句话说,这些方法主要搜索更好的攻击用语或策略选择,但不搜索可执行代码。通过将搜索移至代码空间,我们不仅能优化最终的攻击提示,还能优化生成它的过程,包括执行流程、可重用逻辑、分支和失败驱动的修复。为克服这一差距,我们引入了 EvoSynth,一个自主的多代理框架,将优化空间从提示转移到可执行代码。与直接优化提示不同,EvoSynth 使用多代理系统自主设计、进化和执行基于代码的攻击算法。关键在于其代码级自我纠正循环,使其能够根据目标模型反馈和失败尝试迭代重写基于代码的算法。通过广泛实验,我们证明 EvoSynth 在高度稳健的模型如 Claude-Sonnet-4.5 上实现了 85.5% 的攻击成功率(ASR),并在评估目标上平均达到 95.9% 的 ASR,同时生成的攻击比现有方法更具多样性。我们发布该框架以促进未来在可执行代码空间中进化合成的研究。

英文摘要

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure-driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi-agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite the code-based algorithm in response to target-model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5 and a 95.9\% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.

2511.04070 2026-05-19 cs.CL 版本更新

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

T-FIX:基于文本的可解释性方法,具备可解释的专家特征

Shreya Havaldar, Weiqiu You, Chaehyeon Kim, Anton Xue, Helen Jin, Marco Gatti, Bhuvnesh Jain, Helen Qu, Amin Madani, Daniel A. Hashimoto, Gary E. Weissman, Rajat Deo, Sameed Khatana, Lyle Ungar, Eric Wong

发表机构 * Department of Computer and Information Science, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系) Department of Computer Science, University of Texas at Austin(德克萨斯大学奥斯汀分校计算机科学系) Department of Physics and Astronomy, University of Pennsylvania(宾夕法尼亚大学物理与天文学系) Flatiron Institute(Flatiron研究所) Department of Surgery, Perelman School of Medicine, University of Pennsylvania(宾夕法尼亚大学佩雷尔曼医学院外科系) Division of Pulmonary, Allergy, and Critical Care, Perelman School of Medicine, University of Pennsylvania(宾夕菲亚大学佩雷尔曼医学院呼吸、过敏与危重医学科) Division of Cardiovascular Medicine, Perelman School of Medicine, University of Pennsylvania(宾夕法尼亚大学佩雷尔曼医学院心血管医学科) Department of Surgery, University of Toronto(多伦多大学外科系) University Health Network(大学健康网络)

AI总结 本文提出T-FIX框架,用于评估LLM生成的解释是否符合专家的推理方式,通过七个科学任务和三个领域进行验证,实现了自动且可定制的专家对齐评估。

详情
AI中文摘要

随着LLM被应用于知识密集型领域(例如手术、天文学、治疗),用户通常是领域专家,他们不仅期望答案,还期望解释能反映专业推理。然而,评估LLM是否'像专家一样思考'仍然困难:现有方法依赖于每个示例的专家注释,使它们成本高、难以扩展,并且局限于每个领域的单一正确推理观念。为了解决这一差距,我们引入了T-FIX,一个统一的评估框架,将专家对齐作为LLM生成解释的期望属性进行操作化。T-FIX涵盖七个科学任务和三个领域,每个任务均根据专家定义的准则进行评估,这些准则捕捉的是领域相关的推理而非通用的解释质量。我们的框架实现了自动且可定制的专家对齐评估,能够在没有持续专家参与的情况下泛化到未见过的解释。代码可在https://github.com/BrachioLab/FIX-2/上获得。

英文摘要

As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users are often domain experts who expect not just answers, but explanations that mirror professional reasoning. Yet evaluating whether an LLM "thinks like an expert" remains difficult: existing approaches rely on per-example expert annotation, making them costly, hard to scale, and tied to a single notion of correct reasoning within each domain. To address this gap, we introduce T-FIX, a unified evaluation framework that operationalizes expert alignment as a desired attribute of LLM-generated explanations. T-FIX spans seven scientific tasks across three domains, with each task evaluated against expert-defined criteria that capture domain-grounded reasoning rather than generic explanation quality. Our framework enables automatic, personalizable evaluation of expert alignment that generalizes to unseen explanations without ongoing expert involvement. Code is available at https://github.com/BrachioLab/FIX-2/.

2510.24701 2026-05-19 cs.CL cs.AI cs.IR cs.LG cs.MA 版本更新

Tongyi DeepResearch Technical Report

通义深研技术报告

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Minpeng Liao, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

发表机构 * Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文介绍了一种专为长时间深度信息检索任务设计的代理大语言模型,通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索,同时提供了高可扩展的数据合成管道,实现了无需昂贵人工标注的自动化训练流程,并在多个深度研究基准测试中取得了最先进的性能。

Comments https://tongyi-agent.github.io/blog

详情
AI中文摘要

我们介绍了通义深研,一种专为长周期、深度信息检索任务设计的代理大语言模型。为了激励自主深度研究代理,通义深研通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索。我们设计了一个高度可扩展的数据合成管道,完全自动化,无需依赖昂贵的人工标注,并赋能所有训练阶段。通过为每个阶段构建定制化环境,我们的系统在整个过程中实现了稳定一致的交互。通义深研拥有305亿总参数,每token仅激活33亿个参数,在多个代理深度研究基准测试中,包括人类最后考试、浏览比较、浏览比较-中文、WebWalkerQA、xbench-DeepSearch、FRAMES和xbench-DeepSearch-2510,均取得了最先进的性能。我们开源了该模型、框架和完整解决方案,以赋能社区。

英文摘要

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

2510.16727 2026-05-19 cs.CL cs.AI 版本更新

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Beacon:单轮诊断和缓解大型语言模型中潜在的阿谀倾向

Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal

AI总结 本文提出Beacon基准测试,用于单轮诊断和缓解大型语言模型中潜在的阿谀倾向,通过评估十二种最先进的模型,揭示了阿谀倾向在语言和情感方面的稳定子偏差,并提出了在提示和激活层面的干预措施,以调节这些偏差,从而揭示对齐作为事实性和社会合规判断之间的动态流形。

详情
AI中文摘要

大型语言模型内部化了诚实与奉承之间的结构权衡,这种权衡源于奖励优化,将有用性与礼貌服从混淆。这种潜在的偏见,称为阿谀倾向,表现为对用户同意的偏好而非原则性推理。我们引入Beacon,一种单轮强制选择基准测试,该测试独立于对话上下文,能够精确测量事实准确性与顺从偏见之间的张力。在十二种最先进的模型上的评估表明,阿谀倾向分解为稳定的语言和情感子偏见,每个都随模型容量而扩大。我们进一步提出了提示级别和激活级别干预,以调节这些偏见的相反方向,揭示对齐作为事实性和社会合规判断之间的动态流形。Beacon将阿谀倾向重新定义为可测量的规范性误泛化形式,为研究和缓解大规模生成系统中的对齐漂移提供了可重复的基础。

英文摘要

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

2510.16252 2026-05-19 cs.LG cs.CL 版本更新

WEBSERV: A Full-Stack and RL-Ready Web Environment for Training Web Agents at Scale

WEBSERV: 一个全栈且适合强化学习的网页环境,用于大规模训练网页代理

Yuxuan Lu, Ziyi Wang, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, Xianfeng Tang, Chen Luo, Yisi Sang, Jin Lai, Dakuo Wang

发表机构 * Northeastern University(东北大学) Amazon(亚马逊)

AI总结 本文提出WebServ,一个全栈且适合强化学习的网页环境,用于大规模训练网页代理。该环境在服务器端使用Incus容器减少启动延迟和存储需求,浏览器端提供自动化的观察和动作接口,以及可靠的执行后端。实验表明,WebServ在WebArena-Lite上实现了最先进的单提示结果,并在强化学习训练中超越了现有方法。

详情
AI中文摘要

针对网页代理强化学习需求,本文提出WebServ,一个全栈且适合强化学习的网页环境,用于大规模训练网页代理。当前网页环境存在不足:服务器端Docker设置过于资源密集,无法支持大规模并行展开;浏览器端接口产生噪声观察,执行动作在现代单页应用中不可靠,并遗漏视觉交互提示。我们引入WebServ,一个全栈、适合强化学习的网页环境,解决这些限制。在服务器端,WebServ使用Incus容器,通过块级拷贝-写入减少启动延迟约5倍,持久化存储减少约240倍,使单台主机支持200+个隔离环境。在浏览器端,WebServ提供一个紧凑的、站点无关的观察和动作接口,自动从DOM派生,并提供人类对齐的交互提示,以及使用网络感知等待的稳健动作执行后端。在WebArena-Lite上,WebServ实现了最先进的单提示结果,受控比较确认在GPT-4o、OpenAI-o3和Llama-3.1-8B上均优于普通WebArena。我们进一步在WebServ中完全训练Qwen3-4B和Qwen3-30B-A3B;RL训练的4B模型在均值准确率上达到55.5%,超过了Claude 4.5 Sonnet(50.0%)和WebAgent-R1中的RL训练8B模型(51.8%)

英文摘要

Reinforcement learning (RL) for web agents demands environments that are both effective for evaluation and efficient enough for large-scale on-policy training. Current web environments fall short: server-side Docker setups are too resource-intensive for massive parallel rollouts, while browser-side interfaces produce noisy observations, execute actions unreliably under modern single-page applications, and omit visual interactivity cues. We introduce WebServ, a full-stack, RL-ready web environment that addresses these limitations end-to-end. On the server side, WebServ uses Incus containers with block-level copy-on-write, reducing launch latency by ~5x and persistent storage by ~240x, enabling 200+ concurrent isolated environments on a single host. On the browser side, WebServ provides a compact, site-agnostic observation and action interface derived automatically from the DOM with human-aligned interactivity cues, and a robust action execution backend using network-aware waiting for reliable SPA support. On WebArena-Lite, WebServ achieves state-of-the-art single-prompt results, with controlled comparisons confirming consistent gains across GPT-4o, OpenAI-o3, and Llama-3.1-8B over vanilla WebArena. We further train Qwen3-4B and Qwen3-30B-A3B with RL entirely within WebServ; the RL-trained 4B model achieves 55.5% mean accuracy, surpassing both Claude 4.5 Sonnet (50.0%) and the RL-trained 8B model from WebAgent-R1 (51.8%).

2510.14466 2026-05-19 cs.CL cs.AI 版本更新

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

迈向低资源语言LLM鲁棒多语言适应

Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing Huang

发表机构 * Department of Automation, Tsinghua University, Beijing, China(清华大学自动化系) Alibaba International Digital Commerce Group, Beijing, China(阿里巴巴国际数字 commerce 集团) School of Software, Tsinghua University, Beijing, China(清华大学软件学院)

AI总结 本文提出LiRA框架,通过轻量级微调实现低资源语言LLM的鲁棒多语言适应,结合Arca和LaSR组件提升跨语言语义一致性与表示稳定性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在低资源语言上仍面临挑战,主要由于训练数据有限、翻译噪声和跨语言对齐不稳定。为解决这些问题,我们提出LiRA(LLM的语言鲁棒锚定框架)——一个插件式框架,仅需在现有预训练模型上进行轻量级微调。LiRA通过结合两个关键组件:Arca(锚定表示组合架构),通过基于锚点的对齐和协作编码将低资源输入对齐到共享的英语语义空间;以及LaSR(语言耦合语义推理器),一个轻量级、语言感知的头部,通过一致性正则化强制统一的跨语言理解、检索和推理。我们理论证明,在受控的锚定误差和翻译诱导偏差下,LiRA保证了表示偏差的有界性和稳定的下游性能,基于局部Lipschitz连续性。为促进研究,我们发布了一个新的多语言产品检索数据集,涵盖五个东南亚语言和两种南亚语言。在多样化的低资源基准测试中,广泛实验显示在检索、排序、问答和推理任务上均取得一致的改进。代码将在GitHub上公开,数据集将托管在Hugging Face上。

英文摘要

Large language models (LLMs) continue to struggle with low-resource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)-a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.

2510.13870 2026-05-19 cs.CL cs.AI 版本更新

Unlocking the Potential of Diffusion Language Models through Template Infilling

通过模板填充解锁扩散语言模型的潜力

Junhoo Lee, Seungyeon Kim, Nojun Kwak

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出了一种针对扩散语言模型的模板填充方法,通过在生成响应空间中建立全局蓝图,提升了数学推理、代码生成和旅行规划等任务的性能,同时在多token生成中实现了生成质量与速度的平衡。

Comments ACL 2026 Main Conference - Long Paper, Oral Presentation

详情
AI中文摘要

扩散语言模型(DLMs)作为一种有前景的替代自回归语言模型的候选者,其推理策略仍局限于自回归范式继承的前缀提示。本文提出模板填充(TI),一种针对DLMs的定制化条件化方法。与传统前缀提示不同,TI在目标响应空间中灵活对齐结构锚点,建立全局蓝图后再填充被遮蔽段落。我们在数学推理、代码生成和旅行规划等多样基准上展示了方法的有效性,相对于基线模型在多个任务上实现了9.40%的提升。此外,我们发现TI在多token生成设置中提供了额外优势,能够在保持生成质量与鲁棒性的同时实现有效加速。通过强制这些全局约束,TI最终促进了系统2推理,使模型能够在结构定义的解决方案空间内进行深入思考。

英文摘要

Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs. Unlike conventional prefix prompting, TI flexibly aligns structural anchors across the entire target response space, establishing a global blueprint before filling in the masked segments. We demonstrate the effectiveness of our approach on diverse benchmarks, including mathematical reasoning, code generation, and trip planning, achieving consistent improvements of 9.40% over the baseline. Furthermore, we observe that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality and robustness. By enforcing these global constraints, TI ultimately facilitates System-2 reasoning, empowering the model to deliberate within a structurally defined solution space.

2510.10528 2026-05-19 cs.CL cs.LG 版本更新

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Merlin's Whisper:通过黑盒说服提示增强大语言模型的高效推理

Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系) Sea AI Lab(Sea AI实验室) Peking University(北京大学)

AI总结 本文提出Whisper框架,通过黑盒说服提示减少大语言模型(LRM)的推理过程中的token使用量,同时保持性能,展示了在多个基准测试中显著的token减少效果。

Comments ACL 2026 (Long Paper), camera-ready version

详情
AI中文摘要

大型推理模型(LRMs)通过逐步思考在解决复杂任务方面表现出色。然而,这种漫长的推理过程带来了显著的计算和延迟开销,阻碍了LRMs的实用部署。本文提出了一种通过黑盒说服提示来减轻LRMs过度思考的新方法。通过将LRMs视为黑盒通信者,我们研究如何说服它们生成简洁响应而不影响准确性。我们引入了Whisper,一个迭代细化框架,能够从多种视角生成高质量的说服提示。在多个基准测试中的实验表明,Whisper在保持性能的同时,能够显著减少token使用量。值得注意的是,Whisper在简单的GSM8K问题上对Qwen3模型系列实现了平均3倍的响应长度减少,并在所有基准测试中实现了平均约40%的token减少。对于闭源API,Whisper在MATH-500上分别使Claude-3.7和Gemini-2.5的token使用量减少了46%和50%。进一步分析显示,Whisper在数据领域、模型规模和家族中的广泛应用,凸显了黑盒说服提示作为提升LRM效率的实用策略的潜力。

英文摘要

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex tasks through step-by-step thinking. However, this lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of LRMs. This work presents a new approach to mitigating overthinking in LRMs via black-box persuasive prompting. By treating LRMs as black-box communicators, we investigate how to persuade them to generate concise responses without compromising accuracy. We introduce Whisper, an iterative refinement framework that generates high-quality persuasive prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that Whisper consistently reduces token usage while preserving performance. Notably, Whisper achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series and delivers an average ~40% token reduction across all benchmarks. For closed-source APIs, Whisper reduces token usage on MATH-500 by 46% for Claude-3.7 and 50% for Gemini-2.5. Further analysis reveals the broad applicability of Whisper across data domains, model scales, and families, underscoring the potential of black-box persuasive prompting as a practical strategy for enhancing LRM efficiency.

2510.08886 2026-05-19 cs.CL cs.CE cs.IR 版本更新

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

FinAuditing: 一个基于财务分类结构的多文档基准,用于评估LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Yankai Chen, Víctor Gutiérrez-Basulto, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Xue Liu, Jian-Yun Nie

发表机构 * Columbia University(哥伦比亚大学) Stevens Institute of Technology(史蒂文斯理工学院) Rensselaer Polytechnic Institute(拉特格斯理工学院) University of Montreal(蒙特利尔大学) McGill University(麦吉尔大学) MBZUAI(麦吉尔大学人工智能研究所) Cardiff University(卡迪夫大学) The University of Manchester(曼彻斯特大学) Harvard University(哈佛大学)

AI总结 本文提出FinAuditing,一个基于财务分类结构的多文档基准,用于评估大型语言模型在财务审计任务中的能力,通过三个任务:财务语义匹配、财务关系提取和财务数学推理,揭示了现有LLMs在概念检索、分类感知关系建模和跨文档一致性推理方面的显著差距。

Comments Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

详情
AI中文摘要

超越简单的文本处理,财务审计需要在大规模披露中检测语义、结构和数值的一致性。由于财务报告以XBRL(一种受会计标准规范的结构化XML格式)提交,审计成为涉及概念对齐、分类定义的关系和跨文档一致性的结构化信息提取和推理问题。尽管大型语言模型(LLMs)在孤立的财务任务上表现出色,但其在专业级审计中的能力仍不明确。我们引入了FinAuditing,一个基于分类结构的基准,由真实的XBRL文件构建。它包含1,102个注释实例,平均超过33,000个标记,并定义了三个任务:财务语义匹配(FinSM)、财务关系提取(FinRE)和财务数学推理(FinMR)。对13种最先进的LLMs的评估揭示了概念检索、分类感知关系建模和跨文档一致性推理方面的显著差距。这些发现突显了需要现实且结构感知的基准的重要性。我们发布了评估代码(https://github.com/The-FinAI/FinAuditing)和数据集(https://huggingface.co/collections/TheFinAI/finauditing)。目前,该任务已成为正在进行的公开评估竞赛的官方基准(https://open-finance-lab.github.io/SecureFinAI_Contest_2026/)

英文摘要

Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.

2510.08702 2026-05-19 cs.CL 版本更新

Scaling Laws for Code: A More Data-Hungry Regime

代码的缩放规律:一个更数据渴求的阶段

Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Chinese Academy of Sciences(中国科学院) Fudan University(复旦大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本文研究了代码的缩放规律,通过大规模实验发现Farseer定律在准确性上更优,代码模型在模型大小上表现良好,但需要更高的数据与参数比,且在代码-自然语言混合数据中,自然语言在资源受限场景下有益,但在更高计算预算下成为负担。

Comments Accepted by ACL2026

详情
AI中文摘要

代码大型语言模型(LLMs)正在革新软件工程。然而,指导高效训练的缩放定律主要是在自然语言(NL)上分析的。鉴于代码和自然语言之间的根本差异,如严格的语法,这些定律是否直接适用于代码尚不清楚。为填补这一空白,我们进行了首次大规模的代码缩放定律实证研究,包括117次实验运行,模型大小从0.2B到3.8B,训练token从2B到128B。我们拟合了Chinchilla定律和Farsser定律。首先,结果表明,更具表现力的Farsser定律在准确性上更优。其次,分析显示代码LLMs在模型大小上有效扩展。关键的是,代码代表了一个更数据渴求的阶段,需要比自然语言显著更高的数据与参数比。最后,对代码-自然语言混合数据的两个额外实验显示,自然语言在资源受限的场景下有益,但在更高计算预算下成为负担。

英文摘要

Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.

2510.07239 2026-05-19 cs.CL 版本更新

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Red-Bandit:通过带引导的LoRA专家实现LLM红队测试的测试时适应

Christos Ziakas, Nicholas Loo, Nishita Jain, Alessandra Russo

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 本文提出Red-Bandit框架,通过带引导的LoRA专家在不同攻击风格下实现LLM的测试时适应,通过强化学习生成不安全提示,并利用多臂老虎机策略动态选择攻击风格专家,从而在AdvBench上取得最佳结果,同时生成更易读的提示。

Comments Accepted to the Main Conference at ACL 2026

详情
AI中文摘要

自动化红队测试已成为在部署前审计大型语言模型(LLM)的可扩展方法,但现有方法缺乏有效适应模型特定漏洞的机制。我们介绍了Red-Bandit,一种红队测试框架,能够在线适应以识别和利用不同攻击风格(例如操纵、俚语)下的模型失败模式。Red-Bandit通过强化学习后训练一组参数高效的LoRA专家,每个专家专门针对特定的攻击风格,奖励生成不安全提示通过基于规则的安全模型。在推理时,多臂老虎机策略根据目标模型的响应安全性动态选择这些攻击风格专家,平衡探索和利用。Red-Bandit在足够的探索(ASR@10)下在AdvBench上实现了最先进的结果,同时生成更易于人类阅读的提示(更低的困惑度)。此外,Red-Bandit的老虎机策略还充当诊断工具,通过指示哪些攻击风格最有效引发不安全行为来揭示模型特定的漏洞。

英文摘要

Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

2510.05921 2026-05-19 cs.CL cs.LG 版本更新

Prompt reinforcing for long-term planning of large language models

通过提示强化实现大语言模型的长期规划

Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić

发表机构 * Heinrich-Heine-Universität Düsseldorf(杜伊斯堡-埃森大学)

AI总结 本文提出了一种基于强化学习的提示优化框架,通过修改LLM代理的任务指令提示来实现长期规划,提升了多轮交互任务如文本到SQL和任务导向对话的表现,并能泛化到不同LLM代理和多种LLM作为元提示代理。

详情
AI中文摘要

大型语言模型(LLMs)在广泛自然语言处理任务中取得了显著成功,并可通过提示进行适应。然而,它们在多轮交互中仍表现不足,常依赖错误的早期假设,无法随时间跟踪用户目标,使此类任务尤其具有挑战性。先前对话系统的工作表明,长期规划对于处理交互任务至关重要。在本工作中,我们提出了一种受强化学习启发的提示优化框架,仅通过修改LLM代理的任务指令提示即可实现此类规划。通过生成回合间的反馈并利用经验回放进行提示重写,我们的方法在文本到SQL和任务导向对话等多轮任务中显示出显著改进。此外,该方法能跨不同LLM代理泛化,并可利用多种LLM作为元提示代理。这促使未来在受强化学习启发的无参数优化方法上的研究。

英文摘要

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

2509.21820 2026-05-19 cs.CL 版本更新

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

大语言模型能否生成并解决语言奥林匹克谜题?

Neh Majmudar, Elena Filatova

发表机构 * CUNY(纽约大学)

AI总结 本文研究了大语言模型在生成和解决语言谜题中的能力,发现其在大多数谜题类型上优于人类,但对书写系统和不为人知语言的谜题表现较弱,提出了通过谜题生成促进语言学普及的研究意义。

Comments Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

详情
AI中文摘要

在本文中,我们介绍了一种新的任务组合:语言谜题的解决方案和生成。我们专注于用于高中生的语言奥林匹克谜题。我们首先扩展了现有基准,以解决语言谜题的任务。我们探索了大型语言模型(LLMs)在解决语言谜题中的应用,包括最近的最先进的模型,如OpenAI的o1,在各种语言主题上的表现。我们证明,LLMs在大多数谜题类型上优于人类,除了那些以书写系统为中心的谜题,以及不为人知的语言。我们利用谜题解决实验的洞察力,指导了新的谜题生成任务。我们相信,即使对于相对简单的谜题,自动化谜题生成也有望扩大对语言学的兴趣,并将该领域介绍给更广泛的受众。这一发现突显了语言谜题生成作为研究任务的重要性:此类谜题不仅能促进语言学,还能支持对稀有和不为人知语言的知识传播。

英文摘要

In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.

2507.20917 2026-05-19 cs.CL cs.AI 版本更新

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

MediQAl: 一个用于知识和推理评估的法语医学问答数据集

Adrien Bazoge

发表机构 * Data Clinic, University Hospital of Nantes, France(南特大学医院数据诊所,法国) Nantes Université, École Centrale Nantes, CNRS, LS2N, France(南特大学,中央理工学院南特分校,国家科学研究中心,LS2N,法国)

AI总结 本文提出MediQAl数据集,用于评估语言模型在事实性医学记忆和现实临床场景推理方面的能力,包含32,603个法语医学问题,涵盖41个医学科目,包含三种任务,通过14个大型语言模型的评估发现事实记忆与推理任务之间存在显著性能差距。

详情
AI中文摘要

本文介绍了MediQAl,一个法语医学问答数据集,旨在评估语言模型在事实性医学记忆和现实临床场景推理方面的能力。MediQAl包含32,603个问题,来源于41个医学科目中的法语医学考试。该数据集包含三种任务:(i) 有唯一答案的多项选择题,(ii) 有多个答案的多项选择题,以及(iii) 有短答案的开放性问题。每个问题都被标记为理解或推理,使能够对模型的认知能力进行详细分析。我们通过与14个大型语言模型的广泛评估,包括最近的推理增强模型,验证了MediQAl数据集,并观察到事实记忆与推理任务之间存在显著的性能差距。我们的评估为评估语言模型在法语医学问答上的性能提供了全面的基准,填补了医学领域多语言资源中的关键空白。

英文摘要

This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

2507.04996 2026-05-19 cs.CY cs.CE cs.CL cs.HC cs.RO 版本更新

Agentic Vehicles for Human-Centered Mobility: Definition, Prospects, and Synergistic Co-Development with Vehicle Autonomy

面向人类中心的移动性:定义、前景以及与车辆自主性的协同发展

Jiangbo Yu, Raphael Frank, Luis Miranda-Moreno, Sasan Jafarnejad, Jonatas Augusto Manzolli, Fuqiang Liu, Jiyao Wang, Ali Eslami

发表机构 * Interdisciplinary Centre for Security, Reliability and Trust(跨学科安全、可靠与信任中心) University of Luxembourg(卢森堡大学)

AI总结 本文探讨了面向人类中心的移动性,提出了代理车辆的概念,指出自主性和代理性是相互关联但概念上不同的维度,并强调了协同发展的必要性。

详情
AI中文摘要

自主性,源自希腊语autos(自我)和nomos(法律),指的是根据内部规则运行而不受外部控制的能力。自动驾驶车辆(AuVs)因此被理解为能够感知环境并执行任务,且在一定程度上减少人类干预的车辆系统,这与SAE自动化驾驶级别所指示的方向一致。然而,最近的研究和部署越来越多地展示了车辆能力,这些能力虽然不违背自主性,但也不由自主性所涵盖,包括处理模糊目标、有目的的社会互动、外部工具使用、主动问题解决、持续学习以及在未见过且具有伦理重要性的环境中进行情境敏感推理,这在部分情况下得益于多模态语言模型。这些发展揭示了技术自主性与为人类中心移动性所需更广泛社会认知功能之间的差距,这些功能更精确地由代理性概念所捕捉。因此,而不是不断增加“自主”一词的修饰词,我们引入了代理车辆(AgVs)并建议自主性和代理性是相互交织但概念上不同的:如果自主性关注的是做什么和如何做(在内部规则下的任务执行),那么代理性则关注为什么做以及还能做什么(目标导向、适应性的行动)。我们提出自主性和代理性作为正交但相互促进的维度,并具有协同发展的意义。车辆代理标志着移动服务智能的新维度,预示着车辆作为社会中的目的性行为者。

英文摘要

Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as vehicular systems that perceive their environment and execute tasks with minimal human intervention, consistent with the direction indicated by the SAE levels of automated driving. However, recent research and deployments increasingly showcase vehicular capabilities that, while not contradicting autonomy, are not entailed by it, including ambiguous goal handling, purposeful social engagement, external tool use, proactive problem solving, continuous learning, and context-sensitive reasoning in unseen and ethically salient situations, enabled in part by multimodal language models. These developments reveal a gap between technical autonomy and the broader social cognitive functions required for human-centered mobility, which are more precisely captured by the notion of agency. Therefore, rather than adding increasingly elaborate modifiers to "autonomous," we introduce agentic vehicles (AgVs) and suggest that autonomy and agency are intertwined but conceptually distinct: if autonomy concerns what to do and how to do it (task executions under internal rules), agency pertains to why to do it and what else can be done (goal-directed, adaptive actions). We present autonomy and agency as orthogonal yet synergistic dimensions with co-development implications. Vehicle agency marks a novel dimension of mobility service intelligence, heralding vehicles as purposeful actors in society.

2506.05606 2026-05-19 cs.CL cs.HC 版本更新

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPeRA: 一个用于评估LLM在模拟人类在线购物行为上的表现的观察、人设、推理和行动数据集

Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

发表机构 * Northeastern University(东北大学) University of Southern California(南加州大学) Stony Brook University(石溪大学) Independent Researcher(独立研究者) Ohio State University(俄亥俄州立大学) University of Notre Dame(Notre Dame 大学) Columbia University(哥伦比亚大学)

AI总结 本文提出OPeRA数据集,用于评估LLM在模拟人类在线购物行为上的能力,通过收集真实用户在在线购物会话中的观察、人设、推理和行动,建立首个评估LLM预测特定用户下一步行动和推理的基准。

Comments ACL 2026 main

详情
AI中文摘要

大型语言模型(LLMs)能否准确模拟特定用户的下一步网页操作?尽管LLMs在生成“可信”的人类行为方面表现出色,但评估其模仿真实用户行为的能力仍是一个开放性挑战,这主要归因于缺乏高质量、公开可用的数据集,这些数据集能够捕捉到实际人类用户的可观测行为和内部推理。为了解决这一差距,我们引入了OPeRA,一个新型的数据集,收集自真实人类参与者在在线购物会话中的观察、人设、推理和行动。OPeRA是首个公开数据集,全面捕捉了用户人设、浏览器观察、细粒度网页操作以及即时自报的推理。我们开发了在线问卷和定制浏览器插件来收集该数据集,以高保真度的方式获取。使用OPeRA,我们建立了首个基准,用于评估当前LLMs在给定人设和<观察、行动、推理>历史的情况下,预测特定用户下一步行动和推理的能力。该数据集为未来研究LLM代理提供了基础,这些代理旨在作为人类的个性化数字双胞胎发挥作用。

英文摘要

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

2505.20650 2026-05-19 cs.CL cs.AI cs.CE 版本更新

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging: 评估LLM提取和结构化财务信息

Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Columbia University(哥伦比亚大学) California State University(加州州立大学) University of Montreal(蒙特利尔大学) Carnegie Mellon University(卡内基梅隆大学) Rensselaer Polytechnic Institute(莱斯利理工学院) The University of Manchester(曼彻斯特大学) Harvard University(哈佛大学)

AI总结 本文提出FinTagging基准,用于评估LLM在提取和结构化财务信息方面的能力,通过分解为FinNI和FinCL两个子任务,揭示了LLM在细粒度概念链接上的局限性。

详情
AI中文摘要

准确解读财务报告中的数字数据对市场和监管机构至关重要。尽管XBRL(可扩展商业报告语言)提供了对财务数据进行标记的标准,但将数千个事实映射到超过1万项美国通用会计准则(US GAAP)概念仍然成本高昂且容易出错。现有基准将此任务简化为对小概念子集的扁平单步分类,忽略了分类法的层次语义和财务文档的结构特性。因此,这些基准无法评估LLM在真实报告条件下的表现。为弥合这一差距,我们引入FinTagging,首个全面的结构感知和全范围XBRL标记基准。我们将复杂的标记过程分解为两个子任务:(1)FinNI(财务数字识别),从异构上下文中提取实体和类型;(2)FinCL(财务概念链接),将提取的实体映射到完整的US GAAP分类法。这种两阶段的框架使能够公平评估LLM在数值推理和分类法对齐方面的能力。在零样本设置下评估多种LLM发现,尽管模型在提取方面表现良好,但在细粒度概念链接上存在显著困难,突显了领域特定结构感知推理的关键限制。

英文摘要

Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over 10k US GAAP concepts remains costly and error prone. Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. Consequently, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure aware and full scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts including text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US GAAP taxonomy. This two stage formulation enables a fair assessment of LLMs' capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero shot settings reveals that while models generalize well in extraction, they struggle significantly with fine grained concept linking, highlighting critical limitations in domain specific structure aware reasoning.

2505.19155 2026-05-19 cs.CV cs.CL 版本更新

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

稀疏到密集:一种无损加速视频理解的LLM免费午餐

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

发表机构 * Singapore Management University(新加坡国立管理学院) Sea AI Lab(Sea AI实验室) National University of Singapore(国立新加坡大学)

AI总结 本文提出了一种名为Sparse-to-Dense(StD)的解码策略,通过结合稀疏top-K注意力和密集全注意力模块,实现视频大语言模型(Video-LLMs)的无损加速,从而在处理长视频序列时显著提高处理速度。

Comments Accepted by ACL 2025

详情
AI中文摘要

由于当前视频大语言模型(Video-LLMs)的自回归性质,输入序列长度的增长会导致推理延迟增加,这给处理通常非常长的视频序列带来了挑战。我们发现,在解码过程中,Video-LLMs中大多数标记的注意力分数趋于稀疏和集中,只有某些标记需要全面的全注意力。基于这一见解,我们引入了Sparse-to-Dense(StD),一种新颖的解码策略,集成了两个不同的模块:一个利用稀疏top-K注意力,另一个采用密集全注意力。这些模块协同工作,以在不损失的情况下加速Video-LLMs。快速(稀疏)模型推测解码多个标记,而缓慢(密集)模型并行验证它们。StD是一种无调优、即插即用的解决方案,可在视频处理中实现高达1.94倍的壁时加速。它在保持模型性能的同时,使从标准Video-LLM无缝过渡到稀疏Video-LLM变得可能,只需最小的代码修改。

英文摘要

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

2501.17549 2026-05-19 cs.CL 版本更新

Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

具有查询意识的可学习图池化标记作为大语言模型的提示

Wooyoung Kim, Byungyoon Park, Wooju Kim

发表机构 * Smart Systems Lab, Department of Industrial Engineering(智能系统实验室,工业工程系)

AI总结 本文提出了一种名为可学习图池化标记(LGPT)的新方法,通过引入可学习参数作为大语言模型中的标记,解决节点级投影的可扩展性和图级投影信息丢失的问题,并通过早查询融合技术提升图嵌入效果,在GraphQA基准测试中实现了4.13%的性能提升。

详情
AI中文摘要

图结构数据在许多领域中发挥着重要作用,例如社交网络、引用网络、常识推理图和知识图。尽管图神经网络已被用于图处理,但最近的进展探索了将大语言模型整合到图相关任务中。在本文中,我们提出了一种新的方法,称为可学习图池化标记(LGPT),该方法解决了节点级投影的可扩展性问题和图级投影的信息丢失问题。LGPT通过引入可学习参数作为大语言模型中的标记,实现了灵活且高效的图表示,平衡了细粒度和全局图信息。此外,我们研究了一种早查询融合技术,该技术在构建图表示之前融合查询上下文,从而产生更有效的图嵌入。我们的方法在不训练大语言模型的情况下,在GraphQA基准测试中实现了4.13%的性能提升,展示了在处理复杂文本属性图数据方面的显著优势。

英文摘要

Graph-structured data plays a vital role in numerous domains, such as social networks, citation networks, commonsense reasoning graphs and knowledge graphs. While graph neural networks have been employed for graph processing, recent advancements have explored integrating large language models for graph-based tasks. In this paper, we propose a novel approach named Learnable Graph Pooling Token (LGPT), which addresses the limitations of the scalability issues in node-level projection and information loss in graph-level projection. LGPT enables flexible and efficient graph representation by introducing learnable parameters that act as tokens in large language models, balancing fine-grained and global graph information. Additionally, we investigate an Early Query Fusion technique, which fuses query context before constructing the graph representation, leading to more effective graph embeddings. Our method achieves a 4.13\% performance improvement on the GraphQA benchmark without training the large language model, demonstrating significant gains in handling complex textual-attributed graph data.

2410.13846 2026-05-19 cs.CL cs.AI cs.LG 版本更新

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

LightTransfer: 你的长上下文LLM实际上是一个具有轻松适应能力的混合模型

Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin

发表机构 * Singapore Management University(新加坡国立大学) National University of Singapore(新加坡国立大学) Sea AI Lab, Singapore(新加坡海智实验室)

AI总结 本文提出LightTransfer方法,通过将LLaMA等模型转换为混合架构,实现更高效的生成,实验表明在长上下文理解任务中,即使有半数层被识别为懒层,也能在性能损失小于1.5%的情况下提升2.17倍的吞吐量,并在数学基准AIME24上达到53.3%的分数。

Comments Accepted by TMLR 2025

详情
AI中文摘要

将语言模型扩展到处理更长上下文引入了由于键值(KV)缓存成本增加而带来的显著内存挑战。受混合模型的效率提升和预训练大变压器骨干的广泛可用性启发,我们探索将变压器模型转换为混合架构以实现更高效的生成。在本工作中,我们提出了LightTransfer,一种轻量级方法,将模型如LLaMA转换为混合变体。我们的方法识别出懒层——那些专注于最近或初始token的层,并将它们的完整注意力替换为流式注意力。这种转换可以在无需任何训练的情况下用于长上下文理解任务,或在需要更强推理能力的o1-like长推理生成任务中进行最小微调。在多样化的基准和模型(如LLaMA、Mistral、QwQ-STILL)上的实验表明,即使有半数层被识别为懒层,LightTransfer在性能损失小于1.5%(在LongBench上)的情况下,也能实现高达2.17倍的吞吐量提升,并在数学基准AIME24上达到先进o1-like长推理模型QwQ-STILL的53.3%。

英文摘要

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

2410.02064 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

对Llama3-8b-Instruct自生成文本识别能力的检查与控制

Christopher Ackerman, Nina Panickssery

AI总结 本研究探讨了LLM是否能识别自身生成的文本,发现Llama3-8b-Instruct模型能够区分自身输出与人类输出,并通过残差流中的特定向量控制其行为和感知,揭示了模型自我归属的认知机制。

Comments 10 pages, 13 figs, 2 tables, accepted as conference paper to ICLR 2025

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025)
AI中文摘要

已报告LLM能够识别其自身生成的文本,这可能对AI安全有重要影响,但研究较少。我们调查这一现象,以确定其在行为层面是否稳健发生,观察行为是如何实现的,以及是否可以控制。首先,我们发现Llama3-8b-Instruct聊天模型(而非基础Llama3-8b模型)能够可靠地区分自身输出与人类输出,并提供证据表明聊天模型很可能利用其在训练后对自身输出的经验来完成文本识别任务。其次,我们识别出残差流中一个在模型正确识别自身生成文本时被差异激活的向量,证明该向量对自我归属相关信息的响应,并提供证据表明该向量与模型中的“自我”概念相关,并展示该向量与模型感知和声明自我归属能力的因果关系。最后,我们证明该向量可用于控制模型的行为和感知,通过将其应用于模型生成输出时,可引导模型声称或否认作者身份;通过将其应用于模型阅读的文本时,可引导模型相信或不相信其写了任意文本。

英文摘要

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

2404.10981 2026-05-19 cs.IR cs.AI cs.CL 版本更新

A Survey on Retrieval-Augmented Text Generation for Large Language Models

基于大型语言模型的检索增强文本生成综述

Yizheng Huang, Jimmy Huang

发表机构 * York University(约克大学)

AI总结 本文综述了检索增强文本生成方法,探讨了其在提升大型语言模型生成准确性和可靠性方面的核心方法与主要贡献。

Comments Ongoing Work

详情
Journal ref
ACM Computing Surveys, Volume 58, Issue 12, Article No.: 300, Pages 1 - 38, 2026
AI中文摘要

检索增强生成(RAG)将检索方法与深度学习进展相结合,以解决大型语言模型(LLMs)静态限制的问题,通过动态整合最新外部信息。该方法主要关注文本领域,提供了一种成本效益高的解决方案,以生成合理但可能不正确的响应,从而通过真实世界数据提高LLMs的准确性和可靠性。随着RAG的复杂性增加并整合多个可能影响其性能的概念,本文将RAG范式分为四个类别:预检索、检索、后检索和生成,从检索角度提供详细视角。它概述了RAG的发展历程,并通过分析重要研究讨论了该领域的进步。此外,本文介绍了RAG的评估方法,解决了所面临的挑战,并提出了未来研究方向。通过提供有组织的框架和分类,该研究旨在整合现有的RAG研究,明确其技术基础,并突出其扩展大型语言模型适应性和应用潜力的潜力。

英文摘要

Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but possibly incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.

2605.17364 2026-05-19 cs.CL cs.IR 版本更新

NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation

NewsLens: 一个用于对抗性新闻偏见导航的多智能体框架

Joy Bose

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出NewsLens多智能体框架,通过五个智能体协作解构新闻文章,揭示意识形态缺失、修辞操纵和框架边界,利用Qwen2.5-3B-Instruct和Mistral 7B模型进行评估,展示了系统在不同政治事件簇中的表现。

Comments 17 pages, 2 figures, 7 tables, 1 appendix

详情
AI中文摘要

媒体偏见检测长期以来被框架为分类任务:为文章或媒体分配政治标签。我们认为这种框架过于浅显:它只能识别偏见存在,但无法确定其位置、方式,以及最关键的是,什么结构上被省略了。我们提出了NewsLens,一个五智能体对抗性流程,用于结构化新闻偏见导航。事实验证器、渐进框架分析器、保守框架分析器、宣传检测器和中性摘要器协作,将文章解构为可解释的框架地图,揭示意识形态缺失、修辞操纵和框架边界。该系统在四个政治事件簇(印度-巴基斯坦克什米尔、加沙、气候政策、乌克兰)的15篇文章上进行评估,使用Qwen2.5-3B-Instruct(4位量化,Google Colab T4),并使用Mistral 7B进行跨模型验证(在克什米尔簇)。中心媒体显示最高均值Perspective Divergence Score(PDS:Qwen 0.907,Mistral 0.729在克什米尔子集);保守框架媒体显示最高均值Manipulation Index(MI:0.600在两个模型上)。跨模型比较显示高传播内容具有高度一致性(Republic World delta-PDS=0.125,MI=0.8两个模型)和更广泛的变异。Mann-Whitney U检验发现n=15时组间差异无统计学意义,这被后验幂分析确认为样本量限制。部分消融实验去除宣传检测器显示中性摘要器输出的省略精度下降。该架构扩展了先前的词法-几何偏见工作到代理LLM推理,并且使用开放权重模型完全可复现,无需API密钥。

英文摘要

Media bias detection has predominantly been framed as a classification task: assign a political label to an article or outlet. We argue this framing is too shallow: it identifies that bias exists but not where, how, or crucially, what is structurally omitted. We present NewsLens, a five-agent adversarial pipeline for structured news bias navigation. A Fact Verifier, Progressive Framing Analyst, Conservative Framing Analyst, Propaganda Detector, and Neutral Summarizer collaborate to deconstruct articles into interpretable framing maps, exposing ideological omissions, rhetorical manipulation, and framing boundaries. The system is evaluated on 15 articles across four geopolitical event clusters (India-Pakistan Kashmir, Gaza, Climate Policy, Ukraine) using Qwen2.5-3B-Instruct (4-bit quantised, Google Colab T4), with cross-model validation using Mistral 7B on the Kashmir cluster. Center outlets show the highest mean Perspective Divergence Score (PDS: Qwen 0.907, Mistral 0.729 on Kashmir subset); conservative-framing outlets show the highest mean Manipulation Index (MI: 0.600 across both models). Cross-model comparison shows high consistency for high-propaganda content (Republic World delta-PDS=0.125, MI=0.8 both models) and greater variance for nuanced reporting. Mann-Whitney U tests find no statistically significant between-group differences at n=15, reported honestly as a sample-size limitation confirmed by post-hoc power analysis. A partial ablation removing the Propaganda Detector shows degraded omission precision in the Neutral Summarizer output. The architecture extends prior lexical-geometric bias work to agentic LLM reasoning, and is fully reproducible using open-weight models without API keys.

2605.17359 2026-05-19 cs.CL 版本更新

Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

学习跨领域的多智能体LLM协作可转移拓扑先验

Taolin Zhang, Zijie Zhou, Jiuheng Wan, Tingyuan Hu, Chengyu Wang, Xiaofeng He, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) China University of Petroleum (Beijing)(中国石油大学(北京)) East China Normal University(东华大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出TopoPrior框架,通过学习可转移的拓扑先验来提升多智能体LLM在跨领域协作中的效率,减少在线搜索开销并提高可扩展性。

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统通过结构化通信协调专门的智能体,在复杂推理中展现出强大潜力。然而,现有拓扑演化方法通常为每个查询从头构建或优化协作拓扑,导致显著的在线搜索开销、高推理时间token消耗以及在多领域设置中的有限可扩展性。我们提出TopoPrior,一种用于跨领域多智能体LLM协作学习可转移拓扑先验的框架。与其反复在线搜索有效的协作结构不同,TopoPrior从多个领域收集的参考协作图中学习可重用的拓扑先验,并利用它们生成查询条件的初始协作图以供下游细化。通过将部分拓扑搜索从每个查询的在线优化转移到离线先验学习,TopoPrior在保持与现有拓扑演化后端兼容的同时,降低了搜索成本。技术上,TopoPrior包含两个关键组件。首先,一个可转移拓扑先验学习模块采用条件变分图框架,在潜在空间中捕捉跨领域的可重用结构规律。其次,一个查询条件的潜在适应模块引入对抗对齐以减少不必要的领域差异,同时保持查询相关的结构变化。在多领域推理基准测试中,TopoPrior在多个异构拓扑演化后端上一致提升了性能,同时减少了在线推理时间的token使用,仅需少量额外的可训练参数。这些结果表明,可转移的拓扑初始化是一种有效且轻量的机制,用于提高跨领域的多智能体LLM协作效率。

英文摘要

Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.

2605.17355 2026-05-19 cs.AI cs.CL 版本更新

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

HyperPersona: 一种多级超图框架用于基于文本的自动人格预测

Sina Heydari, Majid Ramezani

发表机构 * Department of Computer Science and Information Technology(计算机科学与信息技术系) Institude for Advanced Studies in Basic Sciences (IASBS)(基础科学高级研究 institute (IASBS))

AI总结 本文提出HyperPersona框架,通过超图结构显式建模文本的层次结构,利用基于Transformer的图编码器学习不同语言层之间的交互,从而在不依赖传统心理测量法的情况下,实现更准确的人格预测。

Comments Preprint. Submitted to Artificial Intelligence (Elsevier)

详情
AI中文摘要

作为一种现代商品,语言已成为一个庞大的社会和心理重要特质和概念的存储库,反映了人们如何将思维模式、行为和情感的模式编码成词语。基于文本的自动人格预测(APP)旨在从语言行为中推断人格,提供了一种可扩展的替代传统心理测量法的方案。尽管文本本质上是层次化的,文档级捕捉全局特征,句子级编码局部语义,词级提供细粒度的词汇信息,但大多数现有方法依赖于浅层、顺序或单级表示,忽略了书面语言的多级结构。为了解决这个问题,我们提出了HyperPersona,一个框架,通过超图结构显式建模文本的层次组织(文档、句子和词),其中文档及其句子表示为超边,词表示为节点,从而实现对文本全局、局部和词汇依赖关系的联合建模。随后通过基于Transformer的图编码器学习这些语言层内的交互,产生上下文敏感且结构基础的特征表示用于人格预测。在Big Five人格维度上的实验表明,仅依赖文本的情况下,HyperPersona有效整合了多级语言线索,相比最先进的基线方法实现了更优的性能。这些发现强调了文本层次结构在从自然语言中推进类人人格推断中的关键作用。

英文摘要

As a modern commodity, language has become a vast repository of socially and psychologically significant traits and concepts, reflecting the ways people encode pattern of thoughts, behaviors, and emotions into words. Text-based Automatic Personality Prediction (APP), seeks to infer personality from linguistic behavior, offering a scalable alternative to traditional psychometric assessments. Although text is inherently hierarchical, with the document-level capturing global features, the sentence-level encoding local semantics, and the word-level providing fine-grained lexical information, most existing approaches rely on shallow, sequential, or single-level representations that ignore the multi-level structure of written language. To address this, we propose HyperPersona, a framework that explicitly models the hierarchical organization of text (document, sentence, and word) through hypergraph structure, where a document and its sentences are represented as hyperedges, and the words are represented as nodes, enabling joint modeling of global, local, and lexical dependencies of text. Followed by a transformer-based graph encoder that learns interactions within and across these linguistic layers, yielding context-sensitive and structurally grounded feature representations for personality prediction. Experiments on the Big Five personality dimensions show that, while relying solely on text, HyperPersona effectively integrates multi-level linguistic cues, achieving superior performance compared to state-of-the-art baselines. These findings underscore the critical role of textual hierarchy in advancing human-like personality inference from natural language.

2605.17352 2026-05-19 cs.CL 版本更新

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

AMATA: 适应性多智能体轨迹对齐用于知识密集型问答

Taolin Zhang, Dongyang Li, Chen Chen, Qizhou Chen, Jiuheng Wan, Xiaofeng He, Chengyu Wang, Richang Hong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) Shanghai University of Electric Power(上海电力大学) East China Normal University(华东师范大学) Guangdong University of Finance and Economics(广东财经大学) Alibaba Group(阿里巴巴集团)

AI总结 本研究提出AMATA框架,通过动态整合外部知识提高知识密集型问答的响应可解释性和事实准确性,采用六个专门化智能体协作执行复杂问题推理,并引入两种创新:轨迹内偏好学习和智能体间依赖学习。

详情
AI中文摘要

尽管大语言模型(LLMs)取得了显著进展,但在知识密集型问答中生成事实一致的响应仍然具有挑战性。这些困难主要是由于幻觉和LLMs在长尾知识缺口上的局限性。为此,我们提出了AMATA,一种自适应多智能体轨迹对齐框架,通过动态整合外部知识来提高响应的可解释性和事实基础性。我们的架构利用六个专门化的智能体,协同执行结构化动作进行复杂问题推理。我们将多智能体协作与外部工具的协作形式化为轨迹偏好对齐问题,结合问题感知的智能体定制和智能体间偏好和谐化。AMATA引入了两种主要创新:(1)轨迹内偏好学习,学习以目标为导向的偏好以优先考虑关键智能体;(2)智能体间依赖学习,通过一种新颖的依赖感知直接偏好优化技术捕获跨智能体工具依赖性。实验证明,AMATA在五个已建立的知识密集型QA基准上一致优于基线方法、知识增强框架和基于LLM的轨迹系统。进一步分析显示,我们的方法在减少token消耗方面具有效率。

英文摘要

Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.

2605.17348 2026-05-19 cs.CL 版本更新

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

驯服“僵尸”代理:一种面向鲁棒多代理演化的马尔可夫状态感知框架

Taolin Zhang, Pukun Zhao, Qizhou Chen, Jiuheng Wan, Chen Chen, Xiaofeng He, Chengyu Wang, Richang Hong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) Guangdong University of Finance and Economics(广东财经大学) Alibaba Group(阿里巴巴集团) East China Normal University(华东师范大学)

AI总结 本文提出AgentRevive框架,通过动态管理代理协作和状态感知边优化,有效解决多代理系统中因临时问题导致有价值代理被提前丢弃的问题,提升了系统鲁棒性和效率。

详情
AI中文摘要

近年来,基于大语言模型的多代理系统在复杂任务中的协作能力有了显著提升。为提高整体效率,现有方法常依赖于代理间的激进图演化(例如节点或边剪枝),这可能导致因临时问题(如幻觉或暂时知识缺口)而过早丢弃有价值的代理。然而,这种硬剪枝忽略了“僵尸”代理在后续讨论轮次中恢复和贡献的潜力。本文提出AgentRevive,一种面向鲁棒多代理演化的马尔可夫状态感知框架。我们的方法通过软状态转移动态管理代理协作,通过两个关键组件实现:(1)状态感知策略学习:将代理状态分为“活跃”、“待命”和“终止”状态,根据代理记忆选择性传播消息。策略利用风险估计器通过评估幻觉风险优化代理状态转移,最小化不可靠节点的影响,同时保护有价值节点。(2)状态感知边优化:根据策略学习到的状态剪枝子图边,永久移除“终止”节点,并保留“待命”节点以供后续轮次评估其潜在的未来贡献。在通用推理、领域特定和幻觉挑战任务上的广泛实验表明,我们的方法在性能上始终优于强基线,并通过状态感知的代理调度显著减少了令牌消耗。

英文摘要

Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.

2605.17342 2026-05-19 cs.CL cs.AI 版本更新

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

传递性与循环性:动态大语言模型对齐的显式偏好分解

Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区)

AI总结 本文提出Hybrid Reward-Cyclic模型,通过博弈论分解显式分离传递性和循环性偏好,结合动态自我博弈优化方法提升大语言模型对齐效果,实验证明其在混合传递-循环设置中具有结构优势和更高的准确率。

Comments Accepted by ICML 2026

详情
AI中文摘要

标准的RLHF依赖于传递性的标量奖励,无法捕捉人类偏好的循环性质。尽管一些方法如通用偏好模型(GPM)试图解决这一问题,但其隐式公式将层次结构与循环性结合在一起,未能保证主导解。为此,我们提出了混合奖励-循环(HRC)模型,利用博弈论分解将偏好显式分解为正交的传递性(标量)和循环性(向量)组件。此外,我们引入了动态自我博弈偏好优化(DSPPO),将对齐视为随时间变化的游戏,逐步引导策略向纳什均衡发展。合成数据实验进一步验证了HRC在混合传递-循环设置中的结构优势,其中HRC收敛速度更快且准确率更高。在RewardBench 2上的实验表明,HRC在BT和GPM基线基础上持续改进(例如,在Gemma-2B-it上提升1.23%)。特别是,其在Ties领域中的优越表现验证了模型在处理复杂非严格偏好时的鲁棒性。对AlpacaEval 2.0、Arena-Hard-v0.1和MT-Bench的广泛下游评估确认了我们框架的有效性。值得注意的是,当使用Gemma-2B-it作为基础偏好模型时,HRC+DSPPO在AlpacaEval 2.0上达到峰值长度控制下的胜率44.75%,在Arena-Hard-v0.1上达到46.8%,显著优于使用BT或GPM训练的SPPO基线。我们的代码在https://github.com/lab-klc/Hybrid-Reward-Cyclic上公开可用。

英文摘要

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

2605.17314 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

通过不匹配的错误草稿实现弱到强的引导

Wei Deng

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了通过较小较弱模型的不匹配错误草稿引导更强学习者的能力,发现这种策略在MATH-500和AIME 2025/2026等任务上表现优异,主要贡献是提出了一种有效的训练方法。

详情
AI中文摘要

我们考虑是否可以利用较小、较弱模型的离线经验来引导更强的学习者,使其在在线策略学习(如GRPO)无法达到的能力。我们发现,将数学上错误但更领域训练的较小模型生成的草稿注入更强学习者的GRPO上下文,能一致优于标准在线GRPO在MATH-500和离分布AIME 2025/2026上。具体来说,我们使用Mathstral-7B作为学习者,Qwen2.5-Math-1.5B作为草稿模型,8.8K Level 3--5 MATH问题(其中MATH-500被排除),并使用Dr. GRPO进行训练。不匹配是关键成分:在保持其他条件不变的情况下,将草稿洗牌到不匹配的问题中,使MATH-500的greedy pass@1提升+1.62pp(n=10种子,p=0.0015,Welch's t检验)。事实上,不匹配-错误变体在MATH-500上所有测试的变体中均优于。在离分布AIME 2025和2026上,不匹配-错误变体在每个样本预算从k=1到k=1024的所有年份中,均将pass@k提升到Mathstral-7B(其原生[INST]格式)和Qwen2.5-Math-1.5B草稿模型之上。所有变体在测试时使用相同的提示,没有草稿注入。该配方——在单个GPU上训练,无需SFT、奖励模型、合成数据和无produce-critique-revise内循环——在Mathstral-7B-v0.1上达到了71.98%的MATH-500成绩,这是目前该模型的最高已发表结果,超过了WizardMath流程在完整MATH上的70.9%(SFT + PPO加过程/指令奖励模型)。

英文摘要

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

2605.17305 2026-05-19 cs.AI cs.CL 版本更新

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect: 一种基于闭环自修正的大型语言模型框架

Yuning Wu, Yingmin Liu, Yang Shu

发表机构 * School of Software, Henan University, Kaifeng, China(河南大学软件学院,开封,中国) Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 本文提出CyberCorrect框架,将大型语言模型的自我修正建模为闭环控制系统,通过三模态错误检测器、类型导向的修正控制器和收敛判断器,提升模型的自我修正能力和准确性。

Comments 6 pages, 1 figure, submitted to IEEE SMC 2026

详情
AI中文摘要

大型语言模型(LLM)的自我修正能力——即检测并修复生成输出中的错误——仍然主要依赖于通用提示,如'请重新考虑你的答案',缺乏系统性的错误分析和收敛保证。我们提出了CyberCorrect,一种将LLM自我修正建模为闭环控制系统的方法,基于控制论理论。该框架将LLM生成器视为被控对象,并引入三模态错误检测器(结合自一致性、口头化信心和逻辑链验证)作为传感器。类型导向的修正控制器根据诊断的错误类别生成针对性的修复指令,而收敛判断器利用控制理论适应的稳定性标准确定迭代终止。我们进一步引入了三个控制理论评估指标——收敛率、超调率和振荡率——以捕捉修正动态,而不仅仅是最终准确性。在我们构建的CyberCorrect-Bench(440个带有标注错误类型和修正路径的推理任务)上的实验表明,CyberCorrect实现了79.8%的最终准确性,比现有最佳自我修正方法提高了6.2个百分点,同时通过其收敛控制机制将超调(错误的过度修正)减少了41%。

英文摘要

Large language model (LLM) self-correction -- the ability to detect and fix errors in generated outputs -- remains largely ad hoc, relying on generic prompts such as "please reconsider your answer" without systematic error analysis or convergence guarantees. We propose CyberCorrect, a framework that formalizes LLM self-correction as a closed-loop control system grounded in cybernetic theory. The framework models the LLM generator as the plant and introduces a tri-modal Error Detector (combining self-consistency, verbalized confidence, and logic-chain verification) as the sensor. A type-directed Correction Controller generates targeted repair instructions based on diagnosed error categories, while a Convergence Judge determines iteration termination using stability criteria adapted from control theory. We further introduce three control-theoretic evaluation metrics -- convergence rate, overshoot rate, and oscillation rate -- that capture correction dynamics beyond final accuracy. Experiments on our constructed CyberCorrect-Bench (440 reasoning tasks with annotated error types and correction paths) show that CyberCorrect achieves 79.8% final accuracy, improving upon the best existing self-correction method by 6.2 percentage points, while reducing overshoot (erroneous over-correction) by 41% through its convergence control mechanism.

2605.17304 2026-05-19 cs.LG cs.CL 版本更新

Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression

压缩上下文,保持承诺:可验证大语言模型上下文压缩的正式框架

Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab (EMILAB)(嵌入式智能实验室)

AI总结 本文提出Context Codec框架,通过语义层面的压缩方法,确保在压缩对话历史时保留关键承诺,解决现有方法在压缩过程中缺乏对语义承诺保留的明确规范的问题。

详情
AI中文摘要

LLM上下文不仅仅是token;它是一组承诺。长期对话累积了目标、约束、决定、偏好、工具结果、检索到的证据、制品和安全边界,这些必须被未来响应保留。现有上下文管理方法通过截断、检索、摘要、记忆系统或token级提示压缩来减少长度,但很少明确指定哪些语义承诺必须在压缩中保留或如何衡量其保留。我们提出Context Codec,一种基于承诺的框架,用于压缩提示和聊天历史。Context Codec将对话状态表示为具有标准身份、等价性、冲突、置信度、风险和证据跨度的语义原子。它分离了五个关注点——提取、规范化、表示、渲染和验证,并引入了关键原子召回率、加权原子召回率、承诺密度和往返恢复性等指标。它还定义了语义压缩错误的分类学,一个具体的规范化程序,保守的回退规则用于低置信度和安全关键原子,以及Context Compression Language (CCL),一种以ASCII优先的紧凑表示法,用于标准JSON原子。在一项小规模诊断研究中,CCL-Core在结构化的散文和JSON之间占据了一个有用的中间位置:比散文更明确和可审计,通常比JSON更紧凑,且比高度压缩的符号更安全。结果不是声称缩写解决压缩问题,而是一个使上下文压缩可验证的框架:压缩对话,保持承诺。

英文摘要

LLM context is not just tokens; it is a set of commitments. Long-running conversations accumulate goals, constraints, decisions, preferences, tool results, retrieved evidence, artifacts, and safety boundaries that future responses must preserve. Existing context-management methods reduce length through truncation, retrieval, summarization, memory systems, or token-level prompt compression, but they rarely specify which semantic commitments must survive compression or how their preservation should be measured. We propose Context Codec, a commitment-level framework for compressing prompts and chat histories. Context Codec represents dialogue state as typed, source-grounded semantic atoms with canonical identity, equivalence, conflict, confidence, risk, and evidence spans. It separates five concerns - extraction, normalization, representation, rendering, and verification - and introduces metrics for Critical Atom Recall, Weighted Atom Recall, Commitment Density, and round-trip recoverability. It also defines a taxonomy of semantic compression errors, a concrete normalization procedure, conservative fallback rules for low-confidence and safety-critical atoms, and Context Compression Language (CCL), an ASCII-first compact rendering of canonical JSON atoms. In a small diagnostic study, CCL-Core occupies a useful middle ground between structured prose and JSON: more explicit and auditable than prose, usually more compact than JSON, and less risky than heavily minified notation. The result is not a claim that shorthand solves compression, but a framework for making context compression verifiable: compress the conversation, keep the commitments.

2605.17295 2026-05-19 cs.LG cs.CL 版本更新

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

DISA: 分布匹配强化学习中的离线重要性采样

Shaobo Wang, Yujie Chen, Yafeng Sun, Wenjie Qiu, Zhihui Xie, Sihang Li, Yucheng Li, Huiqiang Jiang, Xingzhang Ren, Xuming Hu, Dayiheng Liu, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Qwen Team, Alibaba Group(阿里集团Qwen团队) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The University of Science and Technology of China(中国科学技术大学) Nanjing University(南京大学) The University of Hong Kong(香港大学)

AI总结 本研究提出DISA方法,通过离线重要性采样解决分布匹配强化学习中的校准问题,分离了分区函数估计与策略学习,提高了策略多样性并在多个基准测试中表现出色。

Comments 21 pages, 7 figures, 7 tables. Abstract shortened to respect the arXiv limit of 1920 characters. Please see the PDF for the full abstract

详情
AI中文摘要

现代推理代理越来越多地被评估其在给定输入下生成多个有效解决方案路径、计划或工具使用轨迹的能力。标准奖励最大化强化学习倾向于崩溃到最容易强化的高奖励模式,而分布匹配强化学习旨在在整个奖励形状的解决方案集中分配概率质量。实现这一目标需要计算轨迹空间中依赖提示的分区函数。由于现有分布匹配方法在线学习这个分区函数,导致分区函数的校准误差直接扭曲策略更新且无法独立诊断。我们引入DISA(Decoupled Importance-Sampled Anchoring),通过离线绘制提案轨迹、通过重要性采样估计分区函数,并在策略优化开始前冻结所得的分区函数估计。这种解耦保持了分布匹配目标,同时严格分离分区函数估计与策略学习在数据、梯度、损失和诊断方面。实验表明,在六个数学和三个代码基准测试上,DISA与在线耦合的分布匹配基线FlowRL持平或超过,优于奖励最大化基线GRPO和GSPO在数学平均表现,并在相同离线轨迹上超过LoRASFT蒸馏方法多达13.8 Mean@8点。LLM-as-judge评估进一步显示DISA比奖励最大化基线保留了显著更多的策略多样性,提案强度和逆温度的敏感性研究遵循分析预测的偏差-方差模式。

英文摘要

Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.

2605.17283 2026-05-19 cs.CL cs.AI 版本更新

OProver: A Unified Framework for Agentic Formal Theorem Proving

OProver:一个用于代理形式定理证明的统一框架

David Ma, Kaijing Ma, Shawn Guo, Yunfeng Shi, Enduo Zhao, Jiajun Shi, Zhaoxiang Zhang, Gavin Cheung, Jiaheng Liu, Zili Wang

发表机构 * Lean 4

AI总结 本文提出OProver,一个用于Lean 4的统一框架,通过迭代修订检索到的编译器验证证明和Lean编译器反馈来改进代理证明,通过持续预训练和迭代后训练,使OProver-32B在多个基准测试中取得最佳成绩。

详情
AI中文摘要

近年来,形式定理证明的进步得益于大规模证明生成和验证器感知训练,但代理证明很少被整合到证明器训练中,仅在推理时间出现。我们提出了OProver,一个用于Lean 4的统一框架,其中失败的证明尝试通过检索到的编译器验证证明和Lean编译器反馈进行迭代修订。OProver通过持续预训练和迭代后训练进行训练:每次迭代运行代理证明,将新验证的证明索引到OProofs和检索内存中,使用修复轨迹作为SFT数据,并使用未解决的困难案例用于RL。OProofs由公开的Lean资源、大规模证明合成和代理证明轨迹构建,包含177万条Lean语句、686万条编译器验证证明以及带有检索上下文、失败尝试、反馈和修复的序列轨迹。在五个基准测试中,OProver-32B在MiniF2F(93.3%)、ProverBench(58.2%)和PutnamBench(11.3%)上取得最佳Pass@32,且在MathOlympiad(22.8%)和ProofNet(33.2%)上排名第二,比任何先前的开放式整体证明证明器的顶级位置更多。

英文摘要

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

2605.17231 2026-05-19 cs.LG cs.CL 版本更新

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

FishBack: 用于变换器中最优激活引导的反向费舍尔几何

Sihan Wang, Jiayi Zhao

发表机构 * Sihan Wang, 1 Jiayi Zhao(1 王世涵,1 赵佳怡)

AI总结 本文提出FishBack框架,通过反向费舍尔几何优化变换器中的激活引导,解决激活空间欧几里得假设失效的问题,提出闭式引导方程并实现迭代优化。

Comments Preprint. 20 pages, 9 figures, 5 tables

详情
AI中文摘要

激活引导方法通过修改语言模型的中间表示来控制输出行为,但普遍假设激活空间是欧几里得空间。我们证明这一假设严重失效:模型自身输出行为诱导的局部几何——即softmax层的费舍尔信息度量通过后续层的雅可比矩阵拉回后的度量——在GPT-2上相对于欧几里得度量的谱相对范数偏离超过97%,其有效维度仅为环境空间的2-17%。基于此拉回费舍尔度量,我们推导出一个闭式引导方程,确定任何目标概念的最小失真方向,从而在每个点上获得闭式最优方向,可迭代应用而无需曲面拟合或数据驱动的几何估计。我们称该框架为FishBack。该度量允许逐层递归分解,揭示现有方法——CAA、ActAdd、ITI等——各自隐式采用特定近似度量,其性能差距可通过单个谱诊断量化:其隐式度量成本与费舍尔最优成本的比率。在GPT-2上,迭代拉回引导在三个动词形态学概念和四层上均优于所有欧几里得基线,其偏离目标的KL减少量相对于欧几里得梯度上升为1.3×-2.5×,相对于CAA在匹配的概念概率下为1.5×。

英文摘要

Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model's own output behavior -- the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers -- deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2--17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a layer-wise recursive decomposition, which reveals that existing methods -- CAA, ActAdd, ITI, and others -- each implicitly adopt a particular approximate metric, and that their performance gaps are quantitatively predicted by a single spectral diagnostic: the ratio of their implicit metric's cost to the Fisher-optimal cost. On GPT-2, iterative pullback steering consistently outperforms all Euclidean baselines across three verb-morphology concepts and four layers, with off-target KL reductions of $1.3\times$--$2.5\times$ relative to Euclidean gradient ascent and $1.5\times$ relative to CAA at matched concept probability.

2605.17228 2026-05-19 cs.CL 版本更新

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

人工不耐受:临床文档中的污名化语言扭曲大型语言模型决策

Jen-tse Huang, Didi Zhou, Faith Kamau, Amy Oh, Anne R. Links, Mark Dredze, Mary Catherine Beach, Somnath Saha

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 研究探讨了大型语言模型在处理包含污名化语言的临床文档时是否继承并传播人类偏见,发现所有模型在决策上都存在显著偏见,且对语言框架敏感,需加强算法防护以确保公平性和鲁棒性。

Comments 9 pages

详情
AI中文摘要

大型语言模型(LLMs)正在越来越多地应用于高风险领域,如临床决策支持和医疗文档。然而,这些模型对细微语言变化的鲁棒性,特别是常见于人类撰写的临床笔记中的污名化语言(SL)仍缺乏深入研究。本文研究了前沿LLMs在处理临床文本时是否继承并传播这种人类偏见。我们系统评估了九个前沿LLMs,通过注入不同强度和表型的SL(怀疑、指责和贬低)的临床案例,发现所有模型均表现出显著偏见,临床决策显著偏向于更温和的患者管理。值得注意的是,我们观察到对语言框架的高度敏感性,单句SL即可改变模型输出,揭示出剂量-反应关系。此外,我们评估了标准提示基于的缓解策略,包括链式推理(CoT)和模型自我去偏。这些方法效果有限;模型难以显式识别SL,但仍受其影响。我们的发现揭示了当前LLMs在临床NLP中的公平性和鲁棒性关键漏洞,强调了需要严格算法防护以防止健康不平等的自动化。

英文摘要

Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as clinical decision support and medical documentation. However, the robustness of these models against subtle linguistic variations, specifically stigmatizing language (SL) commonly found in human-authored clinical notes, remains critically under-explored. In this work, we investigate whether frontier LLMs inherit and propagate this human bias when processing clinical text. We systematically evaluate nine frontier LLMs across four stigmatized medical conditions, utilizing clinical vignettes injected with varying intensities and phenotypes of SL (doubt, blame, and maligning). Our results demonstrate that all evaluated models exhibit substantial bias, with clinical decision-making significantly skewed towards less aggressive patient management. Notably, we observe a high sensitivity to linguistic framing, where a single SL sentence is sufficient to alter model outputs, revealing a clear dose-response relationship. Furthermore, we evaluate standard prompt-based mitigation strategies, including Chain-of-Thought (CoT) reasoning and model self-debiasing. These approaches show limited efficacy; models struggle to explicitly identify SL while remaining implicitly influenced by it. Our findings expose a critical vulnerability in current LLMs regarding fairness and robustness in clinical NLP, underscoring the need for rigorous algorithmic guardrails to prevent the automation of health disparities.

2605.17214 2026-05-19 cs.AI cs.CL cs.CV 版本更新

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA:推动大型语言模型在化学反应图示理解上的进步

Mingyang Rao, Kehua Feng, Zhihui Zhu, Jiangzhen Fu, Hao Yu, Keyan Ding, Huajun Chen

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University(浙江大学杭州全球科学与技术创新中心) Department of Chemistry, Fudan University(复旦大学化学系)

AI总结 本文针对现有系统在理解化学反应图示时存在的视觉缺陷和语义断开问题,提出ChemVA框架,通过视觉锚机制和语义对齐方法提升大型语言模型在化学推理中的性能。

详情
AI中文摘要

尽管大型语言模型(LLMs)已革新了科学文本处理,但在解释化学反应图示方面存在显著的能力差距。我们识别出两个限制当前系统的根本瓶颈:视觉缺陷,即通用视觉编码器难以解析密集分子图的严格拓扑连接性;以及语义断开,即标准线性字符串,如SMILES,无法有效激活模型的潜在化学推理能力。为弥合这些差距,我们提出了化学视觉激活(ChemVA)框架,该框架采用视觉锚机制通过混合粒度检测来定位功能团,随后采用语义对齐方法将视觉特征转换为实体名称,以最大限度地激活LLMs中的知识。我们在OCRD-Bench数据集上评估了我们的方法,该数据集包含密集的视觉-语义上下文和全面的反应覆盖,以评估从识别到推理的整个谱系。在OCRD-Bench上的大量实验表明,ChemVA实现了92.0%的结构识别准确率。通过弥合视觉和语义瓶颈,我们的框架在9种不同的LLMs上实现了约20个百分点的性能提升,使开放式权重模型能够与专有SOTA系统在复杂的化学推理任务中竞争。

英文摘要

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

2605.17205 2026-05-19 cs.CL 版本更新

LLMs for automatic annotation of Mandarin narrative transcripts

基于大型语言模型的汉语叙述转录自动标注

Qingwen Zhao, Hongao Zhu, Yunqi He, Rui Wang, Aijun Huang, Hai Hu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of California San Diego(加州大学圣地亚哥分校) The Hong Kong Polytechnic University(香港理工大学) National Research Center for Language and Well-Being(语言与幸福感国家研究中心) City University of Hong Kong(香港城市大学)

AI总结 本研究探讨了大型语言模型在自动标注汉语叙述转录中的有效性,通过比较四种LLM与训练好的人类标注者,发现最佳模型在标注叙事宏观结构时能与人类标注者达成高一致性,同时显著减少标注时间,但轻量级本地部署模型表现较差,且标注难度因宏观结构元素类型而异,尤其在需要微妙语义区分的类别中存在持续挑战。

Comments 28 pages, 9 tables

详情
AI中文摘要

对转录语音进行语言标注对于语言习得、语言障碍和社会语言学研究至关重要,但这一过程仍然耗费大量人力和时间。尽管大型语言模型(LLMs)在自动化标注任务中显示出潜力,但其在非英语语言中处理复杂语篇层面标注的能力仍缺乏研究。本研究评估了LLMs在标注汉语口语中的叙事宏观结构(即故事语法元素的层级组织)的可靠性,使用多语言叙述评估仪器(MAIN)作为测试平台。我们比较了四种LLM与训练好的人类标注者在儿童、年轻人和老年人生成的叙述上的表现。最佳模型在与人类评分者达成一致(k=.794)方面接近人类-人类可靠性水平(k=.872),同时将标注时间减少了65%,而本地可部署的轻量级模型表现则明显较差。标注难度系统性地因宏观结构元素类型而异,需要微妙语义区分的类别提出了持续挑战。此外,模型可靠性在年轻人叙述中下降,因为年轻人叙述表现出更大的词汇变化、语义模糊性和单个语句中的多元素整合。这些发现表明,LLMs可以有效支持非英语口语语料库的语篇层面标注,同时强调在语义复杂任务中仍需持续的人类监督。我们的提示模板已开源供未来使用。

英文摘要

Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.

2605.17187 2026-05-19 cs.CL cs.AI cs.CY 版本更新

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule:一种用于社交媒体上多元社区调节的基准测试

Zoher Kachwala, Bao Tran Truong, Rasika Muralidharan, Haewoon Kwak, Jisun An, Filippo Menczer

发表机构 * Observatory on Social Media, Indiana University, USA(社交媒体观察站,印第安纳大学,美国) Center Synergy of Systems, TUD Dresden University of Technology, Germany(系统协同中心,德累斯顿技术大学,德国)

AI总结 研究探讨了AI模型在调节社交媒体上多元社区中的挑战,提出PluRule基准测试以检测13371条规则违规情况,发现即使使用最先进的视觉语言模型,也难以有效识别违规行为。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

社交媒体正在向多元主义转变--由社区自行定义规范的平台。在某一社区中违反规则的行为可能在另一社区中是完全可接受的。AI模型能否帮助调节此类多元社区?我们将此任务形式化为多选问题,模仿人类调节员在现实世界中的操作方式:给定一条评论及其上下文,识别违反了哪一条具体规则(如果有的话)。我们引入了PluRule,一个多模态、多语言的基准测试,用于检测1989个Reddit社区中跨越2885条规则的13371条违规情况。使用此基准测试,我们发现最先进的视觉语言模型在识别违规方面表现显著不佳:即使GPT-5.2具有高水平推理能力,也仅略优于基础基线。我们还发现,更大的模型和更多的上下文提供微小收益,而普遍规则如礼貌和自我推广更容易检测。我们的结果表明,社交媒体上多元社区的调节是语言模型的基本挑战。我们的代码和基准测试已公开发布。

英文摘要

Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

2605.17173 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Why Do Safety Guardrails Degrade Across Languages?

为何安全护栏在不同语言中会退化?

Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学)

AI总结 该研究通过引入多组项目反应理论框架,揭示了语言无关的安全鲁棒性、提示内在难度、全球语言处理难度和提示特定的跨语言安全差距等因素,发现安全退化并非仅在低资源语言中发生,且文化与概念不匹配也会影响安全性能。

详情
AI中文摘要

大型语言模型在非英语语言中表现出安全退化。标准评估依赖于禁令成功率(JSR),但将多个安全驾驶因素合并为一个,掩盖了安全失败的具体原因。我们引入了一个潜在变量模型,即多组项目反应理论(IRT)框架,将安全驾驶因素如语言无关的安全鲁棒性(θ)、内在提示难度(β)、全球语言处理难度(γ)和提示特定的跨语言安全差距(τ)分离。使用MultiJail数据集,我们评估了61种模型配置在5个闭源模型家族和10种资源各异的语言中的安全鲁棒性,汇总了190万行数据集。探索性因子分析显示安全主要是一维的:模型拒绝不同危害类型主要通过共享机制。与预期趋势相反,22种模型配置在英语中比在低资源语言中更易受攻击。低资源语言产生更多不确定响应(高熵)比高资源语言。此外,高τ提示集中在如盗窃和武器等物理危害类别和低资源语言中,趋势通过跨数据集泛化得到验证。虽然全球翻译质量与τ相关性低,但严重翻译错误驱动高偏置异常值,通过本地说话者验证。文化与概念基础不匹配也会影响τ。在预测验证中,IRT框架实现了AUC=0.940,优于更简单的基线,在预测不安全提示的安全拒绝方面表现更优。我们的框架揭示了概念-语言脆弱性,这些指标汇总后被掩盖,使公平的跨语言安全评估和目标改进数据集建设成为可能。

英文摘要

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($θ$), intrinsic prompt hardness ($β$), global language processing difficulty ($γ$), and a prompt-specific cross-lingual safety gap ($τ$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$τ$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $τ$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $τ$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

2605.17172 2026-05-19 cs.LG cs.AI cs.CL 版本更新

OpenJarvis: Personal AI, On Personal Devices

OpenJarvis: 个人AI,本地设备上

Jon Saad-Falcon, Avanika Narayan, Robby Manihani, Tanvir Bhathal, Herumb Shandilya, Hakki Orhun Akengin, Gabriel Bo, Andrew Park, Matthew Hart, Caia Costello, Chuan Li, Christopher Ré, Azalia Mirhoseini

发表机构 * OpenClaw Hermes Agent PinchBench GAIA

AI总结 本文提出OpenJarvis,一种分解的个人AI堆栈,通过在本地设备上优化五个基本组件(智能、引擎、代理、工具与记忆、学习)来缩小本地与云端之间的性能差距,同时保持本地模型的特性。

Comments Code: https://github.com/openjarvis/openjarvis Website: https://open-jarvis.github.io/OpenJarvis/

详情
AI中文摘要

个人AI堆栈,如OpenClaw和Hermes Agent,正在成为日常工作的核心,但它们几乎将每一个查询(通常涉及敏感的本地数据)都路由到云托管的前沿模型。用现有的堆栈中替换前沿模型为本地模型并不奏效:将Claude Opus 4.6换成Qwen3.5-9B,在个人AI任务如PinchBench和GAIA上会降低25-39个百分点的准确性。现有堆栈围绕特定的云模型捆绑代理提示、工具描述、内存配置和运行时设置。只有提示可以进行调优,而最先进的提示优化器只能自行关闭5个百分点的本地-云差距。这促使了分解的个人AI堆栈:一种能够暴露个体原语,可以单独或联合优化以缩小本地-云差距的堆栈。我们提出了OpenJarvis,一种将个人AI系统表示为五种原语的类型规范的架构:智能、引擎、代理、工具与记忆、学习。每个原语都是独立可编辑的字段,使堆栈能够端到端优化,并且可以针对准确性、成本和延迟进行测量。为了在不牺牲本地模型特性的情况下缩小本地-云差距,OpenJarvis引入了LLM引导的规范搜索,这是一种本地-云协作,在搜索时前沿云模型提出规范的编辑,只有非退化的编辑被接受,最终的规范在推理时完全在设备上运行。通过LLM引导的规范搜索,设备上的规范在8个基准中的4个上匹配或超过了云准确性,并且平均在最佳云基线基础上减少了3.2个百分点。它们还减少了边际API成本约800倍,并将端到端延迟减少了4倍。

英文摘要

Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud-hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5-9B drops accuracy by 25-39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state-of-the-art prompt optimizers close just 5 pp of the local-cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local-cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end-to-end optimizable and measurable against accuracy, cost, and latency. Towards closing the local-cloud gap without surrendering local-model properties, OpenJarvis introduces LLM-guided spec search, a local-cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non-regressing edits are accepted, and the resulting spec runs entirely on-device at inference time. With LLM-guided spec search, on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end-to-end latency by 4x.

2605.17169 2026-05-19 cs.AI cs.CL cs.MA 版本更新

Responsible Agentic AI Requires Explicit Provenance

负责任的代理AI需要明确的来源

Jinwei Hu, Xinmiao Huang, Qisong He, Youcheng Sun, Yi Dong, Xiaowei Huang

发表机构 * School of Computer Science and Informatics, University of Liverpool(利兹大学计算机科学与信息学学院) Department of Computer Science, Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学计算机科学系)

AI总结 本文探讨了代理AI中责任归属的问题,指出需要在整个代理生命周期中明确来源,以使责任可计算和可执行,提出了通过因果归因函数和责任张量来形式化所需信息,并通过初步实验验证了在线估计和干预的可能性。

Comments Under Review

详情
AI中文摘要

代理AI正在迅速扩展到软件工程等多样化的真实世界领域,但公众信任并未同步增长。核心原因是责任,尽管被广泛讨论,但仍是一个主观且未强制执行的概念,因为目前没有任何代理框架能够产生所需的可量化、可追溯和可干预的来源,以在损害由多个方共同设计时分配责任。我们主张所缺失的不是更好的基准级评估,而是整个代理生命周期中明确的来源,这是使责任可计算和可操作的唯一可行基础。我们从四个方向推进这一议程:通过识别社会技术维度中的责任缺口,确立为何此类来源是结构上的必要条件;通过因果归因函数和责任张量形式化它必须编码的内容;讨论如何在四个生命周期层中使其可计算,通过初步实验表明来源可以在不可逆损害积累之前在线估计和干预;并通过具体代理事件考察谁应承担责任。明确来源不是可选的改进,而是负责任的代理AI的必要条件,其生态系统中的任何利益相关者都无法承担将其视为可选的态度。

英文摘要

Agentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but $\textbf{explicit provenance}$ across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing $\textit{why}$ such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing $\textit{what}$ it must encode through a causal attribution function and responsibility tensor, discussing $\textit{how}$ it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining $\textit{who}$ bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional.

2605.17152 2026-05-19 cs.CL 版本更新

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

多语言和多模态大语言模型在野:为低资源语言构建

Firoj Alam, Shammur Absar Chowdhury, Enamul Hoque Prince

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究所) HBKU(哈马德大学) York University(约克大学)

AI总结 本文探讨了在有限数据和计算资源下构建多语言多模态大语言模型的方法,涵盖了低成本数据创建、三模态对齐适配器堆栈以及文化感知评估等核心技术和资源。

Comments Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Low-resources-language

详情
AI中文摘要

多模态大语言模型正从视觉语言扩展到三模态(见、听、读),但流水线和基准测试仍以英语为中心且计算密集。本教程概述了在有限的数据和计算预算下,跨文本、语音和视觉的多语言多模态研究领域的概述,综合了基础理论、最近的多语言模型(PALO、Maya)和语音-文本大语言模型。我们涵盖了低成本的数据创建/整理;三模态对齐的适配器堆栈;超越英语的文化感知评估以及用于微调紧凑型多语言视觉语言模型和构建语音->文本->LLM流水线的实用资源。内容将以交互式半天教程形式呈现,面向在低资源语言环境中从事多语言、多模态AI研究和实践的研究人员和从业者。

英文摘要

Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.

2605.17113 2026-05-19 cs.CL cs.AI 版本更新

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

无法回头的点:语言模型推理中欺骗承诺的反事实定位

Scott Merrill, Shashank Srivastava

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究语言模型在推理过程中何时开始承诺欺骗,通过反事实定位方法,分析不同环境中的欺骗产生机制,并发现注意力转移特征在跨环境泛化中的有效性,同时提出通过压缩注意力头集来抑制欺骗承诺。

Comments 41 pages, 25 figures

详情
AI中文摘要

现有欺骗数据集将完成的输出标记为诚实或欺骗,将欺骗视为最终响应的属性,而非模型推理轨迹的功能。这掩盖了一个更根本的问题:语言模型何时开始承诺欺骗?我们引入反事实定位:对于推理轨迹中的每个句子前缀,固定前缀,重新采样后续内容,并估计欺骗结果的概率。为了扩展此方法,我们构建了五个环境(涵盖战略欺骗、迷宫指引、财务建议、二手车销售和报价谈判),其中欺骗从未被提示,而是源自战略激励,标签机械地从环境状态得出,而非主观人类判断。所得到的语料库在四个推理模型中定位了约146万句话,来自超过9410万次采样的后续内容、915亿生成的token和超过1万种场景。句子层面的人类评估证实,检测到的承诺点对应于决策状态的可解释转变。使用此资源,我们显示,用于承诺预测的词汇线索在不同环境之间转移效果差,而基于注意力的转移特征在分布外泛化中表现良好,表明欺骗承诺反映在可重用的推理动态变化中,而非表层形式。我们进一步识别出压缩的注意力头集(少于10%的头)在一种环境中选择后,能因果地抑制其他环境中的欺骗承诺。我们发布此语料库作为研究语言模型推理中欺骗和更广泛承诺的子基质。

英文摘要

Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes $\sim$1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.

2605.17093 2026-05-19 cs.CV cs.CL 版本更新

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

HEED:基于密度加权残差对齐的混合视觉-语言模型蒸馏

Yihao Liang, Niraj K. Jha

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出HEED方法,通过密度加权残差对齐改进混合视觉-语言模型蒸馏,提升在OCR和文档任务中的性能,同时在不同教师模型和混合架构上实现高效推理。

详情
AI中文摘要

将视觉-语言模型蒸馏为更高效的混合架构,如3:1 Mamba-2/注意力混合,已成为提高推理效率的标准做法。聚合基准表明这可行,但隐藏了选择性失败。当将Qwen3-VL-8B-Instruct蒸馏为3:1 Mamba-2/注意力混合时,在视觉推理基准如MMStar、MMBench和MMMU-Pro上,学生模型在教师模型附近保持2分差距,但在光学字符识别和文档任务上下降13分。学生模型仍能理解场景,但失去回答所需的细粒度文本。我们发现大部分失败归因于特定位置。在高分辨率图像中,大多数拼图是天空、墙壁或平滑纹理,而一小部分携带文本、边缘、物体边界或其他局部细节。在令牌级诊断中,前10%最高密度拼图的残差漂移比后10%最低密度拼图大3.6倍,且教师遮蔽答案贡献大3.5倍。均匀加权将许多损失项分配给低信息量的背景拼图,而稀疏答案承载拼图未得到特殊保护。所需干预极小:我们用拼图自不相似性作为无监督代理来替代均匀残差对齐,以确定位置重要性。我们称之为HEED。与常规端到端蒸馏相比,HEED在OCRBench v2上提升8.7分,在10个基准平均上提升5.13分。增益在不同教师模型和混合架构上实现。在标准后训练后,学生在10个基准平均上达到教师级性能,具有4.12倍的吞吐量和128k上下文时68%的内存节省,无需额外参数和推理时间成本。

英文摘要

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.

2605.17088 2026-05-19 cs.CL 版本更新

ACIL: Auto Chain of Thoughts for In-Context Learning

ACIL: 自动链式思维用于上下文学习

Rui Chu

发表机构 * Rui Chu(楚瑞)

AI总结 本文提出ACIL框架,通过自动构建包含推理步骤的演示来提升上下文学习在多步推理任务中的性能。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进步表明,链式思维(CoT)推理可以显著提高复杂推理任务的性能。同时,上下文学习(ICL)已成为一种重要的机制,用于在不更新模型参数的情况下将LLMs适应于新任务,仅使用提示中提供的示例。然而,标准ICL在需要多步推理的任务上往往表现不佳,因为演示通常只包含输入-输出对,缺乏显式的中间推理步骤。本文介绍了一种自动链式思维(Auto-CoT)框架,通过自动构建推理增强的演示来改进ICL。Auto-CoT为输入-输出示例生成推理链,将结构化的中间解释添加到提示上下文中,并通过系统化的选择过程去除无关或低质量的演示。通过将高质量的推理示例纳入ICL提示中,Auto-CoT引导模型朝向更可靠的推理,并提高预测准确性。在多个推理任务上的实验表明,所提出的框架通过提供显式的中间推理指导,提高了ICL的性能。

英文摘要

Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.

2605.17084 2026-05-19 cs.LG cs.CL 版本更新

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

尺度决定语言模型是否为预测组织表示几何

Weilun Xu

发表机构 * School of Computer and Communication Sciences(计算机与通信科学学院) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 研究探讨了语言模型中表示几何是否为预测组织,通过Subspace PGA指标发现,模型规模影响表示几何的组织程度,小模型在训练后期逐渐失去这种组织,而大模型则保持稳定。

详情
AI中文摘要

在语言模型中,表示所编码的内容由其表示空间的几何结构决定:距离而非激活值承载意义。现有工具描述了这种几何结构的形状,但并未探讨其组织目的。我们引入Subspace PGA指标,测试某层的距离结构是否比随机等大小子空间更符合解嵌入矩阵$W_U$的读出子空间。在七个Pythia模型(70M-6.9B)和三个跨家族模型中,中间几何显著为预测组织(峰值$z = 9$--$24$),但程度依赖于规模:小模型($d \leq 1024$)在训练后期逐渐失去这种组织——即使损失持续改善,而大模型($d \geq 2048$)则保持稳定。我们追溯到容量权衡:少数主导方向迁离$W_U$的读出,掩盖而非破坏预测结构,移除它们可恢复对齐。频谱度量和损失曲线无法捕捉这一区别。因此,规模不仅决定了模型预测性能,还决定了其表示几何如何组织以实现预测。

英文摘要

In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_U$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$--$24$), but the degree is scale-dependent: small models ($d \leq 1024$) progressively lose it at late layers during training -- even as loss keeps improving -- while large models ($d \geq 2048$) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from $W_U$'s readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.

2605.17079 2026-05-19 cs.CL cs.AI cs.CY 版本更新

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

LLMs能否像消费者一样思考?通过ConsumerSimBench进行大众级反应重建的基准测试

Tianyu Wang, Jiajun Li, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出ConsumerSimBench基准,通过1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准,评估LLM在模拟消费者反应方面的能力,揭示了前沿模型在预测高语境中文消费者讨论中实际关心内容方面的不足。

详情
AI中文摘要

LLMs越来越多地被用作“数字消费者”来模拟公众意见、预测试营销决策并预测观众反应。然而,现有评估很少询问模型是否能重建现实中消费者在公开讨论中表现的具体反应模式。我们引入了ConsumerSimBench,该基准基于1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准,涵盖四个反应类别。与评分开放生成的综合偏好判断不同,ConsumerSimBench将每个任务分解为可审计的yes-no决定,使三判官协议从65.8%提升至92.1%,且点wise判断与人类多数标签在98.4%时一致。在13个前沿生成器中,最强的模型Gemini-3.1-Pro仅覆盖了47.8%的真实反应标准,而GPT-5.2和Claude-4.6尽管在技术基准上表现优异,但仍然落后。这些失败揭示了技术基准表现与基于社会的消费者直觉之间的巨大差距。直接的结构化推理提示会降低覆盖率,而生成-反思多代理流水线可将MiMo-V2.5-Pro在子集上的表现从32.9%提升至37.6%。ConsumerSimBench将消费者模拟重新定义为对真实公开讨论反应的预测问题,表明前沿LLM在预测高语境中文消费者讨论中实际关心内容方面仍远未可靠。

英文摘要

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

2605.17072 2026-05-19 cs.AI cs.CL 版本更新

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

RAGA:用于自主知识图谱构建和检索增强生成的阅读与图构建代理

Chengrui Han, Zesheng Cheng

发表机构 * Qingdao University(青岛大学)

AI总结 本文提出RAGA框架,通过结合阅读、搜索、验证和构建的认知约束,提升知识图谱构建与检索增强生成的效率和准确性,实现了知识图谱的全生命周期管理。

详情
AI中文摘要

现有基于LLM的知识图谱(KG)构建方法主要采用无状态的批处理流程,存在跨片段语义关系捕捉、实体消歧和构建过程可解释性方面的结构性缺陷。这些限制影响了KG的质量、检索精度和在高风险领域的部署信任度。我们提出RAGA(Reading And Graph-building Agent),一种基于LLM的自主KG构建和检索融合框架。RAGA提供支持完整KG生命周期CRUD操作的原子工具集,并将读取-搜索-验证-构建的认知约束嵌入到ReAct工具循环中。KG向量同步机制实现了混合符号-向量检索,而证据锚定验证将每个知识条目与其源文本链接,以实现可审计的溯源性。在QASPER科学问答数据集的子集上的初步实验表明,RAGA的融合检索优于零样本基线,KG整合在答案和证据质量方面提供了可衡量的提升。该框架设计和实验基线为代理驱动的自主KG构建提供了参考。

英文摘要

Existing LLM-driven knowledge graph (KG) construction methods predominantly employ stateless batch processing pipelines, exhibiting structural deficiencies in cross-chunk semantic relation capture, entity disambiguation, and construction process interpretability. These limitations undermine KG quality, retrieval precision, and deployment trust in high-stakes domains. We propose RAGA (Reading And Graph-building Agent), an LLM-based autonomous KG construction and retrieval fusion framework. RAGA provides an atomic toolset supporting full KG lifecycle CRUD operations and embeds a Read-Search-Verify-Construct cognitive constraint into a ReAct tool loop. A KG-vector synchronization mechanism enables hybrid symbolic-vector retrieval, while evidence-anchored verification links every knowledge entry to its source text for auditable provenance. Preliminary experiments on a subset of the QASPER scientific QA dataset indicate that RAGA's fusion retrieval outperforms zero-shot baselines, with KG integration providing measurable gains in both answer and evidence quality. The framework design and experimental baseline serve as a reference for agent-driven autonomous KG construction.

2605.17041 2026-05-19 cs.CL cs.AI cs.HC 版本更新

Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design

代理AI翻译:一种用于翻译作为沟通设计的代理翻译原型

Masaru Yamada

发表机构 * Rikkyo University(立命馆大学) Translation Lab Inc(翻译实验室公司)

AI总结 本文提出了一种代理翻译原型,通过将翻译研究的金属语言转化为生成AI的指令代码,重新定义翻译作为沟通设计的过程,而非文本转换。

Comments 14 pages. Conceptual and architectural paper; empirical validation in future work. Code: https://github.com/chuckmy/agentic-translator (v0.8.0). Live demo: https://agentic-translator-chuckmy.streamlit.app

详情
AI中文摘要

我们提出了Agentic AI Translate,一种代理翻译原型,实现了Yamada(即将出版)的论点——翻译研究的金属语言已成为生成AI的指令代码。该系统取代了机器翻译中占主导地位的文本输入/文本输出范式,采用四阶段代理循环(识别->提示->生成->验证),并在用户通过模型辅助对话构建一个基于skopos理论、语域、受众和体裁惯例的结构化翻译简报的交互规范阶段之前。验证阶段采用GEMBA-MQM错误跨度协议(Kocmi & Federmann, 2023)进行证据导向评分,并通过Wang等人(2025)的Delta-lite记忆保存文档层面的连贯性。我们描述了哲学动机、架构承诺、系统消耗的四种参考材料类别以及架构显式说明的主要设计张力。实证验证留待未来工作;本文的贡献是概念性和架构性的——一种可执行的体现,表明在GenAI时代翻译是沟通设计,而非文本转换。

英文摘要

We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) -- that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text-in / text-out paradigm of machine translation with a four-stage agentic cycle (Identify -> Prompt -> Generate -> Verify), preceded by an interactive specification phase in which the user composes -- through model-assisted dialogue -- a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA-MQM error-span protocol (Kocmi & Federmann, 2023) for evidence-grounded scoring, and document-level coherence is preserved through a DelTA-lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference-material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural -- an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.

2605.17037 2026-05-19 cs.LG cs.AI cs.CL 版本更新

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D$^2$Evo: 双重难度感知的自进化方法用于数据高效的强化学习

Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao, Yong Wang, Xiangxiang Chu

发表机构 * Zhejiang University(浙江大学) AMAP, Alibaba Group(AMAP,阿里巴巴集团)

AI总结 本文提出D$^2$Evo方法,通过双重难度感知的自进化机制,解决强化学习中有效数据稀缺和动态难度变化的问题,从而在数学推理基准上以少于2K真实数学样本实现优于现有方法的性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情
AI中文摘要

强化学习(RL)在增强大型语言模型(LLMs)推理能力方面展现出潜力。然而,需要中等难度训练样本的有效RL训练面临两个根本性挑战:有效数据稀缺和动态难度变化,其中中等难度样本稀缺且随着模型提升变得简单。现有方法在一定程度上缓解了这种稀缺性,通过生成训练样本。然而,这些方法存在无锚点生成、忽略共进化和难度不匹配的问题。为了解决这些问题,我们提出了D$^2$Evo,一种双重难度感知的自进化RL框架。在每次迭代中,我们的方法基于当前求解器的能力挖掘中等难度锚点,训练提问者生成不同难度层级的多样化问题,并共同优化两个组件以实现渐进式的推理提升。广泛实验表明,D$^2$Evo在数学推理基准上以少于2K真实数学样本优于现有方法,并在通用推理基准上表现出强大的泛化能力。

英文摘要

Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

2605.17028 2026-05-19 cs.CL cs.AI 版本更新

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX:区分真实幻觉检测与基准构建人工制品

Khizar Hussain, Murat Kantarcioglu

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究了大型语言模型幻觉检测中的基准构建人工制品问题,提出DRIFT作为对比方法,发现大部分基线方法在控制条件下表现接近随机,而SAPLMA和DRIFT作为上层隐藏状态的监督探针表现出例外。

Comments Preprint to Neurips 2026 submission

详情
AI中文摘要

大型语言模型(LLMs)在生成输出时常常表现出自信的幻觉:其输出可以流畅、权威且错误。在医疗、法律和科学应用中,这种失败会造成直接伤害,而通过内部模型状态检测幻觉则为更安全的部署提供了路径。越来越多的研究表明,这一问题变得越来越可处理,最近的方法在广泛使用的基准上实现了高检测性能。然而,我们发现,这种明显的进步在仔细审视后并不成立。六个语料库中的四个直接将真实答案嵌入输入提示中。我们提出的名为 extsc{TxTemb}的简单文本相似度基线利用这一点,无需访问模型内部状态即可实现接近完美的检测分数。为了衡量在消除这些人工制品后剩余的真实检测能力,我们进行了涵盖22种检测方法、12种开源模型(涵盖6种架构家族)和6个语料库的大规模评估。我们进一步引入 extbf{DRIFT},作为实时生成检测的比较点。我们的发现表明,该领域报告的幻觉检测进展在很大程度上是由广泛使用的语料库中的基准构建人工制品所解释的;在受控条件下,大多数已建立的基线方法表现接近随机;一致的例外是SAPLMA和DRIFT,两者都是基于上层隐藏状态的监督探针。

英文摘要

Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A naïve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbf{DRIFT}, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field's reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.

2605.17010 2026-05-19 cs.SI cs.AI cs.CL cs.CY cs.HC 版本更新

Algorithmic Cultivation: How Social Media Feeds Shape User Language

算法培养:社交媒体信息流如何塑造用户语言

Olivia Pal, Agam Goyal, Eshwar Chandrasekharan, Koustuv Saha

AI总结 本文基于培养理论,研究社交媒体信息流如何影响用户语言,发现信息流暴露导致用户语言风格、语义和正式程度显著变化,且不同信息流类型产生不同影响,揭示信息流作为持久语言环境的作用。

详情
AI中文摘要

算法信息流已成为人们在线获取信息的主要环境,尽管它们塑造了人们看到的内容,但较少有人了解持续的信息流暴露如何影响人们的写作方式。基于培养理论,我们检验算法信息流是否作为在线环境,在用户语言中留下可测量的痕迹。我们利用Bluesky平台上23500万条由400万用户发布的帖子的大型纵向数据集,并进行准实验研究,将初始池中的368513名用户(暴露于新闻、科学和Blacksky三种信息流之一)与201915名未接触这些信息流的活跃对照用户进行匹配。我们从词汇语义、心理语言学和主题三个维度考察语言演变。我们发现,接触这些信息流的用户在风格适应、语义对齐和正式程度方面显著高于对照组。这些影响因信息流身份差异明显——Blacksky产生最深的心理语言学重构,显著影响认知处理、情感表达和代词使用,而新闻和科学的影响主要集中在正式程度和主题聚焦。回归模型显示,转发是所有信息流中语言趋同最一致的预测因素,而发布和书签行为则有信息流依赖性影响,其影响在不同信息流间差异超过四倍。我们的研究将培养理论扩展到语言行为,证明信息流作为持久语言环境,逐渐塑造用户在线写作的内容和方式。我们的研究对研究算法影响、在线身份形成以及调解在线互动的基于信息流平台的设计和治理具有启示意义。

英文摘要

Algorithmic feeds have become primary environments for encountering information online, yet while they shape what people see, less is known about how sustained feed exposure shapes how people write. Drawing on Cultivation Theory, we examine whether algorithmic feeds function as online environments that leave measurable traces in users' language. We leverage a large-scale longitudinal dataset of 235M posts by 4M users on Bluesky, and conduct a quasi-experimental study matching an initial pool of 368,513 users exposed to one of three feeds -- News, Science, and Blacksky -- with a pool of 2,001,915 active control users who did not engage with any of these feeds. We examine linguistic evolution across three dimensions: lexico-semantics, psycholinguistics, and topics. We find that users exposed to these feeds show significantly greater stylistic accommodation, semantic alignment, and register formalization than matched controls. These effects vary markedly by feed identity -- Blacksky produces the deepest psycholinguistic restructuring, with significant shifts in cognitive processing, affective expression, and pronoun use, while News and Science effects are largely confined to register and topical focus. Regression models reveal that reposting is the most consistent predictor of linguistic convergence across all feeds, whereas posting and bookmarking show feed-dependent effects, with effects differing more than fourfold across feeds. Our work extends Cultivation Theory beyond belief formation to linguistic behavior, demonstrating that feeds function as persistent linguistic environments that gradually shape what and how users write online. Our work has implications for studying algorithmic influence, online identity formation, and the design and governance of feed-based platforms that mediate online interactions.

2605.17008 2026-05-19 cs.PL cs.AI cs.CL 版本更新

The IsalProgram Programming Language

IsalProgram 编程语言

Ezequiel López-Rubio

发表机构 * Department of Computer Languages and Computer Science(计算机语言与计算机科学系) University of Málaga(马拉加大学) ITIS Software(ITIS软件) Universidad de Málaga(马拉加大学)

AI总结 本文提出了一种新的类似汇编的编程语言IsalProgram,其核心特性包括:1) 作为形式语言理论中的正则语言,其程序可被有限自动机接受;2) 每个有限的指令字母表上的字符串都是语法有效的程序;3) 不显式使用内存地址或变量名。程序由固定指令集中的有限令牌序列组成,并在虚拟机上执行,其唯一数据结构是通过三个数据指针导航的循环双链表,控制流由两个代码指针控制。本文还讨论了IsalProgram作为神经程序合成目标语言的潜在优势,以及通过Levenshtein编辑距离进行程序空间度量探索的可能性,以及在此框架内分析可计算性和复杂性的方向。

详情
AI中文摘要

我们介绍了一种新的类似汇编的编程语言IsalProgram,它具有三个独特的理论特性:(1) 它在形式语言理论中是正则语言,意味着其程序被有限自动机接受;(2) 每个有限的指令字母表上的字符串都是语法有效的程序;(3) 它不显式使用内存地址或变量名,绝对或相对。程序是由固定指令集中的有限令牌序列组成,并在虚拟机上执行,该虚拟机的唯一数据结构是一个通过三个数据指针导航的循环双链表(CDLL),控制流由两个代码指针控制。我们给出了该语言及其虚拟机的完整形式定义,证明了其正则性,并展示了其表达能力。我们进一步讨论了IsalProgram作为神经程序合成目标语言的潜在优势,其程序空间通过Levenshtein编辑距离进行度量探索的可能性,以及在此框架内分析可计算性和复杂性的方向。

英文摘要

We introduce IsalProgram (Instruction Set and Language for Programming), a novel assembly-like programming language with three distinctive theoretical properties: (1) it is a regular language in the sense of formal language theory, meaning its programs are accepted by a finite automaton; (2) every finite string over the instruction alphabet is a syntactically valid program; and (3) it makes no explicit use of memory addresses or variable names, absolute or relative. Programs are finite sequences of tokens drawn from a fixed instruction set, and are executed on a virtual machine whose sole data structure is a circular doubly linked list (CDLL) navigated by three data pointers, with control flow governed by two code pointers. We give a complete formal definition of the language and its virtual machine, prove its regularity, and demonstrate its expressive power. We further discuss IsalProgram's potential advantages as a target language for neural program synthesis, the amenability of its program space to metric-based exploration via the Levenshtein edit distance, and directions for analyzing computability and complexity within this framework.

2605.17007 2026-05-19 cs.CL 版本更新

HalluScore: Large Language Model Hallucination Question Answering Benchmark

HalluScore: 大语言模型幻觉问答基准

Aisha Alansari, Hamzah Luqman

发表机构 * Department of Information and Computer Science, King Fahd University of Petroleum and Minerals(国王法赫德石油与矿物大学信息与计算机科学系) SDAIA-KFUPM Joint Research Center for Artificial Intelligence(SDAIA-KFUPM人工智能联合研究中心)

AI总结 本研究提出HalluScore基准,用于评估大语言模型在不同推理难度、知识领域、历史时间线和文化场景中的幻觉行为,通过827个精心编写的题目评估、检测和缓解幻觉,同时提供高质量的人类标注以识别不同模型的幻觉表现。

详情
AI中文摘要

大语言模型(LLMs)在自然语言生成方面取得了显著进展,但仍然容易产生幻觉。针对日益增长的幻觉担忧,已开发出几种基准,主要集中在英语和中文上。然而,阿拉伯语仍被低估,由于标注资源稀缺和语言形态复杂性,LLM幻觉基准有限。因此,现有基准不能充分反映阿拉伯语的语言、文化和推理特征。为解决这一差距,我们引入HalluScore,一个结构化的阿拉伯语问答基准,旨在评估LLM在不同推理难度、各种知识领域、历史时间线和文化相关的阿拉伯语场景中的幻觉行为。该数据集包含827个精心挑选的问题,用于评估、检测和缓解LLM中的幻觉。数据集通过包含质量保证、清晰度和事实准确性过滤以及模型驱动选择的结构化流程构建,以保留一致引发幻觉的问题。每个问题都链接到经过验证的地面真实证据、答案解释和多标签注释。使用HalluScore基准,我们对17个阿拉伯语、多语言和推理LLM的幻觉模式进行了全面的实证分析。此外,我们提供了高质量的人类标注,识别了所有评估LLM的幻觉、非幻觉和部分幻觉响应。这些结果表明,阿拉伯语LLM中的幻觉不仅限于事实不准确,还涉及文化理解、语言推理和逻辑一致性方面的挑战。我们发布HalluScore以支持未来改进阿拉伯语LLM的可靠性和文化能力的研究。

英文摘要

Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.

2605.16996 2026-05-19 cs.CL 版本更新

Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?

LLM个性诱导中的评估漂移:我们是否在移动目标?

Prateek Rajput, Yewei Song, Iyiola E. Olatunji, Jacques Klein, Tegawendé F. Bissyandé

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本文研究了大型语言模型在诱导人类样性格时的稳定性问题,通过微调长格式论文来诱导性格,并发现尽管微调减少了问卷评分的方差,但完整五维性格的准确性仍接近随机,表明无指导的论文缺乏表达忠实性格所需的线索。

Comments 14 pages, 8 main pages, 5 figures, 4 main page figures

详情
AI中文摘要

大型语言模型能否可靠地表达人类般的性格,还是仅仅在模仿表面线索而缺乏稳定的底层特征?为探讨此问题,我们通过在长格式论文上进行微调来诱导LLM的性格,每篇论文都关联一个目标大五人格特征轮廓。随后,我们使用IPIP-NEO问卷评估诱导性格的稳定性和准确性。具体而言,我们提出两个问题:(i)训练后(SFT、DPO、ORPO)是否在提示重新表述下稳定问卷评分?(ii)能否从无指导的论文中诱导目标大五人格特征?我们的结果表明,微调在五个模型上一致减少了问卷响应的方差,直接缓解了预训练模型中报告的评估脆弱性。然而,这种新发现的稳定性揭示了更根本的限制:即使单个特征分数有所提高,完整五维特征的准确性仍接近随机。这表明无指导的论文缺乏表达忠实性格所需的线索。因此,我们主张使用场景相关的数据集或交互式引导来积累与测试一致的证据。

英文摘要

Can large language models reliably express a human-like personality, or are they merely mimicking surface cues without a stable underlying profile? To investigate this, we induce personality in LLMs by fine-tuning them on the long-form essays, where each essay is associated with a target Big Five personality profile. We then evaluate the stability and fidelity of the induced personality using the IPIP-NEO questionnaire. Specifically, we ask: (i) does post-training (SFT, DPO, ORPO) stabilize questionnaire scores under prompt rephrasings, and (ii) can it induce target Big Five profiles from unguided essays? Our results demonstrate that fine-tuning consistently reduces variance in questionnaire responses across five models, directly mitigating the evaluation fragility reported in pre-trained models. However, this newfound stability reveals a more fundamental limitation: accuracy on the full five-dimensional profile remains near chance, even when single-trait scores improve. This indicates that unguided essays lack the cues needed for faithful personality expression. We therefore argue for scenario-grounded datasets or interactive elicitation that accumulates test-aligned evidence over time.

2605.16991 2026-05-19 cs.CL cs.AI 版本更新

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

无响应项目难度建模用于多项选择题:细调Transformer:组件表示和多任务学习

Jan Netík, Patrícia Martinková

发表机构 * Faculty of Education, Charles University(查理大学教育学院) Institute of Computer Science of the Czech Academy of Sciences(捷克科学院计算机科学研究所)

AI总结 本文提出了一种无响应项目难度建模方法,通过细调Transformer来处理阅读理解多项选择题的难度问题,采用组件级表示和多任务学习方法来提升模型性能。

详情
AI中文摘要

无响应项目难度建模旨在减少对响应校准的依赖,但对阅读理解多项选择题而言,其难度取决于词汇组件的推断需求。尽管现有方法通常从项目文本中提取特征并传递给单独的统计或机器学习模型,本文通过端到端地在项目词汇上微调Transformer编码器,消除了手动特征工程和预处理所丢失的信息。此外,本文还提出了两种扩展:一种是组件级变体,通过共享编码器分别编码词汇组件;另一种是多任务变体,保留联合编码并添加辅助的多项选择问题回答目标。每种方法都在三种训练集大小下通过蒙特卡洛子采样设计在保留的测试集上进行评估。研究发现,联合编码是一种可行的端到端替代方案;虽然组件级变体没有明显优势,这与自注意力机制本身已经捕获跨组件信号一致,但多任务变体在小样本情况下提供了显著的改进。Transformer微调,尤其是通过合适的辅助任务进行正则化,能够在应用测量中典型的训练集大小下恢复大量词汇可推导的信号。该框架为心理测量学扩展提供了可定制的接口。

英文摘要

Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

2605.16986 2026-05-19 cs.CL cs.AI 版本更新

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Jingxing Wang, Chenyu Zhou, Zhihui Fu, Jun Wang, Weiwen Liu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) OPPO Research Institute(OPPO研究院)

AI总结 本文提出了一种在测试时自适应的技能合成方法SkillTTA,通过检索与当前任务相关的少量训练轨迹并将其合成成为任务特定的文本技能,以提高LLM代理在SpreadsheetBench、ALFWorld和BigCodeBench等任务上的性能。

Comments 10 pages, 4 figures

详情
AI中文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose SkillTTA, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-k retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

英文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emph{SkillTTA}, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-$k$ retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

2605.16941 2026-05-19 cs.CL 版本更新

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

展开与回退:扩散大语言模型是它们自己的效率教师

Fanqin Zeng, Feng Hong, Geng Yu, Huangjie Zheng, Xiaofeng Cao, Ya Zhang, Bo Han, Yanfeng Wang, Jiangchao Yao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Apple MLR Tongji University(同济大学) Hong Kong Baptist University(香港 Baptist 大学) RIKEN(日本理化学研究所)

AI总结 本文提出了一种基于可撤销解码的扩散大语言模型(DLLM)方法,通过发现可靠的去噪顺序来提升生成质量和效率,WINO和WINO+在多个数据集上展示了显著的性能提升。

详情
AI中文摘要

扩散大语言模型(DLLMs)承诺快速并行生成,但开源DLLMs仍面临严重的质量-速度权衡问题:通过揭示多个token来加速解码往往会导致显著的质量下降。我们将其困境归因于训练-推理不匹配被不可逆解码放大。虽然训练过程从随机破坏的状态重建token,但高效的推理需要自适应的去噪顺序,其中更容易的token被更早揭示,而依赖上下文的token则被推迟。这种观点促使提出两种互补的方法:一种是在推理时使并行解码可撤销的方法,另一种是在训练时扩展的方法,通过这种可撤销过程暴露的可靠顺序进行知识蒸馏。据此,我们首先提出Wide-In, Narrow-Out(WINO),一种无需训练的解码算法,使并行解码可撤销。WINO积极地草案多个token,通过增强的全局上下文验证生成token,并重新掩码不可靠的token以供后续细化。基于发现的顺序,我们进一步引入WINO+,将WINO生成的验证去噪轨迹注入模型参数,使训练与高效推理对齐。在LLaDA和MMaDA上的实验表明,WINO在质量和效率上均有提升,而WINO+进一步加强了这一进程。在GSM8K上,WINO将准确率从73.24%提升到75.82%,并以6.10倍的步骤减少;WINO+进一步达到76.58%并以6.83倍的步骤减少。在Flickr30K上,WINO+实现了16.22倍的步骤减少并提升了CIDEr分数。这些结果表明,DLLMs可以通过可撤销解码发现可靠的去噪顺序,然后学习遵循这些顺序以实现更快的生成。代码可在https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus获取。

英文摘要

Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus.

2605.16938 2026-05-19 cs.CL cs.AI q-bio.NC 版本更新

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

努力作为上限,而非调节器:推理预算不影响人类与大推理模型之间的认知成本对齐

Yueqing Hu, Tianhong Wang

发表机构 * Institute of Neuroscience, Chinese Academy of Sciences(中国科学院神经科学研究所) School of Philosophy, Anhui University(安徽大学哲学系)

AI总结 该研究探讨了推理预算是否影响人类与大推理模型之间的认知成本对齐,发现无论推理努力如何变化,对齐情况保持不变,表明这种对齐是在训练时形成的,而非在推理时动态调整。

Comments 8 pages, 6 figures

详情
AI中文摘要

大推理模型(LRMs)生成的思维链轨迹长度与人类反应时间在认知任务中保持一致,但最近的争论质疑这种一致性是否反映真实的计算结构还是表面的冗长性。我们测试了这种一致性是否随推理时间的推理努力而变化。在GPT-OSS-20B和GPT-OSS-120B上,三个努力水平和六个推理任务中,任务内和跨任务的一致性保持不变:贝叶斯因子倾向于null,且各条件下的平均一致性几乎相同。操纵检查显示,努力参数设定了生成的上限,而非驱动实时分配,表明分配策略在训练时已固化。算术复杂度对比进一步显示,令牌分配跟踪细粒度、格式依赖的人类难度模式,模型规模提高了匹配程度。人类与LRMs之间的认知成本对齐似乎是在训练时形成的,对推理时的扰动具有鲁棒性,支持大推理模型问题解决的编译而非在线账户。

英文摘要

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

2605.16896 2026-05-19 cs.CL 版本更新

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

JSPG: 通过联合语义-拼音-字形检索实现中文上下文ASR的动态字典过滤

Shilin Zhou, Zhenghua Li

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对中文上下文ASR中大规模关键词字典带来的噪声问题,本文提出JSPG框架,结合语义、拼音和字形特征进行联合检索,有效提升关键词识别准确率。

详情
AI中文摘要

上下文自动语音识别(ASR)在处理大规模关键词字典时面临挑战,因为过多的不相关候选词会引入噪声并降低准确性。为此,动态过滤通常使用基础ASR模型生成初步假设,然后通过语义文本检索器获取相关关键词的子集。然而,这种方法在中文ASR中经常失效。基础模型常常产生同音或近同音的错误,这些错误保留了目标关键词的语音线索,但严重扭曲了其语义意义,使标准语义检索器无效。为了解决这个问题,我们提出了一种过滤框架,联合整合语义、拼音和字形特征(JSPG)。拼音可以根据语音相似性检索目标,而字形提供互补的结构线索以过滤掉中文中大量无关的同音词。为了弥补字符级拼音/字形指标与序列级过滤之间的差距,我们引入了扩展的Smith-Waterman算法,计算N-best假设序列与关键词之间的相似度分数。在Aishell-1和RWCS-NER数据集上的实验表明,JSPG显著优于单一特征基线。此外,由JSPG引导的下游上下文ASR模型在关键词识别准确性上实现了显著提升。

英文摘要

Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.

2605.16895 2026-05-19 cs.CE cs.AI cs.CL 版本更新

The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

阿尔法幻觉:LLM交易代理报告的阿尔法不应被视为部署证据

Yuxuan Ye, Jun Han, Ao Hu, Juncheng Bu, Yiyi Chen, Liangjian Wen, Danilo Mandic, Danny Dongning Sun, Xu Yinghui, Zenglin Xu

发表机构 * Fudan University(复旦大学) Shanghai University of Finance and Economics(上海财经大学) Southwest University of Finance and Economics(西南财经大学) Northeastern University(东北大学) Imperial College London(伦敦帝国理工学院) Peng Cheng Laboratory(鹏城实验室)

AI总结 本文指出,LLM交易代理报告的阿尔法不应被当作部署的证据,因为这些阿尔法需要通过结构有效性测试来验证其时间完整性、现实摩擦、反事实稳健性、预测校准、数值执行和多代理分解等关键指标,当前的公开证据无法区分稳健的预测能力与时间污染、未建模的摩擦、短窗口夏普不确定性、叙事拟合和参数先验等因素。

详情
AI中文摘要

端到端的LLM交易代理已经从研究兴趣迅速发展为一个小型的命名系统生态系统,包括FinCon、FinMem、TradingAgents、FinAgent、QuantAgent和FLAG-Trader。其中几个系统报告的 headline 夏普比率如果在部署桌面上被直接解读,将是实质性的影响,而相关的基准如FinBen也报告了在相同范围内的交易任务夏普统计。学术界与工业界之间的差距在双方都被过度跨越了。我们对这一差距持立场:端到端LLM交易代理报告的阿尔法不应被视为部署证据。在这样的收益能够支持部署交易能力的主张之前,它们必须通过结构有效性测试,以确保时间完整性、现实摩擦、反事实稳健性、预测校准、数值执行和多代理分解。当前的公开证据尚无法区分稳健的预测能力与时间污染、未建模的摩擦、短窗口夏普不确定性、叙事拟合和参数先验等因素。问题不仅是评估性的,也是结构性的。语言信心不是可交易的概率,叙事推理不是数值执行,模型先验可能成为未披露的隐性因子暴露。我们贡献了一个最小报告协议套件,P1-P6,根据主张强度分层适用,并且提供了一个保守的模块化替代方案,该方案利用LLM作为可审计的信息接口,在独立校准、风险和执行模块之前。代码和再生产工具:\url{https://github.com/hj1650782738/Trading}。

英文摘要

End-to-end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG-Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading-task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia--industry divide. We take a position on that gap: reported alpha from end-to-end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real-world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi-agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short-window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1--P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \url{https://github.com/hj1650782738/Trading}.

2605.16892 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe: 一种用于驾驶场景中风险检测与安全建议的框架

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C. V. Jawahar

发表机构 * IIIT-Hyderabad(IIIT-海得拉巴)

AI总结 本文提出DriveSafe框架,通过结构化自然语言描述实现风险感知场景理解,结合多模态上下文生成空间 grounded 的描述,用于下游风险评估和安全建议,实验表明其在DRAMA基准上达到最先进的性能。

Comments 8 pages

详情
AI中文摘要

全面的情景意识对于在安全关键环境中运行的自动驾驶车辆至关重要,因为它能够识别并缓解潜在风险。尽管最近的多模态大语言模型(MLLMs)在通用视觉-语言任务上表现出色,但我们的研究发现,零样本MLLMs在细粒度、空间接地的风险评估中仍不如领域特定的方法。为了解决这一差距,我们提出了DriveSafe,一种用于风险感知场景理解的框架,利用结构化自然语言描述。具体而言,我们的方法首先生成包含运动、空间和深度线索的多模态上下文的时空接地描述。这些描述随后用于下游的风险评估,明确识别危险物体、其位置以及它们所暗示的不安全行为,随后提供可操作的安全建议。为了进一步提高性能,我们采用描述-风险配对来微调一个轻量级的适配器模块,高效地将领域特定的知识注入基础LLM中。通过将风险评估条件化为显式的语言基础场景表示,DriveSafe在零样本MLLMs和先前的领域特定基线之上取得了显著的提升。在DRAMA基准上的全面实验表明了最先进的性能,而消融研究验证了我们关键设计选择的有效性。项目页面:https://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe

英文摘要

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe

2605.16882 2026-05-19 cs.CL 版本更新

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

E-PMQ: 专家引导的后合并量化与合并权重锚定

Wenjun Wang, Yanggan Gu, Shuo Cai, Yuanyi Wang, Pengkai Wang, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) PolyU-Daya Bay Technology and Innovation Research Institute(PolyU-大亚湾科技与创新研究院)

AI总结 本文提出E-PMQ方法,通过专家引导的后合并量化与合并权重锚定,解决合并模型中量化偏差和合并偏差耦合的问题,从而提升低比特部署性能。

详情
AI中文摘要

低资源部署约束使得模型量化成为部署神经网络以保持性能的关键。同时,模型合并已成为一种日益实用的低资源策略,用于将多个任务或领域专门化的专家整合到一个模型中,而无需联合训练或多模型服务。共同,量化和模型合并能够通过将多个专家整合到一个低比特模型中,实现高效的低资源部署流程。我们把这个设置称为后合并量化(PMQ)。我们证明直接对合并模型应用后训练量化(PTQ)是不可靠的,因为两种不同的偏差耦合:低比特重建引入的量化偏差和来自模型合并的专家相对合并偏差。为了减轻这些偏差,我们提出E-PMQ,一个专家引导的PMQ框架,利用源专家权重在层间校准期间提供专家引导的输出目标,以及合并权重锚定来稳定校准并保持合并模型的行为。在CLIP-ViT-B/32八任务合并中,E-PMQ将4位GPTQ在任务算术下的性能从65.0%提升到73.6%,在TIES-合并下从69.1%提升到74.8%。在更困难的设置中,E-PMQ在20任务CLIP-ViT-L/14上将GPTQ从34.8%提升到76.7%,在FLAN-T5-base GLUE上从78.26%提升到83.34%。这些结果表明,E-PMQ能够实现有效的后合并量化和低比特部署。

英文摘要

Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.

2605.16881 2026-05-19 cs.CL 版本更新

PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

PaliBench: 一种多参考框架用于经典语言翻译基准测试

Máté Metzger, Nadnapang Phophichit

发表机构 * Independent Researcher(独立研究者) International Buddhist Studies College(国际佛教研究学院) Mahachulalongkornrajavidyalaya University(玛哈切通拉翁皇家师范大学)

AI总结 本文提出PaliBench,一种用于巴利语到英语翻译的基准测试,以及构建多参考翻译基准测试的方法,通过结合LLM辅助对齐、自动化验证、质量过滤、去重和多指标评估,生成包含1700段文本、8389个段落和约345,000个token的基准测试数据集,评估了十个现代大语言模型的性能。

Comments Preprint. This manuscript has not yet been peer reviewed

详情
AI中文摘要

数字人文项目越来越多地依赖机器翻译和大语言模型来扩大对古典、宗教及其他未翻译文本传统的访问。然而,标准翻译基准测试并不适合此类材料:它们通常将系统输出与单一参考翻译进行比较,尽管古典文本往往支持多种忠实的翻译,这些翻译在术语、语体和解释上有所不同。本文介绍了PaliBench,既是一种巴利语到英语翻译的基准测试,也是一种可重用的方法,用于构建多参考翻译基准测试。Pali案例研究基于与Bhikkhu Sujato、Bhikkhu Thanissaro和Bhikkhu Bodhi独立英文翻译对齐的Sutta Pitaka段落。工作流程结合了LLM辅助对齐独立分段的翻译、自动化验证源文件、段落级质量过滤、去重公式性重复以及多指标评估多个人类参考。生成的基准测试包含1700段文本,覆盖8389个段落和约345,000个token。我们使用它评估了十个现代大语言模型,发现系统排名在不同指标之间有强一致性,同时可靠性及语义异常率有显著差异。更广泛贡献是方法论:PaliBench展示了如何将现有学术翻译转化为解释性文本传统的评估基础设施,而不将任何单一翻译视为最终结论。尽管是为巴利佛教文本开发的,该方法也可用于其他古典语料库,其中存在足够的独立参考翻译。

英文摘要

Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG 版本更新

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考:通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

发表机构 * State Key Lab of CAD& CG(CAD与CG国家重点实验室)

AI总结 本文提出通过模式诱导的方法,利用模式推理和模式诱导策略,使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理,解决传统模型在复杂输入下的感知瓶颈问题。

详情
AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型(VLMs)构成重大挑战,当输入复杂度超出其一步感知能力时。受最近在图像思考(TWI)中的进展启发,一种合理的解决方案是通过迭代获取和整合局部视觉证据,将感知过程分解为更简单的步骤。然而,尽管当前VLMs在一般TWI能力上训练良好,但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战,我们将TWI视为一种工具,逐步构建并反映一个准确的内部世界模型。我们发现,由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务,但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率,我们提出模式推理,一种新的TWI策略,使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式,我们提出模式诱导,一种在线归纳学习策略,将视觉模式视为复合且可重用的专家,这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明,我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

2605.16843 2026-05-19 cs.CL 版本更新

RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis

RTI-Bench: 一个用于印度权利信息决策分析的结构化数据集

Joy Bose

发表机构 * Independent Researcher(独立研究员)

AI总结 本文提出RTI-Bench,一个包含印度中央信息委员会(CIC)决策的结构化数据集,用于分析权利信息决策,该数据集首次公开发布,包含结果标签、豁免引用、IRAC风格的推理组件和程序时间线。

Comments 8 pages, 4 tables

详情
AI中文摘要

印度《2005年信息权利法》赋予每位公民要求公共机构提供信息的权利,但在实践中,大多数人无法理解中央信息委员会(CIC)决定中密集的行政语言,更不用说预测是否值得提出上诉。本文介绍了RTI-Bench,一个包含CIC决定的结构化数据集,包含结果标签、豁免引用、IRAC风格的推理组件和程序时间线。据我们所知,这是首个公开发布的印度RTI行政决定结构化数据集。该数据集来自两个来源:1,218个公开可用的指令-响应语料库(通过规则提取添加了结构化字段),以及298个直接从委员会门户网站收集的CIC决定PDF文件,涵盖2023至2026年的五个委员和三种文档格式版本。在指令-响应语料库上,标签覆盖率达到89%。对于239个主要决定的PDF子集,本次首次发布覆盖率为51%。对50个标记案例的随机样本进行了人工审查,得出的标签精度为95.3%。在100个案例上的零样本Mistral 7B基线模型在结果预测上的准确率为57.3%,宏F1得分为37.0%,高于多数类基线的14.3%宏F1。RTI-Bench可在https://huggingface.co/datasets/joyboseroy/rti-bench获取。

英文摘要

India's Right to Information Act, 2005 gives every citizen the right to demand information from public authorities, yet in practice most people cannot make sense of the dense administrative language used in Central Information Commission (CIC) decisions, let alone predict whether an appeal is worth filing. This paper introduces RTI-Bench, a structured dataset of CIC decisions with outcome labels, exemption citations, IRAC-style reasoning components, and procedural timelines. To the best of our knowledge it is the first publicly released structured dataset for Indian RTI administrative decisions. The dataset draws from two sources: 1,218 cases from a publicly available instruction-response corpus (with structured fields added through rule-based extraction), and 298 CIC decision PDFs collected directly from the Commission portal, spanning five commissioners and three document format generations from 2023 to 2026. Label coverage reaches 89% on the instruction-response corpus. For the PDF subset of 239 primary decisions, coverage is 51% in this first release. A random sample of 50 labelled cases was manually reviewed, yielding a label precision of 95.3%. A zero-shot Mistral 7B baseline on 100 cases gives 57.3% accuracy and 37.0% macro-F1 on outcome prediction, well above the majority-class baseline of 14.3% macro-F1. RTI-Bench is available at https://huggingface.co/datasets/joyboseroy/rti-bench

2605.16839 2026-05-19 cs.CL 版本更新

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

CompactAttention: 加速分块预填的块-联合KV选择

Jiwon Song, Dongwon Jo, Beomseok Kang, Jae-Joon Kim

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出CompactAttention,一种基于块-联合KV选择的分块预填注意力机制,通过将二维块稀疏掩码作为KV选择信号,实现高效的注意力计算,从而在保持精度的同时提升2.72倍的注意力速度。

详情
AI中文摘要

分块预填已成为长上下文大语言模型广泛采用的服务策略,但在这种模式下高效计算注意力仍然具有挑战性。现有稀疏注意力方法主要针对一次性预填设计,无法有效转换为分块预填:块稀疏内核在查询长度受限于分块大小时效率降低,而细粒度模式搜索在每次分块累积KV缓存中重复时变得昂贵。QUOKA是一种近期针对分块预填的方法,避免了稀疏内核的开销,但依赖于查询子采样、令牌级的KV选择,这可能导致遗漏查询特定的KV条目并引入显式的KV复制开销。为了解决这些限制,我们提出了CompactAttention,一种基于块-联合KV选择的分块预填注意力机制。CompactAttention将二维块稀疏掩码作为KV选择信号,而不是直接的稀疏内核执行计划,并将其转换为GQA-aware的每组KV块表,通过Q块联合和组内联合。这种构造产生了最小的块表,保留了输入掩码所选择的所有KV块,在分页执行约束下,使所选KV块能够原地访问,而无需显式的KV压缩。在LLaMA-3.1-8B-Instruct上,CompactAttention在RULER基准测试中保持的精度接近密集注意力,同时在128K上下文长度下的分块预填中提供高达2.72倍的注意力加速。

英文摘要

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72$\times$ attention speedup at 128K context length under chunked prefill.

2605.16829 2026-05-19 cs.CL cs.PL 版本更新

Constrained Code Generation with Discrete Diffusion

带有离散扩散的约束代码生成

Lize Shao, Michael Cardei, Zichen Xie, Ferdinando Fioretto, Wenxi Wang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出了一种无需训练的神经符号推理框架CDC,通过将约束满足直接整合到反向去噪过程中,提升代码生成中功能正确性、安全性和语法的约束满足能力。

详情
AI中文摘要

离散扩散模型是一种强大的新兴范式,用于代码生成。它们通过迭代地对部分损坏的token序列进行细化来构建程序,并能够并行地细化token。重要的是,这种范式在每次去噪步骤中暴露了全局程序状态,这为强制执行程序级功能和安全约束提供了自然的干预点,从而在最终代码提交前指导生成。基于这一观察,本文引入了Constrained Diffusion for Code (CDC),一种无需训练的神经符号推理框架,它将约束满足直接整合到反向去噪过程中。CDC在基础离散扩散采样器上增加了具有约束意识的去噪运算符,这些运算符结合数学优化与程序分析,以识别中间程序状态中与约束相关的区域,并在本地调整去噪轨迹,使生成过程朝向可行的程序发展,同时保持接近基础模型。在代码生成基准测试中,CDC在功能正确性、安全性和语法方面一致提高了约束满足能力,优于离散扩散和自回归基线,使用更少的纠正计算和更局部的编辑。

英文摘要

Discrete diffusion models are a powerful, emerging paradigm for code generation. They construct programs through iterative refinement of partially corrupted token sequences and enable parallel token refinement. Importantly, this paradigm exposes a global program state at each denoising step, which provides a natural intervention point for enforcing program-level functionality and security constraints, guiding the generation before the final code is committed. Building on this observation, the paper introduces Constrained Diffusion for Code (CDC), a training-free neurosymbolic inference framework that integrates constraint satisfaction directly into the reverse denoising process. CDC augments the base discrete diffusion sampler with constraint-aware denoising operators that combine mathematical optimization with program analysis to identify constraint-relevant regions of the intermediate program state and locally adjust the denoising trajectory, steering generation toward feasible programs while remaining close to the base model. Across code generation benchmarks, CDC consistently improves constraint satisfaction in functional correctness, security, and even syntax, outperforming discrete diffusion and autoregressive baselines with less corrective computation and more localized edits.

2605.16826 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

解耦KL与轨迹:为LLM蒸馏中的SFT、DAgger、离线RL和OPD提供统一视角

Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen

发表机构 * Eastern Institute of Technology(东技术院) The Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(人工智能 thrust,香港科学与技术大学(广州))

AI总结 本文探讨了知识蒸馏中KL散度与轨迹之间的耦合问题,通过解耦两个轴向,提出了四种有效的蒸馏目标,并通过实验揭示了KL方向、前缀源和训练长度之间的权衡关系,提出了KL混合和熵门长度课程等实用方法。

Comments Code available at https://github.com/EIT-NLP/Decoupled-Distill

详情
AI中文摘要

知识蒸馏是LLM后训练的核心,但其设计空间仍不明确,尤其是在与强化学习(RL)结合时。我们展示了主流范式,即离线蒸馏和在线蒸馏(OPD),隐含地耦合了两个正交选择:前缀源和token级KL方向。这源于将序列级KL分解为自回归响应分布的KL:前向KL将教师前缀与token级前向KL配对,而反向KL将学生前缀与token级反向KL配对。我们主张这种耦合并非本质:解耦这两个轴向会产生四个有效的目标。我们建立了梯度级恒等式,显示前向KL给出SFT风格的交叉熵匹配,而反向KL给出RL风格的策略梯度目标,连接到离线SFT、DAgger风格的在线SFT、离线RL风格的蒸馏和OPD。我们在数学推理上进行了广泛的受控研究,评估了四个目标作为独立方法和后续RL的初始化。结果揭示了三个权衡:KL方向引起准确度-熵权衡,前缀源引起质量-计算权衡,训练长度引起准确度-稳定性权衡。受这些发现启发,我们提出了KL混合和熵门长度课程。KL混合显示长序列蒸馏需要显著的前向KL权重以防止熵崩溃和长度膨胀而不牺牲准确性。熵门长度课程提高了Avg@k和Pass@k分别3.6和高达5.8个点,并将平均响应长度减少了约3倍。我们的结果提供了一个框架和实用方法,用于设计平衡准确度、多样性、计算和RL行为的推理蒸馏目标。

英文摘要

Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.

2605.16824 2026-05-19 cs.LG cs.CL 版本更新

Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning

置信几何揭示大语言模型推理中的痕量正确性

Shuo Liu, Ding Liu, Shi-Ju Ran

发表机构 * School of Computer Science and Technology, Tiangong University(天津工业大学计算机科学与技术学院) Center for Quantum Physics and Intelligent Sciences, Department of Physics, Capital Normal University(首都师范大学量子物理与智能科学中心)

AI总结 本文研究了大语言模型推理正确性与置信轨迹之间的关系,提出通过置信几何分析来区分正确与错误推理轨迹,并展示了NeuralConf方法在提高推理准确性方面的有效性。

Comments 11 pages, 9 figures, 1 table. Code is available at https://github.com/QML-TGU/NeuralConf

详情
AI中文摘要

大语言模型(LLMs)不仅生成推理文本,还生成记录推理过程中不确定性演变的token级置信轨迹。这些轨迹是否与推理正确性相关尚不清楚。本文表明,置信轨迹编码了与痕量最终答案正确性相关的内容无关的置信几何。仅使用token级置信值,不访问输入问题、推理文本、隐藏状态或外部验证器,发现置信轨迹的低维表示能将正确和错误的推理轨迹分开。在GSM8K、MATH和MMLU数据集上,这种几何分离与下游可预测性量度定量相关:正确和错误轨迹的更强聚类(通过Davies-Bouldin指数测量)一致对应更高的正确性判别AUC。进一步发现正确性相关信息在推理尾部得到丰富,表明晚期置信动态携带关键正确性信号。本文提出NeuralConf,一个轻量级估计器,通过置信轨迹学习正确性评估。在固定轨迹预算下,NeuralConf衍生的分数在置信加权答案聚合方面优于多数投票、尾置信度和其他静态基线。这些结果表明,LLMs通过自身的置信动态暴露了正确性的痕量统计信号,为利用生成中已存在的信息提高推理提供了途径。

英文摘要

Large language models (LLMs) generate not only reasoning text, but also token-level confidence trajectories that record how uncertainty evolves during inference. Whether these trajectories are relevant to reasoning correctness remains unclear. Here we show that confidence trajectories encode a content-agnostic confidence geometry associated with trace-level final-answer correctness. Using only token-level confidence values, without access to the input question, reasoning text, hidden states, or external verifiers, we find that low-dimensional representations of confidence trajectories separate correct from incorrect reasoning traces. Across GSM8K, MATH, and MMLU, this geometric separation is quantitatively linked to downstream predictability: stronger clustering of correct and incorrect traces, measured by the Davies--Bouldin index, consistently corresponds to higher correctness-discrimination AUC. We further show that correctness-related information is enriched in the tail of reasoning, suggesting that late-stage confidence dynamics carry key correctness signals. We propose NeuralConf, a lightweight estimator that learns from confidence trajectories for correctness evaluation. Under a fixed trace budget, NeuralConf-derived scores improve confidence-weighted answer aggregation over majority voting, tail confidence, and other static baselines. These results reveal that LLMs expose trace-intrinsic statistical signals of correctness through their own confidence dynamics, offering a route to improve inference using information already present within generation.

2605.16819 2026-05-19 cs.CL cs.AI cs.LG 版本更新

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena: GPU核优化代理的通用化意识基准测试

Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Ji Liu, Yue Liu, Yuchen Yang, Hao Li, Ziqiong Liu, Dong Li, Vikram Appia, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结 本文提出AgentKernelArena,一个用于评估GPU核优化代理的开源基准,通过隔离工作区和统一评分机制,测试代理在不同任务和硬件目标上的性能和通用化能力,发现大多数任务在正确性和编译效率上表现优异,但在PyTorch到HIP的转换任务中存在显著的正确性下降。

详情
AI中文摘要

GPU核优化对于高效深度学习系统日益关键,但编写高性能核仍然需要大量的低级专业知识。最近的AI编码代理可以迭代阅读代码、调用编译器和性能分析器,并优化实现,但现有的核基准测试仅评估单个LLM调用而非完整的代理工作流程,且未包含核到核的优化和未见过的配置泛化测试。我们提出了AgentKernelArena,一个开源的基准测试,用于衡量AI编码代理在GPU核优化上的能力。该基准测试包含196个任务,涵盖HIP到HIP的优化、Triton到Triton的优化以及PyTorch到HIP的转换,并在隔离的工作区中使用门控编译、正确性和性能检查,集中评分和一个未见过的配置泛化协议,测试优化是否转移到代理从未见过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent在内的生产代理中,我们发现大多数任务在正确性和编译效率上表现优异,最强配置在PyTorch到HIP任务中平均加速达6.89倍,在HIP到HIP任务中达6.69倍,在Triton到Triton任务中达2.13倍。我们的未见过的配置评估显示,HIP到HIP和Triton到Triton的优化大多能转移到未见过的输入形状,而PyTorch到HIP的转换则表现出显著的正确性下降,表明生成核的代理经常硬编码形状特定的假设。AgentKernelArena被设计为一个模块化、可扩展的框架,用于严格评估跨代理、任务和硬件目标的代理GPU核优化。

英文摘要

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

2605.16800 2026-05-19 cs.LG cs.CL 版本更新

FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

FIM-LoRA: 通过校准时间梯度方差估计实现任务信息的秩分配

Ramakrishnan Sathyavageeswaran

发表机构 * Intuit

AI总结 本文提出FIM-LoRA,通过校准时间梯度方差估计来分配任务信息的秩,以优化LoRA的秩分配,从而提高模型性能。

Comments 10 pages, 1 figure

详情
AI中文摘要

低秩适应(LoRA)为每个适应的权重矩阵分配一个统一的秩——一种实用的便利,但忽略了一个基本现实:不同层对任务适应的贡献不均。我们通过一种轻量级的工程解决方案来解决这个问题:在微调开始之前,运行八次校准反向传递,计算每个LoRA-B矩阵的梯度方差作为层信息度的代理,并按比例重新分配秩预算。所得到的适配器是一个标准的LoRA,具有每层的秩模式——没有新的参数,没有训练开销,没有对服务基础设施的更改。我们通过高效地近似经验 Fisher 信息矩阵(eFIM)对角线,仅限于 LoRA 适配器矩阵,来实现这一点,这将内存成本降低了大约256倍相比完整的模型 Fisher 估计。在 GLUE 上使用 DeBERTa-v3-base 时,FIM-LoRA 在相同参数预算下与 LoRA 相当(88.6 vs. 88.7),在常识推理上使用 LLaMA-3-8B 时达到 68.5 vs. 68.7。每层的秩映射是可解释的:值投影和早期到中期层一致获得更高的秩,与已建立的 transformer 层角色研究结果一致。

英文摘要

Low-rank adaptation (LoRA) assigns a uniform rank to every adapted weight matrix - a practical convenience that ignores a fundamental reality: different layers contribute unequally to task adaptation. We address this with a lightweight engineering solution: before fine-tuning begins, run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally. The resulting adapter is a standard LoRA with a per-layer rank pattern - no new parameters, no training overhead, no changes to serving infrastructure. We implement this via an efficient approximation of the empirical Fisher Information Matrix (eFIM) diagonal, restricted to LoRA adapter matrices only, which reduces memory cost by approximately 256x compared to full-model Fisher estimation. On GLUE with DeBERTa-v3-base, FIM-LoRA matches LoRA (88.6 vs. 88.7) at the same parameter budget, and on commonsense reasoning with LLaMA-3-8B reaches 68.5 vs. 68.7 for LoRA. The per-layer rank maps are interpretable: value projections and early-to-middle layers consistently receive higher rank, consistent with established findings on transformer layer roles.

2605.16790 2026-05-19 cs.LG cs.AI cs.CL 版本更新

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

TIER: 用于多步工具组合的轨迹不变执行奖励

Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

发表机构 * UC San Diego(加州大学圣迭戈分校) Cisco Research(思科研究)

AI总结 本文提出TIER,一种基于函数模式和运行时执行的奖励框架,能够提供密集且可解释的序列级反馈,支持多种解决方案策略并适应变化的工具接口,在DepthBench等基准上实现了高准确率。

Comments Preprint. Submitted to NeurIPS 2026. 28 pages, 7 figures, 8 tables. Code and datasets available at https://github.com/anaykulkarni/TIER

详情
AI中文摘要

工具使用使大语言模型能够通过一系列API调用解决复杂任务,但现有的强化学习方法无法扩展到多步骤组合设置。基于结果的奖励只能提供稀疏反馈,而轨迹监督的奖励依赖于注释的参考解决方案,惩罚有效的替代方案并限制可扩展性。我们提出TIER:轨迹不变执行奖励,一种奖励框架,其监督直接来自函数模式和运行时执行,而非参考轨迹。该奖励分解为格式有效性、模式遵守、执行成功和答案正确性,提供来自细粒度验证的单个步骤工具使用反馈。这种设计允许任何有效的执行路径获得信用,自然支持多种解决方案策略并适应变化的工具接口。在DepthBench,一个按深度(1到6步)分层的组合基准上,TIER在所有步骤中实现了>90%的准确率,其中轨迹监督的奖励在第4步之后崩溃。我们进一步在BFCL v3和NestFUL等基准上展示了持续的提升。消融研究确认所有奖励组件都是必要的,突显了多级监督对于组合推理的重要性。

英文摘要

Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.

2605.16787 2026-05-19 cs.LG cs.CL 版本更新

The Unlearnability Phenomenon in RLVR for Language Models

在语言模型中RLVR的不可学习现象

Yulin Chen, He He, Chen Zhao

发表机构 * New York University(纽约大学)

AI总结 本文研究了RLVR在提升大语言模型推理能力中的学习动态,发现即使存在正确回放,某些难例仍无法学习,揭示了当前RL方法在推理任务中的根本限制。

Comments Accepted to ICML 2026

详情
AI中文摘要

可验证奖励强化学习(RLVR)已被证明在提高大语言模型(LLM)的推理能力方面是有效的。然而,RLVR的学习动态仍缺乏深入研究。在本文中,我们揭示了一个反直觉的现象:在模型最初难以处理的硬例中,一个显著子集即使在存在正确回放的情况下仍无法学习。为了理解这一现象,我们首先证明了现有的优化和采样技术无法解决不可学习性。通过跨例梯度分析,我们显示不可学习的例子具有根本性的表示问题,其特征是与其余例子的梯度相似性低且推理模式不可泛化。我们进一步表明,表示缺陷在RL中难以缓解,因为数据增强无法提高梯度相似性。本研究为RLVR训练中的不可学习数据提供了首次系统的表征,并揭示了当前RL方法在推理任务中的根本限制。代码和数据可在https://github.com/yulinchen99/unlearnability-rlvr获取。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.

2605.16770 2026-05-19 cs.CL cs.AI 版本更新

Exploring Lightweight Large Language Models for Court View Generation

探索用于法院视图生成的轻量级大语言模型

Zhitian Hou, Tianyong Hao, Nanli Zeng, Zhixiong Chao, Kun Zeng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) School of Computer Science, South China Normal University(华南师范大学计算机学院) China Mobile Internet Co., Ltd.(中国移动互联网有限公司)

AI总结 本文研究了轻量级大语言模型在法院视图生成中的能力及其对指控预测的影响,探讨了模型架构、大小对性能的影响,以及轻量级LLM与深度神经网络在任务中的比较,同时开发了CVGEvalKit评估框架。

详情
AI中文摘要

刑事法院视图生成(CVG)是法律人工智能(Legal AI)中的关键任务,涉及根据案件事实生成法院视图。在本工作中,我们系统地探索了轻量级(小于2B参数)大语言模型(LLMs)在CVG中的能力及其对指控预测的影响。我们的研究解决了四个关键问题:(1)不同架构的LLMs如何影响CVG质量和指控预测;(2)LLMs的大小如何影响性能;(3)轻量级LLMs在这些任务中与深度神经网络(DNNs)的比较;(4)通过先生成法院视图再预测指控与直接预测指控的比较。此外,我们还开发了CVGEvalKit评估框架,包括三个公开可用的数据集用于CVG任务以及预测其指控。在该框架上进行了全面实验,模型在混合训练集上训练,并在每个数据集的测试集上评估。实验结果提供了关于模型架构、模型大小和不同任务之间影响的权衡的新见解,突显了轻量级LLMs在司法AI应用中的潜力。源代码匿名地可在\url{https://github.com/ZhitianHou/CVGEvalKit}获取。

英文摘要

Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}

2605.16767 2026-05-19 cs.CL 版本更新

Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

基于检索的多标签法律标注:可扩展、数据高效且无幻觉

Li Zhang, Jaromir Savelka, Kevin Ashley

发表机构 * University of Pittsburgh(匹兹堡大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种基于检索的多标签法律标注方法,该方法通过嵌入文档和标签描述,利用冻结的检索模型进行k近邻搜索,实现了高效的数据标注,同时避免了生成模型的幻觉问题,展示了在高基数和快速变化的法律标签空间中的实用性。

Comments 10 pages, 3 figures

详情
AI中文摘要

多标签法律标注需要将多个标签分配给长文本文档,这些标签来自大规模且不断演变的分类体系,通常在监督有限的情况下进行。参数编码器通常需要针对特定任务进行训练和重新训练,而提示生成大语言模型在标签集变化时成本高且性能下降。我们将其法律标注视为检索:通过冻结的检索模型嵌入文档和标签描述,并在嵌入空间中通过k近邻预测标签,从而通过重新嵌入和重新索引实现更新,而不是基于梯度的反向传播。在三个法律数据集(ECtHR-A、ECtHR-B和Eurlex,共100个标签)上,检索实现了竞争性的准确率和强大的数据效率;在Eurlex上,Qwen-8B检索将宏F1从40.41(GPT-5.2,零样本)提升至49.12,同时将计算量估计减少了20-30倍。仅使用(N=100)训练样本,检索在ECtHR-A上将微F1几乎翻倍,超过层级Legal-BERT(48.29 vs. 27.87)。我们还量化了生成推理的可靠性故障模式:GPT-5.2在确定性解码下在0.12-0.9%的测试样本中会生成超出提供分类法的标签。相比之下,检索严格遵守定义的标签集,通过设计消除幻觉。这些结果表明,基于检索模型的标注器是高基数和快速变化的法律标签空间的实用且可部署的替代方案。

英文摘要

Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.

2605.16758 2026-05-19 cs.CL 版本更新

Language Acquisition Device in Large Language Models

大型语言模型中的语言习得装置

Masato Mita, Taiga Someya, Ryo Yoshida, Yohei Oseki

发表机构 * The University of Tokyo(东京大学)

AI总结 本文提出了一种受语言习得装置启发的预预训练方法,通过在MP-STRUCT形式语言上进行预训练,提高了大型语言模型的数据效率,并展示了其对结构不合理的语言的抗性。

Comments Accepted to ACL2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)仍然显著不如人类数据高效。预预训练(PPT)在合成语言上的应用被提出以缩小这一差距,先前的工作强调了高度表达性的形式语言,如k-Shuffle Dyck。受语言习得装置(LAD)假说的启发,该假说认为内禀约束会预先限制学习者的假设空间以自然语言结构,我们提出LAD启发的PPT:在MP-STRUCT形式语言上进行预预训练,该语言的字符串编码了层次结构组成、基于特征的依赖关系以及长距离位移,通过MERGE、AGREE和MOVE操作。简短的500步PPT在MP-STRUCT上与强大的形式语言基线在token效率上相当,同时还赋予了对结构不合理的语言的人类样抗性(例如REVERSE)。分析简化变体,我们发现MP-STRUCT CORE在没有定义在C-RASP(变压器表达性的正式界限)的情况下仍优于k-Shuffle Dyck,挑战了先前假设即有效的PPT语言必须同时具有层次结构表达性和电路理论可学习性。我们显示功能地标,这些减少依赖解析歧义,是关键驱动因素,表明有效的PPT设计不仅依赖于表达性,还依赖于依赖解析的可访问性。

英文摘要

Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages such as $k$-Shuffle Dyck. Inspired by the Language Acquisition Device (LAD) hypothesis, which posits that innate constraints preemptively restrict the learner's hypothesis space to natural-language-like structure, we propose LAD-inspired PPT: pre-pretraining on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE. A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages (e.g., REVERSE). Analyzing simplified variants, we find that MP-STRUCT CORE outperforms $k$-Shuffle Dyck despite not being definable in C-RASP (a formal bound on transformer expressivity), challenging the prior hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable. We show that functional landmarks, which reduce dependency resolution ambiguity, are a key driver, suggesting that effective PPT design depends not only on expressivity but also on the accessibility of dependency resolution.

2605.16654 2026-05-19 cs.CL cs.AI 版本更新

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

一种可扩展的测量方式用于发展语言研究中的方式和结果动词

Divyesh Pratap Singh, Dakshesh Gusain, Federica Bulgarelli, Alison Eisel Hendricks, John Beavers, Nathan M. Beers, Ifeoma Nwogu

发表机构 * University at Buffalo(布法罗大学) Nanyang Technological University(南洋理工大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出一种利用大规模语言模型进行方式和结果动词识别的方法,通过MASC和InterCorp数据扩展至436类,并在三个数据集上验证,模型准确率达89.6%。

Comments 12 pages

详情
AI中文摘要

方式和结果动词编码事件结构的不同方面,在发展语言研究中被视为研究早期动词学习的潜在区分特征。然而,由于目前缺乏大规模标注的 manner 和 result 分类资源,这种区分难以量度。本文提出了一种计算方法,利用语言学启发式提示生成句子级标注,扩展了VerbNet的标注范围至436类。然后在这些标注上训练了基于RoBERTa的分类器,并在三个保留的金标准数据集上进行评估,包括之前标注的项目和一个新的专家标注集。在这些评估中,模型表现出有希望的性能,平均准确率高达89.6%。本文将此工作作为可扩展的测量工具,支持未来关于动词语义的发展语言和其他语言数据集的研究,同时指出需要进一步验证边缘情况、混合方式/结果动词以及下游发展应用。

英文摘要

Manner and result verbs encode different aspects of event structure and have been discussed in developmental work as a potentially informative distinction for studying early verb learning. However, this distinction remains difficult to measure at scale because large annotated resources for manner and result classification are not currently available. We present a computational approach for identifying manner and result verbs in sentence context. Using linguistically informed prompts, we generate sentence-level annotations with large language models over data drawn from MASC and InterCorp, extending coverage from previously annotated portions of VerbNet to 436 classes. We then train a RoBERTa-based classifier on these annotations and evaluate it on three held-out gold-standard datasets, including previously annotated items and a new expert-annotated set. Across these evaluations, the model shows promising performance, with average accuracy up to 89.6%. We present this work as a scalable measurement tool that can support future research on verb semantics in developmental and other language datasets, while noting that further validation is needed for borderline cases, mixed manner/result verbs, and downstream developmental applications.

2605.16650 2026-05-19 cs.CL cs.AI 版本更新

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

SKG-Eval: 通过增量语义知识图谱进行多轮对话的有状态评估

Avijit Shil, Suman Samui

发表机构 * Maulana Abul Kalam Azad University of Technology(Maulana Abul Kalam Azad 工业技术大学) National Institute of Technology Durgapur(Durgapur 国家理工大学)

AI总结 本文提出SKG-Eval框架,通过增量语义知识图谱模型,解决多轮对话评估中长距离不一致问题,提供可解释的评估信号和可复现的评分结果。

Comments 36 Pages, 6 Figures

详情
AI中文摘要

评估多轮对话系统仍具挑战性,因为响应质量不仅取决于当前提示,还取决于之前建立的实体、声明和对话承诺。现有自动评估器主要依赖扁平或轮次隔离的表示,难以检测长距离问题如矛盾、话题漂移和实体不一致。为此,我们提出SKG-Eval,一个近确定性和可解释的框架,将对话建模为跨轮次的语义知识图谱(SKG)的实体、关系和承诺。该框架通过结构化三元组提取逐步更新图谱,并计算三个互补信号:(i)局部相关性,衡量与当前提示和可选参考的一致性;(ii)历史一致性,评估新引入信息如何连接到先前对话上下文,使用图谱驱动和嵌入驱动信号;(iii)逻辑一致性,通过几何矛盾引擎评估跨轮次冲突,不依赖NLI模型或LLM判断。这些信号通过近期加权趋势分析适应性融合,生成长度不变的会话分数。在多个基准测试中,SKG-Eval在与人类判断的相关性上更高,并显著提高了长距离不一致的检测效果。此外,该框架为固定输入生成明确的矛盾证书和确定性分数,使评估可复现和可审计。整体而言,我们的结果表明,通过语义知识图谱的结构化外部化状态跟踪,为LLM基于对话评估器提供了可扩展的替代方案。

英文摘要

Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.

2605.14498 2026-05-19 cs.CL 版本更新

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

GroupMemBench: 多方对话中LLM代理记忆的基准测试

Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich

发表机构 * UC Santa Barbara(加州大学圣芭芭拉分校) Microsoft(微软)

AI总结 本文提出GroupMemBench,用于评估多方对话中LLM代理的记忆能力,揭示现有记忆系统在群体动态、信念跟踪和语言适应方面的不足。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地作为个人助手和职场协作工具,其效用依赖于能够从长对话中提取、检索和应用信息的记忆系统。然而,现有的记忆系统和基准测试都是基于二元、单用户设置,尽管实际部署通常跨越群体和频道,多个用户与代理和彼此交互。这种不匹配导致三个群体记忆的特性未被测量:(i) 超出拼接一对一聊天的群体动态,(ii) 以说话者为基础的信念跟踪,其中需要每个用户的记忆建模,以及(iii) 为受众定制的语言,其中理论-of- mind转变产生角色特定的词汇。我们引入GroupMemBench,一个能够暴露这三者的基准测试。一个基于图的合成流程生成多方对话,具有可控的回复结构和条件,每条消息都基于用户身份和目标受众。一个对抗性查询流程然后将每个问题绑定到特定的提问者,涵盖六个类别,包括多跳推理、知识更新、术语歧义、用户隐含推理、时间推理和回避,并迭代搜索具有挑战性和现实的查询,反映全面的记忆能力。基准测试领先的记忆系统暴露了一个明显的崩溃:最强的系统达到46.0%的平均准确率,其中知识更新为27.1%,术语歧义为37.7%,而一个简单的BM25基线匹配或超过大多数代理记忆系统。这表明当前的记忆摄入擦除了群体记忆依赖的结构和词汇特征,使多用户记忆远未解决。

英文摘要

Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

2605.14381 2026-05-19 cs.LG cs.CL 版本更新

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

NodeSynth: 为AI评估的社会协同合成数据

Qazi Mamunur Rashid, Xuan Yang, Zhengzhe Yang, Yanzhou Pan, Erin van Liemt, Darlene Neal, Kshitij Pancholi, Jamila Smith-Loud

发表机构 * Google Research USA(谷歌研究美国分公司) Google USA(谷歌美国分公司) Google Deepmind USA(谷歌深Mind美国分公司)

AI总结 NodeSynth通过结合现实证据的细粒度分类扩展,生成社会相关合成查询,提升AI模型在敏感领域评估的准确性与安全性。

详情
AI中文摘要

NodeSynth通过结合现实证据的细粒度分类扩展,生成社会相关合成查询,提升AI模型在敏感领域评估的准确性与安全性。

英文摘要

Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (https://github.com/google-research/nodesynth).

2605.14292 2026-05-19 cs.LG cs.CL 版本更新

Minimal-Intervention KV Retention via Set-Conditioned Diversity

通过集合条件多样性实现最小干预的KV保留

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系) Department of Information Management(信息管理系) Auburn University(阿伯丁大学) National Central University(国立中央大学)

AI总结 研究通过改进TriAttention保留评分器,在有限预算下提升KV缓存压缩效果,采用V空间冗余惩罚机制,验证了最小修改优于结构性重设计。

Comments 15 pages, 3 figures, 3 tables. Code and data: https://github.com/libophd/minimal-kv-retention

详情
AI中文摘要

在小预算下,KV缓存压缩是一个拥挤的设计空间,涵盖缓存表示、逐头路由、压缩节奏、解码行为和预算内评分。我们研究了五个家族中的七种机制,在匹配的长形式数学推理(MATH-500~\cite{hendrycks2021math})下,使用两个蒸馏推理模型(Qwen-7B和Llama-8B变体DeepSeek-R1-Distill~\cite{deepseek2025r1})在预算$b \in \{64, 128\}$下进行测试。所有七种机制均被拒绝。我们随后提出$\alpha$,一种对TriAttention~\cite{mao2026triattention}保留评分器的单函数修改,用启发式设施选址的贪心选择替代argmax-top-$k$,在由单个权重$\lambda$控制的V空间冗余惩罚下。一个预先注册的协议在冻结的开发分割上调整$\lambda$,并在不相交的保留分割上验证;当$\lambda= 0.5$时,$\alpha$在两个四(模型,预算)单元格(Qwen $b{=}128$和Llama $b{=}64$)上通过Bonferroni检验,没有单元格显著为负,且预注册的Branch~A触发。发现是不对称的:在该范围内,最小评分修改优于更重的结构重设计,且结合匹配的记忆、sympy评分、保留验证协议的证据标准使得不对称性显现。

英文摘要

KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.

2605.14271 2026-05-19 cs.CL cs.CY 版本更新

Auditing Agent Harness Safety

审查代理安全

Chengzhi Liu, Yichen Guo, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang

发表机构 * UCB(加州大学伯克利分校) WISC(威斯康星大学) STANFORD(斯坦福大学) MSR1(微软研究院)

AI总结 本文提出HarnessAudit框架,用于审查执行轨迹中的边界合规性、执行保真度和系统稳定性,揭示多代理Harness中的安全风险,并通过HarnessAudit-Bench验证了安全风险与轨迹长度、领域和代理角色的关系。

Comments 11 Pages, 8 Figures

详情
AI中文摘要

LLM代理越来越多地在执行Harness中运行,这些Harness负责调度工具、分配资源并路由消息。然而,Harness可能在轨迹中访问未授权资源或泄露上下文给错误的代理,但输出级评估无法检测这些失败。本文提出HarnessAudit框架,用于审查完整执行轨迹的边界合规性、执行保真度和系统稳定性,并通过HarnessAudit-Bench验证了安全风险与轨迹长度、领域和代理角色的关系。

英文摘要

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

2605.14125 2026-05-19 cs.CL 版本更新

Polar probe linearly decodes semantic structures from LLMs

极性探针线性解码LLM中的语义结构

Pablo J. Diego-Simón, Pierre Orhan, Emmanuel Chemla, Yair Lakretz, Jean-Rémi King

发表机构 * LSCP, ENS, PSL, EHESS, CNRS(LSCP、ENS、PSL、EHESS、CNRS) Paris Brain Institute(巴黎脑研究所) Earth Species Project(地球物种计划) Meta AI

AI总结 研究通过极性探针线性恢复LLM中的语义结构,发现其基于嵌入距离和方向表示实体存在与关系类型,且在中层表现更优,能泛化至新实体但随语义结构规模下降。

详情
AI中文摘要

人工神经网络如何将概念绑定形成复杂语义结构?本文提出一种简单神经编码,通过嵌入距离和方向表示实体的存在及关系类型。在多种LLM中测试,结果表明极性探针能线性恢复真实语义结构,该编码主要出现在中层,随LLM性能提升而改善。极性探针能泛化至新实体和关系类型,但随语义结构规模增大而退化。极性表示质量与LLM回答语义结构问题的能力相关。这些发现表明,LLM通过简单几何原理绑定表示来构建复杂语义结构。

英文摘要

How do artificial neural networks bind concepts to form complex semantic structures? Here, we propose a simple neural code, whereby the existence and the type of relations between entities are represented by the distance and the direction between their embeddings, respectively. We test this hypothesis in a variety of Large Language Models (LLMs), each input with natural-language descriptions of minimalist tasks from five different domains: arithmetic, visual scenes, family trees, metro maps and social interactions. Results show that the true semantic structures can be linearly recovered with a Polar Probe targeting a subspace of LLMs' layer activations. Second, this code emerges mostly in middle layers and improves with LLM performance. Third, these Polar Probes successfully generalize to new entities and relation types, but degrades with the size of the semantic structure. Finally, the quality of the polar representation correlates with the LLM's ability to answer questions about the semantic structure. Together, these findings suggest that LLMs learn to build complex semantic structures by binding representations with a simple geometrical principle.

2605.12987 2026-05-19 cs.CL 版本更新

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

利用多模态自一致性推理进行编码动机访谈以减少酒精使用

Guangzeng Han, James G. Murphy, Benjamin O. Ladd, Xiaolei Huang, Brian Borsari

发表机构 * Department of Computer Science, University of Memphis(密苏里大学计算机科学系) Veterans Affairs Health Care System(退伍军人事务医疗系统) Department of Psychiatry and Behavioral Sciences, University of California San Francisco(旧金山大学精神病学与行为科学系) Department of Psychology, University of Memphis(密苏里大学心理学系) Department of Psychology, Washington State University Vancouver(华盛顿州立大学温哥华分校心理学系)

AI总结 本文提出基于音频语言模型的自动动机访谈编码方法,通过多模态自一致性推理提升编码鲁棒性,实验显示其在准确率、精确率和召回率上均优于基线方法。

详情
AI中文摘要

背景:对动机访谈(MI) session 进行编码对于理解客户行为和预测结果至关重要,但需要大量时间和劳动力由受过训练的 MI 专业人士完成。最近在音频-语言模型(ALMs)上的进展为通过捕捉多模态行为信号自动化 MI 编码提供了新机会。目的:本研究旨在开发一种基于 ALMs 的自动 MI 编码方法,分析原始音频输入并整合多个推理轨迹的预测,利用自一致性提高编码鲁棒性。方法:我们使用了五段去标识化的 MI 音频磁带进行实验。我们部署了 ALMs,使用四个互补的分析提示来支持语句级推理:用于语音提示的分析提示、用于声学提示的声调感知提示、用于定量假设检验的证据评分提示,以及用于对比推理的比较提示。每个提示抽取三个随机样本,生成每个语句12条独立的推理轨迹。最终预测由所有轨迹上的多数投票确定。结果:通过准确率、精确率、召回率和宏F1分数进行评估。所提出的多模态自一致性方法在准确率为52.56%、精确率为54.03%、召回率为47.45%、宏F1分为46.40%,优于基线方法。系统消融实验中移除个别模块在主要指标上一致地降低了性能。结论:多模态自一致性方法在 MI 编码中优于单次通过基线提示方法。这些发现表明,结合客户所说和如何说的内容可以支持更可靠的自动 MI 编码。

英文摘要

BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.

2605.12920 2026-05-19 cs.MA cs.AI cs.CL 版本更新

Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

通过对话对齐世界模型实现具身多智能体协调

Vardhan Dongre, Dilek Hakkani-Tür

发表机构 * Siebel School of Computing & Data Science(计算机与数据科学学院)

AI总结 研究通过对话机制探索具身智能体的世界模型对齐,发现对话能减少冲突但降低任务成功率,提出评估世界模型对齐的框架。

详情
AI中文摘要

有效的具身智能体协作需要超越共享环境中的行动,要求基于每个智能体对世界的理解进行沟通。当智能体只能部分观察环境时,无沟通的协调是难以证明的,但沟通可通过共享观察和对齐世界模型来弥合这一差距。本文研究LLM基于的具身智能体是否真正具备沟通能力。我们扩展了PARTNR协作家庭机器人基准,加入自然语言对话通道,使两个具有部分观察能力的智能体在任务执行中沟通。为评估对话是否导致真实的世界模型对齐而非表面协调,我们提出了一种基于每智能体世界图的对齐测量框架:观察收敛(私人世界模型随时间对齐吗?)、信息新颖性(信息是否传达了伙伴所缺乏的内容?)以及信念敏感的通信(智能体是否建模了伙伴所知的内容?)。我们的实验显示,对话减少了40至83个百分点的行动冲突,但相对于沉默协调任务成功率较低。使用我们的指标,我们表征了表面协调与真实世界模型对齐之间的差距,并确定当前模型在该光谱中的位置。

英文摘要

Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.

2605.12824 2026-05-19 cs.MA cs.AI cs.CL cs.CY 版本更新

Mechanism Plausibility in Generative Agent-Based Modeling

生成基于代理的建模中的机制合理性

Patrick Zhao, David Huu Pham, Nicholas Vincent

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 本文提出机制合理性量表,区分生成充分性与机制合理性,探讨生成式代理模型的生成能力与解释能力。

Comments Accepted at ACM FAccT 2026

详情
AI中文摘要

大型语言模型(LLMs)能够生成多样化现象而无需显式编程规则,这一能力使其在不同代理基于模型(ABMs)和社会模拟中得到应用。最近的研究探讨了LLMs生成不同现象的能力,例如社交媒体上的人类行为或博弈论场景中的外星行为。然而,能力、预测和解释是不同的——从科学哲学和机制文献中,解释需要展示现象如何由相关组织实体和活动产生。对于建模者而言,在没有基于潜在遥远研究领域的情况下,描述实验特征或判断模拟是否在能力(或解释)上取得进展是困难的。我们整合了最近关于LLM-ABMs的研究与当代科学哲学文献,用以操作化'合理性'的定义,提出四等级量表。该量表将模型生成充分性(重现现象的能力)与机制合理性(现象如何产生)分开,并明确不同模型的不同角色,如预测性和解释性。我们将其介绍为机制合理性量表。

英文摘要

Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recent studies investigate their ability to generate different phenomena of interest, for example, human behavior on social media platforms or alien behavior in game-theoretic scenarios. However, capability, prediction, and explanation are different--drawing from the philosophy of science and mechanisms literature, explanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of 'plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.

2605.11518 2026-05-19 cs.AI cs.CL cs.LG 版本更新

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch: 训练研究代理以自动化LLM实验配置 - 从低成本学习,优化高成本

Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 本文提出AutoLLMResearch框架,通过多保真度实验环境学习LLM配置原则,解决高成本实验自动化问题,展示其在大规模LLM实验中的有效性与通用性。

详情
AI中文摘要

有效配置可扩展的大规模语言模型(LLM)实验,涵盖架构设计、超参数调优等,对推进LLM研究至关重要,因为糟糕的配置选择会浪费大量计算资源并阻碍模型潜力的实现。以往的自动化方法适用于低成本环境,但可扩展的LLM实验成本过高,无法进行大量迭代。为了解决这一问题,我们提出AutoLLMResearch,一个模仿人类研究人员从低保真度实验中学习一般性原则并高效识别高成本LLM配置的代理框架。核心挑战是如何使代理通过与多保真度实验环境的交互学习LLM配置景观的结构。为此,我们提出一个系统框架,包含两个关键组件:1) LLMConfig-Gym,涵盖四个关键LLM实验任务的多保真度环境,支持超过一百万GPU小时的可验证实验结果;2) 一个结构化训练管道,将配置研究建模为长周期马尔可夫决策过程,并相应地激励跨保真度外推推理。在各种强基线上的广泛评估表明了我们框架的有效性、通用性和可解释性,支持其作为大规模现实LLM实验自动化的实用且通用解决方案的潜力。

英文摘要

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

2605.10923 2026-05-19 cs.LG cs.CL 版本更新

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

动态技能生命周期管理用于代理强化学习

Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng

发表机构 * Database Group, The Chinese University of Hong Kong(香港中文大学数据库组) Lanzhou University(兰州大学)

AI总结 本文提出SLIM框架,通过动态优化变量管理代理强化学习中的外部技能集,提升任务性能。

Comments Implementation code is available at https://github.com/ejhshen/SLIM

详情
AI中文摘要

大型语言模型代理越来越多地依赖外部技能来解决复杂任务,其中技能作为模块化单元扩展其能力。现有方法假设外部技能要么积累为持久指导或内化到策略中,最终导致零技能推断。本文认为这一假设过于限制,因为参数容量有限且不同技能的边际贡献不均,最优活跃技能集是非单调、任务和阶段依赖的。本文提出SLIM,一种动态技能生命周期管理框架,将活跃的外部技能集作为动态优化变量与策略学习共同更新。具体而言,SLIM通过留一技能验证估计每个活跃技能的边际外部贡献,然后应用三种生命周期操作:保留高价值技能、退役贡献变得微不足道的技能、以及在持续失败揭示缺失能力覆盖时扩展技能库。实验显示,SLIM在ALFWorld和SearchQA上平均比最佳基线高出7.1个百分点。结果进一步表明,策略学习和外部技能保留并非互斥:某些技能被吸收进策略,而其他技能继续提供外部价值,支持SLIM作为基于技能的代理强化学习更通用的范式。

英文摘要

Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.

2605.09341 2026-05-19 cs.MA cs.CL 版本更新

SkillMAS: Skill Co-Evolution with LLM-based Multi-Agent System

SkillMAS: 基于大语言模型的多智能体系统技能共进化

Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, Yong Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Central South University(中南大学) OPPO

AI总结 SkillMAS通过耦合技能进化与多智能体系统重构,解决部署后技能优化与系统重构的分离问题,提升多智能体系统的适应性和专业化水平。

Comments 21 pages, 2 figures

详情
AI中文摘要

大语言模型(LLM)代理系统在部署后期望持续改进,但现有研究常将技能进化与多智能体系统(MAS)重构分开,导致组织瓶颈、上下文压力和专业化偏差。本文提出SkillMAS,一种非参数框架,用于多智能体系统的自适应专业化,通过Utility Learning分配验证执行轨迹的信用,有界技能进化精简可重用过程,避免无限制库增长,并在保留失败和Executor Utility指示结构不匹配时进行证据门控的MAS重构。在具身操作、命令行执行和零售工作流中,SkillMAS在报告的Harnesses下具有竞争力,同时澄清了部署后专业化的归因、更新和应用方式。

英文摘要

Large language model (LLM) agent systems are increasingly expected to improve after deployment, but existing work often decouples two adaptation targets: skill evolution and multi-agent system (MAS) restructuring. This separation can create organization bottlenecks, context pressure, and mis-specialization. We present SkillMAS, a non-parametric framework for adaptive specialization in multi-agent systems that couples skill evolution with MAS restructuring. SkillMAS uses Utility Learning to assign credit from verified execution traces, bounded skill evolution to refine reusable procedures without unfiltered library growth, and evidence-gated MAS restructuring when retained failures and Executor Utility indicate a structural mismatch. Across embodied manipulation, command-line execution, and retail workflows, SkillMAS is competitive under the reported harnesses while clarifying how post-deployment specialization is attributed, updated, and applied.

2605.07905 2026-05-19 cs.CL cs.AI 版本更新

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

CoCoReviewBench:面向AI审稿人完整性和正确性的基准测试

Hexuan Deng, Xiaopeng Ke, Yichen Li, Ruina Hu, Dehao Huang, Derek F. Wong, Yue Wang, Xuebo Liu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) Zhongguancun Academy, Beijing, China(中关村学院) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院) Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学计算机科学与工程系) NLP²CT Lab, Department of Computer and Information Science, University of Macau, China(澳门大学自然语言处理实验室)

AI总结 本文提出CoCoReviewBench,通过构建领域特定的基准子集和专家注释,提升AI审稿人评估的完整性和正确性,分析显示AI审稿人仍存在正确性不足和幻觉问题,强调推理模型的优越性。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管AI审稿系统发展迅速,但评估这些系统仍具挑战性:现有指标倾向于重叠度而非正确性。由于人类审稿通常只涵盖部分显著问题且可能含错误,因此不可靠。为此,我们构建了领域特定的基准子集,并在对应人类审稿缺失时跳过评估以增强完整性。我们还利用审稿人-作者-元审稿讨论作为专家注释,并据此过滤不可靠的审稿以增强正确性。最终,我们引入了CoCoReviewBench,该基准从ICLR和NeurIPS中精选了3900篇论文,以实现对AI审稿人的可靠和细致评估。分析表明,AI审稿人仍受限于正确性不足且易产生幻觉,推理模型表现更佳,这推动了进一步改进AI审稿人的方向。基准和模型可在https://github.com/hexuandeng/CoCoReviewBench获取。

英文摘要

Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.

2605.07501 2026-05-19 cs.LG cs.CL 版本更新

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

ExpThink:基于经验的强化学习用于自适应思路链压缩

Tingcheng Bian, Yuzhe Zhang, Jing Jin, Jinchang Luo, MingQuan Cheng, Haiwei Wang, Wenyuan Jiang, Miaohui Wang

发表机构 * Baidu Inc.(百度公司) Shenzhen University(深圳大学) Peking University(北京大学) Tsinghua University(清华大学) D-INFK, ETH Zürich(ETH Zürich 计算机科学与技术研究所)

AI总结 本文提出ExpThink框架,通过经验引导奖励塑造和难度自适应优势机制,实现思路链压缩的准确优先、压缩次之的训练目标,实验表明其在多个数学推理基准上显著提升压缩效率和准确性。

Comments 39 pages, 18 figures. Code and model checkpoints will be released upon publication

详情
AI中文摘要

大型推理模型(LRMs)通过扩展的思路链(CoT)推理实现强性能,但面临过度的token消耗和高推理延迟。现有CoT压缩的强化学习(RL)方法依赖于统一的静态长度惩罚,忽略了模型能力动态和问题级别难度变化。本文提出ExpThink框架,通过两个互补机制解决这两个维度:首先,经验引导奖励塑造跟踪每个问题迄今为止找到的最短正确解决方案,并应用三级奖励:对简洁正确的响应给予全额信用,对冗长正确的响应给予折扣信用,对错误的响应给予零信用。阈值随着模型改进自动收紧,形成自我演进的课程,无需手动调度。其次,难度自适应优势将标准差归一化替换为正确计数归一化,产生单调难度缩放的梯度,放大对难题的学习以保持准确性,同时抑制对简单问题的梯度以促进简洁性。这些机制强制执行准确性优先、压缩次之的训练目标。在多个数学推理基准上的实验表明,ExpThink减少了平均响应长度高达77%,同时提高了准确性,实现了比基线模型高3倍的准确性-效率比,并在两个指标上优于现有基于RL的压缩方法。

英文摘要

Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.

2605.06638 2026-05-19 cs.AI cs.CL 版本更新

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

强化学习能否教会大语言模型长期 horizon 推理?表达性是关键

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov

发表机构 * Purdue University(普渡大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Georgia Tech(佐治亚理工学院) UC San Diego(加州大学圣地亚哥分校)

AI总结 本文通过ScaleLogic框架研究了RL训练与任务难度的关系,发现推理深度和逻辑表达性影响训练计算量,表达性越高,训练效率越高,证明LLM的长期推理问题可通过改进训练方法解决。

详情
AI中文摘要

强化学习(RL)已被应用于改进大语言模型(LLM)的推理能力,但关于训练规模与任务难度之间系统研究受限于缺乏可控且可扩展的环境。观察到LLM在长期推理方面的不足引发了它们可能是自回归Transformer架构根本问题的推测。为此,我们引入了ScaleLogic,一个合成逻辑推理框架,可独立控制两个难度轴:所需证明规划的深度(即horizon)和底层逻辑的表达性。我们提出的框架支持多种逻辑:从简单的蕴含逻辑(

英文摘要

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that they are fundamental to the autoregressive transformer architecture. To address this, we introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.

2605.05739 2026-05-19 cs.LG cs.AI cs.CL q-fin.CP 版本更新

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

基于大语言模型判官的多维行为评估:用于代理股票预测系统的闭环强化学习反馈

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 本文提出一种多维行为评估方法,通过大语言模型判官评估代理系统决策过程,利用闭环强化学习反馈提升预测性能,验证了方法在股票预测中的有效性。

Comments 17 pages, 5 figures, 14 tables. Manuscript submitted to Applied Artificial Intelligence (Taylor and Francis)

详情
AI中文摘要

代理人工智能系统通过一系列相互依赖的自主决策产生输出,但标准评估仅评估输出而无法诊断底层过程。本文开发了一种行为评估方法,通过评分中间决策过程补充输出级测试。在每个自主决策点记录的行为轨迹被分为五日周期,并由三个大语言模型(LLM)判官根据六个领域特定维度(制度检测、路由、适应、风险校准、策略一致性、错误恢复)评分。一种扰动程序破坏一个维度,同时保持其他五个维度不变,验证了维度特异性;跨模型一致性达到Krippendorff's alpha=0.85。综合行为评分与实际20日夏普比率相关性达到Spearman rho=0.72。闭环框架将缺陷的每维度评分转换为信用分配惩罚,添加到Soft Actor-Critic奖励中。三次微调循环,限制在验证数据上,将持有期MAPE从0.61%降低到0.54%(11.5%相对;p<0.001,d=0.31)在2017至2025的测试期上,显著性在Diebold-Mariano下,通过Giacomini-White局部化到高波动性制度。该方法应用无关,适用于任何可以记录中间决策的代理系统。

英文摘要

Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized 20-day Sharpe ratio. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty added to the Soft Actor-Critic reward. Three fine-tuning cycles, confined to validation data, reduce one-day MAPE from 0.61% to 0.54% (11.5% relative; p<0.001, d=0.31) on the held-out 2017 to 2025 test period, significant under Diebold-Mariano and localized by Giacomini-White to the high-volatility regime. The methodology is application-agnostic and applies to any agentic system whose intermediate decisions can be logged.

2605.02028 2026-05-19 cs.CL 版本更新

Language models fail at extended rule following

语言模型在扩展规则遵循中表现不佳

Tianxiang Dai, Jonathan Fan

发表机构 * Department of Electrical Engineering, Stanford University(斯坦福大学电气工程系)

AI总结 研究发现语言模型在重复应用规则时无法保持精确状态,即使增加模型规模和计算资源也无法克服这一缺陷,表明需要新的模型架构来实现可靠的规则遵循。

Comments for accessing the data and code for reproduction of the study, see https://txdai.github.io/counting-reliability/

详情
AI中文摘要

大型语言模型在回答困难问题时能够通过检索、重新组合和关注长上下文中的信息来表现出色。然而,在代理任务中,需要额外的能力:在反复应用规则时保持精确的状态。我们发现语言模型在这一可靠性方面存在缺失。通过查询126种领先模型变体,执行对长字符串重复字符计数的任务,发现所有模型都无法准确计数超过模型依赖的语法敏感计数能力阈值。失败是突然的,并且即使增加模型规模、推理时间计算和外部工具,失败仍然持续。机制探测表明,模型使用有限的内部状态来模拟计数作为规则,一旦这些状态耗尽就会失败。此外,这些状态是执行复杂任务的基础。这些结果表明,为了使自主代理实现真正可靠的规则遵循能力,需要从根本上新的模型架构。

英文摘要

Large language models are highly capable of answering difficult questions by retrieving, recombining, and attending to information in long contexts. For agentic tasks, an additional capability is required: the preservation of an exact state while repeatedly applying rules. We find that this reliability is absent across language models. To demonstrate, we query 126 leading model variants with the task of counting a long string of repeated characters, and we find they all cannot accurately count above a model-dependent, syntax-sensitive counting capacity threshold. Failures are abrupt and persist even with increasing model size, inference time computation, and external tool. Mechanistic probing indicates that models use a finite number of internal states to mimic counting as a rule and fail once these states are exhausted. Furthermore, such states are the basis for performing complex tasks beyond counting. These results indicate that fundamentally new model architectures are required for autonomous agents to achieve truly reliable rule following capabilities.

2605.00155 2026-05-19 cs.LG cs.CL math.OC stat.ML 版本更新

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Wasserstein分布鲁棒遗憾优化用于人类反馈的强化学习

Yikai Wang, Shang Liu, Jose Blanchet

发表机构 * Department of Statistics and Operations Research, University of North Carolina(统计与运筹学系,北卡罗来纳大学) Imperial Business School, Imperial College London(帝国理工学院伦敦商学院) Department of Management Science and Engineering, Stanford University(管理科学与工程系,斯坦福大学)

AI总结 本文提出Wasserstein分布鲁棒遗憾优化(DRRO)用于强化学习从人类反馈,通过简单分配模型研究提示问题,展示在ℓ1-地面成本Wasserstein模糊集下,内最坏遗憾有精确解,最优策略具有水填充结构,从而实现高效政策梯度算法。

详情
AI中文摘要

强化学习从人类反馈(RLHF)已成为对齐大语言模型的核心后训练步骤,但RLHF中使用的奖励信号仅是真实人类效用的学得代理。从运筹学角度看,这形成了一个目标不准确的决策问题:策略是针对估计奖励优化,而部署性能由未观察的目标决定。由此产生的差距导致奖励过度优化,即Goodharting现象,即代理奖励在真正质量下降后仍继续改善。现有缓解方法通过不确定性惩罚、悲观奖励或保守约束,但这些方法计算上负担重且过于悲观。我们提出Wasserstein分布鲁棒遗憾优化(DRRO)用于RLHF。不同于标准DRO悲观最坏价值,DRRO悲观最坏遗憾相对于相同合理奖励扰动下的最佳策略。我们通过简单分配模型研究提示问题,展示在ℓ1-地面成本Wasserstein模糊集下,内最坏遗憾有精确解,最优策略具有水填充结构。这些结果导致具有简单采样奖金解释和仅小幅改动GRPO式RLHF训练的实用策略梯度算法。该框架还理论上澄清了为什么DRRO比DRO更不悲观,且实验显示DRRO比现有基线更有效缓解过度优化,而标准DRO系统性过悲观。

英文摘要

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$-ground-cost Wasserstein ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.

2604.26904 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym:一种构建有效Claw代理的可扩展框架

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence(人工智能学院) Renmin University of China(中国人民大学) IQuest Research(IQuest研究) Beihang University(北航)

AI总结 本文提出ClawGym框架,用于构建Claw式个人代理的全生命周期,通过合成可验证训练数据和强化学习方法提升代理效能。

详情
AI中文摘要

Claw-style环境支持在本地文件、工具和持久工作区状态上进行多步骤工作流。然而,围绕这些环境的可扩展开发受限于缺乏系统框架,尤其是合成可验证训练数据并将其与代理训练和诊断评估集成的框架。为解决这一挑战,我们提出了ClawGym,一种支持Claw式个人代理全生命周期的可扩展框架。具体而言,我们构建了ClawGym-SynData,一个包含13500个过滤任务的多样化数据集,这些任务由基于人物驱动的意图和技能基础操作合成,配以现实的模拟工作区和混合验证机制。我们随后通过在黑箱滚动轨迹上进行监督微调训练了一组有能力的Claw式模型,称为ClawGym-Agents,并进一步通过轻量级管道探索强化学习,该管道在每项任务的沙箱中并行化滚动。为了支持可靠的评估,我们进一步构建了ClawGym-Bench,一个通过自动化过滤和人工LLM审查校准的200个实例的基准。相关资源已发布在https://github.com/ClawGym。

英文摘要

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources have been released at https://github.com/ClawGym.

2604.17338 2026-05-19 cs.SE cs.CL 版本更新

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

精确调试基准:你的模型是在调试还是重生成?

Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger, Robin Jia

发表机构 * University of Southern California(南加州大学) Microsoft(微软) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Toronto(多伦多大学)

AI总结 本文提出PDB基准,评估LLM调试能力,发现前沿模型调试精度低,即使受指令引导也难以达到高精度。

详情
AI中文摘要

与代码补全不同,调试需定位故障并应用针对性修改。我们发现前沿LLM在调试时常生成正确但过度修改的解决方案。为评估LLM离精确调试的距离,我们引入精确调试基准(PDB)框架,可将任意编码数据集转换为调试基准并进行精度感知评估。PDB通过合成已验证的原子故障并组合成多故障程序生成bug程序。我们定义了两个新指标:编辑级精度和故障级召回率,分别衡量必要修改的数量和解决的故障数量。我们发布了两个评估基准:PDB-Single-Hard(单行bug)和PDB-Multi(多行bug)。实验显示,前沿模型如GPT-5.1-Codex和DeepSeek-V3.2-Thinking在单元测试通过率超过76%的情况下,精度仍低于45%,即使被明确指示进行最小化调试。最后,我们表明迭代和代理调试策略对精度和召回率无显著提升,凸显了重新思考编码模型训练后流程的必要性。

英文摘要

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

2604.12766 2026-05-19 cs.CL 版本更新

NaviRAG: Towards Active Knowledge Navigation for Retrieval-Augmented Generation

NaviRAG:迈向检索增强生成的主动知识导航

Jihao Dai, Dingjun Wu, Yuxuan Chen, Zheni Zeng, Yukun Yan, Zhenghao Liu, Maosong Sun

发表机构 * Tsinghua University(清华大学) Nanjing University(南京大学) Northeastern University(东北大学)

AI总结 NaviRAG通过主动知识导航框架提升检索增强生成的复杂任务处理能力,通过分层知识组织和动态信息检索提高检索准确率和生成性能。

详情
AI中文摘要

NaviRAG通过主动知识导航框架提升检索增强生成的复杂任务处理能力,通过分层知识组织和动态信息检索提高检索准确率和生成性能。

英文摘要

Retrieval-augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse-grained topics to fine-grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long-document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method's capacity for multi-granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

2604.04202 2026-05-19 cs.LG cs.AI cs.CL 版本更新

ClawArena: Benchmarking AI Agents in Evolving Information Environments

ClawArena:在演化的信息环境中评估AI代理的基准测试

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao

发表机构 * UNC-Chapel Hill(北卡罗来纳州立大学夏洛特分校) University of California, Santa Cruz(加州大学圣克鲁兹分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 ClawArena评估AI代理在信息环境动态变化中的能力,通过多源冲突推理、动态信念更新和隐式个性化三个挑战,测试代理在多通道会话、工作区文件和阶段更新中的表现。

详情
AI中文摘要

部署为持久助手的AI代理必须在信息环境演变时保持正确信念。实际中,证据分散在异构来源中,常相互矛盾,新信息可能推翻先前结论,用户偏好通过修正而非明确指令出现。现有基准大多假设静态、单一权威环境,不评估代理能否应对这种复杂性。我们引入ClawArena,一个评估AI代理在演化的信息环境中的基准。每个场景保持完整的隐藏真实情况,同时仅向代理暴露噪声、部分且有时矛盾的痕迹,跨多通道会话、工作区文件和阶段更新。评估围绕三个相互关联的挑战:多源冲突推理、动态信念更新和隐式个性化,其相互作用产生14类问题分类。两种问题格式,多选(集合选择)和基于shell的可执行检查,测试推理和工作区定位。ClawArena包含12个多轮场景,覆盖337个评估轮次和45个动态更新,评估五个代理框架和18种语言模型,来自专有、社区可访问和自托管来源。实验表明,模型能力在模型间产生29分的分数范围,而框架设计最多产生24分的范围,MetaClaw的技能叠加可靠提高分数而不降低准确性,信念更新难度由更新设计策略而非更新量决定。代码可在https://github.com/aiming-lab/ClawArena获取。

英文摘要

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. ClawArena comprises 12 multi-turn scenarios spanning 337 evaluation rounds with 45 dynamic updates, evaluated across five agent frameworks and 18 language models from proprietary, community-accessible, and self-hosted sources. Experiments show that model capability accounts for a 29-point score range across models while framework design accounts for up to a 24-point range, that MetaClaw's skill overlay reliably improves score without degrading accuracy, and that belief revision difficulty is determined by update design strategy rather than update volume. Code is available at https://github.com/aiming-lab/ClawArena.

2604.01404 2026-05-19 cs.CL cs.AI 版本更新

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

硅世界中的朋友与祖母:在语言模型中本地化实体细胞

Itay Yona, Dan Barzilay, Michael Karasik, Mor Geva

发表机构 * Mentaleap Independent Researcher(独立研究者) Tel Aviv University(特拉维夫大学)

AI总结 研究通过寻找稀疏的实体选择性MLP神经元(实体细胞)探讨语言模型如何检索实体特定事实,并发现这些细胞在早期层聚集,具有因果作用。

详情
AI中文摘要

语言模型如何从参数中检索实体特定事实?我们通过寻找稀疏、实体选择性的MLP神经元(称为实体细胞,类比神经科学中的'祖母细胞'假说)来探讨这一问题,并测试这些细胞在事实回忆中的因果作用。我们通过在不同提示下对同一实体的激活一致性对MLP神经元进行排名,跨七个模型在Curated PopQA子集上应用此过程。所有模型中,本地化神经元主要集中在早期层,这一经验模式并非由架构强制。使用Qwen2.5-7B base作为模型生物,我们发现最清晰的因果证据:抑制局部细胞会擦除其匹配实体的回忆,而其他保持不变;激活单个细胞足以恢复大多数实体的正确知识,即使实体不在上下文中。相同的细胞在别名、缩写、拼写错误和多语言表层形式下仍能恢复,并在指令微调中保持稳定,表明它们编码的是实体身份而非表层标记模式。因果信号在不同模型家族中变化,指出了架构差异如何影响实体知识的组织。这些发现为理解、控制和纠正语言模型中的事实知识提供了具体且可解释的访问点,并与神经科学中关于稀疏编码概念的长期问题建立了令人惊讶的经验平行。

英文摘要

How do language models retrieve entity-specific facts from their parameters? We investigate this question by searching for sparse, entity-selective MLP neurons - which we call entity cells, by analogy to the "grandmother cell" hypothesis in neuroscience - and testing whether they play a causal role in factual recall. We localize candidate entity cells by ranking MLP neurons for activation consistency across varied prompts about the same entity, applying this procedure across seven models on a curated subset of PopQA. In all models, localized neurons cluster predominantly in early layers, an empirical pattern not imposed by the architecture. Using Qwen2.5-7B base as a model organism, we find the clearest causal evidence: suppressing a localized cell selectively erases recall for its matched entity while leaving others intact, and activating a single cell is sufficient to recover correct knowledge for most entities - even when the entity is absent from the context. The same cells are recovered under aliases, acronyms, misspellings, and multilingual surface forms, and remain stable through instruction tuning, suggesting they encode canonical entity identity rather than surface token patterns. Causal signals vary across model families, pointing to architectural differences in how entity knowledge is organized. These findings offer concrete, interpretable access points for understanding, controlling, and correcting factual knowledge in language models, and draw a surprising empirical parallel to longstanding questions in neuroscience about sparse coding of concepts.

2603.17588 2026-05-19 cs.IR cs.CL 版本更新

From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

从孤立评分到协作排名:一种基于LLM的论文评估比较原生框架

Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He, Jiawei Liu, Yong Huang, Tianrui Guo, Wei Lu

发表机构 * School of Economics and Management, East China Normal University(东华大学经济管理学院) School of Information Management, Wuhan University(武汉大学信息管理学院) China Academic Degrees & Graduate Education Development Center(中国学位与研究生教育发展中心)

AI总结 本文提出CNPE框架,通过整合比较机制于数据构建和模型学习,提升论文评估的相对质量判断,实验显示比DeepReview-14B平均提升21.8%。

Comments Accepted at Findings of ACL 2026

详情
AI中文摘要

大型语言模型(LLM)目前通过为每篇论文分配绝对分数来评估科学论文,但因评分标准在会议、时间周期和评估标准上存在差异,训练绝对分数模型易导致适应狭窄、特定情境规则而非发展稳健的学术判断。为克服此限制,本文提出将论文评估从孤立评分转向协作排名。具体而言,设计了用于论文评估的比较原生框架(CNPE),整合比较机制于数据构建和模型学习。首先提出基于图的相似性排名算法以从文献集合中采样更具信息量和区分度的论文对。随后通过监督微调和强化学习结合基于比较的奖励增强相对质量判断。在推理阶段,模型对采样论文对进行成对比较,并将这些偏好信号聚合为全局相对质量排名。实验结果表明,本文框架在强基线DeepReview-14B上实现了平均相对改进21.8%,并在五个之前未见过的数据集上表现出稳健的泛化能力。代码可在https://github.com/ECNU-Text-Computing/ComparisonReview获取。

英文摘要

Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design a $\textbf{C}$omparison-$\textbf{N}$ative framework for $\textbf{P}$aper $\textbf{E}$valuation ($\textbf{CNPE}$), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of 21.8% over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. Our code is available at https://github.com/ECNU-Text-Computing/ComparisonReview.

2603.16091 2026-05-19 cs.CL cs.AI 版本更新

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

CounterRefine:用于事实问答中推理时知识修复的答案条件计数证据检索

Tianyi Huang, Ying Kai Deng

发表机构 * Ryquo App-In Club

AI总结 CounterRefine通过在推理时检索特定证据并进行约束性修正,提升事实问答的准确性,实验表明其在多个基准测试中有效改进了基础模型的表现。

Comments Accepted at the 4th Workshop on Towards Knowledgeable Foundation Models at ACL 2026

详情
AI中文摘要

在事实问答中,许多错误并非检索失败,而是对答案的固执。我们提出了CounterRefine,一种轻量级的修复层,用于短形式RAG。该方法将第一个答案视为假设进行检验。给定草稿,CounterRefine会发出答案条件扩展查询以检索候选特定证据,然后应用受约束的KEEP或REVISE修正步骤,其提出的修订仅在确定性验证后才被接受。设计是故意狭窄的:它添加了一次证据收集流程和一次受保护的修正调用,而不是替换检索器或构建广泛代理系统。在完整的SimpleQA基准测试中,CounterRefine将匹配的一次通过RAG基线改进了最多5.8个正确率点;在完整的Claude轨迹中,它只改变了5.6%的输出,其中180个有益变化和8个有害变化。这些发现表明,对于知识丰富的基础模型来说,除了访问证据外,它们还应能够利用该证据重新考虑,并在必要时修复自己的答案。

英文摘要

In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight repair layer for short-form RAG that treats the first answer as a hypothesis to test. Given a draft, CounterRefine issues answer-conditioned expansion queries to retrieve candidate-specific evidence, then applies a constrained KEEP or REVISE refinement step whose proposed revisions are accepted only after deterministic validation. The design is intentionally narrow: it adds one evidence-gathering pass and one guarded refinement call rather than replacing the retriever or building a broad agentic system. On the full SimpleQA benchmark, CounterRefine improves a matched one-pass RAG baseline by up to 5.8 correct-rate points; in the full Claude trace, it changes only 5.6% of outputs, with 180 beneficial outcome changes and 8 harmful ones. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

2603.04737 2026-05-19 cs.AI cs.CL cs.LG 版本更新

Interactive Benchmarks

交互式基准测试

Baoqing Yue, Zihan Zhu, Yutong Han, Brian Fan, Qian Sun, Jichen Feng, Hufei Yang, Yifan Zhang, Mengdi Wang

发表机构 * InteractiveBench Princeton University(普林斯顿大学)

AI总结 本文提出交互式基准测试,通过预算化的多轮交互评估模型推理能力,改进传统基准和偏好评估的局限性,揭示模型在交互场景中的改进空间。

Comments Project Page: https://github.com/interactivebench/interactivebench

详情
AI中文摘要

现有的推理评估范式存在不同局限:固定基准日益饱和且易受污染,而基于偏好的评估依赖主观判断。我们主张智能的核心在于决定获取哪些信息以及如何有效使用它们。我们提出了交互式基准,一种统一的评估范式,通过预算化的多轮交互评估模型的推理能力。我们在两种设置中评估模型:交互证明,其中模型与裁判互动解决逻辑、UI2Html和数学任务,在客观反馈下;以及交互游戏,其中模型战略推理以最大化长期效用。我们的结果表明,交互式基准提供了更稳健的评估,揭示了模型在交互场景中的显著改进空间。

英文摘要

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

2602.08437 2026-05-19 cs.CL 版本更新

Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

大语言模型与不可能的语言习得:"虚假承诺"或对当前人工智能观点的颠覆

Ziyan Wang, Longlong Ma

发表机构 * New Talent Academy(新人才学院) Institute of Software, University of Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 本文通过实验探讨大语言模型学习可能与不可能语言的能力,发现GPT-2模型在自然语言任务上表现优于不可能语言任务,而LSTM模型则无显著差异,挑战了Chomsky对AI的理性主义基础观点。

详情
AI中文摘要

在Chomsky的挑衅性批评《CHATGPT的虚假承诺》中,大语言模型(LLMs)被描述为仅能预测模式的工具,无法像人类一样通过内在因果和自我修正结构习得语言,因此无法区分可能与不可能的语言。它代表了对AI智力基础的根本挑战,因为它整合了LLMs方法论中的主要问题,并具有一个典型的先验理性主义视角。我们从语言学和心理学现有文献的视角以及基于实验研究LLMs学习可能和不可能语言能力的角度审视这一著名批评。我们通过将英语应用特定转换生成了语法上不可能的语言,包括反转整个句子和根据词数奇偶性添加否定。在GPT-2小模型和长短期记忆(LSTM)模型上进行了两轮受控实验。单次运行训练轨迹的描述性分析显示,GPT-2小模型在自然语言任务上的最终损失较低、收敛速度更快、困惑度较低,其中反转条件表现最差(损失比自然语言高达2.25倍)。LSTM模型在不同条件下则差异最小。鉴于实验的单次运行性质(每种条件n=1),我们报告了描述性比较,并提醒正式统计推断无法进行。基于理论分析和描述性实证发现,我们提出了一种新的Chomsky理论视角,即在LLMs研究中从Chomsky的“理性主义-浪漫主义”范式转向功能主义和经验主义。

英文摘要

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critique from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring into the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Descriptive analysis of single-run training trajectories shows that GPT-2 small models exhibit lower final loss, faster convergence, and lower perplexity on natural language compared to impossible language conditions, with the reversed condition showing the largest departure (loss ratios up to 2.25 * natural). LSTM models, by contrast, show minimal differences across conditions. Given the single-run nature of our experiments (n=1 per condition), we report descriptive comparisons and caution that formal statistical inference is precluded. Based on theoretical analysis and descriptive empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.

2602.08169 2026-05-19 cs.LG cs.CL 版本更新

Spherical Steering: Geometry-Aware Activation Rotation for Language Models

球面操控:面向语言模型的几何感知激活旋转

Zejia You, Chunyuan Deng, Hanjie Chen

发表机构 * Rice University(里士大学) Tufts University(塔夫茨大学)

AI总结 本文提出球面操控方法,通过激活旋转而非加法实现无训练的推理控制,有效避免隐藏状态幅度变化,提升模型在多项选择基准上的表现,同时保持开放生成能力。

Comments ICML 2026

详情
AI中文摘要

在推理过程中,操控语言模型(LMs)而不重新训练是一种有前景的方法。然而,标准方法通常依赖于激活加法,这不可避免地会改变隐藏状态的幅度,引发表示崩溃和开放生成退化的问题。本文探讨了球面操控,一种无需训练的原始方法,通过激活旋转解决这一权衡问题。与使用固定向量移动激活不同,我们的方法沿向目标方向的测地线旋转激活,从而在保持信号完整性的同时指向目标概念。为进一步增强适应性,我们引入了一个置信度门,根据输入不确定性动态调节操控强度。在多个选择基准上的广泛实验表明,球面操控在多项选择基准上显著优于加法基线(在TruthfulQA、COPA和Storycloze上分别提高10%),同时同时保持模型的开放生成质量。这项工作强调了几何一致性的重要性,表明保持范数的旋转是一种稳健且有效的方法,用于精确的推理时间控制。代码可在:https://github.com/chili-lab/Spherical-Steering 获取。

英文摘要

Inference-time steering offers a promising way to control language models (LMs) without retraining. However, standard approaches typically rely on activation addition, which inevitably alters the hidden-state magnitudes raising concerns about representation collapse and degraded open-ended generation. In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation. Rather than shifting activations with a fixed vector, our method rotates them along a geodesic toward a target direction, preserving signal integrity while steering toward the target concept. To further enhance adaptivity, we incorporate a confidence gate that dynamically modulates steering strength based on input uncertainty. Extensive experiments across multiple-choice benchmarks demonstrate that Spherical Steering significantly outperforms addition-based baselines (notably by +10% on TruthfulQA, COPA, and Storycloze), while simultaneously maintaining the model's general open-ended generation quality. This work highlights the value of geometric consistency, suggesting that norm-preserving rotation is a robust and effective primitive for precise inference-time control. The code is available at: https://github.com/chili-lab/Spherical-Steering.

2602.02039 2026-05-19 cs.AI cs.CL cs.DB cs.LG 版本更新

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

在大型语言模型上进行深度数据研究:评估深度数据研究

Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He

发表机构 * GitHub

AI总结 本文提出深度数据研究(DDR)任务和DDR-Bench基准,评估大型语言模型的探索智能,发现有效探索需要内在策略而非单纯扩展。

Comments 14 pages, 7 tables, 8 figures, accepted by ICML 2026

详情
AI中文摘要

Agentic Large Language Models 的代理预期超越正确回答,要求模型自主设定目标和决定探索方向。我们称其为探索智能,区别于仅完成任务的执行智能。数据科学提供自然测试场,因为现实分析从原始数据而非明确查询开始,但很少有基准关注此领域。为此,我们引入深度数据研究(DDR),一个开放任务,使 LLM 自主从数据库提取关键洞察,并提出 DDR-Bench,一个大规模、基于清单的基准,支持可验证评估。结果表明,尽管前沿模型显示出新兴自主性,但长周期探索仍具挑战性。我们的分析强调,有效的探索智能不仅依赖代理支架或单纯扩展,还依赖于 agentic 模型的内在策略。

英文摘要

The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

2601.16527 2026-05-19 cs.LG cs.AI cs.CL cs.CV 版本更新

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

超越表面遗忘:多模态大语言模型中Hallucinations的锐度感知鲁棒擦除

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Institute for AI, Tsinghua University(清华大学人工智能研究院) Huzhou University(湖州大学) Institute of Dataspace, Hefei Comprehensive National Science Center(合肥综合性国家科学中心数据空间研究院) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出SARE方法,通过目标导向的min-max优化和Targeted-SAM机制,解决多模态大语言模型中 hallucinations 的鲁棒擦除问题,提升模型稳定性与擦除效果。

详情
AI中文摘要

多模态大语言模型虽然强大,但容易产生hallucinations,即不存在的实体,影响可靠性。尽管最近的遗忘方法试图缓解这一问题,我们发现了一个关键缺陷:结构脆弱性。我们实证显示,标准擦除仅能表面抑制,使模型陷入尖锐极小值,轻度重新学习后hallucinations会灾难性复苏。为确保几何稳定性,我们提出SARE,将遗忘视为目标min-max优化问题,并使用Targeted-SAM机制显式平坦hallucinated概念周围的损失景观。通过在模拟最坏情况参数扰动下抑制hallucinations,我们的框架确保了鲁棒去除的稳定性。大量实验表明,SARE在擦除效果上显著优于基线,同时保持一般生成质量。关键的是,它在重新学习和参数更新中维持持久的hallucination抑制,验证了几何稳定性的有效性。

英文摘要

Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

2601.16398 2026-05-19 cs.CY cs.CL cs.LG 版本更新

White-Box Sensitivity Auditing with Steering Vectors

白盒敏感性审计与引导向量

Hannah Cyberey, Yangfeng Ji, David Evans

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出白盒敏感性审计框架,通过激活引导进行更严格的模型内部评估,用于检测大语言模型中的偏见,揭示模型对保护属性的依赖。

详情
AI中文摘要

算法审计是检查系统属性的重要工具,当前对大语言模型(LLM)的审计主要依赖黑盒评估,仅通过输入输出测试。这些方法局限于输入空间中的测试,通常由启发式生成。此外,许多社会相关模型属性(如性别偏见)抽象且难以通过文本输入单独测量。为解决这些限制,我们提出了一种白盒敏感性审计框架,利用激活引导进行更严格的内部评估。我们的审计方法通过操纵关键概念进行内部敏感性测试,以评估模型的预期功能。我们展示了其在四个模拟高风险LLM决策任务中的应用。我们的方法一致表明,模型预测对保护属性存在显著依赖,即使在标准黑盒评估表明几乎没有偏见的设置中。我们的代码在https://github.com/hannahxchen/llm-steering-audit上公开可用。

英文摘要

Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black-box evaluations that assess model behavior only through input-output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text-based inputs alone. To address these limitations, we propose a white-box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model's intended function for the task. We demonstrate its application to bias audits in four simulated high-stakes LLM decision tasks. Our method consistently indicates substantial dependence on protected attributes in model predictions, even in settings where standard black-box evaluations suggest little or no bias. Our code is openly available at https://github.com/hannahxchen/llm-steering-audit

2601.13992 2026-05-19 cs.CL cs.AI 版本更新

"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

整体大于部分之和:一种兼容性感知的多教师CoT蒸馏框架

Jin Cui, Jiaqi Guo, Ruixuan Yang, Jiayi Lu, Jiepeng Zhou, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学) Nankai University(南开大学) The Hong Kong University of Science and Technology(Guangzhou)(香港科技大学(广州)) School of Software Engineering, State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(软件学院,人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学)

AI总结 本文提出COMPACT框架,通过动态加权不同教师的梯度,结合多维指标提升学生模型的推理能力,有效整合多样化推理能力并减少灾难性遗忘。

Comments 11pages, 9figures

详情
AI中文摘要

链式推理(CoT)推理赋予大语言模型(LLMs)显著能力,但通常需要极高的参数规模。CoT蒸馏作为一种有前景的范式,将推理能力转移到紧凑的学生模型(SLMs)中,但现有方法通常依赖单一教师,限制了学生潜力,因为个体LLMs常有不同能力偏倚且可能遭受灾难性遗忘。虽然利用多样教师似乎有吸引力,但有效融合其监督仍具挑战:教师-学生不兼容可能放大幻觉,被动监督无法确保真实逻辑内化。为此,我们引入COMPACT框架,通过动态加权教师梯度,基于多维指标评估学生实时兼容性:(1)基于图的共识过滤误导性推理路径;(2)基于互信息的适应性检测“顿悟时刻”以真正理解推理过程而非单纯模仿;(3)基于损失的难度评估学生对教师指导的接受度并防止负迁移。大量实验和潜在空间分析表明,COMPACT能有效整合多样化推理能力而不破坏模型原有知识结构,在各种基准测试中取得最佳性能并缓解灾难性遗忘。

英文摘要

Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

2601.11956 2026-05-19 cs.CL cs.AI 版本更新

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

双重校准:通过校准知识和推理置信度实现可靠的LLM

Yuyin Lu, Ziran Liang, Yanghui Rao, Wenqi Fan, Fu Lee Wang, Qing Li

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学计算机科学与工程学院,广州,中国) Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR(香港理工大学计算机系,香港特别行政区) School of Science and Technology, Hong Kong Metropolitan University, Hong Kong SAR(香港 Metropolitan 大学科技学院,香港特别行政区)

AI总结 本文提出双重校准框架,通过校准知识和推理置信度提升LLM的可靠性,实验表明其在保持低token成本的同时显著提高准确性和置信度校准。

Comments This work is to appear in the Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
AI中文摘要

Large Language Models (LLMs) 的可靠推理受到其易产生幻觉的挑战。尽管通过知识图谱(KGs)增强LLMs可以提高事实准确性,但现有KG增强方法未能量化检索证据和LLM推理中的epistemic不确定性。为此,我们引入DoublyCal框架,基于新颖的双重校准原则。DoublyCal采用轻量级代理模型生成KG证据并校准证据置信度,此校准的支撑证据引导黑盒LLM,产生更准确且校准良好的最终预测,其置信度可追溯到支撑证据的不确定性。在知识密集型基准测试中,DoublyCal显著提高了黑盒LLM的准确性和置信度校准,同时保持低token成本。

英文摘要

Reliable reasoning in Large Language Models (LLMs) is challenged by their propensity for hallucination. While augmenting LLMs with Knowledge Graphs (KGs) improves factual accuracy, existing KG-augmented methods fail to quantify epistemic uncertainty in both the retrieved evidence and LLMs' reasoning. To bridge this gap, we introduce DoublyCal, a framework built on a novel double-calibration principle. DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence. Experiments on knowledge-intensive benchmarks show that DoublyCal significantly improves both the accuracy and confidence calibration of black-box LLMs while maintaining low token cost.

2601.06633 2026-05-19 cs.LG cs.AI cs.CL cs.CY 版本更新

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER:面向开放性编程任务的知识对齐学生错误模拟器

Zhangqi Duan, Nigel Fernandez, Andrew Lan

发表机构 * University of Massachusetts(马萨诸塞大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 KASER通过强化学习方法,结合代码相似性、错误匹配和预测多样性,提升大语言模型对学生错误的模拟与预测能力,实验表明其在代码和错误预测及错误覆盖方面优于基线方法。

Comments Published in ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

开放性任务,如计算机科学教育中的编程问题,能提供关于学生知识的深入洞察。然而,训练大语言模型(LLMs)模拟和预测学生在这些问题上的可能错误具有挑战性:它们常出现模式崩溃,并无法充分捕捉学生响应中的语法、风格和解决方案方法的多样性。在本文中,我们提出了KASER(知识对齐学生错误模拟器),一种将错误与学生知识对齐的新方法。我们提出了一种基于强化学习的训练方法,使用混合奖励反映学生代码预测的三个方面:i)代码与地面真相的相似性,ii)错误匹配,以及iii)代码预测的多样性。在两个真实世界数据集上,我们进行了两个层面的评估,并表明:在每对学生-问题对层面,我们的方法在代码和错误预测上优于基线;在每问题层面,我们的方法在错误覆盖和模拟代码多样性上优于基线。

英文摘要

Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity.

2512.17843 2026-05-19 cs.CL cs.AI cs.HC 版本更新

ShareChat: A Dataset of Chatbot Conversations in the Wild

ShareChat: 一个真实对话的大型数据集

Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

发表机构 * Indiana University(印第安纳大学)

AI总结 本文提出ShareChat数据集,包含142,808条对话(660,293个回合),涵盖95种语言,分析不同平台对话完整性和响应延迟差异,揭示多平台交互特性。

详情
AI中文摘要

通过统一的文本接口评估大型语言模型(LLMs),当前学术基准掩盖了不同商业平台的独特设计和功能如何影响真实用户行为和系统性能。为弥合这一差距,我们提出了ShareChat,这是首个包含142,808条对话(660,293个回合)的大型语料库,从ChatGPT、Perplexity、Grok、Gemini和Claude的公开共享URL中收集。ShareChat保留了原生平台功能,包括引用、思考痕迹和代码 artifacts,涵盖95种语言,时间跨度从2023年4月至2025年10月,补充了现有语料库中同质化交互的不足。为了展示数据集的评估用途,我们提出了三个案例研究:对话完整性分析评估跨平台意图满足差异,来源定位分析比较搜索增强系统之间的引用策略,时间分析揭示响应延迟动态的差异。这些分析展示了单平台或剥离功能语料库无法解决的研究问题。该数据集已公开可用。

英文摘要

By evaluating Large Language Models (LLMs) through uniform, text-only interfaces, current academic benchmarks obscure how the unique designs and affordances of distinct commercial platforms shape real-world user behavior and system performance. To bridge this gap, we present ShareChat, the first large-scale corpus of 142,808 conversations (660,293 turns) collected from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat preserves native platform affordances, including citations, thinking traces, and code artifacts, across 95 languages and the period from April 2023 to October 2025, complementing existing corpora that homogenize these interactions. To demonstrate the dataset's evaluative utility, we present three case studies: a conversation completeness analysis assessing cross-platform differences in intent satisfaction, a source grounding analysis comparing citation strategies between search-augmented systems, and a temporal analysis revealing divergent response latency dynamics. Together, these analyses demonstrate research questions that are inaccessible to single-platform or stripped-affordance corpora. The dataset is publicly available.

2511.19078 2026-05-19 cs.CL cs.AI 版本更新

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

GraphMind: 一种基于动态GNN的定理选择与结论生成框架用于LLM推理

Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin

AI总结 GraphMind通过动态图神经网络与LLM结合,实现多步推理中的定理选择和结论生成,提升上下文感知的推理能力。

Comments This paper has been withdrawn by the authors in order to prepare a substantially revised version

详情
AI中文摘要

大型语言模型(LLMs)在自然语言理解和生成方面表现出色,包括多步推理如数学证明。然而,现有方法缺乏显式且动态的机制来结构化表示和演变中间推理状态,限制了其在上下文感知定理选择和迭代结论生成方面的能力。为此,我们提出了GraphMind,一种新颖的动态图基框架,将图神经网络(GNN)与LLMs结合,以迭代方式选择定理并生成中间结论。我们的方法将推理过程建模为异构演进图,其中节点代表条件、定理和结论,边捕捉节点间的逻辑依赖。通过编码当前推理状态并利用语义匹配进行定理选择,我们的框架在闭环模式下实现了上下文感知、可解释和结构化的推理。在各种问答(QA)数据集上的实验表明,所提出的GraphMind方法在多步推理中实现了稳定性能提升,并显著优于现有基线方法,验证了我们方法的有效性和通用性。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

2511.06516 2026-05-19 cs.CL 版本更新

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

你只有一份工作:利用LLM隐藏表示进行每任务量化

Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz Ziv, Avi Mendelson

发表机构 * Technion—Israel Institute of Technology(技术ion—以色列理工学院) Zenity(Zenity公司) DeepKeep(DeepKeep公司) Ben-Gurion University of the Negev(巴伊兰大学) New York University(纽约大学)

AI总结 本文提出TAQ框架,通过任务校准提示分配更高精度给任务相关的transformer层,提升精度-内存比和部署性能。

详情
AI中文摘要

许多LLM应用只需窄能力,但标准后训练量化(PTQ)方法在分配精度时未考虑目标任务,导致浪费位数于不相关层并过度压缩关键层。我们提出任务感知量化(TAQ),一种无需训练的权重仅混合精度PTQ框架,利用少量未标记任务校准提示,在固定位预算下分配更高精度给任务相关的transformer层。TAQ通过隐藏表示和输出敏感性估计层重要性,并使用三种评分规则:TAQ-IS基于激活信息和稳定性;TAQ-KL基于量化噪声代理下的输出分布敏感性;TAQ-O是标签引导的oracle诊断用于分析层敏感性。在多个基准测试中,TAQ在大多数设置中优于任务无关基线,尤其在精度-内存比方面有显著提升。我们进一步通过硬件吞吐量和延迟测量验证这些增益在实际部署中的表现,并分析校准鲁棒性和残差流误差传播。总体而言,TAQ将混合精度PTQ从模型导向的压缩步骤转变为任务条件的精度分配问题。参考实现可在https://anonymous.4open.science/r/TAQ-9217/README.md获取。

英文摘要

Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over-compressing layers that are critical for downstream behavior. We propose Task-Aware Quantization (TAQ), a training-free, weight-only mixed-precision PTQ framework that uses a small set of unlabeled task calibration prompts to allocate higher precision to task-relevant transformer layers under a fixed bit budget. TAQ estimates layer importance from hidden representations and output sensitivity, and we instantiate it with three scoring rules: TAQ-IS, based on activation information and stability; TAQ-KL, based on output-distribution sensitivity under a quantization-noise proxy; and TAQ-O, a label-informed oracle diagnostic for analyzing layer sensitivity. Across several benchmarks, TAQ outperforms task-agnostic baselines such in most settings, with especially strong gains in the accuracy--memory ratio. We further validate that these gains translate to real deployment behavior through hardware throughput and latency measurements, and analyze calibration robustness and residual-stream error propagation. Overall, TAQ turns mixed-precision PTQ from a model-centric compression step into a task-conditioned precision-allocation problem. A reference implementation is available at https://anonymous.4open.science/r/TAQ-9217/README.md.

2510.18941 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench:需要专业知识回答和评判的多领域评分标准

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

发表机构 * NVIDIA

AI总结 ProfBench通过7000多个由专业领域专家评估的响应-评分对,评估大语言模型在处理专业文档、信息整合和生成综合报告方面的能力,揭示了即使顶级模型在专业任务上也面临挑战。

Comments Published at ICLR 2026, 30 pages

详情
AI中文摘要

评估大语言模型(LLMs)的进步通常受限于验证响应的挑战,限制了评估任务仅限于数学、编程和简短问答。然而,许多现实应用需要评估LLMs在处理专业文档、整合信息和生成综合报告方面的能力。我们介绍了ProfBench:一个包含超过7000个响应-评分对的集合,由具有物理学博士、化学博士、金融MBA和咨询MBA专业知识的人类专家评估。我们构建了稳健且经济的LLM-Judges来评估ProfBench评分标准,通过减轻自我增强偏差并减少评估成本2-3个数量级,使其公平且对更广泛社区可及。我们的发现表明,即使对于最先进的LLM,ProfBench也提出了重大挑战,顶级模型如GPT-5-high仅达到65.9%的整体性能。此外,我们识别了专有模型与开源模型之间显著的性能差异,并提供了关于扩展思考在解决复杂专业领域任务中的作用的见解。数据:https://huggingface.co/datasets/nvidia/ProfBench 和代码:https://github.com/NVlabs/ProfBench 和排行榜:https://huggingface.co/spaces/nvidia/ProfBench

英文摘要

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench

2510.16079 2026-05-19 cs.CL cs.AI 版本更新

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

EvolveR:通过经验驱动的生命周期实现自进化大语言模型代理

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, Botian Shi

发表机构 * Zhejiang University(浙江大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Fudan University(复旦大学) Central South University(中南大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 EvolveR通过闭环经验生命周期实现LLM代理的自进化,结合离线自蒸馏和在线交互,利用策略强化机制迭代优化,提升多跳问答任务性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

当前大语言模型代理在工具使用方面表现优异,但缺乏系统性地从自身经验学习的能力。现有框架主要解决外部知识缺口问题,未能解决更根本的限制:无法迭代优化问题解决策略。本文提出EvolveR框架,通过完整的闭环经验生命周期实现代理自改进。该生命周期包含两个关键阶段:(1) 离线自蒸馏,将代理交互轨迹合成结构化的抽象可重用策略原则;(2) 在线交互,代理与任务互动并主动检索蒸馏的原则以指导决策,积累多样化的行为轨迹。此循环利用策略强化机制迭代更新代理。我们在复杂多跳问答基准测试中展示了EvolveR的有效性,其在强代理基线上的表现更优。我们的工作为能够从外部数据和自身行为后果学习的代理提供了全面蓝图,为更自主和持续改进的系统铺平道路。代码可在https://github.com/Edaizi/EvolveR获取。

英文摘要

Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.

2510.07799 2026-05-19 cs.CL cs.AI 版本更新

Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

基于图扩散模型的多LLM代理通信拓扑动态生成

Eric Hanchen Jiang, Mengting Li, Guancheng Wan, Sophia Yin, Yuchen Wu, Xiao Liang, Xinfeng Li, Yizhou Sun, Wei Wang, Kai-Wei Chang, Ying Nian Wu

发表机构 * University of California Los Angeles(加州大学洛杉矶分校) University of Washington(华盛顿大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Guided Topology Diffusion框架,通过迭代构建过程生成适应任务需求的高效通信拓扑,优于现有方法。

Comments ACL 2026 Main

详情
AI中文摘要

多代理系统效率依赖于其通信拓扑,但设计最优拓扑具有挑战性。本文引入Guided Topology Diffusion框架,通过条件离散图扩散模型生成任务自适应的通信拓扑,实现实时优化,显著提升LLM代理协作性能。

英文摘要

The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.

2510.01782 2026-05-19 cs.CL cs.AI 版本更新

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

大语言模型能否拒绝它们不知道的问题?在事实性任务中测量知识感知的拒绝

Wenbo Pan, Jie Xu, Qiguang Chen, Junhao Dong, Libo Qin, Xinfeng Li, Haining Yu, Xiaohua Jia

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Harbin Institute of Technology(哈尔滨工业大学) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院)

AI总结 本文提出Refusal Index(RI)作为衡量大语言模型知识感知拒绝能力的新指标,通过评估拒绝概率与错误概率的相关性,揭示模型在事实性任务中的可靠性问题。

Comments Accepted at ICLR 2026

详情
AI中文摘要

大语言模型(LLMs)应拒绝回答超出其知识范围的问题。这种称为知识感知拒绝的能力对于事实性可靠性至关重要,但现有指标未能捕捉这一能力。本文提出Refusal Index(RI),一种新颖且原理明确的度量标准,用于衡量LLMs拒绝其不知问题的准确性。我们将RI定义为拒绝概率与错误概率之间的Spearman秩相关性。RI可通过轻量级两轮评估方法进行实际测量,仅需在两个标准评估运行中观察到的拒绝率。在16个模型和5个数据集上的广泛实验表明,RI准确量化了模型的知识感知拒绝能力。值得注意的是,RI在不同拒绝率下保持稳定,并提供一致的模型排名,不依赖于模型的整体准确率和拒绝率。这些特性表明RI捕捉了模型知识校准的稳定、内在方面。更重要的是,RI提供了关于LLM事实性的重要但此前被忽视方面的见解:尽管LLM在事实性任务上实现高准确率,但其拒绝行为可能不可靠且脆弱。

英文摘要

Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability, while existing metrics fail to capture this ability. In this work, we propose the Refusal Index (RI), a novel and principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. RI is practically measurable with a lightweight two-pass evaluation method which only require observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's knowledge-aware refusal capability. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. These properties suggest RI captures a stable, intrinsic aspect of model knowledge calibration. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile.

2509.22061 2026-05-19 eess.AS cs.CL cs.SD 版本更新

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

说出你的想法:语音延续任务作为语音基础模型偏见的探测器

Shree Harsha Bokkahalli Satish, Harm Lameris, Olivier Perrotin, Gustav Eje Henter, Éva Székely

发表机构 * Department of Speech, Music and Hearing, KTH Royal Institute of Technology(语音、音乐与听觉系,皇家理工学院) Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab(格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,格勒诺布尔理工学院,GIPSA实验室)

AI总结 本文首次系统评估语音延续任务中的偏见,探讨性别和音质类型对延续行为的影响,发现模型和性别交互显著,且女性提示更倾向于回归模态音质,揭示语音质量偏见。

Comments 8 pages, 2 figures, Accepted to Identity-Aware AI LREC Workshop 2026

详情
AI中文摘要

语音延续(SC)任务是生成连贯的语音提示扩展,同时保持语义上下文和说话者身份。由于SC受限于单一音频流,它比对话更直接地探测语音基础模型的偏见。本文首次系统评估SC中的偏见,研究性别和音质类型(气音、嘶嘶音、末端嘶嘶音)对延续行为的影响。评估了三个最新模型:SpiritLM(基础和表达型)、VAE-GSLM和SpeechGPT,在说话者相似性、语音质量保持和基于文本的偏见指标上。结果表明,尽管说话者相似性和连贯性仍具挑战性,文本评估揭示了显著的模型和性别交互:一旦连贯性足够高(对于VAE-GSLM),性别效应会在文本指标如代理性和句子极性上显现。此外,延续行为更倾向于回归模态音质,特别是对于女性提示,揭示了系统性的语音质量偏见。这些发现突显了SC作为探测社会相关表征偏见的受控探测器,表明随着延续质量的提高,SC将成为越来越有信息量的诊断工具。

英文摘要

Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.

2509.21319 2026-05-19 cs.CL cs.AI cs.LG 版本更新

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF:二进制灵活反馈用于连接人类反馈与可验证奖励

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

发表机构 * NVIDIA

AI总结 RLBFF结合人类偏好与规则验证,提升奖励模型对响应质量的精准捕捉,优于Bradley-Terry模型,在RM-Bench和JudgeBench上取得优异成绩,且支持用户自定义反馈原则。

Comments Published at ICLR 2026, 21 pages

详情
AI中文摘要

Reinforcement Learning with Human Feedback (RLHF) 和 Reinforcement Learning with Verifiable Rewards (RLVR) 是LLM后训练的主要RL范式,各有优势。然而,RLHF在可解释性和奖励黑客问题上存在困难,因为它依赖于通常缺乏明确标准的人类判断,而RLVR则受限于其对正确性基于验证器的专注。我们提出Reinforcement Learning with Binary Flexible Feedback (RLBFF),结合人类驱动的偏好灵活性与规则基础验证的精确性,使奖励模型能够捕捉响应质量的细微方面,超越单纯的正确性。RLBFF从自然语言反馈中提取可以二进制回答的原则(例如信息准确性:是,或代码可读性:否)。这些原则随后可用于将奖励模型训练作为蕴含任务(响应满足或不满足任意原则)。我们展示奖励模型以这种方式训练可以优于匹配数据的Bradley-Terry模型,在RM-Bench(86.2%)和JudgeBench(81.4%,2025年9月24日排行榜第一)上取得最佳成绩。此外,用户可以在推理时指定感兴趣的原理以自定义我们的奖励模型,与Bradley-Terry模型不同。最后,我们提供了一个完全开源的食谱(包括数据)来对Qwen3-32B进行对齐,以匹配或超过o3-mini和DeepSeek R1在MT-Bench、WildBench和Arena Hard v2的一般对齐基准上的性能(在<5%的推理成本下)。模型:https://huggingface.co/collections/nvidia/reward-models-10-2025

英文摘要

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

2509.17912 2026-05-19 cs.CL 版本更新

SiDiaC: Sinhala Diachronic Corpus

Sinhala 语料库:SiDiaC:历时性语料库

Nevidu Jayatilleke, Nisansa de Silva

发表机构 * Department of Computer Science & Engineering, University of Moratuwa(计算机科学与工程系,穆拉图瓦大学)

AI总结 SiDiaC 是首个全面的僧加罗语历时性语料库,涵盖从5世纪到20世纪的58,000个单词,经过筛选和标注后,为僧加罗语NLP研究提供了基础资源,支持词形变化、新词追踪和历史语法研究。

Comments 17 pages, 7 figures, 9 tables, Accepted paper at the 39th Pacific Asia Conference on Language, Information and Computation (PACLIC 39)

详情
Journal ref
https://aclanthology.org/2025.paclic-1.47/
AI中文摘要

SiDiaC 是首个全面的僧加罗语历时性语料库,涵盖从5世纪到20世纪的58,000个单词。该语料库通过筛选和标注,基于写作日期进行精细处理,经过Google Document AI OCR数字化和后处理,以纠正格式并现代化拼写。SiDiaC 的构建借鉴了其他语料库(如FarPaHC)的实践,特别是在句法标注和文本规范化策略上,以应对僧加罗语资源匮乏的挑战。该语料库按文体分为两层:初级分类将每本书分为非虚构或虚构,次级分类则更具体,将文本归类为宗教、历史、诗歌、语言和医学等类别。尽管面临获取稀有文本有限和依赖次级日期来源等挑战,SiDiaC 为僧加罗语NLP研究提供了基础资源,显著扩展了僧加罗语可用资源,支持历时性研究,如词汇变化、新词追踪、历史语法和基于语料库的词典编纂。

英文摘要

SiDiaC, the first comprehensive Sinhala Diachronic Corpus, covers a historical span from the 5th to the 20th century CE. SiDiaC comprises 58k words across 46 literary works, annotated carefully based on the written date, after filtering based on availability, authorship, copyright compliance, and data attribution. Texts from the National Library of Sri Lanka were digitised using Google Document AI OCR, followed by post-processing to correct formatting and modernise the orthography. The construction of SiDiaC was informed by practices from other corpora, such as FarPaHC, particularly in syntactic annotation and text normalisation strategies, due to the shared characteristics of low-resourced language status. This corpus is categorised based on genres into two layers: primary and secondary. Primary categorisation is binary, classifying each book into Non-Fiction or Fiction, while the secondary categorisation is more specific, grouping texts under Religious, History, Poetry, Language, and Medical genres. Despite challenges including limited access to rare texts and reliance on secondary date sources, SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending the resources available for Sinhala, enabling diachronic studies in lexical change, neologism tracking, historical syntax, and corpus-based lexicography.

2509.04471 2026-05-19 cs.CL cs.AI 版本更新

MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

MOSAIC:一种多语言、无类别依赖且计算高效的放射报告分类方法

Alice Schiavone, Marco Fraccaro, Lea Marie Pehrson, Silvia Ingala, Rasmus Bonnevie, Michael Bachmann Nielsen, Vincent Beliveau, Melanie Ganz, Desmond Elliott

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系) Neurobiology Research Unit, Copenhagen University Hospital(哥本哈根大学医院神经生物学研究单位) Unumed Aps(Unumed公司) Department of Diagnostic Radiology, Copenhagen University Hospital(哥本哈根大学医院诊断放射学系) Department of Clinical Medicine, University of Copenhagen(哥本哈根大学临床医学系) Cerebriu A/S(Cerebriu公司) Institute for Human Genetics, Medical University of Innsbruck(因斯布鲁克医学大学人类遗传学研究所)

AI总结 MOSAIC通过紧凑开放模型实现多语言、无类别依赖的放射报告分类,无需大量标注数据,且在多种影像模态和标签体系上表现优异,达到专家水平性能。

Comments 8 pages, 14 pages including references and appendix. 9 figures. Preprint

详情
Journal ref
Proceedings of the ClinicalNLP Workshop at LREC 2026
AI中文摘要

放射学报告包含丰富的临床信息,可用于训练影像模型而无需依赖昂贵的手动标注。然而,现有方法面临关键限制:基于规则的方法难以处理语言多样性,监督模型需要大量标注数据集,而近期基于LLM的方法依赖封闭源或资源密集型模型,不适合临床使用。此外,当前解决方案大多局限于英语和单模态、单类别数据集。我们介绍了MOSAIC,一种多语言、无类别依赖且计算高效的放射报告分类方法。基于紧凑的开放访问语言模型(MedGemma-4B),MOSAIC支持零/少样本提示和轻量级微调,可在消费级GPU上部署。我们在英语、西班牙语、法语和丹麦语的七个数据集上评估MOSAIC,涵盖多种影像模态和标签体系。该模型在五个胸部X光数据集上达到平均宏F1分数88,接近或超过专家水平性能,同时仅需24GB GPU内存。通过数据增强,仅需80个标注样本即可在丹麦报告上达到加权F1分数82,相比完整1600样本训练集的86分。MOSAIC为临床环境中大型或专有LLM提供了实用替代方案。代码和模型是开源的。我们邀请社区在新语言、类别和模态上评估和扩展MOSAIC。

英文摘要

Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

2508.04149 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

基于难度的偏好数据选择:通过DPO隐式奖励差距

Xuan Qi, Rongwu Xu, Zhijing Jin

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学计算机科学与工程保罗·G·艾伦学校) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根马克斯·普朗克智能系统研究所) Jinesis Lab, University of Toronto & Vector Institute(多伦多大学Jinesis实验室及向量研究所)

AI总结 本文提出基于难度的偏好数据选择方法,利用DPO隐式奖励机制选择奖励差距小的样本,提升数据效率和模型对齐性能,在多个数据集和对齐任务中优于五个基线方法。

Comments Our code and data are available at https://github.com/Difficulty-Based-Preference-Data-Select/Difficulty-Based-Preference-Data-Select

详情
AI中文摘要

对齐大语言模型(LLMs)与人类偏好是AI研究中的关键挑战。尽管强化学习从人类反馈(RLHF)和直接偏好优化(DPO)等方法被广泛使用,但它们通常依赖于大规模、成本高的偏好数据集。本文缺少针对偏好数据的高质量数据选择方法。在本文中,我们引入了一种基于难度的偏好数据选择策略,该策略基于DPO隐式奖励机制。通过选择奖励差距较小的偏好数据示例,这些示例代表更具挑战性的案例,从而提高数据效率和模型对齐。我们的方法在多个数据集和对齐任务中一致优于五个强大的基线方法,仅使用原始数据的10%即可实现优越性能。这种原理上高效的选择方法为在有限资源下扩展LLM对齐提供了有前景的解决方案。

英文摘要

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

2506.23978 2026-05-19 cs.LG cs.CL cs.CY cs.SI 版本更新

LLM Agents Are the Antidote to Walled Gardens

大语言模型代理是封闭生态系统的解药

Samuele Marro, Philip Torr

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系) Institute for Decentralized AI(去中心化人工智能研究所)

AI总结 本文提出通过大语言模型代理实现通用互操作性,打破封闭平台垄断,促进数据端到端迁移,同时探讨其带来的安全与法律挑战。

Comments Published at the ICML 2026 Position Paper track

详情
AI中文摘要

尽管互联网的核心基础设施最初设计为开放和通用,但当今的应用层却被封闭的专有平台主导。开放且互操作的API需要大量投资,而市场领导者缺乏激励去启用可能削弱用户锁定的数据交换。我们主张基于大语言模型的代理从根本上颠覆这一现状。代理可以自动转换数据格式并与为人设计的界面交互:这使互操作性大幅降低且实际上不可避免。我们称之为这种转变通用互操作性:任何两个数字服务都能通过AI调解的适配器无缝交换数据的能力。通用互操作性削弱了垄断行为,促进数据端到端迁移。然而,它也可能导致新的安全风险、技术债务和法律摩擦。我们的立场是ML社区应拥抱这一发展,同时构建适当的框架来减轻负面影响。通过现在行动,我们可以利用AI恢复用户自由和竞争市场,而不牺牲安全。

英文摘要

While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks, technical debt, and legal frictions. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.

2506.12119 2026-05-19 cs.CL cs.AI 版本更新

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

专家混合模型可在严格相等资源下超越密集语言模型

Houyi Li, Ka Man Lo, Shijie Xuyang, Ziqi Wang, Wenzhen Zheng, Haocheng Zhang, Zhao Li, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

发表机构 * Fudan University(复旦大学) StepFun University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学)

AI总结 本文研究在资源相等条件下MoE模型是否能超越密集模型,提出优化框架并验证了在最优激活率下MoE模型性能更优,且该区域在不同模型规模下一致,通过数据重用解决数据量增加的权衡问题。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

专家混合(MoE)语言模型显著扩展了模型容量,并在不增加每token计算量的情况下实现了显著性能提升。然而,在严格相等的资源约束下,即总参数量、训练计算和数据预算完全相同的情况下,MoE能否超越密集架构?尽管其具有重要的实际价值和潜力,这一问题仍缺乏深入研究。本文提出了一种新的视角和方法论框架,系统研究这一问题。首先,我们全面调查了MoE的架构并实现了最优模型设计以最大化性能。基于此,我们发现,在最优区域内的MoE模型在相同总参数、训练计算和数据资源下能够超越其密集 counterpart。更重要的是,这一最优区域在不同模型规模下保持一致。虽然增加的数据量会带来性能的权衡,但我们通过重用数据解决了这一问题。我们通过广泛的实验验证了我们的发现,训练了近200个20亿参数规模的语言模型和超过50个70亿参数规模的语言模型,累计处理了50万亿token。所有模型检查点均已公开。

英文摘要

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints -- that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All model checkpoints are publicly available.

2505.19590 2026-05-19 cs.LG cs.CL 版本更新

Learning to Reason without External Rewards

无需外部奖励的学习推理

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song

发表机构 * UC Berkeley(加州大学伯克利分校) Yale University(耶鲁大学)

AI总结 本文提出Intuitor方法,通过内在反馈实现无需外部奖励的自主学习,实验表明其在数学基准和代码生成等任务中表现优异,为无监督学习提供了新途径。

Comments ICLR 2026

详情
AI中文摘要

训练大型语言模型(LLMs)进行复杂推理的强化学习可验证奖励(RLVR)方法虽有效但受限于昂贵的领域特定监督。我们探索强化学习从内在反馈(RLIF)框架,使LLMs能从内在信号学习而无需外部奖励或标注数据。我们提出Intuitor,一种使用模型自身信心术语自信心作为唯一奖励信号的RLIF方法。Intuitor在组相对策略优化(GRPO)中用自信心分数替代外部奖励,实现完全无监督学习。实验表明,Intuitor在数学基准上与GRPO性能相当,但在代码生成等跨领域任务中泛化能力更强,无需黄金解决方案或测试用例。我们的发现表明,内在模型信号能驱动跨领域有效学习,为无可验证奖励的自主AI系统提供可扩展替代方案。代码可在https://github.com/sunblaze-ucb/Intuitor获取。

英文摘要

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

2505.16831 2026-05-19 cs.CL cs.AI cs.CR cs.LG 版本更新

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

反学习不是删除:调查机器反学习在大语言模型中的可逆性

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Carnegie Mellon University(卡内基梅隆大学) University of California, Santa Cruz(加州大学圣克ruz分校) Huawei Technologies(华为技术有限公司) Research Centre for Privacy and Security Technologies in Future Smart Systems, PolyU(未来智能系统中的隐私与安全技术研究中心,PolyU)

AI总结 研究揭示大语言模型反学习的可逆性问题,提出表示层面分析框架,通过PCA相似度、CKA和Fisher信息等指标评估表示漂移,发现四种遗忘模式,指出数据来源影响重学效率,揭示不可逆遗忘的挑战。

Comments ICML 2026, accepted to appear

详情
AI中文摘要

在大语言模型(LLMs)中,反学习旨在移除指定数据,但其效果通常通过任务级指标如准确率和困惑度评估。我们证明这些指标可能误导,因为模型似乎遗忘,但通过最小微调即可恢复原始行为。这种可逆性表明信息被抑制而非真正删除。为填补这一评估空白,我们引入表示层面分析框架。我们的工具包包括PCA相似度和位移、中心核对齐(CKA)和Fisher信息,辅以均值PCA距离作为总结指标,用于衡量表示漂移。在多种反学习方法、数据领域和LLMs上应用此框架,我们识别出四种基于可逆性和灾难性程度的遗忘模式。我们比较了恢复策略,发现重学效率依赖于数据来源。我们还发现不可逆、非灾难性遗忘异常困难。通过探测反学习极限,我们识别出一个看似不可逆的目标遗忘案例,为更稳健的擦除算法提供见解。总体而言,我们的发现揭示了当前评估的差距,并建立了可信反学习的表示层面基础。

英文摘要

Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We show that these metrics can be misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across multiple unlearning methods, data domains, and LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. We compare recovery strategies and show that relearning efficiency relies on the data source. We also find that irreversible, non-catastrophic forgetting is exceptionally challenging. By probing unlearning limits, we identify a case of seemingly irreversible, targeted forgetting, offering insights for more robust erasure algorithms. Overall, our findings expose a gap in current evaluation and establish a representation-level foundation for trustworthy unlearning.

2503.20981 2026-05-19 cs.CL cs.AI cs.SI 版本更新

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

患者发声,AI倾听:基于大语言模型的在线评论分析揭示了紧急护理满意度的关键驱动因素

Xiaoran Xu, Zhaoqian Xue, Chi Zhang, Jhonatan Medri, Junjie Xiong, Jiayan Zhou, Jin Jin, Yongfeng Zhang, Siyuan Ma, Lingyao Li

发表机构 * Electrical Engineering department, University of South Florida(佛罗里达州立大学电气工程系) Department of Biostatistics, Epidemiology and Bioinformatics, University of Pennsylvania(宾夕法尼亚大学生物统计学、流行病学与生物信息学系) Computer Science and Engineering department, University of South Florida(佛罗里达州立大学计算机科学与工程系) Mathematics & Statistics, University of South Florida(佛罗里达州立大学数学与统计学系) Department of Computer Science and Engineering, University of Missouri Science and Technology(密苏里科技大学计算机科学与工程系) School of Medicine, Stanford University(斯坦福大学医学院) Department of Computer Science, Rutgers University(罗格斯大学计算机科学系) Department of Biostatistics, Vanderbilt University(范德比尔特大学生物统计学系)

AI总结 本文利用大语言模型分析在线评论,揭示紧急护理满意度的关键因素,发现人际因素和运营效率是主要决定因素,其他因素在调整后无显著影响。

详情
AI中文摘要

调查紧急护理设施的公众体验对促进社区医疗发展至关重要。传统调查方法由于范围、时间和空间覆盖有限而效果不佳。通过在线评论或社交媒体进行众包研究是一种有价值的途径。随着大语言模型(LLMs)的最新进展,从评论中提取细微感知已成为可能。本研究收集了Google Maps上DMV和佛罗里达地区的评论,并使用GPT模型进行提示工程,分析紧急护理的方面情感。我们首先分析了各种方面的地理空间模式,包括人际因素、运营效率、技术质量、财务和设施。接下来,我们确定了影响公众感知的CBG层面特征,包括人口密度、中位收入、基尼指数、租金与收入比率、家庭贫困率、无保险率和失业率。我们的结果表明,人际因素和运营效率是紧急护理患者满意度的最强决定因素,而技术质量、财务和设施在多变量模型中无显著独立影响。在社会经济和人口因素中,只有人口密度与患者评分有显著但微弱的相关性,其余因素无显著相关性。总体而言,本研究强调了众包研究揭示居民关注因素的潜力,并为利益相关者改进紧急护理公众满意度提供有价值的见解。

英文摘要

Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group (CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.

2503.19950 2026-05-19 cs.LG cs.AI cs.CL 版本更新

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

LogQuant: 一种基于对数分布的2位KV缓存量化技术,具有更优异的精度保持性能

Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen

发表机构 * Paradigm(4Paradigm)

AI总结 LogQuant通过基于对数的过滤机制实现KV缓存的2位量化,减少内存占用的同时保持高性能,实验表明其在吞吐量、批处理大小和准确性上均优于现有方法。

Comments Accepted by ICLR 2025 Workshop on Sparsity in LLMs (SLLM)

详情
AI中文摘要

我们介绍了LogQuant,一种突破性的2位量化技术,用于大型语言模型(LLM)推理中的KV缓存,实现显著的内存节省同时保持优越的性能。先前的方法要么假设后续token更重要,要么基于早期注意力模式预测重要token,但两者都可能导致性能瓶颈或频繁的误预测。LogQuant采取了不同的方法。通过应用基于对数的过滤机制,它在整个上下文中选择性地压缩KV缓存,与现有方法相比,实现更好的性能,甚至减少内存占用。在基准测试中,它提高了25%的吞吐量,提升了60%的批处理大小,而无需增加内存消耗。对于Math和Code Completion等具有挑战性的任务,LogQuant在相同压缩比下将准确性提高了40%至200%,优于其他方法。LogQuant可以轻松集成到流行的推理框架中,如Python的transformers库。实现可在https://github.com/Concyclics/LogQuantKV上获得。

英文摘要

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

2502.18632 2026-05-19 cs.AI cs.CL cs.CY cs.LG cs.SE 版本更新

Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems

面向编码问题可解释知识追踪的自动化知识组件生成

Zhangqi Duan, Nigel Fernandez, Arun Balajiee Lekshmi Narayanan, Mohammad Hassany, Rafaella Sampaio de Alencar, Peter Brusilovsky, Bita Akram, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of Pittsburgh(匹兹堡大学) North Carolina State University(北卡罗来纳州立大学)

AI总结 本文提出基于LLM的知识组件生成与标注自动化流程,开发KCGen-KT框架,在不同编程语言的实测数据中验证其优于传统方法和人工编写的知识组件。

Comments Findings of ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

知识组件(KCs)映射到问题有助于建模学生学习,跟踪他们在细粒度技能上的掌握水平,从而在在线学习平台中实现个性化学习和反馈。然而,传统上由人类领域专家进行的知识组件编制和标注工作非常劳动密集。本文提出一个基于LLM的自动化流程,用于开放性编程问题的知识组件生成和标注。我们还开发了一个基于LLM的知识追踪(KT)框架,利用这些LLM生成的知识组件,称为KCGen-KT。我们在两个真实世界的学生代码提交数据集中进行了广泛的定量和定性评估。我们发现KCGen-KT在预测学生未来响应方面优于现有KT方法和人工编写的KCs。我们研究了生成KCs的学习曲线,并显示在认知模型下,LLM生成的KCs比人工编写的KCs拟合更好。我们还进行了与课程讲师的人类评估,以展示我们的流程生成合理的问题-KC映射。

英文摘要

Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor intensive. We present an automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations on two real-world student code submission datasets in different programming languages.We find that KCGen-KT outperforms existing KT methods and human-written KCs on future student response prediction. We investigate the learning curves of generated KCs and show that LLM-generated KCs result in a better fit than human written KCs under a cognitive model. We also conduct a human evaluation with course instructors to show that our pipeline generates reasonably accurate problem-KC mappings.

2502.16691 2026-05-19 cs.CL cs.DC cs.MA 版本更新

Responsible Federated LLMs via Safety Filtering and Constitutional AI

通过安全过滤和宪法AI实现负责任的联邦大语言模型

Eunchung Noh, Jeonghun Baek

发表机构 * Samsung Electronics(三星电子) The University of Tokyo(东京大学)

AI总结 本文提出在联邦大语言模型中引入安全过滤和宪法AI技术,以提升模型安全性,实验显示在AdvBench上安全性能提升超过20%。

Comments Accepted at the 6th Workshop on Trustworthy NLP (TrustNLP), ACL 2026

详情
AI中文摘要

近期研究逐渐关注使用联邦学习训练大型语言模型(LLMs),称为FedLLM。然而,负责任的AI(RAI)旨在确保安全和可信响应,仍在此领域研究不足。在FedLLM中,客户端侧训练数据可能包含有害内容,导致生成不适当响应的不安全LLMs。将此类模型聚合到全局模型并分发给客户端,会增加不安全LLMs的广泛部署风险。为此,我们结合两种已确立的RAI技术:安全过滤和宪法AI,进入FedLLM。实验表明,这些方法显著提高LLM安全性,在AdvBench上实现超过20%的改进。

英文摘要

Recent research has increasingly focused on training large language models (LLMs) using federated learning, known as FedLLM. However, responsible AI (RAI), which aims to ensure safe and trustworthy responses, remains underexplored in this context. In FedLLM, client-side training data may contain harmful content, resulting in unsafe LLMs that can generate inappropriate responses. Aggregating such models into a global model and redistributing it to clients risks the widespread deployment of unsafe LLMs. To address this, we incorporate two well-established RAI techniques into FedLLM: safety filtering and constitutional AI. Our experiments show that these methods significantly improve LLM safety, achieving over 20% improvement on AdvBench.

2502.13957 2026-05-19 cs.CL cs.AI 版本更新

Supervising the search process produces reliable and generalizable information-seeking agents

通过监督搜索过程产生可靠且可推广的信息寻求代理

Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang

发表机构 * Department of Computer Science, University of Virginia, USA(弗吉尼亚大学计算机科学系) National Library of Medicine, National Institutes of Health, USA(美国国立卫生研究院国家医学图书馆) Department of Computer Science, University of Illinois Urbana–Champaign, USA(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Medical Oncology, Dana–Farber Cancer Institute, USA(达纳-法伯癌症研究所医学肿瘤科) Surgery, University of Alabama at Birmingham, USA(阿拉巴马大学伯明翰分校外科系) Department of Neurology, Yale School of Medicine, USA(耶鲁医学院神经病学系)

AI总结 本文提出通过监督搜索过程来构建更可靠且可推广的信息寻求代理,通过RAG-Gym框架系统研究了架构设计、参数优化和动作评估,发现推理反思是关键能力,Re$^2$Search++在多跳信息检索基准上取得显著提升,尤其在领域外任务中表现更优。

Comments Homepage: https://rag-gym.github.io; Code: https://github.com/RAG-Gym/RAG-Gym

详情
AI中文摘要

大型语言模型(LLMs)通过从文档排序转向综合答案的方式改变了网络搜索,并越来越多地被用作自主的代理搜索系统,这些系统通过迭代与外部知识源交互。尽管有进展,构建有效的搜索代理仍然具有挑战性,因为高质量的中间搜索步骤难以生成。以往的方法主要依赖于结果监督,仅奖励代理生成正确最终答案。这往往导致奖励黑客和对参数记忆的过度依赖,限制了对领域外任务的泛化能力。为了解决这些限制,我们引入RAG-Gym框架,将监督从最终答案转移到搜索过程本身。通过RAG-Gym,我们系统地研究了架构设计、参数优化和动作评估,确定推理反思是搜索代理的关键能力。基于这一见解,我们提出了Re$^2$Search++,一个受过程监督的代理,它在多跳信息检索基准上实现了显著改进,尤其是在领域外设置中。性能提升主要由更高质量的搜索查询驱动,而非仅靠答案优化。所学的搜索批评者能够跨模型转移,包括专有LLMs。这些发现表明,监督搜索过程会产生更可靠且可推广的信息寻求代理。

英文摘要

Large language models (LLMs) are transforming web search by shifting from document ranking to synthesizing answers, and are increasingly deployed as autonomous agentic search systems that iteratively interact with external knowledge sources. Despite this progress, building effective search agents remains challenging because high-quality intermediate search steps are difficult to generate. Previous approaches have primarily relied on outcome supervision, rewarding agents only for producing correct final answers. This often leads to reward hacking and excessive dependence on parametric memory, limiting generalization to out-of-domain tasks. To address these limitations, we introduce RAG-Gym, a framework that shifts supervision from final answers to the search process itself. With RAG-Gym, we systematically investigate architecture design, parameter optimization, and action evaluation, identifying reasoning reflection as a critical capability for search agents. Building on this insight, we propose Re$^2$Search++, a process-supervised agent that achieves substantial improvements on multi-hop information-seeking benchmarks, especially in out-of-domain settings. Performance gains are driven primarily by higher-quality search queries rather than answer optimization alone, and the learned search critics transfer across models, including proprietary LLMs. These findings show that supervising the search process produces more reliable and generalizable information-seeking agents.

2411.10636 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

缓解孟加拉语分类任务中的外在性别偏见

Sajib Kumar Saha Joy, Arman Hassan Mahy, Meherin Sultana, Azizah Mamun Abha, MD Piyal Ahmmed, Yue Dong, G M Shahariar

发表机构 * Ahsanullah University of Science and Technology(阿沙努拉科学与技术大学) University of California, Riverside(加州大学河滨分校)

AI总结 本文研究了孟加拉语预训练语言模型中的外在性别偏见,构建了四个任务特定的基准数据集,并提出RandSymKL方法以缓解偏见,实验表明其能有效减少偏见并保持高准确率。

详情
AI中文摘要

在本研究中,我们探讨了孟加拉语预训练语言模型中的外在性别偏见,这是一个在低资源语言中鲜有研究的领域。为了评估这种偏见,我们构建了四个人工标注的任务特定基准数据集,用于情感分析、毒性检测、仇恨言论检测和讽刺检测。每个数据集都通过细致的性别扰动进行了增强,通过系统地交换性别化名称和术语并保持语义内容,实现了对性别驱动预测变化的最小配对评估。然后,我们提出RandSymKL,一种整合对称KL散度和交叉熵损失的随机去偏策略,以在任务特定的预训练模型中缓解偏见。RandSymKL是一种精炼的训练方法,以统一的方式整合这些元素,专注于分类任务的外在性别偏见缓解。我们的方法在现有偏见缓解方法上进行了评估,结果表明,我们的技术不仅有效减少了偏见,还与其他基线方法相比保持了竞争性的准确性。为了促进进一步研究,我们已公开了我们的实现和数据集:https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

英文摘要

In this study, we investigate extrinsic gender bias in Bangla pretrained language models, a largely underexplored area in low-resource languages. To assess this bias, we construct four manually annotated, task-specific benchmark datasets for sentiment analysis, toxicity detection, hate speech detection, and sarcasm detection. Each dataset is augmented using nuanced gender perturbations, where we systematically swap gendered names and terms while preserving semantic content, enabling minimal-pair evaluation of gender-driven prediction shifts. We then propose RandSymKL, a randomized debiasing strategy integrated with symmetric KL divergence and cross-entropy loss to mitigate the bias across task-specific pretrained models. RandSymKL is a refined training approach to integrate these elements in a unified way for extrinsic gender bias mitigation focused on classification tasks. Our approach was evaluated against existing bias mitigation methods, with results showing that our technique not only effectively reduces bias but also maintains competitive accuracy compared to other baseline approaches. To promote further research, we have made both our implementation and datasets publicly available: https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

2409.10102 2026-05-19 cs.IR cs.AI cs.CL 版本更新

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

检索增强生成系统中的可信度:综述

Yujia Zhou, Wenbo Zhang, Jingying Shao, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Jason Chen Zhang, Zhicheng Dou, Philip S. Yu, Jiaxin Mao

发表机构 * Tsinghua University(清华大学) Renmin University of China(中国人民大学) The Chinese University of Hong Kong(香港中文大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Hong Kong Polytechnic University(香港理工大学) Microsoft Research Asia(微软亚洲研究院) University of Illinois(伊利诺伊大学)

AI总结 本文综述了检索增强生成系统中可信度的关键维度,提出Trust-RAG Compass框架,评估事实性、鲁棒性等六个方面,并建立评估基准,揭示不同LLM在可信度方面的性能差异,指出未来研究方向。

详情
AI中文摘要

检索增强生成(RAG)已迅速成为大型语言模型(LLMs)发展中的关键范式。尽管现有研究主要强调准确性和效率,但RAG系统的可信度仍缺乏充分探讨。RAG通过将响应基于外部和最新知识来提高LLM的可靠性,减少幻觉。然而,不可靠的检索或不当的知识利用仍可能导致不良输出。为此,我们提出统一框架Trust-RAG Compass,从事实性、鲁棒性、公平性、透明性、问责性和隐私六个关键维度评估RAG系统的可信度。在此框架下,我们对现有文献进行了全面回顾,并引入评估基准TRC Bench,围绕六个维度对多种专有和开源模型进行全面评估。我们的结果揭示了不同类型的LLM在不同可信度维度上的性能差距。最后,基于我们的发现,我们识别了关键挑战和未来研究的前景。通过这项工作,我们旨在为后续研究提供结构化基础,并为开发真实场景中的可信RAG系统提供实用指导。

英文摘要

Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). Although existing research mainly emphasizes accuracy and efficiency, the trustworthiness of RAG systems remains insufficiently explored. RAG can improve LLM reliability by grounding responses in external and up-to-date knowledge, reducing hallucinations. However, unreliable retrieval or improper knowledge utilization may still lead to undesirable outputs. To address these concerns, we propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench (\underline{T}rust-\underline{R}AG \underline{C}ompass \underline{Bench}mark), regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness. Finally, we identify key challenges and promising directions for future research based on our findings. Through this work, we aim to provide a structured foundation for subsequent investigations and practical guidance for developing trustworthy RAG systems in real-world scenarios.

2409.02428 2026-05-19 cs.LG cs.AI cs.CL cs.SY eess.SY 版本更新

Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement

语言模型作为定制环境多目标强化学习的高效奖励函数搜索器

Guanwen Xie, Jingzehua Xu, Yiyuan Yang, Yimian Ding, Shuai Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, China(清华大学深圳国际研究生院,清华大学,中国) Department of Computer Science, University of Oxford, United Kingdom(英国牛津大学计算机科学系) Department of Data Science, New Jersey Institute of Technology, USA(美国新泽西理工学院数据科学系)

AI总结 本文提出ERFSL,利用语言模型高效搜索奖励函数,通过生成奖励组件和使用奖励批评者修正代码,实现多目标强化学习任务中零样本学习的高效奖励函数设计。

详情
AI中文摘要

在强化学习任务中,设计和改进复杂定制环境和多重需求的奖励函数具有挑战性。本文提出ERFSL,一种利用大型语言模型(LLMs)的高效奖励函数搜索器,使LLMs成为有效的白盒搜索器,并突出其先进的语义理解能力。具体而言,我们为每个数值明确的用户需求生成奖励组件,并使用奖励批评者识别正确的代码形式。然后,LLMs为奖励组件分配权重以平衡其值,并通过灵活采用方向突变和交叉策略迭代调整权重,类似于遗传算法,基于训练日志分析器提供的上下文。我们将其应用于无直接人类反馈或奖励示例的定制数据收集RL任务(零样本学习)。奖励批评者仅需每个需求一个反馈实例即可有效纠正奖励代码,防止不可纠正的错误。权重初始化使在帕累托解集内获取不同奖励函数而无需权重搜索。即使权重偏差达500倍,平均仅需5.2次迭代即可满足用户需求。ERFSL也适用于大多数使用GPT-4o mini的提示,因为我们分解了权重搜索过程,以降低对数值和长上下文理解能力的要求。

英文摘要

Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to a customized data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities.

2407.16216 2026-05-19 cs.CL 版本更新

Reinforcement Learning for LLM Post-Training: A Survey

为大型语言模型进行后训练的强化学习:综述

Zhichao Wang, Kiran Ramnath, Bin Bi, Shiva Kumar Pentyala, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, Cheng

发表机构 * Salesforce AWS AI Labs Airbnb

AI总结 本文综述了通过强化学习进行大型语言模型后训练的方法,分析了RLHF和RLVR等技术,并提出统一的策略梯度框架,旨在为研究人员提供技术参考。

详情
AI中文摘要

大型语言模型(LLMs)通过预训练和监督微调(SFT)训练后,仍可能产生有害或不一致的输出,或在数学和编程等领域表现不佳。基于强化学习(RL)的后训练方法,如通过人类反馈的强化学习(RLHF)方法(如直接偏好优化(DPO)和可验证奖励的强化学习(RLVR)方法(如PPO和GRPO)等,已显著缓解了这些问题。然而,现有研究未对推动这些进展的各种方法进行技术细节的比较。为填补这一空白,本文提出了一项及时的综述,将基础组件与最新进展联系起来。我们推导出一个统一的策略梯度框架,将预训练、SFT、RLHF和RLVR作为特殊案例,并组织其中更近期的技术。本文的主要贡献包括:(1)对MLE、RLHF和RLVR基础以及统一策略梯度框架的自包含介绍;(2)详细分析PPO和GRPO方法以及离线和迭代DPO方法,沿提示采样、响应采样和梯度系数轴分解;(3)标准化符号,实现直接跨方法比较;(4)附录中对每种方法的实现细节和实证结果的全面比较。我们旨在为从事LLM后训练研究的研究人员和实践者提供技术参考。

英文摘要

Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main contributions of our survey are as follows: (1) a self-contained introduction to MLE, RLHF, and RLVR foundations and the unified policy gradient framework; (2) detailed technical analysis of PPO- and GRPO-based methods alongside offline and iterative DPO approaches, decomposed along prompt sampling, response sampling, and gradient coefficient axes; (3) standardized notation enabling direct cross-method comparison; and (4) comprehensive comparison of implementation details and empirical results of each method in the appendix. We aim to serve as a technically grounded reference for researchers and practitioners working on LLM post-training.

2605.16613 2026-05-19 cs.CL econ.GN q-fin.EC q-fin.GN 版本更新

Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text

超越情感分类:一种用于文本情感强度评估的生成框架

Francesco A. Fabozzi, Dasol Kim, William N. Goetzmann

AI总结 本文提出一种新的情感建模方法,关注情感评估而非识别,通过构建情感强度评分数据集并微调生成语言模型,实现更通用的情感分析框架,优于传统分类方法并在相关领域表现出色。

Comments 10 pages, no figures, 5 tables

详情
AI中文摘要

我们介绍了一种新的情感建模方法,将重点从识别转向评估,解决在应用领域如金融中离散分类的局限性。通过构建情感强度评分数据集,并对开放式权重生成语言模型进行微调以输出0-100之间的连续值,我们展示了一种更具表现力和通用性的框架,用于情感和情绪分析。我们的发现不仅优于分类基线,还揭示了惊人的泛化能力和向相关构造如情感和唤醒的迁移效果。这项工作通过将情感强度评估作为分类的替代方案,推动了NLP的跨学科重新上下文化,认为这种转变更符合金融等领域中情绪内容程度对解释和决策的重要性。

英文摘要

We introduce a novel approach to emotion modeling that shifts the focus from identification to evaluation, addressing the limitations of discrete classification in applied domains such as finance. By constructing a dataset of emotional intensity scores and fine-tuning open-weight generative language models to output continuous values from 0-100, we demonstrate a more expressive, generalizable framework for sentiment and emotion analysis. Our findings not only outperform classification baselines but also reveal surprising generalization capabilities and transfer effects to related constructs such as sentiment and arousal. This work contributes to the interdisciplinary recontextualization of NLP by introducing emotion intensity evaluation as an alternative to classification, arguing that this shift better aligns with the needs of domains--such as finance--where the degree of emotional content is central to interpretation and decision-making.

2605.16600 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

预训练写入,对齐读取:Transformer权重空间的不对称性

Valeria Ruscio, Eli-Shaoul Khedouri, Keiran Thompson

AI总结 研究揭示了预训练和对齐在Transformer权重空间中的不对称性,通过分析权重变化在残差流激活子空间和预测子空间中的对齐情况,发现读路径权重集中于注意力输入激活的主方向,而写路径权重在预测子空间中保持各向同性。

详情
AI中文摘要

交叉熵预训练和偏好对齐更新相同的Transformer权重,但留下几何上不同的痕迹。我们通过相对子空间分数探针来刻画这种不对称性,追踪权重变化如何与残差流激活子空间和由去嵌入定义的预测子空间对齐。对齐变化集中在读路径(W_Q,W_K)上,沿着注意力输入激活的主方向,而写路径(W_O,W_2)相对于预测子空间则保持近各向同性。我们通过各向异性梯度积累来解释这种模式:对矩阵W的更新是外积δ_t a_t^T之和,继承自哪一侧的协方差集中。对于读路径矩阵,这一侧是输入激活a_t,其协方差在训练过的Transformer中呈尖峰状,因此产生与目标无关的集中。对于写路径矩阵,相关的一侧是上游梯度δ_t,其各向异性取决于损失。交叉熵提供标准的每样本信号,诱导预训练期间写路径的预测几何;对齐目标通常在写路径上添加很少的进一步集中。我们通过检查点内轨迹、渐进对比目标控制以及闭合形式的秩1干预与匹配方向控制来支持这一解释,为所提出的权重空间几何提供因果证据。

英文摘要

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $δ_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $δ_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

2605.16538 2026-05-19 cs.HC cs.CL 版本更新

LLMs in Qualitative Research: Opportunities, Limitations, and Practical Considerations

大语言模型在定性研究中的应用:机遇、限制与实践考量

Henry Salgado, Meagan R. Kendall, Martine Ceberio, Alexandra Coso Strong

AI总结 本文探讨了大语言模型在定性研究中的机遇、限制及实践问题,强调研究者需批判性地处理技术参数,结合定性方法与可解释AI,讨论大语言模型的透明度与传统NLP工具的差异。

Comments To be published and presented in 2026 ASEE Annual Conference and Exposition

详情
AI中文摘要

本文探讨了大语言模型(LLMs)在定性研究中的机遇、限制及实践考量。结合定性方法与可解释AI的多学科视角,本文认为负责任地将LLMs整合到定性工作流程中,要求研究者批判性地处理一整套技术参数,即上下文窗口限制、温度和Top-p采样设置、用户和系统提示设计以及以系统卡片形式呈现的模型文档。本文将这些考量置于定性研究的本体论承诺之中,包括反思性、位置性和解释性判断,并讨论当代LLMs的不透明性与早期自然语言处理工具如主题模型和基于词典的情感分析器的区别。

英文摘要

This paper examines the opportunities, limitations, and practical considerations associated with the use of large language models (LLMs) in qualitative research. Drawing on a multidisciplinary perspective that combines expertise in qualitative methods and explainable AI, the paper argues that responsible integration of LLMs into qualitative workflows requires researchers to engage critically with a curated set of technical parameters, that is, context window constraints, temperature and top-p sampling settings, user and system prompt design, and model documentation in the form of system cards. The paper situates these considerations within the epistemological commitments of qualitative research, including reflexivity, positionality, and interpretive judgment, and discusses how the opacity of contemporary LLMs differs from earlier natural language processing tools such as topic models and lexicon-based sentiment analyzers.

2605.16516 2026-05-19 cs.HC cs.AI cs.CL cs.CY 版本更新

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

长期人类-大语言模型交互中的对齐漂移:一种机制导向的框架

Xintong Yao

AI总结 本文提出一种机制导向的框架,用于描述长期人类-大语言模型交互中的对齐漂移现象,通过反馈回路和子模式选择解释漂移的发展过程,并将对齐漂移视为递归互动过程而非孤立模型失败。

Comments 16 pages, 1 appendix

详情
AI中文摘要

长期与基于大语言模型的系统交互可能导致对齐漂移:一种渐进过程,其中系统输出逐渐受用户当前消息的约束减少,而更多受先前交互历史影响,尽管仍显得有帮助、连贯和响应。此过程难以检测,因为用户的主观体验可能随着系统变得更熟悉、有用和适应而改善。现有研究主要集中在短期任务表现、孤立输出或单实例对齐问题,导致慢性和累积的交互层面动态未被充分描述。本文提出一种机制导向的框架来描述对齐漂移。该框架定义信号A和信号B的区别,解释漂移如何通过反馈回路和子模式选择发展,将过程分为三个互动阶段,并识别控制漂移的边界条件。通过将对齐漂移视为递归互动过程而非孤立模型失败,本文为研究长期人类-系统交互提供了概念基础。

英文摘要

Long-term interaction with LLM-based systems may produce alignment drift: a gradual process in which system outputs become less constrained by the user's current message and more shaped by prior interaction history, while still appearing helpful, coherent, and responsive. This process is difficult to detect because the user's subjective experience may improve as the system becomes more familiar, useful, and attuned. Existing research on human-LLM interaction has largely focused on short-term task performance, isolated outputs, or single-instance alignment problems, leaving slow and cumulative interaction-level dynamics undercharacterized. This paper proposes a mechanism-oriented framework for describing alignment drift. The framework defines the distinction between signal A and signal B, explains how drift develops through feedback loops and sub-pattern selection, divides the process into three interactional regimes, and identifies boundary conditions for controlling drift. By framing alignment drift as a recursive interactional process rather than an isolated model-side failure, the paper provides a conceptual basis for studying long-term human-system interaction.

2605.16508 2026-05-19 cs.CL cs.AI 版本更新

The Scaling Laws of Skills in LLM Agent Systems

大语言模型代理系统中技能的扩展规律

Charles Chen, Qiming Yu, Yuhang Gu, Zhuoye Huang, Hanjing Li, Hongyu Liu, Simin Liu, Jinhao Liu, Dengyun Peng, Jiangyi Wang, Zheng Yan, Fanqing Meng, Ethan Qin, Carl Che, Mengkang Hu

AI总结 研究揭示了大规模代理系统中技能扩展的双重规律:路由准确性随库大小对数衰减,执行准确性通过联合路由乘法提升下游任务表现,二者通过路由衰减斜率参数耦合,优化后显著提升性能。

Comments Technical Report

详情
AI中文摘要

随着代理系统规模扩大,技能积累为大规模可重用库,但其扩展规律仍不明确。在15个前沿LLM、1141个现实技能及超300万次路由或执行决策中,发现两个耦合规律。路由规律:单步路由准确性随库大小对数衰减(R²>0.97),错误从局部技能竞争发展到跨家族漂移并被过于通用的'黑洞技能'捕获。执行规律:在状态实现前,联合路由近似乘法,正确执行可提升困难下游任务表现约4倍。单参数路由对数衰减斜率b耦合二者:路由侧拟合预测执行侧救援,显示同一库属性控制预执行崩溃和下游恢复能力。这些结果表明代理性能不仅取决于模型能力,还取决于技能库的结构、粒度和暴露策略。

英文摘要

As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

2605.16475 2026-05-19 cs.DL cs.CL 版本更新

Generative Artificial Intelligence for Literature Reviews

生成式人工智能用于文献综述

Gerit Wagner, Julian Prester, Reza Mousavi, Roman Lukyanenko, Guy Pare

AI总结 本文探讨生成式人工智能在文献综述中的应用,分析其在文本摘要、问答、数据提取和翻译等方面的能力,提出使用通用和专用工具进行综述的方法,并讨论其机遇与风险,以及对科学进步的影响。

详情
Journal ref
Journal of Information Technology, 02683962261425675 (2026)
AI中文摘要

生成式人工智能(GenAI),基于大型语言模型(LLMs),如ChatGPT,已席卷组织、学术界和公众。特别是生成式人工智能在大规模文本语料摘要、问答、数据提取和翻译方面的出色能力,对文献综述的开展具有深远影响。这影响了科学、组织和公众,因为所有都可以从GenAI支持的文献综述中受益。本文基于GenAI的技术基础和已确立的方法论 discourse,概述了使用通用(如ChatGPT、Gemini、Claude)和专用(如Consensus、Elicit)GenAI工具进行文献综述的方法。我们提供了提示的示例,并建议方法上稳健的文献综述策略。本文采用平衡的方法,考虑了依赖GenAI进行文献综述的机会与风险。最后,我们讨论了GenAI对长期科学进步影响的哲学问题,并提出了改进GenAI核心技术(其架构和训练数据)的研究机会,以及在GenAI支持的文献综述方法学中的开放问题。

英文摘要

Generative artificial intelligence (GenAI), based on large-language models (LLMs), such as ChatGPT, has taken organizations, academia, and the public by storm. In particular, impressive GenAI capabilities such as summarization of large text corpora, question-answering, data extraction, and translation, carry profound implications for the conduct of literature reviews. This impacts science, organizations and the general public, as all can benefit from GenAI-supported literature reviews. Building on the technical foundations of GenAI and grounded in established methodological discourse, this work outlines approaches for conducting literature reviews using both general-purpose (e.g., ChatGPT, Gemini, Claude) and specialized GenAI tools (e.g., Consensus, Elicit). We provide illustrative examples of prompts and suggest methodologically-sound literature review strategies. Throughout this perspective paper, we adopt a balanced approach considering both the opportunities and the risks of relying on GenAI in the conduct of literature reviews. We conclude by discussing philosophical questions related to the effects of GenAI on long-term scientific progress, and also present fruitful opportunities for research on improving the core of GenAI's technology-its architecture and training data-and suggest open issues in GenAI-based literature reviews methodology.

2605.16468 2026-05-19 cs.CV cs.AI cs.CL cs.LG q-bio.NC 版本更新

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

可解释的神经编码机制揭示人类视觉皮层的精细功能选择性

Idan Daniel Grosbard, Mor Geva, Galit Yovel

AI总结 本文提出MINE框架,通过机制可解释工具揭示自然图像中驱动皮层 voxel 活动的特征,验证了特征对 voxel 响应的因果影响,并揭示了视觉皮层中精细的功能选择性。

Comments 40 pages, 28 figures

详情
AI中文摘要

理解人类视觉的核心目标是揭示驱动神经活动的视觉特征。已有研究利用人工神经网络作为编码模型预测皮层对自然图像的响应,揭示了激活类别选择区域的视觉内容。然而,现有方法多为相关性分析,将编码器视为黑箱,无法确定哪些图像特征驱动每个 voxel 的响应。本文提出机制可解释神经编码(MINE)框架,通过机制可解释工具定位自然图像中驱动毫米级(voxel 级)活动的特征。MINE利用语言对齐的图像表示预测每个 voxel 的响应,并生成语义可解释的特征描述,用于 voxel 的激活。进一步将这些 per-image 特征泛化为 per-voxel 功能轮廓。为验证 per-image 描述,我们显示它们足以生成激发 voxel 响应与原始图像响应匹配的图像,其准确性优于随机或低贡献控制生成的图像。此外,通过反事实插入或移除预测特征,可使激活在预期方向变化,提供因果证据。由 voxel 激活轮廓指导的反事实编辑产生更强的激活变化,表明轮廓忠实捕捉每个 voxel 的选择性。最后,将 MINE 应用于研究充分的类别选择脑区,显示其恢复了已知的类别偏好,同时揭示了每个区域内的精细 voxel 结构。总体而言,我们的结果确立了机制可解释性作为发现和验证神经功能精细假设的路径。

英文摘要

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

2605.16411 2026-05-19 cs.CV cs.AI cs.CL cs.DB cs.LG 版本更新

Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

通过分布偏移下的分阶段偏好优化减少视觉-语言模型中的幻觉

Qinwu Xu

AI总结 本文提出分阶段偏好优化框架,通过构建针对幻觉问题的数据集,提升视觉-语言模型的 grounded reasoning,减少幻觉并提高响应信息量。

详情
AI中文摘要

幻觉仍然是视觉-语言模型(VLMs)中的基本挑战,其中自回归生成可能因联合概率建模下的最大似然估计而产生语言上合理但物理上不一致或视觉上不 grounded 的响应。我们提出了一种分阶段偏好优化框架,通过有针对性的多模态数据构建来减少幻觉。该框架强调模糊的空间方向、物体关系、OCR不确定性以及对抗性假前提训练。幻觉负样本通过最小扰动但视觉不一致的替代品生成,使直接偏好优化(DPO)能够更好地区分 grounded 推理与 plausible 幻觉。在开源基准和现实多模态评估场景中的实验表明,改进了 grounded 一致性,减少了幻觉,并产生了更具信息量的 grounded 响应。跨模型定性评估进一步显示,所提出的多模态 LLM DPO 框架在模糊空间推理和对抗性假前提设置中比几个前沿专有 VLMs 产生更视觉 grounded 的响应。结果表明,幻觉可能不仅源于模型容量的限制,还源于自回归概率生成在弱视觉 grounding 下倾向于选择语言上合理但视觉上不一致的延续。未来工作可能探索物理一致性建模、不确定性感知的多模态推理以及超越标准自回归解码的架构替代方案。

英文摘要

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

2605.16407 2026-05-19 cs.LO cs.CL cs.CR cs.PL 版本更新

Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture

为LLM流水线提供证明携带证书:一种信任边界架构

George Koomullil

AI总结 本文提出一种验证大型语言模型周围确定性结构计算的框架,扩展了Lean 4信任边界架构至现代LLM流水线的通用接口。核心贡献包括三个证书家族和两个操作符,用于高风险部署中的专利检索、金融监管等场景。

Comments 83 pages, 1 figure, 12 tables

详情
AI中文摘要

我们提出一种框架,用于验证大型语言模型周围确定性结构计算,而非模型本身,将Lean 4信任边界架构扩展到现代LLM流水线的通用接口。证书的有效性是Lean 4内核类型检查加上没有sorry的传递公理审核,针对受信任集合{propext, Classical.choice, Quot.sound};其他假设被声明并按层级划分(数学占位符、密码学假设、ML/人类或acles)。技术贡献包括三个本地证书家族和两个操作符。家族包括冲突感知的双石接地(带有发射门声学引理)、嵌入敏感性和改写稳定性,以及Hoare风格的代理行为。操作符包括最大可证明残差,将回避转化为最大权重可证明残差,带有审计日志记录的被丢弃声明;以及组合稳定性定理,从每层增益和边距中得出闭合形式的流水线整体扰动预算。三个家族加一个通用保证卡整合器构成每调用交付物:专利和法律检索、受监管金融、临床决策支持和具有不可逆副作用的代理系统。编译后的Lean 4参考制品(Lean v4.30.0-rc2,Mathlib)涵盖了所有22种证书类型,其中17个46个内核审核声明无公理,其余仅依赖于受信任集合和声明的假设,且无sorryAx或Lean.ofReduceBool的使用。三个家族通过四个注册试点进行了实证测试:双石接地在对抗性扰动的HotpotQA上,嵌入敏感性在短形式和长形式设置中,以及Hoare风格的代理行为在文件系统沙盒中对抗性提示注入。

英文摘要

We present a framework for verifying the deterministic structured computations surrounding a large language model rather than the model itself, extending a Lean 4 trust-boundary architecture to the generic interfaces of modern LLM pipelines. Certificate validity is a Lean 4 kernel type-check plus a sorry-free transitive axiom audit against the trusted set {propext, Classical.choice, Quot.sound}; other assumptions are declared and partitioned by tier (mathematical placeholders, cryptographic assumptions, ML/human oracles). The technical contribution comprises three local certificate families and two operators. The families are conflict-aware bilattice grounding (with an emission-gate soundness lemma), embedding sensitivity and paraphrase stability, and Hoare-style agent action. The operators are a Maximal Certifiable Residue, which turns abstention into the maximum-weight certifiable residue with audit-logged dropped claims, and a Compositional Stability theorem, which yields a closed-form pipeline-wide perturbation budget from per-layer gains and margins. The three families plus a Universal Assurance Card consolidator form the per-call deliverable for high-stakes deployments: patent and legal retrieval, regulated finance, clinical decision support, and agentic systems with irreversible side effects. A compiled Lean 4 reference artifact (Lean v4.30.0-rc2, Mathlib) covers all 22 certificate types, with 17 of 46 kernel-audited declarations axiom-free, the rest depending only on the trusted set and declared assumptions, and zero uses of sorryAx or Lean.ofReduceBool. The three families are empirically tested through four registered pilots: bilattice grounding on adversarially perturbed HotpotQA, embedding sensitivity in short- and long-form settings, and Hoare-style agent action on a filesystem sandbox with adversarial prompt injection.

2605.16364 2026-05-19 cs.SD cs.AI cs.CL 版本更新

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

WASIL:真实场景下阿拉伯语口语交互与LLMs

Zien Sheikh Ali, Hamdy Mubarak, Soon-Gyo Jung, Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury

AI总结 本文提出WASIL数据集,包含真实阿拉伯语口语交互数据,包含音频、ASR假说、助手回复及显式喜欢/不喜欢反馈,用于评估LLMs在真实场景下的表现。

Comments Spoken Prompts, Multilingual LLMs, Speech-based Evaluation, Dialectal Speech, Low-resource Languages, Conversational AI, Speech-to-Text QA, Real-world Interaction, Spoken Language Understanding

详情
AI中文摘要

大型语言模型(LLMs)的语音助手通常构建为自动语音识别(ASR)与LLM系统的级联系统,其中识别错误可能扭曲用户意图。不满可能还源于模糊、领域外或非请求的对话轮次,使难以分离ASR影响。我们发布WASIL(在阿拉伯语中表示连接或链接):包含真实场景下的阿拉伯语口语交互提示,包含音频、ASR假说、助手回复及显式喜欢/不喜欢反馈(8,529轮次;14.2%的不满),再加上一个包含现代标准阿拉伯语(MSA)和四种主要方言及其标签的2,000轮次测试集。我们通过多ASR协议引导的后编辑提供低成本的黄金转录,并标注回答性(可回答、模糊/需要澄清、不支持、非请求/噪声)以区分内在不可回答性与ASR引起的退化。最后,我们描述了使用多裁判LLM评分的可扩展无参考评估方法,用于评估ASR与黄金转录之间的响应。

英文摘要

Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.

2605.16354 2026-05-19 cs.LG cs.AI cs.CL cs.HC stat.ML 版本更新

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

通过LLM裁判增强人类评估:你真的需要多少人类评审?

Jane Paik Kim

AI总结 本文提出通过LLM作为辅助裁判来增强人类评估,通过两阶段抽样设计确定人类和LLM评审样本量,以实现目标统计功效。

Comments 10 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作AI系统的自动评估者,包括在高风险应用中。在这一角色中,LLMs用于生成关于模型输出质量、适当性甚至安全性的判断。这种做法受到实际限制的驱动。专家人类评分成本高且难以扩展,而LLM评分可以快速低成本地生成。然而,当前部署LLM评估者的方法是随意的,通常仅限于报告人类和LLM裁判之间的一致性度量作为替代人类评分的正当性,且缺乏正式的研究设计基础。本文(1)将LLM裁判的角色从替代性转为辅助性,并(2)将LLM作为裁判范式制定为通过两阶段抽样设计增强人类评估的一种方法,其中在第一阶段对所有观察进行LLM评估,在第二阶段对子样本进行部分人类评分。我们提出使用来自缺失数据文献的双重鲁棒估计器,利用预测模型的鲁棒性属性,因为缺失性模型是设计已知的。使用该估计器的渐近方差,我们提出如何确定人类和LLM评分的样本量以达到目标统计功效。我们还展示通过分配更多人类评分给LLM评分预测性不高的评估类型,可以高效地设计研究。据我们所知,关于在验证基准时应保留多少人类监督的指导非常有限。

英文摘要

Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.

2605.16342 2026-05-19 cs.LG cs.AI cs.CL 版本更新

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

DACA-GRPO:去噪感知的信用分配用于扩散语言模型中的强化学习

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Lokesh Boominathan, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

AI总结 本文提出DACA-GRPO,通过引入去噪进度评分和分层掩码似然,改进扩散语言模型中强化学习的信用分配,提升数学推理、代码生成等任务性能。

详情
AI中文摘要

扩散大语言模型是自回归模型的有力替代品,但现有强化学习方法将所有去噪步骤视为同等重要,并依赖于有偏、高方差的似然估计。我们识别出两个根本性弱点:去噪轨迹中缺乏时间信用分配,以及用于策略优化的均场似然估计存在系统偏差。为了解决这些问题,我们提出了Denoising-Aware Credit Assignment for GRPO(DACA-GRPO),一种轻量级、即插即用的增强方法,适用于任何GRPO风格的训练器。DACA-GRPO引入了两个互补机制:去噪进度评分,从中间预测中提取每token的重要性权重,无需额外前向成本;分层掩码似然,将token位置分为层次,使每个token在大部分序列作为上下文的情况下进行预测,从而减少均场偏差。在三种GRPO基础方法上应用DACA-GRPO,使其在七个基准测试中取得一致提升,涵盖数学推理、代码生成、约束满足和受约束生成等任务,在数学推理中提升达5.6个百分点,在代码生成中提升7.4个百分点,在约束满足中提升36.3个百分点,在JSON schema符合性中提升5.9个百分点。

英文摘要

Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.

2605.16338 2026-05-19 cs.DL cs.CL 版本更新

Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment

Vidya:一种由人工智能驱动的模块化流水线,用于档案自动化和语义元数据增强

Cloter Migliorini Filho, Julia Graciela Machado, Edson Armando Silva, Marcella Scoczynski

AI总结 Vidya利用大语言模型和开源工具实现大规模档案语义增强与自动化处理,通过YAML定义的本体和Pydantic验证生成结构化JSON输出,降低机构存储成本并符合相关标准。

详情
AI中文摘要

大规模历史档案数字化创造了'暗数据'的悖论:缺乏检索元数据的数字对象。手动档案描述缓慢且昂贵,限制了发现和再利用。我们提出Vidya,一种模块化流水线,通过协调大语言模型(LLMs)和开源工具,实现大规模语义增强和档案处理。Vidya通过YAML定义的本体和Pydantic验证约束生成,产生确定性的结构化JSON输出。Vidya在Ponta Grossa州立大学的数字人文与创新实验室开发,采用制作原则和开源实践,使使用普通硬件的机构能够以低成本部署。我们比较了LLM性能并展示了成本效益分析,显示了重大收益,将处理时间从数十年减少到数天,同时符合NOBRADE和ISAD(G)标准。

英文摘要

The large-scale digitization of historical archives has created a paradox: "dark data"-digital objects lacking metadata for retrieval. Manual archival description is slow and expensive, limiting discovery and reuse. We propose Vidya, a modular pipeline that orchestrates Large Language Models (LLMs) and FOSS tools to automate semantic enrichment and archival ingestion at scale. Vidya constrains generations using YAML-defined ontologies and Pydantic validation, producing deterministic, structured JSON outputs from probabilistic models. Developed at Laboratory for Digital Humanities and Innovation (LAMUHDI) of the State University of Ponta Grossa (UEPG), Vidya applies Maker principles and open-source practices to enable low-cost deployment in memory institutions using modest hardware. We compare LLM performance and present a cost-benefit analysis showing major gains, reducing processing time from decades to days while complying with NOBRADE and ISAD(G).

2605.16303 2026-05-19 cs.CY cs.AI cs.CL 版本更新

From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes

从人口统计学到调查锚点:评估LLM代理在建模退休态度中的表现

Rubén Garzón, Pauline Baron, Vincent Grari, Jonne Kamphorst, Michael Bernstein, Marcin Detyniecki

AI总结 本文比较了基于人口统计学的LLM代理与基于调查数据的代理在预测退休态度调查中的准确性,发现仅依赖人口统计学的代理存在偏差且不够准确,而基于调查锚点的代理能更好地捕捉复杂的人类响应模式。

Comments 50 pages, 22 figures

详情
AI中文摘要

大型语言模型(LLM)代理可能为预测人类对调查的响应提供工具。一种常见的技术是仅使用人口统计学数据(如国家、年龄、性别、就业状况、收入、教育和婚姻状况)来定义这些代理。我们比较了基于人口统计学的代理与基于更广泛调查响应数据的代理的预测准确性。我们测试了这两种方法在预测多学科、跨国的《健康、老龄化与退休状况调查》(SHARE)中的响应,重点关注五个变量,这些变量来自三个与个人财务相关的政策相关构念。在这些三个构念中,我们发现,与基于更广泛数据训练的调查代理相比,仅依赖人口统计学的代理(1)表现出集中趋势偏差,使答案偏向人口均值,(2)过于准确,无法重现人类响应中的错误答案和“不知道”响应。这些性能差异通过复制先前退休规划研究中的分层回归分析得到进一步验证。仅基于人口统计信息的代理重现了财务风险承受能力、未来时间观念和退休规划知识各自预测退休储蓄的结果。然而,只有基于调查锚点的代理才能重现这三个因素之间的相互作用。这些发现表明,在仅使用人口统计学定义LLM代理以预测调查响应时应保持谨慎。

英文摘要

Large language models (LLM) agents may offer tools to predict human responses to surveys. A common technique for defining these agents uses only demographics, for example country, age, gender, employment status, income, education and marital status. We compare the predictive accuracy of demographic agents to that of survey agents defined with a larger set of in-domain survey responses. We test both approaches in predicting responses to the multidisciplinary, cross-national Survey of Health, Ageing and Retirement in Europe (SHARE), focusing on five variables from three policy-relevant constructs around personal finance. In these three constructs, we observe that, compared to survey agents trained on broader data, demographics-only agents (1) exhibited a central tendency bias, skewing answers toward population means, and (2) were unrealistically accurate, failing to reproduce the incorrect answers and "don't know" responses typical of human respondents. These performance differences are further substantiated through the replication of a hierarchical regression analysis from prior retirement planning research. Agents based solely on demographic information reproduce the outcome that financial risk tolerance, future time perspective, and knowledge of retirement planning each are predictive of retirement savings. However, only the survey-anchored agents succeed in reproducing the interaction among these three factors. These findings suggest caution in using only demographics to define LLM agents for predicting survey responses.

2605.16295 2026-05-19 cs.CY cs.AI cs.CL cs.GR cs.HC cs.MM 版本更新

ANVIL: Analogies and Videos for Lecturers

ANVIL:为讲师提供类比和视频

Yuri Noviello, Anastasiia Birillo, Gosia Migut

AI总结 ANVIL是一种多模态生成系统,可自动生成基于类比的计算机科学教学动画。通过生成文本类比、结构化视觉剧本和可执行代码,提升教学有效性。

详情
AI中文摘要

我们介绍了ANVIL,一种多模态生成系统,可自动生成基于类比的教学动画。给定一个概念定义,ANVIL生成文本类比,将其编译成结构化的视觉剧本,并生成可执行的manim代码以渲染动画,同时具备自动修复机制以提高鲁棒性。在大规模评估此类系统时,需要在教学有效性与可扩展性之间取得平衡。我们首先通过教师评估来确定质量评估的基础,并利用其发现来指导自动化筛选。对于文本类比,我们引入基于LLM的评估器以实现可扩展的质量筛选;对于视频,由于主观判断难以自动化,我们改用自动代理来评估与预期剧本的一致性并进行错误分析。我们进一步与教育工作者进行用户研究,以考察采用要求和风险。我们的发现表明,ANVIL可以生成经常被评价为足够的材料,并且教育工作者对其感知价值和易用性有积极反应。

英文摘要

We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.

2605.16289 2026-05-19 cs.CY cs.CL 版本更新

Linguistic Uncertainty and Reply Engagement on X: A Cross-Domain Replication of the Uncertainty-Reply Asymmetry

语言不确定性与X上的回复参与:对不确定性-回复不对称性的跨领域复制

Mohamed Soufan

AI总结 研究探讨语言不确定性与社交媒体参与的关系,发现不确定性帖子获得更多回复,验证了先前阿拉伯语研究中的不对称参与模式。

Comments 13 pages, 2 figures, 2 tables

详情
AI中文摘要

在社交媒体中,语言不确定性普遍存在,但其与参与度的关系在不同语言和主题中仍不明确。本文利用2026年4月三天收集的2258篇英文帖子(涉及联邦储备政策、通货膨胀和选举政治),测试先前阿拉伯语研究中观察到的不确定性-回复不对称性是否在更广泛背景下复制。通过基于词典的不确定性框架对帖子进行分类,约三分之一被识别为不确定。不确定帖子平均获得82%更多的回复,再贴和点赞的增加较小,复制了先前研究中的不对称参与模式。回归结果确认不确定性与回复之间存在正相关(η=0.126,p=0.011),相当于约13%更高的预期回复参与度,而总参与度则显示较弱的正相关。这些发现表明,语言不确定性系统性地增加对话参与度,并可能反映一种跨语言和领域的普遍互动机制。

英文摘要

Linguistic uncertainty is common in social media, but its relationship with engagement remains unclear across languages and topics. Using 2,258 English-language posts on Federal Reserve policy, inflation, and electoral politics collected over three days in April 2026, we test whether the Uncertainty-Reply Asymmetry observed in prior Arabic-language research replicates in a broader context. Posts are classified using a lexicon-based uncertainty framework, with approximately one-third identified as uncertain. Uncertain posts receive 82% more replies on average than certain posts, with smaller increases in reposts and likes, replicating the asymmetric engagement pattern observed in prior work. Regression results confirm a positive and statistically significant association between uncertainty and replies (\b{eta} = 0.126, p = 0.011), equivalent to ~13% higher expected reply engagement, while total engagement shows a positive but weaker association. These findings suggest that linguistic uncertainty systematically increases conversational engagement and may reflect a general interactional mechanism across languages and domains.

2605.16288 2026-05-19 cs.CY cs.CL 版本更新

When AI Tells You What You Want to Hear: Sycophantic Behavior of Large Language Models in Dementia Care Settings

当AI说你想听的话:大型语言模型在痴呆症护理环境中的趋炎附势行为

Christian Kolb

AI总结 研究探讨了大型语言模型在痴呆症护理场景中是否因迎合社会期望而降低专业质量,通过五种提示测试四款模型,发现响应质量随提示框架增强而下降,提示框架显著影响响应质量。

Comments 10 pages, 4 figures. Exploratory study

详情
AI中文摘要

大型语言模型(LLMs)正在越来越多地应用于临床和护理环境。本探索性研究检验了LLMs在痴呆症护理情境中是否表现出趋炎附势行为——即根据社会期望信号调整响应而非维持专业质量。五种提示(P1中性到P5权威信号)被提交给四款LLMs(GPT-5、Claude Sonnet 4.6、Gemini 3.1 Pro、Mistral Large),每种提示重复五次(N=100次响应)。响应通过LLM-as-a-Judge方法评估,依据七个护理伦理质量标准(K1-K7)和语气量表(0-3)。所有模型均显示提示水平与响应质量之间存在显著负相关(rho范围从-0.543到-0.734,p<0.01)。Mistral Large表现出最显著的效果(rho=-0.734),平均分从P1的6.0/7降至P5的0.2/7。研究结果表明,LLMs在高风险护理环境中存在情境敏感的风险,提示框架显著影响响应质量——这一维度在医疗AI部署中关注不足。

英文摘要

Large language models (LLMs) are increasingly used in clinical and care settings. This exploratory study investigates whether LLMs exhibit sycophantic behavior - adapting their responses to social expectation signals rather than maintaining professional quality - in the context of dementia care. Five prompts with systematically increasing confirmatory and authority-related framing (P1 neutral to P5 authority-signaled implementation support) were submitted to four LLMs (GPT-5, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Large), each repeated five times (N = 100 responses). Responses were evaluated using an LLM-as-a-Judge methodology against seven nursing-ethical quality criteria (K1-K7) and a tone scale (0-3). All models showed significant negative Spearman correlations between prompt level and response quality (rho ranging from -0.543 to -0.734, all p < 0.01). Mistral Large exhibited the most pronounced effect (rho = -0.734), with mean scores dropping from 6.0/7 at P1 to 0.2/7 at P5. The findings suggest that LLMs pose context-sensitive risks in high-stakes care environments and that prompt framing significantly shapes response quality - a dimension that has received insufficient attention in healthcare AI deployment.

2605.16275 2026-05-19 cs.CY cs.AI cs.CL cs.MM 版本更新

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

AI 产出物还是AI增强?英语学术用途课程中学生对AI生成媒体的看法

David James Woo, Deliang Wang, Kai Guo

AI总结 研究探讨了AI生成内容在EAP课程中的教学效果,通过混合方法分析发现学生偏好视觉化内容,视频与学业表现正相关,但高认知负荷与成绩负相关,表明需合理设计内容以提升学习效果。

Comments 23 pages, 7 figures

详情
AI中文摘要

人工智能(AI)检索增强生成(RAG)工具现在使教育者能够将课程材料转化为多样化的多媒体内容。然而,这种AI生成内容是教学支架还是低质量的AI产出仍不明确。本文报告了在一所香港社区学院的英语学术用途(EAP)课程中,教师引导生成补充材料的开发、实施与评估。主要使用Google Notebook LM生成视频、播客、信息图和个性化反馈报告。通过混合方法设计,包括调查、半结构化访谈和与学术成绩的相关性分析,研究发现学生认为材料有用且易于使用,更偏好与评估相关的视觉和多模态内容,特别是视频和信息图。视频偏好与学业成绩正相关,但高认知负荷与成绩负相关,表明需谨慎校准内容复杂性。值得注意的是,部分成绩较低的学生自行将材料作为补救支架。该实践表明,RAG工具能够实现传统方法难以实现的规模化个性化反馈。当与学生目标和认知原理相结合时,教师引导的AI生成可以有意义地增强EAP学习生态系统,而非产生AI产出物。

英文摘要

Artificial intelligence (AI) retrieval-augmented generation (RAG) tools now enable educators to transform course materials into diverse multimedia at scale. However, it remains unclear whether such AI-generated content functions as a pedagogical scaffold or AI slop: high volume, low quality material. This innovative practice paper reports on the development, implementation, and evaluation of teacher-prompted, AI-generated supplemental materials in an English for Academic Purposes (EAP) course at a Hong Kong Community College. Using primarily Google Notebook LM, the instructor generated videos, podcasts, infographics, and individualized feedback reports from course materials and student work for 106 English as a Foreign Language learners. An explanatory sequential mixed-methods design comprising a survey, semi-structured interviews, and correlation analysis with academic scores was employed to examine students' preferences, perceptions, and learning outcomes. Findings are framed through the Technology Acceptance Model and Cognitive Load Theory. Students rated the materials highly for perceived usefulness and ease of use, and preferred assessment-linked content presented in visual and multimodal formats, particularly videos and infographics. Video preference correlated positively with academic performance; however, higher cognitive load was negatively associated with course grades, indicating that material complexity must be carefully calibrated. Notably, some lower-performing students independently adopted the materials as remedial scaffolds. The practice demonstrates that RAG tools enable scalable personalized feedback that would be less feasible through traditional methods. When aligned with student goals and cognitive principles, teacher-prompted AI generation can meaningfully enhance the EAP learning ecosystem rather than producing AI slop.

2605.16264 2026-05-19 cs.HC cs.CL 版本更新

LLM-Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging

基于大语言模型的智能通知生成:从静态个性化到情境感知的说服性信息

Nilesh Agrawal

AI总结 本文提出利用大语言模型提升通知信息质量,通过六个维度评估其效果,展示LLM在提升CTR和说服性方面的贡献,并提出决策框架以指导LLM生成的应用。

Comments 17 pages, 1 figure, 7 tables. Code available at https://github.com/ndagrawal/LLMNotificationComposition

详情
AI中文摘要

推送通知仍然是数字平台与用户互动最直接的渠道,但现有方法在通知对象、时间及内容推荐上投入巨大,而如何沟通仍是最薄弱的环节。本文认为信息质量是独立且被低估的优化点,大语言模型在此层创造最大差异化价值。本文贡献包括:首先定义通知信息质量的六个维度,并展示基于LLM的生成如何优于模板;其次提供架构归因分析,解构信息生成与其他组件(目标定位、排序、时间)的关系;第三引入三准则决策框架,明确何时LLM生成是瓶颈。本文通过PRISMA引导的调研(28个来源,142个筛选)、社交媒体、食品配送和电子商务领域的应用分析,并提出统一的架构框架,包含预算感知路由、基础生成、候选排序、多样性控制和在线学习。

英文摘要

Push notifications remain among the most direct channels through which digital platforms engage users, yet existing approaches have invested heavily in who to notify, when to notify, and what to recommend, while leaving how to communicate as the least-optimized stage. This paper argues that message quality is an independent, underinvested lever, and that LLMs create their most differentiated value precisely at this layer. We make three contributions. First, we define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM-based composition improves each relative to templates. Across reviewed deployments, reported improvements range from +8% to +14.5% CTR over static templates and +1% to +2.5% over mature slot-filling systems, though these span heterogeneous systems and should not be treated as directly comparable. Second, we provide an architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing), arguing that observed gains are frequently misattributed to text generation alone. Third, we introduce a three-criterion decision framework specifying when LLM generation is and is not the binding constraint. We support these arguments through a PRISMA-guided survey (28 sources from 142 screened), examine domain-specific applications across social media, food delivery, and e-commerce, and propose a unified architectural framework with budget-aware routing, grounded generation, candidate ranking, diversity controls, and online learning.

2605.14787 2026-05-19 cs.CV cs.CL 版本更新

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

组合图像检索基准是否需要多模态组合?

Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Monica Sekoyan, Wai-Chung Kwan, Claudio Pomo, Alessandro Suglia, Dietmar Jannach, Tommaso Di Noia, Pasquale Minervini

AI总结 研究发现组合图像检索任务中,许多查询可通过单一模态解决,而非真正的多模态组合,揭示了多模态组合的假设不成立。

详情
AI中文摘要

组合图像检索(CIR)是一种多模态检索任务,其中查询由参考图像和文本修改组成,目标是检索满足两者条件的目标图像。本文表明,这一假设并不总成立。在四个广泛使用的CIR基准和十一种通用多模态嵌入模型中,大量查询可通过单一模态解决(32.2%至83.6%),揭示了普遍存在的单模态捷径。为此,我们进行了两阶段审核:首先通过跨模型分析识别捷径可解查询;其次在4741个捷径不可解查询上进行人工验证,发现仅1689个查询结构合理,常见问题包括模糊编辑和目标不匹配。重新评估模型在验证子集上的表现显示,查询无法再仅通过单一模态解决,成功检索需结合两种输入。虽然准确率下降,但对多模态信息的依赖增加。整体而言,当前CIR基准将捷径可解、噪声和真正组合性查询混为一谈,导致对模型多模态能力的高估。

英文摘要

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

2604.02178 2026-05-19 cs.CL cs.AI cs.LG 版本更新

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

专家反击:在专家层面解读混合专家语言模型

Jeremy Herbst, Stefan Wermter, Jae Hee Lee

AI总结 研究通过k稀疏探测比较MoE专家与密集FFN,发现专家神经元更单语义,提出以专家为分析单位,揭示专家是细粒度任务专家,而非领域专家或token处理者。

Comments 8 pages, 7 Figures. Accepted at ICML 2026. Improved writing, changed author order, updated citations

详情
AI中文摘要

混合专家(MoE)架构已成为扩展大语言模型(LLMs)的主导选择,每个token仅激活部分参数。尽管MoE主要用于计算效率,但其稀疏性是否使其比密集前馈网络(FFN)更容易解释仍存疑问。通过k稀疏探测比较MoE专家与密集FFN,发现专家神经元始终更单语义,随着路由稀疏性增加,差距扩大。这表明稀疏性迫使神经元和整个专家朝单语义方向发展。基于此发现,我们从神经元层面转向专家层面作为更有效的分析单位。通过自动解读数百个专家,验证了这一方法。此分析解决了关于专业化争论:专家既非广领域专家(如生物学)也非简单token处理者。相反,它们作为细粒度任务专家,专门处理语言操作或语义任务(如闭合LaTeX括号)。我们的发现表明,MoE在专家层面具有内在可解释性,为大规模模型可解释性提供了更清晰路径。代码见:https://github.com/jerryy33/MoE_analysis。

英文摘要

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in $\LaTeX{}$). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis.

2603.20562 2026-05-19 cs.CL cs.AI 版本更新

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

排列一致性列表判断用于鲁棒事实性评估

Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan

AI总结 本文提出PCFJudge方法,通过多排列重跑列表事实性提示以提高LLM事实性判断的鲁棒性,实验显示其在RewardBench 2 Factuality上显著提升准确率。

Comments Accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)如今广泛用作评判者,但其决定可能受无关的呈现选择影响。我们研究了列表事实性评估中的候选顺序敏感性问题,其中多个答案可能看似同样精良但实际存在显著的幻觉风险差异。我们引入PCFJudge,一种推理时方法,通过多次重跑相同事实性优先的列表提示并聚合结果,生成一致的决策。在RewardBench 2 Factuality上,最终七排列聚合(K=7)使GPT-5.4的Top-1选择准确率从86.00%提升至91.33%,Claude Sonnet 4.6的准确率从86.33%提升至89.67%。这些结果表明候选顺序可能是事实性评判中的重要误差来源,消除这种干扰可提高LLM评估的可靠性。

英文摘要

Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.

2603.02218 2026-05-19 cs.LG cs.AI cs.CL cs.IT math.IT 版本更新

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

仅在自我合成管道确保可学习信息增益时,自我博弈才会进化

Wei Liu, Siya Qi, Yali Du, Yulan He

AI总结 本文通过实验揭示可持续自我进化需要可学习信息递增的自我合成数据管道,提出自我进化LLM的三重角色及系统设计,解决自我博弈停滞问题。

Comments 10 pages, 6 figures, 7 formulas, accepted by ICML 2026 position paper track

详情
AI中文摘要

大型语言模型(LLMs)使构建通过自我进化循环改进的系统成为可能,但许多现有方案更倾向于自我博弈且易陷入停滞。核心失败模式是循环生成更多数据但未增加下一轮的可学习信息。通过自我博弈编程任务实验,我们发现可持续自我进化需要具有可学习信息递增的自我合成数据管道。我们识别出自我进化LLM的三重角色:生成任务的Proposer、尝试解决方案的Solver以及提供训练信号的Verifier,并提出三种系统设计共同针对三重角色视角下的可学习信息增益。不对称共进化在角色间形成弱到强到弱的循环。容量增长扩展参数和推理时间预算以匹配上升的可学习信息。主动信息寻求引入外部上下文和新任务来源以防止饱和。这些模块共同提供从脆弱自我博弈动态到持续自我进化的可测量系统级路径。

英文摘要

Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

2603.01683 2026-05-19 cs.CL cs.AI 版本更新

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

手术式后训练:用于知识保留的近端策略蒸馏

Wenye Lin, Kai Han

AI总结 本文提出SPOT框架,通过近端策略蒸馏在后训练中保留知识,提升推理性能,实验表明其在少量数据下显著提升模型准确率。

Comments 21 pages

详情
AI中文摘要

通过后训练向大语言模型注入新的推理知识时,常常导致灾难性遗忘。最近的研究强调了on-policy数据的重要性,但KL散度未能缓解遗忘。本文通过分析和实验表明,KL约束的奖励公式在后训练中保留知识起关键作用。本文提出手术式后训练(SPOT),一种近端策略蒸馏框架,旨在高效优化推理同时保留先前知识。SPOT包含(1)使用Oracle的数据显示校正流水线,通过最小编辑修正错误步骤,生成近端on-policy数据;(2)基于奖励的二元交叉熵目标,用于增强推理和缓解遗忘。实验表明,仅使用4k校正数学对,SPOT在域内和域外任务上平均提升Qwen3-8B的准确率6.2%,仅需16分钟模型训练(8x H800 GPU)。此外,SPOT为后续强化学习提供更优初始化,显著提升性能上限。代码:https://github.com/Visual-AI/SPoT

英文摘要

Injecting new reasoning knowledge into Large Language Models (LLMs) via post-training often induces catastrophic forgetting. Recent studies emphasize the importance of on-policy data but suggest that KL-divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL-constrained reward formulation actually plays a critical role in retaining knowledge during post-training. This motivates our Surgical Post-Training (SPOT), a proximal on-policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on-policy data; and (2) a reward-based binary cross-entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k rectified math pairs, SPOT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and out-of-domain tasks, requiring merely 16-minute model training on 8x H800 GPUs. Moreover, SPOT provides a superior initialization for subsequent reinforcement learning, significantly elevating the performance ceiling. Code: https://github.com/Visual-AI/SPoT

2602.16699 2026-05-19 cs.CL cs.AI 版本更新

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

校准后再行动:在大语言模型代理中考虑成本的探索

Wenxuan Ding, Nicholas Tomlin, Greg Durrett

AI总结 本文提出Calibrate-Then-Act框架,使LLM代理在不确定环境下显式平衡成本与不确定性,从而更优地决策。

详情
AI中文摘要

大语言模型代理被部署在需要交互以获取信息的环境中。在这些场景中,代理必须权衡行动中的内在成本不确定性,例如何时停止探索并提交答案。例如,在编程任务中,代理可能运行生成的代码,或为该代码片段生成测试;编写和运行测试的成本非零,但通常低于运行有缺陷代码的成本。本文表明,可以通过诱导LLM代理显式权衡这些成本-不确定性权衡,使代理在环境中表现更优。我们正式化了多个任务,包括检索增强的问答和文件阅读编码任务,作为在不确定性下的连续决策问题。每个问题都有潜在的环境状态影响代理性能。我们引入了名为Calibrate-Then-Act(CTA)的框架,通过将代理传递推断出的环境状态先验信息,使其能够更优地行动。此信息在定性上改变了代理行为,并向代理添加了非标准RL训练所学的环境敏感性。在合成任务、问答和文件阅读上的结果表明,通过CTA显式进行成本-收益权衡有助于代理发现更优的决策策略。

英文摘要

LLM agents are deployed in environments where they must interact to acquire information. In these scenarios, the agent must reason about inherent cost-uncertainty tradeoffs in how to act, such as when to stop exploring and commit to an answer. For instance, on a programming task, an agent might run the code it generates, or it might generate tests for that code snippet; the cost of writing and running a test is nonzero, but typically lower than the cost of running buggy code. In this work, we show that we can induce LLM agents to explicitly reason about balancing these cost-uncertainty tradeoffs, then act more optimally in their environments. We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision-making strategies.

2509.22510 2026-05-19 cs.CL 版本更新

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

我们思考,因此我们对LLMs进行对齐,使其在出错前变得有益、无害和诚实

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

AI总结 本文提出AMBS框架,通过在1-to-N Transformer设置中参数化目标特定转换,解决LLM对齐中多个目标之间的干扰问题,提升HHH目标的联合满足能力。

详情
AI中文摘要

大语言模型(LLM)的对齐能力是指在生成过程中满足预期目标的能力,这对于可信部署至关重要。在实践中,对齐通常通过多个目标如Helpfulness、Harmlessness和Honesty(HHH)来操作化。先前的工作通过在标准Transformer解码器中使用引导向量来研究对齐,但将目标孤立处理,优化单个目标会覆盖其他目标,导致干扰。近期的工作尝试通过将引导扩展到1-to-N Transformer设置来解决这一限制,通过将表示复制到目标特定路径中,但应用转换独立,导致目标之间响应不一致。同样,安全RLHF和MoE设计方法研究目标之间的权衡,但不约束目标特定转换在共享表示中的限制。因此,即使是最先进的LLM在复杂情况下也难以同时满足HHH目标。为此,我们提出自适应多分支引导(AMBS),一种在1-to-N Transformer设置中的两阶段框架,参数化目标特定转换相对于共享表示。在第一阶段,计算一次共享隐藏表示。在第二阶段,该表示被复制到N条路径中,并相对于共享参考更新,捕捉目标特定偏差的同时限制分歧。这在单次前向传递中产生N个目标特定响应,可以在解码时结合以获得跨目标的单一响应。在多个主干网络上,AMBS在HHH目标上提升了性能,具有在WR、TI和SS(例如,LLaMA-2-7B上平均56.5%)上的一致提升,同时保持效率(例如,189 Tok/s,9 GPU-hrs)

英文摘要

Alignment of Large Language Models (LLMs) is the ability to satisfy desired objectives during generation, which is critical for trustworthy deployment. In practice, alignment is often operationalized through multiple objectives such as Helpfulness, Harmlessness, and Honesty (HHH). Prior works study alignment via steering vectors in standard Transformer decoders but treat objectives in isolation, where optimizing a single objective can overwrite others, leading to interference. Recent works attempt to address this limitation by extending steering to a 1-to-N Transformer setting by replicating representations into objective-specific pathways, but apply transformations independently, resulting in inconsistent responses across objectives. Similarly, approaches such as safe RLHF and MoE-based designs study trade-offs across objectives but do not constrain objective-specific transformations within a shared representation during inference. As a result, even aligned State-of-the-Art (SOTA) LLMs can struggle to jointly satisfy HHH objectives in complex settings. To address this, we propose Adaptive Multi-Branch Steering (AMBS), a two-stage framework in a 1-to-N Transformer setting that parameterizes objective-specific transformations relative to a shared representation. In Stage I, a shared hidden representation is computed once. In Stage II, this representation is replicated into N pathways and updated relative to a shared reference, capturing objective-specific deviations while restricting divergence. This produces N objective-specific responses within a single forward pass, which can be combined at decoding to obtain a single response across objectives. Across multiple backbones, AMBS improves performance across HHH, with consistent gains in WR, TI, and SS (e.g., Avg 56.5% on LLaMA-2-7B) while maintaining efficiency (e.g., 189 Tok/s, 9 GPU-hrs).

2508.06524 2026-05-19 cs.CL cs.AI cs.CY cs.DC cs.LG 版本更新

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling:扩展神经缩放定律以用于大型语言模型中的碳足迹

Lei Jiang, Fan Chen

AI总结 本文提出CarbonScaling框架,结合神经缩放定律和分布式训练策略,准确建模前沿LLM训练的碳足迹,提高硬件配置和排放估计的精度。

Comments 8 pages

详情
AI中文摘要

大型语言模型(LLMs)日益遵循将性能提升与快速扩展计算预算联系起来的神经缩放定律,这引发了对前沿规模训练可持续性的担忧。现有碳估计方法主要依赖于历史运行的回归分析,无法捕捉关键系统级因素,包括硬件异质性、分布式并行性、通信开销和架构稀疏性。我们提出了CarbonScaling,一种硬件意识的分析框架,用于建模前沿LLM训练的碳缩放行为。该框架整合了神经缩放定律、分布式训练策略、加速器和互连建模,以及操作和嵌入碳会计,以估计可行的硬件配置和相关排放。CarbonScaling同时建模张量、流水线、数据和专家并行性,同时纳入内存、带宽、利用率和运行时间约束。实验验证显示其比基于回归的基线具有显著更高的保真度,并突显了在万亿参数规模下嵌入碳的重要性。源代码:https://github.com/UnchartedRLab/CarbonScaling。

英文摘要

Large language models (LLMs) increasingly follow neural scaling laws that tie performance gains to rapidly expanding computational budgets, raising concerns about the sustainability of frontier-scale training. Existing carbon-estimation methods largely depend on regression over historical runs and fail to capture critical system-level factors, including hardware heterogeneity, distributed parallelism, communication overhead, and architectural sparsity. We present \textit{CarbonScaling}, a hardware-aware analytical framework for modeling the carbon scaling behavior of frontier LLM training. The framework integrates neural scaling laws, distributed training strategies, accelerator and interconnect modeling, and operational and embodied carbon accounting to estimate feasible hardware configurations and associated emissions. CarbonScaling jointly models tensor, pipeline, data, and expert parallelism while incorporating memory, bandwidth, utilization, and runtime constraints. Experimental validation demonstrates substantially higher fidelity than regression-based baselines and highlights the growing importance of embodied carbon at trillion-parameter scales. Source code: \url{https://github.com/UnchartedRLab/CarbonScaling}.

2508.04047 2026-05-19 cs.CL 版本更新

LaPA$^2$: Length-Aware Prefix and Prompt Attention Augmentation for Long-Form Controllable Text Generation

LaPA$^2$: 一种长度感知的前缀和提示注意力增强方法用于长形式可控文本生成

Jiabing Yang, Yixiang Chen, Zichen Wen, Chenhang Cui, Peiyan Li, Yuan Xu, Bowen Fang, Tao Yu, Ruikang Lin, Yan Huang, Liang Wang

AI总结 本文提出LaPA$^2$方法,通过长度感知对数缩放和上下文锚点强化,解决长序列中注意力稀释问题,提升可控文本生成的性能和可控性。

Comments Substantially revised version, renamed from "DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation"

详情
AI中文摘要

基于前缀的方法因参数效率成为可控文本生成(CTG)的有前景范式,但其在长序列中可控性下降。本文识别出注意力稀释是关键因素:随着序列增长,softmax机制导致控制信号注意力自然衰减,产生'模糊'控制效果。为此,我们提出LaPA$^2$(长度感知前缀和提示注意力增强),一种无需训练且模型无关的框架,旨在维持长上下文中的稳健控制。具体而言,LaPA$^2$采用长度感知对数缩放动态放大前缀注意力权重,数学上抵消稀释效应,同时可选的上下文锚点强化对提示词进行同步增强,当强属性控制风险覆盖原始提示时保持语义一致性。LaPA$^2$具有灵活性,支持软前缀(连续嵌入)和硬前缀(离散指令)。在多个CTG任务中的实验表明,LaPA$^2$在长形式设置中持续提升各种基于前缀的方法性能,实现更优的属性可控性,同时保持内容相关性和流畅性。我们的代码和数据可在https://github.com/jiabingyang01/LaPA2公开获取。

英文摘要

Prefix-based methods have emerged as a promising paradigm for Controllable Text Generation (CTG) due to their parameter efficiency. However, while effective in short sequences, their controllability tends to diminish as the generated sequence grows. In this paper, we identify Attention Dilution as a key factor behind this phenomenon: as the sequence length increases, the attention allocated to the control signal naturally decays due to the softmax mechanism, leading to a "fading" control effect. To address this, we propose LaPA$^2$ (Length-aware Prefix and Prompt Attention Augmentation), a training-free and model-agnostic framework designed to sustain robust control in long contexts. Specifically, LaPA$^2$ employs Length-Aware Logarithmic Scaling to dynamically amplify prefix attention weights, mathematically counteracting the dilution effect, while an optional Contextual Anchor Reinforcement applies synchronized augmentation to prompt tokens, preserving semantic coherence when strong attribute control risks overshadowing the original prompt. LaPA$^2$ is versatile, supporting both soft prefixes (continuous embeddings) and hard prefixes (discrete instructions). Experiments on multiple CTG tasks demonstrate that LaPA$^2$ consistently improves the performance of various prefix-based methods in long-form settings, leading to superior attribute controllability while preserving content relevance and fluency. Our code and data are publicly available at https://github.com/jiabingyang01/LaPA2.

2304.03427 2026-05-19 cs.CL cs.AI cs.CY cs.LG 版本更新

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

清除珠宝:基于谷歌OCR的藏文手稿的神经拼写纠正模型

Queenie Luo, Yung-Sung Chuang

AI总结 本文提出基于谷歌OCR的藏文手稿的神经拼写纠正模型,通过改进的Transformer架构实现自动纠正OCR噪声输出,实验表明其优于其他序列模型。

详情
Journal ref
Association for Computing Machinery 2024
AI中文摘要

人文学者依赖古代手稿来研究历史、宗教和社会政治结构。许多努力致力于使用OCR技术数字化这些珍贵的手稿,但大多数手稿因数世纪的污损,使得OCR程序无法准确捕捉褪色的字符和污渍。本文提出基于谷歌OCR处理的藏文手稿的神经拼写纠正模型,用于自动纠正OCR输出中的噪声。本文分为四个部分:数据集、模型架构、训练和分析。首先,我们将原始藏文电子文本语料库特征工程为两个结构化数据框——一组配对玩具数据和一组配对真实数据。然后,我们在Transformer架构中实现了置信度评分机制,用于拼写纠正任务。根据损失和字符错误率,我们的Transformer加置信度评分机制架构证明优于Transformer、LSTM-2-LSTM和GRU-2-GRU架构。最后,为了检验模型的鲁棒性,我们分析了错误的标记,可视化了模型中的注意力和自我注意力热图。

英文摘要

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

2102.11105 2026-05-19 cs.SI cs.CL 版本更新

REMOD: Relation Extraction for Modeling Online Discourse

REMOD:在线 discourse 中的关系提取

Matthew Sumpter, Giovanni Luca Ciampaglia

AI总结 本文提出一种结合图嵌入与语义依赖图路径遍历的监督学习方法,用于提取在线 discourse 中实体间的语义关系,以应对半结构化数据建模的挑战。

Comments 11 pages, 5 figures

详情
AI中文摘要

在线 discourse 的大量数据对维护一个文明和知情的公共领域构成挑战。诸如 ClaimReview 的标准化数据努力提供了大量关于可能不准确声明的新数据,由第三方事实核查员审查。这些数据有助于揭示在线 discourse 的性质、政治精英对其的放大作用以及其对在线信息生态系统完整性的影响。不幸的是,这种数据的半结构化性质在建模和推理在线 discourse 时带来了重大挑战。关键挑战是关系提取,即确定声明中命名实体之间的语义关系。本文开发了一种新颖的监督学习方法,结合图嵌入技术与语义依赖图上的路径遍历。我们的方法基于直观观察,即了解三元组主体和对象之间路径上的实体信息有助于提取其语义关系(例如,华盛顿,D.C. 与美国合众国的关系为 capitalOf)。作为该技术在建模在线 discourse 中的潜在应用示例,我们展示了该方法可以整合到一个管道中,以推理潜在的虚假信息声明。

英文摘要

The enormous amount of discourse taking place online poses challenges to the functioning of a civil and informed public sphere. Efforts to standardize online discourse data, such as ClaimReview, are making available a wealth of new data about potentially inaccurate claims, reviewed by third-party fact-checkers. These data could help shed light on the nature of online discourse, the role of political elites in amplifying it, and its implications for the integrity of the online information ecosystem. Unfortunately, the semi-structured nature of much of this data presents significant challenges when it comes to modeling and reasoning about online discourse. A key challenge is relation extraction, which is the task of determining the semantic relationships between named entities in a claim. Here we develop a novel supervised learning method for relation extraction that combines graph embedding techniques with path traversal on semantic dependency graphs. Our approach is based on the intuitive observation that knowledge of the entities along the path between the subject and object of a triple (e.g. Washington,_D.C.}, and United_States_of_America) provides useful information that can be leveraged for extracting its semantic relation (i.e. capitalOf). As an example of a potential application of this technique for modeling online discourse, we show that our method can be integrated into a pipeline to reason about potential misinformation claims.