arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.18753 2026-05-19 cs.CL cs.AI cs.LG 版本更新

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention: 可微且自适应的稀疏分层注意力

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso

发表机构 * Tsinghua University（清华大学）； Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； Instituto de Telecomunicações（电信研究院）； Carnegie Mellon University（卡内基梅隆大学）； Sapienza University of Rome（罗马萨皮恩扎大学）； University of Edinburgh（爱丁堡大学）； TransPerfect（TransPerfect公司）； ELLIS Unit Lisbon（里斯本ELLIS单位）

AI总结本研究提出DashAttention，一种可微且自适应的稀疏分层注意力机制，通过自适应稀疏α-entmax变换选择可变数量的块，从而在保持整个层次结构可微的同时，提升长上下文建模能力，实验表明其在高稀疏度下优于现有方法。

Comments Preprint

详情

AI中文摘要

当前的分层注意力方法，如NSA和InfLLMv2，基于粗粒度注意力得分选择前k个相关键值（KV）块，然后对所选标记应用细粒度softmax注意力。然而，top-k操作假设任何查询的相关标记数量固定，并且阻止了稀疏和密集阶段之间的梯度流动。在本工作中，我们提出了DashAttention（可微且自适应的稀疏分层注意力），它利用自适应稀疏α-entmax变换，在第一阶段根据当前查询选择可变数量的块。这反过来为第二阶段的softmax注意力提供先验信息，保持整个层次结构完全可微。与其他分层注意力方法不同，我们表明DashAttention是非发散的，这导致更好的长上下文建模能力。在大型语言模型（LLMs）上的实验表明，DashAttention在75%的稀疏度下达到与全注意力相当的准确性，并在高稀疏度情况下优于NSA和InfLLMv2，特别是在高稀疏度情况下。我们还提供了一个高效的、GPU-aware的DashAttention实现，在Triton中实现了比FlashAttention-3快超过一倍的推理速度。总体而言，DashAttention提供了一种成本效益高的长上下文建模策略。

英文摘要

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $α$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.18747 2026-05-19 cs.CL cs.AI 版本更新

Code as Agent Harness

代码作为代理工具

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pan Chen, Dorothy Sun, Ren Chen, Mahesh Srinivasan, Nipun Mathur, Yinglong Xia, Hong Li, Hong Yan, Pan Lu, Lingming Zhang, Tong Zhang, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Meta ； Stanford University（斯坦福大学）

AI总结本文探讨了代码在代理系统中的作用，提出了一种统一的视角，将代码视为代理基础设施的基础，并讨论了代理工具接口、机制以及扩展到多代理系统的挑战。

Comments GitHub: https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers

详情

AI中文摘要

近年来，大型语言模型（LLMs）在理解和生成代码方面展现了强大的能力，从竞争性编程到仓库级别的软件工程。在新兴的代理系统中，代码不再仅仅是目标输出，而是越来越多地作为代理推理、行动、环境建模和基于执行的验证的操作基础。我们通过代理工具的视角来阐述这一转变，并引入“代码作为代理工具”的概念：一种以代码为基础的统一视角，用于代理基础设施。为了系统地研究这一视角，我们围绕三个相连的层次组织了综述。首先，我们研究工具接口，其中代码连接代理到推理、行动和环境建模。其次，我们检查工具机制：计划、记忆和工具使用用于长周期执行，以及反馈驱动的控制和优化，使工具可靠且适应性强。第三，我们讨论将工具从单代理系统扩展到多代理系统，其中共享的代码艺术支持多代理协调、审查和验证。在这些层次中，我们总结了代码作为代理工具的代表性方法和实际应用，涵盖编码助手、GUI/OS自动化、具身代理、科学发现、个性化和推荐、DevOps以及企业工作流程。我们进一步概述了工具工程中的开放挑战，包括评估超越最终任务成功、在不完整反馈下的验证、无回归的工具改进、多个代理之间的一致共享状态、人类监督以确保安全关键行动，以及向多模态环境的扩展。通过将代码视为代理AI的工具，本文为可执行、可验证和具有状态的AI代理系统提供了一条统一的道路。

英文摘要

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18732 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

可预测的编造：大型语言模型的事实回忆能力随模型大小和主题频率而增加

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun, Iyiola E. Olatunji, Tegawendé F. Bissyandé

发表机构 * International Development Research Centre Canada（加拿大国际发展研究中心）； University of Cape Town（开普敦大学）； Global Center on AI Governance（人工智能治理全球中心）； SnT, University of Luxembourg（卢森堡大学SnT分校）； CITADEL AI Centre of Excellence, Burkina Faso（布基纳法索CITADEL人工智能卓越中心）

AI总结本研究探讨了大型语言模型在事实回忆方面的可预测性，发现模型大小和训练数据中主题频率是影响回忆质量的关键因素，且模型大小和主题频率的组合能解释60%-94%的方差。

Comments 18 pages, 5 figures, 6 tables

2605.18703 2026-05-19 cs.CL cs.LG 版本更新

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory: 通过可执行环境合成和鲁棒强化学习扩展工具使用智能体

Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, Zhicheng Yang, Xiao Zhu, Yinhong Liu, Boyu Zhu, Baiyu Huang, Chao Chen, Heyuan Deng, Fei Mi, Lifeng Shang, Xingshan Zeng, Zhijiang Guo

发表机构 * LARK, HKUST (GZ)（LARK，香港科技大学（广州））； University of Cambridge（剑桥大学）； UCL（伦敦大学学院）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结本文提出EnvFactory框架，通过自动合成可执行环境和鲁棒强化学习，解决工具使用智能体扩展中的环境可扩展性和训练数据不足问题，显著提升训练效率和下游性能。

Comments 11 pages

详情

AI中文摘要

通过代理强化学习（Agentic RL）为LLM配备工具使用能力受到两个挑战的限制：缺乏可扩展且稳健的执行环境以及现实训练数据的稀缺性，这些数据无法捕捉隐含的人类推理。现有方法依赖于昂贵的真实世界API、易产生幻觉的LLM模拟器或依赖预收集文档的合成环境，这些环境通常是单轮次或依赖预收集文档。此外，合成轨迹经常过于指定，更像指令序列而非自然人类意图，从而降低了其对强化学习训练的有效性。我们引入EnvFactory，一个完全自动化的框架，解决这两个挑战。EnvFactory自动探索和验证具有状态的可执行工具环境，并通过拓扑感知采样和校准细化合成自然多轮次轨迹，生成具有隐含意图的扎根查询。仅使用7个领域中的85个验证环境，EnvFactory生成2,575个SFT和RL轨迹。尽管使用的环境数量远少于先前工作（通常是5倍），EnvFactory在训练效率和下游性能上均优于现有方法，使Qwen3系列模型在BFCLv3上提升高达+15%，在MCP-Atlas上提升+8.6%，并在对话基准测试中包括τ²-Bench和VitaBench上提升+6%。通过完全自动化环境构建和轨迹合成，EnvFactory为代理强化学习提供了可扩展、可扩展且稳健的基础。

英文摘要

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $τ^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

URL PDF HTML ☆

赞 0 踩 0

2605.18673 2026-05-19 cs.CY cs.CL 版本更新

Generative AI Advertising as a Problem of Trustworthy Commercial Intervention

生成式AI广告作为可信商业干预问题

Jingyi Qiu, Qiaozhu Mei

发表机构 * University of Michigan（密歇根大学）

AI总结本文探讨生成式AI广告如何通过影响生成过程而非内容放置来改变广告方式，提出基于影响层级的分类体系，并指出当前系统在更深层次商业影响上的可信度问题。

详情

AI中文摘要

主要部署的生成式AI广告系统维持了商业内容与AI生成响应之间的可见边界。然而，实证研究表明，直接嵌入大语言模型（LLM）输出的广告往往未被用户检测到。我们主张生成式AI从根本上改变了广告：而非将产品置于离散槽位，它使生成过程本身受到干预，通过更不明显的渠道产生商业影响。这将生成式AI广告重新定义为可信干预问题而非内容放置问题。我们引入了一个按影响层级组织的分类体系，对应于对越来越深层潜在变量的干预：产品提及、信息框架、行为引导和长期偏好塑造；并展示这些层级如何在不同模态和系统架构中体现，包括检索增强生成和代理流水线，其中上游决策可以显著限制下游结果。主要部署的系统和设计机制集中在最明显且最容易控制的层级，而对用户自主性最根本的商业影响形式仍缺乏理解和检测、测量或披露的框架。核心挑战是生成系统中的商业影响是否可以变得可信，即可追溯、可测量、可质疑且符合用户福祉。

英文摘要

Major deployed generative AI advertising systems preserve a visible boundary between commercial content and AI-generated responses. Yet empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users. We argue that generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels. This reframes generative AI advertising as a problem of trustworthy intervention rather than content placement. We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping; and show how these tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes. Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure. The central challenge is whether commercial influence in generative systems can be made trustworthy, i.e., attributable, measurable, contestable, and aligned with user welfare.

URL PDF HTML ☆

赞 0 踩 0

2605.18663 2026-05-19 cs.AI cs.CL cs.LG 版本更新

GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM：通过整合多个认知领域的任务评估模型

Rohit Patel, Alexandre Rezende, Steven McClain

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）

AI总结本文提出GIM基准测试，通过整合多个认知领域的任务来评估模型，其核心方法是设计820个原创问题，结合广泛的知识和多种认知操作，从而保持推理在现实任务中的基础性，同时通过2PL IRT模型校准能力估计，发布涵盖22个模型和47种测试配置的综合排行榜，并深入研究了测试时计算与模型能力之间的权衡。

Comments 56 pages, 27 figures, 4 tables. Code: https://github.com/facebookresearch/gim ; Dataset: https://huggingface.co/datasets/facebook/gim

详情

AI中文摘要

随着LLM基准测试趋于饱和，评估社区已采取两种策略来提高难度：提升知识需求（GPQA，HLE）或完全去除知识而采用抽象推理（ARC-AGI）。前者将记忆混淆为能力，后者使推理脱离实际应用背景。我们采取了不同的方法。Grounded Integration Measure（GIM）是一个包含820个原创问题（615个公开问题，205个私有问题）的基准测试，其中难度来自于整合；每个问题都需要协调多种认知操作（约束满足、状态跟踪、知识警惕、受众校准）在广泛可获取的知识上，从而保持推理在现实任务中而不依赖专门的专家知识。每个问题都是原创专家撰写的组成，大多数有基于评分标准分解的评分（中位数6个独立判断的准则）。一个平衡的公开-私有划分提供了内置的污染诊断。我们校准了一个连续响应的2参数逻辑（2PL）IRT模型，超过200,000个提示-响应对，覆盖28个模型，产生稳健的能力估计，即使在原始准确率被错误或缺失数据扭曲的情况下，也能正确排序测试配置，解决了基准报告中的常见挑战。使用这一框架，我们发布了一个涵盖22个模型和47种测试配置的综合排行榜（独特的模型和思考级别对），并进行了迄今为止最广泛的已发表研究，探讨在固定基准上测试时计算与模型能力之间的权衡：11个模型在35种测试配置中被扫过。我们观察到，家庭内部配置选择，如思考预算和量化，与模型选择一样重要。我们发布了评估框架、校准的IRT参数和所有公开问题。

英文摘要

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

URL PDF HTML ☆

赞 0 踩 0

2605.18648 2026-05-19 cs.LG cs.AI cs.CL 版本更新

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

对软标签学习和校准中人类与模型不确定性的评估

Maja Pavlovic, Silviu Paun, Massimo Poesio

发表机构 * Queen Mary University London（伦敦女王玛丽大学）； Amazon（亚马逊）； University of Utrecht（乌得勒支大学）

AI总结本文通过对比人类和模型标签在软标签学习中的效果，发现人类标签不仅提升了模型准确性，还通过正则化作用改善了模型在困难样本上的校准和训练稳定性。

详情

AI中文摘要

人类对齐的人工智能的核心在于理解人类提取的标签相对于合成标签的优势。虽然人类软标签通过捕捉不确定性来提高校准，但先前研究将这些好处与隐含的错误标签修正（模式偏移）混淆了，从而掩盖了软标签的真实效果。我们对MNIST和一个合成变体上的软标签学习进行了受控审计，重新标注子集以提取人类不确定性。通过将软标签监督与底层标签模式偏移解耦，我们发现虽然人类软标签确实提供了准确性提升，但其更大的价值在于作为正则化器，改善模型在困难样本上的校准并促进训练运行中的稳定收敛。数据集制图显示，训练于人类软标签的模型能反映人类不确定性，而训练于合成标签的模型则无法与人类对齐。广泛而言，这项工作提供了一个用于人类-人工智能不确定性对齐的诊断测试平台。

英文摘要

Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.18607 2026-05-19 cs.CL cs.LG 版本更新

Forecasting Downstream Performance of LLMs With Proxy Metrics

通过代理指标预测大语言模型的下游性能

Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau

发表机构 * Mila – Quebec AI Institute & McGill University（魁北克AI研究院与麦吉尔大学）； CIFAR AI Chair（CIFAR人工智能主席）； ServiceNow Research Periodic Labs（ServiceNow研究周期实验室）

AI总结本文提出通过聚合候选模型的下一个token分布中的token级统计信息（如熵、top-k准确率和专家token排名）来构建代理指标，以更准确地预测大语言模型的下游性能，优于传统的损失和计算量基线方法。

Comments Preprint. 31 pages

详情

AI中文摘要

语言模型的发展进步往往由比较决策驱动：选择哪种架构、哪种预训练语料库或哪种训练配方。做出这些决策需要可靠的性能预测，但常用的两个信号从根本上受到限制。交叉熵损失与下游能力不匹配，而直接下游评估成本高、稀疏且在早期训练阶段信息有限。相反，我们提出通过聚合候选模型的下一个token分布中的token级统计信息（如熵、top-k准确率和专家token排名）来构建代理指标。在三个设置中，我们的代理指标始终优于基于损失和计算量的基线方法：1）在跨家族模型选择中，它们对异质推理模型的排名平均Spearman Rho为0.81（与交叉熵损失的Rho为0.36相比）；2）在预训练数据选择中，它们能以大约10,000倍更低的计算成本可靠地对25个候选语料库进行排名，推动帕累托前沿超越现有方法；3）在训练时间预测中，它们在18倍计算范围内预测下游准确性时，误差大约是现有方法的一半。这些结果表明，专家轨迹是评估模型能力广泛有用的信息源，使整个模型开发生命周期中的性能预测变得可靠。

英文摘要

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

URL PDF HTML ☆

赞 0 踩 0

2605.18583 2026-05-19 cs.SE cs.AI cs.CL cs.CR 版本更新

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

过度激进的编码代理：在良性任务中测量超出范围的动作

Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng, Yuekang Li, Leo Yu Zhang, Yi Liu

发表机构 * Griffith University（格里菲斯大学）； Wake Forest University（威克森林大学）； Nanyang Technological University（南洋理工大学）； University of New South Wales（新南威尔士大学）； Quantstamp

AI总结本文提出OverEager-Gen基准，用于评估编码代理在良性任务中超出范围的行为。研究发现，当基准中明确列出授权范围时，代理会停止推断边界并开始匹配声明文本，从而影响测量有效性。通过行为梯度验证器和双通道堆栈审计内部工具调用，验证了不同框架下的过度激进行为差异。

详情

AI中文摘要

编码代理现在能够自主运行，拥有shell、文件和网络权限。当用户发出良性请求时，代理有时会做更多事情：删除无关文件、清除过期凭证备份，或重写用户从未提及的配置。我们将这些超出范围的行为称为过度激进动作，这是一种与能力失败、提示注入或沙盒逃逸不同的授权问题。我们提出了OverEager-Gen，一个专门用于评估过度激进行为的基准。构建该基准暴露了一个测量有效性问题：如果基准在提示中明确列出授权范围，代理会停止推断边界并开始匹配声明文本。在Claude Code上，仅去除同意声明，配对场景中的过度激进率从0.0%增加到17.1%（McNemar精确p=2.4×10^-4）。OverEager-Gen因此在通过前使用行为梯度验证器验证每个场景的区分能力，通过双通道堆栈审计内部工具调用（PATH注入的 shim 加上每个代理的事件流），并发布字节完全一致的同意保留和同意去除变体。OverEager-Bench包含500个验证的场景和四个代理产品（Claude Code、OpenHands、Codex CLI、Gemini CLI）和六个基础模型的约7,500次运行；50个重新注释样本给出Cohen's kappa=0.73和规则判断召回率=1.00。去除同意声明在每个共享基础模型上都增加了过度激进率（Delta在[11.9,17.2]pp）。框架轴主导效应大小：一个宽松的集群（Claude Code、Codex CLI、Gemini CLI）运行在5.4-27.7%之间，而询问继续框架（OpenHands）位于0.2-4.5%（Fisher p<=10^-5）。在框架内基础模型方差达到15.9pp，表明模型层对齐并未完全通过宽松权限门控传播。

英文摘要

Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configuration the user never mentioned. We call these scope expansions overeager actions, an authorization problem distinct from capability failures, prompt injection, or sandbox escapes. We present OverEager-Gen, a benchmark dedicated to overeager behavior on benign tasks. Building it surfaces a measurement-validity issue: if a benchmark spells out the authorized scope inside the prompt, the agent stops inferring boundaries and starts pattern-matching declaration text. On Claude Code, stripping the consent declaration alone raises the overeager rate from 0.0% to 17.1% on paired scenarios (McNemar exact p = 2.4 x 10^-4). OverEager-Gen therefore certifies each scenario's discriminative power before admission via a behavioral-gradient validator, audits internal tool calls through a dual-channel stack (PATH-injected shim plus per-agent event streams), and ships byte-identical consent_kept and consent_stripped variants. OverEager-Bench contains 500 validated scenarios and ~7,500 runs across four agent products (Claude Code, OpenHands, Codex CLI, Gemini CLI) and six base models; a 50-sample re-annotation gives Cohen's kappa = 0.73 and rule-judge recall = 1.00. Stripping consent multiplies the overeager rate on every shared base model (Delta in [11.9, 17.2] pp). The framework axis dominates effect size: a permissive cluster (Claude Code, Codex CLI, Gemini CLI) runs at 5.4-27.7% while the ask-to-continue framework (OpenHands) sits at 0.2-4.5% (Fisher p <= 10^-5). Within-framework base-model variance reaches 15.9 pp, indicating that model-layer alignment does not fully propagate through permissive permission gating.

URL PDF HTML ☆

赞 0 踩 0

2605.18572 2026-05-19 cs.CL 版本更新

MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

MA$^{2}$P: 一种用于复杂说服的元认知自主智能体框架

Dingyi Zhang, Ziqing Zhuang, Linhai Zhang, Ziyang Gao, Deyu Zhou

发表机构 * School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China（计算机科学与工程学院、计算机网络与信息集成重点实验室、教育部、东南大学，中国）； Department of Informatics, King’s College London（信息学院、伦敦国王学院）

AI总结本文提出MA$^{2}$P框架，通过自主多智能体架构和元认知配置器，解决复杂说服中说服者需解读内部状态、推断潜在心理状态并生成策略一致行动的挑战，提升说服成功率。

Comments 22 pages, 8 figures. Accepted to Findings of ACL 2026

详情

AI中文摘要

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

英文摘要

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.18567 2026-05-19 cs.CL cs.LG 版本更新

GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

GUT-IS: 一种数据驱动的方法，用于整合信息系统的构念及其关系

Maximilian Reinhardt, Jonas Scharfenberger, Burkhardt Funk

发表机构 * Institute of Information Systems（信息系统研究所）

AI总结本文提出了一种数据驱动的方法，通过结合任务适应的文本嵌入和聚类技术，生成构念分组候选集，并利用显式权衡语义纯度和聚类数量简洁性的损失函数选择最优解，从而分析构念分组及其关系在优先级从纯度转向简洁性时的变化。

Comments Accepted at the 34th European Conference on Information Systems (ECIS 2026), Milan, Italy

2605.18563 2026-05-19 cs.CL 版本更新

Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences

读者对‘噪声通道花园路径’句子中的合理错误进行定向回归

Thomas Hikaru Clark, Roger Levy, Edward Gibson

发表机构 * MIT Brain and Cognitive Sciences（麻省理工学院脑与认知科学）

AI总结研究探讨了读者在处理‘噪声通道花园路径’句子时如何通过后续信息定位可能的错误，揭示了阅读动态中的定向回归现象及对噪声通道语言理解理论的影响。

2605.18549 2026-05-19 cs.CL cs.CR 版本更新

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

监控内部独白：探测轨迹揭示推理动态

Maciej Chrabąszcz, Aleksander Szymczyk, Marcin Sendera, Tomasz Trzciński, Sebastian Cygert

发表机构 * NASK - National Research Institute（国家研究 institute）； Faculty of Electronics and Information Technology（电子与信息技术学院）； Faculty of Mathematics and Computer Science（数学与计算机科学学院）； Jagiellonian University（雅盖隆大学）； Tooploox ； IDEAS Research Institute（IDEAS研究所）； Gdańsk University of Technology（格但姆大学）

AI总结本文研究了大型推理模型的内部推理动态，通过分析探测轨迹来提高监控可靠性，提出了一种基于信号处理特征的方法，显著提升了未来模型状态的分离能力。

详情

AI中文摘要

大型推理模型（LRMs）通过其思维链（CoT）推理引入了安全监控的新机遇。然而，CoT并不总是忠实于模型的最终输出，削弱了其作为监控工具的可靠性。为了解决这个问题，我们研究了LRMs的隐藏表示，以确定是否可以从提示和CoT表示中预测未来行为。通过在每个生成的token上评估探测器，我们构建了探测轨迹，即概念概率在推理过程中的连续演变。我们发现，当在完整的轨迹上检查时，未来模型行为比从单一静态预测中更易于区分。为了表征这些时间动态，我们提取了信号处理特征，捕捉波动性、趋势和稳态行为，显著提高了未来模型状态的分离。我们还提出了两种方法论见解。首先，基于模板的训练数据在动态生成模型响应方面几乎达到同等水平，消除了对昂贵初始推理和标记的需要。其次，池化操作的选择至关重要：平均池化和最后token方法退化到接近随机的性能，而最大池化在95%的AUROC上取得优异成绩，并产生稳定的探测轨迹。使用四个数据集和四个跨安全和数学领域的推理模型，我们证明轨迹特征编码了任务特定的动态，提高了结果分离度。这些发现确立了探测轨迹作为监控LRM行为的互补框架。

英文摘要

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.

URL PDF HTML ☆

赞 0 踩 0

2605.18548 2026-05-19 cs.CL cs.AI 版本更新

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

STT-Arena：一种更现实的工具使用环境，包含时空动态

Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, Kun Zhan, Sen Su, Chunxiao Liu, Ning Miao

发表机构 * Hong Kong Institute of AI for Science, City University of Hong Kong（香港人工智能科学研究院，香港城市大学）； Department of Data Science, City University of Hong Kong（数据科学系，香港城市大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； Li Auto Inc.（李汽有限公司）； Independent Researcher（独立研究者）

AI总结本文提出STT-Arena基准测试，旨在评估大型语言模型在面对时空动态变化时的适应性规划能力，发现现有模型在处理此类动态问题时存在显著不足，并提出改进方法STT-Agent-4B以提升性能。

Comments Work in progress

详情

AI中文摘要

大型语言模型（LLMs）在现实世界中的代理应用中必须能够重新规划和适应，当任务中途中断时推翻其先前决策。现有的动态基准主要测量LLMs是否能够及时检测时间变化，留下适应时空动态的互补挑战未被探索。我们介绍了STT-Arena（Spatio-Temporal Tool-Use Arena），一个包含227个高质量交互任务的基准测试，涵盖九种时空冲突类型和四种可解性级别。每个任务都基于一个现实、可执行的环境，配备注入的时空触发器，可以突然使正在进行的计划失效，迫使模型检测状态变化并构建修订的执行策略。对前沿LLMs的广泛评估显示，即使是最先进的专有模型，如Claude-4.6-Opus，也只达到低于40%的总体准确率，突显了时空动态推理的根本难度。对失败轨迹的系统分析揭示了现有模型的三种反复出现的错误模式：停滞状态执行、动态触发器的误诊断和缺失的适应后验证。基于这些发现，我们提出了一种迭代轨迹细化技术，消除这些失败模式，结合在线强化学习，产生STT-Agent-4B，其在STT-Arena上优于前沿LLMs。

英文摘要

Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40\% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.

URL PDF HTML ☆

赞 0 踩 0

2605.18530 2026-05-19 cs.CL cs.AI cs.LG stat.ML 版本更新

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

连续扩散在语言领域中能与离散扩散竞争性地扩展

Zhihan Yang, Wei Guo, Shuibai Zhang, Subham Sekhar Sahoo, Yongxin Chen, Arash Vahdat, Morteza Mardani, John Thickstun

发表机构 * NVIDIA & Cornell（NVIDIA与康奈尔大学）； NVIDIA & Georgia Tech（NVIDIA与佐治亚理工学院）； UW-Madison（威斯康星大学麦迪逊分校）； MBZUAI-IFM（梅兰德大学-IFM）； Cornell（康奈尔大学）

AI总结本文研究了连续扩散模型在语言建模中的扩展能力，通过改进Plaid模型构建RePlaid，证明连续扩散模型在计算效率和性能上可与离散模型竞争，并提供了理论支持。

详情

AI中文摘要

尽管扩散模型近期在语言建模领域受到广泛关注，但连续扩散模型在扩展性方面似乎不如离散方法。为了挑战这一观点，我们重新审视Plaid，一种基于似然的连续扩散语言模型（DLM），并构建RePlaid，通过将Plaid的架构与现代离散DLMs对齐。在统一的设定下，我们建立了第一个连续DLMs的扩展定律，表明RePlaid的计算差距仅为自回归模型的20倍，使用更少的参数优于Duo，并在过训练范围内优于MDLM。我们将RePlaid与最近的连续DLMs进行基准测试：在OpenWebText上，RePlaid实现了连续DLMs中的新状态-of-the-art PPL界值为22.1，并在生成质量上更优。这些结果表明，当通过似然训练时，连续扩散是与离散DLMs高度竞争且可扩展的替代方案。此外，我们提供了理论见解以理解基于似然训练的优势。我们展示了优化噪声调度以最小化ELBO的方差自然会得到时间上的线性交叉熵（信息损失）。这均匀地分配去噪难度，而无需任何特定时间的重参数化。此外，我们发现通过似然优化嵌入会创建结构化的几何形状并驱动最大的似然增益。

英文摘要

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

URL PDF HTML ☆

赞 0 踩 0

2605.18512 2026-05-19 cs.CL 版本更新

Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

更容易判断而非寻找：预测上下文学习成功以选择演示

Haochun Wang, Chaofen Yang, Jiatong Liu, Jingbo Wang, Zewen Qiang, Sendong Zhao, Bing Qin, Ting Liu

发表机构 * Research Center for Social Computing（社会计算研究中心）； Interactive Robotics, Harbin Institute of Technology, China（交互机器人，哈尔滨工业大学，中国）

AI总结本文提出DiSP框架，通过分层判断和轻量路由模型，预测上下文学习中演示选择的成功率，从而在多个数据集上实现更高的准确率和更快的推理速度。

Comments ICML 2026

详情

AI中文摘要

上下文学习（ICL）对提示中出现的演示非常敏感，但选择演示是昂贵的，因为可能的演示上下文和组合空间极大。我们主张演示选择是'更容易判断而非寻找'：预测特定查询-上下文对（q,D）是否成功比搜索最优D*更便宜且更通用。基于这一见解，我们提出DiSP，一种样本和判断框架，通过分层查询难度进行分类。DiSP运行随机演示试验以估计每个训练查询的成功率，训练轻量级路由器预测查询难度，并训练针对特定层次的判断器对采样演示进行判断。在推理时，DiSP在显式预算下执行停止接受判断，当未找到合适上下文时会发出诊断风险标签。在五个分类数据集上使用Llama 3-8B和Qwen 2.5-7B，DiSP实现了最佳的平均准确率，比强学习选择基线提高了最高3.4%，同时实现了高达23倍的端到端时间加速。

英文摘要

In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because the space of possible demonstration contexts and combinations is enormous. We argue that demonstration selection is \emph{easier to judge than to find}: predicting whether a specific query--context pair $(q,D)$ will succeed is cheaper and more general than searching for an optimal $D^\star$. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama~3--8B and Qwen~2.5--7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4\%, while achieving up to $23\times$ end-to-end wall-clock speedup.

URL PDF HTML ☆

赞 0 踩 0

2605.18504 2026-05-19 cs.CL 版本更新

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

古希腊语到现代希腊语机器翻译：一种新的基准和对LLM和NMT模型的微调实验

Spyridon Mavromatis, Sokratis Sofianopoulos, Prokopis Prokopidis, Maria Giagkou

发表机构 * National and Kapodistrian University of Athens, Department of Informatics and Telecommunications（雅典国家和卡普迪斯特里亚大学信息与电信系）； Institute for Language and Speech Processing, Athena RC（语言与语音处理研究所，雅典RC）

AI总结本文提出了一种新的基准测试，并对LLM和NMT模型进行了微调实验，以解决古希腊语到现代希腊语的低资源机器翻译问题，展示了微调在提升翻译性能上的显著效果。

Comments 14 pages. Accepted for presentation at the 15th Language Resources and Evaluation Conference (LREC 2026), Palma, Mallorca, Spain

详情

DOI: 10.63317/4cdk64dgm2w9
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 8685-8698. European Language Resources Association (ELRA)

AI中文摘要

机器翻译（MT）在古希腊语（AG）到现代希腊语（MG）之间的任务是一个低资源任务，受到大规模高质量平行数据缺乏的限制。我们通过引入AG-MG平行语料库来填补这一空白，该语料库包含132,481个句子对，来源于文学、历史和圣经文本。我们提出了一种新的语料库创建流水线，结合了网络爬取的片段级数据和多阶段的句子级对齐和精修过程。我们的方法使用VecAlign与LaBSE嵌入，首先在手动对齐的AG-MG子集中进行微调，然后使用Gemini 2.5 Flash进行LLM基于的错误/对齐修正阶段，以确保高质量的对齐。此外，我们提供了对现代MT模型在该任务上的首次全面基准测试，评估了三种微调策略在NMT模型（NLLB、M2M100）和希腊LLM（Llama-Krikri-8B）上的表现。我们的实验表明，微调在基模型上带来了显著的性能提升，最高可增加10.3个BLEU分数。具体而言，Llama-Krikri-8B的全参数微调实现了最高整体性能，BLEU得分为13.16，而经过QLoRA调整的M2M100-1.2B模型展示了最大的相对增益和具有竞争力的结果。我们的数据集和模型对希腊NLP做出了重要贡献。

英文摘要

Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data. We address this gap by introducing the AG-MG Parallel Corpus, a new resource containing 132,481 sentence-aligned pairs derived from literary, historical, and biblical texts. We present a novel corpus creation pipeline that combines web-scraped, excerpt-level data with a multi-stage sentence-level alignment, and refinement process. Our method uses VecAlign with LaBSE embeddings, which we first fine-tune on a manually-aligned AG-MG subset, followed by an LLM-based error/misalignment correction phase using Gemini 2.5 Flash to ensure high alignment quality. Furthermore, we provide the first comprehensive benchmark of modern MT models on this task, evaluating three fine-tuning strategies across NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B). Our experiments show that fine-tuning yields significant improvements over base models, increasing performance by up to +10.3 BLEU points. Specifically, full-parameter fine-tuning of Llama-Krikri-8B achieves the highest overall performance with a BLEU score of 13.16, while the QLoRA-adapted M2M100-1.2B model demonstrates the largest relative gains and highly competitive results. Our dataset and models represent a significant contribution to Greek NLP.

URL PDF HTML ☆

赞 0 踩 0

2605.18500 2026-05-19 cs.CL 版本更新

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

隐式层次GRPO：将工具调用与执行解耦以实现工具集成的数学推理

Li Wang, Xiaohan Wang, Xiaodong Lu, Zipeng Zhang, Jinyang Wu, Jiajun Chai, Wei Lin, Guojun Yin

发表机构 * Meituan（美团）； Tsinghua University（清华大学）

AI总结本文提出了一种将工具调用与执行解耦的方法，通过引入延迟执行和显式控制来增强工具集成推理能力，并提出了一个分层控制框架和理论推导出的替代损失函数，从而得到隐式分层策略，最终提出IH-GRPO算法，在六个跨领域数学推理基准测试中，Qwen3-1.7B、Qwen3-4B和Qwen3-8B在最强基线方法上分别实现了1.87%、2.16%和2.53%的绝对提升。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地利用工具调用来增强其推理能力。然而，现有方法通常紧密耦合工具调用与即时执行。这种即时工具交互可能会破坏LLMs的推理连贯性并限制其表达能力，最终降低推理性能。为此，我们首次提出并形式化了在推理过程中解耦工具调用与执行的问题，并引入延迟执行与显式控制以增强工具集成推理（TIR）。此外，我们提出了一种分层控制框架，并理论推导出一个替代损失函数，使隐式分层策略能够学习等同于显式分层策略的行为，从而得到所提出的IH-GRPO算法。在六个跨领域数学推理基准测试中，IH-GRPO在Qwen3-1.7B、Qwen3-4B和Qwen3-8B上分别实现了1.87%、2.16%和2.53%的绝对提升，同时在其他领域也产生了持续的性能提升。我们的代码可在https://github.com/Lumina04/IH-GRPO-01上获得。

英文摘要

Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87\%, 2.16\%, and 2.53\% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.

URL PDF HTML ☆

赞 0 踩 0

2605.18490 2026-05-19 cs.CL cs.IR 版本更新

Vector RAG vs LLM-Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research

向量RAG与LLM编写的维基：在小型多领域研究上的预注册比较

Theodore O. Cochran

发表机构 * AI for Altruism (A4A)（AI为利他主义（A4A））

AI总结本文通过预注册比较，研究了向量RAG系统与LLM编写的维基在小型多领域研究中的表现，发现两者在跨论文连接、答案组织和成本等方面各有优劣，没有单一系统在所有指标上最优。

详情

AI中文摘要

我们预注册了两种帮助LLM回答问题的方法在小型研究语料库上的比较：单轮向量RAG系统和LLM编写的markdown维基。两种系统使用相同的回答生成模型回答了24篇论文中的13个问题，其答案由盲审的LLM法官评分。维基在连接不同论文的发现方面表现更好，但其在答案组织方面的优势在法官调整后并不显著。RAG通过预注册测试满足了单事实查找问题的要求。干净的查询侧成本结果与预期的维基优势相悖：在测试设置下，维基使用了比RAG更多的查询令牌，因此无法通过更便宜的查询来回收前期构建成本。两个探索性分析改变了我们对结果的解读。首先，按主张层面的引用检查更倾向于维基：其引用的页面更常支持所陈述的精确主张，尽管RAG在整体可信度评分上表现更好。其次，基于分解的RAG变体在较低的LLM令牌成本下恢复了维基在跨论文综合方面的大部分优势，但未能恢复维基在按主张引用支持方面的优势。主要结论是，基于事实的研究综合并非单一能力。系统在组织证据、引用支持每个主张的能力以及运行成本方面可能有所不同。在本研究中，没有架构在所有三个指标上最优。

英文摘要

We preregistered a comparison of two ways to help an LLM answer questions over a small research corpus: a single-round Vector RAG system and an LLM-compiled markdown wiki. Both systems answered the same 13 questions over 24 papers using the same answer-generating model, and their answers were scored by blinded LLM judges. The wiki scored much better at connecting findings across papers, but its advantage in answer organization was not strong after judge adjustment. RAG met the preregistered test for single-fact lookup questions. The clean query-side cost result went against the expected wiki advantage: under the tested setup, the wiki used far more query tokens than RAG, so it could not recover any upfront build cost through cheaper queries. Two exploratory analyses changed how we interpret the result. First, claim-level citation checking favored the wiki: its cited pages more often supported the exact claims being made, even though RAG scored better on the overall groundedness rubric. Second, a decomposition-based RAG variant recovered most of the wiki's advantage on cross-paper synthesis at lower LLM-token cost, but it did not recover the wiki advantage in claim-by-claim citation support. The main conclusion is that grounded research synthesis is not a single capability. Systems can differ in how well they organize evidence, how well their citations support each claim, and how much they cost to run. In this study, no architecture was best on all three.

URL PDF HTML ☆

赞 0 踩 0

2605.18462 2026-05-19 cs.CL 版本更新

From BERT to T5: A Study of Named Entity Recognition

从BERT到T5：一项命名实体识别研究

Mei Jia

发表机构 * University of Manchester（曼彻斯特大学）

AI总结本文研究了在预训练模型BERT和T5上进行命名实体识别任务，比较了两种模型在序列标注任务中的性能，并分析了常见错误和超参数对模型表现的影响。

Comments 11 pages, 9 figures

2605.18352 2026-05-19 cs.CL 版本更新

Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs

条件句中的预设与推理：人类与LLM的理论研究

Tara Azin, Yongan Yu, Raj Singh, Olessia Jouravlev

发表机构 * Department of Cognitive Science, Carleton University（卡尔顿大学认知科学系）； School of Computer Science, McGill University（麦吉尔大学计算机科学学院）； Mila – Quebec AI Institute（魁北克人工智能研究所）

AI总结本文通过对比人类判断和LLM预测，研究条件句中预设投影的机制，发现人类结合概率和语用线索进行判断，而LLM存在不一致的对齐模式，表明LLM的表现可能源于表面模式匹配而非语用能力。

Comments To appear in the Proceedings of CoNLL 2026, colocated with ACL 2026

详情

AI中文摘要

条件句中的预设投影在意义理论和语用学中至关重要，但在大型语言模型中仍未得到充分评估。我们通过平行行为研究填补这一空白，比较人类判断和LLM预测在规范化的条件句数据集上的表现，该数据集控制了前件与投影预设之间的关系。我们收集了120名参与者和四个LLM在匹配的上下文中对可能性评分。结果表明，人类在判断中整合了概率和语用线索，而LLM在对齐人类模式上表现不一致。使用一个语言学驱动的检查清单在LLM作为判断者框架内，我们进一步评估模型推理。我们发现最符合人类评分的模型往往缺乏连贯的语用推理，而推理能力更强的模型产生更不人类化的判断。这些发现表明，LLM在这些任务上的表现可能源于表面模式匹配而非语用能力。我们的发现强调了基于语言学理论的基准测试在比较人类和模型时的重要性。

英文摘要

Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs' performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.

URL PDF HTML ☆

赞 0 踩 0

2605.18337 2026-05-19 cs.CL 版本更新

Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Infini-News：13亿篇Common Crawl新闻文章的高效可查询访问

Ruggero Marino Lazzaroni, Jana Lasser, Kirill Solovev

发表机构 * Common Crawl Foundation（Common Crawl基金会）

AI总结本文提出Infini-News，通过构建全文索引和检索工具，为研究者提供了13亿篇Common Crawl新闻文章的高效访问方式，包括文本清洗、多语言检测、地理归属和高效检索功能，降低了纵向跨国媒体研究的门槛。

详情

AI中文摘要

大规模新闻语料库支持计算社会科学和自然语言处理中的广泛研究，但访问仍然受限：商业档案施加了昂贵的成本和许可限制，而开放替代方案如Common Crawl的CC-News需要TB级存储和计算密集型处理。我们提出了Infini-News，一个用于整个CC-News档案（2016年8月至最新可用快照）的检索工具包和索引。我们的贡献有三方面：首先，我们提取、清洗超过13.5亿篇文章的文本，并解析结构化元数据。其次，我们通过三种前沿语言分类器（GlotLID、lingua和CommonLingua）丰富语料库，并通过多源地理归属确定83.4%的文章的国家来源（涵盖222个国家）。第三，我们构建了Infini-gram索引：后缀数组结构，使研究者能够以亚秒级时间在全文档案中搜索任意文本模式。这些资源降低了纵向、跨国媒体研究的门槛。

英文摘要

Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.

URL PDF HTML ☆

赞 0 踩 0

2605.18299 2026-05-19 cs.AI cs.CL cs.IR 版本更新

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-Search: 用于搜索增强推理的在线策略 hindsight 自监督学习

Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology（快手科技）

AI总结本文提出SD-Search，一种基于在线策略hindsight自监督学习的搜索增强推理方法，通过自身策略生成细粒度监督信号，无需外部教师模型或额外标注。

详情

AI中文摘要

搜索增强推理代理将内部推理与外部检索器的调用交替进行，其性能依赖于每次发出的查询质量。然而，在基于结果奖励的强化学习中，每个搜索决策在展开过程中共享同一轨迹级奖励，使个体查询缺乏步级信用。最近的过程监督方法通过从政策外部获取步级信号来解决这一差距，依赖于一个更大的教师模型或由更强的外部系统生成的子问题注释。相比之下，我们提出了SD-Search，通过在线策略的hindsight自监督学习自身生成步级监督，无需外部教师或额外标注。在SD-Search中，一个模型扮演两个角色：学生只看到推理时可用的上下文，而教师还根据一个紧凑的hindsight块总结了搜索查询和一组从同一问题采样的展开的最终结果。由于教师知道每个展开的展开过程和哪些成功，其查询分布隐含地标记了哪些决策值得做出，学生通过最小化token级的Jensen-Shannon散度来恢复这种行为。这在GRPO的粗粒度轨迹奖励上叠加了密集的步级信号。关键的是，这个信号由策略本身在标准RL训练循环中生成，无需外部模型推理、辅助标注流程或额外的训练阶段。

英文摘要

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

URL PDF HTML ☆

赞 0 踩 0

2605.18261 2026-05-19 cs.CL 版本更新

上下文记忆用于高效长上下文生成

Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki

发表机构 * Institute of Science Tokyo, Japan（东京科学研究所）； Imperial College London, UK（伦敦帝国学院）

AI总结本文提出了一种无需训练的上下文记忆方法，通过将前缀外部化为轻量级的预计算注意力状态查找表，以提高长上下文生成的准确性和效率，同时减少注意力计算的延迟。

详情

AI中文摘要

现代大型语言模型（LLM）应用越来越多地依赖长前缀来在推理时控制模型行为。尽管增强前缀的推理是有效的，但存在两个结构限制：i）随着生成过程的进行，前缀的影响逐渐减弱；ii）对前缀的注意力计算与长度成线性关系。现有方法要么在注意力中保留前缀同时压缩它，要么通过梯度训练将它内部化到模型参数中。前者在推理时仍然会关注到前缀，而后者训练成本高且不适合前缀更新。为了解决这些问题，我们提出了注意力状态记忆，这是一种无需训练的方法，将前缀外部化为一个轻量级的预计算注意力状态的查找表。在ManyICLBench上使用LLaMA-3.1-8B，我们的方法在1K-8K内存预算下比上下文学习提高了准确性，同时在8K时将注意力延迟减少了1.36倍，并在NBA基准测试中仅使用其内存足迹的20%就超过了全注意力RAG性能。

英文摘要

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

URL PDF HTML ☆

赞 0 踩 0

2605.18221 2026-05-19 cs.SD cs.CL cs.CV cs.LG physics.med-ph 版本更新

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg（埃森哲-埃尔朗根-纽伦堡大学模式识别实验室）； Institute of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg（埃尔朗根大学医院放射学研究所）； Institut für Informationsverarbeitung, Leibniz Universität Hannover（汉诺威莱比锡大学信息处理研究所）； Department of Radiology, Harvard Medical School and Massachusetts General Hospital（哈佛医学院放射科和麻省总医院）

AI总结本文提出了一种语音引导的MRI重建框架SIREM，通过同步语音作为跨模态先验，利用语音与声音学之间的相关性预测图像内容，从而在更高的吞吐量下实现更合理的解剖结构重建。

详情

AI中文摘要

实时磁共振成像（rtMRI）在语音生产中的应用能够非侵入性地可视化动态声带运动，对语音科学和临床评估具有价值。然而，rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制，常常导致k空间测量不足和重建质量下降。我们提出SIREM，一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关，使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合，通过空间加权图。音频分支从语音预测发音器相关结构，而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓，使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式，结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM，与标准基线（包括栅格、基于小波的压缩感知和总变分）进行比较。SIREM引入了一种语音引导的重建范式，在比迭代方法高得多的吞吐量下运行，同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准，并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

URL PDF HTML ☆

赞 0 踩 0

2605.18211 2026-05-19 cs.CL cs.AI 版本更新

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

利用图结构在序列到序列模型中进行知识图谱链接预测

Luu Huu Phuc, Ratan Bahadur Thapa, Mojtaba Nayyeri, Jingcheng Wu, Evgeny Kharlamov, Steffen Staab

发表机构 * Analytic Computing, KI, University of Stuttgart, Stuttgart, Germany（斯图加特大学分析计算研究所）； Bosch Center for Artificial Intelligence, Stuttgart, Germany（博世人工智能中心）； WAIS, University of Southampton, United Kingdom（南安普顿大学WAIS）

AI总结本文提出了一种结合图结构的序列到序列模型GA-S2S，通过整合T5-small编码器解码器与关系图注意力网络RGAT，提升知识图谱链接预测的性能。

Comments 9 pages, 1 figure, 2 tables. Preprint of a paper accepted at the 5th Workshop on LLM-Integrated Knowledge Graph Generation from Text (TEXT2KG), co-located with ESWC 2026, May 10--14, 2026, Dubrovnik, Croatia

详情

AI中文摘要

我们介绍了图增强的序列到序列（GA-S2S）框架，这是一种新的框架，将T5-small编码器解码器与关系图注意力网络（RGAT）相结合，以提高知识图谱的链接预测。虽然现有的序列到序列模型仅依赖于实体和关系的表面描述，并且在最理想的情况下，将查询实体的邻居扁平化为一个线性序列，从而丢弃了内在的图结构，GA-S2S联合编码文本特征和查询实体周围的完整k跳子图拓扑。通过将原始编码器输出与RGAT的关系感知嵌入相结合，我们的模型捕捉并利用了更丰富的多跳关系模式和文本信息。在CoDEx数据集上的初步实验表明，GA-S2S在链接预测准确性上优于竞争的序列到序列基线模型，达到了高达19%的相对增益。

英文摘要

We introduce Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework that integrates a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) to improve link prediction in knowledge graphs. While existing Seq2Seq models rely solely on surface-level textual descriptions of entities and relations and at best, flatten the neighborhoods of a query entity into a single linear sequence, thereby discarding the inherent graph structure, GA-S2S jointly encodes both textual features and the full $k$-hop subgraph topology surrounding the query entity. By integrating raw encoder outputs with RGAT's relation-aware embeddings, our model captures and leverages richer multi-hop relational patterns and textual information. Our preliminary experiments on the CoDEx dataset demonstrate that GA-S2S outperforms competitive Seq2Seq-based baseline models, achieving up to a 19\% relative gain in link prediction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.18181 2026-05-19 cs.AI cs.CL 版本更新

利用语音识别洞察和迁移的特征

Linas Nasvytis, Judith E. Fan

发表机构 * Department of Psychology, Stanford University（斯坦福大学心理学系）； Graduate School of Education, Stanford University（斯坦福大学教育研究生院）； Department of Computer Science, Stanford University（斯坦福大学计算机科学系）

AI总结研究通过语音分析探讨解决问题过程中洞察力的表现形式及其对后续问题解决的影响，发现可迁移的洞察力更容易被口头描述。

详情

AI中文摘要

许多问题似乎需要瞬间的灵感来解决。这些突如其来的灵感是什么形式，它们如何影响人们未来解决类似问题的方式？在本研究中，我们要求189名参与者在解决一系列五个“火柴算术”问题时进行think-aloud。这些问题要么都依赖同一种非显而易见的解决方案（Same组），要么每次都不同（Different组）。我们的第一个观察是Same组参与者进步更快 than Different组参与者。然后我们利用自然语言处理技术分析参与者的语音，发现Same组参与者进步加快的同时，他们在说话量和说话内容上也发生了变化。特别是，他们更可能自发地标注他们正在解决的问题类型。综合来看，这些发现表明，可迁移的洞察力的一个标志是其对口头报告的可访问性，即使其背后的洞察力前因仍难以描述。

英文摘要

Many problems seem to require a flash of insight to solve. What form do these sudden insights take, and what impact do they have on how people approach similar problems in the future? In this work, we prompted participants (N = 189) to think aloud as they attempted to solve a sequence of five "matchstick-arithmetic" problems. These problems either all relied on the same kind of non-obvious solution (Same group) or a different kind each time (Different group). Our first observation was that Same participants improved more rapidly than Different participants. We then leveraged techniques from natural language processing to analyze participants' speech, and found that this accelerated improvement for Same participants was accompanied by changes in both how much they spoke and what they said. In particular, they were more likely to spontaneously label the kind of problem they were working on. Taken together, these findings suggest that a hallmark of transferable insights is their accessibility for verbal report, even if the underlying precursors of insight remain difficult to articulate.

URL PDF HTML ☆

赞 0 踩 0

2605.10843 2026-05-19 cs.CL cs.AI cs.CY 版本更新

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

通过人格分歧实现无训练的文化对齐大语言模型

Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen, Chi-Nguyen Tran, Phu-Hoa Pham, Nguyen Lam Phu Quy, The Anh Han, Long Tran-Thanh

发表机构 * Faculty of Information and Technology, University of Science, Vietnam National University（信息与技术学院，科学大学，越南国家大学）； Department of Computer Science, University of Warwick（计算机科学系，沃里克大学）； School of Computing, Engineering and Digital Technologies, Teesside University（计算、工程和数字技术学院，泰赛德大学）

AI总结本文提出DISCA方法，在不改变模型权重的情况下，通过人格分歧校准减少大语言模型在多任务测试中的文化偏差，为服务全球道德偏好提供了可扩展的替代方案。

Comments 57 pages, 1 figure, 6 MultiTP moral dimensions

详情

AI中文摘要

大型语言模型越来越多地参与涉及道德判断的决策，但越来越多的证据表明，它们的隐含偏好并非文化中立。现有的文化对齐方法要么需要国家层面的偏好数据和微调预算，要么假设可以访问模型内部的白盒信息，而商业API并未暴露此类信息。在本工作中，我们专注于这种现实的黑盒、仅公共数据的环境，并观察到国家内部的社会人口学分歧，而非共识，是主要的指导信号。我们引入DISCA（基于分歧的文化对齐推理方法），一种在推理时的方法，将每个国家视为一个基于世界价值观调查的个人代理面板，并将他们的分歧转化为一个有界的、损失厌恶的logit校正。在20个国家和7个开放权重的backbone（2B-70B）上，DISCA在MultiTP上减少了10-24%的文化偏差（在六个backbone >=3.8B上），并在开放场景中减少了2-7%的偏差，而无需改变任何权重。我们的结果表明，推理时的校准是微调的可扩展替代方案，用于服务全球道德偏好的长尾。

英文摘要

Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.

URL PDF HTML ☆

赞 0 踩 0

2604.25850 2026-05-19 cs.CL cs.SE 版本更新

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

代理 harness 工程：基于可观测性的自动进化编码代理 harness

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Peking University（北京大学）； Shanghai Qiji Zhifeng Co., Ltd（上海启智锋科技有限公司）

AI总结本文提出了一种基于可观测性的代理 harness 工程方法，通过三个支柱解决 harness 工程中的挑战，使 harness 进化过程自动且可控，实验表明其在多个基准测试中表现优异。

详情

AI中文摘要

harness 现在是编码代理性能的核心，调解模型与工具和执行环境的交互方式。然而，harness 工程仍然是手工制作，因为自动化面临可编辑组件的异质动作空间、大量轨迹隐藏可操作信号以及编辑效果难以归因的挑战。我们引入了代理 harness 工程（AHE），一个闭环系统，通过三个匹配的可观测性支柱来解决这些挑战：（1）组件可观测性为每个可编辑的harness组件提供文件级表示，使动作空间明确且可回退；（2）经验可观测性将数百万个原始轨迹令牌提炼成分层、可深入查阅的证据库，使进化代理可以实际消耗；（3）决策可观测性将每个编辑与自我声明的预测配对，随后通过下一轮任务级结果验证。共同，这些支柱使每个编辑都成为可检验的合同，使harness进化过程自主进行而不陷入试错。实证上，十次AHE迭代使Terminal-Bench 2的pass@1从69.7%提升到77.0%，超过人工设计的harness Codex-CLI（71.9%）和自进化基线ACE和TF-GRPO。冻结的harness无需重新进化：在SWE-bench-verified中，它在比种子少12%的token上达到最高聚合成功率，并在Terminal-Bench 2上，相对于三个替代模型家族，产生+5.1到+10.1个百分点的跨家族增益，表明进化组件编码了通用工程经验而非基准特定调优。消融分析将增益局部化到工具、中间件和长期记忆，而非系统提示，表明事实性的harness结构转移，而语义层面的策略不转移。

英文摘要

Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

URL PDF HTML ☆

赞 0 踩 0

2604.18248 2026-05-19 cs.CR cs.CL 版本更新

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

超越模式匹配：七种跨领域技术用于提示注入检测

Thamilvendhan Munirathinam

发表机构 * Independent Researcher（独立研究者）； prompt-shield project（prompt-shield项目）

AI总结本文提出七种跨领域技术用于提示注入检测，通过引入来自不同领域的机制，如法医学语言学、材料科学疲劳分析、网络安全欺骗技术、生物信息学本地序列比对、经济学机制设计、流行病学谱信号分析和编译器理论中的污点跟踪，以改进现有的提示注入检测方法。

Comments v3.0 (18 May 2026): Added Sec. 5.6 with independent evaluation on three peer-reviewed benchmarks (Liu, USENIX Sec 2024; Garak, Derczynski 2024; InjecAgent, ACL Findings 2024). 8,276 unseen attacks; cross-benchmark plateau at 35-45% on subtle indirect injection. Abstract, contributions, Sec. 6, and 6 refs updated

详情

DOI: 10.5281/zenodo.19644135

AI中文摘要

当前开源的提示注入检测器集中在两种架构选择上：正则表达式模式匹配和微调的变压器分类器。两者都存在近期研究已明确的失败模式。正则表达式无法检测到改写攻击。微调的分类器对适应性对手易受攻击：2025年NAACL Findings研究报告称，八个已发表的间接注入防御方法在适应性攻击下被绕过，攻击成功率超过50%。本文提出七种检测技术，每种技术都引入了来自大语言模型安全领域之外的特定机制：法医学语言学、材料科学疲劳分析、网络安全欺骗技术、生物信息学本地序列比对、经济学机制设计、流行病学谱信号分析以及编译器理论中的污点跟踪。七种技术中的三种在提示防护v0.4.1版本中实现（Apache 2.0），并在六个数据集上的四配置消融测试中进行评估，包括deepset/prompt-injections、NotInject、LLMail-Inject、AgentHarm和AgentDojo。本地比对检测器在deepset数据集上将F1分数从0.033提升到0.378，且无额外的假阳性。风格度量检测器在间接注入基准上增加了11.1个百分点的F1分数。疲劳跟踪器通过探测活动整合测试进行验证。所有代码、数据和重现脚本均以Apache 2.0许可证发布。

英文摘要

Current open-source prompt-injection detectors converge on two architectural choices: regular-expression pattern matching and fine-tuned transformer classifiers. Both share failure modes that recent work has made concrete. Regular expressions miss paraphrased attacks. Fine-tuned classifiers are vulnerable to adaptive adversaries: a 2025 NAACL Findings study reported that eight published indirect-injection defenses were bypassed with greater than fifty percent attack success rates under adaptive attacks. This work proposes seven detection techniques that each port a specific mechanism from a discipline outside large-language-model security: forensic linguistics, materials-science fatigue analysis, deception technology from network security, local-sequence alignment from bioinformatics, mechanism design from economics, spectral signal analysis from epidemiology, and taint tracking from compiler theory. Three of the seven techniques are implemented in the prompt-shield v0.4.1 release (Apache 2.0) and evaluated in a four-configuration ablation across six datasets including deepset/prompt-injections, NotInject, LLMail-Inject, AgentHarm, and AgentDojo. The local-alignment detector lifts F1 on deepset from 0.033 to 0.378 with zero additional false positives. The stylometric detector adds 11.1 percentage points of F1 on an indirect-injection benchmark. The fatigue tracker is validated via a probing-campaign integration test. All code, data, and reproduction scripts are released under Apache 2.0.

URL PDF HTML ☆

赞 0 踩 0

2604.15929 2026-05-19 cs.CL 版本更新

解构大型语言模型中的歧义与不稳定性：一项临床文本到SQL的案例研究

Angelo Ziletti, Leonardo D'Ambrosi

发表机构 * Bayer AG（勃林格殷曼集团）

AI总结本文提出CLUES框架，通过将文本到SQL分解为两个阶段（解释->答案）来区分输出多样性两种不同原因：输入歧义和模型不稳定性，并在临床文本到SQL基准测试中提高了故障预测性能。

详情

Journal ref: Proceedings of the 7th Clinical Natural Language Processing Workshop 2026

AI中文摘要

在临床文本到SQL中部署大型语言模型需要区分输出多样性的两种不同原因：（i）输入歧义，应触发澄清，和（ii）模型不稳定性，应触发人工审查。我们提出CLUES，将文本到SQL建模为两个阶段的过程（解释-->答案），并将语义不确定性分解为歧义分数和不稳定性分数。不稳定性分数通过二元语义图矩阵的Schur补计算。在AmbigQA/SituatedQA（黄金解释）和临床文本到SQL基准测试（已知解释）上，CLUES在状态-of-the-art Kernel Language Entropy之上提高了故障预测。在部署设置中，它保持竞争力，同时提供单个分数不可用的诊断分解。所得到的不确定性区域映射到目标干预 - 对歧义进行查询细化，对不稳定性进行模型改进。高歧义/高不稳定性区域包含51%的错误，覆盖25%的查询，从而实现高效的优先级排序。

英文摘要

Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

URL PDF HTML ☆

赞 0 踩 0

2602.10134 2026-05-19 cs.CR cs.AI cs.CL 版本更新

Reverse-Engineering Model Editing on Language Models

语言模型上的逆向工程模型编辑

Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He

AI总结本文研究了语言模型中参数编辑的漏洞，提出了一种名为KSTER的逆向工程攻击方法，通过利用参数更新的低秩结构恢复编辑数据，并提出subspace camouflage防御策略以降低重建风险。

Comments Accepted to ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）在预训练过程中会接触到包含万亿个标记的语料库，因此不可避免地会记住敏感信息。定位然后编辑方法作为一种主流的模型编辑范式，通过修改模型参数而不重新训练，提供了一个有前景的解决方案。然而，在本工作中，我们揭示了这一范式的关键漏洞：参数更新无意中充当了侧信道，使攻击者能够恢复编辑的数据。我们提出了一种两阶段的逆向工程攻击，称为KSTER（KeySpaceReconsThenEntropyReduction），该方法利用这些更新的低秩结构。首先，我们理论证明了更新矩阵的行空间编码了被编辑主体的“指纹”，通过谱分析可以准确恢复主体。其次，我们引入了一种基于熵的提示恢复攻击，重构了编辑的语义上下文。在多个LLM上的大量实验表明，我们的攻击能够以高成功率恢复编辑数据。此外，我们提出了一种名为subspace camouflage的防御策略，通过语义伪装来混淆更新指纹，从而有效降低重建风险，而不会影响编辑的实用性。我们的代码可在https://github.com/reanatom/EditingAttack上获得。

英文摘要

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAttack.

URL PDF HTML ☆

赞 0 踩 0

2602.09805 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

超越准确率：分解大语言模型的推理效率

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

发表机构 * Integreat - Norwegian Centre for knowledge-driven machine learning（Integreat - 挪威知识驱动机器学习中心）； UiT - The Arctic University of Norway（UiT - 北极大学）； University of Oslo（奥斯陆大学）

AI总结本文提出一种无需追踪的评估协议，通过完成率、条件正确性和生成长度三个指标分解大语言模型的token效率，同时考虑任务工作量元数据进行归一化处理，并评估模型在不同任务上的推理效率和冗余问题。

Comments Preprint (under review). 29 pages, 4 figures

详情

AI中文摘要

随着推理大语言模型越来越多地通过推理、搜索和自我纠正来换取准确性，单一的准确性分数已无法说明这些token是否带来了有用的推理、从困难实例中恢复或不必要的冗长。我们介绍了一种可选追踪的评估协议，通过三个即使在封闭模型中也可用的观测指标精确分解token效率：完成率、在完成条件下正确性的条件正确性以及生成长度。当实例级工作量元数据可用时，我们进一步将生成长度归一化为声明的任务隐含工作，并将平均口头冗余与工作量依赖的扩展分离。当此类元数据不可用时，我们定义了一个可审计的求解器衍生工作量规模，并在留出自我、留出top-k和持有参考池扰动下评估其稳定性。我们在CogniLoad、GSM8K、ProofWriter和ZebraLogic上评估了14个共享开放权重模型。我们进一步在CogniLoad上评估了11个额外模型，从而能够对推理任务难度因素进行细致分析：任务长度、内在难度和干扰项密度。效率和冗余排名在所有基准对中保持稳定，比准确性排名更加稳健，同时分解了逻辑受限、上下文受限（截断驱动）和冗余受限的失败模式，这些模式在准确性每token下看起来是相同的。我们发布了评估工具包和报告模板，详细说明了LLM在推理上的低效原因。

英文摘要

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.02262 2026-05-19 cs.SE cs.AI cs.CL 版本更新

Evo-Memory：通过自演化记忆基准测试LLM代理的测试时间学习

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

发表机构 * Google DeepMind（谷歌深Mind）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出Evo-Memory，一个用于评估LLM代理自演化记忆能力的综合流基准和框架，通过构建序列任务流数据集，要求LLM在每次交互后搜索、适应和演化记忆，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了超过十种代表性的记忆模块。

详情

AI中文摘要

状态性对于大型语言模型（LLM）代理进行长期规划和问题解决至关重要。这使得记忆成为关键组件，但其管理和进化仍 largely underexplored。现有的评估主要集中在静态对话设置上，其中记忆被动地从对话中检索以回答查询，忽略了在不断变化的任务流中积累和重用经验的能力。在现实世界环境中，如交互问题助手或具身代理中，LLM需要处理连续的任务流，但通常无法从积累的交互中学习，失去有价值的上下文见解，这限制了测试时间的进化，即LLM在部署期间持续检索、整合和更新记忆。为了弥合这一差距，我们引入了Evo-Memory，一个综合的流基准和框架，用于评估LLM代理的自演化记忆能力。Evo-Memory将数据集结构化为连续的任务流，要求LLM在每次交互后搜索、适应和演化记忆。我们统一并实现了超过十种代表性的记忆模块，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了它们。为了更好地基准测试经验重用，我们提供了一个基线方法ExpRAG，用于检索和利用先前经验，并进一步提出ReMem，一个将推理、任务动作和记忆更新紧密集成的行动-思考-记忆精炼流程，以实现持续改进。

英文摘要

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

URL PDF HTML ☆

赞 0 踩 0

2510.26745 2026-05-19 cs.LG cs.AI cs.CL stat.ML 版本更新

Deep sequence models tend to memorize geometrically; it is unclear why

深度序列模型倾向于记忆几何学；不清楚为何

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

发表机构 * Machine Learning Department \& Heinz College, Carnegie Mellon University, Pittsburgh, PA, USA ； Google Research, NY, USA

AI总结研究探讨了深度序列模型中原子事实的存储机制，发现几何记忆能编码全局关系，即使在训练中未共现的实体间也能建立联系，挑战了传统关联记忆的观点。

Comments Forty-third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

深度序列模型被认为主要通过关联记忆存储原子事实，即通过暴力查找共现实体。我们识别出一种不同的存储形式，称为几何记忆。在此模型中，嵌入编码了所有实体之间的新型全局关系，包括训练中未共现的实体。这种存储形式强大：例如，我们展示了它如何将涉及ℓ-折叠组合的困难推理任务转化为易于学习的一步导航任务。从这一现象中，我们提取了神经嵌入几何学中难以解释的基本方面。我们认为，这种几何的出现，与局部关联的查找相比，不能简单归因于典型的监督、架构或优化压力。反直觉的是，即使几何比暴力查找更复杂，它仍然会被学习。然后，通过分析与Node2Vec的联系，我们展示了几何起源于一种光谱偏见，这与主流理论相反，确实自然产生，尽管缺乏各种压力。这一分析也指出了从业者在使Transformer记忆更几何化方面的可见空间。我们希望几何视角的参数记忆鼓励重新审视指导知识获取、容量、发现和遗忘等领域的默认直觉。

英文摘要

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

URL PDF HTML ☆

赞 0 踩 0

2510.24208 2026-05-19 cs.CL cs.LG 版本更新

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

超越神经不兼容：通过潜在语义对齐实现语言模型中的跨尺度知识转移

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

发表机构 * Monash University（墨尔本大学）； Technical University of Munich（慕尼黑技术大学）； Chongqing University（重庆大学）

AI总结本文提出SemAlign方法，通过潜在语义对齐实现跨尺度知识转移，解决了不同架构和参数化模型间参数重用受限的问题，通过激活值作为转移介质，利用语义分解与重组稳定地实现知识迁移。

Comments an early-stage version

详情

AI中文摘要

语言模型（LMs）在其参数中编码了大量知识，但如何以细粒度方式转移此类知识，即参数化知识转移（PKT）仍不明确。核心挑战是当源模型和目标模型在架构和参数化上存在差异时，如何实现有效的、高效的跨尺度转移，这使得直接参数重用受到神经不兼容的限制。在本文中，我们识别出潜在语义对齐是跨尺度知识转移的关键前提。与直接移动层参数不同，我们的方法使用激活值作为转移介质。SemAlign包含两个阶段：一个层归因阶段，用于归因任务相关的源层并为每个目标层选择恰好一个源层；一个语义对齐阶段，通过逐层配对并优化目标模型，利用源侧语义监督。对齐通过语义分解和重组在潜在空间中进行。在浅层到深层的转移过程中，只有前沿目标层是可训练的。层目标通过匹配中心化的词-词关系几何与对齐的监督残差来监督该层的残差贡献，而输出KL保持源级预测行为。因此，转移介质既不是参数块也不是绝对的隐藏状态，而是由配对源层监督诱导的目标空间残差几何。在四个基准测试中的评估证实了SemAlign的有效性，进一步分析确认语义分解和重组为跨尺度知识转移提供了一个稳定的机制。

英文摘要

Language Models (LMs) encode substantial knowledge in their parameters, yet it remains unclear how to transfer such knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A central challenge is to make cross-scale transfer effective and efficient when source and target models differ in architecture and parameterization, making direct parameter reuse strongly limited by neural incompatibility. In this paper, we identify latent semantic alignment as the key prerequisite for cross-scale knowledge transfer. Instead of directly moving layer parameters, our approach uses activations as the transfer medium. \textsc{SemAlign} has two stages: an \emph{layer attribution} stage that attributes task-relevant source layers and selects exactly one source layer for each target layer, and a \emph{semantic alignment} stage that pairs them layer by layer and optimizes the target with source-side semantic supervision. The alignment is carried out in latent space through semantic decomposition and recomposition. During the shallow-to-deep transfer, only the frontier target layer is trainable. The layer objective supervises the residual contribution of that layer by matching centered token-token relation geometry against an aligned supervisory residual, while output KL preserves source-level predictive behavior. The transferred medium is therefore neither a parameter block nor an absolute hidden state, but target-space residual geometry induced by paired source-layer supervision. Evaluations on four benchmarks demonstrate the efficacy of \textsc{SemAlign}, and further analysis confirms that semantic decomposition and recomposition provide a stable mechanism for cross-scale knowledge transfer.

URL PDF HTML ☆

赞 0 踩 0

2510.21712 2026-05-19 cs.IR cs.AI cs.CL 版本更新

DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling

DecoupleSearch: 通过分层奖励建模解耦规划与搜索

Hao Sun, Zile Qiao, Bo Wang, Guoxin Chen, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang

发表机构 * Tongyi Lab（通义实验室）； Alibaba Group（阿里巴巴集团）

AI总结本文提出DecoupleSearch框架，通过双值模型解耦规划与搜索过程，利用蒙特卡洛树搜索评估每一步的质量，并通过分层束搜索迭代优化规划和搜索候选，验证了方法的有效性。

Comments EMNLP 2025 Main Conference

详情

AI中文摘要

检索增强生成（RAG）系统已作为一种增强大型语言模型（LLM）的关键方法，通过动态整合外部知识。为了进一步提高RAG的灵活性，代理RAG引入了自主代理到工作流程中。然而，代理RAG面临几个挑战：（1）每一步的成功取决于高质量的规划和准确的搜索；（2）中间推理步骤缺乏监督；（3）规划和搜索的候选空间呈指数级增长。为了解决这些挑战，我们提出了DecoupleSearch，一种新的框架，通过双值模型解耦规划和搜索过程，使规划推理和搜索基础能够独立优化。我们的方法构建了一个推理树，其中每个节点代表规划和搜索步骤。我们利用蒙特卡洛树搜索来评估每一步的质量。在推理过程中，分层束搜索通过双值模型迭代优化规划和搜索候选。在不同参数规模的策略模型上的广泛实验验证了我们方法的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To further improve RAG's flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges: (1) the success of each step depends on both high-quality planning and accurate search, (2) the lack of supervision for intermediate reasoning steps, and (3) the exponentially large candidate space for planning and searching. To address these challenges, we propose DecoupleSearch, a novel framework that decouples planning and search processes using dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree, where each node represents planning and search steps. We leverage Monte Carlo Tree Search to assess the quality of each step. During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models. Extensive experiments across policy models of varying parameter sizes demonstrate the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2510.20584 2026-05-19 cs.CL cs.AI 版本更新

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

使用ChatGPT自动编码通信数据：子群体一致性分析

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

发表机构 * ETS Research Institute（ETS研究机构）

AI总结本文研究了使用ChatGPT进行通信数据编码在不同性别和种族/族裔群体间的一致性，发现其编码结果与人类评分者一致，为大规模评估协作与沟通提供了可能。

Comments Accepted to the Journal of Educational Measurement

详情

AI中文摘要

在大规模评估沟通和协作方面，对通信数据进行分类编码是一项劳动密集型任务，根据不同的框架进行分类。先前研究已证明，可以通过直接指示ChatGPT使用编码评分表来对通信数据进行编码，并且其准确性与人类评分者相当。然而，ChatGPT或类似AI技术在不同人口群体（如性别和种族）之间编码的一致性仍不清楚。为填补这一空白，我们引入了三种检查方法，用于评估基于LLM的编码中的子群体一致性，通过适应自自动化评分文献中已有的框架。使用典型的协作问题解决编码框架和三种类型的协作任务数据，我们检查了基于ChatGPT的编码在性别和种族/族裔群体中的表现。我们的结果表明，基于ChatGPT的编码在性别或种族/族裔群体中表现一致，与人类评分者一致，证明了其在大规模评估协作和沟通中的可行性。

英文摘要

Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

URL PDF HTML ☆

赞 0 踩 0

2510.11391 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

发表机构 * CUHK（香港大学）； UCAS（中国科学技术大学）； XJTU（西安交通大学）； UMich（密歇根大学）； Microsoft（微软）

AI总结本文提出DocReward，一种用于评估文档结构和风格的奖励模型，通过构建包含117,000对文档的DocPair数据集，采用Bradley-Terry损失训练，有效提升了文档生成的结构和风格专业性。

详情

AI中文摘要

近期的代理工作流程自动化了专业文档生成，但主要关注文本质量，忽视了结构和风格的专业性，这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型，无法引导代理生成结构和风格专业的文档。我们引入DocReward，一种评估文档结构和风格的文档奖励模型。为此，我们提出了一种文本质量无关的框架，确保评估不受内容质量的影响，并构建了包含117,000对文档的DocPair数据集，涵盖32个领域和267种类型。每对文档内容相同，但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中，DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明，DocReward能有效引导代理生成具有更一致结构和风格专业性的文档，突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

URL PDF HTML ☆

赞 0 踩 0

2510.10930 2026-05-19 cs.CL cs.AI 版本更新

Evaluating Language Models' Evaluations of Games

评估语言模型对游戏的评估

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

发表机构 * University of Cambridge（剑桥大学）； MIT（麻省理工学院）； Princeton University（普林斯顿大学）； NYU（纽约大学）； Harvard University（哈佛大学）； Stanford University（斯坦福大学）

AI总结本文研究了语言模型对游戏评估的能力，通过比较现代语言模型和人类及符号计算代理的评估结果，发现推理模型在游戏评估上更接近人类，但随着模型接近博弈最优，其与人类数据的匹配度会减弱，且在评估趣味性时表现出更大的波动。

详情

AI中文摘要

推理不仅仅是解决问题，也是评估哪些问题值得解决。人工智能系统的历史评估主要集中在解决问题上，通过研究模型如何玩国际象棋和围棋等游戏。在本文中，我们倡导一种新的范式，即评估人工智能系统对游戏的评估。首先，我们引入了一种评估此类评估的形式化方法。然后利用超过100种新型棋盘游戏和450份人类判断的大型数据集，将现代语言和推理模型的评估结果与人类和符号计算代理的评估结果进行比较。我们考虑了两种类型的评估查询：评估游戏的收益（或公平性）和趣味性。这些查询涵盖了两个与AI评估设计相关的重要维度：计算查询的复杂性和量化查询的难度。我们的结果表明，推理模型在游戏评估上通常比非推理语言模型更接近人类。然而，我们观察到非单调的关系：随着模型接近博弈最优，其与人类数据的匹配度会减弱。我们还发现，在评估趣味性时，模型之间存在更多的波动性，这与量化该查询的难度更大有关。在各种查询和游戏中，推理模型在评估查询时表现出高度变化和不可预测的资源使用，这表明在语言和推理模型中加入更多资源理性的元推理非常重要。

英文摘要

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2509.17680 2026-05-19 cs.CL 版本更新

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

当表格问答遇见噪声：为复杂问题和大规模表格设计的双去噪框架

Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang

发表机构 * University of Science and Technology of China（中国科学技术大学）； The University of Melbourne（墨尔本大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）

AI总结本文提出EnoTab双去噪框架，通过改进相关性过滤和表格修剪能力，解决复杂问题和大规模表格中的噪声问题，提升表格问答性能。

Comments 24 pages, 24 figures, accepted to ACL 2026 Main

详情

AI中文摘要

表格问答（TableQA）是自然语言处理（NLP）中的基本任务。大语言模型（LLMs）强大的推理能力在这一领域带来了显著进展。然而，随着实际应用中问题日益复杂且表格规模增大，大量噪声数据被引入，严重降低了推理性能。为了解决这一挑战，我们专注于提升两个核心能力：相关性过滤，即识别并保留与推理真正相关的信息，以及表格修剪，即在保留必要内容的同时减少表格规模。基于这些原则，我们提出了EnoTab，一种为复杂问题和大规模表格设计的双去噪框架。具体来说，我们首先通过证据-based问题去噪，将问题分解为最小的语义单元，并根据一致性和实用性标准过滤掉与答案推理无关的部分。然后，我们提出证据树引导的表格去噪，构建一个明确且透明的表格修剪路径，逐步移除无关数据。在每一步修剪过程中，我们观察表格的中间状态，并应用后序节点回滚机制来处理异常表格状态，最终产生一个高度可靠的子表格用于最终答案推理。最后，广泛的实验表明，EnoTab在复杂问题和大规模表格的TableQA任务中实现了卓越的性能，证实了其有效性。

英文摘要

Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (LLMs) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while preserving essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table pruning path to remove irrelevant data step by step. At each pruning step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2509.14004 2026-05-19 cs.CL 版本更新

在单层变换器中可证明的知识获取与提取

Ruichen Xu, Kexin Chen

AI总结本文研究了单层变换器中知识获取与提取的机制，通过理论分析和实验验证，揭示了预训练和微调过程中知识存储与提取的关系，以及低秩微调如何恢复预训练的事实知识。

详情

AI中文摘要

大型语言模型在预训练过程中可能获得事实性知识，但在微调后却无法可靠地使用这些知识。尽管有越来越多的实证证据表明MLP层存储事实关联，并且微调影响事实回忆，但连接下一个标记预训练、知识存储和后微调提取的训练动态机制仍然理解有限。我们研究了这个问题，使用了一个简化的一层变换器，包含自注意力和MLP模块，通过下一个标记预测进行训练，随后在问答数据上进行微调。在适当的正则性条件下，我们首先证明模型在学习结构化注意力模式和关系特定的特征方向时达到接近最优的预训练损失，从而提供了一个事实性知识获取的机制。然后我们展示微调可以将问答提示格式转化为触发预训练关系特征的手段，使模型能够提取在微调过程中未被重新访问的事实。我们的分析给出了知识提取的关联覆盖特征化：微调不需要重新访问每一个存储的主体-答案对，但必须覆盖足够的潜在关系-模板方向，通过这些方向在预训练中编码了事实。因此，提取随着预训练的多重性和微调的覆盖度而提高，但随着关系-模板宇宙的增长而变得更加困难。相反，不足的覆盖度会导致失败状态，其中事实可能被存储但仍然无法访问，提供了一个简化的幻觉机制。该理论适用于全和低秩微调，为为什么当关系覆盖度足够时低秩适应可以恢复预训练的事实知识提供了见解。在合成数据和基于PopQA的GPT-2/Llama模型上的实验支持了预测的趋势。

英文摘要

Large language models may encounter factual knowledge during pre-training yet fail to reliably use that knowledge after fine-tuning. Despite growing empirical evidence that MLP layers store factual associations and fine-tuning affects factual recall, the training-dynamics mechanisms linking next-token pre-training, knowledge storage, and post-fine-tuning extraction remain poorly understood. We study this problem in a stylized one-layer transformer with self-attention and MLP modules, trained by next-token prediction and subsequently fine-tuned on question-answering data. Under suitable regularity conditions, we first prove that the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions, giving a mechanism for factual knowledge acquisition. We then show that fine-tuning can turn the Q&A prompt format into a trigger for pre-trained relation features, enabling the model to extract facts that are not revisited during fine-tuning. Our analysis yields a relation-covering characterization of knowledge extraction: fine-tuning need not revisit every stored subject-answer pair, but it must cover enough latent relation-template directions through which facts were encoded during pre-training. Consequently, extraction improves with pre-training multiplicity and fine-tuning coverage, but becomes harder as the relation-template universe grows. Conversely, insufficient coverage leads to a failure regime in which facts may be stored but remain inaccessible, providing a stylized mechanism for hallucination. The theory applies to both full and low-rank fine-tuning, offering insight into why low-rank adaptation can recover pre-trained factual knowledge when relation coverage is sufficient. Experiments on synthetic data and PopQA-based GPT-2/Llama models support the predicted trends.

URL PDF HTML ☆

赞 0 踩 0

2507.18406 2026-05-19 cs.CL cs.DB cs.DL cs.IR 版本更新

Factual Inconsistencies in Multilingual Wikipedia Tables

多语言维基百科表格中的事实不一致

Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo

发表机构 * CNR ISTC（意大利国家研究委员会信息与通信技术研究所）； Fraunhofer Institute for Applied Information Technology FIT（弗劳恩霍夫应用信息技术研究所）； Tallinn University of Technology（塔林理工大学）； EURECOM（欧瑞康）； Technical University of Munich（慕尼黑技术大学）； University of Amsterdam（阿姆斯特丹大学）

AI总结本研究探讨了多语言维基百科结构化内容中的跨语言不一致问题，特别是表格数据，通过开发方法收集、对齐和分析多语言维基百科文章中的表格，定义不一致的类别，并应用定量和定性指标评估多语言对齐，为事实验证、多语言知识交互和可靠AI系统设计提供启示。

Comments 11 pages, 7 figures, White Paper for RTF Work at ISWS Summer School 2025

2504.13217 2026-05-19 cs.CL cs.AI 版本更新

轰鸣声击中新闻摊位：对全球山体滑坡相关新闻报道和空间偏见的数据分析

Brielen Madureira, Andreas Niekler, Marc Keuschnigg, Mariana Madruga de Brito

发表机构 * LeipzigLab – Climate Discourse（莱比锡实验室——气候话语）； Leipzig University（莱比锡大学）； Helmholtz Centre for Environmental Research（亥姆霍兹环境研究中心）； Linköping University（林雪平大学）

AI总结本文通过分析25年间近6万篇关于5500起山体滑坡事件的新闻文章，探讨德国报纸对全球山体滑坡的报道方式，揭示南欧和西欧地区报道过度的现象，为研究媒体对国际灾害关注的不平等提供参考。

Comments Work in progress

2605.18083 2026-05-19 cs.CL 版本更新

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$Δ$ Integration into Upcycled MoE

多语言大语言模型的高效路径：通过后训练PARAM$Δ$整合到再利用MoE进行语言扩展

Hao Zhou, Tianhao Li, Zhijun Wang, Shuaijie She, Linjuan Wu, Hao-Ran Wei, Baosong Yang, Jiajun Chen, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Tongyi Lab, Alibaba Group（阿里集团通义实验室）； Zhejiang University（浙江大学）

AI总结本文提出了一种高效的方法，通过将密集模型转换为MoE架构，并将不同语言分配给不同专家，从而在不进行复杂对齐阶段的情况下提升多语言大语言模型的性能，同时保留原始能力。

详情

AI中文摘要

将大型语言模型（LLMs）扩展到新语言是一个成本高昂的过程，需要大量的持续预训练（CPT）和数据密集型对齐。尽管最近的数据免费融合技术试图通过将多语言CPT增强模型与其指令版本融合来绕过对齐，但它们受到关键权衡的限制：缓解参数冲突以保持原始能力不可避免地会稀释新语言的学习，反之亦然。为了解决这一矛盾，我们引入了\method，将密集模型重新利用为专家混合（MoE）架构，将不同专家分配给不同语言。然后通过将MoE扩展的参数delta（$Δ_{ ext{post}}$）嫁接回CPT增强的基模型来转移对齐能力，从而绕过复杂的对齐阶段。实验表明，\method在具有相似FLOPs或参数数量的基线方法上表现出色；它在扩展语言上提高了性能，同时有效保留了原始能力。我们进一步证明，我们的方法在不同模型和后训练delta上具有高度适用性。

英文摘要

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($Δ_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

URL PDF HTML ☆

赞 0 踩 0

2605.18079 2026-05-19 cs.LG cs.CC cs.CL 版本更新

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

低精度softmax变换器的表达能力（摘要）链式思维

Moritz Brösamle, Stephan Eckstein

发表机构 * Department of Mathematics, University of Tübingen, Germany（图宾根大学数学系）

AI总结本文研究了低精度softmax变换器在链式思维中的表达能力，通过构造三元激活和分离注意力分数的硬max变换器来模拟图灵机，从而将构造转换为等效的softmax变换器，并分析了最近提出的总结链式思维范式在模拟图灵机时的效率。

Comments Accepted to ICML 2026

详情

AI中文摘要

现有的变换器表达性结果通常依赖于hardmax注意力、高精度和其它架构修改，这些修改将它们与实际使用的模型脱节。我们通过分析具有softmax注意力和激活值及注意力权重四舍五入的标准变换器解码器，同时允许深度和宽度以对数方式增长于上下文长度，来弥合这一差距。作为中间步骤，我们构造了具有三元激活和良好分离注意力分数的硬max变换器，利用链式思维（CoT）模拟图灵机。这使我们能够将构造转换为等效的softmax变换器，而无需先前方法所需的不现实的参数规模或激活精度。使用相同的技术，我们分析了最近提出的总结Co T范式，并展示其在模拟图灵机时更加高效，模型大小以空间界而非时间界缩放。我们通过在数独推理任务上验证我们的结果，并发现其比先前的高精度结果更符合可学习性。我们的代码可在https://github.com/moritzbroe/transformer-expressivity上获得。

英文摘要

Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.

URL PDF HTML ☆

赞 0 踩 0

2605.18071 2026-05-19 cs.CL 版本更新

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

KVDrive: 一个面向长上下文LLM推理的多层级KV缓存管理系统

Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

发表机构 * Hong Kong University of Science and Technology China（香港科技大学中国）； Xi’an Jiaotong University China（西安交通大学中国）

AI总结本文提出KVDrive，一个面向长上下文LLM推理的多层级KV缓存管理系统，通过联合缓存放置、流水线调度和跨层级协调，实现了高吞吐量的推理，在有限的GPU预算下保持高精度。

详情

AI中文摘要

支持长上下文LLM存在挑战，因为键值（KV）缓存的大量内存需求。现有的卸载系统将完整的缓存存储在主机内存中，并在解码过程中选择性地获取关键条目，但这种策略很快达到极限：无法进一步稀释而不影响准确性。因此，当上下文长度和批处理大小增加时，KV传输的体积急剧上升，成为解码延迟的主要来源。我们提出了KVDrive，一个横跨GPU内存、主机DRAM和SSD的多层级KV缓存管理系统。与之前通过算法改进追求更高稀疏度的工作不同，KVDrive从系统角度出发，联合缓存放置、流水线调度和跨层级协调，以在有限的GPU预算下维持高吞吐量的推理。KVDrive实现了三个基本能力：它根据注意力行为调整缓存管理以最大化重用并最小化冗余数据移动；它重构解码流水线以重叠I/O和CPU/GPU计算瓶颈阶段，消除异构资源中的停滞；并且它协调内存层级之间的数据移动，解锁远超GPU和DRAM限制的可扩展长上下文推理。我们已经实现了一个完整的KVDrive原型，并在长上下文基准测试中评估了流行LLM。该系统在保持准确性的同时，相比最先进的工作实现了高达1.74倍的吞吐量提升。

英文摘要

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.18067 2026-05-19 cs.CL 版本更新

弥合差距：将阅读文本转换为对话式语音

Parshav Singla, Agnik Banerjee, Aaditya Arora, Shruti Aggarwal, Anil Kumar Verma, Vikram C M, Raj Prakash Gohil, Gopal Kumar Agarwal

发表机构 * Samsung Research and Development Institute, Bangalore, India（三星研发研究所，班加罗尔，印度）

AI总结本文提出了一种名为PACC的新方法，通过利用深度神经网络分析和修改语调、重音和节奏等语调特征，将阅读语音转换为更自然的对话语音，从而在虚拟助手、客户服务和语言学习工具中提高语音转换的自然度和准确性。

Comments 11 pages, 4 figures. Published in ICICC 2025, Springer Lecture Notes in Networks and Systems

详情

DOI: 10.1007/978-981-96-6681-2_38
Journal ref: Innovative Computing and Communications (ICICC 2025), Lecture Notes in Networks and Systems, Springer Nature, 2025, pp. 543-556

AI中文摘要

在最近的语音处理进展中，将阅读语音转换为对话语音引起了广泛关注。该领域的主要挑战是在实时应用中保持自然性和可懂性的同时，最小化计算开销。传统的阅读语音缺乏对话互动中至关重要的细微语调变化，这对虚拟助手、客户服务和语言学习工具等应用构成了挑战。本文介绍了一种新的方法，即带有对话上下文的语调调整（PACC），旨在将阅读语音转换为各种现代应用中使用的自然对话语音。PACC利用先进的深度神经网络来分析和修改语调特征，如语调、重音和节奏。与传统方法不同，我们的方法使用高保真生成对抗网络（HiFi-GAN）进行语音合成。我们的实验结果表明，语音转换在自然度和模型准确性方面有显著提高，通过在语音数据集上额外训练。这项研究为语音转换任务和Mean Opinion Score（MOS）评估建立了新的基准，并证明我们的方法可以成功扩展到其他语音转换应用。

英文摘要

In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing computational overhead for real-time applications. Traditional read speech often lacks the nuanced prosodic variation essential for natural conversational interactions, posing challenges for applications in virtual assistants, customer service, and language learning tools. This paper introduces a novel approach, Prosodic Adjustment with Conversational Context (PACC), aimed at converting read speech into natural conversational speech used in various modern applications. PACC utilizes advanced deep neural networks to analyze and modify prosodic features such as intonation, stress, and rhythm. Unlike conventional methods, our approach uses High-Fidelity Generative Adversarial Networks (HiFi-GAN) for speech synthesis. Our experimental results demonstrate significant improvements in speech conversion, enhancing naturalness and achieving better model accuracy with additional training on speech datasets. This research establishes new benchmarks in speech conversion tasks and Mean Opinion Score (MOS) evaluation for testing model accuracy, and we show that our approach can be successfully extended to other speech conversion applications.

URL PDF HTML ☆

赞 0 踩 0

2605.17989 2026-05-19 cs.CL cs.AI 版本更新

Predictive Prefetching for Retrieval-Augmented Generation

检索增强生成的预测预取

Wuyang Zhang, Shichao Pei

发表机构 * Department of Computer Science, University of Massachusetts Boston（马萨诸塞大学波士顿分校计算机科学系）

AI总结本文提出了一种先进的异步检索框架，通过预测检索触发时机和所需信息，以减少延迟并提高生成效率，同时保持回答质量。

Comments Accepted by Forty-third International Conference on Machine Learning ICML 2026

详情

AI中文摘要

检索增强生成（RAG）通过在大型语言模型中增强事实性，但因其同步检索导致显著延迟。尽管近期工作探索了异步检索，但现有方法依赖于检索与生成之间的启发式协调，并假设解码期间信息需求稳定，这在复杂、多领域设置中往往失效。本文提出了一种先进的异步检索框架，该框架能够与不断演变的信息需求相匹配，通过利用生成动态中出现的语义前驱，使用三个组件——检索预测器、上下文监视器和查询生成器，显式预测何时应触发检索以及应检索什么信息。在多个基准测试上的实验表明，该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到第一个token的提升，同时保持与同步RAG基线相当的回答质量。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17978 2026-05-19 cs.CL 版本更新

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder: 教授大语言模型生成显式向量化代码

Shangzhan Li, Xinyu Yin, Xuanyu Jin, Ye He, Yuxin Zhou, Yuxuan Li, Xu Han, Wanxiang Che, Qi Shi, Ting Liu, Maosong Sun

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Xiamen University（厦门大学）； Tsinghua University（清华大学）

AI总结本文提出AutoVecCoder框架，通过VecPrompt和VecRL组件，使大语言模型能够自动进行显式向量化，从而在SimdBench的SSE和AVX子集上达到最先进的性能，超越传统自动向量化的方法。

详情

AI中文摘要

通过单指令多数据（SIMD）架构进行向量化是高性能计算的核心。为了充分利用硬件潜力，开发人员通常依赖显式向量化使用内联函数，因为基于编译器的自动向量化由于保守的静态分析常常产生次优结果。尽管大语言模型（LLMs）在一般代码生成方面表现出色，但它们在显式向量化方面遇到困难，因为高质量语料库稀缺且低级硬件指令的语义约束严格。在本文中，我们提出了AutoVecCoder，一种新的框架，旨在赋予LLMs自动显式向量化的能力。AutoVecCoder集成了两个核心组件：VecPrompt，一个自动数据合成管道，用于注入领域特定的内联知识；以及VecRL，一个强化学习框架，将代码生成与执行效率对齐。通过此框架训练的AutoVecCoder-8B在SimdBench的SSE和AVX子集上实现了最先进的性能，并在某些情况下生成的实现超过了标准-O3优化，有效克服了传统自动向量化的固有瓶颈。

英文摘要

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.

URL PDF HTML ☆

赞 0 踩 0

2605.17936 2026-05-19 cs.CL cs.LG 版本更新

代理分块与贝叶斯去分块：人工智能生成的模糊认知图的模型：特克西德斯陷阱模型

Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

发表机构 * University of Southern California（美国南加州大学）； Florida International University（佛罗里达国际大学）

AI总结本文提出了一种基于代理分块和贝叶斯去分块的方法，用于生成和更新人工智能生成的模糊认知图，通过在文本中生成重叠的文本分块，并利用稀疏因果分块矩阵进行混合，从而构建出代表性的循环模糊认知图知识图谱，以预测特克西德斯陷阱模型中的冲突结果。

Comments 15 pages, 6 figures

详情

AI中文摘要

我们通过训练大语言模型代理将文本分解为重叠的文本分块，从而自动生成反馈因果模糊认知图（FCMs）。通过将这些分块FCMs进行凸混合，可以得到一个代表性的循环FCM知识图。文本分块可以有不同的重叠程度。分块FCMs仍然混合以形成新的FCM因果知识图。混合技术的可扩展性源于其使用轻量计算和稀疏因果分块矩阵。混合结构允许进行一种操作层面的贝叶斯推断，从而从混合的FCM中生成“去分块”或后验似的FCM。这些去分块的FCM在自身具有价值，并允许进一步的贝叶斯更新。我们通过Allison的“特克西德斯陷阱”模型的论文文本演示了这些混合技术，该模型描述了主导力量（如美国）与崛起力量（如中国）之间的冲突。FCM动态系统在达到固定点或极限环吸引子时预测结果。当我们通过激活代表崛起力量野心和权利的概念节点来刺激这些FCM知识图时，8个中的7个FCM知识图预测了战争类型。Gemini 3.1 LLMs作为分块AI代理。

英文摘要

We automatically generate feedback causal fuzzy cognitive maps (FCMs) from text by teaching large-language-model agents to break the text into overlapping chunks of text. Convex mixing of these chunk FCMs gives a representative cyclic FCM knowledge graph. The text chunks can have different levels of overlap. The chunk FCMs still mix to form a new FCM causal knowledge graph. The mixing technique scales because it uses light computation with sparse causal chunk matrices. The mixing structure allows an operator-level type of Bayesian inference that produces "de-chunked" or posterior-like FCMs from the mixed FCM. These de-chunked FCMs are useful in their own right and allow further iterations of Bayesian updating. We demonstrate these mixing techniques on the essay text of Allison's "Thucydides Trap" model of conflict between a dominant power such as the United States and a rising power such as China. The FCM dynamical systems predict outcomes as they equilibrate to fixed-point or limit-cycle attractors. Seven out of 8 FCM knowledge graphs predicted a type of war when we stimulated them by turning on and keeping on the concept node that stands for the rising power's ambition and entitlement. Gemini 3.1 LLMs served as the chunking AI agents.

URL PDF HTML ☆

赞 0 踩 0

2605.17885 2026-05-19 cs.CL cs.AI 版本更新

Multi-agent AI systems outperform human teams in creativity

多智能体AI系统在创造力上超越人类团队

Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo, Xing Xie, Nigel Collier, David Stillwell, Luning Sun

发表机构 * Microsoft Research Asia（微软亚洲研究院）

AI总结研究探讨了多智能体AI系统在创造力任务中的表现，发现其在四个多样化问题解决任务中，比单智能体和人类团队更具创造力，核心方法是通过语义空间路径分析生成过程，主要贡献是揭示了AI和人类团队在创造力预测上的不同机制。

详情

AI中文摘要

尽管人工智能（AI）在众多认知任务上已匹配或超越人类表现，但创造力仍是一个极具争议的前沿。随着基于大语言模型（LLMs）的AI系统在研究和创新中被越来越多地采用，理解并增强其创造力变得至关重要。本文证明，多智能体LLM团队不仅超越了单个智能体，而且在4541个多智能体LLM想法和341个人类团队想法上，显著优于人类团队在创造力方面（Cohen's d=1.50）。这种优势由新颖性驱动，同时保持了相当的实用性。为了研究两组的生成过程，我们通过神经语言模型表示将对话表示为语义空间中的路径。LLM和人类团队在对话范围广泛而不是集中在单一主题（低全局一致性）时产生更多创造性想法。然而，预测创造力的额外模式不同：LLM团队受益于高效的探索（高语义扩展，较短路径），而人类团队受益于维持流畅的对话流程（高局部一致性，频繁转换）。此外，我们识别出模型选择和讨论结构作为正交的设计杠杆，共同解释了LLM对话动态中26.8%的方差，为系统开发具有增强创造力的多智能体系统铺平了道路。

英文摘要

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.17873 2026-05-19 cs.LG cs.AI cs.CL 版本更新

从文档到段落：一种用于主题分配的上下文重述

Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-Tür, Stanley Jungkyu Choi

发表机构 * LG AI Research（LG AI研究院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出了一种基于段落的主题分配方法（SBTA），通过将主题分配到短小且连贯的文本段落而非整个文档，以解决传统主题模型中文档多主题问题导致的主题污染问题，从而提升主题分析的清晰度和可解释性。

Comments Findings of ACL 2026

详情

AI中文摘要

传统的主题建模方法为每个文档分配一个单一主题。然而，在实践中，许多现实世界文档，如产品评论或开放式调查回答，包含多个不同的主题。这种不匹配常常导致主题污染，即不相关主题被合并到一个主题中，使得难以识别真正专注于特定主题的文档。我们通过引入基于段落的主题分配（SBTA），一种对主题建模的重述方法，将主题分配给段落：短小、连贯的文本片段，每个片段表达一个单一主题。通过在段落层面建模主题结构，我们的方法产生更清晰和可解释的主题，并更好地支持多主题文档的分析。为了支持系统评估，我们构建了一个SemEval-STM数据集，灵感来自基于方面的情感分析。文档首先通过大型语言模型（LLMs）分解为基于主题的段落，随后通过人工校验确保段落质量。我们还提出了一种基于段落的词入侵任务扩展，使人类能够在主题实际分配的粒度上评估主题连贯性。在多个模型和评估指标上，我们证明SBTA提高了聚类质量和可解释性。总体而言，这项工作提供了一个实用、可扩展的框架，用于异构文本语料库中细粒度的主题分析，其中文档自然涵盖多个主题。

英文摘要

Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based sentiment analysis. Documents are first decomposed into topical segments using large language models (LLMs), followed by human refinement to ensure segment quality. We also propose a segment-level extension of the word intrusion task, enabling human evaluation of topical coherence at the granularity where topics are actually assigned. Across multiple models and evaluation metrics, we show that SBTA improves clustering quality and interpretability. Overall, this work provides a practical, scalable framework for fine-grained topic analysis in heterogeneous text corpora where documents naturally span multiple topics. URL: https://huggingface.co/datasets/LG-AI-Research/SemEval-STM

URL PDF HTML ☆

赞 0 踩 0

2605.17710 2026-05-19 cs.CL eess.AS 版本更新

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

Sometin Beta Pass Notin (SBPN): 通过知识蒸馏改进尼日利亚语言的多语言语音识别

Sewade Ogun

发表机构 * Nigerian Languages（尼日利亚语言）

AI总结本文提出SBPN模型，通过两阶段知识蒸馏方法提升尼日利亚多种语言的语音识别性能，显著降低词错误率并优于现有多语言模型。

Comments 25 pages

详情

AI中文摘要

尽管现代多语言自动语音识别（ASR）系统支持多种尼日利亚语言，但其性能始终落后于英语和法语等高资源语言。尼日利亚语言存在独特的建模挑战，包括数据稀缺、不一致的正字法、声调符号、多样化的口音、频繁的代码切换和本地化专有名词。为解决这些挑战，我们开发了一个多语言ASR框架，采用两阶段蒸馏过程。首先，我们利用学生-教师知识蒸馏从现有单语言模型中学习，基于稳健的语言特定N-gram语言模型进行条件化。其次，我们使用伪标签数据进行迭代自我改进以进一步提高准确性。我们的方法显著缩小了性能差距，平均在单语言基线上实现了29%的词错误率（WER）减少。我们的模型在主要基准上也优于现有最先进的多语言模型，包括Common Voice和Fleurs。我们引入Sometin Beta Pass Notin（SBPN），一个覆盖约鲁巴、豪萨、伊博、尼日利亚皮钦语和尼日利亚英语的多语言ASR模型。SBPN以两种大小发布：SBPN-Base（120 M参数）和SBPN-Large（600 M参数）。通过发布这些作为开放基础模型，我们旨在为该地区丰富的语音和文化景观的研究提供ASR资源。

英文摘要

Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present unique modelling hurdles, including acute data scarcity, inconsistent orthography, tonal diacritics, diverse accents, frequent code-switching, and localized named entities. To address these challenges, we developed a multilingual ASR framework utilizing a two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data to further refine accuracy. Our method significantly bridges the performance gap, achieving on average a relative Word Error Rate (WER) reduction of 29 % over monolingual baselines. Our models also outperform state-of-the-art multilingual models across major benchmarks, including Common Voice and Fleurs. We introduce Sometin Beta Pass Notin (SBPN), a foundational multilingual ASR model covering Yorùbá, Hausa, Igbo, Nigerian Pidgin, and Nigerian English. SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region.

URL PDF HTML ☆

赞 0 踩 0

2605.17691 2026-05-19 cs.CL cs.AI 版本更新

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

验证你的权威：在多标签先例处理分类上对LLM进行基准测试

M. Mikail Demir, M. Abdullah Canbaz

发表机构 * Department of Information Science and Technology（信息科学与技术系）； College of Emergency Preparedness, Homeland Security, and Cybersecurity（应急准备、国土安全与网络安全学院）； University at Albany, SUNY（萨利纳大学）

AI总结本文提出了一种新的评估框架，通过专家标注的数据集对现代大语言模型进行基准测试，引入了平均严重性误差指标，以更准确地衡量分类错误的实践影响。

Comments Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP

详情

DOI: 10.18653/v1/2025.nllp-1.13

AI中文摘要

自动化法律先例中负面处理的分类是一个关键但复杂的自然语言处理任务，误分类可能带来重大风险。为了解决标准准确率的不足，本文介绍了一种更稳健的评估框架。我们对239个真实世界法律引用的新专家标注数据集上的现代大语言模型进行了基准测试，并提出了一种新的平均严重性误差度量标准，以更好地衡量分类错误的实践影响。我们的实验揭示了性能的分裂。Google的Gemini 2.5 Flash在高层次分类任务上达到了最高准确率（79.1%），而OpenAI的GPT-5-mini则在更复杂的细粒度模式上表现最佳（67.7%）。本工作建立了关键基准，提供了一个新的上下文丰富的数据集，并引入了一个针对这一复杂法律推理任务的评估度量标准。

英文摘要

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

URL PDF HTML ☆

赞 0 踩 0

2605.17672 2026-05-19 cs.CL 版本更新

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

停止当推理收敛：用于推理模型的语义保留早退出

Dehai Min, Giovanni Vaccarino, Huiyi Chen, Yongliang Wu, Gal Yona, Lu Cheng

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Google Research（谷歌研究）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Politecnico di Milano（米兰理工学院）

AI总结本文提出PUMA框架，通过识别推理层面的语义冗余信号，实现语义保留的早退出，从而在保持准确性和推理链完整性的同时减少token消耗。

Comments under review

详情

AI中文摘要

大型推理模型（LRMs）通过生成长链的推理步骤（CoT）实现强大的性能，但常常过度推理，继续推理在解决方案已经稳定后，从而浪费token并增加延迟。现有的推理时早退出方法主要依赖于答案层面的信号，如置信度或试答一致性，来决定何时停止。然而，这些信号主要反映答案准备程度而非推理收敛：它们可能在模型尚未完成探索或自我纠正之前触发，导致过早退出，从而降低最终答案的准确性并留下保留的推理链语义不完整。我们识别推理层面的语义冗余作为语义保留早退出的互补信号：当连续步骤不再添加新的进展而是重复已确立的结论时，推理轨迹可能已收敛。基于这一见解，我们提出了PUMA，一个插件式框架，结合了轻量级的冗余检测器和答案层面的验证。检测器标记语义冗余的候选退出点，而验证确认停止是否安全，使PUMA能够在保持答案准确性和连贯推理前缀的同时删除冗余的延续。在五个LRMs和五个具有挑战性的推理基准测试中，PUMA实现了26.2%的平均token减少，同时保持准确性和保留的CoT质量。此外，针对代码生成、零样本视觉-语言推理和学习停止策略内部化等额外实验进一步证明，推理层面的冗余是高效推理的稳健、可转移和可学习的信号。我们的代码可在https://github.com/giovanni-vaccarino/PUMA上获得。

英文摘要

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.

URL PDF HTML ☆

赞 0 踩 0

2605.17652 2026-05-19 cs.CL 版本更新

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

超越转录：借助音频的迭代同行编辑解锁高质量的对话语音总结

Kaavya Chaparala, Thomas Thebaud, Jesús Villalba López, Laureano Moro-Velazquez, Peter Viechnicki, Najim Dehak

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结该研究通过比较音频和转录本总结的质量，探讨了同行编辑在提升语音总结质量中的作用，并发现音频总结在信息量和压缩程度上不如转录本总结，但通过迭代同行编辑可以弥补这一差距。

Comments Accepted in LREC 2026

详情

DOI: 10.63317/4d596vd4x2xr

AI中文摘要

目前缺乏足够的语音总结任务基准。创建新基准需要人工标注，因为LLM可能会将系统性错误和偏见嵌入数据集中。我们测试了十种标注工作流程，涉及不同的输入模态（音频、转录本或两者）以及编辑（自我编辑或同行编辑）的 inclusion，以研究使用人工标注者总结音频时可能的质量权衡。我们比较了基于音频的人类总结与基于转录本的人类总结，以跟踪不同信息模态对总结质量的影响。我们还比较了人类输出与四个LLM基准（三个文本，一个音频）以确定人类编写总结是否比高度流畅的自动输出信息更少。我们发现基于音频的总结信息较少且更压缩，但通过使用音频的迭代同行编辑可以缓解这一差异，使基于音频的总结信息量与转录本总结和LLM总结相当。这些发现验证了在创建结合词汇和语调信息的基准时，人类标注者之间的迭代同行编辑的有效性。这使在转录本不可用的情况下也能关键数据集的收集成为可能。

英文摘要

There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2605.17641 2026-05-19 cs.AI cs.CL 版本更新

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

基于因果干预的记忆选择用于长时域大语言模型智能体

Saksham Sahai Srivastava

发表机构 * School of Computing, University of Georgia, Athens, Georgia, USA（佐治亚大学计算机学院）

AI总结本文提出Causal Memory Intervention（CMI）方法，通过因果推理选择大语言模型的长期记忆，以提高回答质量和鲁棒性，同时引入Causal-LoCoMo基准数据集进行评估。

Comments 12 pages, 3 figures, 3 tables

详情

AI中文摘要

长时域大语言模型智能体依赖持久记忆来支持跨会话的交互，但现有记忆系统通常使用语义相似性或广泛历史包含来检索上下文，将检索到的记忆视为统一有用。这一假设是脆弱的，因为记忆可能在主题上相关，但仍然无关、过时或误导性。我们提出了Causal Memory Intervention（CMI），一种因果记忆选择技术，通过在受控干预下估计候选记忆如何影响模型的答案，选择提高任务性能的同时抑制不稳定、无关或有害的记忆。为了评估这一设置，我们引入了Causal-LoCoMo，一个从长对话数据中衍生出的因果标注基准，其中每个示例包含用户请求、结构化记忆库、有用的记忆、无关干扰项以及合成有害记忆。我们比较了CMI与向量、图、反思、摘要、完整历史和无记忆基线。结果表明，CMI在回答质量和对误导性记忆的鲁棒性之间实现了更强的平衡，表明可靠的长期记忆需要基于因果有用性而非相关性本身来选择上下文。完整的框架、基准构建代码和实验流程可在https://github.com/Saksham4796/causal-memory-intervention获取。

英文摘要

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

URL PDF HTML ☆

赞 0 踩 0

2605.17639 2026-05-19 cs.CL cs.IR 版本更新

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

共引预测性的时变衰减：来自3.96亿乌克兰法院引用的20年法规检索基准

Volodymyr Ovcharov

发表机构 * LEX AI LLC

AI总结本文研究了共引结构在法律信息系统中的稳定性假设，通过构建UA-StatuteRetrieval基准，测试了20年中3.96亿条引用数据的共引可预测性，发现Adamic-Adar MRR在固定文章集上下降33%，在训练/测试时间分割下下降47%，证实了真正的时变衰减而非组合变化或评估伪影。

Comments 12 pages, 8 figures, 4 tables. Dataset: https://huggingface.co/datasets/overthelex/ua-statute-retrieval

详情

AI中文摘要

共引结构被广泛假设为提供稳定的检索信号。我们通过构建UA-StatuteRetrieval基准，纵向测试这一假设，该基准在2007-2026年的20个年度快照中测量了3.96亿条法典引用的共引可预测性。通过在完整的双部分引用图上使用留一法协议，我们发现Adamic-Adar MRR在固定文章集上下降33%（从0.43到0.29），在训练/测试时间分割下下降47%（从0.51到0.27），证实了真正的时变衰减而非组合变化或评估伪影。衰减是非均匀的：刑事程序保持稳定的共引模式（MRR ~0.40），而民法从0.35下降到0.15，与2017年司法改革重合。枢纽文章（>100,000引用）抵抗衰减，但中频文章（1,000-10,000）——实际检索前沿失去一半的可预测性。BM25文本基线衰减得更快（31%），嵌入漂移分析显示E5-large揭示了文章引用的语义偏移4.3%，提供了衰减的机制解释。该基准在https://huggingface.co/datasets/overthelex/ua-statute-retrieval发布。

英文摘要

Co-citation structure is widely assumed to provide stable retrieval signal in legal information systems. We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court decisions. Using a leave-one-out protocol over the full bipartite citation graph, we find that Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split (from 0.51 to 0.27) confirming genuine temporal decay rather than compositional shift or evaluation artifact. The decay is non-uniform: criminal procedure maintains stable co-citation patterns (MRR ~0.40), while civil law degrades from 0.35 to 0.15, coinciding with the 2017 judicial reform. Hub articles (>100K citations) resist decay, but mid-frequency articles (1K-10K) -- the practical retrieval frontier lose half their predictability. A BM25 text baseline decays even faster (31%), and embedding drift analysis with E5-large reveals a 4.3% semantic shift in how articles are cited, providing a mechanistic explanation for the observed decay. The benchmark is released at https://huggingface.co/datasets/overthelex/ua-statute-retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.17634 2026-05-19 cs.CR cs.CL cs.CY 版本更新

泛化还是记忆？国际象棋训练语言模型的脆弱性测试

Ethan Tang

发表机构 * School of Computing and Augmented Intelligence（计算与增强智能学院）

AI总结本文研究了国际象棋训练语言模型是泛化还是记忆，通过测试发现其高性能主要源于模式匹配，并展示了LLM-Modulo框架在提升国际象棋谜题解决性能上的效果，证明了与外部验证器结合的通用LLM比直接训练合成数据更灵活。

Comments 14 pages, 2 figures, 4 tables, 3 equations

详情

AI中文摘要

最近的研究对语言模型进行了棋类数据微调，并报告了高基准分数，作为证据表明由此产生的模型可以理解国际象棋规则、以专业水平下完整棋局，或生成基于专家知识的人可读解释。我们训练了KinGPT，一个仅在（位置，最佳移动）对上训练的2500万参数字符级语言模型，其在600个mate-in-N谜题套件上超过了300亿参数的ChessGPT，在20个主题谜题基准上超过了4000亿参数的C1-4B。我们检查了现有文献中关于国际象棋训练语言模型的几个主张，并断言其令人印象深刻的基准性能主要由模式匹配解释。我们还展示了LLM-Modulo，一个验证器在环框架，如何将RedPajama 3B的最佳移动准确率从1.2%提升到21.2%，移动生成有效性从19.3%提升到95.3%，在mate-in-N国际象棋谜题上，与ChessGPT在棋类特定网络语料库上微调所获得的提升相当，但成本仅为后者的一小部分。我们的结果展示了将通用LLM与外部验证器结合，为明确领域提供了一个更灵活的替代方案，而不是直接训练合成数据。我们开源了所有训练/评估代码、数据集、谜题样本和KinGPT模型检查点，以确保可重复性。

英文摘要

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2605.17558 2026-05-19 cs.SE cs.CL 版本更新

结合CNN的混合特征组合用于孟加拉语虚假新闻分类

Md Gulzar Hussain, Babe Sultana, Md Rinku Ali

发表机构 * School of Software, Nanjing University of Information Science and Technology（信息科学与技术学院）； Department of Computer Science and Engineering, Green University of Bangladesh（计算机科学与工程系）； School of Computer Science and Artificial Intelligence, Changzhou University（计算机科学与人工智能学院）

AI总结本文研究了在BanFakeNews-2.0数据集上使用CNN模型进行孟加拉语虚假新闻分类时，不同特征组合（语义、统计和字符级特征）对识别效果的影响，发现多特征组合能显著提升召回率和F1分数。

Comments Already accepted and presented in the 3rd International Conference on Big Data, IoT and Machine Learning (BIM 2025)

2605.17467 2026-05-19 cs.CL 版本更新

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

VerifyMAS: LLM多智能体系统中故障归因的假设验证

Hezhe Qiao, Hanghang Tong, Ee-Peng Lim, Bing Liu, Guansong Pang

发表机构 * Singapore Management University（新加坡国立管理学院）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Illinois at Chicago（伊利诺伊大学香槟分校）

AI总结本文提出VerifyMAS框架，通过验证假设的方法对LLM多智能体系统中的故障进行归因，解决了现有方法在全局故障识别和细粒度归因方面的不足，实验表明其在多种模型上均优于现有方法。

Comments 22 pages

详情

AI中文摘要

大型语言模型驱动的多智能体系统（LLM-MAS）在复杂任务中表现出色，但不可靠的智能体仍是系统可靠性的重要瓶颈。自动故障归因因此至关重要，但现有方法如直接预测智能体错误对和智能体优先故障归因依赖于本地日志，无法识别仅在完整交互轨迹中显现的全局故障，如跨步不一致和智能体间协调错误。此外，直接预测故障会引入大规模组合搜索空间，阻碍细粒度归因。为了解决这些挑战，我们提出了VerifyMAS，一种用于智能体故障归因的假设验证框架。不同于直接预测故障智能体和错误类型，VerifyMAS针对完整轨迹验证故障假设。这种基于验证的方法将归因分解为轨迹级错误验证和细粒度智能体定位，提供了一种以错误优先的归因方法，能够捕捉全局故障模式，同时显著减少搜索空间。我们进一步引入基于结构化错误分类学的假设数据构建策略，并对专用LLM验证器模型进行微调，用于轨迹级故障验证和智能体归因。在Aegis-Bench和Who&When上的实验表明，VerifyMAS在多种基础模型上均表现优异，包括开源Qwen和基于API的GPT模型，在不牺牲长多智能体轨迹推理效率的情况下，优于现有方法。

英文摘要

Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.

URL PDF HTML ☆

赞 0 踩 0

2605.17453 2026-05-19 cs.CR cs.CL 版本更新

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

勿信工具：在不可信工具反馈下评估和防御LLM代理

Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li, Binwu Wang, Longyue Wang, Chenyang Lyu, Guanhua Chen

发表机构 * Southern University of Science and Technology（南方科技大学）； University of Science and Technology of China（中国科学技术大学）； University of Birmingham（伯明翰大学）； Zhejiang University（浙江大学）； East China Normal University（华东师范大学）； Alibaba Group（阿里巴巴集团）

AI总结本研究探讨了在不可信工具反馈环境下评估和防御LLM代理的问题，提出了一种新的失败模式'认知中毒'，并介绍了TRUST-Bench基准、GuardedJoint惩罚指标和VISTA-Guard框架，展示了轨迹感知的最终动作评分在安全与效用权衡中的有效性。

详情

AI中文摘要

LLM代理越来越多地依赖外部工具做出重大决策，但现有的代理安全基准和防御方法通常假设一旦选择工具，其反馈就是可信的。我们研究了一种不同的失败模式，即认知中毒，其中恶意工具在探索过程中表现合理，通过看似无害的反馈积累信任，只有在隐藏状态条件与最终执行动作一致时才会变得有害。为此，我们构建了TRUST-Bench，一个包含1970个隐藏触发工具妥协事件的任务条件基准，并引入了GuardedJoint不对称惩罚指标，以更好地反映真实部署风险，还提出了VISTA-Guard，一种不依赖特定骨干的最终动作风险评分框架。核心思想是将多步工具交互抽象为结构化环境变量，这些变量编码了信任形成动态，并从该轨迹条件表示中评分最终执行动作的风险。实验表明，基于提示的启发式方法、标量特征和零样本判断在该领域失效，而轨迹感知的最终动作评分在域内具有强区分能力，并在平衡的域外迁移中保持有效。在GuardedJoint下，VISTA-Guard在域内达到84.2，在平衡的域外评估中达到56.9，而仅优化安全-效用权衡一方的方法则降至零。这些发现支持了更广泛的代理安全观点：在黑盒工具生态系统中，决定性的防御目标不是本地提示文本或工具描述本身，而是信任在交互轨迹中的形成方式以及通过最终动作所承诺的方式。

英文摘要

Tool-using LLM agents increasingly rely on external tools to make consequential decisions, yet most existing agent-security benchmarks and defenses implicitly assume that tool feedback is trustworthy once a tool has been selected. We study a different failure mode, cognitive poisoning, in which a malicious tool behaves plausibly during exploration, accumulates trust through benign-looking feedback, and becomes harmful only when hidden state conditions align with the final executable action. To study this setting, we construct TRUST-Bench, a task-conditioned benchmark of 1,970 hidden-trigger tool-compromise episodes with matched safe controls, introduce an asymmetric penalty metric, GuardedJoint, to better reflect real deployment risk, and present VISTA-Guard, a backbone-agnostic framework for final-action risk scoring. The core idea is to abstract multi-step tool interaction into structured environment variables that encode trust-formation dynamics and then score the risk of the final executable action from this trajectory-conditioned representation. Experiments show that prompt-centric heuristics, scalarized features, and zero-shot judges fail in this regime, whereas trajectory-aware final-action scoring yields strong in-domain discrimination and remains effective under balanced out-of-distribution transfer. Under GuardedJoint, VISTA-Guard reaches $84.2$ in-domain and $56.9$ on balanced out-of-distribution evaluation, while methods that optimize only one side of the safety--utility tradeoff collapse to zero. These findings support a broader view of agent security in black-box tool ecosystems: the decisive defense target is not local prompt text or tool descriptors alone, but the way trust is formed across the interaction trajectory and committed through the final action.

URL PDF HTML ☆

赞 0 踩 0

2605.17450 2026-05-19 cs.SE cs.AI cs.CL cs.CR 版本更新

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

ContraFix：通过差分运行时证据和技能重用进行代理漏洞修复

Simiao Liu, Fang Liu, Li Zhang, Yang Liu, Yinghao Zhu

发表机构 * Beihang University（北京航空航天大学）； The University of Hong Kong（香港大学）

AI总结本文提出ContraFix框架，通过差分运行时证据和可重用的修复技能，解决大型语言模型代理在自动漏洞修复中的语义误解问题，实现了在SEC-Bench和PatchEval上的高准确率。

详情

AI中文摘要

大型语言模型（LLM）代理越来越多地用于自动漏洞修复（AVR），其中仓库级推理使它们能够检查上下文并生成源代码补丁。然而，最近的经验结果表明，这些代理仍然难以处理现实世界中的漏洞。其主要失败模式是语义误解：选择一个修复方向，该方向不匹配根本原因。我们识别出这种差距的两个原因。现有代理通常仅从失败执行进行推理。崩溃报告可以指出程序失败的位置，但无法揭示众多候选者中哪一个变量或状态转换将崩溃行为与安全执行区分开来。因此，代理通常生成症状导向的补丁而不是因果修复。此外，为一个漏洞收集的证据很少被保留，因此后续仓库中的类似案例必须从头开始诊断。我们提出了ContraFix，一种结合差分运行时证据和可重用修复技能的代理AVR框架。其Mutator构造了跨越失败边界的POC变体；其Analyzer在故障区域插入状态探针，并将崩溃和非崩溃执行之间的差异总结为修复规范；其Patcher将规范转换为经过验证的源代码补丁。每次成功的修复都会更新一个包含修复规范和突变策略的双轨技能库，这些通过三层策略在未来实例中检索。在SEC-Bench（C/C++，200个实例）和PatchEval（Go、Python、JavaScript，225个实例）上，ContraFix与GPT-5-mini相比，分别解决了84.0%和73.8%的任务，分别在两个基准上实现了最先进的性能，同时成本低于最强的可比基线。

英文摘要

Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom-oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances. On SEC-Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT-5-mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state-of-the-art performance on both benchmarks while costing less than one-third of the strongest comparable baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.17447 2026-05-19 cs.CV cs.CL 版本更新

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR: 通过KV缓存剪枝实现高效的动态视觉聚焦文档解析

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

发表机构 * Tsinghua University（清华大学）

AI总结本文提出FastOCR，一种无需训练的框架，通过动态视觉聚焦技术解决文档解析中的高效KV缓存剪枝问题，显著提升处理速度和准确性。

详情

AI中文摘要

视觉-语言模型（VLMs）在光学字符识别（OCR）中展现出强大潜力，但编码密集文档所需的大量视觉令牌导致推理成本过高。现有剪枝方法依赖物理驱逐，例如在prefill阶段永久丢弃视觉令牌。尽管在自然图像上有效，但此策略在OCR中失效，因为几乎每个视觉令牌可能对应一个字符或结构元素，任何不可逆的损失都会导致准确性急剧下降。我们观察到，尽管文档图像看似密集且难以剪枝，模型对它们的注意力实际上在时间上是稀疏的：在每个解码步骤中，它集中在一小块区域，随着步骤逐渐移动，就像人类读者依次聚焦于词语而不是一次性感知整页内容一样。受此动态视觉聚焦现象的启发，我们将不可行的全局剪枝问题转化为可处理的局部动态问题，并提出FastOCR，一种无需训练的框架，包含两个互补模块。具体而言，Focal-Guided Pruning识别少量焦点层，并在每一步从中选择最相关的视觉令牌；Cross-Step Fixation Reuse利用固定点的逐渐移动，从上一步温暖启动。通过动态调整哪些令牌被关注而不是驱逐任何缓存中的令牌，FastOCR避免了永久信息丢失。广泛实验表明，FastOCR作为一种即插即用的加速模块，在五个不同大小和架构的VLMs上表现出一致的泛化能力。在Qwen2.5-VL上，FastOCR在每个解码步骤只关注5%的视觉令牌，保留了未剪枝模型98%的准确性，同时将注意力延迟减少了3.0倍。

英文摘要

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

URL PDF HTML ☆

赞 0 踩 0

2605.17444 2026-05-19 cs.SE cs.AI cs.CL 版本更新

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair：用于代理级漏洞修复的分层内存

Simiao Liu, Li Zhang, Fang Liu, Xiaoli Lian, Yang Liu, Yinghao Zhu

发表机构 * Beihang University（北京航空航天大学）； The University of Hong Kong（香港大学）

AI总结本研究提出MemRepair，一种增强记忆的代理框架，通过分层记忆和动态反馈循环提高漏洞修复的可靠性，实现了在多个仓库级别的漏洞修复基准上的高修复率。

详情

AI中文摘要

现代软件生态系统面临越来越多披露的漏洞，增加了需要在仓库规模上可靠运行的自动化修复技术的需求。尽管基于大语言模型（LLM）的代理最近在自动化漏洞修复（AVR）中显示出潜力，但大多数现有系统仍然将修复视为在当前可见代码上下文中的一次生成步骤。因此，它们缺乏重用先前修复或从失败验证尝试中学习的持久机制，这限制了它们在复杂、多文件修复任务上的有效性。我们提出了MemRepair，一种增强记忆的代理框架，将漏洞修复视为一个迭代、经验驱动的过程。MemRepair结合了三个互补的记忆层，即History-Fix、Security-Pattern和Refinement-Trajectory记忆，并通过动态反馈驱动的细化循环。这种设计使代理能够检索仓库特定的修复惯例，应用可重用的安全防御，并利用先前的“失败到成功”轨迹来根据运行时证据修正语义无效的补丁。我们评估了MemRepair在三个具有代表性的仓库级别漏洞修复基准上的表现：SEC-Bench、PatchEval（Python、Go、JavaScript）以及Multi-SWE-bench的C++子集。MemRepair在三个基准上分别实现了58.0%、58.2%和30.58%的修复率，优于强大的通用代理如OpenHands和SWE-agent，以及专用的AVR工具InfCode-C++，同时保持竞争性的修复成本。这些结果表明，持久的、分层的修复记忆可以显著提高跨多种语言和仓库设置的代理漏洞修复的可靠性。

英文摘要

Modern software ecosystems face a rapidly growing number of disclosed vulnerabilities, increasing the need for automated repair techniques that can operate reliably at repository scale. Although Large Language Model (LLM)-based agents have recently shown promise for automated vulnerability repair (AVR), most existing systems still treat repair as a single generation step over the currently visible code context. As a result, they lack a persistent mechanism for reusing prior fixes or learning from failed validation attempts, which limits their effectiveness on complex, multi-file repair tasks. We present MemRepair, a memory-augmented agentic framework that formulates vulnerability repair as an iterative, experience-driven process. MemRepair combines three complementary memory layers, i.e., History-Fix, Security-Pattern, and Refinement-Trajectory memories, with a dynamic feedback-driven refinement loop. This design allows the agent to retrieve repository-specific repair conventions, apply reusable security defenses, and exploit prior "failure-to-success" trajectories to revise semantically invalid patches based on runtime evidence. We evaluate MemRepair on three representative repository-level vulnerability repair benchmarks: SEC-Bench, PatchEval (Python, Go, JavaScript), and the C++ subset of Multi-SWE-bench. MemRepair achieves state-of-the-art resolution rates of 58.0%, 58.2%, and 30.58%, respectively, outperforming strong general-purpose agents such as OpenHands and SWE-agent, as well as the specialized AVR tool InfCode-C++, while maintaining competitive repair cost. These results show that persistent, hierarchical repair memory can substantially improve the reliability of agentic vulnerability repair across diverse languages and repository settings.

URL PDF HTML ☆

赞 0 踩 0

2605.17442 2026-05-19 cs.CL cs.AI cs.IR 版本更新

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数：低资源多语言NLP中的数据集可见性不对称

Zhiyin Tan, Changxu Duan

发表机构 * L3S Research Center, Leibniz University Hannover（莱布尼茨汉诺威大学L3S研究中心）； Technische Universität Darmstadt（达姆施塔特技术大学）

AI总结本研究探讨了多语言NLP中数据集可见性不对称问题，通过结合目录基准和文献证据，提出了资源密度指数（RDI）来衡量语言的数据集可见性，揭示了大量语言在目录记录中数据贫乏但文献中存在明显数据集活动的现象。

Comments Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)

详情

DOI: 10.63317/3bep4yiomtp2

AI中文摘要

多语言NLP常常依赖于集中式目录中的数据集计数来确定哪些语言是资源丰富或贫乏的。然而，这些目录只记录了数据集可见性的一层：哪些数据集已被注册或机构分发。它们不一定反映哪些数据集在研究文献中被创建、引用或重用。为了考察这一差距，我们结合基于目录的基准与文献支持的数据集流通证据。我们引入了资源密度指数（RDI），定义为每一百万使用者的数据集数量，并计算了乙努诺格（Ethnologue）中200种最广泛使用的语言的RDI。其中，118种语言（59%）在LRE地图和语言数据 consortium（LDC）中平均RDI为零，另有23种语言低于0.1，对应每十万使用者最多一个目录数据集。然后，我们利用LLM辅助的引用挖掘流程处理Semantic Scholar语料库中的这141种低可见性语言。经过人工验证和整合，我们识别出53种语言中的609个唯一数据集，其中356个仍通过工作公共链接公开访问。这些结果揭示了显著的可见性差距：许多大使用者语言在目录记录中数据贫乏，但在研究文献中显示明显的数据集活动。我们的发现表明，多语言数据稀缺不仅应被视为生产问题，还应被视为文档、可发现性和长期可访问性的问题。代码和数据可在（https://github.com/zhiyintan/dataset-visibility-asymmetry）公开获取。

英文摘要

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset-visibility-asymmetry).

URL PDF HTML ☆

赞 0 踩 0

2605.17436 2026-05-19 cs.CV cs.CL 版本更新

QQJ: 量化定性判断以实现可扩展且与人类对齐的生成AI评估

Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

发表机构 * AI Lab, Arioobarzan Engineering Team（艾伊罗巴赞工程团队人工智能实验室）； Department of Computer Engineering and Information Technology（计算机工程与信息科技系）； Department of Informatics, Bioengineering, Robotics and Systems Engineering（信息学、生物工程、机器人与系统工程系）； University of Genoa（热那亚大学）

AI总结本文提出QQJ框架，通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器，实现与人类判断一致的可扩展评估方法，验证了结构化定性判断在大规模应用中的有效性。

详情

AI中文摘要

生成人工智能的快速发展暴露了现有评估方法的根本局限，尤其是在开放性、创造性和面向人类的任务中。传统自动指标依赖于表面统计相似性，往往无法反映人类对质量的感知，而纯粹的人类评估虽然可靠，但成本高、主观性强且难以扩展。最近利用大语言模型作为评估者的做法虽然提高了可扩展性，但通常缺乏明确的人类定义评估原则，导致偏见和不一致。本文介绍Quantifying Qualitative Judgment (QQJ)，一种可扩展且以人类为中心的评估框架，通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器，以实现人类判断与自动化评估之间的桥梁。这种设计使在多样化的生成任务和模态上实现了一致、可解释和可扩展的评估。在文本和图像生成上的大量实验表明，QQJ在与人类判断的一致性方面优于传统自动指标和无约束的大语言模型评估者。此外，QQJ在重复评估中表现出更高的稳定性，并在识别关键失败模式如幻觉和意图不匹配方面具有更好的诊断能力。这些结果表明，结构化的定性判断可以在不牺牲可解释性和人类对齐的情况下实现规模化应用，使QQJ成为现代生成AI系统可靠评估的实用基础。

英文摘要

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.17379 2026-05-19 cs.CL cs.AI 版本更新

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

通过更好的令牌学习：用于专业文本摘要的参数高效词汇适应

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

发表机构 * Dept. of Computer Science and Engg., IIT Kharagpur（印度Kharagpur理工学院计算机科学与工程系）； Dept. of Medicine (Biomedical Informatics), Stanford University（斯坦福大学医学院（生物医学信息学））

AI总结本文提出了一种参数高效的领域适应方法，通过结合词汇适应和预训练，提升大型语言模型在专业领域文本摘要任务中的性能，同时减少训练时间和参数数量。

Comments 16 pages. Accepted in the 64th Annual Meeting of the Association for Computational Linguistics [ACL (Main) 2026] as a long paper

详情

AI中文摘要

预训练在通用领域语料库上的大型语言模型在应用于专门领域时常常表现出令牌化效率低下。尽管连续预训练用于领域适应在一定程度上缓解了性能下降，但并未解决根本的词汇匹配问题。为了解决这一差距，我们引入了一种有针对性的参数高效领域适应方法，结合词汇适应与预训练用于基于LLM的文本摘要。我们的统一框架在预训练令牌化器中增加领域特定的令牌，同时选择性地替换未充分训练和不可达的令牌以限制参数增长。我们在Llama-3.1-8B和Qwen2.5-7B上评估了我们的方法，在法律和医学摘要任务上使用以专家驱动文本和摘要为中心的评估协议，这些文本通常包含更高浓度的Out-of-Vocabulary（OOV）词。词汇适应算法通过提高生成摘要与参考摘要之间的语义相似性，提升了摘要模型的整体质量。此外，适应后的模型生成的摘要包含更多合适的新型和领域特定的词汇，从而提高了连贯性、相关性和忠实性。我们进一步观察到，我们的方法在连续预训练上减少了35-55%的训练时间，并将参数数量减少了多达37%。我们公开了代码库：https://github.com/gb-kgp/VocabReplace-Then-Expand。

英文摘要

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

URL PDF HTML ☆

赞 0 踩 0

2605.16234 2026-05-19 cs.LG cs.AI cs.CL 版本更新

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

没有免费的交换：Transformer中的协议依赖层冗余

Gabriel Garcia

发表机构 * Independent Researcher（独立研究者）

AI总结本文研究了Transformer中层冗余问题，通过比较替换和交换两种协议，发现它们在压缩中的效果存在显著差异，且在相同评估器下，不同协议可能导致层剪枝结果的变化，尤其在高替换距离时更为明显。

Comments 40 pages, 8 figures, 24 tables. Code is available at https://github.com/Gpgabriel25/ProtocolGapDiagnostic

详情

AI中文摘要

当研究人员询问两个Transformer层是否在压缩中“等价”时，他们常常混淆了不同的测试方法。替换测试询问是否可以将一层的映射替换为另一层的映射；交换测试询问是否当两层位置交换时，它们近似可交换。两者都是基于输出的swap-KL探测器，但它们并不总是一致：在预训练的Transformer中，协议差距可能在相同评估器下改变哪些层看起来可以安全剪枝，尤其是在替换距离较高时。我们跨检查点和架构测量了两种协议。在Pythia训练轨迹（410M和1.4B）上，替换-交换差距从初始化到收敛逐渐增大。在8B规模的WikiText-2合同下，Qwen3-8B进入了一个发散阶段：交换引导的移除比替换引导的在相同层预算下更安全，而Llama-3.1-8B在剪枝成本上两者持平，尽管交换KL较低，这表明指标差距不必一对一映射到移除。在层移除或合并之前，应在目标检查点上对两种swap-KL进行评分；该诊断仅需未标记的正向传递。

英文摘要

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

URL PDF HTML ☆

赞 0 踩 0

2605.15508 2026-05-19 cs.LG cs.CL 版本更新

STS: Efficient Sparse Attention with Speculative Token Sparsity

STS: 高效稀疏注意力与推测性标记稀疏性

Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； UC Berkeley（加州大学伯克利分校）

AI总结本文提出STS，一种无需模型再训练的稀疏注意力机制，通过利用较小的草稿模型识别出的重要标记来预测更大目标模型的重要标记，从而在大规模语言模型推理中实现高效的稀疏注意力计算，显著提升速度并保持准确性。

Comments 14 pages, 12 figures

详情

AI中文摘要

注意力的二次复杂性对大型语言模型（LLM）推理造成了严重的内存和计算瓶颈。这一挑战在新兴的代理应用中尤为突出，这些应用需要处理数百万标记序列。我们提出STS，一种稀疏注意力机制，无需模型再训练。STS利用关键洞察：由较小的草稿模型识别出的重要标记对更大目标模型的重要标记具有高度预测性。通过整合到推测解码框架中，STS将草稿模型的注意力分数重新利用，动态构建标记和头部层面的稀疏性掩码。该掩码有效剪枝目标LLM中的昂贵注意力计算。我们的评估显示，STS在代表性的基准NarrativeQA上实现了约90%稀疏度下的2.67倍加速，与密集注意力相比，准确性降解可忽略不计。STS在稀疏性与准确性权衡上建立了新的状态-of-the-art，通过在给定准确性预算下实现更高的稀疏度水平，优于先前技术。

英文摘要

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

URL PDF HTML ☆

赞 0 踩 0

2605.14005 2026-05-19 cs.CL cs.LG 版本更新

SlimQwen: 探索在大规模MoE模型预训练中的剪枝与知识蒸馏

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

发表机构 * Qwen Team, Alibaba Inc.（通义实验室，阿里公司）； MBZUAI ； KAUST（卡士大学）

AI总结本文研究了在大规模预训练中如何应用剪枝和知识蒸馏技术，探讨了剪枝在初始化方面的优势、专家压缩对最终模型的影响以及训练策略的有效性，最终将Qwen3-Next-80A3B压缩到23A2B模型并保持竞争力。

详情

AI中文摘要

结构化剪枝和知识蒸馏（KD）是压缩大型语言模型的典型技术，但其在预训练规模下的应用仍不清楚，尤其是针对最近的混合专家（MoE）模型。本文系统研究了大规模预训练中的MoE压缩，重点探讨三个关键问题：剪枝是否比从头训练提供更好的初始化；专家压缩选择如何影响继续训练后的最终模型；以及哪种训练策略最有效。我们得出以下发现：首先，在深度、宽度和专家压缩方面，对预训练MoE进行剪枝在相同训练预算下优于从头训练。其次，不同的单次专家压缩方法在大规模持续预训练后收敛到相似的最终性能。受此启发，我们引入了一种简单的部分保留专家合并策略，该策略在大多数基准上提升了下游性能。第三，结合KD与语言建模损失在知识密集型任务上优于仅使用KD。我们进一步提出了多令牌预测（MTP）蒸馏，其效果一致。最后，鉴于相同的训练令牌，渐进式剪枝计划优于单次压缩，表明渐进的架构过渡导致更好的优化轨迹。综合来看，我们将Qwen3-Next-80A3B压缩到23A2B模型，保持了竞争力。这些结果为大规模高效MoE压缩提供了实用指导。

英文摘要

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.08439 2026-05-19 cs.CL 版本更新

Can Language Models Identify Side Effects of Breast Cancer Radiation Treatments?

语言模型能否识别乳腺癌放射治疗的副作用？

Natalie Seah, Danielle S. Bitterman, Daphna Spiegel, Thomas Hartvigsen

发表机构 * University of Virginia（弗吉尼亚大学）； Mass General Brigham Dana-Farber Cancer Institute Harvard Medical School（麻省总医院达纳-法伯癌症研究所哈佛医学院）

AI总结本研究探讨了语言模型在识别乳腺癌放射治疗副作用中的能力，通过评估多种语言模型在不同提示下的表现，揭示了其在精度、召回率及罕见长期副作用识别上的局限性，并提出了改进方向。

详情

AI中文摘要

准确地向癌症幸存者传达癌症治疗的副作用至关重要，特别是在知情同意等情境中，临床医生必须清晰而全面地传达潜在的治疗毒性。然而，由于对不良治疗反应的临床知识不足以及电子健康记录（EHR）系统之间的碎片化，这一任务仍极具挑战性。大型语言模型（LLMs）有潜力帮助完成此任务，但其在癌症幸存者护理中的可靠性仍不明确。本文提出了一种面向部署的压力测试框架，用于评估LLM生成的乳腺癌治疗和幸存者护理中的放射副作用列表。使用21名乳腺癌患者资料，我们构建了仅在放射治疗方案上不同的配对患者临床场景，以在多种提示模式下评估七种指令微调的LLM。然后将LLM输出与由两名主要学术医疗中心的知情同意文件和超过七名乳腺放射肿瘤学家团队编写的临床医生编纂参考进行比较。该参考将放射剂量分割、照射区域和位置映射到相关的毒性，按频率和时间起始点分解。在不同模型中，我们揭示了对细微文档变化的敏感性、精度与召回率之间的权衡，以及系统性低估罕见和长期副作用的问题。当单独使用时，限制生成的副作用数量会降低精度，而将输出基于临床医生编纂的副作用列表可以显著提高可靠性和稳健性。这些发现突显了LLM在肿瘤学中的重要局限性，并提出了更安全和信息丰富的幸存者护理应用的设计选择。

英文摘要

Accurately communicating the side effects of cancer treatments to cancer survivors is critical, particularly in settings such as informed consent, where clinicians must clearly and comprehensively convey potential treatment toxicities. However, this task remains challenging due to clinical knowledge deficits about adverse treatment effects and fragmentation across electronic health record (EHR) systems. Large language models (LLMs) have the potential to assist in this task, though their reliability in oncology survivorship contexts remains poorly understood. We present a deployment-oriented stress-testing framework for evaluating LLM-generated radiation side effect lists in breast cancer treatment and survivorship care. Using 21 breast cancer patient profiles, we construct paired patient clinical scenarios that differ only in radiotherapy regimens to evaluate seven instruction-tuned LLMs under multiple prompting regimes. We then compare LLM outputs to a clinician-curated reference derived from informed consent documents at two major academic medical centers and developed by a team including more than seven breast radiation oncologists. The reference maps radiation dose-fractionation, fields, and locations to associated toxicities, broken down by frequency and temporal onset. Across models, we reveal sensitivity to minor documentation changes, trade-offs between precision and recall, and systematic under-recall of rare and long-term side effects. When used alone, constraints on the number of side effects generated reduce precision, and grounding outputs in clinician-curated side effect lists substantially improves reliability and robustness. These findings highlight important limitations of LLM use in oncology and suggest practical design choices for safer and more informative survivorship-focused applications.

URL PDF HTML ☆

赞 0 踩 0

2605.08163 2026-05-19 cs.CV cs.AI cs.CL 版本更新

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

MULTITEXTEDIT：跨语言文本-图像编辑中退化程度的基准测试

Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

发表机构 * Harbin Institute of Technology（哈尔滨理工大学）

AI总结本文提出MULTITEXTEDIT基准测试，通过12种语言、5种视觉领域和7种编辑操作的3600个实例，评估跨语言文本-图像编辑中退化问题，引入语言保真度指标并发现模型在文本准确性和脚本保真度上的显著退化。

Comments 11 pages, 5 figures

详情

AI中文摘要

文本-图像编辑已成为视觉内容创作的关键能力，但现有基准测试大多以英语为中心且常将视觉合理性与语义正确性混为一谈。我们引入MULTITEXTEDIT，一个包含3,600个实例的受控基准测试，涵盖12种语言类型、5种视觉领域和7种编辑操作。每个实例的语言变体共享相同的视觉基础，并配有人工编辑的参考文本和区域掩码，从而隔离语言变量以进行跨语言比较。为捕捉粗粒度文本匹配度指标所遗漏的脚本级错误，如缺失变音符号、RTL顺序颠倒和混合脚本渲染，我们引入了一个由两阶段LVM协议评分的语言保真度（LSF）度量，其与母语者标注员的二次加权κ值达到0.76。评估12个开源和专有系统时，发现所有模型在跨语言退化方面表现显著，最大退化出现在希伯来语和阿拉伯语上，最小退化出现在荷兰语和西班牙语上，且集中在文本准确性和脚本保真度而非粗粒度结构维度上。我们还发现普遍存在的语义和像素不匹配，其中输出保持全局布局和背景保真度，但扭曲了脚本特定的形态。

英文摘要

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

URL PDF HTML ☆

赞 0 踩 0

2605.07111 2026-05-19 cs.CL cs.AI 版本更新

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

超越LoRA与全微调：基于梯度的优化器路由用于大语言模型适应

Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Tsinghua University（清华大学）； Infinigence AI

AI总结本文提出了一种混合LoRA和全微调（MoLF）框架，通过在优化器层面动态路由更新，实现两种训练模式之间的连续导航，从而提升大语言模型的适应性能。

详情

AI中文摘要

仅在必要时精确回答：用于代理系统的校准断言特异性控制

Tianyi Huang, Samuel Xu, Jason Tansong Dang, Samuel Yan, Kimberley Yin

AI总结本文研究了代理系统因过于精确而失败的问题，提出了一种称为组合选择性特异性（CSS）的方法，通过分解回答为断言、提出更粗略的退化方案，并在最合适的校准级别发出每个断言，从而提高风险-效用权衡。

Comments Accepted at the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情

AI中文摘要

代理系统往往不是因为完全错误，而是因为过于精确而失败：一个回答可能总体有用，但特定断言超出了证据支持的范围。我们研究这种失败模式为过度承诺控制，并引入组合选择性特异性（CSS），一种后生成层，将回答分解为断言，提出更粗略的退化方案，并在最具体的校准级别发出每个断言。该方法旨在将不确定性表达为局部语义退化，而不是整个回答的拒绝。在完整的LongFact运行和HotpotQA试点中，校准的CSS提高了固定草稿的风险-效用权衡。在完整的LongFact运行中，相对于无CSS输出，它将过度承诺意识效用从0.846提升到0.913，同时实现0.938的特异性保留。这些结果表明，断言层面的特异性控制是代理系统有用的不确定性接口，并且是未来无分布有效性层的目标。

英文摘要

Agentic systems often fail not by being entirely wrong, but by being too precise: a response may be generally useful while particular claims exceed what the evidence supports. We study this failure mode as overcommitment control and introduce compositional selective specificity (CSS), a post-generation layer that decomposes an answer into claims, proposes coarser backoffs, and emits each claim at the most specific calibrated level that appears admissible. The method is designed to express uncertainty as a local semantic backoff rather than as a whole-answer refusal. Across a full LongFact run and HotpotQA pilots, calibrated CSS improves the risk-utility trade-off of fixed drafts. On the full LongFact run, it raises overcommitment-aware utility from 0.846 to 0.913 relative to the no-CSS output while achieving 0.938 specificity retention. These results suggest that claim-level specificity control is a useful uncertainty interface for agentic systems and a target for future distribution-free validity layers.

URL PDF HTML ☆

赞 0 踩 0

2604.04932 2026-05-19 cs.CL 版本更新

Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

超越最终作者：为细粒度LLM生成文本检测建模创作者与编辑的双重角色

Yang Li, Qiang Sheng, Zhengjia Wang, Yehan Yang, Danding Wang, Juan Cao

发表机构 * ict.ac.cn（中国科学院）

AI总结本文提出RACE方法，通过建模创作者和编辑的双重角色，实现细粒度LLM生成文本检测，以更精确地区分不同类型的文本，从而为LLM监管提供政策对齐的解决方案。

Comments ACL 2026 (Oral)

详情

AI中文摘要

大型语言模型（LLM）的滥用需要精确检测合成文本。现有工作主要遵循二元或三元分类设置，只能区分纯人类/LLM文本或协作文本。这在 nuanced 的监管中仍显不足，因为LLM润色的人类文本和人类化的LLM文本往往触发不同的政策后果。在本文中，我们探索了在严格四类设置下细粒度LLM生成文本检测。为处理这些复杂性，我们提出了RACE（Rhetorical Analysis for Creator-Editor Modeling），一种细粒度检测方法，该方法刻画了创作者和编辑的各自特征。具体而言，RACE利用修辞结构理论（RST）构建创作者的逻辑图，同时提取基本话语单元（EDU）级别的特征以捕捉编辑的风格。实验表明，RACE在识别细粒度类型时优于12个基线方法，具有较低的误报率，为LLM监管提供了一种政策对齐的解决方案。

英文摘要

The misuse of large language models (LLMs) requires precise detection of synthetic text. Existing works mainly follow binary or ternary classification settings, which can only distinguish pure human/LLM text or collaborative text at best. This remains insufficient for the nuanced regulation, as the LLM-polished human text and humanized LLM text often trigger different policy consequences. In this paper, we explore fine-grained LLM-generated text detection under a rigorous four-class setting. To handle such complexities, we propose RACE (Rhetorical Analysis for Creator-Editor Modeling), a fine-grained detection method that characterizes the distinct signatures of creator and editor. Specifically, RACE utilizes Rhetorical Structure Theory (RST) to construct a logic graph for the creator's foundation while extracting Elementary Discourse Unit (EDU)-level features for the editor's style. Experiments show that RACE outperforms 12 baselines in identifying fine-grained types with low false alarms, offering a policy-aligned solution for LLM regulation.

URL PDF HTML ☆

赞 0 踩 0

2603.25723 2026-05-19 cs.CL cs.AI 版本更新

Natural-Language Agent Harnesses

自然语言代理Harness

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本文提出自然语言代理Harness（NLAH）作为一种可执行的自然语言对象，用于描述任务运行的Harness策略，并引入Intelligent Harness Runtime（IHR）作为共享运行时，能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。实验表明，NLAH在编码、终端使用和计算机使用基准测试中表现与代码和提示实现相当，同时暴露了更短的静态Harness策略。

Comments revise paper

详情

AI中文摘要

代理性能受到周围Harness的强烈影响：围绕模型组织任务运行的外部执行系统。然而，这种逻辑通常隐藏在紧密耦合的控制器代码中，使得Harness难以检查、比较、转移和消解。本文探讨是否可以将代理Harness的可重用设计模式表示为可执行的自然语言对象。我们引入自然语言代理Harness（NLAH），即可编辑的文档，用于描述运行级别的Harness策略，并引入Intelligent Harness Runtime（IHR），一个共享运行时，能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。在编码、终端使用和计算机使用基准测试中，IHR执行的NLAH实现了与代码和提示实现相当的任务结果，同时暴露了更短的静态Harness策略。模块消解进一步表明，显式的Harness模块是可分析的。这些结果表明，代理Harness可以从模型周围的偶然粘合物转变为科学表示对象。

英文摘要

Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.

URL PDF HTML ☆

赞 0 踩 0

2603.22056 2026-05-19 cs.CL 版本更新

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

双空间知识蒸馏与关键-查询匹配用于具有词汇不匹配的大型语言模型

Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill

发表机构 * University of Cambridge, Department of Engineering（剑桥大学工程系）； Toshiba Europe Limited（东芝欧洲有限公司）

AI总结本文研究了针对具有词汇不匹配的大型语言模型的双空间知识蒸馏与关键-查询匹配方法，通过分析注意力机制揭示其优缺点，并提出基于生成对抗学习的新方法以解决关键-查询分布不匹配问题。

详情

AI中文摘要

大型语言模型（LLMs）在语言任务上实现了最先进的（SOTA）性能，但因其规模和资源需求而昂贵。知识蒸馏（KD）通过训练较小的学生模型模仿较大的教师模型来解决这一问题，从而在不显著损失性能的情况下提高效率。双空间知识蒸馏与跨模型注意力（DSKD-CMA）已成为在具有不同分词器的LLM之间进行KD的SOTA方法，但其内部机制仍然大多不透明。在本文中，我们通过手动标记对齐探测和热图可视化系统地分析DSKD-CMA的注意力机制，揭示其优缺点。在此基础上，我们引入了一种基于生成对抗（GA）学习的新方法DSKD-CMA-GA，以解决由不同模型计算出的关键-查询分布不匹配问题。实验显示在文本生成质量上获得了适度但一致的ROUGE-L提升，特别是在分布外数据上（平均+0.37），缩小了跨分词器KD与同分词器KD之间的差距。

英文摘要

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

URL PDF HTML ☆

赞 0 踩 0

2603.03328 2026-05-19 cs.CL cs.AI 版本更新

MentalBench: 一个用于评估大语言模型 psychiatric 诊断能力的 DSM 基础基准

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（成均馆大学）； Dongguk University Medical Center（东国大学医学院）； Seoultech（首尔技术大学）； Samsung Medical Center（三星医疗中心）

AI总结本文提出 MentalBench，一个用于评估大语言模型在不同临床模糊程度下能否做出 DSM 基础的 psychiatric 诊断决策的基准。该基准基于 psychiatrist 构建并验证的知识图谱，生成了 24,750 个合成临床案例，以系统地变化信息完整性和诊断复杂性，从而实现 DSM 基础的评估。实验表明，尽管最先进的 LLM 在噪声自由查询上表现良好，但它们在区分具有重叠症状的诊断时难以校准其信心。

详情

AI中文摘要

大型语言模型 (LLMs) 已吸引越来越多的关注，作为心理评估和临床决策支持的支持工具。然而，现有的心理健康基准大多依赖于社交媒体数据或支持性对话设置，限制了它们评估模型是否能够应用正式诊断标准和鉴别诊断规则的能力。在本文中，我们介绍了 MentalBench，一个用于评估 LLM 是否能在不同水平的临床模糊性下做出 DSM 基础的 psychiatric 诊断决策的基准。MentalBench 的核心是 MentalKG，一个由精神科医生构建并验证的知识图谱，编码了 DSM-5 的诊断标准和鉴别诊断规则，适用于 23 种心理疾病。利用 MentalKG 作为专家整理的逻辑基础，我们生成了 24,750 个合成临床案例，这些案例在信息完整性和诊断复杂性方面系统地变化，从而实现 DSM 基础的评估。我们的实验表明，尽管最先进的 LLM 在噪声自由查询上表现良好，但它们在区分具有重叠症状的诊断时难以校准其信心。这些发现引发了关于 LLM 作为心理决策支持工具可靠性的担忧，并突显了需要更多评估以反映现实世界心理诊断中的多样化挑战的必要性。

通过大型语言模型的图引导动作生成进行具身任务规划

Xiang Li, Ning Yan, Masood Mortazavi

发表机构 * Purdue University（普渡大学）； Futurewei Technologies（未来科技）

AI总结本文提出GiG框架，通过图神经网络编码环境状态并构建动作连接执行轨迹图，结合有限前瞻性模块提升具身代理的规划能力，在三个具身规划基准测试中取得显著性能提升。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管大型语言模型（LLMs）在零样本推理能力方面表现出色，但将其作为具身代理部署仍面临长周期规划的根本挑战。与开放性文本生成不同，具身代理必须将高层意图分解为可操作的子目标，同时遵守动态环境的约束。标准LLM规划器由于上下文窗口限制或幻觉状态转换而难以维持策略一致性。我们提出GiG，一种通过图-图架构结构化具身代理记忆的规划框架。我们的方法利用图神经网络（GNN）将环境状态编码为嵌入，将这些嵌入组织成动作连接的执行轨迹图，存储在经验记忆库中。GiG能够检索结构相似的先例，使代理能基于相关过去结构模式做出决策。此外，我们引入了一个有限前瞻性模块，利用符号转换逻辑通过基于现实的动作投影增强代理的规划能力。我们在三个具身规划基准测试中评估了我们的框架——Robotouille Synchronous、Robotouille Asynchronous和ALFWorld。我们的方法优于最先进的基线，分别在Robotouille Synchronous、Asynchronous和ALFWorld上实现了高达22%、37%和15%的Pass@1性能提升，同时保持可比或更低的计算成本。

英文摘要

While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intents into actionable sub-goals while adhering to the constraints of a dynamic environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitations or hallucinate state transitions that violate environment constraints. We propose GiG, a planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. GiG enables retrieval of structurally-similar priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a bounded lookahead module that leverages symbolic transition logic to enhance the agent's planning capabilities through grounded action projections. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld while maintaining comparable or lower computational cost.

URL PDF HTML ☆

赞 0 踩 0

2601.14506 2026-05-19 cs.CY cs.CL 版本更新

Compounding Disadvantage: Auditing Intersectional Bias in LLM-Generated Explanations Across Indian and American STEM Education

叠加劣势：在印度和美国STEM教育中对LLM生成解释中的交叉偏见进行审计

Amogh Gupta, Niharika Patil, Sourojit Ghosh, SnehalKumar, S Gaikwad

发表机构 * Society-Centered AI Lab（以社会为中心的人工智能实验室）； University of Washington（华盛顿大学）

AI总结本文研究了大型语言模型在不同文化背景下对边缘化学生群体的系统性歧视，通过审计四个LLM模型发现，多重维度的边缘化导致成绩差距扩大，偏见在精英机构中持续存在。

详情

AI中文摘要

大型语言模型越来越多地被用于跨高收入和低收入国家的STEM教育中，以提供个性化教学和反馈。这些系统旨在根据学生需求调整内容，但其是否基于已展示的能力还是人口统计数据仍未经大规模测试。本文发现，LLM生成的STEM内容在两个文化背景下系统性地歧视边缘化学生群体，最特权和最边缘化群体之间的差距达到2.55个年级。通过使用合成的跨文化特定轮廓（包括印度教育的种姓、教学语言、学院层级和美国教育的种族、HBCU出席情况、学校类型），以及收入、性别和残疾状况，在排名和生成任务中对四个LLM（Qwen 2.5-32B-Instruct、GPT-4o、GPT-4o-mini、GPT-OSS 20B）进行了审计，采用FDR校正显著性测试和SHAP特征归因。收入在所有模型和上下文中产生显著影响，教学语言在印度情境中产生最大的单一影响，残疾状况触发更简单的解释。影响是非加性的：多个维度的边缘化导致差距大于任何单一维度预测的，偏见在精英机构中持续存在。偏见在所有四个架构中一致，并在模型选择中持续存在，因此在部署前进行交叉文化审计是结构化要求。

英文摘要

Large language models are increasingly deployed in STEM education for personalized instruction and feedback across institutions in high- and low-income countries. These systems are designed to adapt content to student needs, but whether they adapt based on demonstrated ability or demographic signals remains untested at scale. Here we establish that LLM-generated STEM content systematically disadvantages marginalized student profiles across two cultural contexts, with the gap between the most privileged and most marginalized profiles reaching 2.55 grade levels. We audited four LLMs (Qwen 2.5-32B-Instruct, GPT-4o, GPT-4o-mini, GPT-OSS 20B) using synthetic profiles crossing dimensions specific to Indian education (caste, medium of instruction, college tier) and American education (race, HBCU attendance, school type), alongside income, gender, and disability, across ranking and generation tasks with FDR-corrected significance testing and SHAP feature attribution. Income produces significant effects across every model and context, medium of instruction drives the largest single effect in the Indian context, and disability status triggers simpler explanations. Effects compound non-additively: marginalization across multiple dimensions produces gaps larger than any single dimension predicts, and biases persist within elite institutions. Bias is consistent across all four architectures and persists through model selection, making intersectional, cross-cultural auditing a structural requirement before deployment.

URL PDF HTML ☆

赞 0 踩 0

2601.09722 2026-05-19 cs.CL cs.AI 版本更新

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

ADMEDTAGGER: 一个用于波兰医疗语言知识蒸馏的标注框架

Franciszek Górski, Andrzej Czyżewski

发表机构 * Gdansk University of Technology（格但斯克技术大学）

AI总结本文提出了一种标注框架，展示如何利用一个多语言预训练大语言模型作为教师模型，蒸馏出用于标注波兰医疗文本所需的专业知识，通过开发多类分类器，解决了标注资源不足的问题，最终得到了高效的分类器。

详情

AI中文摘要

在本工作中，我们提出了一种标注框架，展示了如何利用一个多语言预训练大语言模型作为教师模型，蒸馏出用于标注波兰医疗文本所需的专业知识。本工作是ADMEDVOICE项目的一部分，在此项目中，我们收集了涵盖五个临床类别（放射学、肿瘤学、心脏病学、高血压和病理学）的大量医疗文本语料库。利用这些数据，我们开发了一个多类分类器，但根本问题在于缺乏足够的标注资源来标注足够数量的文本。因此，在我们的解决方案中，我们使用多语言Llama3.1模型来标注大量波兰医疗文本语料库。利用我们有限的标注资源，我们只验证了这些标签中的一部分，从而创建了一个测试集。通过这种方式标注的数据随后用于训练和验证三种基于BERT架构的分类器：基于DistilBERT的蒸馏模型、在医疗数据上微调的BioBERT以及在波兰语言语料库上微调的HerBERT。在我们训练的模型中，DistilBERT模型表现最佳，每个临床类别达到了F1分数大于0.80，其中三个类别达到了F1分数大于0.93。通过这种方式，我们得到了一系列高效的分类器，这些分类器在大小、GPU VRAM消耗和推理速度方面分别比大型语言模型小约500倍、低300倍，以及快数百倍。

英文摘要

In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.

URL PDF HTML ☆

赞 0 踩 0

2512.19134 2026-05-19 cs.CL cs.IR 版本更新

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

QuCo-RAG：从预训练语料库中量化不确定性以实现动态检索增强生成

Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng

发表机构 * University of Illinois at Chicago（伊利诺伊大学香槟分校）； New York University（纽约大学）； Monash University（莫纳什大学）

AI总结本研究提出QuCo-RAG，通过从预训练语料库中提取客观统计信息来量化不确定性，以解决动态检索增强生成中大语言模型的幻觉问题，实验表明其在多跳问答基准测试中优于现有方法，并在多个模型上实现了显著的提升。

Comments ACL Findings 2026

详情

AI中文摘要

Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama-3, Qwen2.5, GPT-4.1/5-chat), improving EM by up to 14 points. Generalization to long-form generation and biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

英文摘要

Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama-3, Qwen2.5, GPT-4.1/5-chat), improving EM by up to 14 points. Generalization to long-form generation and biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

URL PDF HTML ☆

赞 0 踩 0

2512.04746 2026-05-19 cs.CL cs.AI 版本更新

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2：朝着关闭LLMs极低比特后训练量化性能差距的目标

Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen, Zaner Ma

发表机构 * Intel（英特尔公司）； Beijing Institute of Technology（北京理工大学）

AI总结本文提出SignRoundV2框架，通过自适应混合精度策略和轻量稳定技术，在极低比特量化下保持高性能，实验表明在混合MXFP设置中实现接近无损性能，将性能差距缩小到约1%。

详情

AI中文摘要

极低比特量化对高效部署大型语言模型（LLMs）至关重要，但往往在2比特和4比特（如MXFP4）时导致严重性能下降。我们提出了SignRoundV2，一种后训练量化框架，旨在在极端压缩下保持高性能。SignRoundV2引入（1）一种简单而高效的自适应混合精度策略，利用梯度信息和量化引起的重建误差来指导层间比特分配，以及（2）一组轻量级稳定技术，包括损失过滤和预调制比例搜索，以提高极低比特环境下的调优效果。我们的方法在量化和全精度模型之间显著缩小了性能差距。在多种LLMs上的实验结果表明，SignRoundV2在混合MXFP设置中实现了接近无损性能，将差距缩小到约1%（平均4.5比特），同时在具有挑战性的2比特权重-only量化中大幅提高准确性。源代码可在https://github.com/intel/auto-round获取。

英文摘要

Extremely low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2 bits and even at 4 bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework designed to maintain high performance even under aggressive compression. SignRoundV2 introduces (1) a simple yet efficient adaptive mixed-precision strategy that leverages gradient information and quantization-induced reconstruction errors to guide layer-wise bit allocation, and (2) a set of lightweight stabilization techniques, including loss filtering and a pre-tuning scale search, to improve tuning effectiveness in extremely low-bit regimes. Our approach takes a significant step toward closing the performance gap between quantized and full-precision models. Experimental results across diverse LLMs demonstrate that SignRoundV2 achieves near-lossless performance in mixed MXFP settings, narrowing the gap to $\sim$1\% at an average of 4.5 bits, while substantially improving accuracy in challenging 2-bit weight-only quantization. The source code is available at \url{https://github.com/intel/auto-round}.

URL PDF HTML ☆

赞 0 踩 0

2511.21016 2026-05-19 cs.LG cs.CL 版本更新

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

门控KalmaNet：通过测试时岭回归实现渐逝记忆层

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； AWS Agentic AI（AWS 代理人工智能）

AI总结本文提出门控KalmaNet（GKA），通过测试时岭回归实现渐逝记忆层，解决了线性状态空间模型（SSMs）在记忆过去信息时的效率与精度问题，展示了GKA在短上下文任务和长上下文任务中的优越性能。

Comments 30 pages, 10 figures. Accepted at CVPR 2026

详情

AI中文摘要

线性状态空间模型（SSMs）提供了一种高效的替代softmax注意力机制的方案，具有恒定的内存和线性计算，但其损失性、渐逝的过去总结对需要回忆的任务造成了伤害。我们提出门控KalmaNet（GKA，发音为“gee-ka”），这是一种能够考虑完整过去同时保持SSM风格效率的层。我们的方法基于卡尔曼滤波（KF），并证明了现有的几种SSM层（DeltaNet、门控DeltaNet、Kimi Delta Attention）是在恒等误差协方差假设下的卡尔曼滤波递归近似。相比之下，GKA保持完整的误差协方差并计算精确的卡尔曼增益。在稳态假设下，这可以简化为具有恒定内存和线性计算的在线岭回归。标准的卡尔曼滤波方程在低精度设置（如bfloat16）下数值不稳定且难以在GPU上并行化。我们通过（1）输入依赖的门控进行自适应正则化以控制岭回归的条件数，以及（2）Chebyshev迭代，证明其在低精度下比传统迭代求解器更稳定。我们进一步开发了针对硬件的分块内核以提高训练效率。实证上，GKA在短上下文任务中优于现有的SSM层（如Mamba2、门控DeltaNet），并在长上下文RAG和LongQA任务中达到128k token的相对改进超过10%。我们还展示了当扩展到ImageNet分类时，GKA优于Mamba。我们的代码，包括用于训练和推理的Triton内核（vLLM），以及在HuggingFace上8B和32B规模的GKA基于混合模型的模型库，均以Apache 2.0许可证发布。

英文摘要

Linear State-Space Models (SSMs) offer an efficient alternative to softmax Attention with constant memory and linear compute, but their lossy, fading summary of the past hurts recall-oriented tasks. We propose Gated KalmaNet (GKA, pronounced "gee-ka"), a layer that accounts for the full past while retaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF), and show that several existing SSM layers (DeltaNet, Gated DeltaNet, Kimi Delta Attention) are approximations to the KF recurrence under an identity error covariance assumption, which ignores how past keys and values should optimally influence state updates. In contrast, GKA maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The standard KF equations are numerically unstable in low-precision settings (e.g., bfloat16) and hard to parallelize on GPUs. We address this with (1) adaptive regularization via input-dependent gating to control the ridge regression's condition number, and (2) Chebyshev Iteration, which we show is more stable than conventional iterative solvers in low precision. We further develop hardware-aware chunk-wise kernels for efficient training. Empirically, GKA outperforms existing SSM layers (e.g., Mamba2, Gated DeltaNet) on short-context tasks and achieves more than 10\% relative improvement on long-context RAG and LongQA up to 128k tokens. We further show GKA outperforms Mamba when extended to ImageNet classification. Our code, including Triton kernels for training and inference (vLLM), along with a model zoo of GKA-based Hybrid models at 8B and 32B scale on HuggingFace, is released under Apache 2.0.

URL PDF HTML ☆

赞 0 踩 0

2511.12710 2026-05-19 cs.CL cs.CR 版本更新

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

改进方法而非提示：针对大语言模型的进化式 jailbreak 攻击合成

Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma

发表机构 * Department of XXX, University of YYY, Location, Country（XXX系，YYY大学，地点，国家）； School of ZZZ, Institute of WWW, Location, Country（ZZZ学院，WWW研究所，地点，国家）

AI总结本文提出 EvoSynth 框架，通过在代码空间中进行搜索，而非仅在提示空间中优化，从而提高对大语言模型的 jailbreak 攻击成功率和多样性。

详情

AI中文摘要

针对大语言模型（LLM）的自动化红队框架日益复杂，但许多仍主要在提示空间中优化攻击。换句话说，这些方法主要搜索更好的攻击用语或策略选择，但不搜索可执行代码。通过将搜索移至代码空间，我们不仅能优化最终的攻击提示，还能优化生成它的过程，包括执行流程、可重用逻辑、分支和失败驱动的修复。为克服这一差距，我们引入了 EvoSynth，一个自主的多代理框架，将优化空间从提示转移到可执行代码。与直接优化提示不同，EvoSynth 使用多代理系统自主设计、进化和执行基于代码的攻击算法。关键在于其代码级自我纠正循环，使其能够根据目标模型反馈和失败尝试迭代重写基于代码的算法。通过广泛实验，我们证明 EvoSynth 在高度稳健的模型如 Claude-Sonnet-4.5 上实现了 85.5% 的攻击成功率（ASR），并在评估目标上平均达到 95.9% 的 ASR，同时生成的攻击比现有方法更具多样性。我们发布该框架以促进未来在可执行代码空间中进化合成的研究。

英文摘要

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure-driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi-agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite the code-based algorithm in response to target-model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5 and a 95.9\% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.

URL PDF HTML ☆

赞 0 踩 0

2511.04070 2026-05-19 cs.CL 版本更新

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

T-FIX：基于文本的可解释性方法，具备可解释的专家特征

Shreya Havaldar, Weiqiu You, Chaehyeon Kim, Anton Xue, Helen Jin, Marco Gatti, Bhuvnesh Jain, Helen Qu, Amin Madani, Daniel A. Hashimoto, Gary E. Weissman, Rajat Deo, Sameed Khatana, Lyle Ungar, Eric Wong

发表机构 * Department of Computer and Information Science, University of Pennsylvania（宾夕法尼亚大学计算机与信息科学系）； Department of Computer Science, University of Texas at Austin（德克萨斯大学奥斯汀分校计算机科学系）； Department of Physics and Astronomy, University of Pennsylvania（宾夕法尼亚大学物理与天文学系）； Flatiron Institute（Flatiron研究所）； Department of Surgery, Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院外科系）； Division of Pulmonary, Allergy, and Critical Care, Perelman School of Medicine, University of Pennsylvania（宾夕菲亚大学佩雷尔曼医学院呼吸、过敏与危重医学科）； Division of Cardiovascular Medicine, Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院心血管医学科）； Department of Surgery, University of Toronto（多伦多大学外科系）； University Health Network（大学健康网络）

AI总结本文提出T-FIX框架，用于评估LLM生成的解释是否符合专家的推理方式，通过七个科学任务和三个领域进行验证，实现了自动且可定制的专家对齐评估。

详情

AI中文摘要

随着LLM被应用于知识密集型领域（例如手术、天文学、治疗），用户通常是领域专家，他们不仅期望答案，还期望解释能反映专业推理。然而，评估LLM是否'像专家一样思考'仍然困难：现有方法依赖于每个示例的专家注释，使它们成本高、难以扩展，并且局限于每个领域的单一正确推理观念。为了解决这一差距，我们引入了T-FIX，一个统一的评估框架，将专家对齐作为LLM生成解释的期望属性进行操作化。T-FIX涵盖七个科学任务和三个领域，每个任务均根据专家定义的准则进行评估，这些准则捕捉的是领域相关的推理而非通用的解释质量。我们的框架实现了自动且可定制的专家对齐评估，能够在没有持续专家参与的情况下泛化到未见过的解释。代码可在https://github.com/BrachioLab/FIX-2/上获得。

英文摘要

As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users are often domain experts who expect not just answers, but explanations that mirror professional reasoning. Yet evaluating whether an LLM "thinks like an expert" remains difficult: existing approaches rely on per-example expert annotation, making them costly, hard to scale, and tied to a single notion of correct reasoning within each domain. To address this gap, we introduce T-FIX, a unified evaluation framework that operationalizes expert alignment as a desired attribute of LLM-generated explanations. T-FIX spans seven scientific tasks across three domains, with each task evaluated against expert-defined criteria that capture domain-grounded reasoning rather than generic explanation quality. Our framework enables automatic, personalizable evaluation of expert alignment that generalizes to unseen explanations without ongoing expert involvement. Code is available at https://github.com/BrachioLab/FIX-2/.

URL PDF HTML ☆

赞 0 踩 0

2510.24701 2026-05-19 cs.CL cs.AI cs.IR cs.LG cs.MA 版本更新

Tongyi DeepResearch Technical Report

通义深研技术报告

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Minpeng Liao, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

发表机构 * Tongyi Lab（通义实验室）； Alibaba Group（阿里巴巴集团）

AI总结本文介绍了一种专为长时间深度信息检索任务设计的代理大语言模型，通过端到端训练框架结合代理中期和后期训练，实现了在复杂任务中的可扩展推理和信息检索，同时提供了高可扩展的数据合成管道，实现了无需昂贵人工标注的自动化训练流程，并在多个深度研究基准测试中取得了最先进的性能。

Comments https://tongyi-agent.github.io/blog

详情

AI中文摘要

我们介绍了通义深研，一种专为长周期、深度信息检索任务设计的代理大语言模型。为了激励自主深度研究代理，通义深研通过端到端训练框架结合代理中期和后期训练，实现了在复杂任务中的可扩展推理和信息检索。我们设计了一个高度可扩展的数据合成管道，完全自动化，无需依赖昂贵的人工标注，并赋能所有训练阶段。通过为每个阶段构建定制化环境，我们的系统在整个过程中实现了稳定一致的交互。通义深研拥有305亿总参数，每token仅激活33亿个参数，在多个代理深度研究基准测试中，包括人类最后考试、浏览比较、浏览比较-中文、WebWalkerQA、xbench-DeepSearch、FRAMES和xbench-DeepSearch-2510，均取得了最先进的性能。我们开源了该模型、框架和完整解决方案，以赋能社区。

英文摘要

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

URL PDF HTML ☆

赞 0 踩 0

2510.16727 2026-05-19 cs.CL cs.AI 版本更新

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Beacon：单轮诊断和缓解大型语言模型中潜在的阿谀倾向

Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal

AI总结本文提出Beacon基准测试，用于单轮诊断和缓解大型语言模型中潜在的阿谀倾向，通过评估十二种最先进的模型，揭示了阿谀倾向在语言和情感方面的稳定子偏差，并提出了在提示和激活层面的干预措施，以调节这些偏差，从而揭示对齐作为事实性和社会合规判断之间的动态流形。

详情

AI中文摘要

大型语言模型内部化了诚实与奉承之间的结构权衡，这种权衡源于奖励优化，将有用性与礼貌服从混淆。这种潜在的偏见，称为阿谀倾向，表现为对用户同意的偏好而非原则性推理。我们引入Beacon，一种单轮强制选择基准测试，该测试独立于对话上下文，能够精确测量事实准确性与顺从偏见之间的张力。在十二种最先进的模型上的评估表明，阿谀倾向分解为稳定的语言和情感子偏见，每个都随模型容量而扩大。我们进一步提出了提示级别和激活级别干预，以调节这些偏见的相反方向，揭示对齐作为事实性和社会合规判断之间的动态流形。Beacon将阿谀倾向重新定义为可测量的规范性误泛化形式，为研究和缓解大规模生成系统中的对齐漂移提供了可重复的基础。

英文摘要

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

URL PDF HTML ☆

赞 0 踩 0

2510.16252 2026-05-19 cs.LG cs.CL 版本更新

通过提示强化实现大语言模型的长期规划

Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić

发表机构 * Heinrich-Heine-Universität Düsseldorf（杜伊斯堡-埃森大学）

AI总结本文提出了一种基于强化学习的提示优化框架，通过修改LLM代理的任务指令提示来实现长期规划，提升了多轮交互任务如文本到SQL和任务导向对话的表现，并能泛化到不同LLM代理和多种LLM作为元提示代理。

详情

AI中文摘要

大型语言模型（LLMs）在广泛自然语言处理任务中取得了显著成功，并可通过提示进行适应。然而，它们在多轮交互中仍表现不足，常依赖错误的早期假设，无法随时间跟踪用户目标，使此类任务尤其具有挑战性。先前对话系统的工作表明，长期规划对于处理交互任务至关重要。在本工作中，我们提出了一种受强化学习启发的提示优化框架，仅通过修改LLM代理的任务指令提示即可实现此类规划。通过生成回合间的反馈并利用经验回放进行提示重写，我们的方法在文本到SQL和任务导向对话等多轮任务中显示出显著改进。此外，该方法能跨不同LLM代理泛化，并可利用多种LLM作为元提示代理。这促使未来在受强化学习启发的无参数优化方法上的研究。

英文摘要

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

URL PDF HTML ☆

赞 0 踩 0

2509.21820 2026-05-19 cs.CL 版本更新

Can LLMs Generate and Solve Linguistic Olympiad Puzzles?

大语言模型能否生成并解决语言奥林匹克谜题？

Neh Majmudar, Elena Filatova

发表机构 * CUNY（纽约大学）

AI总结本文研究了大语言模型在生成和解决语言谜题中的能力，发现其在大多数谜题类型上优于人类，但对书写系统和不为人知语言的谜题表现较弱，提出了通过谜题生成促进语言学普及的研究意义。

Comments Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

详情

DOI: 10.18653/v1/2025.emnlp-main.969

AI中文摘要

在本文中，我们介绍了一种新的任务组合：语言谜题的解决方案和生成。我们专注于用于高中生的语言奥林匹克谜题。我们首先扩展了现有基准，以解决语言谜题的任务。我们探索了大型语言模型（LLMs）在解决语言谜题中的应用，包括最近的最先进的模型，如OpenAI的o1，在各种语言主题上的表现。我们证明，LLMs在大多数谜题类型上优于人类，除了那些以书写系统为中心的谜题，以及不为人知的语言。我们利用谜题解决实验的洞察力，指导了新的谜题生成任务。我们相信，即使对于相对简单的谜题，自动化谜题生成也有望扩大对语言学的兴趣，并将该领域介绍给更广泛的受众。这一发现突显了语言谜题生成作为研究任务的重要性：此类谜题不仅能促进语言学，还能支持对稀有和不为人知语言的知识传播。

英文摘要

In this paper, we introduce a combination of novel and exciting tasks: the solution and generation of linguistic puzzles. We focus on puzzles used in Linguistic Olympiads for high school students. We first extend the existing benchmark for the task of solving linguistic puzzles. We explore the use of Large Language Models (LLMs), including recent state-of-the-art models such as OpenAI's o1, for solving linguistic puzzles, analyzing their performance across various linguistic topics. We demonstrate that LLMs outperform humans on most puzzles types, except for those centered on writing systems, and for the understudied languages. We use the insights from puzzle-solving experiments to direct the novel task of puzzle generation. We believe that automating puzzle generation, even for relatively simple puzzles, holds promise for expanding interest in linguistics and introducing the field to a broader audience. This finding highlights the importance of linguistic puzzle generation as a research task: such puzzles can not only promote linguistics but also support the dissemination of knowledge about rare and understudied languages.

URL PDF HTML ☆

赞 0 踩 0

2507.20917 2026-05-19 cs.CL cs.AI 版本更新

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

MediQAl: 一个用于知识和推理评估的法语医学问答数据集

Adrien Bazoge

发表机构 * Data Clinic, University Hospital of Nantes, France（南特大学医院数据诊所，法国）； Nantes Université, École Centrale Nantes, CNRS, LS2N, France（南特大学，中央理工学院南特分校，国家科学研究中心，LS2N，法国）

AI总结本文提出MediQAl数据集，用于评估语言模型在事实性医学记忆和现实临床场景推理方面的能力，包含32,603个法语医学问题，涵盖41个医学科目，包含三种任务，通过14个大型语言模型的评估发现事实记忆与推理任务之间存在显著性能差距。

详情

DOI: 10.1038/s41597-026-06680-y

AI中文摘要

本文介绍了MediQAl，一个法语医学问答数据集，旨在评估语言模型在事实性医学记忆和现实临床场景推理方面的能力。MediQAl包含32,603个问题，来源于41个医学科目中的法语医学考试。该数据集包含三种任务：(i) 有唯一答案的多项选择题，(ii) 有多个答案的多项选择题，以及(iii) 有短答案的开放性问题。每个问题都被标记为理解或推理，使能够对模型的认知能力进行详细分析。我们通过与14个大型语言模型的广泛评估，包括最近的推理增强模型，验证了MediQAl数据集，并观察到事实记忆与推理任务之间存在显著的性能差距。我们的评估为评估语言模型在法语医学问答上的性能提供了全面的基准，填补了医学领域多语言资源中的关键空白。

英文摘要

This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

URL PDF HTML ☆

赞 0 踩 0

2507.04996 2026-05-19 cs.CY cs.CE cs.CL cs.HC cs.RO 版本更新

Agentic Vehicles for Human-Centered Mobility: Definition, Prospects, and Synergistic Co-Development with Vehicle Autonomy

面向人类中心的移动性：定义、前景以及与车辆自主性的协同发展

Jiangbo Yu, Raphael Frank, Luis Miranda-Moreno, Sasan Jafarnejad, Jonatas Augusto Manzolli, Fuqiang Liu, Jiyao Wang, Ali Eslami

发表机构 * Interdisciplinary Centre for Security, Reliability and Trust（跨学科安全、可靠与信任中心）； University of Luxembourg（卢森堡大学）

AI总结本文探讨了面向人类中心的移动性，提出了代理车辆的概念，指出自主性和代理性是相互关联但概念上不同的维度，并强调了协同发展的必要性。

详情

AI中文摘要

自主性，源自希腊语autos（自我）和nomos（法律），指的是根据内部规则运行而不受外部控制的能力。自动驾驶车辆（AuVs）因此被理解为能够感知环境并执行任务，且在一定程度上减少人类干预的车辆系统，这与SAE自动化驾驶级别所指示的方向一致。然而，最近的研究和部署越来越多地展示了车辆能力，这些能力虽然不违背自主性，但也不由自主性所涵盖，包括处理模糊目标、有目的的社会互动、外部工具使用、主动问题解决、持续学习以及在未见过且具有伦理重要性的环境中进行情境敏感推理，这在部分情况下得益于多模态语言模型。这些发展揭示了技术自主性与为人类中心移动性所需更广泛社会认知功能之间的差距，这些功能更精确地由代理性概念所捕捉。因此，而不是不断增加“自主”一词的修饰词，我们引入了代理车辆（AgVs）并建议自主性和代理性是相互交织但概念上不同的：如果自主性关注的是做什么和如何做（在内部规则下的任务执行），那么代理性则关注为什么做以及还能做什么（目标导向、适应性的行动）。我们提出自主性和代理性作为正交但相互促进的维度，并具有协同发展的意义。车辆代理标志着移动服务智能的新维度，预示着车辆作为社会中的目的性行为者。

英文摘要

Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as vehicular systems that perceive their environment and execute tasks with minimal human intervention, consistent with the direction indicated by the SAE levels of automated driving. However, recent research and deployments increasingly showcase vehicular capabilities that, while not contradicting autonomy, are not entailed by it, including ambiguous goal handling, purposeful social engagement, external tool use, proactive problem solving, continuous learning, and context-sensitive reasoning in unseen and ethically salient situations, enabled in part by multimodal language models. These developments reveal a gap between technical autonomy and the broader social cognitive functions required for human-centered mobility, which are more precisely captured by the notion of agency. Therefore, rather than adding increasingly elaborate modifiers to "autonomous," we introduce agentic vehicles (AgVs) and suggest that autonomy and agency are intertwined but conceptually distinct: if autonomy concerns what to do and how to do it (task executions under internal rules), agency pertains to why to do it and what else can be done (goal-directed, adaptive actions). We present autonomy and agency as orthogonal yet synergistic dimensions with co-development implications. Vehicle agency marks a novel dimension of mobility service intelligence, heralding vehicles as purposeful actors in society.

URL PDF HTML ☆

赞 0 踩 0

2506.05606 2026-05-19 cs.CL cs.HC 版本更新

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPeRA: 一个用于评估LLM在模拟人类在线购物行为上的表现的观察、人设、推理和行动数据集

Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

发表机构 * Northeastern University（东北大学）； University of Southern California（南加州大学）； Stony Brook University（石溪大学）； Independent Researcher（独立研究者）； Ohio State University（俄亥俄州立大学）； University of Notre Dame（Notre Dame 大学）； Columbia University（哥伦比亚大学）

AI总结本文提出OPeRA数据集，用于评估LLM在模拟人类在线购物行为上的能力，通过收集真实用户在在线购物会话中的观察、人设、推理和行动，建立首个评估LLM预测特定用户下一步行动和推理的基准。

Comments ACL 2026 main

详情

AI中文摘要

大型语言模型（LLMs）能否准确模拟特定用户的下一步网页操作？尽管LLMs在生成“可信”的人类行为方面表现出色，但评估其模仿真实用户行为的能力仍是一个开放性挑战，这主要归因于缺乏高质量、公开可用的数据集，这些数据集能够捕捉到实际人类用户的可观测行为和内部推理。为了解决这一差距，我们引入了OPeRA，一个新型的数据集，收集自真实人类参与者在在线购物会话中的观察、人设、推理和行动。OPeRA是首个公开数据集，全面捕捉了用户人设、浏览器观察、细粒度网页操作以及即时自报的推理。我们开发了在线问卷和定制浏览器插件来收集该数据集，以高保真度的方式获取。使用OPeRA，我们建立了首个基准，用于评估当前LLMs在给定人设和<观察、行动、推理>历史的情况下，预测特定用户下一步行动和推理的能力。该数据集为未来研究LLM代理提供了基础，这些代理旨在作为人类的个性化数字双胞胎发挥作用。

英文摘要

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

URL PDF HTML ☆

赞 0 踩 0

2505.20650 2026-05-19 cs.CL cs.AI cs.CE 版本更新

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging: 评估LLM提取和结构化财务信息

Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Columbia University（哥伦比亚大学）； California State University（加州州立大学）； University of Montreal（蒙特利尔大学）； Carnegie Mellon University（卡内基梅隆大学）； Rensselaer Polytechnic Institute（莱斯利理工学院）； The University of Manchester（曼彻斯特大学）； Harvard University（哈佛大学）

AI总结本文提出FinTagging基准，用于评估LLM在提取和结构化财务信息方面的能力，通过分解为FinNI和FinCL两个子任务，揭示了LLM在细粒度概念链接上的局限性。

详情

AI中文摘要

准确解读财务报告中的数字数据对市场和监管机构至关重要。尽管XBRL（可扩展商业报告语言）提供了对财务数据进行标记的标准，但将数千个事实映射到超过1万项美国通用会计准则（US GAAP）概念仍然成本高昂且容易出错。现有基准将此任务简化为对小概念子集的扁平单步分类，忽略了分类法的层次语义和财务文档的结构特性。因此，这些基准无法评估LLM在真实报告条件下的表现。为弥合这一差距，我们引入FinTagging，首个全面的结构感知和全范围XBRL标记基准。我们将复杂的标记过程分解为两个子任务：（1）FinNI（财务数字识别），从异构上下文中提取实体和类型；（2）FinCL（财务概念链接），将提取的实体映射到完整的US GAAP分类法。这种两阶段的框架使能够公平评估LLM在数值推理和分类法对齐方面的能力。在零样本设置下评估多种LLM发现，尽管模型在提取方面表现良好，但在细粒度概念链接上存在显著困难，突显了领域特定结构感知推理的关键限制。

英文摘要

Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over 10k US GAAP concepts remains costly and error prone. Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. Consequently, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure aware and full scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts including text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US GAAP taxonomy. This two stage formulation enables a fair assessment of LLMs' capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero shot settings reveals that while models generalize well in extraction, they struggle significantly with fine grained concept linking, highlighting critical limitations in domain specific structure aware reasoning.

URL PDF HTML ☆

赞 0 踩 0

2505.19155 2026-05-19 cs.CV cs.CL 版本更新

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

稀疏到密集：一种无损加速视频理解的LLM免费午餐

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

发表机构 * Singapore Management University（新加坡国立管理学院）； Sea AI Lab（Sea AI实验室）； National University of Singapore（国立新加坡大学）

AI总结本文提出了一种名为Sparse-to-Dense（StD）的解码策略，通过结合稀疏top-K注意力和密集全注意力模块，实现视频大语言模型（Video-LLMs）的无损加速，从而在处理长视频序列时显著提高处理速度。

Comments Accepted by ACL 2025

详情

AI中文摘要

由于当前视频大语言模型（Video-LLMs）的自回归性质，输入序列长度的增长会导致推理延迟增加，这给处理通常非常长的视频序列带来了挑战。我们发现，在解码过程中，Video-LLMs中大多数标记的注意力分数趋于稀疏和集中，只有某些标记需要全面的全注意力。基于这一见解，我们引入了Sparse-to-Dense（StD），一种新颖的解码策略，集成了两个不同的模块：一个利用稀疏top-K注意力，另一个采用密集全注意力。这些模块协同工作，以在不损失的情况下加速Video-LLMs。快速（稀疏）模型推测解码多个标记，而缓慢（密集）模型并行验证它们。StD是一种无调优、即插即用的解决方案，可在视频处理中实现高达1.94倍的壁时加速。它在保持模型性能的同时，使从标准Video-LLM无缝过渡到稀疏Video-LLM变得可能，只需最小的代码修改。

英文摘要

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

URL PDF HTML ☆

赞 0 踩 0

2501.17549 2026-05-19 cs.CL 版本更新

Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

具有查询意识的可学习图池化标记作为大语言模型的提示

Wooyoung Kim, Byungyoon Park, Wooju Kim

发表机构 * Smart Systems Lab, Department of Industrial Engineering（智能系统实验室，工业工程系）

AI总结本文提出了一种名为可学习图池化标记（LGPT）的新方法，通过引入可学习参数作为大语言模型中的标记，解决节点级投影的可扩展性和图级投影信息丢失的问题，并通过早查询融合技术提升图嵌入效果，在GraphQA基准测试中实现了4.13%的性能提升。

详情

AI中文摘要

图结构数据在许多领域中发挥着重要作用，例如社交网络、引用网络、常识推理图和知识图。尽管图神经网络已被用于图处理，但最近的进展探索了将大语言模型整合到图相关任务中。在本文中，我们提出了一种新的方法，称为可学习图池化标记（LGPT），该方法解决了节点级投影的可扩展性问题和图级投影的信息丢失问题。LGPT通过引入可学习参数作为大语言模型中的标记，实现了灵活且高效的图表示，平衡了细粒度和全局图信息。此外，我们研究了一种早查询融合技术，该技术在构建图表示之前融合查询上下文，从而产生更有效的图嵌入。我们的方法在不训练大语言模型的情况下，在GraphQA基准测试中实现了4.13%的性能提升，展示了在处理复杂文本属性图数据方面的显著优势。

英文摘要

Graph-structured data plays a vital role in numerous domains, such as social networks, citation networks, commonsense reasoning graphs and knowledge graphs. While graph neural networks have been employed for graph processing, recent advancements have explored integrating large language models for graph-based tasks. In this paper, we propose a novel approach named Learnable Graph Pooling Token (LGPT), which addresses the limitations of the scalability issues in node-level projection and information loss in graph-level projection. LGPT enables flexible and efficient graph representation by introducing learnable parameters that act as tokens in large language models, balancing fine-grained and global graph information. Additionally, we investigate an Early Query Fusion technique, which fuses query context before constructing the graph representation, leading to more effective graph embeddings. Our method achieves a 4.13\% performance improvement on the GraphQA benchmark without training the large language model, demonstrating significant gains in handling complex textual-attributed graph data.

URL PDF HTML ☆

赞 0 踩 0

2410.13846 2026-05-19 cs.CL cs.AI cs.LG 版本更新

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

LightTransfer: 你的长上下文LLM实际上是一个具有轻松适应能力的混合模型

Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin

发表机构 * Singapore Management University（新加坡国立大学）； National University of Singapore（新加坡国立大学）； Sea AI Lab, Singapore（新加坡海智实验室）

AI总结本文提出LightTransfer方法，通过将LLaMA等模型转换为混合架构，实现更高效的生成，实验表明在长上下文理解任务中，即使有半数层被识别为懒层，也能在性能损失小于1.5%的情况下提升2.17倍的吞吐量，并在数学基准AIME24上达到53.3%的分数。

Comments Accepted by TMLR 2025

详情

AI中文摘要

将语言模型扩展到处理更长上下文引入了由于键值（KV）缓存成本增加而带来的显著内存挑战。受混合模型的效率提升和预训练大变压器骨干的广泛可用性启发，我们探索将变压器模型转换为混合架构以实现更高效的生成。在本工作中，我们提出了LightTransfer，一种轻量级方法，将模型如LLaMA转换为混合变体。我们的方法识别出懒层——那些专注于最近或初始token的层，并将它们的完整注意力替换为流式注意力。这种转换可以在无需任何训练的情况下用于长上下文理解任务，或在需要更强推理能力的o1-like长推理生成任务中进行最小微调。在多样化的基准和模型（如LLaMA、Mistral、QwQ-STILL）上的实验表明，即使有半数层被识别为懒层，LightTransfer在性能损失小于1.5%（在LongBench上）的情况下，也能实现高达2.17倍的吞吐量提升，并在数学基准AIME24上达到先进o1-like长推理模型QwQ-STILL的53.3%。

英文摘要

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

URL PDF HTML ☆

赞 0 踩 0

2410.02064 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

对Llama3-8b-Instruct自生成文本识别能力的检查与控制

Christopher Ackerman, Nina Panickssery

AI总结本研究探讨了LLM是否能识别自身生成的文本，发现Llama3-8b-Instruct模型能够区分自身输出与人类输出，并通过残差流中的特定向量控制其行为和感知，揭示了模型自我归属的认知机制。

Comments 10 pages, 13 figs, 2 tables, accepted as conference paper to ICLR 2025

详情

Journal ref: The Thirteenth International Conference on Learning Representations (ICLR 2025)

AI中文摘要

已报告LLM能够识别其自身生成的文本，这可能对AI安全有重要影响，但研究较少。我们调查这一现象，以确定其在行为层面是否稳健发生，观察行为是如何实现的，以及是否可以控制。首先，我们发现Llama3-8b-Instruct聊天模型（而非基础Llama3-8b模型）能够可靠地区分自身输出与人类输出，并提供证据表明聊天模型很可能利用其在训练后对自身输出的经验来完成文本识别任务。其次，我们识别出残差流中一个在模型正确识别自身生成文本时被差异激活的向量，证明该向量对自我归属相关信息的响应，并提供证据表明该向量与模型中的“自我”概念相关，并展示该向量与模型感知和声明自我归属能力的因果关系。最后，我们证明该向量可用于控制模型的行为和感知，通过将其应用于模型生成输出时，可引导模型声称或否认作者身份；通过将其应用于模型阅读的文本时，可引导模型相信或不相信其写了任意文本。

英文摘要

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

URL PDF HTML ☆

赞 0 踩 0

2404.10981 2026-05-19 cs.IR cs.AI cs.CL 版本更新

A Survey on Retrieval-Augmented Text Generation for Large Language Models

基于大型语言模型的检索增强文本生成综述

Yizheng Huang, Jimmy Huang

发表机构 * York University（约克大学）

AI总结本文综述了检索增强文本生成方法，探讨了其在提升大型语言模型生成准确性和可靠性方面的核心方法与主要贡献。

Comments Ongoing Work

详情

DOI: 10.1145/3805774
Journal ref: ACM Computing Surveys, Volume 58, Issue 12, Article No.: 300, Pages 1 - 38, 2026

AI中文摘要

检索增强生成（RAG）将检索方法与深度学习进展相结合，以解决大型语言模型（LLMs）静态限制的问题，通过动态整合最新外部信息。该方法主要关注文本领域，提供了一种成本效益高的解决方案，以生成合理但可能不正确的响应，从而通过真实世界数据提高LLMs的准确性和可靠性。随着RAG的复杂性增加并整合多个可能影响其性能的概念，本文将RAG范式分为四个类别：预检索、检索、后检索和生成，从检索角度提供详细视角。它概述了RAG的发展历程，并通过分析重要研究讨论了该领域的进步。此外，本文介绍了RAG的评估方法，解决了所面临的挑战，并提出了未来研究方向。通过提供有组织的框架和分类，该研究旨在整合现有的RAG研究，明确其技术基础，并突出其扩展大型语言模型适应性和应用潜力的潜力。

英文摘要

Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but possibly incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.17364 2026-05-19 cs.CL cs.IR 版本更新

NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation

NewsLens: 一个用于对抗性新闻偏见导航的多智能体框架

Joy Bose

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出NewsLens多智能体框架，通过五个智能体协作解构新闻文章，揭示意识形态缺失、修辞操纵和框架边界，利用Qwen2.5-3B-Instruct和Mistral 7B模型进行评估，展示了系统在不同政治事件簇中的表现。

Comments 17 pages, 2 figures, 7 tables, 1 appendix

详情

AI中文摘要

媒体偏见检测长期以来被框架为分类任务：为文章或媒体分配政治标签。我们认为这种框架过于浅显：它只能识别偏见存在，但无法确定其位置、方式，以及最关键的是，什么结构上被省略了。我们提出了NewsLens，一个五智能体对抗性流程，用于结构化新闻偏见导航。事实验证器、渐进框架分析器、保守框架分析器、宣传检测器和中性摘要器协作，将文章解构为可解释的框架地图，揭示意识形态缺失、修辞操纵和框架边界。该系统在四个政治事件簇（印度-巴基斯坦克什米尔、加沙、气候政策、乌克兰）的15篇文章上进行评估，使用Qwen2.5-3B-Instruct（4位量化，Google Colab T4），并使用Mistral 7B进行跨模型验证（在克什米尔簇）。中心媒体显示最高均值Perspective Divergence Score（PDS：Qwen 0.907，Mistral 0.729在克什米尔子集）；保守框架媒体显示最高均值Manipulation Index（MI：0.600在两个模型上）。跨模型比较显示高传播内容具有高度一致性（Republic World delta-PDS=0.125，MI=0.8两个模型）和更广泛的变异。Mann-Whitney U检验发现n=15时组间差异无统计学意义，这被后验幂分析确认为样本量限制。部分消融实验去除宣传检测器显示中性摘要器输出的省略精度下降。该架构扩展了先前的词法-几何偏见工作到代理LLM推理，并且使用开放权重模型完全可复现，无需API密钥。

英文摘要

Media bias detection has predominantly been framed as a classification task: assign a political label to an article or outlet. We argue this framing is too shallow: it identifies that bias exists but not where, how, or crucially, what is structurally omitted. We present NewsLens, a five-agent adversarial pipeline for structured news bias navigation. A Fact Verifier, Progressive Framing Analyst, Conservative Framing Analyst, Propaganda Detector, and Neutral Summarizer collaborate to deconstruct articles into interpretable framing maps, exposing ideological omissions, rhetorical manipulation, and framing boundaries. The system is evaluated on 15 articles across four geopolitical event clusters (India-Pakistan Kashmir, Gaza, Climate Policy, Ukraine) using Qwen2.5-3B-Instruct (4-bit quantised, Google Colab T4), with cross-model validation using Mistral 7B on the Kashmir cluster. Center outlets show the highest mean Perspective Divergence Score (PDS: Qwen 0.907, Mistral 0.729 on Kashmir subset); conservative-framing outlets show the highest mean Manipulation Index (MI: 0.600 across both models). Cross-model comparison shows high consistency for high-propaganda content (Republic World delta-PDS=0.125, MI=0.8 both models) and greater variance for nuanced reporting. Mann-Whitney U tests find no statistically significant between-group differences at n=15, reported honestly as a sample-size limitation confirmed by post-hoc power analysis. A partial ablation removing the Propaganda Detector shows degraded omission precision in the Neutral Summarizer output. The architecture extends prior lexical-geometric bias work to agentic LLM reasoning, and is fully reproducible using open-weight models without API keys.

URL PDF HTML ☆

赞 0 踩 0

2605.17359 2026-05-19 cs.CL 版本更新

Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

学习跨领域的多智能体LLM协作可转移拓扑先验

Taolin Zhang, Zijie Zhou, Jiuheng Wan, Tingyuan Hu, Chengyu Wang, Xiaofeng He, Richang Hong

发表机构 * Hefei University of Technology（合肥工业大学）； China University of Petroleum (Beijing)（中国石油大学（北京））； East China Normal University（东华大学）； Alibaba Group（阿里巴巴集团）

AI总结本文提出TopoPrior框架，通过学习可转移的拓扑先验来提升多智能体LLM在跨领域协作中的效率，减少在线搜索开销并提高可扩展性。

详情

AI中文摘要

基于大型语言模型（LLM）的多智能体系统通过结构化通信协调专门的智能体，在复杂推理中展现出强大潜力。然而，现有拓扑演化方法通常为每个查询从头构建或优化协作拓扑，导致显著的在线搜索开销、高推理时间token消耗以及在多领域设置中的有限可扩展性。我们提出TopoPrior，一种用于跨领域多智能体LLM协作学习可转移拓扑先验的框架。与其反复在线搜索有效的协作结构不同，TopoPrior从多个领域收集的参考协作图中学习可重用的拓扑先验，并利用它们生成查询条件的初始协作图以供下游细化。通过将部分拓扑搜索从每个查询的在线优化转移到离线先验学习，TopoPrior在保持与现有拓扑演化后端兼容的同时，降低了搜索成本。技术上，TopoPrior包含两个关键组件。首先，一个可转移拓扑先验学习模块采用条件变分图框架，在潜在空间中捕捉跨领域的可重用结构规律。其次，一个查询条件的潜在适应模块引入对抗对齐以减少不必要的领域差异，同时保持查询相关的结构变化。在多领域推理基准测试中，TopoPrior在多个异构拓扑演化后端上一致提升了性能，同时减少了在线推理时间的token使用，仅需少量额外的可训练参数。这些结果表明，可转移的拓扑初始化是一种有效且轻量的机制，用于提高跨领域的多智能体LLM协作效率。

英文摘要

Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.

URL PDF HTML ☆

赞 0 踩 0

2605.17355 2026-05-19 cs.AI cs.CL 版本更新

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

HyperPersona: 一种多级超图框架用于基于文本的自动人格预测

Sina Heydari, Majid Ramezani

发表机构 * Department of Computer Science and Information Technology（计算机科学与信息技术系）； Institude for Advanced Studies in Basic Sciences (IASBS)（基础科学高级研究 institute (IASBS)）

AI总结本文提出HyperPersona框架，通过超图结构显式建模文本的层次结构，利用基于Transformer的图编码器学习不同语言层之间的交互，从而在不依赖传统心理测量法的情况下，实现更准确的人格预测。

Comments Preprint. Submitted to Artificial Intelligence (Elsevier)

详情

AI中文摘要

作为一种现代商品，语言已成为一个庞大的社会和心理重要特质和概念的存储库，反映了人们如何将思维模式、行为和情感的模式编码成词语。基于文本的自动人格预测（APP）旨在从语言行为中推断人格，提供了一种可扩展的替代传统心理测量法的方案。尽管文本本质上是层次化的，文档级捕捉全局特征，句子级编码局部语义，词级提供细粒度的词汇信息，但大多数现有方法依赖于浅层、顺序或单级表示，忽略了书面语言的多级结构。为了解决这个问题，我们提出了HyperPersona，一个框架，通过超图结构显式建模文本的层次组织（文档、句子和词），其中文档及其句子表示为超边，词表示为节点，从而实现对文本全局、局部和词汇依赖关系的联合建模。随后通过基于Transformer的图编码器学习这些语言层内的交互，产生上下文敏感且结构基础的特征表示用于人格预测。在Big Five人格维度上的实验表明，仅依赖文本的情况下，HyperPersona有效整合了多级语言线索，相比最先进的基线方法实现了更优的性能。这些发现强调了文本层次结构在从自然语言中推进类人人格推断中的关键作用。

英文摘要

As a modern commodity, language has become a vast repository of socially and psychologically significant traits and concepts, reflecting the ways people encode pattern of thoughts, behaviors, and emotions into words. Text-based Automatic Personality Prediction (APP), seeks to infer personality from linguistic behavior, offering a scalable alternative to traditional psychometric assessments. Although text is inherently hierarchical, with the document-level capturing global features, the sentence-level encoding local semantics, and the word-level providing fine-grained lexical information, most existing approaches rely on shallow, sequential, or single-level representations that ignore the multi-level structure of written language. To address this, we propose HyperPersona, a framework that explicitly models the hierarchical organization of text (document, sentence, and word) through hypergraph structure, where a document and its sentences are represented as hyperedges, and the words are represented as nodes, enabling joint modeling of global, local, and lexical dependencies of text. Followed by a transformer-based graph encoder that learns interactions within and across these linguistic layers, yielding context-sensitive and structurally grounded feature representations for personality prediction. Experiments on the Big Five personality dimensions show that, while relying solely on text, HyperPersona effectively integrates multi-level linguistic cues, achieving superior performance compared to state-of-the-art baselines. These findings underscore the critical role of textual hierarchy in advancing human-like personality inference from natural language.

URL PDF HTML ☆

赞 0 踩 0

2605.17352 2026-05-19 cs.CL 版本更新

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

AMATA: 适应性多智能体轨迹对齐用于知识密集型问答

Taolin Zhang, Dongyang Li, Chen Chen, Qizhou Chen, Jiuheng Wan, Xiaofeng He, Chengyu Wang, Richang Hong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology（合肥工业大学计算机科学与信息工程学院）； Shanghai University of Electric Power（上海电力大学）； East China Normal University（华东师范大学）； Guangdong University of Finance and Economics（广东财经大学）； Alibaba Group（阿里巴巴集团）

AI总结本研究提出AMATA框架，通过动态整合外部知识提高知识密集型问答的响应可解释性和事实准确性，采用六个专门化智能体协作执行复杂问题推理，并引入两种创新：轨迹内偏好学习和智能体间依赖学习。

详情

AI中文摘要

尽管大语言模型（LLMs）取得了显著进展，但在知识密集型问答中生成事实一致的响应仍然具有挑战性。这些困难主要是由于幻觉和LLMs在长尾知识缺口上的局限性。为此，我们提出了AMATA，一种自适应多智能体轨迹对齐框架，通过动态整合外部知识来提高响应的可解释性和事实基础性。我们的架构利用六个专门化的智能体，协同执行结构化动作进行复杂问题推理。我们将多智能体协作与外部工具的协作形式化为轨迹偏好对齐问题，结合问题感知的智能体定制和智能体间偏好和谐化。AMATA引入了两种主要创新：（1）轨迹内偏好学习，学习以目标为导向的偏好以优先考虑关键智能体；（2）智能体间依赖学习，通过一种新颖的依赖感知直接偏好优化技术捕获跨智能体工具依赖性。实验证明，AMATA在五个已建立的知识密集型QA基准上一致优于基线方法、知识增强框架和基于LLM的轨迹系统。进一步分析显示，我们的方法在减少token消耗方面具有效率。

英文摘要

Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.

URL PDF HTML ☆

赞 0 踩 0

2605.17348 2026-05-19 cs.CL 版本更新

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

驯服“僵尸”代理：一种面向鲁棒多代理演化的马尔可夫状态感知框架

Taolin Zhang, Pukun Zhao, Qizhou Chen, Jiuheng Wan, Chen Chen, Xiaofeng He, Chengyu Wang, Richang Hong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology（合肥工业大学计算机科学与信息工程学院）； Guangdong University of Finance and Economics（广东财经大学）； Alibaba Group（阿里巴巴集团）； East China Normal University（华东师范大学）

AI总结本文提出AgentRevive框架，通过动态管理代理协作和状态感知边优化，有效解决多代理系统中因临时问题导致有价值代理被提前丢弃的问题，提升了系统鲁棒性和效率。

详情

AI中文摘要

近年来，基于大语言模型的多代理系统在复杂任务中的协作能力有了显著提升。为提高整体效率，现有方法常依赖于代理间的激进图演化（例如节点或边剪枝），这可能导致因临时问题（如幻觉或暂时知识缺口）而过早丢弃有价值的代理。然而，这种硬剪枝忽略了“僵尸”代理在后续讨论轮次中恢复和贡献的潜力。本文提出AgentRevive，一种面向鲁棒多代理演化的马尔可夫状态感知框架。我们的方法通过软状态转移动态管理代理协作，通过两个关键组件实现：（1）状态感知策略学习：将代理状态分为“活跃”、“待命”和“终止”状态，根据代理记忆选择性传播消息。策略利用风险估计器通过评估幻觉风险优化代理状态转移，最小化不可靠节点的影响，同时保护有价值节点。（2）状态感知边优化：根据策略学习到的状态剪枝子图边，永久移除“终止”节点，并保留“待命”节点以供后续轮次评估其潜在的未来贡献。在通用推理、领域特定和幻觉挑战任务上的广泛实验表明，我们的方法在性能上始终优于强基线，并通过状态感知的代理调度显著减少了令牌消耗。

英文摘要

Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.

URL PDF HTML ☆

赞 0 踩 0

2605.17342 2026-05-19 cs.CL cs.AI 版本更新

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

传递性与循环性：动态大语言模型对齐的显式偏好分解

Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳校区）

AI总结本文提出Hybrid Reward-Cyclic模型，通过博弈论分解显式分离传递性和循环性偏好，结合动态自我博弈优化方法提升大语言模型对齐效果，实验证明其在混合传递-循环设置中具有结构优势和更高的准确率。

Comments Accepted by ICML 2026

详情

AI中文摘要

标准的RLHF依赖于传递性的标量奖励，无法捕捉人类偏好的循环性质。尽管一些方法如通用偏好模型（GPM）试图解决这一问题，但其隐式公式将层次结构与循环性结合在一起，未能保证主导解。为此，我们提出了混合奖励-循环（HRC）模型，利用博弈论分解将偏好显式分解为正交的传递性（标量）和循环性（向量）组件。此外，我们引入了动态自我博弈偏好优化（DSPPO），将对齐视为随时间变化的游戏，逐步引导策略向纳什均衡发展。合成数据实验进一步验证了HRC在混合传递-循环设置中的结构优势，其中HRC收敛速度更快且准确率更高。在RewardBench 2上的实验表明，HRC在BT和GPM基线基础上持续改进（例如，在Gemma-2B-it上提升1.23%）。特别是，其在Ties领域中的优越表现验证了模型在处理复杂非严格偏好时的鲁棒性。对AlpacaEval 2.0、Arena-Hard-v0.1和MT-Bench的广泛下游评估确认了我们框架的有效性。值得注意的是，当使用Gemma-2B-it作为基础偏好模型时，HRC+DSPPO在AlpacaEval 2.0上达到峰值长度控制下的胜率44.75%，在Arena-Hard-v0.1上达到46.8%，显著优于使用BT或GPM训练的SPPO基线。我们的代码在https://github.com/lab-klc/Hybrid-Reward-Cyclic上公开可用。

英文摘要

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

URL PDF HTML ☆

赞 0 踩 0

2605.17314 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

通过不匹配的错误草稿实现弱到强的引导

Wei Deng

发表机构 * Independent Researcher（独立研究者）

AI总结本文研究了通过较小较弱模型的不匹配错误草稿引导更强学习者的能力，发现这种策略在MATH-500和AIME 2025/2026等任务上表现优异，主要贡献是提出了一种有效的训练方法。

详情

AI中文摘要

我们考虑是否可以利用较小、较弱模型的离线经验来引导更强的学习者，使其在在线策略学习（如GRPO）无法达到的能力。我们发现，将数学上错误但更领域训练的较小模型生成的草稿注入更强学习者的GRPO上下文，能一致优于标准在线GRPO在MATH-500和离分布AIME 2025/2026上。具体来说，我们使用Mathstral-7B作为学习者，Qwen2.5-Math-1.5B作为草稿模型，8.8K Level 3--5 MATH问题（其中MATH-500被排除），并使用Dr. GRPO进行训练。不匹配是关键成分：在保持其他条件不变的情况下，将草稿洗牌到不匹配的问题中，使MATH-500的greedy pass@1提升+1.62pp（n=10种子，p=0.0015，Welch's t检验）。事实上，不匹配-错误变体在MATH-500上所有测试的变体中均优于。在离分布AIME 2025和2026上，不匹配-错误变体在每个样本预算从k=1到k=1024的所有年份中，均将pass@k提升到Mathstral-7B（其原生[INST]格式）和Qwen2.5-Math-1.5B草稿模型之上。所有变体在测试时使用相同的提示，没有草稿注入。该配方——在单个GPU上训练，无需SFT、奖励模型、合成数据和无produce-critique-revise内循环——在Mathstral-7B-v0.1上达到了71.98%的MATH-500成绩，这是目前该模型的最高已发表结果，超过了WizardMath流程在完整MATH上的70.9%（SFT + PPO加过程/指令奖励模型）。

英文摘要

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

URL PDF HTML ☆

赞 0 踩 0

2605.17305 2026-05-19 cs.AI cs.CL 版本更新

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect: 一种基于闭环自修正的大型语言模型框架

Yuning Wu, Yingmin Liu, Yang Shu

发表机构 * School of Software, Henan University, Kaifeng, China（河南大学软件学院，开封，中国）； Zhejiang University, Hangzhou, China（浙江大学，杭州，中国）

AI总结本文提出CyberCorrect框架，将大型语言模型的自我修正建模为闭环控制系统，通过三模态错误检测器、类型导向的修正控制器和收敛判断器，提升模型的自我修正能力和准确性。

Comments 6 pages, 1 figure, submitted to IEEE SMC 2026

详情

AI中文摘要

大型语言模型（LLM）的自我修正能力——即检测并修复生成输出中的错误——仍然主要依赖于通用提示，如'请重新考虑你的答案'，缺乏系统性的错误分析和收敛保证。我们提出了CyberCorrect，一种将LLM自我修正建模为闭环控制系统的方法，基于控制论理论。该框架将LLM生成器视为被控对象，并引入三模态错误检测器（结合自一致性、口头化信心和逻辑链验证）作为传感器。类型导向的修正控制器根据诊断的错误类别生成针对性的修复指令，而收敛判断器利用控制理论适应的稳定性标准确定迭代终止。我们进一步引入了三个控制理论评估指标——收敛率、超调率和振荡率——以捕捉修正动态，而不仅仅是最终准确性。在我们构建的CyberCorrect-Bench（440个带有标注错误类型和修正路径的推理任务）上的实验表明，CyberCorrect实现了79.8%的最终准确性，比现有最佳自我修正方法提高了6.2个百分点，同时通过其收敛控制机制将超调（错误的过度修正）减少了41%。

英文摘要

Large language model (LLM) self-correction -- the ability to detect and fix errors in generated outputs -- remains largely ad hoc, relying on generic prompts such as "please reconsider your answer" without systematic error analysis or convergence guarantees. We propose CyberCorrect, a framework that formalizes LLM self-correction as a closed-loop control system grounded in cybernetic theory. The framework models the LLM generator as the plant and introduces a tri-modal Error Detector (combining self-consistency, verbalized confidence, and logic-chain verification) as the sensor. A type-directed Correction Controller generates targeted repair instructions based on diagnosed error categories, while a Convergence Judge determines iteration termination using stability criteria adapted from control theory. We further introduce three control-theoretic evaluation metrics -- convergence rate, overshoot rate, and oscillation rate -- that capture correction dynamics beyond final accuracy. Experiments on our constructed CyberCorrect-Bench (440 reasoning tasks with annotated error types and correction paths) show that CyberCorrect achieves 79.8% final accuracy, improving upon the best existing self-correction method by 6.2 percentage points, while reducing overshoot (erroneous over-correction) by 41% through its convergence control mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.17304 2026-05-19 cs.LG cs.CL 版本更新

Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression

压缩上下文，保持承诺：可验证大语言模型上下文压缩的正式框架

Natalia Trukhina, Vadim Vashkelis

发表机构 * Embedded Intelligence Lab (EMILAB)（嵌入式智能实验室）

AI总结本文提出Context Codec框架，通过语义层面的压缩方法，确保在压缩对话历史时保留关键承诺，解决现有方法在压缩过程中缺乏对语义承诺保留的明确规范的问题。

详情

AI中文摘要

LLM上下文不仅仅是token；它是一组承诺。长期对话累积了目标、约束、决定、偏好、工具结果、检索到的证据、制品和安全边界，这些必须被未来响应保留。现有上下文管理方法通过截断、检索、摘要、记忆系统或token级提示压缩来减少长度，但很少明确指定哪些语义承诺必须在压缩中保留或如何衡量其保留。我们提出Context Codec，一种基于承诺的框架，用于压缩提示和聊天历史。Context Codec将对话状态表示为具有标准身份、等价性、冲突、置信度、风险和证据跨度的语义原子。它分离了五个关注点——提取、规范化、表示、渲染和验证，并引入了关键原子召回率、加权原子召回率、承诺密度和往返恢复性等指标。它还定义了语义压缩错误的分类学，一个具体的规范化程序，保守的回退规则用于低置信度和安全关键原子，以及Context Compression Language (CCL)，一种以ASCII优先的紧凑表示法，用于标准JSON原子。在一项小规模诊断研究中，CCL-Core在结构化的散文和JSON之间占据了一个有用的中间位置：比散文更明确和可审计，通常比JSON更紧凑，且比高度压缩的符号更安全。结果不是声称缩写解决压缩问题，而是一个使上下文压缩可验证的框架：压缩对话，保持承诺。

英文摘要

LLM context is not just tokens; it is a set of commitments. Long-running conversations accumulate goals, constraints, decisions, preferences, tool results, retrieved evidence, artifacts, and safety boundaries that future responses must preserve. Existing context-management methods reduce length through truncation, retrieval, summarization, memory systems, or token-level prompt compression, but they rarely specify which semantic commitments must survive compression or how their preservation should be measured. We propose Context Codec, a commitment-level framework for compressing prompts and chat histories. Context Codec represents dialogue state as typed, source-grounded semantic atoms with canonical identity, equivalence, conflict, confidence, risk, and evidence spans. It separates five concerns - extraction, normalization, representation, rendering, and verification - and introduces metrics for Critical Atom Recall, Weighted Atom Recall, Commitment Density, and round-trip recoverability. It also defines a taxonomy of semantic compression errors, a concrete normalization procedure, conservative fallback rules for low-confidence and safety-critical atoms, and Context Compression Language (CCL), an ASCII-first compact rendering of canonical JSON atoms. In a small diagnostic study, CCL-Core occupies a useful middle ground between structured prose and JSON: more explicit and auditable than prose, usually more compact than JSON, and less risky than heavily minified notation. The result is not a claim that shorthand solves compression, but a framework for making context compression verifiable: compress the conversation, keep the commitments.

URL PDF HTML ☆

赞 0 踩 0

2605.17295 2026-05-19 cs.LG cs.CL 版本更新

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

DISA: 分布匹配强化学习中的离线重要性采样

Shaobo Wang, Yujie Chen, Yafeng Sun, Wenjie Qiu, Zhihui Xie, Sihang Li, Yucheng Li, Huiqiang Jiang, Xingzhang Ren, Xuming Hu, Dayiheng Liu, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Qwen Team, Alibaba Group（阿里集团Qwen团队）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The University of Science and Technology of China（中国科学技术大学）； Nanjing University（南京大学）； The University of Hong Kong（香港大学）

AI总结本研究提出DISA方法，通过离线重要性采样解决分布匹配强化学习中的校准问题，分离了分区函数估计与策略学习，提高了策略多样性并在多个基准测试中表现出色。

Comments 21 pages, 7 figures, 7 tables. Abstract shortened to respect the arXiv limit of 1920 characters. Please see the PDF for the full abstract

详情

AI中文摘要

现代推理代理越来越多地被评估其在给定输入下生成多个有效解决方案路径、计划或工具使用轨迹的能力。标准奖励最大化强化学习倾向于崩溃到最容易强化的高奖励模式，而分布匹配强化学习旨在在整个奖励形状的解决方案集中分配概率质量。实现这一目标需要计算轨迹空间中依赖提示的分区函数。由于现有分布匹配方法在线学习这个分区函数，导致分区函数的校准误差直接扭曲策略更新且无法独立诊断。我们引入DISA（Decoupled Importance-Sampled Anchoring），通过离线绘制提案轨迹、通过重要性采样估计分区函数，并在策略优化开始前冻结所得的分区函数估计。这种解耦保持了分布匹配目标，同时严格分离分区函数估计与策略学习在数据、梯度、损失和诊断方面。实验表明，在六个数学和三个代码基准测试上，DISA与在线耦合的分布匹配基线FlowRL持平或超过，优于奖励最大化基线GRPO和GSPO在数学平均表现，并在相同离线轨迹上超过LoRASFT蒸馏方法多达13.8 Mean@8点。LLM-as-judge评估进一步显示DISA比奖励最大化基线保留了显著更多的策略多样性，提案强度和逆温度的敏感性研究遵循分析预测的偏差-方差模式。

英文摘要

Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.17283 2026-05-19 cs.CL cs.AI 版本更新

ChemVA：推动大型语言模型在化学反应图示理解上的进步

Mingyang Rao, Kehua Feng, Zhihui Zhu, Jiangzhen Fu, Hao Yu, Keyan Ding, Huajun Chen

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University（浙江大学杭州全球科学与技术创新中心）； Department of Chemistry, Fudan University（复旦大学化学系）

AI总结本文针对现有系统在理解化学反应图示时存在的视觉缺陷和语义断开问题，提出ChemVA框架，通过视觉锚机制和语义对齐方法提升大型语言模型在化学推理中的性能。

详情

AI中文摘要

尽管大型语言模型（LLMs）已革新了科学文本处理，但在解释化学反应图示方面存在显著的能力差距。我们识别出两个限制当前系统的根本瓶颈：视觉缺陷，即通用视觉编码器难以解析密集分子图的严格拓扑连接性；以及语义断开，即标准线性字符串，如SMILES，无法有效激活模型的潜在化学推理能力。为弥合这些差距，我们提出了化学视觉激活（ChemVA）框架，该框架采用视觉锚机制通过混合粒度检测来定位功能团，随后采用语义对齐方法将视觉特征转换为实体名称，以最大限度地激活LLMs中的知识。我们在OCRD-Bench数据集上评估了我们的方法，该数据集包含密集的视觉-语义上下文和全面的反应覆盖，以评估从识别到推理的整个谱系。在OCRD-Bench上的大量实验表明，ChemVA实现了92.0%的结构识别准确率。通过弥合视觉和语义瓶颈，我们的框架在9种不同的LLMs上实现了约20个百分点的性能提升，使开放式权重模型能够与专有SOTA系统在复杂的化学推理任务中竞争。

英文摘要

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.17205 2026-05-19 cs.CL 版本更新

LLMs for automatic annotation of Mandarin narrative transcripts

基于大型语言模型的汉语叙述转录自动标注

Qingwen Zhao, Hongao Zhu, Yunqi He, Rui Wang, Aijun Huang, Hai Hu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of California San Diego（加州大学圣地亚哥分校）； The Hong Kong Polytechnic University（香港理工大学）； National Research Center for Language and Well-Being（语言与幸福感国家研究中心）； City University of Hong Kong（香港城市大学）

AI总结本研究探讨了大型语言模型在自动标注汉语叙述转录中的有效性，通过比较四种LLM与训练好的人类标注者，发现最佳模型在标注叙事宏观结构时能与人类标注者达成高一致性，同时显著减少标注时间，但轻量级本地部署模型表现较差，且标注难度因宏观结构元素类型而异，尤其在需要微妙语义区分的类别中存在持续挑战。

Comments 28 pages, 9 tables

详情

AI中文摘要

对转录语音进行语言标注对于语言习得、语言障碍和社会语言学研究至关重要，但这一过程仍然耗费大量人力和时间。尽管大型语言模型（LLMs）在自动化标注任务中显示出潜力，但其在非英语语言中处理复杂语篇层面标注的能力仍缺乏研究。本研究评估了LLMs在标注汉语口语中的叙事宏观结构（即故事语法元素的层级组织）的可靠性，使用多语言叙述评估仪器（MAIN）作为测试平台。我们比较了四种LLM与训练好的人类标注者在儿童、年轻人和老年人生成的叙述上的表现。最佳模型在与人类评分者达成一致（k=.794）方面接近人类-人类可靠性水平（k=.872），同时将标注时间减少了65%，而本地可部署的轻量级模型表现则明显较差。标注难度系统性地因宏观结构元素类型而异，需要微妙语义区分的类别提出了持续挑战。此外，模型可靠性在年轻人叙述中下降，因为年轻人叙述表现出更大的词汇变化、语义模糊性和单个语句中的多元素整合。这些发现表明，LLMs可以有效支持非英语口语语料库的语篇层面标注，同时强调在语义复杂任务中仍需持续的人类监督。我们的提示模板已开源供未来使用。

英文摘要

Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.

URL PDF HTML ☆

赞 0 踩 0

2605.17187 2026-05-19 cs.CL cs.AI cs.CY 版本更新

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule：一种用于社交媒体上多元社区调节的基准测试

Zoher Kachwala, Bao Tran Truong, Rasika Muralidharan, Haewoon Kwak, Jisun An, Filippo Menczer

发表机构 * Observatory on Social Media, Indiana University, USA（社交媒体观察站，印第安纳大学，美国）； Center Synergy of Systems, TUD Dresden University of Technology, Germany（系统协同中心，德累斯顿技术大学，德国）

AI总结研究探讨了AI模型在调节社交媒体上多元社区中的挑战，提出PluRule基准测试以检测13371条规则违规情况，发现即使使用最先进的视觉语言模型，也难以有效识别违规行为。

Comments Accepted to ACL 2026 Main Conference

详情

AI中文摘要

社交媒体正在向多元主义转变--由社区自行定义规范的平台。在某一社区中违反规则的行为可能在另一社区中是完全可接受的。AI模型能否帮助调节此类多元社区？我们将此任务形式化为多选问题，模仿人类调节员在现实世界中的操作方式：给定一条评论及其上下文，识别违反了哪一条具体规则（如果有的话）。我们引入了PluRule，一个多模态、多语言的基准测试，用于检测1989个Reddit社区中跨越2885条规则的13371条违规情况。使用此基准测试，我们发现最先进的视觉语言模型在识别违规方面表现显著不佳：即使GPT-5.2具有高水平推理能力，也仅略优于基础基线。我们还发现，更大的模型和更多的上下文提供微小收益，而普遍规则如礼貌和自我推广更容易检测。我们的结果表明，社交媒体上多元社区的调节是语言模型的基本挑战。我们的代码和基准测试已公开发布。

英文摘要

Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.17173 2026-05-19 cs.CL cs.AI cs.LG 版本更新

多语言和多模态大语言模型在野：为低资源语言构建

Firoj Alam, Shammur Absar Chowdhury, Enamul Hoque Prince

发表机构 * Qatar Computing Research Institute（卡塔尔计算研究所）； HBKU（哈马德大学）； York University（约克大学）

AI总结本文探讨了在有限数据和计算资源下构建多语言多模态大语言模型的方法，涵盖了低成本数据创建、三模态对齐适配器堆栈以及文化感知评估等核心技术和资源。

Comments Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Low-resources-language

2605.17113 2026-05-19 cs.CL cs.AI 版本更新

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

无法回头的点：语言模型推理中欺骗承诺的反事实定位

Scott Merrill, Shashank Srivastava

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结本文研究语言模型在推理过程中何时开始承诺欺骗，通过反事实定位方法，分析不同环境中的欺骗产生机制，并发现注意力转移特征在跨环境泛化中的有效性，同时提出通过压缩注意力头集来抑制欺骗承诺。

Comments 41 pages, 25 figures

详情

AI中文摘要

现有欺骗数据集将完成的输出标记为诚实或欺骗，将欺骗视为最终响应的属性，而非模型推理轨迹的功能。这掩盖了一个更根本的问题：语言模型何时开始承诺欺骗？我们引入反事实定位：对于推理轨迹中的每个句子前缀，固定前缀，重新采样后续内容，并估计欺骗结果的概率。为了扩展此方法，我们构建了五个环境（涵盖战略欺骗、迷宫指引、财务建议、二手车销售和报价谈判），其中欺骗从未被提示，而是源自战略激励，标签机械地从环境状态得出，而非主观人类判断。所得到的语料库在四个推理模型中定位了约146万句话，来自超过9410万次采样的后续内容、915亿生成的token和超过1万种场景。句子层面的人类评估证实，检测到的承诺点对应于决策状态的可解释转变。使用此资源，我们显示，用于承诺预测的词汇线索在不同环境之间转移效果差，而基于注意力的转移特征在分布外泛化中表现良好，表明欺骗承诺反映在可重用的推理动态变化中，而非表层形式。我们进一步识别出压缩的注意力头集（少于10%的头）在一种环境中选择后，能因果地抑制其他环境中的欺骗承诺。我们发布此语料库作为研究语言模型推理中欺骗和更广泛承诺的子基质。

英文摘要

Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes $\sim$1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.17093 2026-05-19 cs.CV cs.CL 版本更新

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

HEED：基于密度加权残差对齐的混合视觉-语言模型蒸馏

Yihao Liang, Niraj K. Jha

发表机构 * Princeton University（普林斯顿大学）

AI总结本文提出HEED方法，通过密度加权残差对齐改进混合视觉-语言模型蒸馏，提升在OCR和文档任务中的性能，同时在不同教师模型和混合架构上实现高效推理。

详情

AI中文摘要

将视觉-语言模型蒸馏为更高效的混合架构，如3:1 Mamba-2/注意力混合，已成为提高推理效率的标准做法。聚合基准表明这可行，但隐藏了选择性失败。当将Qwen3-VL-8B-Instruct蒸馏为3:1 Mamba-2/注意力混合时，在视觉推理基准如MMStar、MMBench和MMMU-Pro上，学生模型在教师模型附近保持2分差距，但在光学字符识别和文档任务上下降13分。学生模型仍能理解场景，但失去回答所需的细粒度文本。我们发现大部分失败归因于特定位置。在高分辨率图像中，大多数拼图是天空、墙壁或平滑纹理，而一小部分携带文本、边缘、物体边界或其他局部细节。在令牌级诊断中，前10%最高密度拼图的残差漂移比后10%最低密度拼图大3.6倍，且教师遮蔽答案贡献大3.5倍。均匀加权将许多损失项分配给低信息量的背景拼图，而稀疏答案承载拼图未得到特殊保护。所需干预极小：我们用拼图自不相似性作为无监督代理来替代均匀残差对齐，以确定位置重要性。我们称之为HEED。与常规端到端蒸馏相比，HEED在OCRBench v2上提升8.7分，在10个基准平均上提升5.13分。增益在不同教师模型和混合架构上实现。在标准后训练后，学生在10个基准平均上达到教师级性能，具有4.12倍的吞吐量和128k上下文时68%的内存节省，无需额外参数和推理时间成本。

英文摘要

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.

URL PDF HTML ☆

赞 0 踩 0

2605.17088 2026-05-19 cs.CL 版本更新

ACIL: Auto Chain of Thoughts for In-Context Learning

ACIL: 自动链式思维用于上下文学习

Rui Chu

发表机构 * Rui Chu（楚瑞）

AI总结本文提出ACIL框架，通过自动构建包含推理步骤的演示来提升上下文学习在多步推理任务中的性能。

详情

AI中文摘要

近年来，大型语言模型（LLMs）的进步表明，链式思维（CoT）推理可以显著提高复杂推理任务的性能。同时，上下文学习（ICL）已成为一种重要的机制，用于在不更新模型参数的情况下将LLMs适应于新任务，仅使用提示中提供的示例。然而，标准ICL在需要多步推理的任务上往往表现不佳，因为演示通常只包含输入-输出对，缺乏显式的中间推理步骤。本文介绍了一种自动链式思维（Auto-CoT）框架，通过自动构建推理增强的演示来改进ICL。Auto-CoT为输入-输出示例生成推理链，将结构化的中间解释添加到提示上下文中，并通过系统化的选择过程去除无关或低质量的演示。通过将高质量的推理示例纳入ICL提示中，Auto-CoT引导模型朝向更可靠的推理，并提高预测准确性。在多个推理任务上的实验表明，所提出的框架通过提供显式的中间推理指导，提高了ICL的性能。

英文摘要

Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.17084 2026-05-19 cs.LG cs.CL 版本更新

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

尺度决定语言模型是否为预测组织表示几何

Weilun Xu

发表机构 * School of Computer and Communication Sciences（计算机与通信科学学院）； École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）

AI总结研究探讨了语言模型中表示几何是否为预测组织，通过Subspace PGA指标发现，模型规模影响表示几何的组织程度，小模型在训练后期逐渐失去这种组织，而大模型则保持稳定。

详情

AI中文摘要

在语言模型中，表示所编码的内容由其表示空间的几何结构决定：距离而非激活值承载意义。现有工具描述了这种几何结构的形状，但并未探讨其组织目的。我们引入Subspace PGA指标，测试某层的距离结构是否比随机等大小子空间更符合解嵌入矩阵$W_U$的读出子空间。在七个Pythia模型（70M-6.9B）和三个跨家族模型中，中间几何显著为预测组织（峰值$z = 9$--$24$），但程度依赖于规模：小模型（$d \leq 1024$）在训练后期逐渐失去这种组织——即使损失持续改善，而大模型（$d \geq 2048$）则保持稳定。我们追溯到容量权衡：少数主导方向迁离$W_U$的读出，掩盖而非破坏预测结构，移除它们可恢复对齐。频谱度量和损失曲线无法捕捉这一区别。因此，规模不仅决定了模型预测性能，还决定了其表示几何如何组织以实现预测。

英文摘要

In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_U$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$--$24$), but the degree is scale-dependent: small models ($d \leq 1024$) progressively lose it at late layers during training -- even as loss keeps improving -- while large models ($d \geq 2048$) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from $W_U$'s readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.

URL PDF HTML ☆

赞 0 踩 0

2605.17079 2026-05-19 cs.CL cs.AI cs.CY 版本更新

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

LLMs能否像消费者一样思考？通过ConsumerSimBench进行大众级反应重建的基准测试

Tianyu Wang, Jiajun Li, Jianghao Lin

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出ConsumerSimBench基准，通过1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准，评估LLM在模拟消费者反应方面的能力，揭示了前沿模型在预测高语境中文消费者讨论中实际关心内容方面的不足。

详情

AI中文摘要

LLMs越来越多地被用作“数字消费者”来模拟公众意见、预测试营销决策并预测观众反应。然而，现有评估很少询问模型是否能重建现实中消费者在公开讨论中表现的具体反应模式。我们引入了ConsumerSimBench，该基准基于1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准，涵盖四个反应类别。与评分开放生成的综合偏好判断不同，ConsumerSimBench将每个任务分解为可审计的yes-no决定，使三判官协议从65.8%提升至92.1%，且点wise判断与人类多数标签在98.4%时一致。在13个前沿生成器中，最强的模型Gemini-3.1-Pro仅覆盖了47.8%的真实反应标准，而GPT-5.2和Claude-4.6尽管在技术基准上表现优异，但仍然落后。这些失败揭示了技术基准表现与基于社会的消费者直觉之间的巨大差距。直接的结构化推理提示会降低覆盖率，而生成-反思多代理流水线可将MiMo-V2.5-Pro在子集上的表现从32.9%提升至37.6%。ConsumerSimBench将消费者模拟重新定义为对真实公开讨论反应的预测问题，表明前沿LLM在预测高语境中文消费者讨论中实际关心内容方面仍远未可靠。

英文摘要

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.17072 2026-05-19 cs.AI cs.CL 版本更新

LLM个性诱导中的评估漂移：我们是否在移动目标？

Prateek Rajput, Yewei Song, Iyiola E. Olatunji, Jacques Klein, Tegawendé F. Bissyandé

发表机构 * University of Luxembourg（卢森堡大学）

AI总结本文研究了大型语言模型在诱导人类样性格时的稳定性问题，通过微调长格式论文来诱导性格，并发现尽管微调减少了问卷评分的方差，但完整五维性格的准确性仍接近随机，表明无指导的论文缺乏表达忠实性格所需的线索。

Comments 14 pages, 8 main pages, 5 figures, 4 main page figures

详情

AI中文摘要

大型语言模型能否可靠地表达人类般的性格，还是仅仅在模仿表面线索而缺乏稳定的底层特征？为探讨此问题，我们通过在长格式论文上进行微调来诱导LLM的性格，每篇论文都关联一个目标大五人格特征轮廓。随后，我们使用IPIP-NEO问卷评估诱导性格的稳定性和准确性。具体而言，我们提出两个问题：（i）训练后（SFT、DPO、ORPO）是否在提示重新表述下稳定问卷评分？（ii）能否从无指导的论文中诱导目标大五人格特征？我们的结果表明，微调在五个模型上一致减少了问卷响应的方差，直接缓解了预训练模型中报告的评估脆弱性。然而，这种新发现的稳定性揭示了更根本的限制：即使单个特征分数有所提高，完整五维特征的准确性仍接近随机。这表明无指导的论文缺乏表达忠实性格所需的线索。因此，我们主张使用场景相关的数据集或交互式引导来积累与测试一致的证据。

英文摘要

Can large language models reliably express a human-like personality, or are they merely mimicking surface cues without a stable underlying profile? To investigate this, we induce personality in LLMs by fine-tuning them on the long-form essays, where each essay is associated with a target Big Five personality profile. We then evaluate the stability and fidelity of the induced personality using the IPIP-NEO questionnaire. Specifically, we ask: (i) does post-training (SFT, DPO, ORPO) stabilize questionnaire scores under prompt rephrasings, and (ii) can it induce target Big Five profiles from unguided essays? Our results demonstrate that fine-tuning consistently reduces variance in questionnaire responses across five models, directly mitigating the evaluation fragility reported in pre-trained models. However, this newfound stability reveals a more fundamental limitation: accuracy on the full five-dimensional profile remains near chance, even when single-trait scores improve. This indicates that unguided essays lack the cues needed for faithful personality expression. We therefore argue for scenario-grounded datasets or interactive elicitation that accumulates test-aligned evidence over time.

URL PDF HTML ☆

赞 0 踩 0

2605.16991 2026-05-19 cs.CL cs.AI 版本更新

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

无响应项目难度建模用于多项选择题：细调Transformer：组件表示和多任务学习

Jan Netík, Patrícia Martinková

发表机构 * Faculty of Education, Charles University（查理大学教育学院）； Institute of Computer Science of the Czech Academy of Sciences（捷克科学院计算机科学研究所）

AI总结本文提出了一种无响应项目难度建模方法，通过细调Transformer来处理阅读理解多项选择题的难度问题，采用组件级表示和多任务学习方法来提升模型性能。

详情

AI中文摘要

无响应项目难度建模旨在减少对响应校准的依赖，但对阅读理解多项选择题而言，其难度取决于词汇组件的推断需求。尽管现有方法通常从项目文本中提取特征并传递给单独的统计或机器学习模型，本文通过端到端地在项目词汇上微调Transformer编码器，消除了手动特征工程和预处理所丢失的信息。此外，本文还提出了两种扩展：一种是组件级变体，通过共享编码器分别编码词汇组件；另一种是多任务变体，保留联合编码并添加辅助的多项选择问题回答目标。每种方法都在三种训练集大小下通过蒙特卡洛子采样设计在保留的测试集上进行评估。研究发现，联合编码是一种可行的端到端替代方案；虽然组件级变体没有明显优势，这与自注意力机制本身已经捕获跨组件信号一致，但多任务变体在小样本情况下提供了显著的改进。Transformer微调，尤其是通过合适的辅助任务进行正则化，能够在应用测量中典型的训练集大小下恢复大量词汇可推导的信号。该框架为心理测量学扩展提供了可定制的接口。

英文摘要

Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

URL PDF HTML ☆

赞 0 踩 0

2605.16986 2026-05-19 cs.CL cs.AI 版本更新

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Jingxing Wang, Chenyu Zhou, Zhihui Fu, Jun Wang, Weiwen Liu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； OPPO Research Institute（OPPO研究院）

AI总结本文提出了一种在测试时自适应的技能合成方法SkillTTA，通过检索与当前任务相关的少量训练轨迹并将其合成成为任务特定的文本技能，以提高LLM代理在SpreadsheetBench、ALFWorld和BigCodeBench等任务上的性能。

Comments 10 pages, 4 figures

详情

AI中文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose SkillTTA, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-k retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

英文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emph{SkillTTA}, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-$k$ retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

URL PDF HTML ☆

赞 0 踩 0

2605.16941 2026-05-19 cs.CL 版本更新

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

展开与回退：扩散大语言模型是它们自己的效率教师

Fanqin Zeng, Feng Hong, Geng Yu, Huangjie Zheng, Xiaofeng Cao, Ya Zhang, Bo Han, Yanfeng Wang, Jiangchao Yao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Apple MLR ； Tongji University（同济大学）； Hong Kong Baptist University（香港 Baptist 大学）； RIKEN（日本理化学研究所）

AI总结本文提出了一种基于可撤销解码的扩散大语言模型（DLLM）方法，通过发现可靠的去噪顺序来提升生成质量和效率，WINO和WINO+在多个数据集上展示了显著的性能提升。

详情

AI中文摘要

扩散大语言模型（DLLMs）承诺快速并行生成，但开源DLLMs仍面临严重的质量-速度权衡问题：通过揭示多个token来加速解码往往会导致显著的质量下降。我们将其困境归因于训练-推理不匹配被不可逆解码放大。虽然训练过程从随机破坏的状态重建token，但高效的推理需要自适应的去噪顺序，其中更容易的token被更早揭示，而依赖上下文的token则被推迟。这种观点促使提出两种互补的方法：一种是在推理时使并行解码可撤销的方法，另一种是在训练时扩展的方法，通过这种可撤销过程暴露的可靠顺序进行知识蒸馏。据此，我们首先提出Wide-In, Narrow-Out（WINO），一种无需训练的解码算法，使并行解码可撤销。WINO积极地草案多个token，通过增强的全局上下文验证生成token，并重新掩码不可靠的token以供后续细化。基于发现的顺序，我们进一步引入WINO+，将WINO生成的验证去噪轨迹注入模型参数，使训练与高效推理对齐。在LLaDA和MMaDA上的实验表明，WINO在质量和效率上均有提升，而WINO+进一步加强了这一进程。在GSM8K上，WINO将准确率从73.24%提升到75.82%，并以6.10倍的步骤减少；WINO+进一步达到76.58%并以6.83倍的步骤减少。在Flickr30K上，WINO+实现了16.22倍的步骤减少并提升了CIDEr分数。这些结果表明，DLLMs可以通过可撤销解码发现可靠的去噪顺序，然后学习遵循这些顺序以实现更快的生成。代码可在https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus获取。

英文摘要

Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus.

URL PDF HTML ☆

赞 0 踩 0

2605.16938 2026-05-19 cs.CL cs.AI q-bio.NC 版本更新

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

努力作为上限，而非调节器：推理预算不影响人类与大推理模型之间的认知成本对齐

Yueqing Hu, Tianhong Wang

发表机构 * Institute of Neuroscience, Chinese Academy of Sciences（中国科学院神经科学研究所）； School of Philosophy, Anhui University（安徽大学哲学系）

AI总结该研究探讨了推理预算是否影响人类与大推理模型之间的认知成本对齐，发现无论推理努力如何变化，对齐情况保持不变，表明这种对齐是在训练时形成的，而非在推理时动态调整。

Comments 8 pages, 6 figures

详情

AI中文摘要

大推理模型（LRMs）生成的思维链轨迹长度与人类反应时间在认知任务中保持一致，但最近的争论质疑这种一致性是否反映真实的计算结构还是表面的冗长性。我们测试了这种一致性是否随推理时间的推理努力而变化。在GPT-OSS-20B和GPT-OSS-120B上，三个努力水平和六个推理任务中，任务内和跨任务的一致性保持不变：贝叶斯因子倾向于null，且各条件下的平均一致性几乎相同。操纵检查显示，努力参数设定了生成的上限，而非驱动实时分配，表明分配策略在训练时已固化。算术复杂度对比进一步显示，令牌分配跟踪细粒度、格式依赖的人类难度模式，模型规模提高了匹配程度。人类与LRMs之间的认知成本对齐似乎是在训练时形成的，对推理时的扰动具有鲁棒性，支持大推理模型问题解决的编译而非在线账户。

英文摘要

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

URL PDF HTML ☆

赞 0 踩 0

2605.16896 2026-05-19 cs.CL 版本更新

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

JSPG: 通过联合语义-拼音-字形检索实现中文上下文ASR的动态字典过滤

Shilin Zhou, Zhenghua Li

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对中文上下文ASR中大规模关键词字典带来的噪声问题，本文提出JSPG框架，结合语义、拼音和字形特征进行联合检索，有效提升关键词识别准确率。

详情

AI中文摘要

上下文自动语音识别（ASR）在处理大规模关键词字典时面临挑战，因为过多的不相关候选词会引入噪声并降低准确性。为此，动态过滤通常使用基础ASR模型生成初步假设，然后通过语义文本检索器获取相关关键词的子集。然而，这种方法在中文ASR中经常失效。基础模型常常产生同音或近同音的错误，这些错误保留了目标关键词的语音线索，但严重扭曲了其语义意义，使标准语义检索器无效。为了解决这个问题，我们提出了一种过滤框架，联合整合语义、拼音和字形特征（JSPG）。拼音可以根据语音相似性检索目标，而字形提供互补的结构线索以过滤掉中文中大量无关的同音词。为了弥补字符级拼音/字形指标与序列级过滤之间的差距，我们引入了扩展的Smith-Waterman算法，计算N-best假设序列与关键词之间的相似度分数。在Aishell-1和RWCS-NER数据集上的实验表明，JSPG显著优于单一特征基线。此外，由JSPG引导的下游上下文ASR模型在关键词识别准确性上实现了显著提升。

英文摘要

Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.16895 2026-05-19 cs.CE cs.AI cs.CL 版本更新

The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

阿尔法幻觉：LLM交易代理报告的阿尔法不应被视为部署证据

Yuxuan Ye, Jun Han, Ao Hu, Juncheng Bu, Yiyi Chen, Liangjian Wen, Danilo Mandic, Danny Dongning Sun, Xu Yinghui, Zenglin Xu

发表机构 * Fudan University（复旦大学）； Shanghai University of Finance and Economics（上海财经大学）； Southwest University of Finance and Economics（西南财经大学）； Northeastern University（东北大学）； Imperial College London（伦敦帝国理工学院）； Peng Cheng Laboratory（鹏城实验室）

AI总结本文指出，LLM交易代理报告的阿尔法不应被当作部署的证据，因为这些阿尔法需要通过结构有效性测试来验证其时间完整性、现实摩擦、反事实稳健性、预测校准、数值执行和多代理分解等关键指标，当前的公开证据无法区分稳健的预测能力与时间污染、未建模的摩擦、短窗口夏普不确定性、叙事拟合和参数先验等因素。

详情

AI中文摘要

端到端的LLM交易代理已经从研究兴趣迅速发展为一个小型的命名系统生态系统，包括FinCon、FinMem、TradingAgents、FinAgent、QuantAgent和FLAG-Trader。其中几个系统报告的 headline 夏普比率如果在部署桌面上被直接解读，将是实质性的影响，而相关的基准如FinBen也报告了在相同范围内的交易任务夏普统计。学术界与工业界之间的差距在双方都被过度跨越了。我们对这一差距持立场：端到端LLM交易代理报告的阿尔法不应被视为部署证据。在这样的收益能够支持部署交易能力的主张之前，它们必须通过结构有效性测试，以确保时间完整性、现实摩擦、反事实稳健性、预测校准、数值执行和多代理分解。当前的公开证据尚无法区分稳健的预测能力与时间污染、未建模的摩擦、短窗口夏普不确定性、叙事拟合和参数先验等因素。问题不仅是评估性的，也是结构性的。语言信心不是可交易的概率，叙事推理不是数值执行，模型先验可能成为未披露的隐性因子暴露。我们贡献了一个最小报告协议套件，P1-P6，根据主张强度分层适用，并且提供了一个保守的模块化替代方案，该方案利用LLM作为可审计的信息接口，在独立校准、风险和执行模块之前。代码和再生产工具：\url{https://github.com/hj1650782738/Trading}。

英文摘要

End-to-end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG-Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading-task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia--industry divide. We take a position on that gap: reported alpha from end-to-end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real-world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi-agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short-window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1--P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \url{https://github.com/hj1650782738/Trading}.

URL PDF HTML ☆

赞 0 踩 0

2605.16892 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe: 一种用于驾驶场景中风险检测与安全建议的框架

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C. V. Jawahar

发表机构 * IIIT-Hyderabad（IIIT-海得拉巴）

AI总结本文提出DriveSafe框架，通过结构化自然语言描述实现风险感知场景理解，结合多模态上下文生成空间 grounded 的描述，用于下游风险评估和安全建议，实验表明其在DRAMA基准上达到最先进的性能。

Comments 8 pages

详情

AI中文摘要

全面的情景意识对于在安全关键环境中运行的自动驾驶车辆至关重要，因为它能够识别并缓解潜在风险。尽管最近的多模态大语言模型（MLLMs）在通用视觉-语言任务上表现出色，但我们的研究发现，零样本MLLMs在细粒度、空间接地的风险评估中仍不如领域特定的方法。为了解决这一差距，我们提出了DriveSafe，一种用于风险感知场景理解的框架，利用结构化自然语言描述。具体而言，我们的方法首先生成包含运动、空间和深度线索的多模态上下文的时空接地描述。这些描述随后用于下游的风险评估，明确识别危险物体、其位置以及它们所暗示的不安全行为，随后提供可操作的安全建议。为了进一步提高性能，我们采用描述-风险配对来微调一个轻量级的适配器模块，高效地将领域特定的知识注入基础LLM中。通过将风险评估条件化为显式的语言基础场景表示，DriveSafe在零样本MLLMs和先前的领域特定基线之上取得了显著的提升。在DRAMA基准上的全面实验表明了最先进的性能，而消融研究验证了我们关键设计选择的有效性。项目页面：https://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe

英文摘要

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe

URL PDF HTML ☆

赞 0 踩 0

2605.16882 2026-05-19 cs.CL 版本更新

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

E-PMQ: 专家引导的后合并量化与合并权重锚定

Wenjun Wang, Yanggan Gu, Shuo Cai, Yuanyi Wang, Pengkai Wang, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； PolyU-Daya Bay Technology and Innovation Research Institute（PolyU-大亚湾科技与创新研究院）

AI总结本文提出E-PMQ方法，通过专家引导的后合并量化与合并权重锚定，解决合并模型中量化偏差和合并偏差耦合的问题，从而提升低比特部署性能。

详情

AI中文摘要

低资源部署约束使得模型量化成为部署神经网络以保持性能的关键。同时，模型合并已成为一种日益实用的低资源策略，用于将多个任务或领域专门化的专家整合到一个模型中，而无需联合训练或多模型服务。共同，量化和模型合并能够通过将多个专家整合到一个低比特模型中，实现高效的低资源部署流程。我们把这个设置称为后合并量化（PMQ）。我们证明直接对合并模型应用后训练量化（PTQ）是不可靠的，因为两种不同的偏差耦合：低比特重建引入的量化偏差和来自模型合并的专家相对合并偏差。为了减轻这些偏差，我们提出E-PMQ，一个专家引导的PMQ框架，利用源专家权重在层间校准期间提供专家引导的输出目标，以及合并权重锚定来稳定校准并保持合并模型的行为。在CLIP-ViT-B/32八任务合并中，E-PMQ将4位GPTQ在任务算术下的性能从65.0%提升到73.6%，在TIES-合并下从69.1%提升到74.8%。在更困难的设置中，E-PMQ在20任务CLIP-ViT-L/14上将GPTQ从34.8%提升到76.7%，在FLAN-T5-base GLUE上从78.26%提升到83.34%。这些结果表明，E-PMQ能够实现有效的后合并量化和低比特部署。

英文摘要

Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.16881 2026-05-19 cs.CL 版本更新

PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

PaliBench: 一种多参考框架用于经典语言翻译基准测试

Máté Metzger, Nadnapang Phophichit

发表机构 * Independent Researcher（独立研究者）； International Buddhist Studies College（国际佛教研究学院）； Mahachulalongkornrajavidyalaya University（玛哈切通拉翁皇家师范大学）

AI总结本文提出PaliBench，一种用于巴利语到英语翻译的基准测试，以及构建多参考翻译基准测试的方法，通过结合LLM辅助对齐、自动化验证、质量过滤、去重和多指标评估，生成包含1700段文本、8389个段落和约345,000个token的基准测试数据集，评估了十个现代大语言模型的性能。

Comments Preprint. This manuscript has not yet been peer reviewed

详情

AI中文摘要

数字人文项目越来越多地依赖机器翻译和大语言模型来扩大对古典、宗教及其他未翻译文本传统的访问。然而，标准翻译基准测试并不适合此类材料：它们通常将系统输出与单一参考翻译进行比较，尽管古典文本往往支持多种忠实的翻译，这些翻译在术语、语体和解释上有所不同。本文介绍了PaliBench，既是一种巴利语到英语翻译的基准测试，也是一种可重用的方法，用于构建多参考翻译基准测试。Pali案例研究基于与Bhikkhu Sujato、Bhikkhu Thanissaro和Bhikkhu Bodhi独立英文翻译对齐的Sutta Pitaka段落。工作流程结合了LLM辅助对齐独立分段的翻译、自动化验证源文件、段落级质量过滤、去重公式性重复以及多指标评估多个人类参考。生成的基准测试包含1700段文本，覆盖8389个段落和约345,000个token。我们使用它评估了十个现代大语言模型，发现系统排名在不同指标之间有强一致性，同时可靠性及语义异常率有显著差异。更广泛贡献是方法论：PaliBench展示了如何将现有学术翻译转化为解释性文本传统的评估基础设施，而不将任何单一翻译视为最终结论。尽管是为巴利佛教文本开发的，该方法也可用于其他古典语料库，其中存在足够的独立参考翻译。

英文摘要

Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.

URL PDF HTML ☆

赞 0 踩 0

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG 版本更新

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考：通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

发表机构 * State Key Lab of CAD& CG（CAD与CG国家重点实验室）

AI总结本文提出通过模式诱导的方法，利用模式推理和模式诱导策略，使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理，解决传统模型在复杂输入下的感知瓶颈问题。

详情

AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型（VLMs）构成重大挑战，当输入复杂度超出其一步感知能力时。受最近在图像思考（TWI）中的进展启发，一种合理的解决方案是通过迭代获取和整合局部视觉证据，将感知过程分解为更简单的步骤。然而，尽管当前VLMs在一般TWI能力上训练良好，但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战，我们将TWI视为一种工具，逐步构建并反映一个准确的内部世界模型。我们发现，由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务，但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率，我们提出模式推理，一种新的TWI策略，使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式，我们提出模式诱导，一种在线归纳学习策略，将视觉模式视为复合且可重用的专家，这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明，我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.16843 2026-05-19 cs.CL 版本更新

RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis

RTI-Bench: 一个用于印度权利信息决策分析的结构化数据集

Joy Bose

发表机构 * Independent Researcher（独立研究员）

AI总结本文提出RTI-Bench，一个包含印度中央信息委员会（CIC）决策的结构化数据集，用于分析权利信息决策，该数据集首次公开发布，包含结果标签、豁免引用、IRAC风格的推理组件和程序时间线。

Comments 8 pages, 4 tables

详情

AI中文摘要

印度《2005年信息权利法》赋予每位公民要求公共机构提供信息的权利，但在实践中，大多数人无法理解中央信息委员会（CIC）决定中密集的行政语言，更不用说预测是否值得提出上诉。本文介绍了RTI-Bench，一个包含CIC决定的结构化数据集，包含结果标签、豁免引用、IRAC风格的推理组件和程序时间线。据我们所知，这是首个公开发布的印度RTI行政决定结构化数据集。该数据集来自两个来源：1,218个公开可用的指令-响应语料库（通过规则提取添加了结构化字段），以及298个直接从委员会门户网站收集的CIC决定PDF文件，涵盖2023至2026年的五个委员和三种文档格式版本。在指令-响应语料库上，标签覆盖率达到89%。对于239个主要决定的PDF子集，本次首次发布覆盖率为51%。对50个标记案例的随机样本进行了人工审查，得出的标签精度为95.3%。在100个案例上的零样本Mistral 7B基线模型在结果预测上的准确率为57.3%，宏F1得分为37.0%，高于多数类基线的14.3%宏F1。RTI-Bench可在https://huggingface.co/datasets/joyboseroy/rti-bench获取。

英文摘要

India's Right to Information Act, 2005 gives every citizen the right to demand information from public authorities, yet in practice most people cannot make sense of the dense administrative language used in Central Information Commission (CIC) decisions, let alone predict whether an appeal is worth filing. This paper introduces RTI-Bench, a structured dataset of CIC decisions with outcome labels, exemption citations, IRAC-style reasoning components, and procedural timelines. To the best of our knowledge it is the first publicly released structured dataset for Indian RTI administrative decisions. The dataset draws from two sources: 1,218 cases from a publicly available instruction-response corpus (with structured fields added through rule-based extraction), and 298 CIC decision PDFs collected directly from the Commission portal, spanning five commissioners and three document format generations from 2023 to 2026. Label coverage reaches 89% on the instruction-response corpus. For the PDF subset of 239 primary decisions, coverage is 51% in this first release. A random sample of 50 labelled cases was manually reviewed, yielding a label precision of 95.3%. A zero-shot Mistral 7B baseline on 100 cases gives 57.3% accuracy and 37.0% macro-F1 on outcome prediction, well above the majority-class baseline of 14.3% macro-F1. RTI-Bench is available at https://huggingface.co/datasets/joyboseroy/rti-bench

URL PDF HTML ☆

赞 0 踩 0

2605.16839 2026-05-19 cs.CL 版本更新

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

CompactAttention: 加速分块预填的块-联合KV选择

Jiwon Song, Dongwon Jo, Beomseok Kang, Jae-Joon Kim

发表机构 * Seoul National University（首尔国立大学）

AI总结本文提出CompactAttention，一种基于块-联合KV选择的分块预填注意力机制，通过将二维块稀疏掩码作为KV选择信号，实现高效的注意力计算，从而在保持精度的同时提升2.72倍的注意力速度。

详情

AI中文摘要

分块预填已成为长上下文大语言模型广泛采用的服务策略，但在这种模式下高效计算注意力仍然具有挑战性。现有稀疏注意力方法主要针对一次性预填设计，无法有效转换为分块预填：块稀疏内核在查询长度受限于分块大小时效率降低，而细粒度模式搜索在每次分块累积KV缓存中重复时变得昂贵。QUOKA是一种近期针对分块预填的方法，避免了稀疏内核的开销，但依赖于查询子采样、令牌级的KV选择，这可能导致遗漏查询特定的KV条目并引入显式的KV复制开销。为了解决这些限制，我们提出了CompactAttention，一种基于块-联合KV选择的分块预填注意力机制。CompactAttention将二维块稀疏掩码作为KV选择信号，而不是直接的稀疏内核执行计划，并将其转换为GQA-aware的每组KV块表，通过Q块联合和组内联合。这种构造产生了最小的块表，保留了输入掩码所选择的所有KV块，在分页执行约束下，使所选KV块能够原地访问，而无需显式的KV压缩。在LLaMA-3.1-8B-Instruct上，CompactAttention在RULER基准测试中保持的精度接近密集注意力，同时在128K上下文长度下的分块预填中提供高达2.72倍的注意力加速。

英文摘要

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72$\times$ attention speedup at 128K context length under chunked prefill.

URL PDF HTML ☆

赞 0 踩 0

2605.16829 2026-05-19 cs.CL cs.PL 版本更新

AgentKernelArena: GPU核优化代理的通用化意识基准测试

Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Ji Liu, Yue Liu, Yuchen Yang, Hao Li, Ziqiong Liu, Dong Li, Vikram Appia, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结本文提出AgentKernelArena，一个用于评估GPU核优化代理的开源基准，通过隔离工作区和统一评分机制，测试代理在不同任务和硬件目标上的性能和通用化能力，发现大多数任务在正确性和编译效率上表现优异，但在PyTorch到HIP的转换任务中存在显著的正确性下降。

详情

AI中文摘要

GPU核优化对于高效深度学习系统日益关键，但编写高性能核仍然需要大量的低级专业知识。最近的AI编码代理可以迭代阅读代码、调用编译器和性能分析器，并优化实现，但现有的核基准测试仅评估单个LLM调用而非完整的代理工作流程，且未包含核到核的优化和未见过的配置泛化测试。我们提出了AgentKernelArena，一个开源的基准测试，用于衡量AI编码代理在GPU核优化上的能力。该基准测试包含196个任务，涵盖HIP到HIP的优化、Triton到Triton的优化以及PyTorch到HIP的转换，并在隔离的工作区中使用门控编译、正确性和性能检查，集中评分和一个未见过的配置泛化协议，测试优化是否转移到代理从未见过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent在内的生产代理中，我们发现大多数任务在正确性和编译效率上表现优异，最强配置在PyTorch到HIP任务中平均加速达6.89倍，在HIP到HIP任务中达6.69倍，在Triton到Triton任务中达2.13倍。我们的未见过的配置评估显示，HIP到HIP和Triton到Triton的优化大多能转移到未见过的输入形状，而PyTorch到HIP的转换则表现出显著的正确性下降，表明生成核的代理经常硬编码形状特定的假设。AgentKernelArena被设计为一个模块化、可扩展的框架，用于严格评估跨代理、任务和硬件目标的代理GPU核优化。

英文摘要

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

URL PDF HTML ☆

赞 0 踩 0

2605.16800 2026-05-19 cs.LG cs.CL 版本更新

FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation

FIM-LoRA: 通过校准时间梯度方差估计实现任务信息的秩分配

Ramakrishnan Sathyavageeswaran

发表机构 * Intuit

AI总结本文提出FIM-LoRA，通过校准时间梯度方差估计来分配任务信息的秩，以优化LoRA的秩分配，从而提高模型性能。

Comments 10 pages, 1 figure

详情

AI中文摘要

低秩适应（LoRA）为每个适应的权重矩阵分配一个统一的秩——一种实用的便利，但忽略了一个基本现实：不同层对任务适应的贡献不均。我们通过一种轻量级的工程解决方案来解决这个问题：在微调开始之前，运行八次校准反向传递，计算每个LoRA-B矩阵的梯度方差作为层信息度的代理，并按比例重新分配秩预算。所得到的适配器是一个标准的LoRA，具有每层的秩模式——没有新的参数，没有训练开销，没有对服务基础设施的更改。我们通过高效地近似经验 Fisher 信息矩阵（eFIM）对角线，仅限于 LoRA 适配器矩阵，来实现这一点，这将内存成本降低了大约256倍相比完整的模型 Fisher 估计。在 GLUE 上使用 DeBERTa-v3-base 时，FIM-LoRA 在相同参数预算下与 LoRA 相当（88.6 vs. 88.7），在常识推理上使用 LLaMA-3-8B 时达到 68.5 vs. 68.7。每层的秩映射是可解释的：值投影和早期到中期层一致获得更高的秩，与已建立的 transformer 层角色研究结果一致。

英文摘要

Low-rank adaptation (LoRA) assigns a uniform rank to every adapted weight matrix - a practical convenience that ignores a fundamental reality: different layers contribute unequally to task adaptation. We address this with a lightweight engineering solution: before fine-tuning begins, run eight calibration backward passes, compute the gradient variance of each LoRA-B matrix as a proxy for layer informativeness, and redistribute the rank budget proportionally. The resulting adapter is a standard LoRA with a per-layer rank pattern - no new parameters, no training overhead, no changes to serving infrastructure. We implement this via an efficient approximation of the empirical Fisher Information Matrix (eFIM) diagonal, restricted to LoRA adapter matrices only, which reduces memory cost by approximately 256x compared to full-model Fisher estimation. On GLUE with DeBERTa-v3-base, FIM-LoRA matches LoRA (88.6 vs. 88.7) at the same parameter budget, and on commonsense reasoning with LLaMA-3-8B reaches 68.5 vs. 68.7 for LoRA. The per-layer rank maps are interpretable: value projections and early-to-middle layers consistently receive higher rank, consistent with established findings on transformer layer roles.

URL PDF HTML ☆

赞 0 踩 0

2605.16790 2026-05-19 cs.LG cs.AI cs.CL 版本更新

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

TIER: 用于多步工具组合的轨迹不变执行奖励

Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

发表机构 * UC San Diego（加州大学圣迭戈分校）； Cisco Research（思科研究）

AI总结本文提出TIER，一种基于函数模式和运行时执行的奖励框架，能够提供密集且可解释的序列级反馈，支持多种解决方案策略并适应变化的工具接口，在DepthBench等基准上实现了高准确率。

Comments Preprint. Submitted to NeurIPS 2026. 28 pages, 7 figures, 8 tables. Code and datasets available at https://github.com/anaykulkarni/TIER

详情

AI中文摘要

工具使用使大语言模型能够通过一系列API调用解决复杂任务，但现有的强化学习方法无法扩展到多步骤组合设置。基于结果的奖励只能提供稀疏反馈，而轨迹监督的奖励依赖于注释的参考解决方案，惩罚有效的替代方案并限制可扩展性。我们提出TIER：轨迹不变执行奖励，一种奖励框架，其监督直接来自函数模式和运行时执行，而非参考轨迹。该奖励分解为格式有效性、模式遵守、执行成功和答案正确性，提供来自细粒度验证的单个步骤工具使用反馈。这种设计允许任何有效的执行路径获得信用，自然支持多种解决方案策略并适应变化的工具接口。在DepthBench，一个按深度（1到6步）分层的组合基准上，TIER在所有步骤中实现了>90%的准确率，其中轨迹监督的奖励在第4步之后崩溃。我们进一步在BFCL v3和NestFUL等基准上展示了持续的提升。消融研究确认所有奖励组件都是必要的，突显了多级监督对于组合推理的重要性。

英文摘要

Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.16787 2026-05-19 cs.LG cs.CL 版本更新

The Unlearnability Phenomenon in RLVR for Language Models

在语言模型中RLVR的不可学习现象

Yulin Chen, He He, Chen Zhao

发表机构 * New York University（纽约大学）

AI总结本文研究了RLVR在提升大语言模型推理能力中的学习动态，发现即使存在正确回放，某些难例仍无法学习，揭示了当前RL方法在推理任务中的根本限制。

Comments Accepted to ICML 2026

详情

AI中文摘要

可验证奖励强化学习（RLVR）已被证明在提高大语言模型（LLM）的推理能力方面是有效的。然而，RLVR的学习动态仍缺乏深入研究。在本文中，我们揭示了一个反直觉的现象：在模型最初难以处理的硬例中，一个显著子集即使在存在正确回放的情况下仍无法学习。为了理解这一现象，我们首先证明了现有的优化和采样技术无法解决不可学习性。通过跨例梯度分析，我们显示不可学习的例子具有根本性的表示问题，其特征是与其余例子的梯度相似性低且推理模式不可泛化。我们进一步表明，表示缺陷在RL中难以缓解，因为数据增强无法提高梯度相似性。本研究为RLVR训练中的不可学习数据提供了首次系统的表征，并揭示了当前RL方法在推理任务中的根本限制。代码和数据可在https://github.com/yulinchen99/unlearnability-rlvr获取。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.

URL PDF HTML ☆

赞 0 踩 0

2605.16770 2026-05-19 cs.CL cs.AI 版本更新

GroupMemBench: 多方对话中LLM代理记忆的基准测试

Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich

发表机构 * UC Santa Barbara（加州大学圣芭芭拉分校）； Microsoft（微软）

AI总结本文提出GroupMemBench，用于评估多方对话中LLM代理的记忆能力，揭示现有记忆系统在群体动态、信念跟踪和语言适应方面的不足。

详情

极性探针线性解码LLM中的语义结构

Pablo J. Diego-Simón, Pierre Orhan, Emmanuel Chemla, Yair Lakretz, Jean-Rémi King

发表机构 * LSCP, ENS, PSL, EHESS, CNRS（LSCP、ENS、PSL、EHESS、CNRS）； Paris Brain Institute（巴黎脑研究所）； Earth Species Project（地球物种计划）； Meta AI

AI总结研究通过极性探针线性恢复LLM中的语义结构，发现其基于嵌入距离和方向表示实体存在与关系类型，且在中层表现更优，能泛化至新实体但随语义结构规模下降。

详情

AI中文摘要

人工神经网络如何将概念绑定形成复杂语义结构？本文提出一种简单神经编码，通过嵌入距离和方向表示实体的存在及关系类型。在多种LLM中测试，结果表明极性探针能线性恢复真实语义结构，该编码主要出现在中层，随LLM性能提升而改善。极性探针能泛化至新实体和关系类型，但随语义结构规模增大而退化。极性表示质量与LLM回答语义结构问题的能力相关。这些发现表明，LLM通过简单几何原理绑定表示来构建复杂语义结构。

英文摘要

How do artificial neural networks bind concepts to form complex semantic structures? Here, we propose a simple neural code, whereby the existence and the type of relations between entities are represented by the distance and the direction between their embeddings, respectively. We test this hypothesis in a variety of Large Language Models (LLMs), each input with natural-language descriptions of minimalist tasks from five different domains: arithmetic, visual scenes, family trees, metro maps and social interactions. Results show that the true semantic structures can be linearly recovered with a Polar Probe targeting a subspace of LLMs' layer activations. Second, this code emerges mostly in middle layers and improves with LLM performance. Third, these Polar Probes successfully generalize to new entities and relation types, but degrades with the size of the semantic structure. Finally, the quality of the polar representation correlates with the LLM's ability to answer questions about the semantic structure. Together, these findings suggest that LLMs learn to build complex semantic structures by binding representations with a simple geometrical principle.

URL PDF HTML ☆

赞 0 踩 0

2605.12987 2026-05-19 cs.CL 版本更新

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

利用多模态自一致性推理进行编码动机访谈以减少酒精使用

Guangzeng Han, James G. Murphy, Benjamin O. Ladd, Xiaolei Huang, Brian Borsari

发表机构 * Department of Computer Science, University of Memphis（密苏里大学计算机科学系）； Veterans Affairs Health Care System（退伍军人事务医疗系统）； Department of Psychiatry and Behavioral Sciences, University of California San Francisco（旧金山大学精神病学与行为科学系）； Department of Psychology, University of Memphis（密苏里大学心理学系）； Department of Psychology, Washington State University Vancouver（华盛顿州立大学温哥华分校心理学系）

AI总结本文提出基于音频语言模型的自动动机访谈编码方法，通过多模态自一致性推理提升编码鲁棒性，实验显示其在准确率、精确率和召回率上均优于基线方法。

详情

AI中文摘要

背景：对动机访谈（MI） session 进行编码对于理解客户行为和预测结果至关重要，但需要大量时间和劳动力由受过训练的 MI 专业人士完成。最近在音频-语言模型（ALMs）上的进展为通过捕捉多模态行为信号自动化 MI 编码提供了新机会。目的：本研究旨在开发一种基于 ALMs 的自动 MI 编码方法，分析原始音频输入并整合多个推理轨迹的预测，利用自一致性提高编码鲁棒性。方法：我们使用了五段去标识化的 MI 音频磁带进行实验。我们部署了 ALMs，使用四个互补的分析提示来支持语句级推理：用于语音提示的分析提示、用于声学提示的声调感知提示、用于定量假设检验的证据评分提示，以及用于对比推理的比较提示。每个提示抽取三个随机样本，生成每个语句12条独立的推理轨迹。最终预测由所有轨迹上的多数投票确定。结果：通过准确率、精确率、召回率和宏F1分数进行评估。所提出的多模态自一致性方法在准确率为52.56%、精确率为54.03%、召回率为47.45%、宏F1分为46.40%，优于基线方法。系统消融实验中移除个别模块在主要指标上一致地降低了性能。结论：多模态自一致性方法在 MI 编码中优于单次通过基线提示方法。这些发现表明，结合客户所说和如何说的内容可以支持更可靠的自动 MI 编码。

英文摘要

BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.

URL PDF HTML ☆

赞 0 踩 0

2605.12920 2026-05-19 cs.MA cs.AI cs.CL 版本更新

Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

通过对话对齐世界模型实现具身多智能体协调

Vardhan Dongre, Dilek Hakkani-Tür

发表机构 * Siebel School of Computing & Data Science（计算机与数据科学学院）

AI总结研究通过对话机制探索具身智能体的世界模型对齐，发现对话能减少冲突但降低任务成功率，提出评估世界模型对齐的框架。

详情

AI中文摘要

有效的具身智能体协作需要超越共享环境中的行动，要求基于每个智能体对世界的理解进行沟通。当智能体只能部分观察环境时，无沟通的协调是难以证明的，但沟通可通过共享观察和对齐世界模型来弥合这一差距。本文研究LLM基于的具身智能体是否真正具备沟通能力。我们扩展了PARTNR协作家庭机器人基准，加入自然语言对话通道，使两个具有部分观察能力的智能体在任务执行中沟通。为评估对话是否导致真实的世界模型对齐而非表面协调，我们提出了一种基于每智能体世界图的对齐测量框架：观察收敛（私人世界模型随时间对齐吗？）、信息新颖性（信息是否传达了伙伴所缺乏的内容？）以及信念敏感的通信（智能体是否建模了伙伴所知的内容？）。我们的实验显示，对话减少了40至83个百分点的行动冲突，但相对于沉默协调任务成功率较低。使用我们的指标，我们表征了表面协调与真实世界模型对齐之间的差距，并确定当前模型在该光谱中的位置。

英文摘要

Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.

URL PDF HTML ☆

赞 0 踩 0

2605.12824 2026-05-19 cs.MA cs.AI cs.CL cs.CY 版本更新

Mechanism Plausibility in Generative Agent-Based Modeling

生成基于代理的建模中的机制合理性

Patrick Zhao, David Huu Pham, Nicholas Vincent

发表机构 * Simon Fraser University（西蒙弗雷泽大学）

AI总结本文提出机制合理性量表，区分生成充分性与机制合理性，探讨生成式代理模型的生成能力与解释能力。

Comments Accepted at ACM FAccT 2026

详情

DOI: 10.1145/3805689.3812388

AI中文摘要

大型语言模型（LLMs）能够生成多样化现象而无需显式编程规则，这一能力使其在不同代理基于模型（ABMs）和社会模拟中得到应用。最近的研究探讨了LLMs生成不同现象的能力，例如社交媒体上的人类行为或博弈论场景中的外星行为。然而，能力、预测和解释是不同的——从科学哲学和机制文献中，解释需要展示现象如何由相关组织实体和活动产生。对于建模者而言，在没有基于潜在遥远研究领域的情况下，描述实验特征或判断模拟是否在能力（或解释）上取得进展是困难的。我们整合了最近关于LLM-ABMs的研究与当代科学哲学文献，用以操作化'合理性'的定义，提出四等级量表。该量表将模型生成充分性（重现现象的能力）与机制合理性（现象如何产生）分开，并明确不同模型的不同角色，如预测性和解释性。我们将其介绍为机制合理性量表。

英文摘要

Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recent studies investigate their ability to generate different phenomena of interest, for example, human behavior on social media platforms or alien behavior in game-theoretic scenarios. However, capability, prediction, and explanation are different--drawing from the philosophy of science and mechanisms literature, explanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of 'plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.

URL PDF HTML ☆

赞 0 踩 0

2605.11518 2026-05-19 cs.AI cs.CL cs.LG 版本更新

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch: 训练研究代理以自动化LLM实验配置 - 从低成本学习，优化高成本

Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

发表机构 * University of Notre Dame（诺丁汉大学）

AI总结本文提出AutoLLMResearch框架，通过多保真度实验环境学习LLM配置原则，解决高成本实验自动化问题，展示其在大规模LLM实验中的有效性与通用性。

详情

AI中文摘要

有效配置可扩展的大规模语言模型（LLM）实验，涵盖架构设计、超参数调优等，对推进LLM研究至关重要，因为糟糕的配置选择会浪费大量计算资源并阻碍模型潜力的实现。以往的自动化方法适用于低成本环境，但可扩展的LLM实验成本过高，无法进行大量迭代。为了解决这一问题，我们提出AutoLLMResearch，一个模仿人类研究人员从低保真度实验中学习一般性原则并高效识别高成本LLM配置的代理框架。核心挑战是如何使代理通过与多保真度实验环境的交互学习LLM配置景观的结构。为此，我们提出一个系统框架，包含两个关键组件：1) LLMConfig-Gym，涵盖四个关键LLM实验任务的多保真度环境，支持超过一百万GPU小时的可验证实验结果；2) 一个结构化训练管道，将配置研究建模为长周期马尔可夫决策过程，并相应地激励跨保真度外推推理。在各种强基线上的广泛评估表明了我们框架的有效性、通用性和可解释性，支持其作为大规模现实LLM实验自动化的实用且通用解决方案的潜力。

英文摘要

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

URL PDF HTML ☆

赞 0 踩 0

2605.10923 2026-05-19 cs.LG cs.CL 版本更新

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

动态技能生命周期管理用于代理强化学习

Junhao Shen, Teng Zhang, Xiaoyan Zhao, Hong Cheng

发表机构 * Database Group, The Chinese University of Hong Kong（香港中文大学数据库组）； Lanzhou University（兰州大学）

AI总结本文提出SLIM框架，通过动态优化变量管理代理强化学习中的外部技能集，提升任务性能。

Comments Implementation code is available at https://github.com/ejhshen/SLIM

详情

AI中文摘要

强化学习能否教会大语言模型长期 horizon 推理？表达性是关键

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov

发表机构 * Purdue University（普渡大学）； UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Georgia Tech（佐治亚理工学院）； UC San Diego（加州大学圣地亚哥分校）

AI总结本文通过ScaleLogic框架研究了RL训练与任务难度的关系，发现推理深度和逻辑表达性影响训练计算量，表达性越高，训练效率越高，证明LLM的长期推理问题可通过改进训练方法解决。

详情

AI中文摘要

强化学习（RL）已被应用于改进大语言模型（LLM）的推理能力，但关于训练规模与任务难度之间系统研究受限于缺乏可控且可扩展的环境。观察到LLM在长期推理方面的不足引发了它们可能是自回归Transformer架构根本问题的推测。为此，我们引入了ScaleLogic，一个合成逻辑推理框架，可独立控制两个难度轴：所需证明规划的深度（即horizon）和底层逻辑的表达性。我们提出的框架支持多种逻辑：从简单的蕴含逻辑（

英文摘要

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that they are fundamental to the autoregressive transformer architecture. To address this, we introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.

URL PDF HTML ☆

赞 0 踩 0

2605.05739 2026-05-19 cs.LG cs.AI cs.CL q-fin.CP 版本更新

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

基于大语言模型判官的多维行为评估：用于代理股票预测系统的闭环强化学习反馈

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * School of Electrical Engineering and Computer Science（电气工程与计算机科学学院）

AI总结本文提出一种多维行为评估方法，通过大语言模型判官评估代理系统决策过程，利用闭环强化学习反馈提升预测性能，验证了方法在股票预测中的有效性。

Comments 17 pages, 5 figures, 14 tables. Manuscript submitted to Applied Artificial Intelligence (Taylor and Francis)

详情

AI中文摘要

代理人工智能系统通过一系列相互依赖的自主决策产生输出，但标准评估仅评估输出而无法诊断底层过程。本文开发了一种行为评估方法，通过评分中间决策过程补充输出级测试。在每个自主决策点记录的行为轨迹被分为五日周期，并由三个大语言模型（LLM）判官根据六个领域特定维度（制度检测、路由、适应、风险校准、策略一致性、错误恢复）评分。一种扰动程序破坏一个维度，同时保持其他五个维度不变，验证了维度特异性；跨模型一致性达到Krippendorff's alpha=0.85。综合行为评分与实际20日夏普比率相关性达到Spearman rho=0.72。闭环框架将缺陷的每维度评分转换为信用分配惩罚，添加到Soft Actor-Critic奖励中。三次微调循环，限制在验证数据上，将持有期MAPE从0.61%降低到0.54%（11.5%相对；p<0.001，d=0.31）在2017至2025的测试期上，显著性在Diebold-Mariano下，通过Giacomini-White局部化到高波动性制度。该方法应用无关，适用于任何可以记录中间决策的代理系统。

英文摘要

Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized 20-day Sharpe ratio. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty added to the Soft Actor-Critic reward. Three fine-tuning cycles, confined to validation data, reduce one-day MAPE from 0.61% to 0.54% (11.5% relative; p<0.001, d=0.31) on the held-out 2017 to 2025 test period, significant under Diebold-Mariano and localized by Giacomini-White to the high-volatility regime. The methodology is application-agnostic and applies to any agentic system whose intermediate decisions can be logged.

URL PDF HTML ☆

赞 0 踩 0

2605.02028 2026-05-19 cs.CL 版本更新

Language models fail at extended rule following

语言模型在扩展规则遵循中表现不佳

Tianxiang Dai, Jonathan Fan

发表机构 * Department of Electrical Engineering, Stanford University（斯坦福大学电气工程系）

AI总结研究发现语言模型在重复应用规则时无法保持精确状态，即使增加模型规模和计算资源也无法克服这一缺陷，表明需要新的模型架构来实现可靠的规则遵循。

Comments for accessing the data and code for reproduction of the study, see https://txdai.github.io/counting-reliability/

详情

AI中文摘要

大型语言模型在回答困难问题时能够通过检索、重新组合和关注长上下文中的信息来表现出色。然而，在代理任务中，需要额外的能力：在反复应用规则时保持精确的状态。我们发现语言模型在这一可靠性方面存在缺失。通过查询126种领先模型变体，执行对长字符串重复字符计数的任务，发现所有模型都无法准确计数超过模型依赖的语法敏感计数能力阈值。失败是突然的，并且即使增加模型规模、推理时间计算和外部工具，失败仍然持续。机制探测表明，模型使用有限的内部状态来模拟计数作为规则，一旦这些状态耗尽就会失败。此外，这些状态是执行复杂任务的基础。这些结果表明，为了使自主代理实现真正可靠的规则遵循能力，需要从根本上新的模型架构。

英文摘要

Large language models are highly capable of answering difficult questions by retrieving, recombining, and attending to information in long contexts. For agentic tasks, an additional capability is required: the preservation of an exact state while repeatedly applying rules. We find that this reliability is absent across language models. To demonstrate, we query 126 leading model variants with the task of counting a long string of repeated characters, and we find they all cannot accurately count above a model-dependent, syntax-sensitive counting capacity threshold. Failures are abrupt and persist even with increasing model size, inference time computation, and external tool. Mechanistic probing indicates that models use a finite number of internal states to mimic counting as a rule and fail once these states are exhausted. Furthermore, such states are the basis for performing complex tasks beyond counting. These results indicate that fundamentally new model architectures are required for autonomous agents to achieve truly reliable rule following capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.00155 2026-05-19 cs.LG cs.CL math.OC stat.ML 版本更新

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Wasserstein分布鲁棒遗憾优化用于人类反馈的强化学习

Yikai Wang, Shang Liu, Jose Blanchet

发表机构 * Department of Statistics and Operations Research, University of North Carolina（统计与运筹学系，北卡罗来纳大学）； Imperial Business School, Imperial College London（帝国理工学院伦敦商学院）； Department of Management Science and Engineering, Stanford University（管理科学与工程系，斯坦福大学）

AI总结本文提出Wasserstein分布鲁棒遗憾优化（DRRO）用于强化学习从人类反馈，通过简单分配模型研究提示问题，展示在ℓ1-地面成本Wasserstein模糊集下，内最坏遗憾有精确解，最优策略具有水填充结构，从而实现高效政策梯度算法。

详情

AI中文摘要

强化学习从人类反馈（RLHF）已成为对齐大语言模型的核心后训练步骤，但RLHF中使用的奖励信号仅是真实人类效用的学得代理。从运筹学角度看，这形成了一个目标不准确的决策问题：策略是针对估计奖励优化，而部署性能由未观察的目标决定。由此产生的差距导致奖励过度优化，即Goodharting现象，即代理奖励在真正质量下降后仍继续改善。现有缓解方法通过不确定性惩罚、悲观奖励或保守约束，但这些方法计算上负担重且过于悲观。我们提出Wasserstein分布鲁棒遗憾优化（DRRO）用于RLHF。不同于标准DRO悲观最坏价值，DRRO悲观最坏遗憾相对于相同合理奖励扰动下的最佳策略。我们通过简单分配模型研究提示问题，展示在ℓ1-地面成本Wasserstein模糊集下，内最坏遗憾有精确解，最优策略具有水填充结构。这些结果导致具有简单采样奖金解释和仅小幅改动GRPO式RLHF训练的实用策略梯度算法。该框架还理论上澄清了为什么DRRO比DRO更不悲观，且实验显示DRRO比现有基线更有效缓解过度优化，而标准DRO系统性过悲观。

英文摘要

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$-ground-cost Wasserstein ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.

URL PDF HTML ☆

赞 0 踩 0

2604.26904 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym：一种构建有效Claw代理的可扩展框架

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence（人工智能学院）； Renmin University of China（中国人民大学）； IQuest Research（IQuest研究）； Beihang University（北航）

AI总结本文提出ClawGym框架，用于构建Claw式个人代理的全生命周期，通过合成可验证训练数据和强化学习方法提升代理效能。

详情

AI中文摘要

Claw-style环境支持在本地文件、工具和持久工作区状态上进行多步骤工作流。然而，围绕这些环境的可扩展开发受限于缺乏系统框架，尤其是合成可验证训练数据并将其与代理训练和诊断评估集成的框架。为解决这一挑战，我们提出了ClawGym，一种支持Claw式个人代理全生命周期的可扩展框架。具体而言，我们构建了ClawGym-SynData，一个包含13500个过滤任务的多样化数据集，这些任务由基于人物驱动的意图和技能基础操作合成，配以现实的模拟工作区和混合验证机制。我们随后通过在黑箱滚动轨迹上进行监督微调训练了一组有能力的Claw式模型，称为ClawGym-Agents，并进一步通过轻量级管道探索强化学习，该管道在每项任务的沙箱中并行化滚动。为了支持可靠的评估，我们进一步构建了ClawGym-Bench，一个通过自动化过滤和人工LLM审查校准的200个实例的基准。相关资源已发布在https://github.com/ClawGym。

英文摘要

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources have been released at https://github.com/ClawGym.

URL PDF HTML ☆

赞 0 踩 0

2604.17338 2026-05-19 cs.SE cs.CL 版本更新

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

精确调试基准：你的模型是在调试还是重生成？

Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger, Robin Jia

发表机构 * University of Southern California（南加州大学）； Microsoft（微软）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； University of Toronto（多伦多大学）

AI总结本文提出PDB基准，评估LLM调试能力，发现前沿模型调试精度低，即使受指令引导也难以达到高精度。

详情

AI中文摘要

从孤立评分到协作排名：一种基于LLM的论文评估比较原生框架

Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He, Jiawei Liu, Yong Huang, Tianrui Guo, Wei Lu

发表机构 * School of Economics and Management, East China Normal University（东华大学经济管理学院）； School of Information Management, Wuhan University（武汉大学信息管理学院）； China Academic Degrees & Graduate Education Development Center（中国学位与研究生教育发展中心）

AI总结本文提出CNPE框架，通过整合比较机制于数据构建和模型学习，提升论文评估的相对质量判断，实验显示比DeepReview-14B平均提升21.8%。

Comments Accepted at Findings of ACL 2026

详情

AI中文摘要

大型语言模型（LLM）目前通过为每篇论文分配绝对分数来评估科学论文，但因评分标准在会议、时间周期和评估标准上存在差异，训练绝对分数模型易导致适应狭窄、特定情境规则而非发展稳健的学术判断。为克服此限制，本文提出将论文评估从孤立评分转向协作排名。具体而言，设计了用于论文评估的比较原生框架（CNPE），整合比较机制于数据构建和模型学习。首先提出基于图的相似性排名算法以从文献集合中采样更具信息量和区分度的论文对。随后通过监督微调和强化学习结合基于比较的奖励增强相对质量判断。在推理阶段，模型对采样论文对进行成对比较，并将这些偏好信号聚合为全局相对质量排名。实验结果表明，本文框架在强基线DeepReview-14B上实现了平均相对改进21.8%，并在五个之前未见过的数据集上表现出稳健的泛化能力。代码可在https://github.com/ECNU-Text-Computing/ComparisonReview获取。

英文摘要

Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design a $\textbf{C}$omparison-$\textbf{N}$ative framework for $\textbf{P}$aper $\textbf{E}$valuation ($\textbf{CNPE}$), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of 21.8% over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. Our code is available at https://github.com/ECNU-Text-Computing/ComparisonReview.

URL PDF HTML ☆

赞 0 踩 0

2603.16091 2026-05-19 cs.CL cs.AI 版本更新

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

CounterRefine：用于事实问答中推理时知识修复的答案条件计数证据检索

Tianyi Huang, Ying Kai Deng

发表机构 * Ryquo ； App-In Club

AI总结 CounterRefine通过在推理时检索特定证据并进行约束性修正，提升事实问答的准确性，实验表明其在多个基准测试中有效改进了基础模型的表现。

Comments Accepted at the 4th Workshop on Towards Knowledgeable Foundation Models at ACL 2026

详情

AI中文摘要

在事实问答中，许多错误并非检索失败，而是对答案的固执。我们提出了CounterRefine，一种轻量级的修复层，用于短形式RAG。该方法将第一个答案视为假设进行检验。给定草稿，CounterRefine会发出答案条件扩展查询以检索候选特定证据，然后应用受约束的KEEP或REVISE修正步骤，其提出的修订仅在确定性验证后才被接受。设计是故意狭窄的：它添加了一次证据收集流程和一次受保护的修正调用，而不是替换检索器或构建广泛代理系统。在完整的SimpleQA基准测试中，CounterRefine将匹配的一次通过RAG基线改进了最多5.8个正确率点；在完整的Claude轨迹中，它只改变了5.6%的输出，其中180个有益变化和8个有害变化。这些发现表明，对于知识丰富的基础模型来说，除了访问证据外，它们还应能够利用该证据重新考虑，并在必要时修复自己的答案。

英文摘要

In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight repair layer for short-form RAG that treats the first answer as a hypothesis to test. Given a draft, CounterRefine issues answer-conditioned expansion queries to retrieve candidate-specific evidence, then applies a constrained KEEP or REVISE refinement step whose proposed revisions are accepted only after deterministic validation. The design is intentionally narrow: it adds one evidence-gathering pass and one guarded refinement call rather than replacing the retriever or building a broad agentic system. On the full SimpleQA benchmark, CounterRefine improves a matched one-pass RAG baseline by up to 5.8 correct-rate points; in the full Claude trace, it changes only 5.6% of outputs, with 180 beneficial outcome changes and 8 harmful ones. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

URL PDF HTML ☆

赞 0 踩 0

2603.04737 2026-05-19 cs.AI cs.CL cs.LG 版本更新

Interactive Benchmarks

交互式基准测试

Baoqing Yue, Zihan Zhu, Yutong Han, Brian Fan, Qian Sun, Jichen Feng, Hufei Yang, Yifan Zhang, Mengdi Wang

发表机构 * InteractiveBench ； Princeton University（普林斯顿大学）

AI总结本文提出交互式基准测试，通过预算化的多轮交互评估模型推理能力，改进传统基准和偏好评估的局限性，揭示模型在交互场景中的改进空间。

Comments Project Page: https://github.com/interactivebench/interactivebench

2602.08437 2026-05-19 cs.CL 版本更新

Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

大语言模型与不可能的语言习得："虚假承诺"或对当前人工智能观点的颠覆

Ziyan Wang, Longlong Ma

发表机构 * New Talent Academy（新人才学院）； Institute of Software, University of Chinese Academy of Sciences（中国科学院软件研究所）

AI总结本文通过实验探讨大语言模型学习可能与不可能语言的能力，发现GPT-2模型在自然语言任务上表现优于不可能语言任务，而LSTM模型则无显著差异，挑战了Chomsky对AI的理性主义基础观点。

详情

AI中文摘要

在Chomsky的挑衅性批评《CHATGPT的虚假承诺》中，大语言模型（LLMs）被描述为仅能预测模式的工具，无法像人类一样通过内在因果和自我修正结构习得语言，因此无法区分可能与不可能的语言。它代表了对AI智力基础的根本挑战，因为它整合了LLMs方法论中的主要问题，并具有一个典型的先验理性主义视角。我们从语言学和心理学现有文献的视角以及基于实验研究LLMs学习可能和不可能语言能力的角度审视这一著名批评。我们通过将英语应用特定转换生成了语法上不可能的语言，包括反转整个句子和根据词数奇偶性添加否定。在GPT-2小模型和长短期记忆（LSTM）模型上进行了两轮受控实验。单次运行训练轨迹的描述性分析显示，GPT-2小模型在自然语言任务上的最终损失较低、收敛速度更快、困惑度较低，其中反转条件表现最差（损失比自然语言高达2.25倍）。LSTM模型在不同条件下则差异最小。鉴于实验的单次运行性质（每种条件n=1），我们报告了描述性比较，并提醒正式统计推断无法进行。基于理论分析和描述性实证发现，我们提出了一种新的Chomsky理论视角，即在LLMs研究中从Chomsky的“理性主义-浪漫主义”范式转向功能主义和经验主义。

英文摘要

In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critique from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring into the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Descriptive analysis of single-run training trajectories shows that GPT-2 small models exhibit lower final loss, faster convergence, and lower perplexity on natural language compared to impossible language conditions, with the reversed condition showing the largest departure (loss ratios up to 2.25 * natural). LSTM models, by contrast, show minimal differences across conditions. Given the single-run nature of our experiments (n=1 per condition), we report descriptive comparisons and caution that formal statistical inference is precluded. Based on theoretical analysis and descriptive empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.

URL PDF HTML ☆

赞 0 踩 0

2602.08169 2026-05-19 cs.LG cs.CL 版本更新

Spherical Steering: Geometry-Aware Activation Rotation for Language Models

球面操控：面向语言模型的几何感知激活旋转

Zejia You, Chunyuan Deng, Hanjie Chen

发表机构 * Rice University（里士大学）； Tufts University（塔夫茨大学）

AI总结本文提出球面操控方法，通过激活旋转而非加法实现无训练的推理控制，有效避免隐藏状态幅度变化，提升模型在多项选择基准上的表现，同时保持开放生成能力。

Comments ICML 2026

详情

AI中文摘要

在推理过程中，操控语言模型（LMs）而不重新训练是一种有前景的方法。然而，标准方法通常依赖于激活加法，这不可避免地会改变隐藏状态的幅度，引发表示崩溃和开放生成退化的问题。本文探讨了球面操控，一种无需训练的原始方法，通过激活旋转解决这一权衡问题。与使用固定向量移动激活不同，我们的方法沿向目标方向的测地线旋转激活，从而在保持信号完整性的同时指向目标概念。为进一步增强适应性，我们引入了一个置信度门，根据输入不确定性动态调节操控强度。在多个选择基准上的广泛实验表明，球面操控在多项选择基准上显著优于加法基线（在TruthfulQA、COPA和Storycloze上分别提高10%），同时同时保持模型的开放生成质量。这项工作强调了几何一致性的重要性，表明保持范数的旋转是一种稳健且有效的方法，用于精确的推理时间控制。代码可在：https://github.com/chili-lab/Spherical-Steering 获取。

整体大于部分之和：一种兼容性感知的多教师CoT蒸馏框架

Jin Cui, Jiaqi Guo, Ruixuan Yang, Jiayi Lu, Jiepeng Zhou, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University（人机混合增强智能国家重点实验室，人工智能与机器人研究院，西安交通大学）； Nankai University（南开大学）； The Hong Kong University of Science and Technology(Guangzhou)（香港科技大学（广州））； School of Software Engineering, State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University（软件学院，人机混合增强智能国家重点实验室，人工智能与机器人研究院，西安交通大学）

AI总结本文提出COMPACT框架，通过动态加权不同教师的梯度，结合多维指标提升学生模型的推理能力，有效整合多样化推理能力并减少灾难性遗忘。

Comments 11pages, 9figures

详情

AI中文摘要

链式推理（CoT）推理赋予大语言模型（LLMs）显著能力，但通常需要极高的参数规模。CoT蒸馏作为一种有前景的范式，将推理能力转移到紧凑的学生模型（SLMs）中，但现有方法通常依赖单一教师，限制了学生潜力，因为个体LLMs常有不同能力偏倚且可能遭受灾难性遗忘。虽然利用多样教师似乎有吸引力，但有效融合其监督仍具挑战：教师-学生不兼容可能放大幻觉，被动监督无法确保真实逻辑内化。为此，我们引入COMPACT框架，通过动态加权教师梯度，基于多维指标评估学生实时兼容性：（1）基于图的共识过滤误导性推理路径；（2）基于互信息的适应性检测“顿悟时刻”以真正理解推理过程而非单纯模仿；（3）基于损失的难度评估学生对教师指导的接受度并防止负迁移。大量实验和潜在空间分析表明，COMPACT能有效整合多样化推理能力而不破坏模型原有知识结构，在各种基准测试中取得最佳性能并缓解灾难性遗忘。

英文摘要

Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

URL PDF HTML ☆

赞 0 踩 0

2601.11956 2026-05-19 cs.CL cs.AI 版本更新

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

双重校准：通过校准知识和推理置信度实现可靠的LLM

Yuyin Lu, Ziran Liang, Yanghui Rao, Wenqi Fan, Fu Lee Wang, Qing Li

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China（中山大学计算机科学与工程学院，广州，中国）； Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR（香港理工大学计算机系，香港特别行政区）； School of Science and Technology, Hong Kong Metropolitan University, Hong Kong SAR（香港 Metropolitan 大学科技学院，香港特别行政区）

AI总结本文提出双重校准框架，通过校准知识和推理置信度提升LLM的可靠性，实验表明其在保持低token成本的同时显著提高准确性和置信度校准。

Comments This work is to appear in the Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

2601.06633 2026-05-19 cs.LG cs.AI cs.CL cs.CY 版本更新

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER：面向开放性编程任务的知识对齐学生错误模拟器

Zhangqi Duan, Nigel Fernandez, Andrew Lan

发表机构 * University of Massachusetts（马萨诸塞大学）； University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）

AI总结 KASER通过强化学习方法，结合代码相似性、错误匹配和预测多样性，提升大语言模型对学生错误的模拟与预测能力，实验表明其在代码和错误预测及错误覆盖方面优于基线方法。

Comments Published in ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics

详情

AI中文摘要

开放性任务，如计算机科学教育中的编程问题，能提供关于学生知识的深入洞察。然而，训练大语言模型（LLMs）模拟和预测学生在这些问题上的可能错误具有挑战性：它们常出现模式崩溃，并无法充分捕捉学生响应中的语法、风格和解决方案方法的多样性。在本文中，我们提出了KASER（知识对齐学生错误模拟器），一种将错误与学生知识对齐的新方法。我们提出了一种基于强化学习的训练方法，使用混合奖励反映学生代码预测的三个方面：i）代码与地面真相的相似性，ii）错误匹配，以及iii）代码预测的多样性。在两个真实世界数据集上，我们进行了两个层面的评估，并表明：在每对学生-问题对层面，我们的方法在代码和错误预测上优于基线；在每问题层面，我们的方法在错误覆盖和模拟代码多样性上优于基线。

英文摘要

Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity.

URL PDF HTML ☆

赞 0 踩 0

2512.17843 2026-05-19 cs.CL cs.AI cs.HC 版本更新

ShareChat: A Dataset of Chatbot Conversations in the Wild

ShareChat: 一个真实对话的大型数据集

Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

发表机构 * Indiana University（印第安纳大学）

AI总结本文提出ShareChat数据集，包含142,808条对话（660,293个回合），涵盖95种语言，分析不同平台对话完整性和响应延迟差异，揭示多平台交互特性。

详情

AI中文摘要

通过统一的文本接口评估大型语言模型（LLMs），当前学术基准掩盖了不同商业平台的独特设计和功能如何影响真实用户行为和系统性能。为弥合这一差距，我们提出了ShareChat，这是首个包含142,808条对话（660,293个回合）的大型语料库，从ChatGPT、Perplexity、Grok、Gemini和Claude的公开共享URL中收集。ShareChat保留了原生平台功能，包括引用、思考痕迹和代码 artifacts，涵盖95种语言，时间跨度从2023年4月至2025年10月，补充了现有语料库中同质化交互的不足。为了展示数据集的评估用途，我们提出了三个案例研究：对话完整性分析评估跨平台意图满足差异，来源定位分析比较搜索增强系统之间的引用策略，时间分析揭示响应延迟动态的差异。这些分析展示了单平台或剥离功能语料库无法解决的研究问题。该数据集已公开可用。

英文摘要

By evaluating Large Language Models (LLMs) through uniform, text-only interfaces, current academic benchmarks obscure how the unique designs and affordances of distinct commercial platforms shape real-world user behavior and system performance. To bridge this gap, we present ShareChat, the first large-scale corpus of 142,808 conversations (660,293 turns) collected from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat preserves native platform affordances, including citations, thinking traces, and code artifacts, across 95 languages and the period from April 2023 to October 2025, complementing existing corpora that homogenize these interactions. To demonstrate the dataset's evaluative utility, we present three case studies: a conversation completeness analysis assessing cross-platform differences in intent satisfaction, a source grounding analysis comparing citation strategies between search-augmented systems, and a temporal analysis revealing divergent response latency dynamics. Together, these analyses demonstrate research questions that are inaccessible to single-platform or stripped-affordance corpora. The dataset is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2511.19078 2026-05-19 cs.CL cs.AI 版本更新

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

GraphMind: 一种基于动态GNN的定理选择与结论生成框架用于LLM推理

Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin

AI总结 GraphMind通过动态图神经网络与LLM结合，实现多步推理中的定理选择和结论生成，提升上下文感知的推理能力。

Comments This paper has been withdrawn by the authors in order to prepare a substantially revised version

详情

AI中文摘要

大型语言模型（LLMs）在自然语言理解和生成方面表现出色，包括多步推理如数学证明。然而，现有方法缺乏显式且动态的机制来结构化表示和演变中间推理状态，限制了其在上下文感知定理选择和迭代结论生成方面的能力。为此，我们提出了GraphMind，一种新颖的动态图基框架，将图神经网络（GNN）与LLMs结合，以迭代方式选择定理并生成中间结论。我们的方法将推理过程建模为异构演进图，其中节点代表条件、定理和结论，边捕捉节点间的逻辑依赖。通过编码当前推理状态并利用语义匹配进行定理选择，我们的框架在闭环模式下实现了上下文感知、可解释和结构化的推理。在各种问答（QA）数据集上的实验表明，所提出的GraphMind方法在多步推理中实现了稳定性能提升，并显著优于现有基线方法，验证了我们方法的有效性和通用性。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

URL PDF HTML ☆

赞 0 踩 0

2511.06516 2026-05-19 cs.CL 版本更新

基于图扩散模型的多LLM代理通信拓扑动态生成

Eric Hanchen Jiang, Mengting Li, Guancheng Wan, Sophia Yin, Yuchen Wu, Xiao Liang, Xinfeng Li, Yizhou Sun, Wei Wang, Kai-Wei Chang, Ying Nian Wu

发表机构 * University of California Los Angeles（加州大学洛杉矶分校）； University of Washington（华盛顿大学）； Nanyang Technological University（南洋理工大学）

AI总结本文提出Guided Topology Diffusion框架，通过迭代构建过程生成适应任务需求的高效通信拓扑，优于现有方法。

Comments ACL 2026 Main

2510.01782 2026-05-19 cs.CL cs.AI 版本更新

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

大语言模型能否拒绝它们不知道的问题？在事实性任务中测量知识感知的拒绝

Wenbo Pan, Jie Xu, Qiguang Chen, Junhao Dong, Libo Qin, Xinfeng Li, Haining Yu, Xiaohua Jia

发表机构 * Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； Harbin Institute of Technology（哈尔滨工业大学）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）； School of Computer Science and Engineering, Central South University（中南大学计算机科学与工程学院）

AI总结本文提出Refusal Index（RI）作为衡量大语言模型知识感知拒绝能力的新指标，通过评估拒绝概率与错误概率的相关性，揭示模型在事实性任务中的可靠性问题。

Comments Accepted at ICLR 2026

详情

AI中文摘要

大语言模型（LLMs）应拒绝回答超出其知识范围的问题。这种称为知识感知拒绝的能力对于事实性可靠性至关重要，但现有指标未能捕捉这一能力。本文提出Refusal Index（RI），一种新颖且原理明确的度量标准，用于衡量LLMs拒绝其不知问题的准确性。我们将RI定义为拒绝概率与错误概率之间的Spearman秩相关性。RI可通过轻量级两轮评估方法进行实际测量，仅需在两个标准评估运行中观察到的拒绝率。在16个模型和5个数据集上的广泛实验表明，RI准确量化了模型的知识感知拒绝能力。值得注意的是，RI在不同拒绝率下保持稳定，并提供一致的模型排名，不依赖于模型的整体准确率和拒绝率。这些特性表明RI捕捉了模型知识校准的稳定、内在方面。更重要的是，RI提供了关于LLM事实性的重要但此前被忽视方面的见解：尽管LLM在事实性任务上实现高准确率，但其拒绝行为可能不可靠且脆弱。

英文摘要

Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability, while existing metrics fail to capture this ability. In this work, we propose the Refusal Index (RI), a novel and principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. RI is practically measurable with a lightweight two-pass evaluation method which only require observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's knowledge-aware refusal capability. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. These properties suggest RI captures a stable, intrinsic aspect of model knowledge calibration. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile.

URL PDF HTML ☆

赞 0 踩 0

2509.22061 2026-05-19 eess.AS cs.CL cs.SD 版本更新

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

说出你的想法：语音延续任务作为语音基础模型偏见的探测器

Shree Harsha Bokkahalli Satish, Harm Lameris, Olivier Perrotin, Gustav Eje Henter, Éva Székely

发表机构 * Department of Speech, Music and Hearing, KTH Royal Institute of Technology（语音、音乐与听觉系，皇家理工学院）； Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab（格勒诺布尔阿尔卑斯大学，法国国家科学研究中心，格勒诺布尔理工学院，GIPSA实验室）

AI总结本文首次系统评估语音延续任务中的偏见，探讨性别和音质类型对延续行为的影响，发现模型和性别交互显著，且女性提示更倾向于回归模态音质，揭示语音质量偏见。

Comments 8 pages, 2 figures, Accepted to Identity-Aware AI LREC Workshop 2026

详情

AI中文摘要

语音延续（SC）任务是生成连贯的语音提示扩展，同时保持语义上下文和说话者身份。由于SC受限于单一音频流，它比对话更直接地探测语音基础模型的偏见。本文首次系统评估SC中的偏见，研究性别和音质类型（气音、嘶嘶音、末端嘶嘶音）对延续行为的影响。评估了三个最新模型：SpiritLM（基础和表达型）、VAE-GSLM和SpeechGPT，在说话者相似性、语音质量保持和基于文本的偏见指标上。结果表明，尽管说话者相似性和连贯性仍具挑战性，文本评估揭示了显著的模型和性别交互：一旦连贯性足够高（对于VAE-GSLM），性别效应会在文本指标如代理性和句子极性上显现。此外，延续行为更倾向于回归模态音质，特别是对于女性提示，揭示了系统性的语音质量偏见。这些发现突显了SC作为探测社会相关表征偏见的受控探测器，表明随着延续质量的提高，SC将成为越来越有信息量的诊断工具。

英文摘要

Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.

URL PDF HTML ☆

赞 0 踩 0

2509.21319 2026-05-19 cs.CL cs.AI cs.LG 版本更新

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF：二进制灵活反馈用于连接人类反馈与可验证奖励

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

发表机构 * NVIDIA

AI总结 RLBFF结合人类偏好与规则验证，提升奖励模型对响应质量的精准捕捉，优于Bradley-Terry模型，在RM-Bench和JudgeBench上取得优异成绩，且支持用户自定义反馈原则。

Comments Published at ICLR 2026, 21 pages

详情

AI中文摘要

Reinforcement Learning with Human Feedback (RLHF) 和 Reinforcement Learning with Verifiable Rewards (RLVR) 是LLM后训练的主要RL范式，各有优势。然而，RLHF在可解释性和奖励黑客问题上存在困难，因为它依赖于通常缺乏明确标准的人类判断，而RLVR则受限于其对正确性基于验证器的专注。我们提出Reinforcement Learning with Binary Flexible Feedback (RLBFF)，结合人类驱动的偏好灵活性与规则基础验证的精确性，使奖励模型能够捕捉响应质量的细微方面，超越单纯的正确性。RLBFF从自然语言反馈中提取可以二进制回答的原则（例如信息准确性：是，或代码可读性：否）。这些原则随后可用于将奖励模型训练作为蕴含任务（响应满足或不满足任意原则）。我们展示奖励模型以这种方式训练可以优于匹配数据的Bradley-Terry模型，在RM-Bench（86.2%）和JudgeBench（81.4%，2025年9月24日排行榜第一）上取得最佳成绩。此外，用户可以在推理时指定感兴趣的原理以自定义我们的奖励模型，与Bradley-Terry模型不同。最后，我们提供了一个完全开源的食谱（包括数据）来对Qwen3-32B进行对齐，以匹配或超过o3-mini和DeepSeek R1在MT-Bench、WildBench和Arena Hard v2的一般对齐基准上的性能（在<5%的推理成本下）。模型：https://huggingface.co/collections/nvidia/reward-models-10-2025

英文摘要

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

URL PDF HTML ☆

赞 0 踩 0

2509.17912 2026-05-19 cs.CL 版本更新

SiDiaC: Sinhala Diachronic Corpus

Sinhala 语料库：SiDiaC：历时性语料库

Nevidu Jayatilleke, Nisansa de Silva

发表机构 * Department of Computer Science & Engineering, University of Moratuwa（计算机科学与工程系，穆拉图瓦大学）

AI总结 SiDiaC 是首个全面的僧加罗语历时性语料库，涵盖从5世纪到20世纪的58,000个单词，经过筛选和标注后，为僧加罗语NLP研究提供了基础资源，支持词形变化、新词追踪和历史语法研究。

Comments 17 pages, 7 figures, 9 tables, Accepted paper at the 39th Pacific Asia Conference on Language, Information and Computation (PACLIC 39)

详情

Journal ref: https://aclanthology.org/2025.paclic-1.47/

AI中文摘要

SiDiaC 是首个全面的僧加罗语历时性语料库，涵盖从5世纪到20世纪的58,000个单词。该语料库通过筛选和标注，基于写作日期进行精细处理，经过Google Document AI OCR数字化和后处理，以纠正格式并现代化拼写。SiDiaC 的构建借鉴了其他语料库（如FarPaHC）的实践，特别是在句法标注和文本规范化策略上，以应对僧加罗语资源匮乏的挑战。该语料库按文体分为两层：初级分类将每本书分为非虚构或虚构，次级分类则更具体，将文本归类为宗教、历史、诗歌、语言和医学等类别。尽管面临获取稀有文本有限和依赖次级日期来源等挑战，SiDiaC 为僧加罗语NLP研究提供了基础资源，显著扩展了僧加罗语可用资源，支持历时性研究，如词汇变化、新词追踪、历史语法和基于语料库的词典编纂。

英文摘要

SiDiaC, the first comprehensive Sinhala Diachronic Corpus, covers a historical span from the 5th to the 20th century CE. SiDiaC comprises 58k words across 46 literary works, annotated carefully based on the written date, after filtering based on availability, authorship, copyright compliance, and data attribution. Texts from the National Library of Sri Lanka were digitised using Google Document AI OCR, followed by post-processing to correct formatting and modernise the orthography. The construction of SiDiaC was informed by practices from other corpora, such as FarPaHC, particularly in syntactic annotation and text normalisation strategies, due to the shared characteristics of low-resourced language status. This corpus is categorised based on genres into two layers: primary and secondary. Primary categorisation is binary, classifying each book into Non-Fiction or Fiction, while the secondary categorisation is more specific, grouping texts under Religious, History, Poetry, Language, and Medical genres. Despite challenges including limited access to rare texts and reliance on secondary date sources, SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending the resources available for Sinhala, enabling diachronic studies in lexical change, neologism tracking, historical syntax, and corpus-based lexicography.

URL PDF HTML ☆

赞 0 踩 0

2509.04471 2026-05-19 cs.CL cs.AI 版本更新

MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

MOSAIC：一种多语言、无类别依赖且计算高效的放射报告分类方法

Alice Schiavone, Marco Fraccaro, Lea Marie Pehrson, Silvia Ingala, Rasmus Bonnevie, Michael Bachmann Nielsen, Vincent Beliveau, Melanie Ganz, Desmond Elliott

发表机构 * Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Neurobiology Research Unit, Copenhagen University Hospital（哥本哈根大学医院神经生物学研究单位）； Unumed Aps（Unumed公司）； Department of Diagnostic Radiology, Copenhagen University Hospital（哥本哈根大学医院诊断放射学系）； Department of Clinical Medicine, University of Copenhagen（哥本哈根大学临床医学系）； Cerebriu A/S（Cerebriu公司）； Institute for Human Genetics, Medical University of Innsbruck（因斯布鲁克医学大学人类遗传学研究所）

AI总结 MOSAIC通过紧凑开放模型实现多语言、无类别依赖的放射报告分类，无需大量标注数据，且在多种影像模态和标签体系上表现优异，达到专家水平性能。

Comments 8 pages, 14 pages including references and appendix. 9 figures. Preprint

详情

Journal ref: Proceedings of the ClinicalNLP Workshop at LREC 2026

AI中文摘要

放射学报告包含丰富的临床信息，可用于训练影像模型而无需依赖昂贵的手动标注。然而，现有方法面临关键限制：基于规则的方法难以处理语言多样性，监督模型需要大量标注数据集，而近期基于LLM的方法依赖封闭源或资源密集型模型，不适合临床使用。此外，当前解决方案大多局限于英语和单模态、单类别数据集。我们介绍了MOSAIC，一种多语言、无类别依赖且计算高效的放射报告分类方法。基于紧凑的开放访问语言模型（MedGemma-4B），MOSAIC支持零/少样本提示和轻量级微调，可在消费级GPU上部署。我们在英语、西班牙语、法语和丹麦语的七个数据集上评估MOSAIC，涵盖多种影像模态和标签体系。该模型在五个胸部X光数据集上达到平均宏F1分数88，接近或超过专家水平性能，同时仅需24GB GPU内存。通过数据增强，仅需80个标注样本即可在丹麦报告上达到加权F1分数82，相比完整1600样本训练集的86分。MOSAIC为临床环境中大型或专有LLM提供了实用替代方案。代码和模型是开源的。我们邀请社区在新语言、类别和模态上评估和扩展MOSAIC。

英文摘要

Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

URL PDF HTML ☆

赞 0 踩 0

2508.04149 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

基于难度的偏好数据选择：通过DPO隐式奖励差距

Xuan Qi, Rongwu Xu, Zhijing Jin

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington（华盛顿大学计算机科学与工程保罗·G·艾伦学校）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（德国图宾根马克斯·普朗克智能系统研究所）； Jinesis Lab, University of Toronto & Vector Institute（多伦多大学Jinesis实验室及向量研究所）

AI总结本文提出基于难度的偏好数据选择方法，利用DPO隐式奖励机制选择奖励差距小的样本，提升数据效率和模型对齐性能，在多个数据集和对齐任务中优于五个基线方法。

Comments Our code and data are available at https://github.com/Difficulty-Based-Preference-Data-Select/Difficulty-Based-Preference-Data-Select

详情

AI中文摘要

对齐大语言模型（LLMs）与人类偏好是AI研究中的关键挑战。尽管强化学习从人类反馈（RLHF）和直接偏好优化（DPO）等方法被广泛使用，但它们通常依赖于大规模、成本高的偏好数据集。本文缺少针对偏好数据的高质量数据选择方法。在本文中，我们引入了一种基于难度的偏好数据选择策略，该策略基于DPO隐式奖励机制。通过选择奖励差距较小的偏好数据示例，这些示例代表更具挑战性的案例，从而提高数据效率和模型对齐。我们的方法在多个数据集和对齐任务中一致优于五个强大的基线方法，仅使用原始数据的10%即可实现优越性能。这种原理上高效的选择方法为在有限资源下扩展LLM对齐提供了有前景的解决方案。

英文摘要

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

URL PDF HTML ☆

赞 0 踩 0

2506.23978 2026-05-19 cs.LG cs.CL cs.CY cs.SI 版本更新

LLM Agents Are the Antidote to Walled Gardens

大语言模型代理是封闭生态系统的解药

Samuele Marro, Philip Torr

发表机构 * Department of Engineering Science, University of Oxford（牛津大学工程科学系）； Institute for Decentralized AI（去中心化人工智能研究所）

AI总结本文提出通过大语言模型代理实现通用互操作性，打破封闭平台垄断，促进数据端到端迁移，同时探讨其带来的安全与法律挑战。

Comments Published at the ICML 2026 Position Paper track

详情

AI中文摘要

尽管互联网的核心基础设施最初设计为开放和通用，但当今的应用层却被封闭的专有平台主导。开放且互操作的API需要大量投资，而市场领导者缺乏激励去启用可能削弱用户锁定的数据交换。我们主张基于大语言模型的代理从根本上颠覆这一现状。代理可以自动转换数据格式并与为人设计的界面交互：这使互操作性大幅降低且实际上不可避免。我们称之为这种转变通用互操作性：任何两个数字服务都能通过AI调解的适配器无缝交换数据的能力。通用互操作性削弱了垄断行为，促进数据端到端迁移。然而，它也可能导致新的安全风险、技术债务和法律摩擦。我们的立场是ML社区应拥抱这一发展，同时构建适当的框架来减轻负面影响。通过现在行动，我们可以利用AI恢复用户自由和竞争市场，而不牺牲安全。

英文摘要

While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks, technical debt, and legal frictions. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.

URL PDF HTML ☆

赞 0 踩 0

2506.12119 2026-05-19 cs.CL cs.AI 版本更新

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

专家混合模型可在严格相等资源下超越密集语言模型

Houyi Li, Ka Man Lo, Shijie Xuyang, Ziqi Wang, Wenzhen Zheng, Haocheng Zhang, Zhao Li, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

发表机构 * Fudan University（复旦大学）； StepFun ； University of Science and Technology of China（中国科学技术大学）； Zhejiang University（浙江大学）

AI总结本文研究在资源相等条件下MoE模型是否能超越密集模型，提出优化框架并验证了在最优激活率下MoE模型性能更优，且该区域在不同模型规模下一致，通过数据重用解决数据量增加的权衡问题。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

专家混合（MoE）语言模型显著扩展了模型容量，并在不增加每token计算量的情况下实现了显著性能提升。然而，在严格相等的资源约束下，即总参数量、训练计算和数据预算完全相同的情况下，MoE能否超越密集架构？尽管其具有重要的实际价值和潜力，这一问题仍缺乏深入研究。本文提出了一种新的视角和方法论框架，系统研究这一问题。首先，我们全面调查了MoE的架构并实现了最优模型设计以最大化性能。基于此，我们发现，在最优区域内的MoE模型在相同总参数、训练计算和数据资源下能够超越其密集 counterpart。更重要的是，这一最优区域在不同模型规模下保持一致。虽然增加的数据量会带来性能的权衡，但我们通过重用数据解决了这一问题。我们通过广泛的实验验证了我们的发现，训练了近200个20亿参数规模的语言模型和超过50个70亿参数规模的语言模型，累计处理了50万亿token。所有模型检查点均已公开。

英文摘要

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints -- that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All model checkpoints are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2505.19590 2026-05-19 cs.LG cs.CL 版本更新

Learning to Reason without External Rewards

无需外部奖励的学习推理

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song

发表机构 * UC Berkeley（加州大学伯克利分校）； Yale University（耶鲁大学）

AI总结本文提出Intuitor方法，通过内在反馈实现无需外部奖励的自主学习，实验表明其在数学基准和代码生成等任务中表现优异，为无监督学习提供了新途径。

Comments ICLR 2026

详情

AI中文摘要

训练大型语言模型（LLMs）进行复杂推理的强化学习可验证奖励（RLVR）方法虽有效但受限于昂贵的领域特定监督。我们探索强化学习从内在反馈（RLIF）框架，使LLMs能从内在信号学习而无需外部奖励或标注数据。我们提出Intuitor，一种使用模型自身信心术语自信心作为唯一奖励信号的RLIF方法。Intuitor在组相对策略优化（GRPO）中用自信心分数替代外部奖励，实现完全无监督学习。实验表明，Intuitor在数学基准上与GRPO性能相当，但在代码生成等跨领域任务中泛化能力更强，无需黄金解决方案或测试用例。我们的发现表明，内在模型信号能驱动跨领域有效学习，为无可验证奖励的自主AI系统提供可扩展替代方案。代码可在https://github.com/sunblaze-ucb/Intuitor获取。

英文摘要

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

URL PDF HTML ☆

赞 0 踩 0

2505.16831 2026-05-19 cs.CL cs.AI cs.CR cs.LG 版本更新

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

反学习不是删除：调查机器反学习在大语言模型中的可逆性

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Carnegie Mellon University（卡内基梅隆大学）； University of California, Santa Cruz（加州大学圣克ruz分校）； Huawei Technologies（华为技术有限公司）； Research Centre for Privacy and Security Technologies in Future Smart Systems, PolyU（未来智能系统中的隐私与安全技术研究中心，PolyU）

AI总结研究揭示大语言模型反学习的可逆性问题，提出表示层面分析框架，通过PCA相似度、CKA和Fisher信息等指标评估表示漂移，发现四种遗忘模式，指出数据来源影响重学效率，揭示不可逆遗忘的挑战。

Comments ICML 2026, accepted to appear

详情

AI中文摘要

在大语言模型（LLMs）中，反学习旨在移除指定数据，但其效果通常通过任务级指标如准确率和困惑度评估。我们证明这些指标可能误导，因为模型似乎遗忘，但通过最小微调即可恢复原始行为。这种可逆性表明信息被抑制而非真正删除。为填补这一评估空白，我们引入表示层面分析框架。我们的工具包包括PCA相似度和位移、中心核对齐（CKA）和Fisher信息，辅以均值PCA距离作为总结指标，用于衡量表示漂移。在多种反学习方法、数据领域和LLMs上应用此框架，我们识别出四种基于可逆性和灾难性程度的遗忘模式。我们比较了恢复策略，发现重学效率依赖于数据来源。我们还发现不可逆、非灾难性遗忘异常困难。通过探测反学习极限，我们识别出一个看似不可逆的目标遗忘案例，为更稳健的擦除算法提供见解。总体而言，我们的发现揭示了当前评估的差距，并建立了可信反学习的表示层面基础。

英文摘要

Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We show that these metrics can be misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across multiple unlearning methods, data domains, and LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. We compare recovery strategies and show that relearning efficiency relies on the data source. We also find that irreversible, non-catastrophic forgetting is exceptionally challenging. By probing unlearning limits, we identify a case of seemingly irreversible, targeted forgetting, offering insights for more robust erasure algorithms. Overall, our findings expose a gap in current evaluation and establish a representation-level foundation for trustworthy unlearning.

URL PDF HTML ☆

赞 0 踩 0

2503.20981 2026-05-19 cs.CL cs.AI cs.SI 版本更新

通过安全过滤和宪法AI实现负责任的联邦大语言模型

Eunchung Noh, Jeonghun Baek

发表机构 * Samsung Electronics（三星电子）； The University of Tokyo（东京大学）

AI总结本文提出在联邦大语言模型中引入安全过滤和宪法AI技术，以提升模型安全性，实验显示在AdvBench上安全性能提升超过20%。

Comments Accepted at the 6th Workshop on Trustworthy NLP (TrustNLP), ACL 2026

2502.13957 2026-05-19 cs.CL cs.AI 版本更新

Supervising the search process produces reliable and generalizable information-seeking agents

通过监督搜索过程产生可靠且可推广的信息寻求代理

Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang

发表机构 * Department of Computer Science, University of Virginia, USA（弗吉尼亚大学计算机科学系）； National Library of Medicine, National Institutes of Health, USA（美国国立卫生研究院国家医学图书馆）； Department of Computer Science, University of Illinois Urbana–Champaign, USA（伊利诺伊大学厄巴纳-香槟分校计算机科学系）； Medical Oncology, Dana–Farber Cancer Institute, USA（达纳-法伯癌症研究所医学肿瘤科）； Surgery, University of Alabama at Birmingham, USA（阿拉巴马大学伯明翰分校外科系）； Department of Neurology, Yale School of Medicine, USA（耶鲁医学院神经病学系）

AI总结本文提出通过监督搜索过程来构建更可靠且可推广的信息寻求代理，通过RAG-Gym框架系统研究了架构设计、参数优化和动作评估，发现推理反思是关键能力，Re$^2$Search++在多跳信息检索基准上取得显著提升，尤其在领域外任务中表现更优。

Comments Homepage: https://rag-gym.github.io; Code: https://github.com/RAG-Gym/RAG-Gym

详情

AI中文摘要

大型语言模型（LLMs）通过从文档排序转向综合答案的方式改变了网络搜索，并越来越多地被用作自主的代理搜索系统，这些系统通过迭代与外部知识源交互。尽管有进展，构建有效的搜索代理仍然具有挑战性，因为高质量的中间搜索步骤难以生成。以往的方法主要依赖于结果监督，仅奖励代理生成正确最终答案。这往往导致奖励黑客和对参数记忆的过度依赖，限制了对领域外任务的泛化能力。为了解决这些限制，我们引入RAG-Gym框架，将监督从最终答案转移到搜索过程本身。通过RAG-Gym，我们系统地研究了架构设计、参数优化和动作评估，确定推理反思是搜索代理的关键能力。基于这一见解，我们提出了Re$^2$Search++，一个受过程监督的代理，它在多跳信息检索基准上实现了显著改进，尤其是在领域外设置中。性能提升主要由更高质量的搜索查询驱动，而非仅靠答案优化。所学的搜索批评者能够跨模型转移，包括专有LLMs。这些发现表明，监督搜索过程会产生更可靠且可推广的信息寻求代理。

英文摘要

Large language models (LLMs) are transforming web search by shifting from document ranking to synthesizing answers, and are increasingly deployed as autonomous agentic search systems that iteratively interact with external knowledge sources. Despite this progress, building effective search agents remains challenging because high-quality intermediate search steps are difficult to generate. Previous approaches have primarily relied on outcome supervision, rewarding agents only for producing correct final answers. This often leads to reward hacking and excessive dependence on parametric memory, limiting generalization to out-of-domain tasks. To address these limitations, we introduce RAG-Gym, a framework that shifts supervision from final answers to the search process itself. With RAG-Gym, we systematically investigate architecture design, parameter optimization, and action evaluation, identifying reasoning reflection as a critical capability for search agents. Building on this insight, we propose Re$^2$Search++, a process-supervised agent that achieves substantial improvements on multi-hop information-seeking benchmarks, especially in out-of-domain settings. Performance gains are driven primarily by higher-quality search queries rather than answer optimization alone, and the learned search critics transfer across models, including proprietary LLMs. These findings show that supervising the search process produces more reliable and generalizable information-seeking agents.

URL PDF HTML ☆

赞 0 踩 0

2411.10636 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

缓解孟加拉语分类任务中的外在性别偏见

Sajib Kumar Saha Joy, Arman Hassan Mahy, Meherin Sultana, Azizah Mamun Abha, MD Piyal Ahmmed, Yue Dong, G M Shahariar

发表机构 * Ahsanullah University of Science and Technology（阿沙努拉科学与技术大学）； University of California, Riverside（加州大学河滨分校）

AI总结本文研究了孟加拉语预训练语言模型中的外在性别偏见，构建了四个任务特定的基准数据集，并提出RandSymKL方法以缓解偏见，实验表明其能有效减少偏见并保持高准确率。

详情

AI中文摘要

在本研究中，我们探讨了孟加拉语预训练语言模型中的外在性别偏见，这是一个在低资源语言中鲜有研究的领域。为了评估这种偏见，我们构建了四个人工标注的任务特定基准数据集，用于情感分析、毒性检测、仇恨言论检测和讽刺检测。每个数据集都通过细致的性别扰动进行了增强，通过系统地交换性别化名称和术语并保持语义内容，实现了对性别驱动预测变化的最小配对评估。然后，我们提出RandSymKL，一种整合对称KL散度和交叉熵损失的随机去偏策略，以在任务特定的预训练模型中缓解偏见。RandSymKL是一种精炼的训练方法，以统一的方式整合这些元素，专注于分类任务的外在性别偏见缓解。我们的方法在现有偏见缓解方法上进行了评估，结果表明，我们的技术不仅有效减少了偏见，还与其他基线方法相比保持了竞争性的准确性。为了促进进一步研究，我们已公开了我们的实现和数据集：https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

英文摘要

In this study, we investigate extrinsic gender bias in Bangla pretrained language models, a largely underexplored area in low-resource languages. To assess this bias, we construct four manually annotated, task-specific benchmark datasets for sentiment analysis, toxicity detection, hate speech detection, and sarcasm detection. Each dataset is augmented using nuanced gender perturbations, where we systematically swap gendered names and terms while preserving semantic content, enabling minimal-pair evaluation of gender-driven prediction shifts. We then propose RandSymKL, a randomized debiasing strategy integrated with symmetric KL divergence and cross-entropy loss to mitigate the bias across task-specific pretrained models. RandSymKL is a refined training approach to integrate these elements in a unified way for extrinsic gender bias mitigation focused on classification tasks. Our approach was evaluated against existing bias mitigation methods, with results showing that our technique not only effectively reduces bias but also maintains competitive accuracy compared to other baseline approaches. To promote further research, we have made both our implementation and datasets publicly available: https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

URL PDF HTML ☆

赞 0 踩 0

2409.10102 2026-05-19 cs.IR cs.AI cs.CL 版本更新

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

检索增强生成系统中的可信度：综述

Yujia Zhou, Wenbo Zhang, Jingying Shao, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Jason Chen Zhang, Zhicheng Dou, Philip S. Yu, Jiaxin Mao

发表机构 * Tsinghua University（清华大学）； Renmin University of China（中国人民大学）； The Chinese University of Hong Kong（香港中文大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Hong Kong Polytechnic University（香港理工大学）； Microsoft Research Asia（微软亚洲研究院）； University of Illinois（伊利诺伊大学）

AI总结本文综述了检索增强生成系统中可信度的关键维度，提出Trust-RAG Compass框架，评估事实性、鲁棒性等六个方面，并建立评估基准，揭示不同LLM在可信度方面的性能差异，指出未来研究方向。

详情

AI中文摘要

检索增强生成（RAG）已迅速成为大型语言模型（LLMs）发展中的关键范式。尽管现有研究主要强调准确性和效率，但RAG系统的可信度仍缺乏充分探讨。RAG通过将响应基于外部和最新知识来提高LLM的可靠性，减少幻觉。然而，不可靠的检索或不当的知识利用仍可能导致不良输出。为此，我们提出统一框架Trust-RAG Compass，从事实性、鲁棒性、公平性、透明性、问责性和隐私六个关键维度评估RAG系统的可信度。在此框架下，我们对现有文献进行了全面回顾，并引入评估基准TRC Bench，围绕六个维度对多种专有和开源模型进行全面评估。我们的结果揭示了不同类型的LLM在不同可信度维度上的性能差距。最后，基于我们的发现，我们识别了关键挑战和未来研究的前景。通过这项工作，我们旨在为后续研究提供结构化基础，并为开发真实场景中的可信RAG系统提供实用指导。

英文摘要

Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). Although existing research mainly emphasizes accuracy and efficiency, the trustworthiness of RAG systems remains insufficiently explored. RAG can improve LLM reliability by grounding responses in external and up-to-date knowledge, reducing hallucinations. However, unreliable retrieval or improper knowledge utilization may still lead to undesirable outputs. To address these concerns, we propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench (\underline{T}rust-\underline{R}AG \underline{C}ompass \underline{Bench}mark), regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness. Finally, we identify key challenges and promising directions for future research based on our findings. Through this work, we aim to provide a structured foundation for subsequent investigations and practical guidance for developing trustworthy RAG systems in real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2409.02428 2026-05-19 cs.LG cs.AI cs.CL cs.SY eess.SY 版本更新

Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement

语言模型作为定制环境多目标强化学习的高效奖励函数搜索器

Guanwen Xie, Jingzehua Xu, Yiyuan Yang, Yimian Ding, Shuai Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, China（清华大学深圳国际研究生院，清华大学，中国）； Department of Computer Science, University of Oxford, United Kingdom（英国牛津大学计算机科学系）； Department of Data Science, New Jersey Institute of Technology, USA（美国新泽西理工学院数据科学系）

AI总结本文提出ERFSL，利用语言模型高效搜索奖励函数，通过生成奖励组件和使用奖励批评者修正代码，实现多目标强化学习任务中零样本学习的高效奖励函数设计。

详情

AI中文摘要

在强化学习任务中，设计和改进复杂定制环境和多重需求的奖励函数具有挑战性。本文提出ERFSL，一种利用大型语言模型（LLMs）的高效奖励函数搜索器，使LLMs成为有效的白盒搜索器，并突出其先进的语义理解能力。具体而言，我们为每个数值明确的用户需求生成奖励组件，并使用奖励批评者识别正确的代码形式。然后，LLMs为奖励组件分配权重以平衡其值，并通过灵活采用方向突变和交叉策略迭代调整权重，类似于遗传算法，基于训练日志分析器提供的上下文。我们将其应用于无直接人类反馈或奖励示例的定制数据收集RL任务（零样本学习）。奖励批评者仅需每个需求一个反馈实例即可有效纠正奖励代码，防止不可纠正的错误。权重初始化使在帕累托解集内获取不同奖励函数而无需权重搜索。即使权重偏差达500倍，平均仅需5.2次迭代即可满足用户需求。ERFSL也适用于大多数使用GPT-4o mini的提示，因为我们分解了权重搜索过程，以降低对数值和长上下文理解能力的要求。

英文摘要

Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to a customized data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities.

URL PDF HTML ☆

赞 0 踩 0

2407.16216 2026-05-19 cs.CL 版本更新

Reinforcement Learning for LLM Post-Training: A Survey

为大型语言模型进行后训练的强化学习：综述

Zhichao Wang, Kiran Ramnath, Bin Bi, Shiva Kumar Pentyala, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, Cheng

发表机构 * Salesforce ； AWS AI Labs ； Airbnb

AI总结本文综述了通过强化学习进行大型语言模型后训练的方法，分析了RLHF和RLVR等技术，并提出统一的策略梯度框架，旨在为研究人员提供技术参考。

详情

AI中文摘要

大型语言模型（LLMs）通过预训练和监督微调（SFT）训练后，仍可能产生有害或不一致的输出，或在数学和编程等领域表现不佳。基于强化学习（RL）的后训练方法，如通过人类反馈的强化学习（RLHF）方法（如直接偏好优化（DPO）和可验证奖励的强化学习（RLVR）方法（如PPO和GRPO）等，已显著缓解了这些问题。然而，现有研究未对推动这些进展的各种方法进行技术细节的比较。为填补这一空白，本文提出了一项及时的综述，将基础组件与最新进展联系起来。我们推导出一个统一的策略梯度框架，将预训练、SFT、RLHF和RLVR作为特殊案例，并组织其中更近期的技术。本文的主要贡献包括：（1）对MLE、RLHF和RLVR基础以及统一策略梯度框架的自包含介绍；（2）详细分析PPO和GRPO方法以及离线和迭代DPO方法，沿提示采样、响应采样和梯度系数轴分解；（3）标准化符号，实现直接跨方法比较；（4）附录中对每种方法的实现细节和实证结果的全面比较。我们旨在为从事LLM后训练研究的研究人员和实践者提供技术参考。

英文摘要

Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main contributions of our survey are as follows: (1) a self-contained introduction to MLE, RLHF, and RLVR foundations and the unified policy gradient framework; (2) detailed technical analysis of PPO- and GRPO-based methods alongside offline and iterative DPO approaches, decomposed along prompt sampling, response sampling, and gradient coefficient axes; (3) standardized notation enabling direct cross-method comparison; and (4) comprehensive comparison of implementation details and empirical results of each method in the appendix. We aim to serve as a technically grounded reference for researchers and practitioners working on LLM post-training.

URL PDF HTML ☆

赞 0 踩 0

2605.16613 2026-05-19 cs.CL econ.GN q-fin.EC q-fin.GN 版本更新

Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text

超越情感分类：一种用于文本情感强度评估的生成框架

Francesco A. Fabozzi, Dasol Kim, William N. Goetzmann

AI总结本文提出一种新的情感建模方法，关注情感评估而非识别，通过构建情感强度评分数据集并微调生成语言模型，实现更通用的情感分析框架，优于传统分类方法并在相关领域表现出色。

Comments 10 pages, no figures, 5 tables

2605.16600 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

预训练写入，对齐读取：Transformer权重空间的不对称性

Valeria Ruscio, Eli-Shaoul Khedouri, Keiran Thompson

AI总结研究揭示了预训练和对齐在Transformer权重空间中的不对称性，通过分析权重变化在残差流激活子空间和预测子空间中的对齐情况，发现读路径权重集中于注意力输入激活的主方向，而写路径权重在预测子空间中保持各向同性。

详情

AI中文摘要

交叉熵预训练和偏好对齐更新相同的Transformer权重，但留下几何上不同的痕迹。我们通过相对子空间分数探针来刻画这种不对称性，追踪权重变化如何与残差流激活子空间和由去嵌入定义的预测子空间对齐。对齐变化集中在读路径（W_Q，W_K）上，沿着注意力输入激活的主方向，而写路径（W_O，W_2）相对于预测子空间则保持近各向同性。我们通过各向异性梯度积累来解释这种模式：对矩阵W的更新是外积δ_t a_t^T之和，继承自哪一侧的协方差集中。对于读路径矩阵，这一侧是输入激活a_t，其协方差在训练过的Transformer中呈尖峰状，因此产生与目标无关的集中。对于写路径矩阵，相关的一侧是上游梯度δ_t，其各向异性取决于损失。交叉熵提供标准的每样本信号，诱导预训练期间写路径的预测几何；对齐目标通常在写路径上添加很少的进一步集中。我们通过检查点内轨迹、渐进对比目标控制以及闭合形式的秩1干预与匹配方向控制来支持这一解释，为所提出的权重空间几何提供因果证据。

英文摘要

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $δ_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $δ_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

URL PDF HTML ☆

赞 0 踩 0

2605.16538 2026-05-19 cs.HC cs.CL 版本更新

LLMs in Qualitative Research: Opportunities, Limitations, and Practical Considerations

大语言模型在定性研究中的应用：机遇、限制与实践考量

Henry Salgado, Meagan R. Kendall, Martine Ceberio, Alexandra Coso Strong

AI总结本文探讨了大语言模型在定性研究中的机遇、限制及实践问题，强调研究者需批判性地处理技术参数，结合定性方法与可解释AI，讨论大语言模型的透明度与传统NLP工具的差异。

Comments To be published and presented in 2026 ASEE Annual Conference and Exposition

2605.16516 2026-05-19 cs.HC cs.AI cs.CL cs.CY 版本更新

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

长期人类-大语言模型交互中的对齐漂移：一种机制导向的框架

Xintong Yao

AI总结本文提出一种机制导向的框架，用于描述长期人类-大语言模型交互中的对齐漂移现象，通过反馈回路和子模式选择解释漂移的发展过程，并将对齐漂移视为递归互动过程而非孤立模型失败。

Comments 16 pages, 1 appendix

详情

DOI: 10.5281/zenodo.20113611

AI中文摘要

长期与基于大语言模型的系统交互可能导致对齐漂移：一种渐进过程，其中系统输出逐渐受用户当前消息的约束减少，而更多受先前交互历史影响，尽管仍显得有帮助、连贯和响应。此过程难以检测，因为用户的主观体验可能随着系统变得更熟悉、有用和适应而改善。现有研究主要集中在短期任务表现、孤立输出或单实例对齐问题，导致慢性和累积的交互层面动态未被充分描述。本文提出一种机制导向的框架来描述对齐漂移。该框架定义信号A和信号B的区别，解释漂移如何通过反馈回路和子模式选择发展，将过程分为三个互动阶段，并识别控制漂移的边界条件。通过将对齐漂移视为递归互动过程而非孤立模型失败，本文为研究长期人类-系统交互提供了概念基础。

英文摘要

Long-term interaction with LLM-based systems may produce alignment drift: a gradual process in which system outputs become less constrained by the user's current message and more shaped by prior interaction history, while still appearing helpful, coherent, and responsive. This process is difficult to detect because the user's subjective experience may improve as the system becomes more familiar, useful, and attuned. Existing research on human-LLM interaction has largely focused on short-term task performance, isolated outputs, or single-instance alignment problems, leaving slow and cumulative interaction-level dynamics undercharacterized. This paper proposes a mechanism-oriented framework for describing alignment drift. The framework defines the distinction between signal A and signal B, explains how drift develops through feedback loops and sub-pattern selection, divides the process into three interactional regimes, and identifies boundary conditions for controlling drift. By framing alignment drift as a recursive interactional process rather than an isolated model-side failure, the paper provides a conceptual basis for studying long-term human-system interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.16508 2026-05-19 cs.CL cs.AI 版本更新

The Scaling Laws of Skills in LLM Agent Systems

大语言模型代理系统中技能的扩展规律

Charles Chen, Qiming Yu, Yuhang Gu, Zhuoye Huang, Hanjing Li, Hongyu Liu, Simin Liu, Jinhao Liu, Dengyun Peng, Jiangyi Wang, Zheng Yan, Fanqing Meng, Ethan Qin, Carl Che, Mengkang Hu

AI总结研究揭示了大规模代理系统中技能扩展的双重规律：路由准确性随库大小对数衰减，执行准确性通过联合路由乘法提升下游任务表现，二者通过路由衰减斜率参数耦合，优化后显著提升性能。

Comments Technical Report

详情

AI中文摘要

随着代理系统规模扩大，技能积累为大规模可重用库，但其扩展规律仍不明确。在15个前沿LLM、1141个现实技能及超300万次路由或执行决策中，发现两个耦合规律。路由规律：单步路由准确性随库大小对数衰减（R²>0.97），错误从局部技能竞争发展到跨家族漂移并被过于通用的'黑洞技能'捕获。执行规律：在状态实现前，联合路由近似乘法，正确执行可提升困难下游任务表现约4倍。单参数路由对数衰减斜率b耦合二者：路由侧拟合预测执行侧救援，显示同一库属性控制预执行崩溃和下游恢复能力。这些结果表明代理性能不仅取决于模型能力，还取决于技能库的结构、粒度和暴露策略。

英文摘要

As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

URL PDF HTML ☆

赞 0 踩 0

2605.16475 2026-05-19 cs.DL cs.CL 版本更新

Generative Artificial Intelligence for Literature Reviews

生成式人工智能用于文献综述

Gerit Wagner, Julian Prester, Reza Mousavi, Roman Lukyanenko, Guy Pare

AI总结本文探讨生成式人工智能在文献综述中的应用，分析其在文本摘要、问答、数据提取和翻译等方面的能力，提出使用通用和专用工具进行综述的方法，并讨论其机遇与风险，以及对科学进步的影响。

详情

DOI: 10.1177/02683962261425675
Journal ref: Journal of Information Technology, 02683962261425675 (2026)

AI中文摘要

生成式人工智能（GenAI），基于大型语言模型（LLMs），如ChatGPT，已席卷组织、学术界和公众。特别是生成式人工智能在大规模文本语料摘要、问答、数据提取和翻译方面的出色能力，对文献综述的开展具有深远影响。这影响了科学、组织和公众，因为所有都可以从GenAI支持的文献综述中受益。本文基于GenAI的技术基础和已确立的方法论 discourse，概述了使用通用（如ChatGPT、Gemini、Claude）和专用（如Consensus、Elicit）GenAI工具进行文献综述的方法。我们提供了提示的示例，并建议方法上稳健的文献综述策略。本文采用平衡的方法，考虑了依赖GenAI进行文献综述的机会与风险。最后，我们讨论了GenAI对长期科学进步影响的哲学问题，并提出了改进GenAI核心技术（其架构和训练数据）的研究机会，以及在GenAI支持的文献综述方法学中的开放问题。

英文摘要

Generative artificial intelligence (GenAI), based on large-language models (LLMs), such as ChatGPT, has taken organizations, academia, and the public by storm. In particular, impressive GenAI capabilities such as summarization of large text corpora, question-answering, data extraction, and translation, carry profound implications for the conduct of literature reviews. This impacts science, organizations and the general public, as all can benefit from GenAI-supported literature reviews. Building on the technical foundations of GenAI and grounded in established methodological discourse, this work outlines approaches for conducting literature reviews using both general-purpose (e.g., ChatGPT, Gemini, Claude) and specialized GenAI tools (e.g., Consensus, Elicit). We provide illustrative examples of prompts and suggest methodologically-sound literature review strategies. Throughout this perspective paper, we adopt a balanced approach considering both the opportunities and the risks of relying on GenAI in the conduct of literature reviews. We conclude by discussing philosophical questions related to the effects of GenAI on long-term scientific progress, and also present fruitful opportunities for research on improving the core of GenAI's technology-its architecture and training data-and suggest open issues in GenAI-based literature reviews methodology.

URL PDF HTML ☆

赞 0 踩 0

2605.16468 2026-05-19 cs.CV cs.AI cs.CL cs.LG q-bio.NC 版本更新

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

可解释的神经编码机制揭示人类视觉皮层的精细功能选择性

Idan Daniel Grosbard, Mor Geva, Galit Yovel

AI总结本文提出MINE框架，通过机制可解释工具揭示自然图像中驱动皮层 voxel 活动的特征，验证了特征对 voxel 响应的因果影响，并揭示了视觉皮层中精细的功能选择性。

Comments 40 pages, 28 figures

详情

AI中文摘要

理解人类视觉的核心目标是揭示驱动神经活动的视觉特征。已有研究利用人工神经网络作为编码模型预测皮层对自然图像的响应，揭示了激活类别选择区域的视觉内容。然而，现有方法多为相关性分析，将编码器视为黑箱，无法确定哪些图像特征驱动每个 voxel 的响应。本文提出机制可解释神经编码（MINE）框架，通过机制可解释工具定位自然图像中驱动毫米级（voxel 级）活动的特征。MINE利用语言对齐的图像表示预测每个 voxel 的响应，并生成语义可解释的特征描述，用于 voxel 的激活。进一步将这些 per-image 特征泛化为 per-voxel 功能轮廓。为验证 per-image 描述，我们显示它们足以生成激发 voxel 响应与原始图像响应匹配的图像，其准确性优于随机或低贡献控制生成的图像。此外，通过反事实插入或移除预测特征，可使激活在预期方向变化，提供因果证据。由 voxel 激活轮廓指导的反事实编辑产生更强的激活变化，表明轮廓忠实捕捉每个 voxel 的选择性。最后，将 MINE 应用于研究充分的类别选择脑区，显示其恢复了已知的类别偏好，同时揭示了每个区域内的精细 voxel 结构。总体而言，我们的结果确立了机制可解释性作为发现和验证神经功能精细假设的路径。

英文摘要

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

URL PDF HTML ☆

赞 0 踩 0

2605.16411 2026-05-19 cs.CV cs.AI cs.CL cs.DB cs.LG 版本更新

DACA-GRPO：去噪感知的信用分配用于扩散语言模型中的强化学习

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Lokesh Boominathan, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

AI总结本文提出DACA-GRPO，通过引入去噪进度评分和分层掩码似然，改进扩散语言模型中强化学习的信用分配，提升数学推理、代码生成等任务性能。

详情

AI中文摘要

扩散大语言模型是自回归模型的有力替代品，但现有强化学习方法将所有去噪步骤视为同等重要，并依赖于有偏、高方差的似然估计。我们识别出两个根本性弱点：去噪轨迹中缺乏时间信用分配，以及用于策略优化的均场似然估计存在系统偏差。为了解决这些问题，我们提出了Denoising-Aware Credit Assignment for GRPO（DACA-GRPO），一种轻量级、即插即用的增强方法，适用于任何GRPO风格的训练器。DACA-GRPO引入了两个互补机制：去噪进度评分，从中间预测中提取每token的重要性权重，无需额外前向成本；分层掩码似然，将token位置分为层次，使每个token在大部分序列作为上下文的情况下进行预测，从而减少均场偏差。在三种GRPO基础方法上应用DACA-GRPO，使其在七个基准测试中取得一致提升，涵盖数学推理、代码生成、约束满足和受约束生成等任务，在数学推理中提升达5.6个百分点，在代码生成中提升7.4个百分点，在约束满足中提升36.3个百分点，在JSON schema符合性中提升5.9个百分点。

英文摘要

Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.

URL PDF HTML ☆

赞 0 踩 0

2605.16338 2026-05-19 cs.DL cs.CL 版本更新

Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment

Vidya：一种由人工智能驱动的模块化流水线，用于档案自动化和语义元数据增强

Cloter Migliorini Filho, Julia Graciela Machado, Edson Armando Silva, Marcella Scoczynski

AI总结 Vidya利用大语言模型和开源工具实现大规模档案语义增强与自动化处理，通过YAML定义的本体和Pydantic验证生成结构化JSON输出，降低机构存储成本并符合相关标准。

2605.16303 2026-05-19 cs.CY cs.AI cs.CL 版本更新

From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes

从人口统计学到调查锚点：评估LLM代理在建模退休态度中的表现

Rubén Garzón, Pauline Baron, Vincent Grari, Jonne Kamphorst, Michael Bernstein, Marcin Detyniecki

AI总结本文比较了基于人口统计学的LLM代理与基于调查数据的代理在预测退休态度调查中的准确性，发现仅依赖人口统计学的代理存在偏差且不够准确，而基于调查锚点的代理能更好地捕捉复杂的人类响应模式。

Comments 50 pages, 22 figures

详情

AI中文摘要

大型语言模型（LLM）代理可能为预测人类对调查的响应提供工具。一种常见的技术是仅使用人口统计学数据（如国家、年龄、性别、就业状况、收入、教育和婚姻状况）来定义这些代理。我们比较了基于人口统计学的代理与基于更广泛调查响应数据的代理的预测准确性。我们测试了这两种方法在预测多学科、跨国的《健康、老龄化与退休状况调查》（SHARE）中的响应，重点关注五个变量，这些变量来自三个与个人财务相关的政策相关构念。在这些三个构念中，我们发现，与基于更广泛数据训练的调查代理相比，仅依赖人口统计学的代理（1）表现出集中趋势偏差，使答案偏向人口均值，（2）过于准确，无法重现人类响应中的错误答案和“不知道”响应。这些性能差异通过复制先前退休规划研究中的分层回归分析得到进一步验证。仅基于人口统计信息的代理重现了财务风险承受能力、未来时间观念和退休规划知识各自预测退休储蓄的结果。然而，只有基于调查锚点的代理才能重现这三个因素之间的相互作用。这些发现表明，在仅使用人口统计学定义LLM代理以预测调查响应时应保持谨慎。

英文摘要

Large language models (LLM) agents may offer tools to predict human responses to surveys. A common technique for defining these agents uses only demographics, for example country, age, gender, employment status, income, education and marital status. We compare the predictive accuracy of demographic agents to that of survey agents defined with a larger set of in-domain survey responses. We test both approaches in predicting responses to the multidisciplinary, cross-national Survey of Health, Ageing and Retirement in Europe (SHARE), focusing on five variables from three policy-relevant constructs around personal finance. In these three constructs, we observe that, compared to survey agents trained on broader data, demographics-only agents (1) exhibited a central tendency bias, skewing answers toward population means, and (2) were unrealistically accurate, failing to reproduce the incorrect answers and "don't know" responses typical of human respondents. These performance differences are further substantiated through the replication of a hierarchical regression analysis from prior retirement planning research. Agents based solely on demographic information reproduce the outcome that financial risk tolerance, future time perspective, and knowledge of retirement planning each are predictive of retirement savings. However, only the survey-anchored agents succeed in reproducing the interaction among these three factors. These findings suggest caution in using only demographics to define LLM agents for predicting survey responses.

URL PDF HTML ☆

赞 0 踩 0

2605.16295 2026-05-19 cs.CY cs.AI cs.CL cs.GR cs.HC cs.MM 版本更新

ANVIL: Analogies and Videos for Lecturers

ANVIL：为讲师提供类比和视频

Yuri Noviello, Anastasiia Birillo, Gosia Migut

AI总结 ANVIL是一种多模态生成系统，可自动生成基于类比的计算机科学教学动画。通过生成文本类比、结构化视觉剧本和可执行代码，提升教学有效性。

详情

AI中文摘要

我们介绍了ANVIL，一种多模态生成系统，可自动生成基于类比的教学动画。给定一个概念定义，ANVIL生成文本类比，将其编译成结构化的视觉剧本，并生成可执行的manim代码以渲染动画，同时具备自动修复机制以提高鲁棒性。在大规模评估此类系统时，需要在教学有效性与可扩展性之间取得平衡。我们首先通过教师评估来确定质量评估的基础，并利用其发现来指导自动化筛选。对于文本类比，我们引入基于LLM的评估器以实现可扩展的质量筛选；对于视频，由于主观判断难以自动化，我们改用自动代理来评估与预期剧本的一致性并进行错误分析。我们进一步与教育工作者进行用户研究，以考察采用要求和风险。我们的发现表明，ANVIL可以生成经常被评价为足够的材料，并且教育工作者对其感知价值和易用性有积极反应。

英文摘要

We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.

URL PDF HTML ☆

赞 0 踩 0

2605.16289 2026-05-19 cs.CY cs.CL 版本更新

Linguistic Uncertainty and Reply Engagement on X: A Cross-Domain Replication of the Uncertainty-Reply Asymmetry

语言不确定性与X上的回复参与：对不确定性-回复不对称性的跨领域复制

Mohamed Soufan

AI总结研究探讨语言不确定性与社交媒体参与的关系，发现不确定性帖子获得更多回复，验证了先前阿拉伯语研究中的不对称参与模式。

Comments 13 pages, 2 figures, 2 tables

详情

AI中文摘要

在社交媒体中，语言不确定性普遍存在，但其与参与度的关系在不同语言和主题中仍不明确。本文利用2026年4月三天收集的2258篇英文帖子（涉及联邦储备政策、通货膨胀和选举政治），测试先前阿拉伯语研究中观察到的不确定性-回复不对称性是否在更广泛背景下复制。通过基于词典的不确定性框架对帖子进行分类，约三分之一被识别为不确定。不确定帖子平均获得82%更多的回复，再贴和点赞的增加较小，复制了先前研究中的不对称参与模式。回归结果确认不确定性与回复之间存在正相关（η=0.126，p=0.011），相当于约13%更高的预期回复参与度，而总参与度则显示较弱的正相关。这些发现表明，语言不确定性系统性地增加对话参与度，并可能反映一种跨语言和领域的普遍互动机制。

英文摘要

Linguistic uncertainty is common in social media, but its relationship with engagement remains unclear across languages and topics. Using 2,258 English-language posts on Federal Reserve policy, inflation, and electoral politics collected over three days in April 2026, we test whether the Uncertainty-Reply Asymmetry observed in prior Arabic-language research replicates in a broader context. Posts are classified using a lexicon-based uncertainty framework, with approximately one-third identified as uncertain. Uncertain posts receive 82% more replies on average than certain posts, with smaller increases in reposts and likes, replicating the asymmetric engagement pattern observed in prior work. Regression results confirm a positive and statistically significant association between uncertainty and replies (\b{eta} = 0.126, p = 0.011), equivalent to ~13% higher expected reply engagement, while total engagement shows a positive but weaker association. These findings suggest that linguistic uncertainty systematically increases conversational engagement and may reflect a general interactional mechanism across languages and domains.

URL PDF HTML ☆

赞 0 踩 0

2605.16288 2026-05-19 cs.CY cs.CL 版本更新

When AI Tells You What You Want to Hear: Sycophantic Behavior of Large Language Models in Dementia Care Settings

当AI说你想听的话：大型语言模型在痴呆症护理环境中的趋炎附势行为

Christian Kolb

AI总结研究探讨了大型语言模型在痴呆症护理场景中是否因迎合社会期望而降低专业质量，通过五种提示测试四款模型，发现响应质量随提示框架增强而下降，提示框架显著影响响应质量。

Comments 10 pages, 4 figures. Exploratory study

详情

DOI: 10.5281/zenodo.19548622

AI中文摘要

大型语言模型（LLMs）正在越来越多地应用于临床和护理环境。本探索性研究检验了LLMs在痴呆症护理情境中是否表现出趋炎附势行为——即根据社会期望信号调整响应而非维持专业质量。五种提示（P1中性到P5权威信号）被提交给四款LLMs（GPT-5、Claude Sonnet 4.6、Gemini 3.1 Pro、Mistral Large），每种提示重复五次（N=100次响应）。响应通过LLM-as-a-Judge方法评估，依据七个护理伦理质量标准（K1-K7）和语气量表（0-3）。所有模型均显示提示水平与响应质量之间存在显著负相关（rho范围从-0.543到-0.734，p<0.01）。Mistral Large表现出最显著的效果（rho=-0.734），平均分从P1的6.0/7降至P5的0.2/7。研究结果表明，LLMs在高风险护理环境中存在情境敏感的风险，提示框架显著影响响应质量——这一维度在医疗AI部署中关注不足。

英文摘要

Large language models (LLMs) are increasingly used in clinical and care settings. This exploratory study investigates whether LLMs exhibit sycophantic behavior - adapting their responses to social expectation signals rather than maintaining professional quality - in the context of dementia care. Five prompts with systematically increasing confirmatory and authority-related framing (P1 neutral to P5 authority-signaled implementation support) were submitted to four LLMs (GPT-5, Claude Sonnet 4.6, Gemini 3.1 Pro, Mistral Large), each repeated five times (N = 100 responses). Responses were evaluated using an LLM-as-a-Judge methodology against seven nursing-ethical quality criteria (K1-K7) and a tone scale (0-3). All models showed significant negative Spearman correlations between prompt level and response quality (rho ranging from -0.543 to -0.734, all p < 0.01). Mistral Large exhibited the most pronounced effect (rho = -0.734), with mean scores dropping from 6.0/7 at P1 to 0.2/7 at P5. The findings suggest that LLMs pose context-sensitive risks in high-stakes care environments and that prompt framing significantly shapes response quality - a dimension that has received insufficient attention in healthcare AI deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.16275 2026-05-19 cs.CY cs.AI cs.CL cs.MM 版本更新

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

AI 产出物还是AI增强？英语学术用途课程中学生对AI生成媒体的看法

David James Woo, Deliang Wang, Kai Guo

AI总结研究探讨了AI生成内容在EAP课程中的教学效果，通过混合方法分析发现学生偏好视觉化内容，视频与学业表现正相关，但高认知负荷与成绩负相关，表明需合理设计内容以提升学习效果。

Comments 23 pages, 7 figures

详情

AI中文摘要

人工智能（AI）检索增强生成（RAG）工具现在使教育者能够将课程材料转化为多样化的多媒体内容。然而，这种AI生成内容是教学支架还是低质量的AI产出仍不明确。本文报告了在一所香港社区学院的英语学术用途（EAP）课程中，教师引导生成补充材料的开发、实施与评估。主要使用Google Notebook LM生成视频、播客、信息图和个性化反馈报告。通过混合方法设计，包括调查、半结构化访谈和与学术成绩的相关性分析，研究发现学生认为材料有用且易于使用，更偏好与评估相关的视觉和多模态内容，特别是视频和信息图。视频偏好与学业成绩正相关，但高认知负荷与成绩负相关，表明需谨慎校准内容复杂性。值得注意的是，部分成绩较低的学生自行将材料作为补救支架。该实践表明，RAG工具能够实现传统方法难以实现的规模化个性化反馈。当与学生目标和认知原理相结合时，教师引导的AI生成可以有意义地增强EAP学习生态系统，而非产生AI产出物。

英文摘要

Artificial intelligence (AI) retrieval-augmented generation (RAG) tools now enable educators to transform course materials into diverse multimedia at scale. However, it remains unclear whether such AI-generated content functions as a pedagogical scaffold or AI slop: high volume, low quality material. This innovative practice paper reports on the development, implementation, and evaluation of teacher-prompted, AI-generated supplemental materials in an English for Academic Purposes (EAP) course at a Hong Kong Community College. Using primarily Google Notebook LM, the instructor generated videos, podcasts, infographics, and individualized feedback reports from course materials and student work for 106 English as a Foreign Language learners. An explanatory sequential mixed-methods design comprising a survey, semi-structured interviews, and correlation analysis with academic scores was employed to examine students' preferences, perceptions, and learning outcomes. Findings are framed through the Technology Acceptance Model and Cognitive Load Theory. Students rated the materials highly for perceived usefulness and ease of use, and preferred assessment-linked content presented in visual and multimodal formats, particularly videos and infographics. Video preference correlated positively with academic performance; however, higher cognitive load was negatively associated with course grades, indicating that material complexity must be carefully calibrated. Notably, some lower-performing students independently adopted the materials as remedial scaffolds. The practice demonstrates that RAG tools enable scalable personalized feedback that would be less feasible through traditional methods. When aligned with student goals and cognitive principles, teacher-prompted AI generation can meaningfully enhance the EAP learning ecosystem rather than producing AI slop.

URL PDF HTML ☆

赞 0 踩 0

2605.16264 2026-05-19 cs.HC cs.CL 版本更新

LLM-Based Intelligent Notification Composition: From Static Personalization to Context-Aware Persuasive Messaging

基于大语言模型的智能通知生成：从静态个性化到情境感知的说服性信息

Nilesh Agrawal

AI总结本文提出利用大语言模型提升通知信息质量，通过六个维度评估其效果，展示LLM在提升CTR和说服性方面的贡献，并提出决策框架以指导LLM生成的应用。

Comments 17 pages, 1 figure, 7 tables. Code available at https://github.com/ndagrawal/LLMNotificationComposition

详情

AI中文摘要

推送通知仍然是数字平台与用户互动最直接的渠道，但现有方法在通知对象、时间及内容推荐上投入巨大，而如何沟通仍是最薄弱的环节。本文认为信息质量是独立且被低估的优化点，大语言模型在此层创造最大差异化价值。本文贡献包括：首先定义通知信息质量的六个维度，并展示基于LLM的生成如何优于模板；其次提供架构归因分析，解构信息生成与其他组件（目标定位、排序、时间）的关系；第三引入三准则决策框架，明确何时LLM生成是瓶颈。本文通过PRISMA引导的调研（28个来源，142个筛选）、社交媒体、食品配送和电子商务领域的应用分析，并提出统一的架构框架，包含预算感知路由、基础生成、候选排序、多样性控制和在线学习。

英文摘要

Push notifications remain among the most direct channels through which digital platforms engage users, yet existing approaches have invested heavily in who to notify, when to notify, and what to recommend, while leaving how to communicate as the least-optimized stage. This paper argues that message quality is an independent, underinvested lever, and that LLMs create their most differentiated value precisely at this layer. We make three contributions. First, we define notification message quality along six dimensions (contextual relevance, clarity, actionability, novelty handling, linguistic freshness, and persuasive appropriateness) and show how LLM-based composition improves each relative to templates. Across reviewed deployments, reported improvements range from +8% to +14.5% CTR over static templates and +1% to +2.5% over mature slot-filling systems, though these span heterogeneous systems and should not be treated as directly comparable. Second, we provide an architectural attribution analysis disentangling message generation from adjacent components (targeting, ranking, timing), arguing that observed gains are frequently misattributed to text generation alone. Third, we introduce a three-criterion decision framework specifying when LLM generation is and is not the binding constraint. We support these arguments through a PRISMA-guided survey (28 sources from 142 screened), examine domain-specific applications across social media, food delivery, and e-commerce, and propose a unified architectural framework with budget-aware routing, grounded generation, candidate ranking, diversity controls, and online learning.

URL PDF HTML ☆

赞 0 踩 0

2605.14787 2026-05-19 cs.CV cs.CL 版本更新

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

组合图像检索基准是否需要多模态组合？

Matteo Attimonelli, Alessandro De Bellis, Aryo Pradipta Gema, Rohit Saxena, Monica Sekoyan, Wai-Chung Kwan, Claudio Pomo, Alessandro Suglia, Dietmar Jannach, Tommaso Di Noia, Pasquale Minervini

AI总结研究发现组合图像检索任务中，许多查询可通过单一模态解决，而非真正的多模态组合，揭示了多模态组合的假设不成立。

详情

AI中文摘要

组合图像检索（CIR）是一种多模态检索任务，其中查询由参考图像和文本修改组成，目标是检索满足两者条件的目标图像。本文表明，这一假设并不总成立。在四个广泛使用的CIR基准和十一种通用多模态嵌入模型中，大量查询可通过单一模态解决（32.2%至83.6%），揭示了普遍存在的单模态捷径。为此，我们进行了两阶段审核：首先通过跨模型分析识别捷径可解查询；其次在4741个捷径不可解查询上进行人工验证，发现仅1689个查询结构合理，常见问题包括模糊编辑和目标不匹配。重新评估模型在验证子集上的表现显示，查询无法再仅通过单一模态解决，成功检索需结合两种输入。虽然准确率下降，但对多模态信息的依赖增加。整体而言，当前CIR基准将捷径可解、噪声和真正组合性查询混为一谈，导致对模型多模态能力的高估。

英文摘要

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

URL PDF HTML ☆

赞 0 踩 0

2604.02178 2026-05-19 cs.CL cs.AI cs.LG 版本更新

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

专家反击：在专家层面解读混合专家语言模型

Jeremy Herbst, Stefan Wermter, Jae Hee Lee

AI总结研究通过k稀疏探测比较MoE专家与密集FFN，发现专家神经元更单语义，提出以专家为分析单位，揭示专家是细粒度任务专家，而非领域专家或token处理者。

Comments 8 pages, 7 Figures. Accepted at ICML 2026. Improved writing, changed author order, updated citations

详情

AI中文摘要

校准后再行动：在大语言模型代理中考虑成本的探索

Wenxuan Ding, Nicholas Tomlin, Greg Durrett

AI总结本文提出Calibrate-Then-Act框架，使LLM代理在不确定环境下显式平衡成本与不确定性，从而更优地决策。

详情

AI中文摘要

大语言模型代理被部署在需要交互以获取信息的环境中。在这些场景中，代理必须权衡行动中的内在成本不确定性，例如何时停止探索并提交答案。例如，在编程任务中，代理可能运行生成的代码，或为该代码片段生成测试；编写和运行测试的成本非零，但通常低于运行有缺陷代码的成本。本文表明，可以通过诱导LLM代理显式权衡这些成本-不确定性权衡，使代理在环境中表现更优。我们正式化了多个任务，包括检索增强的问答和文件阅读编码任务，作为在不确定性下的连续决策问题。每个问题都有潜在的环境状态影响代理性能。我们引入了名为Calibrate-Then-Act（CTA）的框架，通过将代理传递推断出的环境状态先验信息，使其能够更优地行动。此信息在定性上改变了代理行为，并向代理添加了非标准RL训练所学的环境敏感性。在合成任务、问答和文件阅读上的结果表明，通过CTA显式进行成本-收益权衡有助于代理发现更优的决策策略。

英文摘要

LLM agents are deployed in environments where they must interact to acquire information. In these scenarios, the agent must reason about inherent cost-uncertainty tradeoffs in how to act, such as when to stop exploring and commit to an answer. For instance, on a programming task, an agent might run the code it generates, or it might generate tests for that code snippet; the cost of writing and running a test is nonzero, but typically lower than the cost of running buggy code. In this work, we show that we can induce LLM agents to explicitly reason about balancing these cost-uncertainty tradeoffs, then act more optimally in their environments. We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision-making strategies.

URL PDF HTML ☆

赞 0 踩 0

2509.22510 2026-05-19 cs.CL 版本更新

清除珠宝：基于谷歌OCR的藏文手稿的神经拼写纠正模型

Queenie Luo, Yung-Sung Chuang

AI总结本文提出基于谷歌OCR的藏文手稿的神经拼写纠正模型，通过改进的Transformer架构实现自动纠正OCR噪声输出，实验表明其优于其他序列模型。

详情

DOI: 10.1145/3654811
Journal ref: Association for Computing Machinery 2024

AI中文摘要

人文学者依赖古代手稿来研究历史、宗教和社会政治结构。许多努力致力于使用OCR技术数字化这些珍贵的手稿，但大多数手稿因数世纪的污损，使得OCR程序无法准确捕捉褪色的字符和污渍。本文提出基于谷歌OCR处理的藏文手稿的神经拼写纠正模型，用于自动纠正OCR输出中的噪声。本文分为四个部分：数据集、模型架构、训练和分析。首先，我们将原始藏文电子文本语料库特征工程为两个结构化数据框——一组配对玩具数据和一组配对真实数据。然后，我们在Transformer架构中实现了置信度评分机制，用于拼写纠正任务。根据损失和字符错误率，我们的Transformer加置信度评分机制架构证明优于Transformer、LSTM-2-LSTM和GRU-2-GRU架构。最后，为了检验模型的鲁棒性，我们分析了错误的标记，可视化了模型中的注意力和自我注意力热图。

英文摘要

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

URL PDF HTML ☆

赞 0 踩 0

2102.11105 2026-05-19 cs.SI cs.CL 版本更新

REMOD: Relation Extraction for Modeling Online Discourse

REMOD：在线 discourse 中的关系提取

Matthew Sumpter, Giovanni Luca Ciampaglia

AI总结本文提出一种结合图嵌入与语义依赖图路径遍历的监督学习方法，用于提取在线 discourse 中实体间的语义关系，以应对半结构化数据建模的挑战。

Comments 11 pages, 5 figures

详情

AI中文摘要

在线 discourse 的大量数据对维护一个文明和知情的公共领域构成挑战。诸如 ClaimReview 的标准化数据努力提供了大量关于可能不准确声明的新数据，由第三方事实核查员审查。这些数据有助于揭示在线 discourse 的性质、政治精英对其的放大作用以及其对在线信息生态系统完整性的影响。不幸的是，这种数据的半结构化性质在建模和推理在线 discourse 时带来了重大挑战。关键挑战是关系提取，即确定声明中命名实体之间的语义关系。本文开发了一种新颖的监督学习方法，结合图嵌入技术与语义依赖图上的路径遍历。我们的方法基于直观观察，即了解三元组主体和对象之间路径上的实体信息有助于提取其语义关系（例如，华盛顿，D.C. 与美国合众国的关系为 capitalOf）。作为该技术在建模在线 discourse 中的潜在应用示例，我们展示了该方法可以整合到一个管道中，以推理潜在的虚假信息声明。

英文摘要

The enormous amount of discourse taking place online poses challenges to the functioning of a civil and informed public sphere. Efforts to standardize online discourse data, such as ClaimReview, are making available a wealth of new data about potentially inaccurate claims, reviewed by third-party fact-checkers. These data could help shed light on the nature of online discourse, the role of political elites in amplifying it, and its implications for the integrity of the online information ecosystem. Unfortunately, the semi-structured nature of much of this data presents significant challenges when it comes to modeling and reasoning about online discourse. A key challenge is relation extraction, which is the task of determining the semantic relationships between named entities in a claim. Here we develop a novel supervised learning method for relation extraction that combines graph embedding techniques with path traversal on semantic dependency graphs. Our approach is based on the intuitive observation that knowledge of the entities along the path between the subject and object of a triple (e.g. Washington,_D.C.}, and United_States_of_America) provides useful information that can be leveraged for extracting its semantic relation (i.e. capitalOf). As an example of a potential application of this technique for modeling online discourse, we show that our method can be integrated into a pipeline to reason about potential misinformation claims.

URL PDF HTML ☆

赞 0 踩 0