arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.18394 2026-06-18 cs.CL 新提交

局部与全局注意力的双重维度

Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）

AI总结提出距离自适应表示（DAR），对局部上下文保留全维度表示，对远距离token使用低维表示，在保持性能的同时减少KV缓存。

详情

AI中文摘要

解码器仅Transformer计算前面token的KV缓存上的注意力。键（和值）通常以相同的维度表示，无论其与预测目标的距离如何。然而，在自然语言中，下一个词受紧邻的前一个词影响最大。我们假设局部和远距离token对表示能力有不对称需求：局部token对预测即时输出更关键，因此需要更丰富的表示，而远距离token主要作为长期记忆，低维表示可能就足够了。我们将这一思想形式化为距离自适应表示（DAR），在受控设置中实现，该设置在局部上下文窗口内保留全维度表示，同时为超出该窗口的token分配降维表示（例如原始维度的1/4）。在多个预训练规模（70M到410M参数）以及1B规模模型上的持续监督微调中，该方法与全维度基线的性能紧密匹配。相比之下，在所有token位置上均匀降低维度会导致性能下降。这些结果挑战了键和值维度应在所有token位置上均匀的常见假设。我们的发现为设计注意力架构提供了新方向，该架构可自适应地跨序列分配表示能力，从而在推理期间进一步减少KV缓存。

英文摘要

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

URL PDF HTML ☆

赞 0 踩 0

2606.18663 2026-06-18 cs.CL 新提交

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D: 通过代理训练轨迹实现动态数据混合

Kaiyan Zhao, Zhongtao Miao, Akiko Aizawa, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo（东京大学）； National Institute of Informatics（国立信息学研究所）

AI总结提出RegMix-D，通过代理训练轨迹预测多阶段最优混合比例，实现动态数据混合，在13个下游任务上优于RegMix和DoReMi，且代理计算预算仅为RegMix的25%。

Comments Work in progress

详情

AI中文摘要

数据混合选择对于大型语言模型预训练至关重要。现有方法如RegMix通过在小规模代理运行上拟合回归模型来选择单个静态混合。我们提出RegMix-D，这是RegMix的一个简单扩展，用于动态混合。我们的关键观察是，代理运行不仅产生端点损失，还产生完整的损失轨迹，这些轨迹可用于进一步改进数据混合。通过在这些轨迹上训练回归模型，我们可以预测多个训练阶段的最优混合。RegMix-D支持两种部署模式：一种离线变体，在目标训练之前生成完整的混合计划；另一种在线变体，在训练期间使用观察到的损失自适应调整混合。在Pile数据集的250亿token上使用1B参数目标模型的实验表明，RegMix-D在13个下游任务上一致优于RegMix和DoReMi，同时保持代理高效：即使仅使用128个代理模型（RegMix代理计算预算的25%），它也超越了RegMix。

英文摘要

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

URL PDF HTML ☆

赞 0 踩 0

2606.18831 2026-06-18 cs.CL cs.AI 新提交

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB ； Tsinghua University（清华大学）

AI总结提出一种简单有效的数据配方，结合最小化基于结果的GRPO设置，显著提升大语言模型的长上下文推理能力，在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情

AI中文摘要

长上下文推理是大语言模型的一项关键能力，特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式，然而现有工作主要关注奖励工程，而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题，并表明仅凭一种简单有效的数据配方，结合最小化基于结果的GRPO设置，就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集，总计约1.4万个示例。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升，超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中，在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练，GAIA提升+4.8分，BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.18902 2026-06-18 cs.CL 新提交

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI ； Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结提出随机提示优化框架SPO，其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索，在多个基准测试中表现依赖于错误类型，并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情

AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度，这促使我们将自动提示优化（APO）视为黑盒搜索。我们引入了SPO（随机提示优化），一个在提示空间上进行随机搜索的框架，并比较了三种复杂度递增的策略：基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE（基于智能体引导探索的SPO），后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中，没有单一策略占主导地位；有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上，它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为，将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.18954 2026-06-18 cs.CL 新提交

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO：基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）； Ant Group（蚂蚁集团）

AI总结提出GraphPO框架，将推理轨迹建模为有向无环图，通过合并语义等价路径减少冗余探索，并利用边级优势函数提高推理效率，在多个基准上优于链式和树式方法。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性：首先，独立响应常包含相似的中间推理步骤，导致冗余探索和计算浪费；其次，稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号，部分解决了这一问题。然而，树分支仍然是独立扩展的。当不同分支达到相似的推理状态时，它们无法共享信息并重复类似的探索。此外，基于树的方法忽略了这种分散性，仅在不同分支内进行局部比较，这可能导致优势估计的方差更高。为了解决这一挑战，我们提出了GraphPO（基于图的策略优化），一种新颖的RL框架，将轨迹表示为有向无环图，其中推理步骤作为边，从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类，允许它们共享后缀，并将预算从冗余扩展重新分配到多样化探索。此外，我们为入边分配效率优势，为出边分配正确性优势，从而在从结果中推导过程监督的同时提高推理效率。理论表明，GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明，在相同的token预算或响应预算下，GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.19002 2026-06-18 cs.CL 新提交

Enhancing Multilingual Reasoning via Steerable Model Merging

通过可引导的模型合并增强多语言推理

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Fudan University（复旦大学）； Beihang University（北京航空航天大学）； Monash University（墨尔本大学）； Zhongguancun Laboratory（中关村实验室）； Nanjing University（南京大学）； Tsinghua University（清华大学）

AI总结提出可引导模型合并（ST-Merge）框架，通过门控交叉注意力机制自适应调节源模型贡献，在多语言推理任务中优于强基线。

Comments 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

详情

AI中文摘要

模型合并是组合多语言模型和推理模型能力的有效技术。通过对齐不同模型的特征空间，它在多语言推理任务中取得了有希望的泛化效果。然而，合并后的单一模型往往无法解决源模型之间的冲突，导致性能次优。换句话说，一刀切的合并策略可能无法适应不同输入的特性，这些输入可能要求优先考虑某些模型。为此，我们提出了一个可引导模型合并（ST-Merge）框架来调节每个源模型的贡献。为了实现这一想法，我们引入了一种门控交叉注意力机制，以自适应方式加权或过滤两个关注的源模型。大量实验表明，ST-Merge在涵盖21种不同语言的四个多语言推理基准上持续优于多个强基线。

英文摘要

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

URL PDF HTML ☆

赞 0 踩 0

2606.19005 2026-06-18 cs.CL cs.LG 新提交

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi: 从头训练的开放均匀扩散语言模型

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

发表机构 * Tohoku University（东北大学）

AI总结本文提出Sumi，一个从零开始预训练的70亿参数均匀扩散语言模型，在1.5T tokens上训练，性能与同规模自回归模型相当，并开源所有资源。

详情

AI中文摘要

扩散模型已成为自回归模型的有前途的替代方案。其中，均匀扩散语言模型（UDLM）允许在任何步骤更新任何token，原则上能够实现更灵活的生成。然而，目前还没有从零开始预训练的大参数规模和大token预算的UDLM。自回归建模和掩码扩散建模已经拥有大规模的可供社区研究和构建的模型；而均匀扩散模型则没有。大规模从头预训练的UDLM将为研究缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供一个干净的参考点。为此，我们引入了Sumi（日语中“墨水”的意思），一个完全开放的70亿参数均匀扩散语言模型，从零开始在1.5T tokens上预训练。Sumi在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当，但在常识基准测试中表现较差，其中我们以教育为主的数据混合可能是原因之一。我们发布了模型权重、检查点和完整的训练方案，包括在公开可用的语料库上的数据混合的完整规范。我们希望这次发布能使社区研究大规模原生均匀扩散，并促进对其尚未很好理解的方面的研究。

英文摘要

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

URL PDF HTML ☆

赞 0 踩 0

2606.19170 2026-06-18 cs.CL 新提交

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Dango：一个严格仅L1的大型语言模型，用于研究第二语言习得

Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

发表机构 * Kyoto University（京都大学）； NII-LLMC（国立信息学研究所-大规模语言模型中心）

AI总结提出1.8B参数的Dango模型，通过过滤L2污染和微调L2学习课程，模拟人类L2产出模式，优于未过滤和多语言基线。

Comments 8 pages main text, 20 pages total including references and appendices

详情

AI中文摘要

我们介绍了Dango，一个1.8B参数的大型语言模型，旨在用于第二语言习得（SLA）中L1到L2（日语到英语）迁移的受控研究。虽然先前的研究已经探索了语言模型中的SLA，但它们主要依赖于较小的或非解码器模型，限制了它们生成开放式文本的能力，并降低了它们作为实用L2模拟器的适用性。我们发现了将模型扩展到该规模时的一个关键挑战：用于L1习得的“单语”预训练语料库中的L2污染。为了解决这个问题，我们提出了一种过滤方法，以减少对英语的过早暴露，同时保留现实的最小暴露。然后，我们在LLM生成的L2学习课程上对模型进行微调，以模拟L2习得过程。我们的评估证实，Dango发展了类似人类的L2产出模式，优于未过滤和标准的多语言基线。我们发布了模型、数据和代码，以促进可重复的计算SLA研究和面向学习者的应用。

英文摘要

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19257 2026-06-18 cs.CL 新提交

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B：面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Peking University（北京大学）

AI总结提出块大小课程学习，通过从细粒度到粗粒度的渐进训练，解决块扩散语言模型在长链推理中性能差距问题，DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情

AI中文摘要

块扩散语言模型通过并行块级去噪加速解码，但其能否可靠地扩展到长思维链（CoT）推理仍未解决。为此，我们开发了开源块扩散推理模型DreamReasoner-8B，并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距：使用大块大小训练会导致推理性能极差，而小块大小则能保持有效的推理。为了弥合这一粒度差距，我们提出了块大小课程学习，逐步从细粒度块大小过渡到粗粒度块大小进行训练，从而克服了这一限制，并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中，DreamReasoner-8B取得了与领先的开源自回归模型（如Qwen3-8B）相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型：https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

URL PDF HTML ☆

赞 0 踩 0

2606.18284 2026-06-18 cs.LG cs.AI cs.CL 交叉投稿

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

打破求解器瓶颈：在可学习前沿训练任务生成器

Lorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin, Augustine N. Mavor-Parker, Matthew Daborn-Sargent

发表机构 * Vmax ； Goodfire AI

AI总结提出PROPEL框架，通过训练轻量级激活探针作为求解率代理，在无需重复求解器评估的情况下优化任务生成器，使生成任务集中在可学习前沿，提升数学、代码和软件工程任务的有效性。

Comments 30 pages, 9 figures, 12 tables

详情

AI中文摘要

通过强化学习训练智能体的限制资源日益成为前沿任务供给：有效、可求解且刚好足够困难以训练当前模型的任务。随着推理和智能体模型的改进，固定任务分布趋于饱和，而天真的合成生成产生琐碎、不可能或不适定的任务。用强化学习训练任务生成器以优化有效性和可学习性可以解决这一瓶颈，但直接优化需要对每个候选任务进行重复求解器评估。对于软件工程任务，单次评估可能耗时数十分钟；求解器在环的生成器训练是不可行的。我们提出PROPEL，一个求解器摊销框架，用于在目标求解率下训练任务生成器。PROPEL在一次性标注的生成任务和求解器结果语料库上训练一个轻量级激活探针。该探针从冻结的生成器参考模型预测目标求解器的通过率，并在生成器优化期间作为求解率的代理，将生成器评估简化为单次前向传播。在多种模型规模下的数学、代码和软件工程任务中，PROPEL将生成任务转向目标求解率：对于编程，在可学习前沿生成的任务从$10.1\% \ ightarrow 20.0\%$（针对Qwen2.5-3B-Instruct求解器）和从$5.3\% \ ightarrow 12.6\%$（针对Qwen2.5-7B-Instruct求解器）。对于软件工程，PROPEL将目标求解率下的生成份额从$9.8\% \ ightarrow 19.6\%$（针对Qwen3.5-27B在探针和生成器训练期间未见过的仓库）。

英文摘要

The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1\% \rightarrow 20.0\%$ for a Qwen2.5-3B-Instruct solver and from $5.3\% \rightarrow 12.6\%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8\% \rightarrow 19.6\%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.

URL PDF HTML ☆

赞 0 踩 0

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 交叉投稿

REVES：通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University（西北大学）； Amazon AGI（亚马逊人工智能实验室）； Qualcomm AI Research（高通人工智能研究）； University of Minnesota（明尼苏达大学）

AI总结提出REVES框架，通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示，实现高效的离策略数据生成，提升大语言模型的多步推理能力，在LiveCodeBench上比强化学习基线高6.5分。

详情

AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型（LLM）推理能力的强大范式。然而，标准的后训练方法主要优化单次目标，与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“接近正确”答案）转化为解耦的修订和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比，这种方法实现了高效的离策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比RL基线高6.5分，比标准多轮训练高4.0分。除了编码，我们的方法在圆填充问题上达到了先前报告的SOTA结果，同时使用了最小的基础模型（4B）和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题，如n皇后和迷你数独，其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

URL PDF HTML ☆

赞 0 踩 0

2606.19236 2026-06-18 cs.LG cs.AI cs.CL 交叉投稿

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: 基于惊讶度的令牌级优势重加权以实现策略熵稳定性

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Tencent Hunyuan（腾讯混元）

AI总结针对GRPO等RL算法中策略熵崩溃问题，提出STARE方法，通过惊讶度分位数识别熵关键令牌并重加权其优势，结合目标熵闭环门控稳定熵，在1.5B-32B模型和多种任务上实现稳定训练，AIME24/25准确率提升4%-8%。

Comments LLM, Reinforcement Learning

详情

AI中文摘要

基于可验证奖励的强化学习算法（如GRPO）已成为LLMs复杂推理的主流后训练范式，但通常在训练中遭受策略熵崩溃。我们对GRPO下的令牌级熵动态进行一阶梯度分析，识别出令牌级信用分配不匹配：每个令牌的熵变化分解为轨迹级优势与下一个令牌分布上的熵敏感函数的乘积，产生优势-惊讶度四象限结构和近临界性质。受此启发，我们提出STARE（基于惊讶度的令牌级优势重加权以实现策略熵稳定性），该方法通过批次内惊讶度分位数识别熵关键令牌子集，选择性重加权其有效优势，并引入目标熵闭环门控以实现稳定的熵调节。在1.5B至32B的模型规模以及三个任务族（短思维链、长思维链和多轮工具使用）上，STARE在数千步内维持稳定的RL训练，同时将策略熵保持在目标带内。在AIME24和AIME25上，STARE在平均准确率上比DAPO和其他竞争基线高出4%-8%，反思令牌和响应长度同步增长，表明持续探索-利用平衡进一步释放了RL训练潜力。代码可在https://github.com/xxxx获取。

英文摘要

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

URL PDF HTML ☆

赞 0 踩 0

2606.19264 2026-06-18 cs.LG cs.CL 交叉投稿

Structured Inference with Large Language Gibbs

大语言吉布斯结构化推理

Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

AI总结提出大语言吉布斯方法，利用大语言模型的条件分布作为转移算子进行结构化概率推理，通过迭代重采样变量避免顺序偏差，在合成分布、一致性推理和贝叶斯结构学习中验证有效性。

Comments Code: https://github.com/hyeok9855/large-language-gibbs

详情

AI中文摘要

大型语言模型（LLMs）中编码的知识可以作为描述复杂世界变量的结构化推理的基础，但以概率一致的方式访问这些知识构成了一个困难的推理问题。我们提出了大语言吉布斯，一种结构化概率推理方案，它使用LLM的条件分布作为转移算子。不是通过单次自回归生成来采样结构化对象，而是利用LLM的下一个标记条件分布，在给定其他变量的条件下迭代地重采样单个变量。这种方法避免了顺序依赖偏差，并产生一个反映所有局部条件分布之间折衷的平稳分布。我们将这种方法应用于从合成分布中采样、一致性推理任务和贝叶斯结构学习。结果表明，在通过噪声LLM条件分布可访问的世界先验下，MCMC中使用LLM条件分布是用于结构化概率推理的一次性生成的实际替代方案。

英文摘要

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

URL PDF HTML ☆

赞 0 踩 0

2606.19327 2026-06-18 cs.AI cs.CL 交叉投稿

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督：基于评分准则的自蒸馏

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

发表机构 * Yale University（耶鲁大学）

AI总结提出评分准则条件自蒸馏框架，通过结构化细粒度反馈指导推理模型，在科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。

详情

AI中文摘要

推理语言模型的后训练通常由监督蒸馏和基于可验证奖励的强化学习驱动。蒸馏通常依赖于思维链注释，这些注释获取成本高昂，且可能本身带有噪声、不完整或部分错误；即使最终答案正确，不完美的推理过程也会干扰学习。另一方面，基于验证奖励的强化学习通常将评估反馈压缩为标量信号，掩盖了响应中哪些方面需要改进。我们提出\textbf{评分准则条件自蒸馏}框架，该框架将评分准则作为结构化、细粒度的反馈用于策略内自蒸馏。我们的方法使教师模型以准则级评分准则为条件，并利用它在学生自身采样的轨迹上提供令牌级指导。这种设计避免了将单一参考推理过程作为唯一的监督目标。相反，评分准则指定了一个强响应应满足的条件，从而在推理过程中实现比标量奖励优化更细粒度的信用分配。我们通过一个两阶段流程实例化该框架：首先学习生成任务特定的评分准则，然后训练一个评分准则引导的推理器。我们在多样化的科学推理基准上进行评估，结果表明，评分准则条件自蒸馏有效地将准则级标准转化为推理过程中的令牌级指导，平均超过GRPO 1.0分、OPSD 0.9分。

英文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

URL PDF HTML ☆

赞 0 踩 0

2602.05992 2026-06-18 cs.CL 版本更新

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

DSB: 扩散语言模型的动态滑动块调度

Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结针对扩散语言模型固定块调度忽视语义难度的问题，提出无训练的动态滑动块方法DSB及配套KV缓存机制DSB Cache，显著提升生成质量和推理效率。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

扩散大语言模型（dLLMs）已成为文本生成的一种有前景的替代方案，其特点在于原生支持并行解码。在实践中，块推理对于避免全局双向解码中的顺序错乱以及提高输出质量至关重要。然而，广泛使用的固定、预定义块（朴素）调度忽略了语义难度，使其在质量和效率上均非最优策略：它可能迫使模型对不确定的位置过早做出承诺，同时延迟块边界附近的简单位置。在这项工作中，我们分析了朴素块调度的局限性，并揭示了根据语义难度动态调整调度对于可靠高效推理的重要性。受此启发，我们提出了动态滑动块（DSB），一种无训练的块调度方法，它使用动态大小的滑动块来克服朴素块的刚性。为了进一步提高效率，我们引入了DSB Cache，一种针对DSB量身定制的无训练KV缓存机制。跨多个模型和基准的大量实验表明，DSB与DSB Cache一起，持续提升了dLLMs的生成质量和推理效率。代码已发布在 https://this https URL。

英文摘要

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

URL PDF HTML ☆

赞 0 踩 0

2602.06470 2026-06-18 cs.CL cs.AI 版本更新

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结本文提出UNO框架，通过用户日志提炼规则和偏好对，利用查询反馈驱动聚类处理数据异质性，量化模型知识与日志数据间的认知差距，提升LLM系统性能。

详情

AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型（LLMs）的发展，但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此，近期研究更加关注从真实世界部署中持续学习，其中用户交互日志提供了丰富的真人类反馈和过程知识。然而，从用户日志学习具有挑战性，因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为，且用户日志收集与模型优化之间的差异（例如，非策略优化问题）进一步加剧了这一问题。为此，我们提出UNO（用户日志驱动的优化），一个统一的框架，用于通过用户日志改进LLM系统（LLMsys）。UNO首先将日志提炼为半结构化的规则和偏好对，然后利用查询和反馈驱动的聚类来管理数据异质性，最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块，以处理从用户日志中提取的初级和反思性经验，从而提升未来的响应。广泛的实验表明，UNO在效果和效率上均达到最先进的水平，显著优于检索增强生成（RAG）和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

URL PDF HTML ☆

赞 0 踩 0

2603.26557 2026-06-18 cs.CL 版本更新

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

MemBoost：一种面向成本感知的LLM推理的内存增强框架

Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng

发表机构 * University of Cambridge（剑桥大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出MemBoost框架，通过轻量模型重用历史答案和检索支持信息，并选择性将困难查询路由到强模型，以降低LLM推理成本，同时保持回答质量。

Comments ICML MemFM 2026 Workshop

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情

DOI: 10.1109/TPAMI.2026.3704314

AI中文摘要

随着大语言模型（LLMs）的快速发展，将策略模型与人类偏好对齐变得日益关键。直接偏好优化（DPO）作为一种有前景的对齐方法，作为从人类反馈中强化学习（RLHF）的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性，但文献中目前缺乏对这些方面的深入综述。在这项工作中，我们对DPO中的挑战和机遇进行了全面回顾，涵盖理论分析、变体、相关偏好数据集和应用。具体而言，我们基于关键研究问题对近期DPO研究进行分类，以提供对DPO当前格局的透彻理解。此外，我们提出了几个未来研究方向，为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

URL PDF HTML ☆

赞 0 踩 0

2510.15551 2026-06-18 cs.CL cs.AI cs.LG 版本更新

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

从统计视角重新思考跨语言差距

Vihari Piratla, Purvam Jain, Darshan Singh, Trevor Cohn, Preethi Jyothi, Partha Talukdar

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出跨语言差距源于目标语言响应方差，通过形式化偏差和无偏误差，并采用推理时集成方法降低方差，使跨语言迁移得分提升8%-50%以上。

Comments 30 pages

详情

AI中文摘要

任何知识片段通常以一种或少数几种自然语言表达在网页或大型语料库中。大型语言模型（LLMs）通过从源语言获取知识，并在使用目标语言查询时使其可访问，从而充当桥梁。跨语言差距是指使用目标语言而非源语言查询知识时准确率的下降。现有研究侧重于导致跨语言差距的建模或训练失败。在这项工作中，我们采取另一种视角来表征跨语言错误的性质，并假设目标语言中响应的方差是造成这一差距的关键原因。我们首次将跨语言差距形式化为有偏误差和无偏误差。通过多种控制方差并减少跨语言差距的推理时干预，我们实证验证了我们的假设。我们展示了几种测试时集成方法，这些方法降低了响应方差，从而将源-目标迁移得分提高了多达12个绝对百分点，在各种LLMs上实现了8%到超过50%的相对提升。

英文摘要

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried using target languages. A cross-lingual gap is a drop in accuracy incurred when querying knowledge in a target language rather than the source language. Existing research focused on modeling or training failures leading to cross-lingual gaps. In this work, we take an alternative view to characterize the nature of cross-lingual error, and hypothesize that the variance of responses in the target language is a key cause of this gap. For the first time, we formalize the cross-lingual gap in terms of biased and unbiased errors. We empirically validate our hypothesis through multiple inference-time interventions that control variance and reduce the cross-lingual gap. We demonstrate a few test-time ensemble methods that reduce response variance, and thereby improve source-target transfer scores by up to 12 absolute points yielding relative gains of 8% to over 50% across various LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.18381 2026-06-18 cs.CL cs.IR 新提交

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

SproutRAG: 基于注意力引导的树搜索与渐进嵌入的长文档RAG

Amirhossein Abaskohi, Issam H. Laradji, Peter West, Giuseppe Carenini

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； ServiceNow Research（ServiceNow研究院）

AI总结提出SproutRAG，通过注意力引导构建句子级分块树，实现多粒度检索，无需额外LLM调用，平均信息效率提升6.1%。

详情

AI中文摘要

检索增强生成（RAG）系统必须平衡检索粒度与上下文连贯性，现有方法通过LLM引导的分块、单级上下文扩展或层次摘要来解决这一挑战。这些方法在索引或检索过程中依赖昂贵的LLM调用，将上下文聚合限制在单一粒度级别，或通过摘要引入信息损失。我们提出SproutRAG，一种注意力引导的层次化RAG框架，通过将句子级块组织成逐渐增大但语义连贯的单元，利用学习到的句子间注意力构建二分块树，从而解决这一权衡。与依赖外部LLM、固定上下文扩展或有损摘要的先前方法不同，SproutRAG学习哪些注意力头和层最能捕捉语义文档结构，实现无需额外LLM调用或压缩摘要的多粒度检索。在检索时，SproutRAG使用层次化束搜索检索多个粒度的候选，捕获超越平面检索的多句子相关性。该框架通过联合目标进行端到端训练，同时改进嵌入和树结构。在涵盖科学、法律和开放域设置的四个基准上的实验表明，SproutRAG在最强基线上平均信息效率（IE）提升6.1%。代码可在该https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on https://github.com/AmirAbaskohi/SproutRAG.

URL PDF HTML ☆

赞 0 踩 0

2606.18508 2026-06-18 cs.CL cs.IR 新提交

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG：主题元数据作为段落级检索的语义指南针

Amirhossein Abaskohi, Raymond Li, Gaetano Cimino, Peter West, Giuseppe Carenini, Issam H. Laradji

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； University of Salerno（萨莱诺大学）； ServiceNow Research（ServiceNow研究院）

AI总结提出MCompassRAG框架，通过主题元数据增强段落表示，利用LLM蒸馏训练轻量检索器，实现主题感知检索，在六个基准上平均信息效率提升8.24%，延迟降低5倍以上。

详情

AI中文摘要

超越分词：面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University（德肯大学）

AI总结提出CADE框架，通过逐点线性编码器直接嵌入每个时间步，避免分词瓶颈，并利用单向监督对比损失对齐时间序列与文本锚点，在Time-MQA基准上提升六项TSQA任务性能。

详情

AI中文摘要

大型语言模型的最新进展催生了时间序列问答（TSQA），它将时间序列分析表述为自然语言问答。然而，直接将原始数值序列输入LLM会遇到分词瓶颈：字节对编码将连续值分割成不稳定的词元，其嵌入缺乏有意义的度量结构，导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口，锁定单一粒度，这会破坏模式并隐藏确切的时间步，且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战，我们提出了CADE（对比对齐与直接嵌入），一个基于两个关键组件构建的TSQA新框架：直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间，保留了精确的索引级访问，同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距，我们引入了一种新颖的单向监督对比损失，将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明，我们的框架在六项TSQA任务上持续提升了性能，优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19183 2026-06-18 cs.CL cs.AI 新提交

RCEM：配备查询重写技能的嵌入器，用于分布偏移下的鲁棒对话搜索

Kilho Son, Paul Hsu, Cha Zhang, Dinei Florencio

发表机构 * Microsoft（微软）

AI总结提出RCEM模型，通过将LLM的查询重写能力蒸馏到嵌入模型中，实现无需显式重写的上下文感知检索，在分布偏移下提升鲁棒性。

详情

AI中文摘要

对话搜索在检索增强生成（RAG）系统中变得越来越重要，用户通过包含上下文相关查询的多轮对话与AI助手交互。我们提出RCEM，一种对话式稠密检索模型，它将LLM的查询改写能力蒸馏到嵌入模型中，从而在推理时无需显式查询改写即可实现上下文感知检索。与先前学习直接对话到文档匹配的对话式稠密检索方法不同，RCEM将对话查询嵌入与改写后的查询嵌入对齐，提高了在分布偏移下的鲁棒性。RCEM不需要用于训练的对话查询到文档的相关性映射，这些映射通常昂贵且难以获得高质量。在QReCC、TopiOCQA和TREC CAsT上的大量实验表明，RCEM始终优于强对话检索基线，在分布偏移下取得了特别大的增益，包括Recall@10提升高达20%。RCEM进一步扩展了基础嵌入模型，使其具备对话查询改写能力，同时保留了原有的检索功能，允许单个模型对独立查询和对话查询进行编码，并针对现有文档索引进行搜索，而无需重建检索数据库。

英文摘要

We propose RCEM, a Robust Conversational search EMbedder that is additionally equipped with LLM's query reformulation capability without losing base model's generalization. Unlike prior conversational dense retrieval approaches that learn direct conversation-to-passage matching, RCEM aligns conversations, prepended by special token, to LLM-rewritten queries, while preserving the original embedding space. The unchanged embedding space automatically maps the rewritten-query to the relevant passages. As a result, RCEM (1) reduces overfitting by simplifying the alignment task from long passages to shorter rewritten queries, (2) eliminates the need for conversation-to-passage relevance labels for training, and (3) maintains its original embedding space that allows conversational queries against indexes built by original embedder without rebuilding them. Extensive experiments show that RCEM consistently outperforms prior approaches, achieving up to 30% improvement under distributional shift.

URL PDF HTML ☆

赞 0 踩 0

2606.18406 2026-06-18 cs.CL 新提交

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Peng Cheng Laboratory（鹏城实验室）； Shandong Analysis and Test Center, Qilu University of Technology（齐鲁工业大学山东省分析测试中心）； State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs（道地药材品质保障与可持续利用国家重点实验室）

AI总结提出CoreMem架构，用黎曼检索替代余弦相似度解决高维检索枢纽问题，通过Fisher引导离散令牌蒸馏实现原则性压缩，在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情

AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而，在消费级硬件（例如8 GB VRAM边缘设备）上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索，以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础，经常在高维检索中遭受枢纽问题，并在压缩过程中出现句法碎片化。为克服这些限制，我们提出CoreMem，一种资源高效的边缘-云记忆架构，从根本上由信息几何统一。首先，黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配，通过马氏距离有效惩罚枢纽记忆，并采用O(Ndr) Woodbury加速实现实时搜索。其次，Fisher引导离散令牌蒸馏（FDTD）引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数，提供原则性的压缩-KL权衡，并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估，CoreMem实现了显著的准确率提升，在开放域（+4.51个百分点）和时间（+4.17个百分点）推理上取得实质性增益。广泛性能分析证实，CoreMem在严格的8 GB VRAM预算内无缝运行，成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

URL PDF HTML ☆

赞 0 踩 0

2606.18448 2026-06-18 cs.CL 新提交

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL：面向计算机使用智能体的多模态技能

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； MIT-IBM Watson AI Lab（麻省理工学院-IBM沃森人工智能实验室）

AI总结提出VISUALSKILL分层多模态技能库，通过结合文档与UI探索构建，使智能体在CUA基准上平均得分提升15.3点，且多模态优于纯文本技能。

详情

AI中文摘要

计算机使用智能体（CUA）在标准化基准上接近人类水平，但在长周期任务和未见软件上仍存在困难。现有技能库通过可复用技能解决此问题，但仅以文本形式表示技能工件，忽略了GUI交互的视觉特性。我们提出VISUALSKILL：一种分层多模态技能，针对每个目标应用定制，并组织为按主题文件索引的中央索引，智能体通过load_topic MCP工具按需获取相关主题的文本和图形。我们通过结合编写文档与实时应用UI探索的两阶段流水线构建每个技能。在两个CUA基准CUA-World和OSExpert-Eval上，由Claude Opus 4.6支持的Claude Code CLI智能体使用VISUALSKILL达到平均得分0.456，比无技能基线（0.303）绝对提升15.3点。与从相同源内容生成且仅在模态上与VISUALSKILL不同的匹配纯文本技能相比，VISUALSKILL进一步绝对提升8.3点（0.373 vs. 0.456），直接证明在技能工件中保留视觉图形而非将其语言化，有助于智能体识别UI元素并在每次操作后验证工作流状态。我们的代码见此链接。

英文摘要

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

URL PDF HTML ☆

赞 0 踩 0

2606.18728 2026-06-18 cs.CL 新提交

基于图灵奖励的学习用户模拟器

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

AI总结提出Turing-RL方法，利用基于图灵测试的强化学习训练用户模拟器，通过判别性图灵奖励使生成响应与真实用户不可区分，在对话和论坛讨论中优于基线方法。

详情

AI中文摘要

在交互式环境中学习模拟人类用户可以推动代理助手的训练、个性化系统的评估、社会科学研究等。现有方法通常通过训练大型语言模型（LLM）来匹配单一真实响应，要么通过最大化对数概率，要么使用相似性奖励。我们提出{Turing-RL}：一种基于图灵测试的强化学习方法，用于训练用户模拟器模型。{Turing-RL}使用带有LLM评判器的判别性图灵奖励，根据用户历史记录对生成的响应与真实用户的不可区分程度进行评分，用户模拟器LLM学习在这种奖励下产生与用户可能说的内容不可区分的响应。在两个不同领域——对话聊天和Reddit论坛讨论中，我们发现{Turing-RL}在LLM和人工评估指标上均持续优于基线方法。我们的研究表明，优化不可区分性而非响应匹配对于学习用户模拟器是有效的。

英文摘要

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

URL PDF HTML ☆

赞 0 踩 0

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 交叉投稿

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench：智能体能否玩转长期博弈？

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

AI总结提出CEO-Bench，通过模拟500天运营初创公司的任务，评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情

AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而，现实世界的挑战需要结合多种复杂技能，这些技能在很大程度上尚未在智能体中得到测试：（1）在不确定性中导航长期视野；（2）在嘈杂环境中获取信息；（3）适应不断变化的世界；（4）协调多个移动部分以实现连贯目标。我们引入CEO-Bench，通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面，在相同的环境中运行，并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库，将信号转化为合理的策略，并通过编程协调许多决策。最强的智能体编写复杂的代码，模拟客户群体以预测未来现金流，并挖掘谈判历史以揭示隐藏的客户偏好。即便如此，大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金，且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

URL PDF HTML ☆

赞 0 踩 0

2606.18668 2026-06-18 cs.MA cs.CL 交叉投稿

EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

EARS：大规模多智能体系统中可靠子智能体建模的解释性弃权

Shuang Xie, Yunan Lu, Han Li, Lingyun Wang

发表机构 * Shopify ； Columbia University（哥伦比亚大学）

AI总结针对大规模多智能体系统中子智能体过度回答导致幻觉的问题，提出EARS框架，通过将弃权重构为智能体间通信协议，利用校准的LLM裁判模型生成结构化弃权标签和理由，微调子智能体以检测故障并返回理由，在电商助手系统中将响应通过率从68.5%提升至78.9%。

详情

AI中文摘要

在大规模企业环境中，集中式多智能体系统（MAS）日益被采用，其中协调器将用户请求委托给轻量级、领域专业化的子智能体。虽然这种架构提高了模块化、可扩展性和成本效率，但其可靠性不仅取决于准确的路由，还取决于子智能体根据能力约束校准其响应的能力。特别是，基于较小微调模型的子智能体通常难以进行这种校准，导致它们过度回答模糊、未明确说明、路由错误或不支持的请求，并产生幻觉输出，而不是可操作的反馈。为了应对这一挑战，我们提出了EARS（用于可靠子智能体建模的解释性弃权），这是一个面向生产的框架，将子智能体弃权重新定义为智能体间通信协议：子智能体不仅弃权，而且向协调器暴露可操作的故障状态。EARS使用一组校准的LLM裁判模型来策划人机交互数据，在子智能体故障模式的分类法下生成结构化的弃权标签和理由。这些数据用于微调子智能体，使其能够检测故障条件并返回理由，以便协调器进行澄清、重新路由或回退。我们在一个支持企业商业智能工作流程的大规模生产电商助手中评估了EARS。EARS将整体响应通过率从68.5%提高到78.9%，证明了子智能体侧的解释性弃权提高了MAS的可靠性。

英文摘要

In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its reliability depends not only on accurate routing but also on sub-agents' ability to calibrate their responses to capability constraints. In particular, sub-agents built on smaller fine-tuned models often struggle with such calibration, leading them to over-answer ambiguous, underspecified, misrouted, or unsupported requests and produce hallucinated outputs instead of actionable feedback. To address this challenge, we present EARS (Explanatory Abstention for Reliable Sub-Agent Modeling), a production-oriented framework that reframes sub-agent abstention as an inter-agent communication protocol: a sub-agent does not merely abstain, but exposes an actionable failure state to the coordinator. EARS curates human-agent interaction data using an ensemble of calibrated LLM-as-a-Judge models, producing structured abstention labels and rationales under a taxonomy of sub-agent failure modes. These data are used to fine-tune sub-agents to detect failure conditions and return rationales for coordinator-level clarification, rerouting, or fallback. We evaluate EARS in a large-scale production e-commerce assistant supporting enterprise business intelligence workflows. EARS improves the overall response pass rate from 68.5% to 78.9%, demonstrating that sub-agent-side explanatory abstention improves MAS reliability.

URL PDF HTML ☆

赞 0 踩 0

2606.19144 2026-06-18 cs.AI cs.CL 交叉投稿

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

人机协同演化动力学：长期互动中社会智能涌现的形式理论

Jingyi Zhou, Senlin Luo, Haofan Chen

AI总结提出人机协同演化动力学框架（HACD-H），将情感适应、关系组织、社会记忆和人格一致性整合为统一动力学模型，通过约14,700轮对话数据集验证，发现社会智能与社会认知能量显著负相关，揭示社会智能源于长期协同演化。

详情

AI中文摘要

当前的对话式AI系统在语言生成、个性化和长上下文交互方面取得了显著进展。然而，大多数现有方法通过孤立组件（如情感建模、记忆检索或人格条件化）来建模社会行为，缺乏一个统一的框架来解释长期人机交互中稳定社会关系和社会智能的涌现。为解决这一问题，我们提出了人机协同演化动力学框架（HACD-H），这是一个将人机交互建模为自组织社会认知系统的形式模型。HACD-H将情感适应、关系组织、社会记忆和人格一致性整合到一个统一的动力学框架中，并引入了多时间尺度社会认知、关系吸引子、信任盆地、发展相变和社会认知能量景观等原则。我们构建了一个约14,700轮交互的对话数据集，并开发了一个理论驱动的实证评估框架。结果揭示了社会认知中的时间持久性层次结构、稳定的关系吸引子、类似相变的发展模式以及结构化的社会认知能量景观。社会智能与社会认知能量呈显著负相关（r = -0.391, p < 0.001），且交互轨迹随时间呈现渐进性能量减少。这些发现表明，社会智能源于长期的社会认知协同演化，而非孤立的对话能力。HACD-H为建模适应性人机社会交互和开发社会智能AI系统提供了统一的理论基础。

英文摘要

Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI interaction.To address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy dynamics.We construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p < 0.001), and interaction trajectories exhibit progressive energy reduction over time.These findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

URL PDF HTML ☆

赞 0 踩 0

2508.04086 2026-06-18 cs.CL 版本更新

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

ToolGrad：利用文本“梯度”高效生成工具使用数据集

Zhongyi Zhou, Kohei Uehara, Haoyu Zhang, Jingtao Zhou, Lin Gu, Ruofei Du, Zheng Xu, Tatsuya Harada

发表机构 * Google（谷歌）； The University of Tokyo（东京大学）； RIKEN AIP（日本学术振兴会AIP）； Tohoku University（东北大学）

AI总结提出ToolGrad框架，通过文本“梯度”引导的迭代过程先构建有效工具使用链再合成用户查询，实现低成本、高成功率的数据生成，训练模型性能超越基线。

Comments ACL 2026 Findings. Source code: https://github.com/zhongyi-zhou/toolgrad

2603.00026 2026-06-18 cs.CL cs.AI cs.IR 版本更新

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

ActMem：弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； Alibaba Group, Hangzhou, China（阿里巴巴集团，杭州，中国）； National Institute of Healthcare Data Science, Nanjing University, China（南京大学健康数据科学国家研究院）

AI总结提出ActMem框架，通过将非结构化对话历史转化为结构化因果语义图，结合反事实推理和常识补全，实现主动因果推理，显著提升LLM代理在复杂记忆依赖任务中的表现。

详情

AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”，并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距，我们提出了一种新颖的可操作记忆框架ActMem，它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全，它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外，我们引入了一个全面的数据集ActMemEval，用于评估代理在逻辑驱动场景中的推理能力，超越了现有记忆基准测试中事实检索的焦点。实验表明，ActMem在处理复杂的、依赖记忆的任务时显著优于基线，为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

URL PDF HTML ☆

赞 0 踩 0

2605.30880 2026-06-18 cs.CL cs.AI 版本更新

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld：可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Independent Researcher（独立研究员）； HKUST（香港科技大学）； Beijing Institute of Technology（北京理工大学）； Southern University of Science and Technology（南方科技大学）； Wayne State University（韦恩州立大学）； University of Edinburgh（爱丁堡大学）

AI总结提出 PatchWorld 框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型，实现无需梯度优化的符号信念状态程序，在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情

AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程（POMDP），假设模拟器的潜在状态和转移动态对智能体隐藏。然而，很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld，一个免梯度框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察，而是归纳出符号信念状态程序，其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中，PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数，在实时一步前瞻中达到 76.4% 的宏观成功率，同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现，人类指定的残差记忆偏差提高了表面观察保真度，但削弱了决策效用。这暴露了可执行世界模型中的权衡，因为提高观察保真度可能以牺牲动作判别动态为代价，反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

URL PDF HTML ☆

赞 0 踩 0

2412.15557 2026-06-18 cs.SE cs.CL 版本更新

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

MORTAR：基于LLM的对话系统的多轮蜕变测试

Aaron Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen

发表机构 * Faculty of Information Technology, Monash University（墨尔本大学信息科技学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； School of Science, Computing and Emerging Technologies, Swinburne University of Technology（斯威本理工大学科学、计算与新兴技术学院）

AI总结提出MORTAR方法，通过多轮蜕变关系自动化生成测试用例，解决LLM对话系统多轮测试中的预言问题，相比单轮测试每个用例发现更多且更高质量的缺陷。

Comments Accepted for publication in IEEE Transactions on Software Engineering (TSE)

详情

DOI: 10.1109/TSE.2026.3701230

AI中文摘要

随着基于LLM的对话系统在日常生活中的广泛应用，质量保证变得比以往更加重要。最近的研究成功引入了在单轮测试场景中识别意外行为的方法。然而，多轮交互是对话系统常见的实际使用方式，但针对此类交互的测试方法仍未得到充分探索。这主要是由于多轮测试中的预言问题，它仍然是对话系统开发人员和研究人员面临的重大挑战。在本文中，我们提出了MORTAR，一种蜕变式多轮对话测试方法，它缓解了测试基于LLM的对话系统时的测试预言问题。MORTAR形式化了对话系统的多轮测试，并自动生成问答对话测试用例，其中包含多种对话级扰动和蜕变关系（MRs）。自动化的MR匹配机制使MORTAR在蜕变测试中具有更高的灵活性和效率。所提出的方法完全自动化，无需依赖LLM评判。在测试六个流行的基于LLM的对话系统时，与单轮蜕变测试基线相比，MORTAR每个测试用例发现的错误数量增加了150%以上，效果显著更好。在错误质量方面，MORTAR在多样性、精确性和唯一性方面揭示了更高质量的错误。MORTAR有望激发更多的多轮测试方法，并帮助开发人员在有限的测试资源和预算下更全面地评估对话系统性能。

英文摘要

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

URL PDF HTML ☆

赞 0 踩 0

2606.18850 2026-06-18 cs.CL cs.IR 新提交

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum：基于知识图谱推理与反思性精炼的师生式抽象摘要生成

Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

AI总结提出ScholarSum框架，通过构建层次知识图谱引导学生生成初稿，并利用教师式审阅者迭代检查与修正，实现科学文献摘要的流畅性与事实一致性。

详情

AI中文摘要

抽象摘要生成在实现科学文献高效理解中起着关键作用，但它本质上要求同时具备语言流畅性和事实忠实性。现有方法往往难以协调这两个要求。抽取式方法依赖僵硬的句子拼接，破坏了宏观层面的逻辑连贯性；而基于大语言模型的生成式方法尽管掌握了语言流畅性，但事实一致性有限。在这项工作中，我们提出了ScholarSum，一个层次化反思性图框架，模拟师生写作过程以实现流畅且忠实的科学摘要生成。ScholarSum首先通过将文档分割成语义连贯的单元，组织成层次知识图谱，其多层社区结构捕获全局逻辑和宏观主题。在该全局结构引导下，学生生成初稿，随后通过细粒度证据检索进行精炼。为确保事实一致性，教师式审阅者迭代检查初稿，识别不支持的内容，并触发有针对性的重新检索和重写，直到摘要达到严格的质量标准。大量实验表明，ScholarSum在完整性和忠实性方面显著优于之前的基线方法。我们的代码可在该https URL获取。

英文摘要

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

URL PDF HTML ☆

赞 0 踩 0

2606.18889 2026-06-18 cs.CL 新提交

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

利用评分引导的反事实推荐改善医疗沟通

Adrian Cosma, Nicoleta-Nina Basoc, Andrei Niculae, Cosmin Dumitrache, Emilian Radoi

发表机构 * IDSIA, Dalle Molle Institute for Artificial Intelligence（IDSIA，达勒莫利人工智能研究所）； National University of Science and Technology POLITEHNICA Bucharest（科学与技术国家大学POLITEHNICA布加勒斯特）

AI总结提出一种语言模型引导的反事实推荐流程，通过调整语气、个性化等可解释沟通特征，在不影响医学内容的前提下提升患者积极反馈概率，平均提升6.41%。

Comments 4 Tables, 8 Figures

详情

AI中文摘要

基于文本的远程医疗越来越依赖轻量级的患者反馈，然而，此类反馈主要反映感知的沟通质量而非医学准确性。我们引入了一种语言模型引导的反事实推荐流程，该流程发现并优化可解释的沟通特征，如语气、个性化、可操作性和完整性，以解决患者关切，同时不干扰医学内容。这些特征与患者-医生互动元数据一起用于估计积极反馈。在推理时，系统搜索低成本的序数特征变化，并推荐最小的沟通变化，这些变化预计会增加积极反馈的概率，而独立的审计模型测试这些增益是否超出选择模型的泛化能力。在互动中，推荐在独立审计下平均带来+6.41%的预测积极反馈概率增益，且93.31%的推荐为非负。这些结果表明，小的、可解释的沟通变化可以捕获大部分预测增益，同时保留医生对医学推理和最终措辞的控制。

英文摘要

Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor's control over medical reasoning and final wording.

URL PDF HTML ☆

赞 0 踩 0

2606.18788 2026-06-18 cs.CV cs.CL 交叉投稿

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher（独立研究者）

AI总结针对土耳其语粘着特性，提出Morpheus神经词素边界模型，实现无损可逆分词与结构化词嵌入，在可逆分词器中达到最低比特每字符（1.425），词素对齐F1提升至0.61，GPU内存节省约19%。

详情

AI中文摘要

土耳其语是粘着语：意义由词素承载，然而驱动现代语言模型的子词分词器根据语料库统计分割单词，切碎了承载语义的后缀，并且在WordPiece和基于规则的分析器的情况下，无法将其输出解码回原始文本。本文提出\textbf{Morpheus}，一个面向土耳其语的神经词素边界模型，它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度，在推理时转化为精确的片段，无需字符串归一化，因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型，相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符（1.425），将子词家族的金标准词素对齐大致翻倍（MorphScore宏F1从约0.32提升至0.61），并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器，冻结的Morpheus向量在词汇检索（根家族MAP 0.85）和同根验证（ROC-AUC 1.00）上领先，超越了多语言检索器BGE-M3和BERTurk；在上下文和屈折依赖的任务（NER、格/数探测）上，更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码：此https URL 模型：此https URL 交互演示：此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

URL PDF HTML ☆

赞 0 踩 0

2606.18856 2026-06-18 cs.CL cs.LG 新提交

Approximate Structured Diffusion for Sequence Labelling

近似结构化扩散用于序列标注

Nicolas Floquet, Joseph Le Roux, Nadi Tomeh

发表机构 * Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN（巴黎北大学 Sorbonne、法国国家科学研究中心、巴黎北信息学实验室、LIPN）

AI总结提出一种基于扩散的条件随机场（CRF）训练方法，通过引入标签噪声条件来捕捉长距离依赖，结合近似推理在词性标注任务上实现16.5%的错误率降低。

2606.18922 2026-06-18 cs.CL cs.AI 新提交

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2510.04120 2026-06-18 cs.CL cs.AI 版本更新

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

探究大语言模型隐喻处理中的语义对齐、词汇不变性和句法影响

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（自然语言处理2CT实验室，计算机与信息科学系，澳门大学）

AI总结通过几何探测、上下文替换和句法扰动三种方法，分析LLM在隐喻处理中的语义漂移、词汇稳定性及句法敏感性，揭示强行为表现可能源于异质信号。

Comments Accepted to ACL 2026

详情

AI中文摘要

大语言模型（LLM）在隐喻检测和解释任务上表现出色，但尚不清楚这种行为成功揭示了隐喻处理的哪些方面。我们通过探测三个互补维度：语义属性对齐、词汇不变性和句法敏感性，对行为证据的局限性进行诊断分析。使用几何探测，我们评估模型生成的解释是否与参考语义属性对齐；通过上下文变化替换，分析隐喻和字面表达之间词汇关联的稳定性；通过受控句法扰动，检查隐喻检测的敏感性。我们的分析表明，LLM生成的解释可能相对于参考属性出现语义漂移；稳定的词汇锚点在不同上下文条件下持续存在，可能支持常规隐喻，同时使需要上下文整合的新奇隐喻产生偏差；检测性能对句法不规则性敏感。这些发现表明，强行为表现可能反映了异质的潜在信号，强调在将隐喻基准解释为稳健、集成语义理解的证据时需要谨慎。

英文摘要

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST（韩国科学技术院）

AI总结提出连续音频思考（CoAT）框架，通过专家蒸馏在连续潜在空间中组织声学信息，使音频语言模型在生成响应前利用丰富声学特征，无需额外自回归解码成本，在多个音频任务上提升性能。

Comments Preprint

详情

AI中文摘要

大型音频语言模型（LALMs）在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而，由于LALMs通常被训练生成与文本对齐的响应，其隐藏状态逐渐为文本生成而塑造，而非保留声学信息。因此，音频携带的多样化声学内容，如语音细节、韵律、声音事件、情感和音调，在过程中丢失，难以在响应中利用。我们引入了连续音频思考（CoAT），这是一个框架，为音频语言模型配备一个连续的潜在工作空间，用于在响应生成之前组织声学信息，并通过音频专家的蒸馏进行基础化。在思考空间内，模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外，所提出的连续思考块可以在单个预填充中处理，因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上，Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3，在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实，辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

URL PDF HTML ☆

赞 0 踩 0

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 交叉投稿

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； Qwen Team, Alibaba Group（阿里巴巴集团Qwen团队）

AI总结提出OmniAgent，一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体，通过主动感知将推理复杂度与视频时长解耦，在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情

AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式，无论查询难度如何都统一处理帧，导致计算成本随视频时长增长。尽管出现了交互式框架，但它们通常依赖于全局预扫描，其上下文成本仍随视频长度扩展。我们提出OmniAgent，第一个原生全模态智能体，将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作，选择性地将视听线索提炼到持久文本记忆中，有效将推理复杂度与原始视频时长解耦。为实现这一点，我们引入了(1)智能体监督微调，通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知；(2)带TAURA（轮次感知自适应不确定性重缩放优势）的智能体强化学习，利用轮次级熵将信用分配引导至关键发现轮次。关键的是，OmniAgent表现出正向测试时缩放，性能随推理轮次增加而提升，验证了主动感知的有效性。在十个基准（如VideoMME、LVBench）上的实验结果表明，OmniAgent在开源模型中达到了最先进性能。值得注意的是，在LVBench上，我们的7B智能体优于10倍大的Qwen2.5-VL-72B（50.5% vs. 47.3%）。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

URL PDF HTML ☆

赞 0 踩 0

2509.18588 2026-06-18 cs.CL 版本更新

UniECG: Understanding and Generating ECG in One Unified Model

UniECG: 在一个统一模型中理解与生成心电图

Jiarui Jin, Haoyu Wang, Xiang Lan, Jun Li, Hongyan Li, Shenda Hong

发表机构 * Peking University（北京大学）； Shanghai Ocean University（上海海洋大学）； the Second Hospital of Tianjin Medical University（天津医科大学第二医院）； National University of Singapore（新加坡国立大学）

AI总结提出UniECG模型，通过两阶段设计实现心电图信号/图像生成解释性文本及根据文本目标生成对应心电图信号，支持交互式心电图教育。

详情

AI中文摘要

心电图解读是医学教育中的基本技能，但学生往往需要更多静态示例来将波形证据与诊断推理联系起来。本文提出UniECG作为迈向交互式心电图教育的一步。UniECG支持两种互补的学习交互：给定心电图信号或图像，它生成基于证据的解释；给定文本学习目标，它生成对应的心电图信号示例用于案例学习。该模型采用两阶段设计。首先，它从心电图信号-图像-文本数据中学习基于证据的心电图解释。其次，它引入特殊的心电图生成标记，并将其隐藏表示与预训练的文本条件心电图扩散模型对齐，实现可控的信号级心电图生成。我们通过基于证据的心电图解释和面向生成的定性分析来评估UniECG，考察其支持解释和案例学习的潜力。UniECG旨在作为教育辅助工具和迈向交互式AI辅助心电图学习的研究步骤，而非临床验证的诊断系统。

英文摘要

Electrocardiogram (ECG) interpretation is a fundamental skill in medical education, yet students often need more than static examples to connect waveform evidence with diagnostic reasoning. This paper presents UniECG as a step toward interactive ECG education. UniECG supports two complementary learning interactions: given an ECG signal or image, it generates an evidence-based explanation; given a textual learning objective, it generates a corresponding ECG signal example for case-based learning. The model follows a two-stage design. First, it learns grounded ECG explanation from ECG signal--image--text data. Second, it introduces special ECG generation tokens and aligns their hidden representations with a pretrained text-conditioned ECG diffusion model, enabling controllable signal-level ECG generation. We evaluate UniECG through grounded ECG explanation and generation-oriented qualitative analysis, examining its potential to support explanation and case-based learning. UniECG is intended as an educational aid and a research step toward interactive AI-assisted ECG learning, rather than a clinically validated diagnostic system.

URL PDF HTML ☆

赞 0 踩 0

2601.19792 2026-06-18 cs.CL cs.AI cs.HC 版本更新

LVLMs and Humans Ground Differently in Referential Communication

LVLMs与人类在指称交流中的基础不同

Peter Zeng, Weiling Li, Amie J. Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan E. Brennan, Owen Rambow

AI总结通过人类与AI配对的多轮指称交流实验，发现LVLMs无法像人类一样利用共同基础生成和解析指称表达，导致交流不畅。

Comments 27 pages, 16 figures

2606.17372 2026-06-18 cs.CL cs.AI 版本更新

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

LVLMs在指称通信中的隐式与显式提示策略

Peter Zeng, Amie J. Paige, Weiling Li, Susan E. Brennan, Owen Rambow, Cameron R. Jones

发表机构 * Stony Brook University（石溪大学）

AI总结本研究通过控制任务差异，比较显式与隐式提示对LVLM生成高效指称表达的影响，发现显式提示下模型能协调高效表达，而隐式提示则失败，揭示了人机通信的关键差异。

2606.05409 2026-06-18 cs.CV cs.CL 版本更新

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗？VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University（麦吉尔大学）； Mila Quebec AI Institute（魁北克人工智能研究所）； University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出新颖视觉参照数据集（NVRD），通过对比VLM和人类对新颖视觉概念的泛化能力，发现模型在矛盾先验知识时难以习得新概念，且过度泛化。

详情

AI中文摘要

视觉语言模型（VLM）像人类学习者一样，经常接触新的视觉概念，但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索，特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点，我们提出了新颖视觉参照数据集（NVRD）：包含跨越90个视觉概念的19,176张图像，这些概念具有不同层次的新颖性，每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同，NVRD包含完全新颖、开放式的刺激，从头构建，模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断，以进行直接的人机比较，发现（i）当新概念与先验知识矛盾时，模型难以在上下文中习得它们，以及（ii）虽然模型和人类对视觉扰动表现出相关的敏感性，但模型显著过度泛化，将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

URL PDF HTML ☆

赞 0 踩 0

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 版本更新

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘：路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

发表机构 * Institute of Information Engineering, CAS（中国科学院信息工程研究所）； School of Cyber Security, UCAS（中国科学院大学网络空间安全学院）； The University of Western Australia（西澳大利亚大学）； Beihang University（北京航空航天大学）

AI总结提出配对路径控制协议（PPCP），发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘，且该效应不受架构深度影响，主要源于输入表示差异。

详情

AI中文摘要

一个模型可以通过听音频或阅读文本描述来学习钢琴曲《致爱丽丝》是平静而沉思的，但当这些知识后来面临遗忘风险时，获取路径是否重要？多模态模型中的遗忘研究衡量了在适应过程中丢失了哪些知识，但尚未探究获取路径是否影响知识被遗忘的难易程度。我们将这个未经检验的前提称为路径不变假设。音乐理解提供了一个干净的测试，因为一段音乐剪辑和一段规范的文本描述可以对齐到相同的感知内容，使得相同的知识单元可以通过听或读进入模型，而目标保持不变。在多个架构不同的音频-语言模型中，我们观察到一致的不对称性：在相同的适应压力下，文本路径知识比匹配的音频路径知识更容易被遗忘。为了将这种效应归因于路径而非混淆因素，我们引入了配对路径控制协议（PPCP），这是一个三阶段设计，建立匹配的路径基线，在相同的知识池上以对称监督激活两条路径，并对两条路径施加相同的遗忘压力。这种差距在模型间和增益控制分析中稳定存在，当矛盾覆盖被替换为正确标签的跨域学习时仍然存在，在单模态压力下仍然存在，并且不会被轻量级重放消除。两个独立的路径深度控制证实，该效应不能由架构深度解释，表明输入表示是主导因素。在PPCP下，我们的结果表明遗忘高度依赖于路径，将获取路径确立为遗忘研究和多模态系统设计的一个新的分析维度。

英文摘要

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

URL PDF HTML ☆

赞 0 踩 0

2606.18466 2026-06-18 cs.CL 新提交

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 与 2026 年语音到文本对齐的现状

Michael McAuliffe, Kaylynn Gunter, Michael Wagner, Morgan Sonderegger

发表机构 * University of Wisconsin--Madison（威斯康星大学麦迪逊分校）； McGill University（麦吉尔大学）； Centre for Brain, Language, and Music（大脑、语言与音乐中心）； University of Oregon（俄勒冈大学）

AI总结本文介绍 MFA 3.0 自 1.0 版本以来的发展，并在英语、日语和韩语上评估其性能，在四个基准数据集上达到平均边界误差低于 15 ms 的最优或接近最优性能。

2606.18584 2026-06-18 cs.CL 新提交

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

语音驱动的端到端汉语方言语言鉴别

Fan Xu, Jian Luo, MingWen Wang, GuoDong Zhou

发表机构 * Jiangxi normal university（江西师范大学）； Soochow university（苏州大学）

AI总结针对相似语言和方言鉴别难题，提出基于MFCC特征和HMM-DNN端到端模型的语音驱动方法，结合注意力机制和CNN融合词嵌入与MFCC特征，在基准语料上优于现有方法。

Comments Published in ACM TALLIP

2606.18979 2026-06-18 eess.AS cs.CL cs.SD 交叉投稿

UMA-Split：面向英语和普通话的非自回归语音识别的单峰聚合

Ying Fang, Xiaofei Li

发表机构 * Zhejiang University（浙江大学）； School of Engineering, Westlake University（西湖大学工程学院）； Institute of Advanced Technology, Westlake Institute for Advanced Study（西湖先进研究学院技术研究所）

AI总结针对UMA在英语等语言中因token粒度不匹配导致性能下降的问题，提出UMA-Split，通过分割模块使每个聚合帧映射到多个token，提升非自回归语音识别的跨语言性能。

Comments Accepted by ICASSP 2026. Code:https://github.com/FnoY0723/uma_split

2509.09631 2026-06-18 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结提出DiFlow-TTS框架，通过离散流匹配和分解离散流去噪器，在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper Track)

2606.18471 2026-06-18 cs.CL 新提交

LLMs难以衡量区分不同水平学生的题目：阅读理解评估中题目区分度研究

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）； University of Maryland（马里兰大学）； Virginia Tech（弗吉尼亚理工大学）

AI总结本研究评估42个LLM在零样本设置下预测题目区分度的能力，发现直接预测与人类校准的区分度相关性弱（最高Spearman 0.152），基于CTT的响应校准相关性有限（0.241），表明LLM尚不能可靠捕捉题目区分度。

详情

AI中文摘要

题目区分度是教育评估的一个基本心理测量属性，它衡量一个题目是否能有效区分高水平和低水平学生。虽然已有研究探讨了大语言模型（LLM）能否估计题目难度，但尚不清楚它们能否捕捉题目区分度。在本工作中，我们使用两种互补方法评估了42个专有和开源LLM在零样本设置下的表现：直接区分度预测，即模型从其内容中显式估计题目的区分度值；以及基于响应的经典测试理论（CTT）校准，其中LLM的答案被视为合成学生响应以计算区分度分数。我们的结果表明，直接预测与人类校准的区分度一致性较弱：表现最好的模型仅达到0.152的Spearman相关性。基于响应的CTT校准提供了更强但仍然有限的信号，全人格合成受访者池达到0.241的Spearman相关性。这些发现突显了题目区分度作为基于LLM的心理测量评估的一个开放挑战：当前的LLM包含非随机的区分度相关信号，但它们尚不能可靠地捕捉评估题目如何区分人类学生。

英文摘要

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

URL PDF HTML ☆

赞 0 踩 0

RECOM：开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville（阿拉巴马大学亨茨维尔分校）； University of North Alabama（北阿拉巴马大学）； Stanford University（斯坦福大学）； Meta AI ； Amazon GenAI（亚马逊GenAI）

AI总结提出 RECOM 数据集，发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性，余弦相似度有效性高但区分性差，BERTScore 区分性受长度影响且有效性弱。

详情

AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法，但一个指标被默默要求完成两项任务：区分真实内容对齐与表面巧合（有效性），以及区分更好的系统与更差的系统（区分性）。在开放式、观点驱动的问答中，这两者存在矛盾。我们引入了 RECOM（Reddit Evaluation for Correspondence of Models），一个无污染评估数据集，包含 15,000 个 r/AskReddit 问题（2025 年 9 月），每个问题都配有真实的社区回复，这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM（7-10B）的每个回复与每个指标配对，并加入随机乱序噪声基线，我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答（Cohen's $d \approx 2$），但无法对五个模型进行排序（$|d| < 0.1$）；BERTScore 精确度看似能对模型排序（原始 $|d|$ 高达 0.63），但一旦控制回复长度，这一数值骤降至 $|d| = 0.09$，且其有效性较弱（$d \approx 0.8$，而余弦相似度约为 2）。由于每个指标对相同的输出进行评分，这种有效性与区分性的权衡是指标的属性，而非模型的属性，我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距，同样只能微弱地区分五个模型。我们建议在两个轴上报告指标，并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

URL PDF HTML ☆

赞 0 踩 0

2606.19334 2026-06-18 cs.CL cs.CY cs.LG 新提交

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

用LOCUS解放法律：美国地方条例语料库

Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport

发表机构 * UC Berkeley（加州大学伯克利分校）； School of Information（信息学院）； Independent（独立研究者）

AI总结为解决美国地方条例缺乏机器可读语料的问题，构建了包含9239个市县条例的LOCUS语料库，并训练ModernBERT分类器以分析法律透明度等维度。

Comments 14 pages, 6 figures

详情

AI中文摘要

法律人工智能的进展越来越依赖于大规模获取权威法律文本。然而，美国法律中最具影响力的层级之一——地方条例——在很大程度上仍然缺失于现有的机器可读语料库中。地方法规管辖着分区、住房、商业许可、公共卫生、噪音、动物控制以及许多其他日常监管领域，但它们分散在专为人类浏览而非批量研究访问设计的供应商平台上。我们引入了LOCUS——美国地方条例语料库——一个全面的语料库和县级统一访问层，用于美国市和县条例。原始语料库可供研究人员发布，几乎涵盖了所有公开可用的市和县条例。由此产生的原始语料库包含来自9239个城市和县的法规。一个较小的县级统一LOCUS访问层覆盖了美国3144个县中最大的2309个，覆盖了大部分人口。我们使用OCR来处理使法律无法成为公共资源的各种文档格式。我们发布了带有覆盖元数据的语料库，以支持可重复性、下游法律AI研究以及逐步扩展对地方法律的机器可读访问。我们训练了一系列基于ModernBERT的分类器和评分器，以便从多个维度分析美国地方法律，例如不透明性和家长式作风，这些维度以前从未在此规模上研究过。LOCUS-v1及其衍生模型可在以下网址获取：this https URL

英文摘要

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

URL PDF HTML ☆

赞 0 踩 0

2606.18686 2026-06-18 cs.AI cs.CL cs.LG 交叉投稿

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim：一个模拟世界预测基准

Jaeho Lee, Nick Merrill, Ezra Karger

发表机构 * Forecasting Research Institute（预测研究所）

AI总结提出基于Freeciv游戏模拟的预测基准ForecastBench-Sim，通过游戏回滚生成可控、即时可解的预测问题，用于评估AI系统的概率推理能力。

Comments 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

详情

AI中文摘要

通用AI系统的预测基准通常继承现实世界的约束：结果缓慢显现、尾部事件罕见、反事实问题难以评分。我们引入ForecastBench-Sim，一个基于Freeciv（一款以文明系列为模型的回合制策略游戏）游戏回滚的模拟世界预测基准。预测者接收固定的世界报告（当前游戏状态的结构化快照），并回答关于隐藏未来状态的问题；然后基准继续模拟并对预测进行评分。由于世界是模拟的，同一设置可以生成任意时间跨度的连续或二元预测问题、用于条件或因果问题的配对干预世界，以及罕见或破坏性结果的已解决示例。我们描述了基准流程、问题族、评分协议和发布工件，并报告了来自模型评估和匿名人工试点的验证切片。ForecastBench-Sim旨在通过提供受控、即时可解的任务来补充现实世界预测基准，用于研究动态世界状态下的概率推理。

英文摘要

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

URL PDF HTML ☆

赞 0 踩 0

2606.18829 2026-06-18 cs.LG cs.CL 交叉投稿

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem：多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； King Abdullah University of Science and Technology (KAUST)（卡尔斯鲁厄大学）； Tsinghua University（清华大学）； National University of Singapore（新加坡国立大学）

AI总结提出GateMem基准，评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力，发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情

AI中文摘要

LLM代理的内存基准主要假设单用户设置，而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中，多个主体写入公共内存池并根据不同角色、范围和关系进行查询，因此内存质量需要治理和召回。我们引入GateMem，一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用（含状态更新）、跨上下文授权边界的访问控制，以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域，包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上，没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数，而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明，当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.19139 2026-06-18 cs.CV cs.CL 交叉投稿

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

Urdu Katib 手写数据集：用于离线乌尔都语手写文本识别的历史文档数据集及基于CRNN的基线评估

Ramza Basharat, Muhammad Usman Ali

发表机构 * Department of Computer Science, University of Gujrat（古杰拉特大学计算机科学系）

AI总结为解决乌尔都语手写文本识别中数据集稀缺的问题，本文提出了首个由历史时期Katib书写的离线乌尔都语手写文本行数据集UKHD，并评估了多种CRNN混合模型，其中CNN-BGRU-CTC在字符错误率和词错误率上表现最优。

详情

AI中文摘要

自动手写文本识别（HTR）本质上是一项具有挑战性的任务，当处理草书体时，其复杂性进一步增加。尽管在各种草书体上已经做出了显著努力，但关于乌尔都语手写文本识别（UHTR）的研究相对有限。这种研究滞后主要是由于其文字带来的独特挑战，以及基准数据集的稀缺和不可用。因此，为了推进UHTR研究，本研究提出了一个专门的真实数据集，称为Urdu Katib手写数据集（UKHD）。据我们所知，这是第一个专门从历史时期Katib书写的材料中整理的离线乌尔都语手写文本行数据集。它涵盖了Nastalique书法风格中各种扁平笔尖书写变体。此外，评估了不同基于CRNN的混合模型的有效性，以确定用于Urdu Katib手写识别（UKHR）的最佳架构。在分析的模型中，CNN-BGRU-CTC模型表现出更稳健的性能，具有较低的字符错误率（CER）和词错误率（WER）。本研究工作旨在支持和鼓励研究社区开发用于保存乌尔都语手写文学的稳健识别系统。

英文摘要

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

URL PDF HTML ☆

赞 0 踩 0

2606.19157 2026-06-18 eess.AS cs.CL 交叉投稿

我们是否仍然需要人在回路中？比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

AI总结研究比较了LLM与人类在主动学习中的标注效果，发现LLM标注成本更低且性能更优，但主动学习在LLM标注下无优势。

详情

AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习（AL）提出了两个问题：LLM标签能否替代AL回路中的人类标签？当整个语料库可以廉价标注时，AL是否仍然必要？我们在一个新的包含277,902条德国政治TikTok评论（25,974条LLM标注，5,000条人工标注）的数据集上进行了研究，比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下，大规模LLM标注的性能优于人类监督分类器，成本约为其十分之一（GPT-5.2 Batch API为28美元，Prolific为316美元）。这一优势对于闭源（GPT-5.2）和开源（Qwen3.5-122B-10B）LLM均成立，在软标签评估下具有鲁棒性，并且是通过双问题分解实现的；整体单提示基线仅与人类监督持平。在任一LLM标注器下，主动学习相比随机采样没有可靠优势。然而，错误结构差异显著：只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡，而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

超越单语言深度研究：用跨语言 BrowseComp-Plus 评估智能体和检索器

Yuheng Lu, Qingcheng Zeng, Heli Qi, Puxuan Yu, Fuheng Zhao, Rui Yang, Hitomi Yanaka, Naoto Yokoya, Weihao Xuan

发表机构 * Waseda University（早稻田大学）； Northwestern University（西北大学）； RIKEN AIP（理化学研究所革新智能研究中心）； Snowflake Inc.（Snowflake公司）； University of Utah（犹他大学）； Duke-NUS Medical School（杜克-新加坡国立大学医学院）； The University of Tokyo（东京大学）

AI总结提出跨语言基准 XBCP，评估深度研究智能体在证据语言与查询不同时的表现，发现检索和智能体端均存在显著性能下降。

Comments Preprint

详情

AI中文摘要

深度研究智能体越来越被评估其搜索证据、推理检索来源和生成有依据答案的能力。然而，现有的浏览基准大多假设用户查询和支持证据使用同一种语言，因此当相关证据出现在另一种语言时，智能体搜索系统能否运行尚不清楚。我们引入了 XBCP（跨语言 BrowseComp-Plus），这是一个受控基准，它保留了 BrowseComp-Plus 的英文问答空间，但改变了支持文档的语言。XBCP 实例化了两个互补的设置：在跨语言设置中，每个查询与单一指定语言的证据配对。在多语言设置中，完整的证据语料库在 12 种语言（涵盖高资源和低资源语言）中均匀随机分布。我们使用稀疏和密集的多语言检索器评估了四个深度研究智能体，测量了答案准确性、证据召回率、搜索行为、校准度、引用忠实度和 oracle 检索。结果显示，当证据被翻译时，性能显著下降。即使是强大的密集检索器也会丢失证据召回率，智能体变得不那么校准，且引用证据的可靠性降低。值得注意的是，即使直接提供所有黄金证据，准确性仍然较低。这些发现表明，跨语言深度研究暴露了检索失败和智能体端在整合语言不匹配证据方面的独立困难。

英文摘要

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

URL PDF HTML ☆

赞 0 踩 0

2606.16000 2026-06-18 cs.CL cs.LG 版本更新

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

GRACE-DS：数据科学中的受保护奖励引导智能体修正环境

Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasiya Palienko

发表机构 * ITMO University（ITMO大学）； HSE University（高等经济学院）

AI总结提出GRACE-DS，一个用于评估LLM驱动的AutoML智能体在部署前性能的隔离环境，通过隐藏的可执行验证器衡量预测性能、泄漏避免、可重复性等指标，实验证明其灵活迭代交互模式优于基线方法。

详情

AI中文摘要

我们介绍了GRACE-DS，一个数据科学中的受保护奖励引导智能体修正环境，用于对LLM驱动的AutoML智能体进行部署前评估。GRACE-DS是一组在隔离环境中的评估指标，可应用于特定组织的表格ML任务。它将智能体暴露于现实的工作流阶段，从规划和数据检查到特征工程、模型开发、验证、代码修复直至最终提交，同时隐藏的可执行验证器不仅衡量最终预测性能，还衡量泄漏避免、可重复性、协议有效性、修正行为和奖励对齐。最强的结构化机制——灵活迭代交互（我们的方法）——实现了比单次生成、非结构化交互和基于重启的基线更高的端到端归一化隐藏测试质量，同时提高了协议有效完成率。经过7000多个回合的验证，这些结果确立了GRACE-DS作为评估基于LLM的AutoML智能体在生产类条件下按照组织特定要求执行机器学习工作流能力的稳健平台。

英文摘要

We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

URL PDF HTML ☆

赞 0 踩 0

2502.02904 2026-06-18 cs.HC cs.CL q-bio.NC 版本更新

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

ScholaWrite: 端到端学术写作过程数据集

Khanh Chi Le, Linghe Wang, Minhwa Lee, Ross Volkov, Luan Tuyen Chau, Dongyeop Kang

发表机构 * University of Minnesota（明尼苏达大学）

AI总结提出ScholaWrite数据集，通过Chrome扩展记录Overleaf上的按键，捕捉从初稿到终稿的多月写作过程，包含5篇计算机科学预印本的近6.2万次文本修改及认知写作意图标注，揭示人类写作与LLM辅助之间的差距。

Comments Equal contribution: Khanh Chi Le, Linghe Wang, Minhwa Lee | project page: https://minnesotanlp.github.io/scholawrite/

详情

AI中文摘要

写作是一项认知要求高的活动，需要持续决策、高度依赖工作记忆，并在不同目标的任务之间频繁切换。为了构建与作者认知真正一致的写作助手，我们必须捕捉并解码作者将想法转化为最终文本背后的完整思维过程。我们提出了ScholaWrite，这是第一个端到端学术写作数据集，追踪从初稿到最终手稿的多月历程。我们贡献了三个关键进展：（1）一个Chrome扩展，可无干扰地记录Overleaf上的按键，从而能够收集真实、现场写作数据；（2）一个新颖的完整学术手稿语料库，附有认知写作意图的细粒度标注。该数据集包含基于LaTeX的五篇计算机科学预印本的编辑，捕捉了四个月内近6.2万次文本更改；（3）对学术写作微观动态的分析和见解，突出了人类写作过程与大型语言模型（LLM）在提供有意义帮助方面的当前能力之间的差距。ScholaWrite强调了捕获端到端写作数据以开发未来写作助手的重要性，这些助手支持而非取代科学家的认知工作。

英文摘要

Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.

URL PDF HTML ☆

赞 0 踩 0

2511.00802 2026-06-18 cs.SE cs.CL cs.LG 版本更新

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker: 使用代码修改型LLM代理的自动离线策略评估优化

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

发表机构 * Michigan Technological University, Houghton（密歇根技术大学）； Birmingham City University（伯明翰城市大学）； University of British Columbia, Kelowna（不列颠哥伦比亚大学, 肯洛纳）

AI总结提出GrowthHacker基准，利用LLM代理自动迭代修改代码以优化离线策略评估（OPE）实现，在Open Bandit Pipeline和Scope-RL上评估多种框架，证明基于LLM的代理可作为自动增长黑客持续改进OPE系统。

Comments Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

详情

DOI: 10.1145/3815588

AI中文摘要

随着数据驱动开发的广泛采用，在线A/B测试已成为衡量新技术效果的既定方法。然而，部署在线实验需要设计、实现和部署资源，并可能对用户产生负面影响（例如，不安全或不道德的结果），同时需要数周的数据收集。为了解决这一问题，离线策略评估（OPE）或离线A/B测试这一日益增长的研究领域，使用先前收集的日志数据离线评估新技术。OPE也是强化学习中的一个基本问题，在在线测试昂贵或风险高的领域（如医疗保健、推荐系统、教育和机器人技术）中非常重要。尽管代码生成大语言模型（LLM）和代理工作流取得了进展，但关于LLM和基于LLM的代理是否以及如何自动优化OPE实现，我们知之甚少。我们提出了GrowthHacker，这是一个基准测试，用于在大规模公共数据集上评估基线LLM和基于LLM的代理。GrowthHacker自主迭代修改代码，运行OPE，并使用指标指导后续优化。我们在Open Bandit Pipeline（OBP）和Scope-RL上评估方法，并开发了一个双代理框架，该框架解决了现有框架的局限性，同时降低了复杂性。在两个库中，双代理显示出最高的可靠性（98.1%-100%成功率）和正向结果率（78%），正向结果的中位改进为4.4%；CrewAI实现了最高的平均改进（37.9%），并且是唯一没有极端值失败的框架。AutoGen和Default各达到65%的正向结果率。这些结果证明了使用基于LLM的代理作为自动“增长黑客”持续改进OPE系统的可行性，对在手动优化成本高昂的情况下扩展数据驱动决策具有重要意义。

英文摘要

With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment, and may negatively impact users (e.g., unsafe or unethical outcomes) while requiring weeks of data collection. To address this, the growing research area of off-policy evaluation (OPE), or offline A/B testing, assesses new technologies offline using previously collected logged data. OPE is also a fundamental problem in reinforcement learning and is important where online testing is expensive or risky, such as healthcare, recommender systems, education, and robotics. Despite advances in code-generation large language models (LLMs) and agentic workflows, little is known about whether and how LLMs and LLM-based agents can automatically optimize OPE implementations. We propose GrowthHacker, a benchmark that evaluates baseline LLMs and LLM-based agents on large-scale public datasets. GrowthHacker autonomously and iteratively modifies code, runs OPE, and uses the metrics to guide subsequent optimization. We evaluate methods on Open Bandit Pipeline (OBP) and Scope-RL, and develop a two_agent framework that addresses limitations of existing frameworks while reducing complexity. Across both libraries, two_agent shows the highest reliability (98.1%-100% success rate) and positive-outcome rate (78%), with a median improvement of 4.4% among positive outcomes; CrewAI achieves the highest average improvement (37.9%) and is the only framework with zero extreme-value failures. AutoGen and Default each reach 65% positive-outcome rates. These results establish the feasibility of using LLM-based agents as automated "growth hackers" to continuously improve OPE systems, with implications for scaling data-driven decision-making where manual optimization is expensive.

URL PDF HTML ☆

赞 0 踩 0

2601.12805 2026-06-18 q-bio.GN cs.AI cs.CL 版本更新

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

SciHorizon-GENE：从基因知识到功能理解的生命科学推理基准测试

Xiaohan Huang, Meng Xiao, Chuan Qin, Qingqing Long, Jinmiao Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * Computer Network Information Center, Chinese Academy of Sciences（中国科学院计算机网络信息中心）； University of the Chinese Academy of Sciences（中国科学院大学）； DUKE-NUS Medical School, National University of Singapore（新加坡国立大学杜克-新加坡医学学校）； Singapore Immunology Network, Agency for Science, Technology and Research（新加坡免疫网络，科技研究局）

AI总结针对大语言模型在基因级推理能力上的不足，构建了包含超过19万个人类基因和54万问题的基准SciHorizon-GENE，从研究关注敏感性、幻觉倾向、答案完整性和文献影响力四个生物学关键维度评估模型，揭示了模型在生成忠实、完整且基于文献的功能解释方面的持续挑战。

Comments Accepted by SIGKDD 2026. 12 pages

详情

AI中文摘要

大型语言模型（LLMs）在生物医学研究中展现出日益增长的潜力，尤其是在知识驱动的解释任务中。然而，它们从基因知识到功能理解的可靠推理能力——这是知识增强型细胞图谱解释的核心要求——仍然在很大程度上未被探索。为了填补这一空白，我们引入了SciHorizon-GENE，这是一个基于权威生物数据库构建的大规模基因中心基准。该基准整合了超过19万个人类基因的 curated 知识，包含超过54万个问题，涵盖了与细胞类型注释、功能解释和机制导向分析相关的多种基因到功能推理场景。受初步检查中观察到的行为模式启发，SciHorizon-GENE从四个生物学关键角度评估LLMs：研究关注敏感性、幻觉倾向、答案完整性和文献影响力，明确针对限制LLMs在生物解释管道中安全采用的失败模式。我们系统评估了多种最先进的通用和生物医学LLMs，揭示了基因级推理能力的显著异质性，以及在生成忠实、完整且基于文献的功能解释方面的持续挑战。我们的基准为在基因尺度上分析LLM行为建立了系统基础，并为模型选择和发展提供了见解，与知识增强型生物解释直接相关。

英文摘要

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

URL PDF HTML ☆

赞 0 踩 0

2605.29676 2026-06-18 cs.AI cs.CL 版本更新

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要：智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH（知中心研究有限公司）； Graz University of Technology（格拉茨技术大学）； Graz Center for Machine Learning（格拉茨机器学习中心）

AI总结本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON，发现TRON在保持准确率的同时最多减少27%的令牌，而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情

AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果，并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的，因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案，如TOON（令牌导向对象表示法）和TRON（令牌减少对象表示法）作为更紧凑的替代，但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准（BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench）和五个开放权重LLM上评估了TOON和TRON，将输入压缩与输出压缩解耦，以独立测量理解和生成。TRON最多减少27%的令牌，准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少，准确率成本类似为9个百分点，但在多轮解析失败上额外级联，并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

URL PDF HTML ☆

赞 0 踩 0

2606.07591 2026-06-18 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出ResearchClawBench基准，包含10个领域40个任务，通过多模态评分标准评估自主科研能力，最强智能体仅得21.5分，揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情

AI中文摘要

AI编码智能体越来越多地用于科学工作，但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench，一个用于评估自主科学研究的基准，涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文，提供相关文献和原始数据，并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准，从而能够评估目标论文级别的重新发现，同时为新发现留出空间。我们在统一协议下评估了七个自主研究（auto-research）智能体，并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现：最强的自主智能体Claude Code平均得分为21.5，最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7，LLM前沿均值仅为26.5。错误分析表明，失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

URL PDF HTML ☆

赞 0 踩 0

2606.17188 2026-06-18 cs.CV cs.CL 版本更新

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

并非真正的多语言：脚本一致性作为VLM评估中缺失的维度

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

发表机构 * RediMinds Inc.（RediMinds公司）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Independent Researcher（独立研究员）

AI总结提出PuMVR基准，评估10个VLM在旁遮普语三种文字上的表现，发现显著的脚本差距，并提出脚本一致性率（SCR）作为必要评估指标。

详情

AI中文摘要

当前视觉语言模型（VLM）的多语言评估假设语言与正字法一一对应，忽略了使用多种文字语言的数十亿用户。我们引入了PuMVR（旁遮普多模态视觉推理），这是一个包含1000个严格平行图像-文本实例的基准，覆盖旁遮普语的三种活跃文字：古木基文、沙穆基文和罗马文。评估10个最先进的VLM，我们暴露了一个显著且系统的脚本差距。模型经常在一种文字上解决视觉任务，而在另一种文字上失败，准确率差异高达16%。关键的是，视觉输入均匀地提升了绝对性能，但并未缩小正字法差距。此外，跨文字的上下文迁移非常脆弱，揭示了脚本锁定的知识表示。通过所有文字对的McNemar检验支持，我们的发现表明当前的“多语言”VLM并非真正的多文字。我们提出脚本一致性率（SCR），在我们的基准上低至24.8%，作为脚本无关评估的强制性指标，以确保公平的AI访问。数据和代码可在以下网址获取：this https URL。

英文摘要

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

URL PDF HTML ☆

赞 0 踩 0

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 版本更新

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛：前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning（同情对齐机器学习）； Sentient Futures（感知未来）； Harvard Kennedy School（哈佛肯尼迪学院）； Appalachian State University Department of Management（阿巴拉契亚州立大学管理系）

AI总结提出首个代理基准TAC，测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型，所有模型得分低于随机水平64%，最佳模型仅53%。

详情

AI中文摘要

AI代理正从顾问转变为行动者，代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应，但未检验这些响应中的福利推理是否迁移到代理部署中（模型必须使用工具采取行动）。我们引入TAC（旅行代理同情心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%，最佳表现者（Claude Opus 4.7）为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升，在GPT-5.2中提升26个百分点，在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计（使用Gemini 2.5 Flash Lite作为评判者，对前两名模型的288个基础条件转录进行审计）未标记任何评估意识转录，表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

URL PDF HTML ☆

赞 0 踩 0

2606.18372 2026-06-18 cs.CL cs.AI 新提交

错误的正确：量化和定位大语言模型中的失调对齐

Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

发表机构 * University of Michigan（密歇根大学）； University of Cambridge（剑桥大学）； University of Aberdeen（阿伯丁大学）

AI总结本文提出VETO基准和失调对齐率（MAR）指标，发现所有LLM在刻板印象相关问题上均存在非平凡的失调对齐，且人类为0%，机制分析表明对齐诱导的线索会放大该现象。

详情

AI中文摘要

警告：本文研究刻板印象和偏见，包含可能令人不适的例子，仅用于说明目的。我们的发现不应被解释为反对对齐的论据。相反，本文强调了需要更先进对齐的原则性方法。对齐旨在确保大语言模型（LLMs）安全可靠地行为，包括避免不安全的推理。然而，我们表明这种安全导向的行为可能误触发：模型可能拒绝有根据的结论，即使上下文明确支持它们。我们将这种失败模式称为失调对齐，其中对齐引起的改变导致LLMs覆盖显式证据。为了量化这一现象，特别是针对刻板印象相关的对齐，我们引入了VETO，一个由2,032个BBQ派生对比对组成的基准，并定义了一个新指标，失调对齐率（MAR），它衡量在0到100的尺度上，模型在刻板印象相关问题上失败但在其对比对应问题上成功的频率。我们在VETO上对25个LLMs进行了基准测试，并表明所有LLMs，包括最新的，都表现出非平凡的（4.7%至18.9%）MAR，而所有人类参与者达到0.0%的MAR。受控启动实验进一步表明，对齐诱导的线索可以显著放大LLMs的MAR，表明这些失败不仅仅是单个例子的伪影，而是可以由安全相关的框架诱导。对开放权重LLMs的机制分析揭示了后期层对证据支持答案的抑制，并且指令模型与基础模型之间的比较表明这种抑制在指令训练后出现。这些发现表明，当前的对齐方法可能过度泛化表面安全线索，以至于覆盖客观证据，这激励了更多关于更好保持上下文基础的对齐目标的工作。

英文摘要

Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.18767 2026-06-18 cs.CL 新提交

Output Vector Editing for Memorization Mitigation in Large Language Models

输出向量编辑：缓解大型语言模型中的记忆化问题

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich（慕尼黑大学语言与信息处理中心）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Pioneer Centre for AI（人工智能先锋中心）

AI总结提出输出向量编辑方法，通过约束优化修改MLP神经元输出向量引入干扰项，在不改变激活值的情况下抑制记忆化序列，在OLMo-7B上实现87.9%抑制率，并揭示MLP编辑的机制边界。

详情

AI中文摘要

大型语言模型会记忆并复现训练数据中的序列，从而带来隐私、版权和安全风险。现有的神经元级缓解方法将编辑等同于将神经元激活归零，但激活仅控制神经元是否参与；输出向量才是写入残差流的内容，并通过叠加编码多个特征。我们提出输出向量编辑，这是一种约束优化的权重编辑方法，定位负责记忆化延续的一小组MLP神经元，并最小程度地修改其输出向量，以在词汇空间中引入干扰项，从而重定向它们在残差流中的贡献，同时保持激活不变。在四个模型（SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B，参数规模从360M到7B）上进行评估，我们重点研究OLMo-7B（其开放权重和预训练语料库支持系统化挖掘），挖掘了6831个记忆化序列，实现了高达87.9%的抑制率。在相同定位的神经元上，与零消融相比，2.7倍的差距表明抑制来自输出向量编辑，而非仅定位。四种编辑模式涵盖了从激进抑制到最小重定向的谱系；集成使用时覆盖了96.5%的记忆化序列，而我们推荐的单一模式配置达到了81.5%，且没有灾难性的局部性失败。我们进一步识别了一个机制边界：约14%的序列无法通过仅MLP编辑达到；虽然这些失败总体上并非由注意力驱动，但消融贡献最大的注意力头可恢复其中60-64%，对于从前缀复制token的延续，恢复更强，这表明注意力是互补的后备机制而非主要机制。编辑模式排序和成功-局部性权衡在所有四个模型上迁移，成功率随模型规模而非家族增长。

英文摘要

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

URL PDF HTML ☆

赞 0 踩 0

2606.18852 2026-06-18 cs.CL cs.AI 新提交

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

对齐隐含陈述：通过上下文边界半硬负挖掘实现隐式仇恨言论的泛化性

Wicaksono Leksono Muhamad, Yunita Sari

发表机构 * Mantera Studio（Mantera工作室）； Universitas Gadjah Mada（加雅玛大学）

AI总结提出ImpSH三元组框架，通过将帖子与隐含陈述对齐并使用上下文边界半硬负样本聚焦学习，提升隐式仇恨言论的跨域泛化能力，在多个数据集上优于对比基线。

详情

AI中文摘要

隐式仇恨言论分类仍然是一个挑战，因为意图通常通过暗示和上下文而非明确辱骂来掩盖。先前的监督对比方法改进了域内检测，但可能过拟合表面线索，且难以跨数据集迁移。我们提出ImpSH，一个基于三元组的框架，当隐含陈述可用时将其与帖子对齐，并使用上下文边界半硬负样本将学习聚焦于近混淆项。我们还研究了AugSH，它通过数据增强形成正样本。在使用BERT和HateBERT对IHC、SBIC和DynaHate进行的受控评估中，ImpSH是标准监督对比基线的可行替代方案，并且在匹配的预处理和调优预算下通常能提高跨域性能。使用对齐性和均匀性进行的表示分析表明，正样本对更紧密且全局分布平衡，定性最近邻案例研究展示了域转移下的典型假负例。这些结果表明，通过上下文边界挖掘将帖子与其隐含陈述对齐，提供了到相关暗示的更稳定、类似双射的映射，克服了传统基于聚类的表示学习固有的波动性。

英文摘要

Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18264 2026-06-18 cs.SI cs.AI cs.CL 交叉投稿

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

使用多LLM智能体模拟仇恨言论级联：实证基础、建模保真度与干预策略

Fan Huang

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

AI总结本研究通过多LLM智能体系统模拟在线仇恨言论传播，发现其能再现实证数据中的立场单一性和毒性同质性，并通过消融实验识别出智能体异质性为关键保真因素，提出针对密集网络的放大器干预策略。

详情

AI中文摘要

在线平台上仇恨内容传播的忠实建模仍然是内容审核研究中的一个开放问题。经典的级联模型没有明确表示与仇恨内容传播相关的用户画像、社区和内容因素，因此在实际场景中部署时可能产生效果较差的审核策略。多智能体大语言模型系统原则上可以使每次转发决策依赖于用户画像、周围社区和帖子内容，但尚不清楚这种增加的灵活性是否比经典基线更忠实地再现真实的仇恨级联。我们研究了三个仇恨Bluesky级联和一个大小匹配的良性对照。在实证Bluesky数据中，我们发现：97.4--99.7%的转发者采取敌对立场；对于仇恨级联，扩散树上的毒性-参与同质性高于关注图；仇恨级联的拓扑结构是星形（大多数转发直接来自根节点），而良性级联是树形（转发通过多跳链传播）。在模拟中，多LLM智能体模拟器再现了立场单一性和毒性差异方向。结构化消融实验将智能体异质性识别为主要的保真因素，针对密集网络的放大器干预在5.7%良性附带损害下实现了7.5--12.9%的减少。

英文摘要

Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful-content propagation may yield moderation strategies that behave less effectively when deployed in real-world scenarios. Multi-agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user's profile, the surrounding community, and the post's content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size-matched benign control. In the empirical Bluesky data, we found that: 97.4--99.7\% of reposters take a hostile stance; toxicity-engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star-like for the hateful cascades (most reposts come directly from the root) versus tree-like for the benign cascade (reposts propagate through multi-hop chains). In simulation, a multi-LLM-agent simulator reproduces the stance monoculture and the toxicity-delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5--12.9\% reduction at 5.7\% benign collateral.

URL PDF HTML ☆

赞 0 踩 0

2606.18383 2026-06-18 cs.LG cs.CL 交叉投稿

基于不确定性感知注意力头的高效大语言模型幻觉检测

Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（莫扎德人工智能大学）； ETH Zurich（苏黎世联邦理工学院）； Independent Researcher（独立研究者）； Applied AI Institute（应用人工智能研究所）

AI总结提出RAUQ框架，利用不确定性感知注意力头与令牌级置信度，通过单次前向传递实现无监督、高效的序列级幻觉检测，在12个数据集上优于现有方法且额外计算少于1%。

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, 2026

AI中文摘要

尽管大型语言模型（LLM）已经变得非常强大，但它们仍然容易出现事实性错误，通常称为“幻觉”。不确定性量化（UQ）为缓解这一问题提供了一种有前景的方法，但大多数现有方法计算量大且/或需要监督。在这项工作中，我们提出了基于循环注意力的不确定性量化（RAUQ），这是一种无监督且高效的幻觉识别框架。该方法利用了Transformer注意力行为的一个观察：当生成错误信息时，某些“不确定性感知”注意力头倾向于减少对前驱令牌的关注。RAUQ自动检测这些注意力头，并以循环方式将其激活模式与令牌级置信度度量相结合，仅通过一次前向传递即可生成序列级不确定性估计。通过在涵盖问答、摘要和翻译的十二个数据集上对九个不同LLM进行的实验，我们表明RAUQ始终优于最先进的UQ基线。重要的是，它产生的开销极小，所需的额外计算不到1%。由于它既不需要标记数据也不需要广泛的参数调整，RAUQ可作为白盒LLM中实时幻觉检测的轻量级即插即用解决方案。

英文摘要

While large language models (LLMs) have become highly capable, they remain prone to factual inaccuracies, commonly referred to as "hallucinations." Uncertainty quantification (UQ) offers a promising way to mitigate this issue, but most existing methods are computationally intensive and/or require supervision. In this work, we propose Recurrent Attention-based Uncertainty Quantification (RAUQ), an unsupervised and efficient framework for identifying hallucinations. The method leverages an observation about transformer attention behavior: when incorrect information is generated, certain "uncertainty-aware" attention heads tend to reduce their focus on preceding tokens. RAUQ automatically detects these attention heads and combines their activation patterns with token-level confidence measures in a recurrent scheme, producing a sequence-level uncertainty estimate in just a single forward pass. Through experiments on twelve datasets spanning question answering, summarization, and translation across nine different LLMs, we show that RAUQ consistently outperforms state-of-the-art UQ baselines. Importantly, it incurs minimal overhead, requiring less than 1\% additional computation. Since it requires neither labeled data nor extensive parameter tuning, RAUQ serves as a lightweight, plug-and-play solution for real-time hallucination detection in white-box LLMs.

URL PDF HTML ☆

赞 0 踩 0

2604.23130 2026-06-18 cs.CL cs.AI 版本更新

通过合成数据蒸馏实现高效金融语言理解

Wen-Fong, Huang, Edwin Simpson

发表机构 * School of Engineering Mathematics and Technology（工程数学与技术学院）； University of Bristol（布里斯托大学）

AI总结提出一种在低资源条件下通过合成数据蒸馏进行金融情感分析的框架，利用聚类种子选择生成代表性合成数据，使紧凑模型在少量标注下达到强性能，甚至在某些任务上超越教师模型。

详情

DOI: 10.63317/3b3zxy5qrw8s
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), 2026, pp. 10242-10254

AI中文摘要

大型指令跟随模型功能强大但部署成本高昂，尤其在金融领域，标注数据因保密性和专家标注成本而受限。我们提出一种通过合成数据蒸馏进行金融情感分析的高效框架，将知识从大型指令调优教师模型迁移到紧凑的学生模型。该框架专为低资源条件设计，其中收集并手工标注少量真实样本。框架随后对样本进行聚类，并利用聚类结果选择种子，通过结构化少样本提示生成合成样本。实验表明，基于聚类的种子选择比随机采样能生成更具代表性的合成数据，使紧凑模型在极少量监督下实现强性能。值得注意的是，在更复杂且噪声更多的文本领域，基于完整合成种子语料库训练的紧凑模型甚至优于教师模型，同时在正式文本上保持竞争力。该框架为金融NLP中资源高效的领域自适应提供了一条实用途径，且只需最少的人工标注工作。

英文摘要

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

URL PDF HTML ☆

赞 0 踩 0

2606.19266 2026-06-18 cs.CL cs.AI 新提交

信任区域在线策略蒸馏

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang

发表机构 * Samsung Research（三星研究院）； University of Oxford（牛津大学）； Peking University（北京大学）

AI总结提出信任区域在线策略蒸馏（TrOPD），通过信用分配策略和信任区域学习解决师生分布差异导致的训练不稳定问题，在数学推理、代码生成和通用基准上超越现有方法。

详情

AI中文摘要

在线策略蒸馏（OPD）是大型语言模型（LLM）高效后训练的基本技术，在智能体学习、多任务增强和模型压缩中具有广泛应用。然而，当教师和学生分布差异较大时，OPD训练变得不稳定，因为教师对学生生成token的监督可能产生不可靠的策略梯度，甚至导致优化失败。本文通过信用分配策略解决可靠的在线策略token级监督问题，并提出信任区域在线策略蒸馏（TrOPD）。它具有以下特点：1）信任区域在线策略学习：TrOPD仅在教师提供可靠监督的区域进行OPD，缓解了分布不匹配下K1反向KL估计的优化困难。2）异常值估计：对于异常区域，我们探索梯度裁剪、掩码和前向KL估计，以减少不可靠监督的不利影响。3）离策略引导：学生从教师前缀继续生成，并使用前向KL模仿离策略引导，鼓励向可靠区域进行在线策略探索。实验表明，TrOPD在数学推理、代码生成和通用领域基准上始终优于最先进的OPD基线，包括OPD、EOPD和REOPOLD。

英文摘要

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.18520 2026-06-18 stat.ML cs.CG cs.CL cs.DS cs.IR cs.LG 交叉投稿

Compact Geometric Representations of Hierarchies

层次结构的紧凑几何表示

Prashant Gokhale, Piotr Indyk, Yuhao Liu, Sandeep Silwal, Tony Chang Wang, Haike Xu

发表机构 * UW-Madison（威斯康星大学麦迪逊分校）； MIT（麻省理工学院）

AI总结研究如何用低维几何嵌入表示有向无环图中的祖先-后代关系，提出基于树宽等结构参数的维度上界和下界，并在真实数据集上验证了紧凑性。

Comments Published at the 39th Annual Conference on Learning Theory (COLT) 2026. 22 Pages

详情

AI中文摘要

计算数据的几何表示是现代机器学习的基石，通常通过训练双编码器将查询和文档映射到共享嵌入空间来实现。You等人[NeurIPS '25]的最新工作将这种方法扩展到层次检索，其中相关性由有向无环图（DAG）中的祖先-后代关系决定。虽然先前的工作表明当后代数量较少时存在有效嵌入，但这些界限对于深层层次结构会严重退化，所需维度与节点总数相当。在本文中，我们研究了更一般图类的紧凑可达性嵌入，并提供了使用维度依赖于结构图参数的嵌入来表示层次结构的理论保证。我们证明，对于任何有向树，存在常数维度3的可达性嵌入，与树的大小或深度无关。我们将这一结果推广到以树宽$t$为特征的图，构造了维度为$O(t \log n)$的嵌入，其中$n$是节点数。作为这些上界的补充，我们提供了匹配或接近匹配的下界，表明对于一般DAG，维度$\Omega(n)$是必要的，而对于树宽为$t$的图，需要$\Omega(t/\log(n/t))$的维度。我们还获得了由DAG中交叉边数量参数化的上界和下界。此外，我们展示了我们的嵌入可以在真实世界数据集上构建，并且与先前具有理论保证的嵌入相比，在高召回率情况下维度小得多。

英文摘要

Computing geometric representations of data is a cornerstone of modern machine learning, typically achieved by training dual encoders which map queries and documents into a shared embedding space. Recent work of You et al. [NeurIPS '25] has extended this approach to hierarchical retrieval, where relevance is determined by the ancestor-descendant relationships in a Directed Acyclic Graph (DAG). While previous work has shown that valid embeddings exist when the number of descendants is small, these bounds degrade significantly for deep hierarchies, requiring dimensions as large as the total number of nodes. In this paper, we investigate compact reachability embeddings for more general graph classes and provide theoretical guarantees for representing hierarchies using embeddings whose dimension depends on structural graph parameters. We prove that for any directed tree, there exists a reachability embedding in constant dimension 3, independent of the tree's size or depth. We generalize this result to graphs characterized by treewidth $t$, constructing embeddings of dimension $O(t \log n)$, where $n$ is the number of nodes. Complementing these upper bounds, we provide matching or near-matching lower bounds, showing that dimension $Ω(n)$ is necessary for general DAGs and $Ω(t/\log(n/t))$ is required for graphs of treewidth $t$. We also obtain upper and lower bounds parameterized by the number of cross-edges in the DAG. We additionally show that our embeddings can be constructed on real world datasets, and that they give much smaller dimensions in high recall regimes compared to prior embeddings with theoretical guarantees.

URL PDF HTML ☆

赞 0 踩 0

2606.18941 2026-06-18 cs.PL cs.CL 交叉投稿

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC：使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester（计算机科学，曼彻斯特大学）； Electrical Engineering, Federal University of Amazonas (UFAM)（电气工程，亚马逊联邦大学（UFAM））

AI总结针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题，提出基于DFS的图形LD解析器，将连接图转换为布尔触点合取，并采用三级I/O推断方案，成功实现完整GOTO IR转换，验证了3个图形LD程序。

Comments 18 pages

详情

AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式：一种使用<rung>元素的文本编码，另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式，但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示，导致空洞的验证成功。本文提出Graph-ESBMC-PLC，通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈，将梯形路径提取为布尔触点合取，并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序，确保SET线圈在RESET线圈之前处理，匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明，所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR，而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留，无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo（DantasCordeiro2026graphical，doi: https://doi.org/10.5281/zenodo.20699856）。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

URL PDF HTML ☆

赞 0 踩 0

2606.19121 2026-06-18 cs.SE cs.CL cs.HC 交叉投稿

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

由AI编写，由AI管理：跨越391个连续会话的语义空间控制与索引病消除

Hui Zhang, Shuren Song

发表机构 * Shenzhen Yunxi Technology Co., Ltd.（深圳云曦科技有限公司）； Information Technology Center, Tsinghua University（清华大学信息科学技术中心）

AI总结本文通过真实软件项目中的行动研究，发现长期LLM协作中增加形式约束反而导致“索引病”，提出“基线-日志物理分离”机制，有效消除该问题。

Comments 22 pages, 2 tables, 1 figure. Action research. Bilingual submission (Chinese companion version included as supplementary). Submitted to ICSE 2027 IOR track

详情

AI中文摘要

解决长期LLM协作中概念漂移的主流工程直觉是，用更多的形式约束换取更可靠的输出——设计符号标识符系统，在系统提示中积累防御规则，扩展上下文窗口。我们的工程记录表明，在长期设置中，这种方向可能产生与设计意图相反的效果。通过在跨越约一个月和391个协作会话的真实软件项目（Bang-v3）中使用行动研究方法，我们记录并分析了这些策略的失败过程。当符号系统超过复杂度阈值时，LLM并不会变得更准确——相反，它们放弃了对业务语义的真正理解，退回到符号层内的自我指涉推理，并生成看似内部一致但实际上与现实脱节的输出。我们将这种失败模式命名为“索引病”，其典型表现为“幻影立法”。我们将底层原理命名为“庞原理（语义活力定律）”：带有明确目的的自然语言传达的信息质量远高于符号表达。由此，我们设计并验证了其物理工程机制：“基线-日志物理分离”。在同一项目中，该机制将AI指令量减少了约75%，并且在随后的约150个会话中，未观察到索引病复发。附有双语对照版本（中文）作为补充材料。

英文摘要

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

URL PDF HTML ☆

赞 0 踩 0

2503.01805 2026-06-18 cs.LG cs.AI cs.CL 版本更新

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

图任务算法推理中Transformer的深度-宽度权衡

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

发表机构 * Courant Institute of Mathematical Sciences, New York University（纽约大学应用数学科学研究所）； Google Research（谷歌研究）； Meta AI ； Bar-Ilan University（巴伊兰大学）； Department of Bio-Medical Engineering, Edmond J. Safra Center for Bioinformatics, Tel-Aviv University（生物医学工程系，埃德蒙·J·萨法中心，特拉维夫大学）； Tel Aviv University（特拉维夫大学）

AI总结研究Transformer在图算法任务中深度与宽度的权衡，发现线性宽度下常数深度足以解决许多图问题，而某些问题需要二次宽度，实验验证了宽模型在保持精度的同时训练和推理更快。

Comments Updated ISF grant number

详情

AI中文摘要

Transformer已经彻底改变了机器学习领域。特别是，它们可用于解决复杂的算法问题，包括基于图的任务。在此类算法任务中，一个关键问题是能够实现该任务的Transformer的最小尺寸是多少。最近的工作开始探索图任务的这个问题，表明对于次线性嵌入维度（即模型宽度），对数深度就足够了。然而，我们在这里解决的一个开放问题是，如果允许宽度线性增长而深度保持固定，会发生什么。我们分析了这种情况，并得出了一个令人惊讶的结果：在线性宽度下，常数深度足以解决一系列基于图的问题。这表明宽度的适度增加可以允许更浅的模型，这在推理和训练时间方面是有利的。对于其他问题，我们表明需要二次宽度。我们的结果展示了Transformer实现图算法的复杂而有趣的格局。我们通过实验研究了深度和宽度相对能力之间的这些权衡，并发现宽模型在具有与深模型相同准确度的任务中，由于可并行化的硬件，训练和推理时间更快。

英文摘要

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

URL PDF HTML ☆

赞 0 踩 0

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo：通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Ricoh Software Research Center Beijing Co.,Ltd（Ricoh 软件研究中心北京有限公司）

AI总结提出Hilbert-Geo框架和Parse2Reason方法，利用条件描述语言和定理库实现立体几何问题的严格推理，在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

几何问题求解作为一种典型的多模态推理问题，近年来受到广泛关注并取得了很大进展，然而大多数工作集中于平面几何，由于三维空间图和复杂推理，通常在立体几何中失败。为弥补这一差距，我们引入了Hilbert-Geo，这是第一个用于立体几何的统一形式语言框架，包括一个广泛的谓词库和一个专用的定理库。基于该框架，我们提出了一种Parse2Reason方法，包含先解析后推理两个步骤。在解析步骤中，我们利用条件描述语言（CDL），一种由专门用于构建几何条件的谓词组成的形式化语言，来表示问题描述（自然文本）和立体图（视觉图像）。在推理步骤中，我们利用这些形式化CDL和定理库进行关系推理和代数计算，生成严格正确、可验证且人类可读的推理过程。值得注意的是，我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理，我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k，它们配备了几何形式语言标注、解答和答案。大量实验表明，我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能，在MathVerse-Solid（MathVerse中专用于立体几何的一个小子集）上达到84.1%，显著优于领先的多模态大语言模型，如Gemini-2.5-pro（在SolidFGeo2k上为54.2%）和GPT-5（在MathVerse-Solid上为62.9%）。此外，我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率，展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

URL PDF HTML ☆

赞 0 踩 0

2602.20135 2026-06-18 cs.CL cs.AI cs.IR 版本更新

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

发表机构 * University of Tehran（塔里班大学）； Independent Researcher（独立研究员）； Amirkabir University of Technology（阿米尔卡比尔技术大学）； TEIAS Institute（TEIAS研究所）

Comments Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

2508.20275 2026-06-18 cs.LG cs.CL q-bio.QM 版本更新

A Systematic Review on the Generative AI Applications in Human Medical Genomics

Anton Changalidis, Yury Barbitoff, Yulia Nasykhova, Andrey Glotov

发表机构 * Dpt. of Genomic Medicine（基因组医学系）； D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology（D.O. Ott妇产科与生殖医学研究所）

Comments 31 pages, 5 figures

详情

DOI: 10.3389/fgene.2025.1694070
Journal ref: Frontiers in Genetics 16 (2026) 1694070

英文摘要

Although traditional statistical techniques and machine learning methods have contributed significantly to genetics and, in particular, inherited disease diagnosis, they often struggle with complex, high-dimensional data, a challenge now addressed by state-of-the-art deep learning models. Large language models (LLMs), based on transformer architectures, have excelled in tasks requiring contextual comprehension of unstructured medical data. This systematic review examines the role of LLMs in the genetic research and diagnostics of both rare and common diseases. Automated keyword-based search in PubMed, bioRxiv, medRxiv, and arXiv was conducted, targeting studies on LLM applications in diagnostics and education within genetics and removing irrelevant or outdated models. A total of 172 studies were analyzed, highlighting applications in genomic variant identification, annotation, and interpretation, as well as medical imaging advancements through vision transformers. Key findings indicate that while transformer-based models significantly advance disease and risk stratification, variant interpretation, medical imaging analysis, and report generation, major challenges persist in integrating multimodal data (genomic sequences, imaging, and clinical records) into unified and clinically robust pipelines, facing limitations in generalizability and practical implementation in clinical settings. This review provides a comprehensive classification and assessment of the current capabilities and limitations of LLMs in transforming hereditary disease diagnostics and supporting genetic education, serving as a guide to navigate this rapidly evolving field.

URL PDF HTML ☆

赞 0 踩 0

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 版本更新

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

Comments Accepted to ACL 2025 Findings

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 版本更新

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology（信息科技学院）； University of Jyväskylä（于韦斯屈莱大学）； Faculty of Humanities and Social Sciences（人文与社会科学学院）

2410.03151 2026-06-18 cs.CL cs.SI 版本更新

Media Framing through the Lens of Event-Centric Narratives

Rohan Das, Aditya Chandra, I-Ta Lee, Maria Leonor Pacheco

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； Independent Researcher（独立研究者）

Comments Accepted to the 6th Workshop on Narrative Understanding, co-located with EMNLP 2024

2206.05018 2026-06-18 cs.SD cs.CL eess.AS 版本更新

Going Beyond the Cookie Theft Picture Test: Detecting Cognitive Impairments using Acoustic Features

Franziska Braun, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer, Sebastian P. Bayerl

Comments Accepted at the 25th International Conference on Text, Speech and Dialogue (TSD 2022)

1. 大语言模型与基础模型 24 篇

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

LLM Parameters for Math Across Languages: Shared or Separate?

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Dual Dimensionality for Local and Global Attention

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

GraphPO: Graph-based Policy Optimization for Reasoning Models

Enhancing Multilingual Reasoning via Steerable Model Merging

Sumi: Open Uniform Diffusion Language Model from Scratch

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Attention as Frustrated Synchronization

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Structured Inference with Large Language Gibbs

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Improve Large Language Model Systems with User Logs

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

2. 机器翻译与跨语言处理 1 篇

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

3. 信息抽取、检索与问答 11 篇

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

BCL: Bayesian In-Context Learning Framework for Information Extraction

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

MemRerank: Preference Memory for Personalized Product Reranking

RCEM: Robust Conversational Search EMbedder in Distributional Shift

4. 对话系统与智能体 13 篇

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

VISUALSKILL: Multimodal Skills for Computer-Use Agents

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Learning User Simulators with Turing Rewards

CEO-Bench: Can Agents Play the Long Game?

EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

PatchWorld: Gradient-Free Optimization of Executable World Models

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

5. 文本生成、摘要与编辑 5 篇

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

6. 语义、语法与语言学分析 5 篇

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Approximate Structured Diffusion for Sequence Labelling

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

7. 多模态语言处理 7 篇

Continuous Audio Thinking for Large Audio Language Models

Native Active Perception as Reasoning for Omni-Modal Understanding

UniECG: Understanding and Generating ECG in One Unified Model

LVLMs and Humans Ground Differently in Referential Communication

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

Would you still call this Dax? Novel Visual References in VLMs and Humans

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

8. 语音语言联合与音频文本 7 篇

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition