arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13940 2026-06-15 cs.CL 新提交

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

后训练能否使LLM成为优秀的医学编码员？生成式ICD编码的实证研究

Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding

发表机构 * Northwestern University（西北大学）

AI总结通过对比提示、监督微调和强化学习（GRPO及新提出的PHI课程）在ICD编码任务上的表现，发现后训练（尤其是SFT和GRPO）能显著提升生成式LLM的编码性能，提示评估瓶颈在于模型适应而非生成范式本身。

详情

AI中文摘要

自动国际疾病分类（ICD）编码是用于计费、流行病学和临床决策支持的核心医学编码任务。生成式大语言模型（LLM）常被报道为弱医学编码员，但这一发现主要来自推理时设置（如提示、检索、重排序或工具使用），而任务特定后训练的作用尚未充分探索。我们提出了一项受控的实证研究，针对生成式ICD编码的后训练，在共同协议和度量集下比较了判别式基线与LLM编码员在提示、监督微调和强化学习中的表现。据我们所知，这是首个评估基于RL的后训练用于生成式LLM编码员在ICD编码中的研究。我们进一步引入了PHI，一种诊断性课程，扩展了GRPO以细化遗漏病例。我们的结果表明，仅提示评估大大低估了LLM在ICD编码中的潜力。SFT提供了主要的能力跃升，GRPO进一步改善了超出SFT的代码集预测，而PHI在宏观性能上提供了有针对性的提升。这些发现表明，主要瓶颈不在于生成式公式本身，而在于如何调整和优化模型以实现全分类召回。我们在以下网址发布代码、数据划分和检查点：此 https URL。

英文摘要

Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.com/AlexandreWANG915/LLM4ICD.

URL PDF HTML ☆

赞 0 踩 0

2606.14243 2026-06-15 cs.CL 新提交

Decoupled Mixture-of-Experts for Parametric Knowledge Injection

解耦混合专家用于参数化知识注入

Baoqing Yue, Weihang Su, Qingyao Ai, Yichen Tang, Changyue Wang, Jiacheng Kang, Jingtao Zhan, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结提出解耦混合专家（DMoE）架构，将外部知识转化为独立可更新的专家模块，通过轻量级不确定性感知路由器在基础模型知识不足时激活相关专家，实现参数级知识增强，在知识密集型基准上优于检索和适配器基线。

详情

AI中文摘要

知识注入旨在为大型语言模型（LLMs）配备外部、领域特定或时效性知识。现有方法通常在灵活性和集成性之间面临权衡：检索增强生成将知识保留在模型外部，但仅提供提示级增强；而基于后训练的方法将新知识编码到共享参数中，但可能引入灾难性遗忘、知识冲突和昂贵的更新。在本文中，我们提出解耦混合专家（DMoE），一种用于参数化知识注入的模块化架构，它将专家和路由器从基础模型中解耦。DMoE将外部知识语料库转换为可独立更新的专家模块，并使用轻量级不确定性感知路由器，仅在基础模型在生成过程中缺乏足够知识时激活相关专家。为了支持高效的自回归推理，DMoE仅将专家附加到最后一层前馈网络，在保留KV缓存重用的同时实现参数级知识增强。在知识密集型基准上的实验表明，DMoE在答案质量上持续优于检索和适配器基线。

英文摘要

Knowledge injection aims to equip large language models (LLMs) with external, domain-specific, or time-sensitive knowledge. Existing approaches typically face a trade-off between flexibility and integration: retrieval-augmented generation keeps knowledge outside the model but only provides prompt-level augmentation, whereas post-training based methods encode new knowledge into shared parameters but may introduce catastrophic forgetting, knowledge conflict, and costly updates. In this paper, we propose Decoupled Mixture-of-Experts (DMoE), a modular architecture for parametric knowledge injection that decouples both experts and the router from the base model. DMoE converts external knowledge corpora into independently updatable expert modules and uses a lightweight uncertainty-aware router to activate relevant experts only when the base model lacks sufficient knowledge during generation. To support efficient auto-regressive inference, DMoE attaches experts only to the final-layer feed-forward network, preserving KV-cache reuse while enabling parameter-level knowledge augmentation. Experiments on knowledge-intensive benchmarks show that DMoE consistently improves answer quality over retrieval and adapter-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.13217 2026-06-15 cs.CL cs.AI cs.LG 新提交

GAGPO: Generalized Advantage Grouped Policy Optimization

GAGPO：通用优势分组策略优化

Siyuan Zhu, Chao Yu, Rongxin Yang, Zongkai Liu, Jinjun Hu, Qiwen Chen, Yibo Zhang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； Meituan（美团）

AI总结 GAGPO提出一种无需价值模型的强化学习方法，通过分组价值代理和动作重要性比，实现多轮任务中精确的时间信用分配，实验表明其在ALFWorld和WebShop上优于现有基线。

2606.13862 2026-06-15 cs.LG cs.AI cs.CL 交叉投稿

SuperThoughts: Reasoning Tokens in Superposition

SuperThoughts: 叠加中的推理令牌

Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Microsoft Research（微软研究院）； Independent（独立机构）； Princeton University（普林斯顿大学）； Rice University（莱斯大学）

AI总结提出SuperThoughts方法，通过将连续CoT令牌对压缩为单一潜在表示并利用多令牌预测模块解码，在保持训练监督的同时将推理吞吐量翻倍，实现约20-30%的CoT长度缩减且精度损失极小。

详情

AI中文摘要

长链思维（CoT）推理提升了LLM的问题解决能力，但由于顺序生成令牌导致计算成本高昂。尽管近期工作探索在连续潜在空间中进行推理以绕过离散令牌生成，但这些方法常面临训练稳定性问题，且因缺乏监督信号而难以扩展到复杂的长程任务。我们提出SuperThoughts，将连续的CoT令牌对压缩为单一潜在表示，并通过轻量级多令牌预测（MTP）模块每步解码两个令牌。这既在训练时保留了离散令牌监督，又在推理时使吞吐量翻倍。我们在Qwen2.5-Math-1.5B-Instruct、Qwen2.5-Math-7B-Instruct、Qwen2.5-Math-14B-Instruct上进行微调，并在MATH500、AMC、OlympiadBench和GPQA-Diamond上评估。通过基于置信度的自适应机制（在不确定时回退到标准解码），SuperThoughts实现了约20-30%的CoT长度缩减，同时保持精度，在大多数任务上仅下降1-2个准确率点。

英文摘要

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20--30\% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

URL PDF HTML ☆

赞 0 踩 0

2606.14368 2026-06-15 cs.LG cs.CL 交叉投稿

Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

做我的导师：通过同伴反馈实现互惠LLM改进的在策略共蒸馏

Woohyeon Byeon, Jiwon Jeon, Jeonghye Kim, Youngchul Sung

发表机构 * KAIST（韩国科学技术院）

AI总结提出在策略共蒸馏（OPCoD）方法，通过认知门控和反馈锚定实现两个不同领域强模型间的相互教学，达到帕累托改进。

2606.14672 2026-06-15 cs.AI cs.CL 交叉投稿

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

面向LLM-Agent工作流中并行分支的直接潜在空间合成

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Meta

AI总结提出Parallel-Synthesis框架，通过直接利用并行工作代理的KV缓存进行合成，避免文本拼接冗余，在9个数据集上匹配或超越文本合成，并将首令牌延迟降低2.5-11倍。

详情

AI中文摘要

大型语言模型越来越多地作为代理系统的执行引擎，但它们仍然通过顺序文本接口消耗上下文。这与现代结构化代理工作流不匹配，其中独立分支探索子任务、检索证据或生成候选解决方案，然后进行最终合成步骤。现有系统通常通过拼接这些分支的文本输出来合并它们，这丢弃了并行结构并导致冗余的预填充计算。在这项工作中，我们引入了Parallel-Synthesis，一个即插即用的框架，使合成器能够直接消耗由并行工作代理产生的KV缓存。Parallel-Synthesis结合了一个缓存映射器，用于校准独立生成的分支缓存，以及一个微调的合成器适配器，用于从此非顺序缓存接口生成。我们使用数据训练Parallel-Synthesis，这些数据使合成器暴露于并行缓存上下文，教授跨缓存分支的聚合，并从基于标准文本拼接的合成中蒸馏推理行为。在跨越数学、科学问答、代码生成、GAIA和多代理数据库诊断的九个下游数据集上，Parallel-Synthesis在七个数据集上匹配或优于基于文本的合成，并在另外两个数据集上保持接近。它还将首令牌时间减少了2.5-11倍，表明直接基于缓存的合成是一种更有前途的接口，用于在并行代理分支上进行更原生和高效的合成。

英文摘要

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

URL PDF HTML ☆

赞 0 踩 0

2505.23277 2026-06-15 cs.CL cs.AI 版本更新

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression

Sentinel: 通过注意力探测解码上下文利用以实现高效LLM上下文压缩

Yong Zhang, Heng Li, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao

发表机构 * Ping An Technology (Shenzhen) Co., Ltd., China（平安科技（深圳）有限公司，中国）； University of Science and Technology of China（中国科学技术大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结提出Sentinel，一种轻量级句子级压缩框架，通过冻结LLM的头部注意力模式解码推理时上下文利用行为，使用单次非自回归前向传递实现压缩，在LongBench上以0.5B代理模型达到5倍压缩且性能与7B模型方法相当。

Comments Preprint

详情

AI中文摘要

检索增强生成（RAG）通常面临长且嘈杂的检索上下文。现有的上下文压缩方法通常依赖于启发式相关性估计或监督压缩模型，而不是基于LLM在推理过程中如何利用检索到的上下文。我们提出Sentinel，一种轻量级的句子级压缩框架，从冻结LLM的头部注意力模式中解码推理时的上下文利用行为。为了在检索依赖的问答行为中提供监督，Sentinel使用QA示例训练一个轻量级探针，其中模型仅在检索上下文可用时成功。Sentinel仅使用单次非自回归前向传递进行压缩，无需专门的压缩训练或自回归评分。实验发现，即使在紧凑的代理模型中，有效的上下文利用信号仍然可访问。在LongBench上，使用0.5B代理模型的Sentinel实现了高达5倍的压缩，同时达到与基于7B规模模型的压缩方法相竞争的问答性能。尽管仅使用英文QA数据训练，Sentinel也能有效泛化到中文和域外设置。

英文摘要

Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Existing context compression methods typically rely on heuristic relevance estimation or supervised compression models rather than on how LLMs utilize retrieved context during inference. We propose Sentinel, a lightweight sentence-level compression framework that decodes inference-time contextual utilization behaviors from head-wise attention patterns of frozen LLMs. To ground supervision in retrieval-dependent answering behavior, Sentinel trains a lightweight probe using QA examples where the model succeeds only when retrieved context is available. Sentinel performs compression using only a single non-autoregressive forward pass without dedicated compression training or autoregressive scoring. Empirically, we find that effective contextual utilization signals remain accessible even in compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5$\times$ compression while attaining question-answering performance competitive with compression methods built on 7B-scale models. Despite being trained only on English QA data, Sentinel also generalizes effectively to Chinese and out-of-domain settings.

URL PDF HTML ☆

赞 0 踩 0

2601.22954 2026-06-15 cs.CL cs.AI 版本更新

Residual Context Diffusion Language Models

残差上下文扩散语言模型

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出残差上下文扩散（RCD）模块，通过回收丢弃令牌的上下文残差提高扩散语言模型的解码效率，在长/短CoT任务上以极少额外计算提升准确率4-11个百分点。

详情

AI中文摘要

扩散大语言模型（dLLM）已成为纯自回归语言模型的有前途的替代方案，因为它们可以并行解码多个令牌。然而，最先进的逐块dLLM依赖于一种“重掩码”机制，该机制仅解码最自信的令牌并丢弃其余令牌，从而浪费计算。我们证明，回收来自被丢弃令牌的计算是有益的，因为这些令牌保留了对于后续解码迭代有用的上下文信息。鉴于此，我们提出了残差上下文扩散（RCD），一个将这些被丢弃的令牌表示转换为上下文残差并将其注入回下一个去噪步骤的模块。RCD使用解耦的两阶段训练流程来绕过与反向传播相关的内存瓶颈。我们在长链推理（SDAR）和短链指令跟随（LLaDA）模型上验证了我们的方法。我们证明，一个标准的dLLM可以仅用约3亿个令牌高效地转换为RCD范式。在广泛基准测试中，RCD以极小的额外计算开销一致地将前沿dLLM的准确率提升4-11个百分点。值得注意的是，在最具挑战性的AIME任务上，RCD几乎使基线准确率翻倍，并在基线峰值准确率下实现高达4-5倍更少的去噪步骤。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~300 million tokens. RCD consistently improves frontier dLLMs by 4-11 percentage points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at baseline's peak accuracy.

URL PDF HTML ☆

赞 0 踩 0

2603.23530 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

你忘记我问什么了吗？大型语言模型中的前瞻记忆失败

Avni Mittal

发表机构 * University of Washington（华盛顿大学）

AI总结本研究通过认知心理学中的前瞻记忆视角，发现大型语言模型在执行复杂任务时，格式化指令的遵从率下降2-21%，并提出了显著性增强格式来恢复遵从性。

详情

AI中文摘要

大型语言模型在必须同时执行要求较高的任务时，常常无法满足格式化指令。我们通过认知心理学中的前瞻记忆视角，使用一个受控范式来研究这种行为，该范式将可验证的格式化约束与复杂度递增的基准任务相结合。在三个模型家族和超过8000个提示中，在并发任务负载下，遵从性下降了2-21%。脆弱性高度依赖于类型：终端约束（需要在响应边界采取行动）下降最多，高达50%，而避免约束相对稳健。显著性增强格式（显式指令框架加上尾部提醒）恢复了大量丢失的遵从性，在许多设置中将性能恢复到90-100%。干扰是双向的：格式化约束也可能降低任务准确性，其中一个模型的GSM8K准确率从93%下降到27%。在额外的堆叠实验中，随着约束的累积，联合遵从性急剧下降。所有结果均使用确定性程序化检查器，无需LLM作为评判组件，并在公开可用的数据集上进行。

英文摘要

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.21363 2026-06-15 cs.CL 版本更新

"I Didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

我没有做出微决策：在协作中测量、诱导和暴露目标级AI贡献

Eunsu Kim, Jessica R. Mindel, Kyungjin Kim, Sherry Tongshuang Wu

发表机构 * KAIST（韩国科学技术院）； Carnegie Mellon University（卡内基梅隆大学）； Seoul National University（首尔国立大学）

AI总结本文提出CoTrace框架，用于测量和暴露协作中目标级AI贡献，发现模型在目标塑造中贡献有限，但在引入具体要求和间接影响方面作用显著，且交互设计影响模型行为。

详情

AI中文摘要

随着大型语言模型（LLMs）越来越多地影响用户如何形成、细化和扩展目标，将贡献归因于人类-人工智能协作变得对用户校准自身依赖性和评估者评估AI辅助工作至关重要。然而，现有方法专注于最终成果，忽略了目标本身共同塑造的过程。我们引入了一个目标级归因框架CoTrace，将显式目标分解为可验证的需求，并追踪对话回合中直接贡献和间接影响。对638个真实世界协作日志应用CoTrace，发现尽管模型仅在目标塑造中贡献11-26%，但它们在引入较低层次的具体需求方面贡献显著，并产生各种间接贡献。通过受控模拟，我们展示了交互设计选择显著影响模型目标塑造行为。在一项用户研究中，向参与者暴露目标级分析使他们对贡献的感知在5分量表上几乎增加2分，揭示了用户在理解自身AI辅助工作时的系统性误校准。

通过融合路由实现令牌级LLM协作

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

发表机构 * [cs.AI]（计算机科学与人工智能）

AI总结本文提出FusionRoute框架，通过轻量级路由器在解码步骤中选择最合适的专家并补充对数几率以优化下一个令牌分布，解决了单个通用模型在多个领域表现不佳的问题，同时在多个基准测试中优于其他方法。

Comments 25 pages

详情

AI中文摘要

大型语言模型（LLMs）在多个领域表现出色。然而，使用单一通用模型在这些领域实现强大性能通常需要扩展到训练和部署成本极高的规模。另一方面，虽然较小的领域专用模型更高效，但它们在训练分布之外的泛化能力较差。为了解决这一矛盾，我们提出了FusionRoute，一种稳健且有效的令牌级多LLM协作框架，其中轻量级路由器同时（i）在每个解码步骤中选择最合适的专家，（ii）贡献一个互补的对数几率，通过对数几率添加来细化或校正所选专家的下一个令牌分布。与现有依赖固定专家输出的令牌级协作方法不同，我们提供了一个理论分析，表明纯专家路由本质上是有限的：除非持有强全局覆盖假设，否则无法一般实现最优解码策略。通过在专家选择中加入可训练的互补生成器，FusionRoute扩展了有效的策略类别，并在温和条件下实现了最优价值函数的恢复。经验上，FusionRoute在Llama-3和Gemma-2家族以及涵盖数学推理、代码生成和指令跟随在内的多种基准测试中，优于序列级和令牌级协作、模型融合和直接微调方法，同时在各自任务上与领域专家保持竞争力。

英文摘要

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

URL PDF HTML ☆

赞 0 踩 0

2602.04879 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Rethinking the Trust Region in LLM Reinforcement Learning

重新思考LLM强化学习中的信任区域

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Toronto（多伦多大学）

AI总结针对PPO在LLM微调中因词表大导致的训练不稳定问题，提出基于策略散度直接约束的DPPO算法，并引入高效近似方法。

详情

AI中文摘要

强化学习已成为微调大型语言模型（LLM）的基石，其中近端策略优化（PPO）是事实上的标准算法。尽管其普遍存在，我们认为PPO中的核心比率裁剪机制在结构上不适合LLM固有的大词表。PPO基于采样令牌的概率比率约束策略更新，该比率是对真实策略散度的有噪单样本蒙特卡洛估计。这导致次优的学习动态：低概率令牌的更新被过度惩罚，而高概率令牌中潜在的灾难性变化却约束不足，导致训练效率低下和不稳定。为解决此问题，我们提出散度近端策略优化（DPPO），用基于策略散度（如总变差或KL）直接估计的更原则性约束替代启发式裁剪。为避免巨大内存占用，我们引入了高效的二元和Top-K近似，以可忽略的开销捕获本质散度。大量实证评估表明，DPPO相比现有方法实现了更优的训练稳定性和效率，为基于RL的LLM微调提供了更稳健的基础。我们的代码可在https://github.com/sail-sg/Stable-RL获取。

英文摘要

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning. Our code is available at https://github.com/sail-sg/Stable-RL.

URL PDF HTML ☆

赞 0 踩 0

2602.14169 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

基于枢轴驱动重采样的LLM强化学习深度密集探索

Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye, Ruiqing Zhang, Shuang Qiu, Lijie Xu

发表机构 * Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）； City University of Hong Kong（香港城市大学）； Baidu（百度）

AI总结针对大语言模型强化学习中探索效率低的问题，提出深度密集探索（DDE）策略，通过识别失败轨迹中的可恢复枢轴状态并局部密集重采样，结合双流优化目标，在数学推理基准上优于现有方法。

详情

AI中文摘要

有效探索是大语言模型强化学习中的一个关键挑战：在有限的采样预算内，从庞大的自然语言序列空间中发现高质量轨迹。现有方法面临显著局限性：GRPO仅从根节点采样，使高概率轨迹饱和，而深层易错状态探索不足；基于树的方法盲目地将预算分散到琐碎或不可恢复的状态，导致采样稀释，无法发现罕见的正确后缀并破坏局部基线。为解决此问题，我们提出深度密集探索（DDE），一种将探索聚焦于失败轨迹中的“枢轴”——深层、可恢复状态的策略。我们通过DEEP-GRPO实例化DDE，引入三个关键创新：（1）轻量级数据驱动效用函数，自动平衡可恢复性和深度偏差以识别枢轴状态；（2）在每个枢轴处进行局部密集重采样，增加发现后续正确轨迹的概率；（3）双流优化目标，将全局策略学习与局部纠正更新解耦。在数学推理基准上的实验表明，我们的方法一致优于GRPO、基于树的方法及其他强基线。代码见 https://this https URL

英文摘要

Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines. Code is available at https://github.com/AgentCombo/DEEP-GRPO

URL PDF HTML ☆

赞 0 踩 0

2604.18419 2026-06-15 cs.LG cs.CL stat.ML 版本更新

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

知道何时退出：LLM推理中动态弃权的原则性框架

Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini

发表机构 * Hebrew University of Jerusalem（特拉维夫大学）

AI总结本文提出一个基于正则化强化学习框架的动态弃权原则，通过价值函数与弃权奖励的比较来决定是否提前终止推理，在数学推理和毒性避免任务上优于现有方法。

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 by the author(s)

AI中文摘要

利用思维链推理的大型语言模型常常因产生冗长且错误的响应而浪费大量计算资源。弃权可以通过抑制可能不正确的输出来缓解这一问题。虽然大多数弃权方法在生成之前或之后决定是否保留输出，但动态的生成中弃权考虑在每个token位置提前终止无前途的推理轨迹。先前的工作探索了这一想法的经验变体，但缺乏对弃权规则的原则性指导。我们提出了LLM动态弃权的形式化分析，将弃权建模为正则化强化学习框架中的一个显式动作。弃权奖励参数控制计算与信息之间的权衡。我们证明，在一般条件下，当价值函数低于该奖励时弃权严格优于自然基线。我们进一步推导了一种原则性且高效的方法来近似价值函数。在数学推理和毒性避免任务上的实证结果支持我们的理论，并展示了相比现有方法改进的选择性准确性。

英文摘要

LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2604.21335 2026-06-15 cs.LG cs.CL 版本更新

Sub-Token Routing for KV Cache Compression

子令牌路由用于KV缓存压缩

Wei Jiang, Wei Wang

发表机构 * Futurewei Technologies（未来智科）

AI总结提出子令牌路由方法，在保留令牌内对值向量分组并选择性保留，与令牌级压缩互补，在LLM和VLM中提升压缩性能。

Comments 17 pages, 8 tables, 2 figures

详情

AI中文摘要

Transformer推理通常需要大型KV缓存，尤其是在长上下文语言建模和多模态生成中。现有的压缩方法通常通过选择、驱逐、量化或压缩缓存令牌，或在语言模型推理前减少视觉令牌序列来降低缓存成本。我们引入子令牌路由，一种KV压缩方法，它在保留令牌内部添加了更精细的控制轴。它将每个保留的值向量分成组，并仅保留选定的组，同时保持查询和键状态不变。该方法设计在令牌级缩减之后工作。首先，令牌缩减方法确定保留哪些令牌。然后，子令牌路由压缩这些保留令牌内部的值状态。在匹配KV预算下的实验表明，添加子令牌路由提高了令牌级缩减在LLM和VLM设置中的性能，包括LLaMA-2-7B和Qwen2.5-7B上的Quest，以及LLaVA和Qwen-VL模型上的FastV/VisionZip。在较小的KV预算下增益更大，表明当进一步移除令牌成本高昂时，值组路由特别有用。总体而言，令牌级缩减和子令牌路由提供了互补的降低KV成本的方式。

英文摘要

Transformer inference often requires a large KV cache, especially for long-context language modeling and multimodal generation. Existing compression methods usually reduce cache cost by selecting, evicting, quantizing, or compressing cached tokens, or by reducing the visual-token sequence before language-model inference. We introduce sub-token routing, a KV-compression method that adds a finer control axis inside retained tokens. It splits each retained value vector into groups and keeps only selected groups, while leaving query and key states unchanged. The method is designed to work after token-level reduction. First, a token-reduction method determines which tokens are retained. Then, sub-token routing compresses the value states inside those retained tokens. Experiments under matched KV budgets show that adding sub-token routing improves token-level reduction performance in both LLM and VLM settings, including Quest on LLaMA-2-7B and Qwen2.5-7B, and FastV/VisionZip across LLaVA and Qwen-VL models. The gains are larger at smaller KV budgets, suggesting that value-group routing is especially useful when further token removal becomes costly. Overall, token-level reduction and sub-token routing provide complementary ways to reduce KV cost.

URL PDF HTML ☆

赞 0 踩 0

2606.01476 2026-06-15 cs.LG cs.CL 版本更新

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

OmniOPD：通过推测性验证实现无Logit的在线策略蒸馏

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

AI总结提出OmniOPD框架，通过基于蒙特卡洛展开的块级语义相似度替代token级logit匹配，结合峰值熵调度器和贝叶斯先验，解决在线策略蒸馏中logit不可获取和信号脆弱问题，在数学任务上超越标准OPD达28.64%。

Comments 26 pages, 3 figures

详情

AI中文摘要

在线策略蒸馏（OPD）在强教师模型的密集token级反馈下，基于学生模型自身的生成轨迹进行训练，缓解了监督微调（SFT）的离策略分布偏移和强化学习（RL）的稀疏信用分配问题。然而，标准OPD面临两个耦合的限制。首先，它需要直接访问教师模型的token级logit，将一大类有能力的专有模型排除在教师之外。其次，token级logit信号本身是脆弱的，依赖于教师和学生之间合理下一个token的狭窄重叠，并且容易放大重复循环等退化模式。在本文中，我们引入了OmniOPD，一种通过无logit的块级监督信号解决这两个限制的新框架。OmniOPD用蒙特卡洛展开替代确定性logit匹配，通过多token块上的连续语义相似性度量近似教师的局部偏好，并通过峰值熵调度器集中这种监督，仅在学生的高不确定性推理分叉处进行审计。Dirichlet-Multinomial贝叶斯先验和基础模型KL锚进一步限制了离散采样的方差，并防止了未审计token上的策略崩溃。在竞争性基准测试中，OmniOPD在数学任务上超越标准OPD方法高达28.64%，证实了块级语义验证提取了比token级logit匹配更可靠的学习信号，后者高信息密度被显著的噪声和脆弱性所抵消。此外，当与更强的黑盒教师（如Claude-4.5-Haiku和Gemini-2.5-Flash）配对时，OmniOPD在数学任务上相对于其开放权重教师对应物额外获得了9.54%的相对提升，使学生超越了自我探索RL的性能。

英文摘要

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

URL PDF HTML ☆

赞 0 踩 0

2606.03085 2026-06-15 cs.LG cs.CL 版本更新

Multi-component Causal Tracing in Large Language Models

大型语言模型中的多组件因果追踪

Zirui Yan, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri, Ali Tajer

发表机构 * Rensselaer Polytechnic Institute（拉特拉姆技术学院）； IBM Research（IBM研究院）

AI总结本文提出一个统一框架，通过软干预和度量转换高效识别对目标性能指标最关键的多组件子集，优于现有基线方法。

Comments Accepted to ACL 2026 main conference

详情

AI中文摘要

因果追踪通过系统地干预大型语言模型（LLM）的内部表示，揭示并量化将特定输入或计算与特定感兴趣指标联系起来的因果路径，从而量化LLM的行为。在先前单组件或单层研究的基础上，本文提出了一个同时因果追踪多个组件的统一框架。该框架系统地识别对期望目标性能指标（如准确性和公平性）最关键的组件子集（例如注意力头和多层感知器神经元）。这是通过将灵活的干预应用于广泛期望的指标来实现的。为了解决多组件问题的组合复杂性，设计了一种高效算法，该算法利用软干预和精心设计的度量转换，将组合搜索问题转化为一个连续问题，该问题可以在适当约束下高效求解，从而为选择组件生成适当的二元决策。实验结果表明，所提出的方法高效地识别出对目标指标具有高影响力的模型组件子集，优于现有基线方法。我们的代码可从此https URL获取。

英文摘要

Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

URL PDF HTML ☆

赞 0 踩 0

2606.12476 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测：延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher（独立研究员）

AI总结将幻觉起始检测建模为快速变化检测问题，基于RAGTruth验证的一阶马尔可夫模型，利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟，优于线性基线，并揭示了分类指标掩盖的延迟结构。

Comments 16 pages, 1 figure. v2: added Discussion and Appendix; recall-honest framing; robustness analyses (k-NN divergence estimate, seed-averaged decomposition)

详情

AI中文摘要

Token级幻觉检测器作为分类器进行评估，通过所有token的AUC，但流式监控器由其反应时间判断：从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型，将任务置于经典变点理论中，并得出Lorden关于检测延迟的下界：在虚警率为0.01时约为1.3个token。然后我们证明，因果循环标注器充当了具有学习增量的CUSUM；在匹配的虚警率下，它在11-13个token内检测到，而线性每token基线为31个token，受控分解将大部分优势归因于更好的每token得分，而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距：学习得分仅实现了特征携带散度的1/4.5，这一缺陷无法通过重新校准消除，其余部分为有限时域效应。分类指标掩盖了这种延迟结构；序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.

URL PDF HTML ☆

赞 0 踩 0

2606.11898 2026-06-15 cs.CL cs.LG 版本更新

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

发表机构 * Peking University（北京大学）； National University of Singapore（新加坡国立大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出GraspLLM框架，通过融合图结构理解与LLM语义能力，利用基序感知对比学习和最优上下文子图对齐，实现跨数据集和跨任务的零样本泛化。

详情

AI中文摘要

CoRe：一种持续奖励微调的LLM查询重写器，用于大规模视频搜索中的多阶段上下文感知相关性

Yilin Wen, Rong Yang, Xiaojia Chang, Hong Sun, Gefu Tang, Chunhui Liu, Jeffrey Chen, Zeyu Ma, Lisong Qiu, Xiaochuan Fan, Congjia Yu, Quan Zhou, Yuheng Chen, Zian Wang

发表机构 * TikTok

AI总结提出CoRe系统，通过半在线混合偏好优化和乘法奖励比，实现每周重新部署的LLM查询重写器，在多阶段视频搜索中显著降低变更查询率并提升相关性指标。

Comments 12 pages, 3 figures

详情

AI中文摘要

生产环境中的基于LLM的查询重写器面临一个矛盾：训练奖励必须反映重写结果如何被生产排序器使用，但训练过程必须足够廉价以支持随着数据漂移而持续重新部署。我们提出了CoRe（上下文相关性）系统，该系统在主要短视频搜索引擎中每周重新部署，已超过五个月。我们的奖励使用部署的多模态相关性模型作为其来源，并采用乘法比率形式，镜像生产融合代数，从而缩小了离线奖励代理留下的模拟-生产差距。半在线混合偏好优化循环使得这种奖励在每周数百万实例规模下可行：一个DPO风格的成对目标将梯度传递限制在采样轨迹的小型top-k/bottom-k子集，并且一个阶段结构将训练器/推理服务器参数同步从每步减少到每阶段。一个基于奖励和稳定性指标的自动升级门检测并恢复了一次生产中的真实奖励黑客事件。重写器输出作为并行相关性信号在召回、原始排序和精细排序阶段被消费，而不取代原始信号，从而限制了重写器故障的影响范围。来自两个连续生产发布的在线A/B测试，首先在精细排序阶段部署重写器，然后扩展到召回和原始排序阶段，在受重写影响的查询上实现了统计显著的变更查询率降低，所有主要相关性和参与度指标均朝着预期方向移动。

英文摘要

LLM-based query rewriters in production face a tension: the training reward must reflect how the rewrite is consumed by the production ranker, yet the training procedure must be cheap enough to support continuous redeployment as data drifts. We present CoRe (Context Relevance), such a system, redeployed weekly for over five months in a major short-video search engine. Our reward uses the deployed multimodal relevance model as its source and a multiplicative ratio form mirroring the production fusion algebra, closing the simulation-production gap that offline reward proxies leave open. A semi-online Mixed Preference Optimization loop makes this reward affordable at multi-million-instance weekly scale: a DPO-style pairwise objective restricts the gradient pass to a small top-k/bottom-k subset of sampled trajectories, and a phase structure reduces trainer/inference-server parameter syncs from per-step to per-phase. An automated promotion gate over reward-like and stability metrics detected and recovered from a real reward-hacking incident in production. Rewriter output is consumed as parallel relevance signals at recall, rawrank, and finerank without displacing the original signals, bounding rewriter-failure blast radius. Online A/B from two sequential production launches, first deploying the rewriter at finerank, then extending consumption to recall and rawrank, delivers statistically significant reductions in change-query rate on rewrite-impacted queries, with all headline relevance and engagement metrics moving in the expected direction.

URL PDF HTML ☆

赞 0 踩 0

2606.14269 2026-06-15 cs.IR cs.CL 交叉投稿

ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

ScoreGate: 通过双分数统计融合实现检索增强生成的自适应分块选择

Karamvir Singh, Arvind Jain

发表机构 * HighLevel, Inc.（HighLevel公司）

AI总结针对固定K检索导致过检索或欠检索的问题，提出ScoreGate，利用双编码器相似度和交叉编码器重排序分数进行自适应检索数量控制，无需额外模型调用，在MS MARCO上以更少块数提升MRR@10，并实现零假阳性。

Comments 20 pages, 6 figures, 14 tables

详情

AI中文摘要

固定基数检索无论查询复杂度如何，都向生成器注入恒定数量的top-K块，导致窄查询的过检索和组合查询的欠检索。我们描述了ScoreGate，一种轻量级的分数空间决策机制，利用标准流程中已经产生的两个分数——双编码器相似度s_i和交叉编码器重排序分数r_i，在推理时控制检索基数，无需额外的模型推理调用。其核心见解是，交叉编码器的确认可以挽救因词汇不匹配而被双编码器检索排名较低的语义相关块——这是固定K或单分数阈值化未能解决的一种失败模式。在MS MARCO（200个开发查询）上，ScoreGate以比标准Top-K少35%的保留块实现了MRR@10 = 0.401。在一个内部基准测试（n=300，Fleiss' kappa=0.87）上，ScoreGate在97.77-99.34%的召回率下观察到零假阳性（95% CI [96.4%, 100%]），每个查询的令牌数减少34.8%，仅增加31ms延迟。在MS MARCO和实际生产流量上的结果表明，自适应检索基数可以在不降低检索质量的情况下提高检索效率。

英文摘要

Fixed-cardinality retrieval injects a constant top-K chunks into the generator regardless of query complexity, causing over-retrieval for narrow queries and under-retrieval for compositional ones. We describe ScoreGate, a lightweight score-space decision mechanism that controls retrieval cardinality at inference time using two scores already produced by the standard pipeline: bi-encoder similarity s_i and cross-encoder reranker score r_i, with no additional model inference calls required. Its core insight is that cross-encoder affirmation can rescue semantically relevant chunks that bi-encoder retrieval ranks poorly due to vocabulary mismatch -- a failure mode unaddressed by fixed-K or single-score thresholding. On MS MARCO (200 dev queries), ScoreGate achieves MRR@10 = 0.401 with 35% fewer retained chunks than Standard Top-K. On an internal benchmark (n=300, Fleiss' kappa=0.87), ScoreGate observed zero false positives (95% CI [96.4%, 100%]) at 97.77-99.34% recall, with 34.8% fewer tokens per query and only 31ms added latency. Results on both MS MARCO and real-world production traffic suggest that adaptive retrieval cardinality can improve retrieval efficiency without degrading retrieval quality.

URL PDF HTML ☆

赞 0 踩 0

2410.15051 2026-06-15 cs.CL cs.LG 版本更新

Automatic identification of diagnosis from hospital discharge letters via weakly supervised Natural Language Processing

通过弱监督自然语言处理自动识别出院信中的诊断

Vittorio Torri, Elisa Barbieri, Anna Cantarutti, Carlo Giaquinto, Francesca Ieva

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出一种弱监督NLP流程，无需文档级标注即可从意大利语出院信中分类诊断，在细支气管炎数据集上达到接近全监督的性能，节省大量人工标注时间。

Comments 61 pages, 9 figures

详情

DOI: 10.1038/s41598-026-56721-0

AI中文摘要

从医院出院信中识别患者诊断对于大规模队列选择和流行病学研究至关重要，但传统的监督方法需要大量手动标注，这对于大型文本数据集通常不切实际。我们提出了一种弱监督自然语言处理（NLP）流程，用于对意大利语出院信进行分类，无需文档级手动标注。该方法提取与诊断相关的句子，使用在意大利医学文档上进一步预训练的Transformer模型生成语义嵌入，并应用两级聚类程序推导出弱标签，然后用于训练文档级分类器。该方法在2017年至2020年间意大利威尼托地区44个急诊室或医院收治的33,176份儿童出院信的细支气管炎案例研究中进行了评估。最佳弱监督模型在手动标注数据上实现了77.68%（±4.30%）的AUROC、73.13%（±4.93%）的AUPRC和78.14%（±4.89%）的F1分数。性能超过了无监督基线，接近全监督模型，同时对于该规模的数据集减少了超过1,500小时的手动标注需求。在较小的支气管炎数据集（3,188份出院信，2020-2025年）的二次验证中观察到类似的模型排名，最佳弱监督模型实现了76.72%（±5.02%）的AUPRC。这些结果表明弱监督NLP方法在从临床出院信中可扩展地识别疾病方面具有潜力。

英文摘要

Identifying patient diagnoses from hospital discharge letters is essential for large-scale cohort selection and epidemiological research, but traditional supervised approaches require extensive manual annotation, which is often impractical for large textual datasets. We present a weakly supervised Natural Language Processing (NLP) pipeline for classifying Italian discharge letters without document-level manual annotation. The method extracts diagnosis-related sentences, generates semantic embeddings using a transformer model further pre-trained on Italian medical documents, and applies a two-level clustering procedure to derive weak labels that are then used to train a document-level classifier. The approach was evaluated in a case study on bronchiolitis using 33,176 discharge letters of children admitted to 44 emergency rooms or hospitals in the Veneto Region, Italy, between 2017 and 2020. The best weakly supervised model achieved an AUROC of 77.68% ($\pm4.30\%$), an AUPRC of 73.13% ($\pm4.93\%$), and an F1-score of 78.14% ($\pm4.89\%$) against manually annotated data. Performance surpassed unsupervised baselines and approached fully supervised models, while reducing the need for manual annotation by more than 1,500 hours for a dataset of this size. Similar model rankings were observed in a secondary validation on a smaller bronchitis dataset (3,188 discharge letters, 2020-2025), where the best weakly supervised model achieved an AUPRC of 76.72% ($\pm 5.02\%$). These results suggest the potential of weakly supervised NLP methods for scalable disease identification from clinical discharge letters.

URL PDF HTML ☆

赞 0 踩 0

2504.20734 2026-06-15 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG: 在多样模态和粒度的语料库上实现检索增强生成

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）

AI总结本文提出UniversalRAG，一种能够处理多种模态和粒度的检索增强生成框架，通过动态路由机制和多粒度组织，提升跨模态知识检索的有效性，实验表明其在多个模态基准上的优越性。

Comments ACL 2026. Project page : https://universalrag.github.io

详情

AI中文摘要

检索增强生成（RAG）通过将外部相关知识与查询绑定，显著提升了事实准确性。然而，现有方法多局限于文本语料，尽管最近有尝试扩展到图像、视频等模态，但通常仅针对单一模态语料。相比之下，现实中的查询所需知识类型多样，单一知识源无法满足。为此，我们引入UniversalRAG，一种any-to-any RAG框架，旨在从异构源中检索和整合多样模态和粒度的知识。具体而言，受强制所有模态进入单一聚合语料的统一表示空间导致模态间隙的观察启发，我们提出模态感知路由，动态识别最合适的模态特定语料并执行针对性检索，并通过理论分析证明其有效性。此外，除模态外，我们对每个模态组织为多个粒度层级，实现针对查询复杂性和范围的精细检索。我们验证UniversalRAG在10个多种模态基准上的性能，显示其优于各种模态特定和统一基线。

英文摘要

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.

URL PDF HTML ☆

赞 0 踩 0

2505.04671 2026-06-15 cs.CL cs.LG 版本更新

Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards

Reward-SQL：通过逐步执行感知推理和过程监督奖励提升Text-to-SQL

Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Guoliang Li, Bin Wu, Wenchao Zhou

发表机构 * Renmin University of China（中国人民大学）； Tsinghua University（清华大学）； Alibaba Cloud Computing（阿里云 computing）

AI总结针对强化学习在Text-to-SQL中缺乏逐步执行感知推理和过程级奖励的问题，提出CoCTE框架和Reward-SQL方法，通过中间视图验证、结构化CTE及过程奖励模型，显著提升复杂查询的准确性和可解释性。

详情

AI中文摘要

最近，使用强化学习（RL）训练的大型语言模型（LLMs）的进展提高了Text-to-SQL的性能。然而，基于RL的方法仍然在处理复杂查询时面临两个关键限制：缺乏基于数据库反馈的逐步执行感知推理，以及缺乏用于指导推理优化的过程级奖励。为了解决这些问题，我们提出了CoCTE，一种分治且执行感知的推理框架，通过中间视图验证和结构化公共表表达式（CTEs）逐步组合SQL查询，提高了准确性和可解释性。为了实现CoCTE推理过程，我们开发了Reward-SQL，一种统一的方法，包含三个阶段：（1）模型初始化，使LLMs具备结构化CoCTE推理能力；（2）过程奖励设计，提供细粒度的、执行感知的监督；（3）过程监督的RL和推理，将过程奖励整合到训练中，并通过过程奖励指导推理阶段。本文解决了Reward-SQL中的核心挑战，并做出了以下贡献。我们引入了一个过程奖励模型（PRM），它将执行感知的轨迹评分与基于熵的步骤加权相结合，在推理步骤中提供密集且可解释的监督。我们将PRM集成到RL训练和推理阶段，稳定优化并通过过程级信号改进轨迹探索。实验表明，Reward-SQL在可比模型大小下显著优于基线，并表现出强大的跨领域泛化能力。

英文摘要

Recent advances in large language models (LLMs) trained with reinforcement learning (RL) have improved Text-to-SQL performance. However, RL-based approaches still struggle with complex queries due to two key limitations: insufficient stepwise execution-aware reasoning grounded in database feedback, and the lack of process-level rewards for guiding reasoning optimization. To address these issues, we propose CoCTE, a divide-and-conquer and execution-aware reasoning framework that progressively composes SQL queries through intermediate view validation and structured Common Table Expressions (CTEs), improving both accuracy and interpretability. To realize a CoCTE reasoning process, we develop Reward-SQL, a unified approach with three stages: (1) model initialization, which equips LLMs with structured CoCTE reasoning capabilities; (2) process reward design, which delivers fine-grained, execution-aware supervision; and (3) process-supervised RL and inference, which integrates process rewards into training and guides the inference stage by process rewards. This paper addresses the core challenges in Reward-SQL and makes the following contributions. We introduce a process reward model (PRM) that combines execution-aware trajectory scoring with entropy-based step weighting, providing dense and interpretable supervision across reasoning steps. We integrate PRM into both RL training and inference stages, stabilizing optimization and improving trajectory exploration with process-level signals. Experiments show that Reward-SQL significantly outperforms baselines with comparable model sizes, and exhibits strong cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2603.16073 2026-06-15 cs.CL 版本更新

ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

ClaimFlow: 追踪自然语言处理中科学主张的演变

Aniket Pramanick, Yufang Hou, Saif M. Mohammad, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)（通用知识处理实验室）； Department of Computer Science and Hessian Center for AI (hessian.AI)（计算机科学系和海斯曼人工智能中心）； Technische Universität Darmstadt（德累斯顿技术大学）； IT:U Interdisciplinary Transformation University Austria（奥地利跨学科转型大学）； National Research Council Canada（加拿大国家研究委员会）

AI总结提出ClaimFlow框架，通过标注1617篇论文中的5689个主张和4871个跨论文关系，定义主张关系分类任务，并分析NLP领域主张的演变模式。

详情

AI中文摘要

科学论文提出$\textit{主张}$，后续工作支持、扩展或有时反驳这些主张。然而，现有的引文和主张分析方法仅捕捉到这一对话的片段。在这项工作中，我们在单个科学主张的层面上使这些交互变得明确。我们引入$\texttt{ClaimFlow}$，一种以主张为中心的NLP文献视图，基于$1{,}617$篇ACL文集论文$(1979-2025)$构建，这些论文被手动标注了$5{,}689$个主张和$4{,}871$个跨论文主张关系，指示引用论文是$\texttt{支持}$、$\texttt{扩展}$、$\texttt{限定}$、$\texttt{反驳}$还是将引用的主张作为$\texttt{背景}$引用。基于$\texttt{ClaimFlow}$，我们定义了一个新任务——$\textit{主张关系分类}$——要求模型从文本和引文上下文中推断对引用的主张的科学立场。评估神经网络模型和大语言模型在该任务上的表现，我们报告了$0.81$宏F1的基线性能，表明该任务是可处理的，同时仍有改进空间。然后，我们将该框架扩展到约$13$k篇NLP论文，以研究跨越数十年NLP研究的主张演变。我们显示$63.5\%$的主张从未被重复使用；只有$11.1\%$曾受到挑战。广泛传播的主张更常通过限定和扩展$\textit{重塑}$，而非被支持或反驳。总体而言，$\texttt{ClaimFlow}$提供了一个审视NLP中思想如何转变和成熟的视角。

英文摘要

Scientific papers advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $1{,}617$ ACL Anthology papers $(1979 - 2025)$ that are manually annotated with $5{,}689$ claims and $4{,}871$ cross-paper claim relations, indicating whether a citing paper $\texttt{supports}$, $\texttt{extends}$, $\texttt{qualifies}$, $\texttt{refutes}$, or references a cited claim as $\texttt{background}$. Building on $\texttt{ClaimFlow}$, we define a new task -- $\textit{Claim Relation Classification}$ -- which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating neural models and large language models on this task, we report baseline performance of $0.81$ macro-F1, suggesting that the task is tractable while leaving room for improvement. We then scale this framework to $\sim$$13k$ NLP papers to study claim evolution across decades of NLP research. We show that $63.5\%$ claims are never reused; only $11.1\%$ are ever challenged. Widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than supported or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP.

URL PDF HTML ☆

赞 0 踩 0

2606.06794 2026-06-15 cs.CL cs.IR 版本更新

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

TA-RAG: 面向同伴支持健康沟通的语气感知检索增强生成

Yong-Bin Kang, Anthony McCosker

发表机构 * Swinburne University of Technology（斯winburne大学）

AI总结提出TA-RAG框架，通过轻量级提示在RAG管道中嵌入语气控制（无污名化、可读性调整、受众适应、同理心改写），无需微调模型，提升敏感健康沟通质量。

详情

AI中文摘要

检索增强生成（RAG）成功地将大型语言模型（LLM）的输出建立在可信文档上，但仅靠事实依据不足以支持敏感的同伴健康沟通。在HIV同伴支持等领域，回复还必须易于理解、无污名化、富有同理心并针对接收者定制。本文提出TA-RAG，一个轻量级的、基于提示的语气感知RAG框架，它将明确的语气控制嵌入到RAG管道中，无需模型微调。我们通过四个核心组件来操作化语气：无污名化改写、可读性调整、接收者适应和同理心重述。我们使用来自澳大利亚HIV在线学习（HOLA）、UNAIDS术语指南、可读性指标、澳大利亚HIV感染者协会（NAPWHA）的同伴支持标准以及公共同理心数据集的问题，通过组件级测试评估TA-RAG。结果表明，TA-RAG的组件在保留关键内容的同时，提高了其目标沟通质量。这些发现强调，基于提示的语气控制是使RAG输出适用于敏感同伴支持健康沟通的一个潜在方向。

英文摘要

Retrieval-augmented generation (RAG) successfully grounds large language model (LLM) outputs in trusted documents, but factual grounding alone is insufficient for sensitive peer-support health communication. In domains such as HIV peer support, responses must also be accessible, stigma-free, empathetic, and tailored to the recipient. This paper presents TA-RAG, a lightweight, prompt-based tone-aware RAG framework that embeds explicit tone control into a RAG pipeline without requiring model fine-tuning. We operationalise tone across four core components: stigma-free rewriting, readability adjustment, recipient adaptation, and empathy rephrasing. We evaluate TA-RAG through component-level tests using questions derived from HIV Online Learning Australia (HOLA), UNAIDS terminology guidance, readability metrics, peer-support standards from National Association of People with HIV Australia (NAPWHA), and a public empathy dataset. Results show that the TA-RAG's components improve their targeted communication quality while preserving key content. These findings emphasise that prompt-based tone control is a potential direction for making RAG outputs suitable for sensitive peer-support health communication.

URL PDF HTML ☆

赞 0 踩 0

2604.23336 2026-06-15 cs.IR cs.CL cs.LG 版本更新

Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA

高效基于理由的检索：基于JEPA的生成重排序器的在线蒸馏

Teng Chen, Sheng Xu, Feixiang Guo, Xiaoyu Wang, Qingqing Gu, Hongyan Li, Luo Ji

发表机构 * Geely AI Lab（吉利人工智能实验室）

AI总结本文提出Rabtriever，通过在线蒸馏从生成重排序器中学习，将查询和文档独立编码，提升检索效率，同时在多个任务中表现优异。

Comments 11 pages, 8 figures. ICMR 2026 (https://youtu.be/apDcrzEVwq4)

详情

DOI: 10.1145/3805622.3810780

AI中文摘要

不同于传统基于事实的检索，基于理由的检索通常需要使用大语言模型对查询-文档对进行跨编码，造成显著的计算成本。为解决这一限制，我们提出了Rabtriever，它独立编码查询和文档，同时提供与重排序器相当的跨查询-文档理解能力。我们从训练一个基于LLM的生成重排序器开始，该重排序器将文档置于查询之前，并提示LLM通过对数概率生成相关性分数。然后将其作为在线蒸馏框架的教师，Rabtriever作为学生重建教师的上下文感知查询嵌入。为此，Rabtriever首先从教师中初始化，参数冻结。然后采用联合嵌入预测架构（JEPA）范式，该范式在LLM层和头部之间集成一个轻量级、可训练的预测器，将查询嵌入投影到新的隐藏空间，文档嵌入作为潜在向量。JEPA然后最小化此投影嵌入与教师嵌入的分布差异。为了增强在线蒸馏的采样效率，我们还添加了对LLM日志几率的反向KL的辅助损失，以重塑学生的日志几率分布。Rabtriever将教师在文档长度上的二次复杂度优化为线性，经理论和实验证实。实验表明，Rabtriever在多种基于理由的任务中优于不同的检索器基线，包括共情对话和机器人操作，且从重排序器中仅有微小的准确率下降。Rabtriever在传统检索基准如MS MARCO和BEIR上也表现良好，性能与最佳检索器基线相当。

英文摘要

Unlike traditional fact-based retrieval, rationale-based retrieval typically necessitates cross-encoding of query-document pairs using large language models, incurring substantial computational costs. To address this limitation, we propose Rabtriever, which independently encodes queries and documents, while providing comparable cross query-document comprehension capabilities to rerankers. We start from training a LLM-based generative reranker, which puts the document prior to the query and prompts the LLM to generate the relevance score by log probabilities. We then employ it as the teacher of an on-policy distillation framework, with Rabtriever as the student to reconstruct the teacher's contextual-aware query embedding. To achieve this effect, Rabtriever is first initialized from the teacher, with parameters frozen. The Joint-Embedding Predictive Architecture (JEPA) paradigm is then adopted, which integrates a lightweight, trainable predictor between LLM layers and heads, projecting the query embedding into a new hidden space, with the document embedding as the latent vector. JEPA then minimizes the distribution difference between this projected embedding and the teacher embedding. To strengthen the sampling efficiency of on-policy distillation, we also add an auxiliary loss on the reverse KL of LLM logits, to reshape the student's logit distribution. Rabtriever optimizes the teacher's quadratic complexity on the document length to linear, verified both theoretically and empirically. Experiments show that Rabtriever outperforms different retriever baselines across diverse rationale-based tasks, including empathetic conversations and robotic manipulations, with minor accuracy degradation from the reranker. Rabtriever also generalizes well on traditional retrieval benchmarks such as MS MARCO and BEIR, with comparable performance to the best retriever baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.13945 2026-06-15 cs.CL 新提交

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

MedLatentDx：用于跨医院罕见病诊断的潜在多智能体通信

Ziqing Wang, Lili Zhao, Kaize Ding

发表机构 * Northwestern University（西北大学）

AI总结提出MedLatentDx框架，通过潜在多智能体通信实现跨医院罕见病诊断，利用潜在KV块传输保护隐私，支持同骨干和跨家族LLM部署，在CrossRare-Bench上提升诊断性能并减少可重建临床内容。

详情

AI中文摘要

罕见病影响超过3亿患者，涉及7000多种疾病，但没有任何一家医院能遇到足够多的病例来进行可靠诊断。跨医院合作可以通过允许诊断机构使用分布式、病例特定的诊断证据来提供帮助，但隐私法规限制可识别的临床文本跨机构传输。这种情况带来了两个挑战：现有的医疗智能体系统通常依赖文本证据交换，而原始潜在状态（如隐藏状态和KV缓存）仍可能泄露提示衍生的临床内容。我们引入了MedLatentDx，一个潜在多智能体通信框架，其中医院智能体将私有临床记录和检索到的病例保留在本地，并向主机智能体发送紧凑的潜在KV块以进行罕见病诊断。MedLatentDx支持两种部署设置：相同骨干的医院智能体使用潜在KV蒸馏，而具有不同LLM骨干的医院使用跨家族潜在对齐。在CrossRare-Bench（一个自建的大规模罕见病基准，具有医院级别分区）上，MedLatentDx提高了跨医院诊断性能，同时相对于原始潜在通信基线减少了可重建的临床内容。

英文摘要

Rare diseases affect over $300$ million patients across more than $7{,}000$ conditions, yet no single hospital encounters enough cases of any one condition for reliable diagnosis. Cross-hospital collaboration could help by allowing a diagnosing institution to use distributed, case-specific diagnostic evidence, but privacy regulations restrict the transmission of identifiable clinical text across institutional boundaries. This setting raises two challenges: existing medical agent systems often rely on textual evidence exchange, while raw latent states such as hidden states and KV caches may still reveal prompt-derived clinical content. We introduce MedLatentDx, a latent multi-agent communication framework in which hospital agents keep private clinical records and retrieved cases local, and send compact latent KV blocks to a host agent for rare-disease diagnosis. MedLatentDx supports two deployment settings: same-backbone hospital agents use latent KV distillation, while hospitals with different LLM backbones use cross-family latent alignment. On CrossRare-Bench, a self-built large-scale rare-disease benchmark with hospital-level partitions, MedLatentDx improves cross-hospital diagnostic performance while reducing reconstructable clinical content relative to raw-latent communication baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.14179 2026-06-15 cs.CL 新提交

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

CacheRL: 通过缓存轨迹和混合奖励实现多轮工具调用智能体

Md Amirul Islam, Sumiran Thakur, Huancheng Chen, Su Min Park, Jiayun Wang, Gyuhak Kim

发表机构 * Center for Advanced AI Accenture（埃森哲高级人工智能中心）

AI总结提出CacheRL系统，通过混合思维轨迹流水线、三级模糊缓存和缓存层级感知奖励，以100倍更少计算量实现92%的多步工具调用准确率，接近GPT-5的94%。

详情

AI中文摘要

我们提出了CacheRL，一个用于训练小型智能体基础模型的系统，在多步工具调用任务上实现了92%的过程准确率，接近GPT-5的94%，同时所需计算量减少100倍。我们的方法解决了实际智能体训练中的三个挑战：大规模从大模型迁移工具调用知识、无需昂贵实时工具执行的强化学习、以及从有噪声的缓存环境中稳健学习。CacheRL引入了三项关键创新。首先，混合思维轨迹流水线通过LLM生成的推理轨迹增强智能体轨迹，产生不仅教授模型调用什么工具还教授为什么调用的训练样本。其次，CacheAgentLoop通过三级模糊缓存消除了实时执行成本，同时使用token级掩码保持轨迹保真度。第三，缓存层级感知奖励动态调整答案质量权重，以避免因缓存引起的限制而惩罚模型。通过迭代监督微调（SFT）和组相对策略优化（GRPO），CacheRL将Qwen3-4B-Thinking的验证奖励从0.43提升到0.78。在公开的智能体工具调用基准测试中，我们的模型达到了与GPT-5等前沿模型竞争的性能。消融研究表明，移除知识迁移会使性能降低41%，而缓存感知奖励贡献了17%的提升。有趣的是，强化学习提高了训练稳定性，但在强监督微调基础上带来的提升有限，这表明在构建实用的小型智能体模型时，数据质量和奖励设计比复杂的优化方法更重要。

英文摘要

We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.

URL PDF HTML ☆

赞 0 踩 0

2606.14302 2026-06-15 cs.CL 新提交

Retrospective Progress-Aware Self-Refinement for LLM Agent Training

回顾性进度感知的LLM智能体训练自我精炼

Xinbei Ma, Congmin Zheng, Jiyang Qiu, Jiale Hong, Yao Yao, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； OPPO Research Institute（OPPO研究院）

AI总结提出RePro框架，通过前向-反思滚动范式训练智能体自我生成进度信号，无需持续外部监督，在WebShop等任务上提升Qwen系列性能高达12%。

详情

AI中文摘要

基于强化学习训练的LLM智能体优化逐步动作预测，但缺乏对任务进度的元认知意识，导致阻碍长期扩展的差距。一项初步研究表明，在线进度提示会损害性能，而回顾性演示有帮助，但这种能力无法仅从结果奖励训练中涌现。我们提出RePro，即回顾性进度感知训练，一种通过前向-反思滚动范式训练智能体自我生成进度信号的框架：智能体在线执行动作，然后根据完成的轨迹和已知结果回顾性地重新评估其逐步进度。RePro通过回顾性热身初始化，从最少的外部演示中教授反思格式，然后通过RePro-PO进一步训练，使用复合奖励产生自我生成的信号，无需持续的外部监督。在WebShop、ALFWorld和Sokoban上的实验表明，RePro提升了Qwen系列的性能，绝对成功率提升高达12%。

英文摘要

LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family's performance, with up to $12\%$ absolute success rate gains.

URL PDF HTML ☆

赞 0 踩 0

2606.14674 2026-06-15 cs.CL 新提交

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

AgentSpec: 通过受控组合理解具身智能体脚手架

Jixuan Chen, Jianzhi Shen, Haoqiang Kang, Zhi Hong, Qingyi Jiang, Soham Bose, Yiming Zhang, Leon Leng, Amit Vyas, Lingjun Mao, Siru Ouyang, Kun Zhou, Lianhui Qin

发表机构 * University of California, San Diego（加利福尼亚大学圣迭戈分校）； Johns Hopkins University（约翰霍普金斯大学）； University of Washington（华盛顿大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出AgentSpec模块化规范框架，将具身智能体表示为可复用策略组件的类型化组合，通过标准化接口实现受控组件替换与重组，揭示脚手架兼容性和交互效应对性能的主导作用。

详情

AI中文摘要

LLM智能体越来越多地构建为脚手架系统，而非单一模型调用，这些系统结合了推理、记忆、反思、动作执行和学习。虽然此类脚手架通常能提升性能，但它们往往嵌入在紧密耦合的流水线中，使得难以隔离组件贡献、比较替代设计或理解模块交互如何塑造智能体行为。我们引入AgentSpec，一个模块化规范框架，将具身智能体表示为具有标准化接口的可复用策略组件的类型化组合。AgentSpec标准化了感知、记忆、推理、反思、动作和可选学习之间的接口，使得组件能够在受控条件下被交换和重组。我们在DeliveryBench、ALFRED、MiniGrid和RoboTHOR上实例化该框架，并分析了跨模型骨干的推理、记忆、反思和强化学习模块。我们的结果表明，智能体性能由脚手架兼容性和交互效应主导，而非孤立模块强度。特别是，结构化多粒度记忆改善了长程状态跟踪，推理和记忆在不同环境中非均匀交互，反思在纠正和成本之间权衡，而RL训练的策略在与部署时脚手架结构共同优化时组合最佳。AgentSpec为研究、比较和设计可组合的LLM智能体提供了受控基础。我们的代码、基线和交互式游乐场在此https URL公开。

英文摘要

LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at https://agentspec-embodied.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.13683 2026-06-15 cs.AI cs.CL 交叉投稿

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

UP-NRPA：基于用户画像的嵌套 rollout 策略自适应，用于目标导向对话系统中与大语言模型的规划

Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

发表机构 * School of Artificial Intelligence, Anhui University（安徽大学人工智能学院）； Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University（安徽大学安徽省安全人工智能重点实验室）； Pengcheng Laboratory（鹏城实验室）

AI总结提出 UP-NRPA 在线框架，利用大语言模型和用户画像实时自适应调整对话策略，无需离线强化学习，在协作与非协作对话基准中实现 100% 任务成功率，谈判任务销售列表比提升 56.41%。

详情

AI中文摘要

为了解决当前对话策略规划方法难以动态适应不同用户特征的挑战，本文提出了一种基于用户画像的嵌套 rollout 策略自适应（UP-NRPA）在线框架，结合大语言模型。与传统依赖模型训练并为用户群体离线学习强化学习策略模型的方法不同，UP-NRPA 通过自适应机制实现对话策略的动态定制。这是通过利用实时用户反馈以及从当前用户画像映射出的个性、偏好和目标来实现的，从而无需离线强化学习即可适应用户特征。在协作和非协作对话基准测试中，UP-NRPA 展现了显著优势，在多个对话任务中实现了令人印象深刻的 100% 成功率。特别是在谈判任务中，销售列表比（SL）提高了 56.41%。这表明 UP-NRPA 无需训练机制即可适应多样化的用户需求，使对话系统能够适应用户特征。

英文摘要

To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

URL PDF HTML ☆

赞 0 踩 0

2606.13715 2026-06-15 cs.AI cs.CL cs.MA 交叉投稿

WorkBench Revisited: Workplace Agents Two Years On

WorkBench 再探：两年后的工作场所智能体

Olly Styles

发表机构 * GitHub

AI总结本文重新评估2024至2026年间WorkBench基准上智能体的进展，发现前沿模型在能力和安全性上均有显著提升，但开放权重模型降低了高性能门槛。

Comments 8 pages, 3 figures. Follow-up to arXiv:2405.00823

详情

AI中文摘要

2024年3月，WorkBench上表现最好的智能体GPT-4完成了43%的任务，并在26%的任务中采取了意外的有害行为（例如给错误的人发送电子邮件）。我们在2026年6月重新审视该基准，发现迄今为止最好的智能体Claude Opus 4.8完成了89%的任务，并仅在2.5%的任务中采取了意外的有害行为。除了前沿智能体性能的显著进步外，有三点值得注意。首先，在WorkBench上，能力与安全性是相辅相成的，而非相互权衡，因此完成最多任务的模型造成的意外损害也最少。其次，虽然几类错误已被完全消除，但前沿模型仍然会犯一些基本错误，有时会导致不可逆转的损害，例如将电子邮件发送给错误的人。第三，开放权重模型的兴起大幅降低了此前仅专有模型才能达到的性能水平的成本，而前沿模型的成本则保持相对稳定。我们发布了该基准的更新版本，包括数据与代码质量改进、新的模型评分以及自2024年以来WorkBench上智能体进展的分析。

英文摘要

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

URL PDF HTML ☆

赞 0 踩 0

2606.14155 2026-06-15 cs.LG cs.CL 交叉投稿

EurekAgent：自主科学发现中，智能体环境工程即一切

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Zhipu AI（智谱AI）

AI总结提出环境工程框架EurekAgent，通过权限、工件、预算和人机交互四维工程设计，在数学、内核工程和机器学习任务上取得新最优结果，总API成本低于11美元。

详情

AI中文摘要

基于LLM的智能体在自动化科学发现方面展现出日益增长的潜力。给定一个可优化的度量和执行环境，它们可以提出、验证和迭代科学解决方案，并已产生超越人类设计方法的结果。随着模型能力的持续提升，我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境：即塑造智能体行为的资源、约束和接口。我们将此框架化为环境工程：构建能够放大生产性行为（如开放式探索、系统化工件管理和智能体间协作）同时抑制有害行为（如奖励黑客和高摩擦人工监督）的环境。我们提出了EurekAgent，一个用于度量驱动自主科学发现的环境工程智能体系统。EurekAgent从四个维度进行环境工程：权限工程用于受限智能体执行和隔离评估；工件工程用于基于文件系统和Git的协作；预算工程用于预算感知探索；人机交互工程用于便捷的人工监督和干预。EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最优结果，包括以不到11美元的总API成本发现新的26圆填充最优结果。我们开源了代码和结果，并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

英文摘要

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.13993 2026-06-15 cs.CL 新提交

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

文本和音频语言模型中动词+up短语的整体存储

Zachary Nicholas Houghton, Yu Zhou, Dan Pluth, Vijay K. Gurbani

发表机构 * University of Oregon（俄勒冈大学）； Vail Systems, Inc（Vail Systems公司）

AI总结研究文本和音频语言模型对动词+up短语的整体存储，发现频率和可预测性驱动独立表征，支持基于使用的语言理论。

2606.14512 2026-06-15 cs.CL cs.AI 新提交

Fodor and Pylyshyn's Systematicity Challenge Still Stands

Fodor和Pylyshyn的系统性挑战依然存在

Michael Goodale, Salvador Mascarenhas

发表机构 * Institut Jean Nicod, Département d’études cognitives ENS, EHESS, CNRS, PSL University（让·尼科研究所，ENS认知科学系，EHESS，CNRS，PSL大学）

AI总结本文通过实验证明，Lake和Baroni的元学习组合协议模型在分布外和分布内问题上均表现不佳，未能满足Fodor和Pylyshyn对神经网络系统性提出的挑战。

Comments Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

详情

AI中文摘要

神经网络近期在生成类人语言方面的成功在认知科学领域引起了巨大轰动，许多研究者认为，关于人类认知的经典难题以及对人工智能的挑战正被神经网络解决。一个显著的例子是Jerry Fodor和Zenon Pylyshyn提出的系统性论证，该论证认为人类表现出系统性的双条件依赖关系。例如，某人能理解句子“John saw Mary”当且仅当能理解句子“Mary saw John”。符号系统解释了这种语言和思维的系统性，而神经网络则没有提供直接的解释。最近几篇文章声称这一挑战已被神经网络解决。特别是，Brenden Lake和Marco Baroni认为他们的元学习组合协议匹配并可能解释了人类的系统性。我们证明这些结论为时过早。在其他结果中，我们发现他们的模型难以学习与训练数据分布稍有差异的规则。此外，即使在许多分布内问题上，模型的行为也是非系统性的。我们得出结论，Fodor和Pylyshyn对神经网络的挑战仍未得到满足。

英文摘要

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

URL PDF HTML ☆

赞 0 踩 0

2509.24102 2026-06-15 cs.CL 版本更新

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

道德推理习得的语用推理：通过元语用链接实现泛化

Guangliang Liu, Xi Chen, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Johnson

发表机构 * Indiana University Indianapolis（印第安纳大学印第安纳波利斯分校）； Nanyang Technological University（南洋理工大学）； University of Mississippi（密苏里大学）； Northeastern University（东北大学）； Qualcomm（高通公司）； Michigan State University（密歇根州立大学）

AI总结针对大语言模型在道德推理中泛化能力不足的问题，提出基于元语用链接和道德基础理论的语用推理方法，使模型获取道德推理目标与社会变量间的元语用链接，在三个任务上验证了其适应性和泛化性。

2606.14691 2026-06-15 cs.CL 新提交

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CORA: 通过一致性导向的推理对齐分析与弥合多模态RLVR中的思考-答案差距

Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng, Changyuan Tian, Zichuan Lin, Wenqian Lv, Nayu Liu

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Wuhan University（武汉大学）； Tsinghua University（清华大学）； Tianjin University（天津大学）

AI总结本文分析多模态RLVR中思考与答案的语义不一致问题，提出CORA方法，通过轻量级一致性奖励模型引入语义一致性，并采用混合奖励优势分裂稳定优化，提升推理忠实度。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

可验证奖励的强化学习（RLVR）已成功激发大语言模型的推理能力，推动其向多模态场景扩展。现有方法主要关注提升推理轨迹的视觉覆盖和缓解视觉幻觉，但低估了推理过程与最终答案之间的语义不一致性。本文深入研究了大型视觉语言模型（LVLMs）中RLVR的思考-答案不一致性，通过对组相对策略优化（GRPO）训练过程中收集的轨迹以及RLVR后评估输出的分析，表明该问题在训练期间持续存在，并在推理时仍然存在。受此分析启发，我们提出一致性导向的推理对齐（CORA），通过轻量级即插即用的一致性奖励模型将思考-答案语义一致性引入RLVR，并进一步结合混合奖励优势分裂（HRAS）以稳定协调任务和一致性优化。在代表性多模态推理基准和主流LVLMs上的大量实验表明，CORA在提升任务性能的同时有效缓解了思考-答案不一致性，从而产生更忠实的推理轨迹。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.

URL PDF HTML ☆

赞 0 踩 0

2606.13707 2026-06-15 cs.AI cs.CL cs.CV 交叉投稿

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1: 全模态智能体编排

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK（香港中文大学）； LIGHTSPEED ； PKU（北京大学）； THU（清华大学）； Tongji University（同济大学）

AI总结提出Orchestra-o1全模态智能体编排框架，通过统一编排机制实现模态感知任务分解、在线子智能体专业化和并行子任务执行，在OmniGAIA基准上准确率超第二名10.3%，并引入DA-GRPO强化学习方法训练Orchestra-o1-8B达到开源全模态智能体最优性能。

详情

AI中文摘要

近期智能体集群的成功将基于大语言模型（LLM）的智能体从单智能体工作流范式转向多智能体系统，凸显了智能体编排在任务分解与协作中的重要性。然而，现有编排框架局限于狭窄的模态集合，难以泛化到异构模态共存并交互的更复杂场景。这种局限性在全模态场景中尤为突出，此类任务需要对文本、图像、音频和视频等多样化输入进行统一理解与协调。在本工作中，我们提出Orchestra-o1，一种全模态智能体编排框架，旨在支持跨多种模态的高效智能体协作。Orchestra-o1引入统一编排机制，实现模态感知任务分解、在线子智能体专业化和并行子任务执行。这种可扩展设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务，在OmniGAIA基准上超越第二名方法10.3%的准确率。此外，我们提出决策对齐群体相对策略优化（DA-GRPO），一种高效的智能体强化学习方法，用于训练Orchestra-o1-8B，该方法在所有现有开源全模态智能体中取得了最先进性能。

英文摘要

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

URL PDF HTML ☆

赞 1 踩 0

2606.14072 2026-06-15 cs.CV cs.CL 交叉投稿

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

扩散细化分割与视觉-语言解释用于儿童脑肿瘤MRI

Wentao Ke, Jianche Liu

发表机构 * Department of Mechanical Engineering, Stanford University（斯坦福大学机械工程系）； School of Medicine, Stanford University（斯坦福大学医学院）

AI总结提出两阶段框架，先用Swin-UNETR粗分割，再用条件扩散模型细化边界，最后结合多模态语言模型生成结构化报告，提升儿童脑肿瘤分割精度和可解释性。

详情

AI中文摘要

由于标注数据有限、成像表型异质性、肿瘤边界弥散以及肿瘤子区域类别不平衡，准确的儿童脑肿瘤分割仍然具有挑战性。在此，我们提出一个两阶段深度学习框架，用于改进多模态儿童脑MRI分割和临床解释。首先，我们在BraTS-PEDs MRI扫描上评估3D Res U-Net和Swin-UNETR基线模型，使用四种配准模态预测肿瘤核心、全肿瘤和增强肿瘤区域。其次，我们引入基于扩散的细化模型，以粗Swin-UNETR预测为条件，包括3D DDPM细化器和MedSegDiff。条件化显著提高了扩散稳定性和性能，特别是对于增强肿瘤边界分割。条件化MedSegDiff实现了最强的边界一致性，HD95最低。最后，预测的肿瘤体积和代表性分割叠加图与多模态语言模型集成，生成结构化的放射学风格报告。综合来看，我们的结果表明，从粗到细的扩散分割可以改善儿童肿瘤边界描绘，并支持端到端可解释的AI辅助神经肿瘤学工作流程。

英文摘要

Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.14141 2026-06-15 cs.SD cs.AI cs.CL 交叉投稿

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

动态声源的时空音频语言建模

Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka, Takashi Shibuya, Kyeongyoon Lee, Tae-Hyun Oh, Yuki Mitsufuji

发表机构 * POSTECH（浦项科技大学）； Sony AI（索尼AI）； Sony Group Corporation（索尼集团）； Sungkyunkwan University（成均馆大学）； KAIST（韩国科学技术院）

AI总结提出ST-AudioLM模型，通过时空音频编码器联合学习事件语义与源轨迹，在ST-AudioQA基准上提升动态声源问答的语义-定位权衡。

详情

AI中文摘要

声音事件是具有语义身份、位置和轨迹的实体，但当前的音频-语言模型通常将片段推理为全局事件内容。相反，声音事件定位模型随时间跟踪声源方向，但对语言推理的语义覆盖有限。为解决这一差距，我们引入了ST-AudioQA，一个基于一阶环绕声（FOA）渲染的静态和移动声源的时空音频问答数据集和基准。每个场景提供源身份、活动、方向、距离和运动元数据，实现密集轨迹监督以及关于什么在发声、在哪里、如何移动以及源之间关系的问题。我们进一步提出了ST-Audio Encoder，一种时间分辨的FOA音频编码器，联合学习事件语义和源轨迹，以及ST-AudioLM，它将编码器的音频令牌连接到LLM进行时空音频问答。实验表明，这种表示改善了语义-定位权衡，并比静态空间和面向定位的基线产生更强的推理性能。

英文摘要

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.14703 2026-06-15 cs.CV cs.CL cs.LG 交叉投稿

Gaze Heads: How VLMs Look at What They Describe

注视头：视觉语言模型如何观察它们所描述的内容

Rohit Gandikota, David Bau

发表机构 * Northeastern University（东北大学）

AI总结发现视觉语言模型的语言骨干中存在一组“注视头”，其注意力跟踪当前描述的图像区域，通过干预这些头可精确控制模型描述内容，准确率达83.1%。

详情

AI中文摘要

视觉语言模型在内部如何解决描述图像的任务远非显而易见。我们发现模型为此发展出一种特定机制：其语言模型骨干中的一小部分注意力头（我们称之为注视头），其注意力跟踪模型当前正在描述的图像区域。我们通过简单的相关性得分从几次前向传播中发现了它们，使用连环漫画作为受控测试平台，其中叙事顺序在空间上展开。这些注视头不仅跟踪正在描述的图像标记：将它们的注意力重定向到所选区域会强制视觉语言模型描述该区域。对前100个注视头（少于所有头的9%）进行单次注意力掩码干预，以83.1%的准确率将模型的答案引导到任何选定的漫画面板，而对随机头进行相同干预则无法重定向答案，并且对所有头进行干预会破坏生成。相同的杠杆还扩展到连续控制：在生成过程中切换注视目标会使模型在几个标记内结束当前面板描述并转向新面板。在漫画之外，相同的干预将答案重定向到自然COCO图像中的选定区域。该机制进一步在2B到32B参数的模型大小以及其他视觉语言模型架构中重复出现，尽管一些冻结编码器系列没有显示可比较的头集。更广泛地说，这表明通过机制分析识别的目标编辑可以作为实用的推理时杠杆来引导多模态模型行为，而无需任何重新训练。我们的代码、交互式演示和数据集可在以下网址获取：此 https URL

英文摘要

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/

URL PDF HTML ☆

赞 1 踩 1

2511.05017 2026-06-15 cs.CV cs.CL 版本更新

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

通过细化文本嵌入缓解大型视觉语言模型中的幻觉

Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Sarvesh Baskar, Vijay Kamarshi, Andrea Fanelli, Furong Huang

发表机构 * University of Maryland（马里兰大学）； Dolby Laboratories（杜比实验室）； Capital One

AI总结针对大型视觉语言模型因过度依赖文本先验而忽视视觉线索导致的幻觉问题，提出一种简单有效的视觉特征融入方法，通过学习视觉信息化的文本嵌入来平衡注意力分布，显著降低幻觉并提升多模态推理能力。

Comments Accepted at The 64th Annual Meeting of the Association for Computational Linguistics

详情

AI中文摘要

大型视觉语言模型（LVLMs）中的幻觉仍然是一个持续的挑战，通常源于多模态推理过程中视觉信息整合不足。一个关键原因是模型过度依赖文本先验而未能充分利用视觉线索，导致输出语言流畅但视觉上不准确。例如，给定一张空厨房台面的图像，LVLM可能会根据语言关联而非视觉证据幻觉出“一碗水果”或“一杯咖啡”。大多数LVLM通过将视觉特征附加到预训练LLM的输入流中，并在大规模视觉语言数据集上训练来整合视觉特征。我们的系统分析表明，由于LLM对语言主导表示的固有偏见，这种策略往往导致对文本信息的过度依赖。这种不平衡使注意力偏向文本而非视觉内容，削弱了模型将输出基于视觉输入的能力。为了解决这个问题，我们提出了一种简单而有效的视觉特征融入方法，鼓励模型学习与基础LLM不同的视觉信息化的文本嵌入，并促进更平衡的注意力分布。在多个幻觉基准上的实验结果表明，我们的方法显著减少了幻觉，并促进了更平衡的多模态推理。值得注意的是，我们的方法取得了显著提升，包括在MMVP-MLLM上+9.33%，在POPE-AOKVQA上+2.99%，在Merlin上高达+3.4%，以及在HallusionBench的硬数据分割上+3%。

英文摘要

Hallucinations in Large Vision-Language Models (LVLMs) remain a persistent challenge, often stemming from inadequate integration of visual information during multimodal reasoning. A key cause is the model's over-reliance on textual priors and underutilization of visual cues, leading to outputs that are linguistically fluent but visually inaccurate. For example, given an image of an empty kitchen countertop, an LVLM might hallucinate a "bowl of fruit" or "cup of coffee", relying on language associations rather than visual evidence. Most LVLMs incorporate visual features by appending them to the input stream of a pre-trained LLM and training on large-scale vision-language datasets. Our systematic analysis reveals that this strategy often leads to over-dependence on textual information due to the inherent bias of LLMs towards language-dominant representations. This imbalance skews attention towards the text over visual content, weakening the model's ability to ground outputs in visual inputs. To address this, we propose a simple yet effective visual feature incorporation method that encourages the model to learn visually-informed textual embeddings distinct from those of the base LLM and promotes a more balanced attention distribution. Experimental results across multiple hallucination benchmarks demonstrate that our method significantly reduces hallucinations and fosters more balanced multimodal reasoning. Notably, our approach achieves substantial gains, including +9.33% on MMVP-MLLM, +2.99% on POPE-AOKVQA, up to +3.4% on Merlin, and +3% on the hard-data split of HallusionBench.

URL PDF HTML ☆

赞 0 踩 0

2605.16739 2026-06-15 cs.LG cs.AI cs.CL q-bio.NC 版本更新

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind：从人类大脑fMRI信号解码情感描述

Bilal A. Mohammed, Lin Gu, Ruogu Fang

发表机构 * Department of Biomedical Engineering（生物医学工程系）； Vanderbilt University（范德比大学）； Research Institute of Electrical Communication（电气通信研究所）； Tohoku University（东北大学）； University of Florida（佛罗里达大学）

AI总结本文提出EmoMind，首个端到端解码fMRI信号生成情感描述的系统，通过结合语义基础的中性场景描述和连续情感向量，实现了在内容保留与情感表达间的平衡，并在多个验证框架下优于基于标签提示的GPT-4。

详情

AI中文摘要

本体记忆增强的ASR校正用于长文本-语音交错对话

Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong, Baotian Hu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China（哈尔滨工业大学（深圳）计算与智能研究所）； Shenzhen Loop Area Institute (SLAI), China（深圳环域研究所）

AI总结提出本体记忆增强的ASR校正框架，通过动态更新本体记忆存储实体、术语、变体、混淆和语义关系，解决长文本-语音交错对话中的上下文校正问题，在RAMC-Corr数据集上优于直接校正。

详情

AI中文摘要

自动语音识别（ASR）校正传统上集中于孤立的话语或短局部上下文。然而，随着文本和语音在长交互中越来越交错，ASR校正需要对话级别的上下文证据。现有的ASR校正方法通常依赖于当前假设或拼接原始对话历史。在此类上下文中，稀疏的校正证据可能难以在冗余和噪声中定位。针对这些挑战，我们提出了一种本体记忆增强的ASR校正框架，用于长文本-语音交错对话。该框架将先前的交互历史组织成动态可更新的本体记忆，其中实体、术语、表面变体、潜在ASR混淆和语义关系作为可检索节点存储，用于上下文基础的校正。为了评估这一设置，我们构建了RAMC-Corr，一个源自MAGIC-RAMC的数据集，用于具有基础上下文的长距离ASR校正。在RAMC-Corr上的实验表明，我们的方法在10个配对骨干-设置组合中的9个上优于直接校正，并鼓励对上下文相关的ASR错误进行更具选择性和证据基础的校正。

英文摘要

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

URL PDF HTML ☆

赞 0 踩 0

2603.24596 2026-06-15 eess.AS cs.AI cs.CL 版本更新

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

X-OPD：面向语音大语言模型能力对齐的跨模态在策略蒸馏

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin

发表机构 * Tencent Hunyuan（腾讯文心）； Zhejiang University（浙江大学）

AI总结提出X-OPD框架，通过跨模态在策略蒸馏对齐语音LLM与文本LLM的能力，利用文本教师模型评估语音模型的轨迹并提供令牌级反馈，显著缩小复杂任务性能差距。

Comments Accepted by Interspeech 2026

详情

AI中文摘要

虽然从级联对话系统转向端到端（E2E）语音大语言模型（LLMs）改善了延迟和副语言建模，但E2E模型通常表现出与其文本对应模型相比显著的性能下降。标准的监督微调（SFT）和强化学习（RL）训练方法无法弥合这一差距。为了解决这个问题，我们提出了X-OPD，一种新颖的跨模态在策略蒸馏框架，旨在系统地将语音LLM的能力与其文本对应模型对齐。X-OPD通过在线策略展开使语音LLM探索其自身分布，其中基于文本的教师模型评估这些轨迹并提供令牌级反馈，从而有效地将教师的能力蒸馏到学生的多模态表示中。在多个基准上的大量实验表明，X-OPD在保留模型固有能力的同时，显著缩小了复杂任务中的差距。

英文摘要

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.13685 2026-06-15 cs.CL cs.AI 新提交

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

抛硬币的裁判？LLM作为评估者的可靠性与偏见

Abel Yagubyan

发表机构 * Independent Researcher（独立研究员）

AI总结研究LLM作为评估者在重复评估中的不可靠性，发现偏好翻转率平均13.6%，存在位置偏见，并建议多轮聚合和不确定性报告。

Comments 24 pages, 7 figures

详情

AI中文摘要

LLM作为评估者（LLM-as-a-Judge）现被广泛用于模型输出排名、训练奖励模型和填充公共排行榜，但其运行间可靠性仍缺乏充分表征。我们使用两个OpenAI评估模型（GPT-4o-mini和GPT-4.1-mini）在涵盖10个类别的29个任务上进行了重复的相同评估，每个问题进行50次成对试验和50次逐点试验，并辅以温度和提示敏感性消融实验。在评估者之间，成对偏好平均翻转13.6%的时间，28%的问题翻转率超过20%，一个问题达到56%。GPT-4o-mini还表现出显著的第一位置偏见（72%的A多数，p=0.024）。同时，平均逐点评分差距很小（在10分制上为0.19-0.36），且总体上不具统计显著性，产生了成对-逐点差距：评估者经常选择胜者，即使它们自己的标量分数几乎没有证据表明存在有意义的质量差异。除了评估者内部的不稳定性，评估者间的一致性仅为76%（κ=0.51），语义等价的提示模板在25%的测试案例中改变了多数结果，确定性解码减少了但不消除不一致性。可靠性曲线分析显示，在我们的数据集中，平均需要11次重复试验才能让多数投票以95%的概率恢复50次试验的参考裁决，对于高方差问题则上升至15次。这些发现表明，单次LLM评估对于高风险评估往往噪声过大，多轮聚合、位置随机化和显式不确定性报告应成为标准实践。由于两个评估者均来自同一提供商，跨提供商复制仍是重要的下一步。

英文摘要

LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19--0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise--pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($κ= 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

URL PDF HTML ☆

赞 0 踩 0

2606.13686 2026-06-15 cs.CL cs.CY 新提交

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

电子商务欺骗性界面下的Web Agent安全基准测试

Zijing Shi, Meng Fang, Ling Chen

发表机构 * AAII, University of Technology Sydney（悉尼科技大学AAII）； University of Liverpool（利物浦大学）

AI总结提出WebDecept框架，在电子商务环境中注入七种常见欺骗性界面模式，测试多模态Web Agent的安全性，发现当前Agent极易受骗且提示约束不足。

Comments Accepted to ACL 2026

详情

AI中文摘要

随着自主Web Agent越来越多地用于执行现实任务，确保其安全性已成为关键问题。在这项工作中，我们研究了电子商务领域中现实欺骗性界面下的Web Agent行为。我们引入了WebDecept，一个轻量级且可配置的插件框架，能够将欺骗性界面模式可控地注入现有Web环境。使用WebDecept，我们实例化了开放Web上常见的七种欺骗模式，包括定向广告、域名重定向和购物操纵。通过在任务执行期间将这些模式注入前端，我们对多个多模态Web Agent进行了受控评估。我们的结果表明，当前的Web Agent极易受到多类欺骗性界面的影响，并且基于提示的约束通常不足以缓解这些失败。我们进一步分析了欺骗性模式的设计选择如何影响此类操纵的成功。这些发现凸显了在Web Agent向现实部署扩展时应解决的安全挑战。

英文摘要

As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt-based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.13756 2026-06-15 cs.CL 新提交

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

QIAS 2026：伊斯兰继承推理共享任务综述

Abdessalam Bouchekif, Somaya Eltanbouly, Samer Rashwani, Shahd Gaben, Mutaz Al-Khatib, Heba Sbahi, Emad Mohamed, Mohammed Ghaly

发表机构 * Hamad bin Khalifa University（哈马德·本·哈利法大学）； Nazarbayev University（纳扎尔巴耶夫大学）

AI总结本文介绍 QIAS 2026 共享任务，评估大语言模型在伊斯兰继承领域的复杂推理能力，基于 MAWARITH 数据集，使用 MIR-E 多步指标评估，16 支队伍参与，结果显示该任务对当前模型极具挑战性。

详情

超越困惑度：字节感知语言模型中的 UTF-8 有效性

Sangwhan Moon, Daisuke Oba, Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki

发表机构 * University of Tokyo（东京大学）； National Institute of Information and Communications Technology（信息通信技术国家研究所）

AI总结研究字节级分词语言模型生成无效UTF-8序列的问题，通过多语言训练实验发现UTF-8有效性收敛比困惑度慢约一倍，且罕见字符结构有效性更高，表明可靠UTF-8生成是需要单独评估的能力。

详情

AI中文摘要

字节级分词使语言模型能够处理任何Unicode输入，但当遇到罕见或未见字符时，模型可能生成无效的UTF-8序列。我们使用一个355M参数的模型，在来自英语、日语、韩语和中文的平衡多语言语料库的80B tokens上训练，研究了训练规模与UTF-8生成可靠性之间的关系。我们引入了多种评估协议，将UTF-8结构有效性与语言建模分离。UTF-8有效性收敛滞后于困惑度大约两倍：困惑度在2.1B tokens后稳定，但UTF-8有效性需要4.2B tokens。在无上下文生成中，罕见字符比常见字符获得更高的结构有效性，表明频繁字符表示过度特化。通过实验，我们观察到可靠的UTF-8生成是一种独特的能力，需要超越困惑度的评估。

英文摘要

Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

URL PDF HTML ☆

赞 0 踩 0

2606.14199 2026-06-15 cs.CL cs.AI cs.LG 新提交

OdysSim: Building Foundation Models for Human Behavior Simulation

OdysSim: 构建人类行为模拟的基础模型

Xuhui Zhou, Weiwei Sun, Weihua Du, Jiarui Liu, Haojia Sun, Qianou Ma, Tongshuang Wu, Yiming Yang, Maarten Sap

发表机构 * Carnegie Mellon University, Language Technologies Institute（卡内基梅隆大学语言技术研究所）

AI总结提出OdysSim，通过SOUL分类法统一62个数据集和23个基准任务，采用混合训练、任务特定强化学习和专家蒸馏，构建8B参数行为基础模型OSim，在多数任务上超越前沿模型，并实现更类人输出和零样本迁移。

Comments 34 pages. Code: https://github.com/sunnweiwei/OdysSim ; Models and data: https://huggingface.co/collections/cmu-lti/odyssim

详情

AI中文摘要

大型语言模型越来越多地被部署为人类模拟器，用于交互式评估和社会模拟。然而，以有用性为导向的后训练使它们趋向于同质化、过于随和的助手风格，造成了行为上的Sim2Real差距。我们提出了OdysSim，这是对行为基础模型（即经过训练以大规模模拟人类行为的模型）进行的最大规模开放系统研究。我们提出了SOUL，一个包含五个能力轴（CONV、SS、COG、ROLE、EVAL）的分类法，将62个数据集和23个基准任务统一在一个框架下。具体来说，我们整理了OdysSim语料库（2140万次交互，100亿个token，并配备了反向生成的社交上下文），构建了SOUL-Index基准，并开发了一个端到端的训练方案，结合了中期训练、任务特定强化学习和专家蒸馏。由此产生的开源8B OSim模型在23个任务中的8个上排名第一或并列第一，按此计数优于任何单个前沿模型，在对话和社交任务上取得了最大的提升。其输出在长度、格式和词汇选择上也更接近人类，并在τ-bench上零样本迁移到分布外的用户模拟，在反应一致性上几乎与真实用户匹配（93.2 vs 93.5）。我们进一步表明，LLM作为评判者的强化学习会引发奖励黑客模式，而我们的检测器可以在后训练期间缓解这些模式。总之，我们的发现表明，行为基础模型需要重新思考LLM的训练范式。我们发布所有工件以支持未来的研究。

英文摘要

Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $τ$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.

URL PDF HTML ☆

赞 0 踩 0

2606.14257 2026-06-15 cs.CL 新提交

The Linguistics Olympiads: Towards a New Corpus for Linguistics Research?

语言学奥林匹克：语言学研究的语料库新方向？

Vlad A. Neacsu

AI总结本文探讨语言学奥林匹克问题作为语言学语料库的潜力，分析其优势与局限，并提出在学术研究中负责任使用的标准。

Comments Accepted for publication in LingBaW. Linguistics Beyond and Within (Volume 12, 2026)

详情

AI中文摘要

语言学奥林匹克问题（LOPs）是一类自足的谜题，包含一个缩小的语料库，代表特定的语言现象，解题者需从中推断出语言的基本规则集，然后翻译一组新元素。语言学奥林匹克（LOs）已成为全球现象，有43个不同地区参加2025年国际语言学奥林匹克（IOL）。尽管LOPs的类型和解题策略已被分析，但其科学层面及与学术语言学的联系尚待探索。LOPs直接关联许多语言学领域，如语言类型学、语言相对论和语言学田野调查。最近，LOPs作为大语言模型的基准成为研究焦点，凸显其在计算语言学中的实用性。然而，它们尚未被纳入主流语言学研究。本文试图通过提供LOPs作为语言数据源的结构化评估，开辟将这类特殊谜题纳入学术研究的新方向，并提出在学术研究中负责任使用的标准。基于超过1800个LOPs的数据集，本研究批判性地考察了LOPs作为语言学新语料库的潜力，讨论了它们作为工具的优势和局限性，以及这些谜题可能适用的语言学领域。这项工作为更广泛的倡议奠定了基础，旨在通过为LOPs建立稳健的理论框架，弥合LOs与学术语言学之间的鸿沟。

英文摘要

Linguistics olympiad problems (LOPs) are a category of self-sufficient puzzles consisting of a scaled-down corpus representative of certain linguistic phenomena, from which the solver must deduce a primitive set of rules of the language and then translate a new set of elements. The linguistics olympiads (LOs) have become a worldwide phenomenon with 43 different territories taking part in the International Linguistics Olympiad (IOL) 2025. While the typology and solving strategies of LOPs have been analysed, their scientific facet and connections to academic linguistics have yet to be explored. LOPs are directly connected to many linguistic fields, e.g., linguistic typology, linguistic relativity, and linguistics fieldwork. Recently, LOPs have become a research focus as benchmarks for large language models, thus highlighting their usefulness in computational linguistics. Nevertheless, they have not yet been integrated into mainstream linguistics research. This paper attempts to open new directions of including this particular type of puzzle in academic research by offering a structured evaluation of LOPs as linguistic data sources and proposes criteria for their responsible use in academic research. Starting from a set of over 1800 LOPs, this study critically examines the potential of LOPs as a novel corpus for linguistics research by discussing their strengths and limitations as tools, as well as the areas of linguistics into which these problems could fit. This work forms the foundation for a broader initiative aimed at bridging the gap between LOs and academic linguistics, by establishing a robust theoretical framework for LOPs.

URL PDF HTML ☆

赞 0 踩 0

2606.14278 2026-06-15 cs.CL 新提交

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

法官偏爱英语吗？评估LLM作为法官时的语言切换不变性

Shaojie Yin

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出Judge-LS协议，通过语言切换变体评估LLM作为自动法官的可靠性，发现语言切换导致10.7-14.4%偏好翻转，但翻译等价测试未显示系统英语偏好。

详情

AI中文摘要

大型语言模型（LLM）现在被广泛用作开放式指令遵循评估的自动法官。这种做法方便、可扩展，且通常比基于参考的指标更具语义意识，但它也引入了一个新的可靠性问题：法官评估的是答案的质量，还是也对比较呈现的语言做出反应？我们提出了Judge-LS，一个轻量级元评估协议，将LLMBar响应对项目转换为英语、中文和中英文语言切换变体。一个可靠的法官应在保持标签的语言转换下保持其偏好，并且当两个答案翻译等价时不应偏好某种语言。我们在完整的419项LLMBar基准上评估了四个API可访问的法官，产生了13,408个成功的成对判断。跨模型，中文和语言切换呈现相对于英语导致10.7-14.4%的偏好翻转，所有法官在英语中达到最高准确率。然而，翻译等价的平局探测并未揭示系统的英语偏好：大多数探测被判断为平局，非平局决策更常偏向中文。我们添加了置信区间、配对显著性检验和自动转换审计，并进行了敏感性分析，排除了机械标记的高风险变体。该实验不需要模型训练，仅使用API调用，并且在适度的本地硬件上可行。

英文摘要

Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it also introduces a new reliability question: does a judge evaluate the quality of an answer, or does it also react to the language in which the comparison is presented? We propose Judge-LS, a lightweight meta-evaluation protocol that transforms LLMBar response-pair items into English, Chinese, and Chinese-English language-switched variants. A reliable judge should preserve its preference under label-preserving language transformations and should not prefer a language when two answers are translation-equivalent. We evaluate four API-accessible judges on the full 419-item LLMBar benchmark, producing 13,408 successful pairwise judgments. Across models, Chinese and language-switched presentations induce 10.7--14.4% preference flips relative to English, and all judges achieve their highest accuracy in English. However, translation-equivalent tie probes do not reveal a systematic English preference: most probes are judged as ties, and non-tie decisions more often favor Chinese. We add confidence intervals, paired significance tests, and an automatic transformation audit with a sensitivity analysis that excludes mechanically flagged high-risk variants. The experiment requires no model training, uses only API calls, and is feasible on modest local hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.14459 2026-06-15 cs.CL cs.AI cs.SD 新提交

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

MoDiCoL：用于鲁棒语音识别的模块化诊断持续学习数据集

Theresa Pekarek Rosin, Matthias Kerzel, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg, Germany（德国汉堡大学信息学系知识技术研究所）

AI总结提出MoDiCoL数据集，通过模块化设计分离语言内容、说话人特征和声学环境，并设计持续学习课程来模拟真实分布变化，评估三种持续学习策略下的鲁棒性获取、迁移和遗忘。

Comments Accepted at Interspeech 2026

详情

AI中文摘要

现代自动语音识别（ASR）系统在标准基准测试上取得了显著进展，但在真实世界的分布变化下，由录音条件、口音、语言障碍和噪声引起的性能差距已经显现。现有数据集和基准通常孤立这些因素，忽略了它们在真实应用中的共现。在本文中，我们认为模型鲁棒性可以被视为一种动态能力，持续发展，并引入了MoDiCoL，一个模块化诊断持续学习数据集，旨在对语言内容、说话人特征和声学环境进行受控分析。此外，我们提出了一个受真实世界启发的持续学习课程，以模拟增量更新，并研究鲁棒性是如何获取、迁移和遗忘的。我们评估了三种持续学习策略，并提供了在演化条件下鲁棒性的详细见解。

英文摘要

Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.14574 2026-06-15 cs.CL cs.AI 新提交

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

SIMMER: 基于世界模型评估LLM可执行规划中的潜在故障

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出SIMMER基准，通过人工策划的厨房领域符号世界模型，评估LLM规划中的潜在故障；实验发现前沿模型最多17%无错误计划，56%含潜在故障，多数不可逆；反事实预演可减少72%潜在故障和75%不可逆案例。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为家庭环境中自主代理的规划器。虽然现有基准评估LLM生成的计划是否成功执行，但它们忽略了一种关键类型的故障：潜在故障。与立即故障（在执行时触发即时反馈并允许及时纠正）不同，潜在故障不会立即停止计划执行，而是悄无声息地损害目标实现。在严重情况下，它们会导致不可逆的损害。为弥补这一空白，我们引入了SIMMER，这是一个通过人工策划的、基于厨房领域的符号世界模型来评估LLM规划中潜在故障的基准。SIMMER定义了一个世界模型，包含77个动作、262个独特对象和约46,800种语义真实的可能交互，这些交互源自真实世界的烹饪脚本。然后，它利用一个状态机执行器，根据世界模型验证计划，并检测即时前提违规、潜在危险和不可逆故障。在六个LLM上的实验表明，即使是最前沿的模型，其无错误计划最多也只有17%。此外，高达56%的计划包含潜在故障，其中大多数导致不可逆后果。我们进一步证明，通过反事实预演进行显式状态推理可以将潜在故障减少高达72%，不可逆案例减少高达75%，这为更鲁棒的LLM规划器指明了一个有前景的方向。

英文摘要

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

URL PDF HTML ☆

赞 0 踩 0

2606.14600 2026-06-15 cs.CL 新提交

Every Eval Ever：AI评估结果的统一模式与社区仓库

Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek Šuppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen

发表机构 * Technical University Munich（慕尼黑工业大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Weizenbaum Institute（魏岑鲍姆研究所）； Zuse Institute Berlin（柏林祖泽研究所）； Evidence Prime ； Trustible ； Kitware ； ETH Zurich（苏黎世联邦理工学院）； StickFlux Labs ； Stanford University（斯坦福大学）； Northeastern University（东北大学）； IBM Research（IBM研究院）； Comenius University Bratislava（布拉迪斯拉发夸美纽斯大学）； Cisco（思科）； University of Notre Dame（圣母大学）； Hebrew University of Jerusalem（耶路撒冷希伯来大学）； University of Oxford（牛津大学）； Ohio University（俄亥俄大学）； Writer ； TCS Research（塔塔咨询服务研究院）； Oxford University Press（牛津大学出版社）； Queen Mary University of London（伦敦玛丽女王大学）； Technical University Berlin（柏林工业大学）； University of Delaware（特拉华大学）； Cinemo ； Johns Hopkins University（约翰霍普金斯大学）； University of Copenhagen（哥本哈根大学）； ELLIS（欧洲学习与智能系统实验室）； Iowa State University（爱荷华州立大学）； Meta FAIR ； University of Montreal（蒙特利尔大学）； Mila Quebec AI Institute（Mila魁北克人工智能研究所）； EleutherAI ； Yale University（耶鲁大学）； Hugging Face ； University of Edinburgh（爱丁堡大学）； Harvard University（哈佛大学）； ETH AI Center（ETH人工智能中心）； MIT（麻省理工学院）； MIT-IBM Watson Lab（MIT-IBM沃森实验室）

AI总结针对AI评估结果格式不统一、难以比较的问题，提出首个共享模式与社区众包仓库，通过标准化表示、自动转换器和社区数据库实现跨评估框架的统一。

详情

AI中文摘要

AI评估被广泛用于测试和理解进展。然而，多样化的评估工具带来了不一致性，挑战了分析和比较。首先，结果以不兼容的格式保存，分散在排行榜、论文、博客文章、评估工具日志和自定义仓库中。其次，结果由不同的评估框架创建，这些框架对名义上相同的评估产生不同的分数，并且不一致地记录元数据，阻碍了比较、跨社区评估科学、成本降低和重用。我们介绍了Every Eval Ever，这是第一个用于AI评估结果的共享模式和社区众包仓库。该模式标准化了评估在统一的单个JSON文档中的表示方式。它在设计上与源无关，可以摄取来自评估工具和论文的结果，并可选择存储每个实例的输出以进行细粒度分析。我们贡献了：(i) 一个社区治理的元数据模式及其配套的实例级模式，这是同类标准化工作的首次；(ii) 从流行格式、评估工具和排行榜到统一模式的自动转换器；以及 (iii) 一个托管在Hugging Face上的众包社区数据库，目前涵盖22,235个模型、2,273个独特基准和31种评估格式。

英文摘要

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

URL PDF HTML ☆

赞 0 踩 0

2606.14697 2026-06-15 cs.CV cs.AI cs.CL 交叉投稿

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: 用于诊断医学多模态大语言模型推理中阶段式幻觉的基准

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； DAMO Academy, Alibaba Group（阿里巴巴达摩院）； Hupan Lab（湖畔实验室）； Zhejiang University（浙江大学）

AI总结提出ClinHallu基准，包含7031个实例，每个实例带有结构化推理轨迹（视觉识别、知识回忆、推理整合），通过阶段替换干预和轨迹监督微调，实现细粒度幻觉诊断与缓解。

Comments Code and datasets: https://github.com/alibaba-damo-academy/ClinHallu

详情

AI中文摘要

构建可信的医学多模态大语言模型（MLLM）对于可靠的临床决策支持至关重要。现有的医学幻觉基准主要关注数据收集，但往往忽略了推理过程中幻觉的起源。我们发现幻觉来源因样本而异：错误可能源于视觉误识别、不正确的医学知识回忆或有缺陷的推理整合。为了实现源级别的幻觉诊断，我们引入了ClinHallu，一个用于医学MLLM推理中阶段式幻觉诊断的基准。ClinHallu包含7031个经过验证的实例，每个实例都附有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹。我们还使用阶段替换干预来测量纠正特定阶段如何影响最终答案。除了评估，我们表明轨迹监督微调减少了阶段式幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准可从此https URL公开获取。

英文摘要

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

URL PDF HTML ☆

赞 0 踩 0

2502.10886 2026-06-15 cs.CL 版本更新

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

MET-Bench：用于评估视觉语言与推理模型局限性的多模态实体追踪

Vanya Cohen, Raymond Mooney

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出MET-Bench多模态实体追踪基准，发现视觉语言模型在图像实体追踪上显著弱于文本，主要源于视觉推理缺陷，强化学习可提升模态内性能但跨模态迁移不足。

Comments ICML 2026

详情

AI中文摘要

实体状态追踪是世界建模的必要组成部分，需要随时间维护实体的连贯表示。以往工作仅在纯文本任务中基准测试实体追踪性能。我们引入MET-Bench，一个多模态实体追踪基准，旨在评估视觉语言模型跨模态追踪实体状态的能力。使用三个领域，我们评估了当前模型整合文本和图像状态更新的有效性。我们的发现揭示了基于文本和基于图像的实体追踪之间存在显著性能差距。我们通过实验表明，这种差异主要源于视觉推理缺陷而非感知缺陷。我们进一步证明，显式的基于文本的推理策略能提升性能，但局限性依然存在，尤其是在长程多模态任务中。我们应用强化学习来改进开源视觉语言模型中的实体追踪。这带来了显著的模态内增益，但未能稳健地跨输入模态迁移。我们的结果凸显了改进多模态表示和推理技术以弥合文本与视觉实体追踪之间差距的必要性。

英文摘要

Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using three domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain, especially in long-horizon multimodal tasks. We apply reinforcement learning to improve entity tracking in open-source VLMs. This yields substantial in-modality gains, but does not transfer robustly across input modalities. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.

URL PDF HTML ☆

赞 0 踩 0

2508.05782 2026-06-15 cs.CL 版本更新

FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

FineDialFact：细粒度对话事实验证基准

Xiangyan Chen, Yufeng Li, Yujian Gan, Arkaitz Zubiaga, Matthew Purver

发表机构 * arXiv

AI总结针对对话系统幻觉检测中事实标签粗粒度的问题，提出细粒度对话事实验证基准FineDialFact，基于公开数据集构建并评估多种基线方法，实验表明思维链推理可提升性能，但最佳F1仅0.74，任务仍具挑战性。

详情

AI中文摘要

大型语言模型已知会产生幻觉——事实不正确或捏造的信息——这对许多自然语言处理应用（如对话系统）构成了重大挑战。因此，检测幻觉已成为一个关键研究领域。当前对话系统中幻觉检测的方法主要集中于验证生成回复的事实一致性。然而，这些回复通常包含准确、不准确或不可验证的事实的混合，使得使用单一事实标签过于简单和粗粒度。在本文中，我们引入了一个基准FineDialFact，用于细粒度对话事实验证，涉及验证从对话回复中提取的原子事实。为此，我们基于公开可用的对话数据集构建了一个数据集，并使用各种基线方法对其进行评估。实验结果表明，结合思维链推理的方法可以提升对话事实验证的性能。尽管如此，在开放域对话数据集HybriDialogue上取得的最佳F1分数仅为0.74，表明该基准对于未来研究仍是一个具有挑战性的任务。我们在以下网址发布数据集和代码：https://this https URL。

英文摘要

Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or non-verifiable facts, making the use of a single factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.74, indicating that the benchmark remains a challenging task for future research. We release our dataset and code at https://github.com/XiangyanChen/FineDialFact.

URL PDF HTML ☆

赞 0 踩 0

2508.06196 2026-06-15 cs.CL cs.HC 版本更新

EiCAP: Beyond Fluency, Probing and Improving Emotional Intelligence in LLMs via Psychologically Grounded Multi-Turn Dialogue

EiCAP：超越流畅性，通过心理学基础的多轮对话探究和提升大语言模型的情感智能

Nizi Nazar, Pardis Sadat Zahraei, Dilek Hakkani-Tür, Natasa Milic-Frayling, Ehsaneddin Asgari

发表机构 * Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University（卡塔尔计算研究所（QCRI），哈马德·本·卡伊夫大学）； University of Illinois Urbana-Champaign (UIUC)（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出基于心理学六层情感智能分类法的EiCAP框架，包含评估基准EiCAP-Bench和微调语料EiCAP-SFT，发现通用对话微调不提升情感智能，而基于情感智能的LoRA微调显著提升模型在所有24个子类别上的表现。

详情

AI中文摘要

大语言模型越来越多地用于情感敏感角色，包括心理健康支持、教育和危机响应，但它们缺乏评估或提升情感智能（EI）的原则性框架。我们引入EiCAP，一个统一的、基于心理学的六层EI分类法，并实现为两个互补资源。EiCAP-Bench是一个多轮、一对三强制选择评估套件，包含3,174个探针，涵盖24个子类别和跨轮依赖关系，反映真实对话EI需求。EiCAP-SFT是一个152,820轮对话的监督语料库，与同一分类法对齐，支持可控、可解释的微调。两个关键发现出现。首先，通用对话监督微调不会赋予EI：在UltraChat上微调在24个子类别中均无显著提升，宏观得分为24.6%，接近随机水平25%。其次，直接对Qwen-2.5-7B-Base应用基于EI的LoRA（使用约0.8%的参数）在全部24个子类别上取得显著提升，宏观得分达到75.33%，比Base提升51.7个百分点，比Instruct提升37.1个百分点。关键的是，消融实验表明UltraChat预训练阶段适得其反，使性能降低21.4个百分点：直接基于EI的训练既必要又充分。

英文摘要

Large Language Models increasingly serve in emotionally sensitive roles, including mental health support, education, and crisis response, yet they lack a principled framework for assessing or improving Emotional Intelligence (EI). We introduce EiCAP, a unified, psychologically grounded six-layer EI taxonomy operationalized into two complementary resources. EiCAP-Bench is a multi-turn, one-vs-three forced-choice evaluation suite with 3,174 probes across 24 subcategories and cross-turn dependencies that reflect real conversational EI demands. EiCAP-SFT is a 152,820-dialogue supervision corpus aligned to the same taxonomy, enabling controlled, interpretable fine-tuning. Two key findings emerge. First, generic conversational supervised fine-tuning does not confer EI: fine-tuning on UltraChat yields no significant gain in any of the 24 subcategories, with a macro score of 24.6%, near the chance level of 25%. Second, applying EI-grounded LoRA, using approximately 0.8% of parameters, directly to Qwen-2.5-7B-Base achieves significant gains in all 24 subcategories, reaching a macro score of 75.33%, a gain of 51.7 percentage points over Base and 37.1 percentage points over Instruct. Crucially, an ablation shows that the UltraChat pre-stage is counterproductive, reducing performance by 21.4 percentage points: direct EI-grounded training is both necessary and sufficient.

URL PDF HTML ☆

赞 0 踩 0

2601.15828 2026-06-15 cs.CL cs.AI 版本更新

Can professional translators identify machine-generated text?

专业翻译人员能否识别机器生成的文本？

Michael Farrell

发表机构 * IULM University Milan Italy（米兰IULM大学）

AI总结通过实验研究无专门训练的专业翻译人员识别AI生成短篇故事的能力，发现少数人（16.2%）能准确区分，但多数依赖主观印象导致误判，低突发性和叙事矛盾是可靠指标。

Comments Pages 581 to 591, Volume 1, proceedings of the 26th Annual Conference of the European Association for Machine Translation, 2026

详情

AI中文摘要

本研究调查了未经专门训练的专业翻译人员能否可靠地识别由人工智能（AI）生成的意大利语短篇故事。69名翻译人员参加了一项现场实验，评估了三篇匿名短篇故事——两篇由ChatGPT-4o生成，一篇由人类作者撰写。对于每篇故事，参与者评估了AI作者身份的可能性并提供了选择理由。虽然平均结果不明确，但有一个统计上显著的子集（16.2%）成功区分了合成文本与人类文本，表明他们的判断基于分析技能而非偶然。然而，几乎相同数量的人以相反方向错误分类了文本，通常依赖主观印象而非客观标记，这可能反映了读者对AI生成文本的偏好。低突发性和叙事矛盾成为合成作者身份最可靠的指标，同时报告了意外的仿译、语义借用和来自英语的句法迁移。相比之下，语法准确性和情感基调等特征经常导致误分类。这些发现对专业语境中合成文本编辑的作用和范围提出了疑问。

英文摘要

This study investigates whether professional translators without prior specialized training can reliably identify short stories generated in Italian by artificial intelligence (AI). Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.

URL PDF HTML ☆

赞 0 踩 0

2603.05167 2026-06-15 cs.CL cs.AI 版本更新

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

C2-Faith: 为思维链推理中的因果和覆盖忠实性基准测试LLM评判者

Avni Mittal, Rauno Arike

发表机构 * SPARAI

AI总结提出C2-Faith基准，通过因果和覆盖两个维度评估LLM评判者对思维链推理过程忠实性的判断能力，发现模型在错误定位和覆盖评分上存在显著不足。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作思维链（CoT）推理的评判者，但目前尚不清楚它们能否可靠地评估过程忠实性，而不仅仅是答案的合理性。我们引入了C2-Faith，这是一个基于PRM800K构建的基准，明确将忠实性分解为两个互补维度：因果性（每一步是否逻辑上源自先前上下文）和覆盖性（是否包含必要的中间推理）。通过受控扰动，我们构建了具有已知因果错误位置的示例，将单个步骤替换为逻辑不一致的变体，并以不同速率进行受控覆盖删除，从而能够直接根据参考标签进行测量。我们评估了三个前沿的LLM评判者在三项任务上的表现：二元因果检测、因果步骤定位和覆盖评分。我们的结果表明，评判者的可靠性高度依赖于任务，没有单一模型在所有设置中占主导地位。虽然模型通常能检测到错误存在，但它们难以准确定位错误，这表明检测与归因之间存在显著差距。此外，所有评判者都系统性地高估了推理完整性，即使中间推理的很大部分缺失，也会给出高覆盖分数。这些发现揭示了LLM评判者在过程级评估中的根本局限性，并强调了在使用LLM评估推理质量时需要更可靠和校准的方法。

英文摘要

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.

URL PDF HTML ☆

赞 0 踩 0

2605.11378 2026-06-15 cs.CL 版本更新

TVIR：构建面向文本-视觉交错报告生成的深度研究智能体

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu

发表机构 * Nanjing University Alibaba Group（南京大学阿里集团）

AI总结提出TVIR基准和层次化多智能体框架，解决深度研究报告中视觉元素的事实可靠性与对齐问题。

详情

AI中文摘要

深度研究智能体在多步信息检索、推理和长文本报告生成方面表现出强大能力，但现有基准和系统仍以文本为中心，对视觉元素是否事实可靠且与周围分析良好对齐的评估有限。为填补这一空白，我们引入了TVIR（文本-视觉交错报告生成），包括TVIR-Bench（一个包含100个专家策划的多模态深度研究任务的基准，要求视觉元素服务于特定的分析子目标）和TVIR-Agent（一个层次化多智能体框架，作为构建大纲、检索图像、生成可溯源图表以及通过上下文感知的顺序写作撰写报告的强基线）。我们进一步开发了结合文本评估和视觉评估的双路径评估框架。在九个深度研究系统上的实验表明，TVIR-Agent实现了强大的整体性能，凸显了显式多模态设计和评估对于证据驱动报告生成的重要性。

英文摘要

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text-Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

URL PDF HTML ☆

赞 0 踩 0

2606.07040 2026-06-15 cs.CL 版本更新

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

超越评分标准：面向奖励建模的探索引导评估技能

Xing Yue, Linjuan Wu, Daoxin Zhang, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University（浙江大学）； Xiaohongshu Inc.（小红书公司）

AI总结提出Eval-Skill方法，通过探索引导合成可复用的领域级评估技能，以替代每查询生成评分标准，在RewardBench 2上使Qwen3-8B和DeepSeek-V4-Flash分别提升13.44%和18.51%。

详情

AI中文摘要

开放式奖励建模需要裁判在无法获得可验证答案时遵循微妙的、领域特定的偏好。现有的基于评分标准的方法通常通过为每个查询在线生成标准来解决这一问题，但额外的生成步骤会增加推理开销，并产生僵化或错位的指导。我们引入了Eval-Skill，一种探索引导的方法，为奖励建模合成可复用的评估技能，并将奖励指导重新定义为上下文演化，而非参数训练或每查询评分标准生成。仅使用每个领域100个案例进行技能演化，Eval-Skill通过两个渐进阶段（工作流生成和原则生成）合成可复用的领域级评估技能，并在两个阶段中交错进行探索和选择。一旦生成，技能直接注入裁判上下文。在多个奖励建模基准上，Eval-Skill持续改进不同的裁判骨干模型；在RewardBench 2上，与普通评判相比，每个主要骨干模型都取得了显著提升（Qwen3-8B提升13.44%，DeepSeek-V4-Flash提升18.51%）。对演化时间缩放、泛化性和可迁移性的进一步分析表明，紧凑的评估技能为基于LLM的评估提供了一种高效的新范式。代码可在https://this URL获取。

英文摘要

Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at https://github.com/xing-stellus-yue/Eval-Skill.

URL PDF HTML ☆

赞 0 踩 0

2601.00821 2026-06-15 cs.AI cs.CL cs.IR 版本更新

Verbatim Chunks Beat Extracted Artifacts: A Controlled Ablation of Memory Representations for Long LLM Conversations

逐字块胜过提取的人工制品：长LLM对话中记忆表征的控制消融研究

Tao An

发表机构 * Hawaii Pacific University（夏威夷太平洋大学）

AI总结通过控制消融实验，发现逐字对话块在长对话记忆检索中比LLM提取的结构化人工制品（事实、决策等）准确率高15.9-22.0点，原因是提取过程丢失了逐字细节，而结构化记忆应作为逐字文本的补充而非替代。

Comments v2: substantially revised -- reframed from a system paper to a controlled ablation study; title and conclusions updated accordingly. 26 pages, 5 figures

详情

AI中文摘要

一类日益增长的对话记忆系统将对话历史压缩为结构化人工制品——提取的事实、决策或事件——其前提是蒸馏后的结构比原始文本检索效果更好。我们通过控制消融实验检验了这一前提：在固定的检索-重排-推理流水线中，仅交换存储的表征——LLM提取的类型化人工制品与逐字对话块——保持模型、检索器、重排器和评判器不变。逐字块在LoCoMo上领先15.9个百分点（43.9% vs. 28.0%），在LongMemEval-S上领先22.0个百分点（67.4% vs. 45.4%）；1跳语义图无法弥补差距，五个混淆控制实验重现了该效应。其机制是有损蒸馏：提取丢弃了逐字细节，而块则免费保留；提取人工制品流水线在整体准确率上从未超过朴素RAG。同时，使用近乎逐字、保留来源的单元所取得的积极结果也符合同一解释：检索准确性取决于表征与源文本的偏离程度。对于我们测试的提取设计，结构化记忆应增强逐字文本而非替代它：块∪人工制品联合存储在两个基准上都匹配块，而仅人工制品则丧失优势。代码和数据：此 https URL

英文摘要

A growing class of conversational-memory systems compresses dialogue history into structured artifacts -- extracted facts, decisions, or events -- on the premise that distilled structure retrieves better than raw text. We test this premise with a controlled ablation: within one fixed retrieval-rerank-reasoning pipeline, we swap only the stored representation -- LLM-extracted typed artifacts versus verbatim conversation chunks -- holding the model, retriever, reranker, and judge constant. Verbatim chunks win by 15.9 points on LoCoMo (43.9% vs. 28.0%) and 22.0 points on LongMemEval-S (67.4% vs. 45.4%); a 1-hop semantic graph does not recover the gap, and five confound controls reproduce the effect. The mechanism is lossy distillation: extraction discards verbatim detail that chunks retain for free, and the extracted-artifact pipeline never beats naive RAG in overall accuracy. Concurrent positive results with near-verbatim, provenance-preserving units fit the same account: retrieval accuracy tracks how far the representation departs from the source. For the extraction designs we test, structured memory should augment verbatim text rather than replace it: a chunks $\cup$ artifacts union store matches chunks on both benchmarks while artifacts alone forfeit the gap. Code and data: https://github.com/tao-hpu/cog-canvas

URL PDF HTML ☆

赞 0 踩 0

2601.04646 2026-06-15 cs.IR cs.AI cs.CL cs.LG 版本更新

对或错，模型都顺从：LLM 道德判断中的方向盲从

Jihye Kim, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结本文提出顺从不对称性（A = BCR/HCR）双向诊断指标，发现大语言模型在事实判断中更顺从有益提示（A=1.58），但在道德判断中几乎同等顺从有益和误导提示（A=1.04），揭示了方向盲从这一对齐失败模式。

详情

AI中文摘要

随着语言模型在许多领域扮演整合角色，LLM对用户反驳的响应成为一个关键的对齐属性。然而，许多现有评估将顺从视为单向的，测量模型是否抵抗压力，但不测量它们是否有选择地抵抗。我们引入顺从不对称性（A = BCR/HCR），一种双向诊断方法，比较在有益提示下的有益输出变化与在误导提示下的有害变化。在9个模型和972,000个提示条件响应中，我们发现这种选择性在事实判断和道德判断中有所不同：模型在事实问题上遵循有益提示多于有害提示（A = 1.58），但在道德问题上以几乎相同的速率遵循两个方向（A = 1.04）。这种现象在模型家族、能力水平和提示类型中持续存在。有趣的是，我们还发现思维链提示同时放大了有益和有害的顺从，而基于身份的提示以几乎相同的幅度抑制了这两者。这些结果将方向盲从道德顺从确定为当前LLM中一个独特的失败模式，并表明对齐应针对方向校准的更新，而不是仅降低顺从。

英文摘要

As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

URL PDF HTML ☆

赞 0 踩 0

2606.14068 2026-06-15 cs.CL 新提交

Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios

对男性更严厉？评估LLM在不同冲突场景中的性别不对称道德框架

Guangzong Si, Dong Wang, Zhenhao Li, Yifan Yu, Panwang Pan, Wentao Zhu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Eastern Institute of Technology, Ningbo（宁波东方理工大学）

AI总结提出GAMA-Bench基准，通过性别镜像场景评估LLM对相同负面行为的回应，发现模型对男性更倾向于惩罚和责备，对女性则更强调共情和治疗。

Comments underreview

详情

AI中文摘要

现有关于LLM性别偏见的研究主要关注刻板印象、职业关联或明确的有害输出。在这项工作中，我们询问LLM是否在匹配的男性角色和女性角色条件下对相同的负面行为应用一致的回应标准。我们引入了GAMA-Bench，一个包含1298个场景的性别镜像基准，涵盖亲密关系和公共社会冲突。它通过受控网格和跨模型审查构建性别中立的不当行为模板，然后将它们编译成配对的第一人称提示，包含匹配的角色性别和角色指称变化。我们进一步设计了一个结构化的回应框架协议，以衡量模型如何分配惩罚、共情、升级、指导和责备。在10个代表性LLM上的实验揭示了一致的男性不利不对称：男性角色获得更多惩罚性、升级性和责备中心的框架，而女性角色对相同的不当行为获得更多治疗性和共情导向的框架。进一步分析表明，这种模式在模型家族、场景轨道、模型规模和显式思维风格推理中持续存在。官方代码见https://this URL。

英文摘要

Existing studies on gender bias in LLMs have largely focused on stereotypes, occupational associations, or explicit harmful outputs. In this work, we ask whether LLMs apply consistent response standards to the same negative behavior under matched male-actor and female-actor conditions. We introduce GAMA-Bench, a gender-mirrored benchmark of 1,298 scenarios covering intimate relationship and public social conflicts. It constructs gender-neutral misconduct templates through controlled grids and cross-model review, then compiles them into paired first-person prompts with matched actor-gender and role-reference variations. We further design a structured response-framing protocol to measure how models allocate punishment, empathy, escalation, instruction, and blame. Experiments on 10 representative LLMs reveal a consistent male-disadvantaging asymmetry: male actors receive more punitive, escalatory, and blame-centered framing, whereas female actors receive more therapeutic and empathy-oriented framing for the same misconduct. Further analyses show that this pattern persists across model families, scenario tracks, model scale, and explicit thinking-style reasoning. The official code is available at https://github.com/xufeiqiong/GAMA-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.14209 2026-06-15 cs.CL 新提交

Detecting undisclosed LLM-generated content in parliamentary texts

检测议会文本中未披露的LLM生成内容

Minerva Suvanto, Andrea McGlinchey, Peter J. Barclay, Mattias Wahde

发表机构 * Chalmers University of Technology（查尔姆斯理工大学）； University of Cambridge（剑桥大学）

AI总结本研究训练可解释文本分类器，检测英国和瑞典议会文本中未披露的LLM使用情况，发现自2022年起两国议会中未披露的LLM使用持续增加。

2606.14460 2026-06-15 cs.CL cs.HC 新提交

A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

ClinicalBERT 语言预测中人口统计关联编码的计算审计

Kehinde Temitayo Soetan

发表机构 * The Ohio State University（俄亥俄州立大学）

AI总结通过两种探针方法审计 ClinicalBERT 中的表征偏差，发现模型主要放大而非继承训练数据中的统计差异。

Comments 17 pages, 4 tables, appendices A-E, preprint

详情

AI中文摘要

基于 Transformer 的临床语言模型越来越多地集成到高风险临床决策支持流程中，但医学文档中编码的人口统计关联通过计算机制传播到模型概率分布的方式仍缺乏实证研究。我们对 ClinicalBERT (Alsentzer et al., 2019) 的表征偏差进行了系统性计算审计，该模型是在 MIMIC-III 出院小结上预训练的 BERT 模型。我们采用两种互补的探针方法：对数概率偏差分析 (LPBA)，量化人口统计描述符引起的掩码标记概率分布在行为和评价语义类别上的偏移；以及基于掩码语言模型的分析 (MLM)，探测内部表征结构在 98 个真实临床句子模板和八种交叉种族-性别组合中的人口统计代理归因编码。语料频率分析通过将模型输出与 MIMIC-III 训练语料中的经验词频进行基准测试，操作化了统计差异与偏差放大之间的区别。在 32 个统计显著的结果中，65.6% 与观察到的语料分布相矛盾，在 MLM 探针下，黑人患者和代理归因的比率分别上升至 80% 和 87.5%，这提供了直接的经验证据，表明 ClinicalBERT 中的表征偏差主要通过模型内部放大而非训练数据继承来运作。关键词：自然语言处理，临床文档，算法审计，表征偏差，健康公平

英文摘要

Transformer-based clinical language models are increasingly integrated into high-stakes clinical decision support pipelines, yet the computational mechanisms through which demographic associations encoded in medical documentation propagate into model probability distributions remain empirically underspecified. We present a systematic computational audit of representational bias in ClinicalBERT (Alsentzer et al., 2019), a BERT-based model pretrained on MIMIC-III discharge summaries, employing two complementary probing methodologies: Log Probability Bias Analysis (LPBA), which quantifies demographic descriptor-induced shifts in masked token probability distributions across behavioral and evaluative semantic categories, and Masked Language Model-based analysis (MLM), which probes internal representational structure for demographic agency attribution encoding across 98 real clinical sentence templates and eight intersectional race-gender combinations. Corpus frequency analysis operationalizes the distinction between statistical disparity and bias amplification by benchmarking model outputs against empirical term frequencies in the MIMIC-III training corpus. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing, providing direct empirical evidence that representational bias in ClinicalBERT operates predominantly through model-internal amplification rather than training data inheritance. Keywords: natural language processing, clinical documentation, algorithmic auditing, representational bias, health equity 1

URL PDF HTML ☆

赞 0 踩 0

2606.13873 2026-06-15 cs.LG cs.CL 交叉投稿

Natively Unlearnable Large Language Models

原生不可学习的大语言模型

Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出NULLs模型，通过共享骨干和稀疏激活的sinks分离数据源贡献，实现无需梯度更新的高效遗忘，在维基百科上验证了单篇文章遗忘的有效性和鲁棒性。

详情

AI中文摘要

遗忘旨在移除特定训练数据源的影响，但由于不同数据源的贡献在模型中纠缠，这已被证明具有挑战性。将源贡献隔离到不相交的参数中使得移除更容易，尽管这会阻碍跨源的联合学习。我们提出NULLs（原生不可学习的大语言模型），这是一类模型，通过训练一组共享骨干神经元以及一个稀疏激活的sinks池，满足隔离源特定贡献和跨源联合学习这两个对立目标。在训练过程中，特定于源的信息自然集中在其sinks中，而跨源共享的信息积累在骨干中。在部署时，通过禁用相应的sinks来遗忘一个源，无需梯度更新也无需访问保留数据。我们展示了NULLs可扩展到维基百科约600万篇文章，将每篇文章隔离为独立源。遗忘单篇文章会移除其特定知识，同时保留与语义相关文章共享的事实，与从头重新训练紧密匹配。我们注意到，使用NULLs进行遗忘也具有鲁棒性：在遗忘《哈利·波特》书籍的案例研究中，NULLs抵抗了对抗性提取和逆转事后遗忘的重新学习。最后，NULLs保留了一般语言能力，在下游基准测试中与标准Transformer相匹配。这些结果共同表明，源级遗忘不必是事后考虑。它可以原生地构建到LLM训练中，同时保留共享表示学习的优势。

英文摘要

Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources. We propose NULLs (Natively Unlearnable LLMs), a model class that satisfies the two opposing goals of isolating source-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data. We show that NULLs scales to Wikipedia's ~6M articles, isolating each as an independent source. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post-hoc unlearning. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks. Together, these results suggest that source-level unlearning need not be an afterthought. It can be built natively into LLM training while retaining the benefits of shared representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14060 2026-06-15 cs.LG cs.CL 交叉投稿

知道评估如何设计的模型更安全

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center（图宾根ELLIS研究所、图宾根马克斯·普朗克智能系统研究所、图宾根人工智能中心）

AI总结本文通过微调模型使其掌握评估的元知识（如可验证结构或道德困境），发现这会导致模型在安全基准测试中表现更安全，从而引入了一种独立于显式记忆或评估意识的新混淆因素。

详情

AI中文摘要

AI安全评估的有效性取决于模型在受控环境和部署环境中行为的一致性。先前的研究已经发现测试时的上下文线索（例如假设场景）是口头评估意识和后续行为转变的来源。在本文中，我们研究了这一现象的一个潜在解释：评估元知识，定义为关于评估结构特征的参数化知识。类似于数据集污染（基准暴露通过记忆导致更高性能），我们假设在描述评估实践的文本上训练的模型可能隐式地学会识别和响应类似评估的上下文，例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为了验证这一点，我们在描述评估特征（如可验证结构或道德困境）的合成文档上微调模型。在六个安全基准上评估这个微调模型，我们发现它比基础模型和控制模型显著更安全。即使将分析限制在缺乏明确评估意识口头表达的响应中，这种行为转变仍然存在。我们的结果表明，评估元知识可能夸大安全基准性能，引入了一种独立于显式记忆或口头评估意识的新混淆因素，因此难以检测。这些发现对AI安全评估的设计和解释具有重要意义。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。

英文摘要

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on five safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

URL PDF HTML ☆

赞 0 踩 0

2606.11502 2026-06-15 cs.CL cs.AI 版本更新

When Roleplaying, Do Models Believe What They Say?

角色扮演时，模型是否相信它们所说的话？

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结通过线性真实探针研究角色扮演对LLM内部表征的影响，发现角色扮演主要改变输出而非内部真实表征，而紧急错位则更显著地改变内部表征。

详情

AI中文摘要

语言模型可以陈述“地球绕太阳运行”，并在扮演亚里士多德时断言相反的说法。最近的研究认为，角色采用是语言模型运作的基础，模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出，还是也影响了模型内部表征为真实的内容？我们通过线性真实探针研究这个问题，将其应用于扮演历史人物（其可能的信念与现代共识不同）的LLM。对于每个角色，我们比较该角色可能赞同的虚假陈述（*时代相信*）与主题匹配但该角色不会赞同的虚假陈述（*时代虚假*）。通过提示、上下文学习和监督微调，角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述，但它们总体上仍被分类为虚假。因此，角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位（EM）的模型进行对比。在三个模型家族（Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B）中，它们的虚假陈述显著向探针空间的真实区域移动，在挑战下大约一半时间被辩护（而角色扮演约为六分之一），并用于下游推理。因此，角色扮演和紧急错位是信念内化谱系上的点，其中角色扮演改变模型所说的内容而表征变化很小，而紧急错位则改变虚假陈述的内部表征，但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

URL PDF HTML ☆

赞 0 踩 0

2305.07609 2026-06-15 cs.IR cs.CL cs.CY 版本更新

Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation

ChatGPT 在推荐中是否公平？评估大语言模型推荐的公平性

Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

发表机构 * University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）

AI总结针对大语言模型推荐（RecLLM）可能存在的偏见，提出公平性基准 FaiRLLM，包含精心设计的指标和涵盖8个敏感属性的数据集，评估发现 ChatGPT 在推荐中仍存在不公平现象。

Comments Accepted by Recsys 2023 (Short). Typo corrections

详情

DOI: 10.1145/3604915.3608860

AI中文摘要

大语言模型（LLM）的显著成就催生了一种新颖的推荐范式——基于LLM的推荐（RecLLM）。然而，需要注意的是，LLM可能包含社会偏见，因此RecLLM做出的推荐的公平性需要进一步研究。为了避免RecLLM的潜在风险，有必要评估RecLLM在用户侧各种敏感属性上的公平性。由于RecLLM范式与传统推荐范式存在差异，直接使用传统推荐的公平性基准是有问题的。为解决这一困境，我们提出了一个新的基准，称为基于LLM的推荐公平性（FaiRLLM）。该基准包括精心设计的指标和一个数据集，该数据集考虑了音乐和电影两个推荐场景中的八个敏感属性。通过使用我们的FaiRLLM基准，我们对ChatGPT进行了评估，发现它在生成推荐时仍然对某些敏感属性表现出不公平性。我们的代码和数据集可在以下网址找到：https://this URL。

英文摘要

The remarkable achievements of Large Language Models (LLMs) have led to the emergence of a novel recommendation paradigm -- Recommendation via LLM (RecLLM). Nevertheless, it is important to note that LLMs may contain social prejudices, and therefore, the fairness of recommendations made by RecLLM requires further investigation. To avoid the potential risks of RecLLM, it is imperative to evaluate the fairness of RecLLM with respect to various sensitive attributes on the user side. Due to the differences between the RecLLM paradigm and the traditional recommendation paradigm, it is problematic to directly use the fairness benchmark of traditional recommendation. To address the dilemma, we propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes1 in two recommendation scenarios: music and movies. By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations. Our code and dataset can be found at https://github.com/jizhi-zhang/FaiRLLM.

URL PDF HTML ☆

赞 0 踩 0

2606.10813 2026-06-15 cs.CR cs.CL 版本更新

RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

RedAct: 为程序技能保护而编辑智能体能力痕迹

Shuwen Xu, Zhitao He, Yi R. Fung

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出RedAct框架，通过定位保护关键信息、重写痕迹并嵌入行为水印，将技能转移率降至无技能基线以下，同时保留审计证据。

详情

AI中文摘要

用户依赖执行痕迹来观察智能体行为、诊断故障并确保问责。这些痕迹包含丰富的程序细节，包括工具调用、中间决策和错误恢复逻辑。然而，这些细节可能暴露私有的程序技能，使下游方法能够在没有模型权重或技能文件的情况下恢复关键公式、阈值和策略。为了量化这种风险并评估保护措施，我们构建了\textsc{CapTraceBench}，一个包含75个专业长时任务和154个跨七个领域精选技能的基准。我们还引入了\textsc{RedAct}（https://github.com/...），一个受保护的痕迹发布框架，该框架定位受保护的关键信息，重写痕迹同时保留验证者关键证据，并嵌入行为水印用于下游溯源分析。在代表性的痕迹重用方法中，\textsc{RedAct}将归一化技能转移（NST）从原始痕迹的44.7--67.1%降至低于无技能基线，同时保留审计证据。其独立的行为水印实现了93.6--100.0%的真实检测率，误报率最多为1.9%。这些结果将公共智能体痕迹视为安全接口，并表明选择性编辑可以在不删除审计证据的情况下减少程序能力泄露。

英文摘要

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct CapTraceBench, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce RedAct, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, RedAct reduces normalized skill transfer (NST) from 44.7-67.1% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6-100.0% true detection with a false alarm rate of at most 1.9%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.21006 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

扮演魔鬼的代言人：现成的人格向量在顺从性上与针对性引导相媲美

Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary

发表机构 * University of Toronto（多伦多大学）； Princeton University（普林斯顿大学）； Purdue University（普渡大学）； EPFL（瑞士联邦理工学院）； Algoverse ； Independent（独立）

AI总结本文研究了不同人格对顺从性的影响，发现现成的人格引导向量在减少顺从性方面与针对性引导相当，且在用户正确时保持准确性。

详情

Journal ref: ICML, Pluralistic Alignment Workshop, 2026

AI中文摘要

我们研究了不同人格对顺从性的影响：模型在用户错误时仍同意用户。标准缓解方法，对比激活添加（CAA），从顺从性和诚实响应的标记对中推导出引导方向。本研究评估了现成的人格引导向量是否能作为替代方案，这些向量最初是为一般角色扮演开发的，且未在顺从性数据上训练。在两个指令微调模型中，引导至以怀疑或审查为特征的人格可将顺从性减少到CAA效果的约68%和98%，且不同于CAA，在用户正确时保持准确性。效果也是不对称的：引导至顺从的人格不会产生镜像增加的顺从性。几何上，人格向量在激活空间的方向上与顺从性方向基本无关。总体而言，这些发现表明，顺从性应被视为人格层面的属性，而非单一可引导方向。我们在此发布代码：https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

英文摘要

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

URL PDF HTML ☆

赞 0 踩 0

2606.13894 2026-06-15 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Gefen: Optimized Stochastic Optimizer

Gefen: 优化随机优化器

Nadav Benedek, Tomer Koren, Ohad Fried

发表机构 * Reichman University（赖希曼大学）； Tel Aviv University（特拉维夫大学）； Google Research（谷歌研究院）

AI总结提出Gefen优化器，通过共享二阶矩估计和量化一阶矩，将AdamW内存占用减少约8倍，同时保持相同性能，支持更大批量和吞吐量。

详情

AI中文摘要

AdamW是现代深度学习的默认优化器，但其一阶和二阶矩状态会额外占用约两倍参数大小的训练内存。我们提出Gefen，一种内存高效的优化器，它自动在参数块之间共享二阶矩估计，并使用学习到的码本量化一阶矩，从而将AdamW的内存占用减少约8倍，同时保持相同性能，相当于每十亿参数减少6.5 GiB。该方法受理论结果启发，该结果表明大的混合Hessian项将平方梯度的比率约束为接近1，表明Hessian对齐的参数是共享二阶矩统计量的自然候选。由于大规模计算Hessian不切实际，Gefen从初始平方梯度推断块结构，除了AdamW默认超参数外，不需要任何架构特定的元数据或超参数。Gefen学习基于精确直方图的动态规划量化码本，并重用相同的块进行一阶矩缩放。在多种实验中，Gefen在比较的类似AdamW的方法中实现了最低的峰值优化器内存，同时保持AdamW级别的性能。在FSDP和DDP训练中，减少的内存占用支持更大的微批次，并显著提高相对于AdamW的吞吐量，提供了一种实用的即插即用替代方案，具有更低的内存使用，可以增加吞吐量并支持训练更大的模型或使用更大的批量大小。我们提供了完整的Python实现，包括融合CUDA内核，网址为https://this https URL。

英文摘要

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

URL PDF HTML ☆

赞 0 踩 0

2606.14150 2026-06-15 cs.LG cs.CL 交叉投稿

Small LLMs: Pruning vs. Training from Scratch

小型LLM：剪枝 vs. 从头训练

Yufeng Xu, Taiming Lu, Kunjun Li, Jiachen Zhu, Mingjie Sun, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）； New York University（纽约大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文通过六种剪枝方法在Llama-3.1-8B上比较剪枝与从头训练，发现有限预算下剪枝更优，预算充足时粗粒度剪枝可被超越。

Comments Our code is available at https://github.com/zlab-princeton/llm-pruning-collection

详情

AI中文摘要

剪枝有望成为获得强大小型语言模型的捷径。在本工作中，我们通过六种涵盖深度、宽度和稀疏粒度的剪枝方法，在两种受控的token匹配设置下，以0.5-0.8的剪枝率对Llama-3.1-8B进行剪枝，检验了这一承诺。(1) 在相同的训练token预算下，剪枝初始化始终优于随机初始化。这表明父模型提供了一个强起点，尽管随着训练token预算的增加和剪枝率的提高，优势逐渐缩小，在我们研究的最高剪枝率下几乎消失。(2) 当从头训练被给予整个流程消耗的全部token预算时，细粒度剪枝仍保持优势，而粗粒度结构化剪枝可能被匹配或超越。这表明父模型传递了额外训练token无法完全恢复的知识，但仅在细粒度下如此。综合来看，我们的结果给出了明确的建议：当手头有一个大型预训练模型且训练token预算有限时，剪枝优于从头训练；当训练预算不受限时，从头训练在粗粒度剪枝下可能具有竞争力，因此大型预训练父模型并非总是必要的。

英文摘要

Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

URL PDF HTML ☆

赞 0 踩 0

2606.14695 2026-06-15 cs.LG cs.CL 交叉投稿

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

Persona-Pruner: 为角色扮演雕琢轻量级模型

Jinsu Kim, Jihoon Tack, Noah Lee, Jongheon Jeong

AI总结提出Persona-Pruner框架，通过从单个描述中隔离特定角色的子网络来剪枝语言模型，在保持角色扮演性能的同时大幅降低计算成本，性能下降比最强基线减少93.8%。

Comments 25 pages; ICML 2026; Code is available at https://github.com/jsu-kim/Persona-Pruner

详情

AI中文摘要

语言模型（LMs）作为角色扮演聊天机器人展现出显著潜力，在给定角色或用户画像规范时，能够提供一致且风格化的交互。然而，将这些能力应用于现实世界应用（例如，众多NPC同时交互的生态系统）时，由于过高的计算成本，暴露了关键的效率问题。在本文中，我们质疑将完整的通用模型专用于单一角色的必要性，假设特定角色身份仅依赖于模型总容量的一小部分。我们观察到，朴素地剪枝LM通常会严重降低特定角色的角色扮演性能；它无法区分冗余知识和基本角色特征。我们提出Persona-Pruner，一个通过从单个描述中隔离特定角色的子网络来雕琢轻量级角色扮演模型的框架。我们的实验一致表明，Persona-Pruner在保留角色扮演性能方面比现有最先进的LLM剪枝技术有效得多，在RoleBench上使用LLM-as-a-judge评分，将性能下降从密集模型减少至多93.8%（相比最强基线），同时仍保持通用LLM能力。代码可在以下网址获取：此https URL。

英文摘要

Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model's total capacity. We observe that naively pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We propose Persona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities. Code is available at https://github.com/jsu-kim/Persona-Pruner.

URL PDF HTML ☆

赞 0 踩 0

2512.22671 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

脆弱的知识，稳健的指令遵循：Llama-3.2中的宽度剪枝二分法

Pere Martra

发表机构 * Independent Researcher（独立研究员）

AI总结通过峰值幅度准则对GLU-MLP层进行结构化宽度剪枝，发现降低扩展比会损害参数化知识任务，但能提升指令遵循能力，挑战了剪枝导致均匀退化的假设。

Comments 22 pages, 5 figures, 9 tables. Code available at https://github.com/peremartra/llama-glu-expansion-pruning

详情

AI中文摘要

对Llama-3.2模型中GLU-MLP层的结构化宽度剪枝，以峰值幅度（PPM）准则为指导，揭示了降低扩展比如何系统性地影响不同模型能力的二分法。虽然依赖参数化知识的任务（如MMLU、GSM8K）和困惑度指标的性能随扩展比降低而可预测地下降，但指令遵循能力在2.4倍平衡比下得到提升（IFEval：Llama-3.2-1B中+4.8分/+46%，Llama-3.2-3B中+3.7分/+39%），且多步推理保持稳健（MUSR）。这种模式在两个评估模型大小上一致观察到，挑战了压缩研究中剪枝导致均匀退化的主流假设。为探究这一点，我们使用评估事实知识、数学推理、语言理解、指令遵循和真实性的综合基准套件，评估了七种扩展比配置。我们的分析将扩展比识别为一个关键架构参数，它选择性地重塑模型的任务性能轮廓，而不仅仅是作为压缩指标。

英文摘要

Structured width pruning of GLU-MLP layers in Llama-3.2 models, guided by the Peak-to-Peak Magnitude (PPM) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably with decreasing expansion ratios, instruction-following capabilities improve at the 2.4x equilibrium ratio (IFEval: +4.8 points / +46% in Llama-3.2-1B and +3.7 points / +39% in Llama-3.2-3B), and multi-step reasoning remains robust (MUSR). This pattern, observed consistently across both evaluated model sizes, challenges the prevailing assumption in compression research that pruning induces uniform degradation. To investigate this, we evaluated seven expansion ratio configurations using comprehensive benchmark suites that assess factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively reshapes the model's task performance profile, rather than merely serving as a compression metric.

URL PDF HTML ☆

赞 0 踩 0

2606.13852 2026-06-15 cs.CL 新提交

Hybrid Classical-Quantum Variational Autoencoder for Neural Topic Modeling

用于神经主题建模的混合经典-量子变分自编码器

Ivan Kankeu

发表机构 * Ivan Kankeu

AI总结提出混合经典-量子变分自编码器，在推理网络中嵌入参数化量子电路，使用改进的高斯Softmax后验降低量子比特需求，在AgNews数据集上优于现有神经主题模型。

详情

AI中文摘要

神经主题模型能够实现可扩展的语义发现，但它们与量子硬件的集成仍 largely unexplored。我们提出了一种概念验证的混合经典-量子变分自编码器（VAE）用于主题建模，在VAE推理网络中嵌入参数化量子电路，同时保留经典的主题-词解码器。为了解决量子硬件的资源限制，我们提出了一种改进的高斯Softmax后验，将潜在空间维度与要提取的主题数量解耦，使模型能够在低资源（10量子比特）的量子设备上运行。在AgNews数据集上，混合VAE优于最先进的神经主题模型（NTMs），达到了0.71的$C_v$连贯性分数和0.20的NPMI分数，同时保持了高主题多样性。为了比较，我们还构建了一个完全经典的变体，它在AgNews上也优于最先进的模型，并在潜在空间中表现出清晰的类别分离。这些结果表明，即使在NISQ时代的设备上，混合VAE在计算上也是可行的，并且代表了量子增强主题建模的一个有前景的方向。

英文摘要

Neural topic models enable scalable semantic discovery, but their integration with quantum hardware remains largely unexplored. We present a proof-of-concept hybrid classical-quantum variational autoencoder (VAE) for topic modeling, embedding parameterized quantum circuits within the VAE inference network while retaining a classical topic-word decoder. To address the resource constraints of quantum hardware, we propose a modified Gaussian Softmax posterior that decouples latent space dimensionality from the number of topics to be extracted, enabling the model to operate with a low-resource 10-qubit quantum device. On the AgNews dataset, the hybrid VAE outperforms state-of-the-art neural topic models (NTMs), reaching a $C_v$ coherence score of 0.71 and an NPMI score of 0.20 while preserving high topic diversity. For comparison, we also construct a fully classical variant, which also outperforms state-of-the-art models on AgNews and exhibits clear class separation in the latent space. These results demonstrate that hybrid VAEs are computationally viable even on NISQ-era devices and represent a promising direction for quantum-enhanced topic modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.13977 2026-06-15 cs.CL 新提交

Creative Integration: A Decidable Criterion of Creativity

创造性整合：一个可判定的创造力标准

Yoshinori Nomura

发表机构 * Mirage Mountain Technologies（幻山科技）

AI总结提出基于描述长度压缩的创造性整合可判定标准，通过四个二元门和伪整合分类法实现判别，并在多领域语料库上通过四项可证伪测试验证。

Comments 18 pages, 1 figure

详情

AI中文摘要

"整合性"解决方案广受赞誉但鲜有定义：我们缺乏一种可操作的方式来区分真正的整合（使世界更易于描述）与整洁的重新描述。基于将创造力和智能视为压缩的思想脉络，我们为创造性整合（CI）给出了这样一个标准：当且仅当在固定描述语言下，描述长度严格缩短（C = L_pre/L_post > 1），且缩减位于冲突本身时，A与B之间真实冲突的解决即为CI。我们通过四个二元合取门使判断可判定，并通过一个伪整合分类法（命名并拒绝相似物）固定其外延。我们用一个精心策划的多领域语料库支持该标准，并且——关键的是——不是通过人类评分者间一致性，而是通过它可能失败的四个可证伪测试来验证：独立计算检查、对硬负例的区分、样本外预测和描述语言鲁棒性；所有测试均以余量通过。贡献不在于"创造力即压缩"，而在于其可判定性、区分性和语料库：据此，使一个举动真正具有创造性——而非仅仅是新颖——的是它压缩了一个冲突，新颖性和价值是下游症状；所有创造力是否都如此构成，我们作为一个明确的猜想陈述。我们仅声称C-1的符号；我们判断，而非生成。结果是一个可引用的基元，用于更广泛的计划。

英文摘要

"Integrative" solutions are widely praised but rarely defined: we lack an operational way to tell a genuine integration -- one that makes the world cheaper to describe -- from a tidy re-description. Building on the lineage that treats creativity and intelligence as compression, we give such a criterion for creative integration (CI): the resolution of a real conflict between A and B is CI if and only if, under a fixed description language, the description length strictly shrinks (C = L_pre/L_post > 1), with the reduction located in the conflict itself. We make the judgment decidable through four binary, conjunctive gates, and we fix its extension through a taxonomy of pseudo-integration that names and rejects the look-alikes. We back the criterion with a curated, multi-domain corpus and -- crucially -- validate it not by human inter-rater agreement but by four falsifiable tests it could fail: an independent computational check, discrimination against hard negatives, out-of-sample prediction, and description-language robustness; all pass with margin. The contribution is not "creativity is compression" but its decidability, discrimination, and corpus: on this account, what makes a move genuinely creative -- rather than merely novel -- is that it compresses a conflict, with novelty and value as downstream symptoms; whether all creativity is so constituted we state as an explicit conjecture. We claim only the sign of C-1; we judge, not generate. The result is a citable primitive for a broader program.

URL PDF HTML ☆

赞 0 踩 0

说服指数：一个理论指导的说服分析框架

Liancheng Gong, Zhiyang Wang, Yiwei Xu, Julia Mendelsohn

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）； New York University（纽约大学）

AI总结提出基于心理学和传播学理论的15维说服指数（PI）及55个子特征实现，在四个数据集上验证其能解释说服相关修辞模式，并提供轻量级预测信号。

详情

AI中文摘要

识别有说服力的修辞线索在多个领域至关重要，从检测信息操纵、提高AI安全性到推进公共卫生沟通。我们提出说服指数（PI），这是一个基于心理学和传播学说服理论的15维分类法，以及一个使用55个子特征的透明实现，这些子特征基于词汇和规则检测器构建。该分类法是模块化的：单个检测器可以被替换，同时保留理论结构。通过在四个领域、风格和结果度量不同的公共数据集上评估PI，我们表明PI提供了一个共享的特征空间，用于解释与说服结果相关的修辞模式。线性模型表明，PI特征在保持计算轻量级的同时携带了有意义的预测信号。维度级分析揭示了PI维度与说服结果之间跨数据集的重复关联，同时也突出了主题和立场特定的变化。我们将PI作为开源包和Web界面发布，用于对人和AI中介的沟通进行原则性和可审计的分析。

英文摘要

Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

URL PDF HTML ☆

赞 0 踩 0

2606.14626 2026-06-15 cs.CL 新提交

Characterizing Cultural Localization in AI-Generated Stories

表征AI生成故事中的文化本地化

Shaily Bhatt, Supriti Vijay, Jeremiah Milbauer, Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种方法，通过识别区分国籍的词汇标记并移除后测量叙事相似性，检测AI生成故事中的模板化本地化，发现仅9-17%词汇解释国籍差异，且部分文化标记具有冒犯性。

Comments Accepted to the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP) Co-located with ACL 2026, San Diego, USA (non-archival)

详情

AI中文摘要

人工智能的全球使用增加了对评估生成文化本地化内容（包括故事）能力的兴趣。故事中的文化本地化通常通过模板化本地化（在通用叙事中使用文化标记，如姓名、地点）或整体本地化（除文化标记外，情节、价值观和主题的变化）发生。我们提出了一种方法来衡量内容通过模板化本地化生成的程度。具体来说，我们识别出区分不同国籍故事的词汇标记，并测量移除这些标记后剩余叙事的相似性。在五个模型针对125个主题和193个国籍生成的故事中，我们的方法能够检测到仅有一小部分词汇（9-17%）解释了国籍间的差异，并且移除这些标记后剩余的叙事包含重复的多词序列，表明存在一个共享的文化无关叙事模板。最后，我们表征了文化标记的刻板性和冒犯性，发现来自19个国家（主要位于全球南方）的标记平均具有冒犯性。

英文摘要

The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.

URL PDF HTML ☆

赞 0 踩 0

2606.13690 2026-06-15 cs.CY cs.CL 交叉投稿

Indirect Computing Model with Indirect Formal Method

间接计算模型与间接形式化方法

Xiaohui Zou

发表机构 * Institute for Higher Education（高等教育研究院）； China University of Geosciences (Beijing)（中国地质大学（北京））； Institute of Synergistic Cultural Gene Engineering（协同文化基因工程研究院）； Tsinghua Science Park（清华大学科技园）； Sino-US Project: UC Berkeley Searle Research Bilingual Information Processing Group（中美项目：伯克利赛尔研究双语信息处理组）

AI总结本文从人机界面与协同计算程序结合的协同智能计算系统视角，探讨间接计算模型与间接形式化方法支持的优化云计算技术原理，并介绍兼容大小字符串的间接计算模型和形式理论，以中文信息数据为例展示原型设计，旨在将数据中心优化为知识中心。

Comments 10 pages, 6 figures

详情

Journal ref: Software 2011,32(5)

AI中文摘要

本文从人机界面与协同计算程序结合形成的协同智能计算系统视角，讨论了由间接计算模型与间接形式化方法组合支持的优化云计算技术原理。在系统回顾先前理论成果——图灵的可计算性理论、克林的小字符串形式理论、冯·诺依曼的数字计算机体系结构以及图灵关于AI判断的假说——对主流通用数字计算机范式的影响的基础上，作者重点介绍了一种兼容大小字符串的间接计算模型和间接形式理论。以中文信息数据为例，给出了协同智能计算系统原型的设计理念。其意义在于，该成果有助于将数据中心优化为知识中心。

英文摘要

This paper,from the perspective of a collaborative intelligent computing system formed by combining human-computer interface and collaborative computing programs, discusses the principles of optimized cloud computing technology supported by the combination of an indirect computing model and an indirect formal method. On the basis of systematically reviewing the influence of previous theoretical achievements Turing's computability theory,Kleene's formal theory of small strings,von Neumann's digital computer architecture and Turing's hypothesis on AI judgment on the mainstream general-purpose digital computer paradigm,the author focuses on introducing an indirect computing model and an indirect formal theory compatible with both large and small strings. Using Chinese information data as an example,the design concept of a collaborative intelligent computing system prototype is presented. The significance is that this achievement facilitates optimization of cloud computing from data centers to knowledge centers.

URL PDF HTML ☆

赞 0 踩 0

2606.14113 2026-06-15 cs.SE cs.CL 交叉投稿

Simulating Students' Java Programming Errors with Large Language Models

用大型语言模型模拟学生的Java编程错误

Ali Keramati, Jie Cao, Iman Mohammadi, Mark Warschauer, Yang Shi

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）

AI总结探索用LLM模拟学生编程错误，评估五种模型在多样性和对齐性上的表现，发现Claude Sonnet 4平衡最佳，且合成错误与真实错误难以区分。

详情

AI中文摘要

理解学生在编程中的错误是编程教育的基石，然而，对于任何新设计的任务，获取具有代表性的学生错误集仍然缓慢且成本高昂，因为真实的提交只有在广泛的课堂部署后才能积累。本文探讨了大型语言模型（LLMs）是否可以通过模拟代码提交中的真实逻辑错误，作为学生的可扩展代理。使用包含37个问题的74,000多个独特学生Java提交的CodeWorkout数据集，我们在三种主流提示策略下评估了五种LLM：输入-输出（IO）、思维链（CoT）和迭代自我改进。我们沿着两个关键维度评估性能：多样性（不同错误模式的范围）和对齐性（与真实学生错误的一致性），并考察这些维度如何随编程任务的困难程度变化。我们的定量发现表明，虽然所有模型都能生成多样化的错误，但它们与人类提交的对齐性存在差异：Claude Sonnet 4实现了最平衡的性能。此外，我们进行了一项盲法专家注释研究（N = 401），比较合成错误和真实错误。这一定性分析证实，生成的错误在功能上与真实学生错误无法区分。此外，更高困难程度的问题会引发更多样化但更不像学生的错误。这些结果突出了使用LLM模拟人类学习者的权衡，并为将合成错误集成到可教学代理、智能辅导系统和大规模学习分析中提供了设计考虑。

英文摘要

Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.

URL PDF HTML ☆

赞 0 踩 0

2606.14230 2026-06-15 cs.CV cs.CL 交叉投稿

洪流与收获：通过极限语言生成视角证明琐碎知识对于生成有价值数学的必要性

Xiaoyu Li, Andi Han, Dai Shi, Zheng Gao, Jiaojiao Jiang, Junbin Gao

发表机构 * University of New South Wales（新南威尔士大学）； University of Sydney（悉尼大学）； University of Cambridge（剑桥大学）

AI总结本文通过极限语言生成模型证明，在形式化数学生成中，验证器无法替代品味：覆盖未记录的有价值数学必须产生无限但渐近可忽略的琐碎语句，这是理论上的必然。

详情

AI中文摘要

与证明助手耦合的AI系统现在能够大规模生成形式化数学，而验证器可验证的内容与数学家认为有价值的内容之间的差距已成为制约因素。我们将有价值数学的生成建模为极限下的嵌套语言生成：通过成员查询预言机（证明检查器）访问的可验证形式语言$F$包含一个未知的有价值语言$H \in \mathcal{H}$，该语言仅通过核心$C \subseteq H$的对抗性枚举揭示，其精确密度为$\alpha$（文献）。每个输出要么是有价值的（$\in H$），要么是琐碎的（$\in F \setminus H$），要么是幻觉（$\notin F$）。我们解决了四个问题。第一，验证器不是品味：允许广度生成的集合恰好是无预言机模型中的那些，按纤维由Angluin条件刻画。第二，验证器确实提供了可靠覆盖，覆盖所有未见过的有价值陈述同时仅断言有效陈述：有验证器可能，无验证器不可能；它将不可避免的错误从虚假转移到琐碎。第三，核心地，关于紧族存在尖锐二分法：生成有限个琐碎语句的生成器达到最优覆盖$\alpha/2$，而任何无限琐碎语句的允许，即使以消失速率，也将最优值跃升至$1-\alpha/2$（两者均为紧界，对于以候选交集形式呈现的核心），且存在一个生成器同时达到两端。转变在于琐碎语句的数量而非速率；间隙$1-\alpha$是未记录的质量。第四，两种机制在数学的压缩模型中实例化。完美的验证器无法替代品味：正确但无价值的语句的无界流并非工程事故，而是可证明的必要性，因为覆盖未记录的有价值数学需要无限但渐近可忽略的已认证琐碎语句流。

英文摘要

AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language $F$, accessed through a membership oracle (the proof checker), contains an unknown valuable language $H \in \mathcal{H}$ revealed only through an adversarial enumeration of a core $C \subseteq H$ of exact density $α$ (the literature). Every output is valuable ($\in H$), trivial ($\in F \setminus H$), or a hallucination ($\notin F$). We settle four questions. First, the verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin's condition. Second, the verifier does buy sound coverage, covering all unseen valuable statements while asserting only valid ones: possible with it, impossible without it; it relocates unavoidable errors from false to trivial. Third, and centrally, a sharp dichotomy on the tight family: generators emitting finitely many trivia achieve optimal coverage $α/2$, while any infinite trivia allowance, even at vanishing rate, jumps the optimum to $1-α/2$ (both tight, for cores presented as the candidate intersection), and one generator attains both ends. The transition is in trivia count, not rate; the gap $1-α$ is the unrecorded mass. Fourth, both regimes instantiate in a compression model of mathematics. A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

URL PDF HTML ☆

赞 0 踩 0

2604.24942 2026-06-15 cs.CL q-bio.NC 版本更新

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

基于独立成分的故事理解过程中大脑活动的编码模型

Kamya Hari, Taha Binhuraib, Jin Li, Cory Shain, Anna A. Ivanova

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology（佐治亚理工学院电子与计算机工程学院）； School of Psychological and Brain Sciences, Georgia Institute of Technology（佐治亚理工学院心理学与脑科学学院）； Department of Linguistics, Stanford University（斯坦福大学语言学系）

AI总结提出基于独立成分的编码框架，从fMRI数据中分离刺激驱动和噪声信号，利用语言模型预测独立成分时间序列，识别出与听觉和语言相关的认知网络，验证了成分的可解释性和跨个体一致性。

Comments Accepted to CCN 2026 (Proceedings Track)

详情

AI中文摘要

编码模型为连接连续刺激特征与神经活动提供了强大框架；然而，传统的体素方法受限于测量噪声、个体间变异以及由编码重叠神经信号的空间相关体素引起的冗余。本文提出了一种基于独立成分（IC）的编码框架，从fMRI数据中分离刺激驱动和噪声驱动信号。我们使用一部分数据将自然故事聆听过程中的连续fMRI数据分解为独立成分，并在独立数据上训练编码模型，从语言输入的大型语言模型表示中预测独立成分时间序列。跨被试来看，一部分独立成分表现出持续的高可预测性。这些独立成分在空间和时间上跨被试一致，并包括已知在故事聆听期间响应的认知网络（听觉和语言）。听觉成分时间序列与声学刺激特征强相关，突出了所识别成分时间序列的可解释性。被ICA-AROMA识别为噪声或运动相关伪影的成分表现出普遍较差的预测性能，证实高预测成分反映的是真实的刺激相关神经信号而非混淆因素。总体而言，基于独立成分的编码模型能够在功能网络层面进行分析，适应个体间网络位置的变异性，并提供易于跨被试比较的可解释结果。代码见：this https URL

英文摘要

Encoding models provide a powerful framework for linking continuous stimulus features to neural activity; however, traditional voxelwise approaches are limited by measurement noise, inter-subject variability, and redundancy arising from spatially correlated voxels encoding overlapping neural signals. Here, we propose an independent component (IC)-based encoding framework that dissociates stimulus-driven and noise-driven signals in fMRI data. We decompose continuous fMRI data from naturalistic story listening into ICs using one subset of the data, and train encoding models on independent data to predict IC time series from large language model representations of linguistic input. Across subjects, a subset of ICs exhibited consistently high predictivity. These ICs were spatially and temporally consistent across subjects and included cognitive networks known to respond during story listening (auditory and language). Auditory component time series were strongly correlated with acoustic stimulus features, highlighting the interpretability of identified component time series. Components identified as noise or motion-related artifacts by ICA-AROMA showed uniformly poor predictive performance, confirming that highly predicted components reflect genuine stimulus-related neural signals rather than confounds. Overall, IC-based encoding models enable analyses at the level of functional networks, accommodating the variability in network locations across individuals and providing interpretable results that are easy to compare across subjects. Code provided at: https://github.com/kamyahari/IC-Encoding-Models.git

URL PDF HTML ☆

赞 0 踩 0

2606.04883 2026-06-15 cs.CL cs.LO 版本更新

从点到面孔：个体在视觉意象能力上的差异预测Ganzflicker诱导的幻觉内容

Ana Chkhaidze, Reshanne R. Reeder, Connor Gag, Anastasia Kiyonaga, Seana Coulson

发表机构 * Department of Cognitive Science, University of California, San Diego (USA)（加州大学圣地亚哥分校认知科学系）； Department of Psychology, Institute of Population Health, University of Liverpool (UK)（利物浦大学心理学系、人口健康研究所）； Department of Computer Science, University of California, San Diego (USA)（加州大学圣地亚哥分校计算机科学系）； Kavli Institute for Brain and Mind, San Diego (USA)（圣地亚哥脑与心智研究所）

AI总结研究探讨视觉意象能力差异如何影响Ganzflicker诱导的幻觉内容，通过自然语言处理分析4000多人的描述，发现高意象者描述复杂自然内容，低意象者报告简单几何图案。

详情

DOI: 10.1093/nc/niag016

AI中文摘要

一种快速交替的红黑显示称为Ganzflicker，会引发反映视觉系统生成能力的视觉幻觉。个体在视觉意象程度上存在差异，从无到生动不等。最近的提议认为，这种意象光谱中的视觉系统差异也应影响其他内部生成视觉体验的复杂性。在这里，我们利用自然语言处理工具分析超过4000名参与者的自由文本描述，探讨不同意象表型的人在Ganzflicker诱导的幻觉中是否看到不同内容。主题建模显示，强意象者描述复杂、自然的内容，而弱意象者报告简单几何图案。使用众包的传感器运动规范，我们还发现，强意象者使用更丰富的感知关联语言。这些发现可能反映个体在早期视觉区域和与意象光谱相关的更高阶区域之间协调性的差异。

英文摘要

A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Individuals vary in their degree of visual imagery, ranging from absent to vivid imagery. Recent proposals suggest that differences in the visual system along this imagery spectrum should also influence the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind's eye during Ganzflicker-induced hallucinations. Topic modeling of descriptions revealed that strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Using crowd-sourced sensorimotor norms, we also found that participants with stronger imagery used language with richer perceptual associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.

URL PDF HTML ☆

赞 0 踩 0

2511.17637 2026-06-15 cs.LG cs.CL 版本更新

PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

PocketLLM: 通过元网络实现大语言模型的终极压缩

Ye Tian, Chengcheng Wang, Jing Han, Yehui Tang, Kai Han

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出PocketLLM，通过元网络在潜在空间压缩大语言模型，利用编码器和解码器实现高效压缩，实验表明在高压缩比下仍保持高精度。

Comments AAAI 2026 camera ready

详情

DOI: 10.1609/aaai.v40i39.40610
Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33250-33258 (2026)

AI中文摘要

随着大语言模型（LLMs）的持续增长，将其存储和传输到边缘设备变得越来越具有挑战性。传统方法如量化和剪枝在不牺牲精度的情况下难以实现极端压缩。本文介绍了一种新的压缩方法PocketLLM，通过元网络在潜在空间中压缩LLMs。提出一个简单的编码器网络，将LLMs的权重投影到离散的潜在向量中，然后使用紧凑的代码本进行表示。轻量级的解码器网络用于将代码本的代表性向量映射回原始权重空间。该方法仅需一个小解码器、简洁的代码本和一个索引即可实现LLMs中大权重的显著压缩。大量实验表明，PocketLLM在显著的压缩比下仍能保持优越的性能，例如将Llama 2-7B压缩10倍，精度损失微不足道。

英文摘要

As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.

URL PDF HTML ☆

赞 0 踩 0

2506.00784 2026-06-15 cs.CL 版本更新

Research Borderlands: Analysing Writing Across Research Cultures

研究边疆：跨研究文化中的写作分析

Shaily Bhatt, Tal August, Maria Antoniak

发表机构 * Language Technologies Institute, Carnegie Mellon University（卡内基梅隆大学语言技术研究所）； Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院）； Pioneer Centre for AI, University of Copenhagen（哥本哈根大学先锋人工智能中心）； Allen Institute for Artificial Intelligence (Ai2)（人工智能研究院）

AI总结本文通过跨文化访谈，揭示研究文化中的语言规范，并评估LLM的文化适应能力，强调人类中心方法在测量文化规范中的有效性。

Comments Accepted to ACL 2025 (Main)

详情

DOI: 10.18653/v1/2025.acl-long.1272

AI中文摘要

提升语言技术的文化能力至关重要。然而，大多数最新研究很少与所研究的社区互动，而是依赖合成设置和不完美的文化代理。本文采用以人类为中心的方法，发现并测量基于语言的文化规范以及LLM的文化能力。我们专注于一种文化——研究文化，以及一种任务——在不同研究文化中适应写作。通过与跨学科研究者进行访谈，这些专家擅长在不同文化间转换，我们创建了结构、风格、修辞和引用规范的框架，这些规范在不同研究文化中有所不同。我们通过一组计算度量来操作这些特征，并用于（a）大规模揭示人类写作研究论文中的潜在文化规范；以及（b）突出LLM缺乏文化能力，以及其倾向于同质化写作的倾向。总体而言，我们的工作展示了以人类为中心的方法在测量人类写作和LLM生成文本中的文化规范的有效性。

英文摘要

Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, research cultures, and a single task, adapting writing across research cultures. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenise writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.

URL PDF HTML ☆

赞 0 踩 0

2406.11565 2026-06-15 cs.CL cs.CY 版本更新

Extrinsic Evaluation of Cultural Competence in Large Language Models

对外评估大型语言模型中的文化素养

Shaily Bhatt, Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文通过两个文本生成任务评估模型在文化敏感性方面的表现，发现文化提示对输出有影响，但不同国家输出的相似性与文化价值观无显著关联。

Comments Accepted to EMNLP Findings 2024

详情

DOI: 10.18653/v1/2024.findings-emnlp.942

AI中文摘要

多样化的用户与语言技术之间的互动要求后者输出具有文化相关性和敏感性。先前的工作评估了模型对文化规范、价值观和物品的知识，但未考虑这种知识如何在下游应用中体现。在本工作中，我们专注于两个文本生成任务的外在评估：开放性问答和故事生成。我们定量和定性地评估当提示中明确提示文化，特别是国籍时，模型输出的变化。尽管我们发现当改变国籍和特征文化相关词汇时，模型输出确实有所变化，但我们还发现不同国家输出的相似性与这些国家的文化价值观之间存在弱相关性。最后，我们讨论了在面向用户的任务中设计全面评估文化能力的重要考虑因素。

英文摘要

Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.

URL PDF HTML ☆

赞 0 踩 0

2406.03221 2026-06-15 cs.CL cs.IR 版本更新

Linking Named Entities in Diderot's Encyclopédie to Wikidata

将狄德罗《百科全书》中的命名实体链接到Wikidata

Pierre Nugues

发表机构 * Université de Lausanne（洛桑大学）

AI总结本文通过将《百科全书》中的10300多个条目与Wikidata标识符链接，实现了知识图谱的连接，展示了地理和人文实体的标注方法与应用实例。

Comments 6 pages, 3 figures

详情

Journal ref: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 10610--10615

AI中文摘要

狄德罗《百科全书》是一部18世纪欧洲的知识参考作品，旨在收集当时的知识。维基百科有相同的目标，但范围更大。然而，两者之间缺乏数字连接可能阻碍其比较和知识演变的研究。维基百科的关键要素是Wikidata，它通过结构化数据图谱支持文章。本文描述了对《百科全书》中超过10,300个条目进行注释，以连接到图谱。我们考虑了地理和人文实体。《百科全书》不包含传记条目，因为它们大多作为位置的子条目出现。我们提取了所有地理条目，并完全注释了所有包含人类实体描述的条目。这代表了超过2,600个链接，指向位置或人类实体。此外，我们注释了超过9,500个仅包含地理内容的条目。我们描述了注释过程以及应用示例。此资源可在https://github.com/pnugues/encyclopedie_1751获取。

英文摘要

Diderot's Encyclopédie is a reference work from XVIIIth century in Europe that aimed at collecting the knowledge of its era. Wikipedia has the same ambition with a much greater scope. However, the lack of digital connection between the two encyclopedias may hinder their comparison and the study of how knowledge has evolved. A key element of Wikipedia is Wikidata that backs the articles with a graph of structured data. In this paper, we describe the annotation of more than 10,300 of the Encyclopédie entries with Wikidata identifiers enabling us to connect these entries to the graph. We considered geographic and human entities. The Encyclopédie does not contain biographic entries as they mostly appear as subentries of locations. We extracted all the geographic entries and we completely annotated all the entries containing a description of human entities. This represents more than 2,600 links referring to locations or human entities. In addition, we annotated more than 9,500 entries having a geographic content only. We describe the annotation process as well as application examples. This resource is available at https://github.com/pnugues/encyclopedie_1751

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 22 篇

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

Decoupled Mixture-of-Experts for Parametric Knowledge Injection

GAGPO: Generalized Advantage Grouped Policy Optimization

SuperThoughts: Reasoning Tokens in Superposition

Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression

Residual Context Diffusion Language Models

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

"I Didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

Fractured Chain-of-Thought Reasoning

Token-Level LLM Collaboration via FusionRoute

Rethinking the Trust Region in LLM Reinforcement Learning

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

Sub-Token Routing for KV Cache Compression

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Multi-component Causal Tracing in Large Language Models

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

2. 信息抽取、检索与问答 11 篇

Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

ADORE: Iterative Query Expansion with Retrieval-Grounded Relevance Feedback

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search

ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

Automatic identification of diagnosis from hospital discharge letters via weakly supervised Natural Language Processing

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards

ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA

3. 对话系统与智能体 11 篇

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Retrospective Progress-Aware Self-Refinement for LLM Agent Training

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

WorkBench Revisited: Workplace Agents Two Years On

Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Large Language Model Agents Are Not Always Faithful Self-Evolvers

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

4. 语义、语法与语言学分析 3 篇

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

Fodor and Pylyshyn's Systematicity Challenge Still Stands

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

5. 多模态语言处理 7 篇

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Orchestra-o1: Omnimodal Agent Orchestration

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

Gaze Heads: How VLMs Look at What They Describe

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

EmoMind: Decoding Affective Captions from Human Brain fMRI

6. 语音语言联合与音频文本 8 篇

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

Multimodal Speaker Identification in Classroom Environments

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

OLaPh: Optimal Language Phonemizer

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

7. 评测、数据集与基准 34 篇

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

The Culture Funnel: You Can't Align What isn't in the Data

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

SANA: What Matters for QA Agents over Massive Data Lakes?

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

OdysSim: Building Foundation Models for Human Behavior Simulation