arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 22 篇

2606.13940 2026-06-15 cs.CL 新提交

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

后训练能否使LLM成为优秀的医学编码员?生成式ICD编码的实证研究

Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding

发表机构 * Northwestern University(西北大学)

AI总结 通过对比提示、监督微调和强化学习(GRPO及新提出的PHI课程)在ICD编码任务上的表现,发现后训练(尤其是SFT和GRPO)能显著提升生成式LLM的编码性能,提示评估瓶颈在于模型适应而非生成范式本身。

详情
AI中文摘要

自动国际疾病分类(ICD)编码是用于计费、流行病学和临床决策支持的核心医学编码任务。生成式大语言模型(LLM)常被报道为弱医学编码员,但这一发现主要来自推理时设置(如提示、检索、重排序或工具使用),而任务特定后训练的作用尚未充分探索。我们提出了一项受控的实证研究,针对生成式ICD编码的后训练,在共同协议和度量集下比较了判别式基线与LLM编码员在提示、监督微调和强化学习中的表现。据我们所知,这是首个评估基于RL的后训练用于生成式LLM编码员在ICD编码中的研究。我们进一步引入了PHI,一种诊断性课程,扩展了GRPO以细化遗漏病例。我们的结果表明,仅提示评估大大低估了LLM在ICD编码中的潜力。SFT提供了主要的能力跃升,GRPO进一步改善了超出SFT的代码集预测,而PHI在宏观性能上提供了有针对性的提升。这些发现表明,主要瓶颈不在于生成式公式本身,而在于如何调整和优化模型以实现全分类召回。我们在以下网址发布代码、数据划分和检查点:此 https URL。

英文摘要

Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at https://github.com/AlexandreWANG915/LLM4ICD.

2606.14243 2026-06-15 cs.CL 新提交

Decoupled Mixture-of-Experts for Parametric Knowledge Injection

解耦混合专家用于参数化知识注入

Baoqing Yue, Weihang Su, Qingyao Ai, Yichen Tang, Changyue Wang, Jiacheng Kang, Jingtao Zhan, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出解耦混合专家(DMoE)架构,将外部知识转化为独立可更新的专家模块,通过轻量级不确定性感知路由器在基础模型知识不足时激活相关专家,实现参数级知识增强,在知识密集型基准上优于检索和适配器基线。

详情
AI中文摘要

知识注入旨在为大型语言模型(LLMs)配备外部、领域特定或时效性知识。现有方法通常在灵活性和集成性之间面临权衡:检索增强生成将知识保留在模型外部,但仅提供提示级增强;而基于后训练的方法将新知识编码到共享参数中,但可能引入灾难性遗忘、知识冲突和昂贵的更新。在本文中,我们提出解耦混合专家(DMoE),一种用于参数化知识注入的模块化架构,它将专家和路由器从基础模型中解耦。DMoE将外部知识语料库转换为可独立更新的专家模块,并使用轻量级不确定性感知路由器,仅在基础模型在生成过程中缺乏足够知识时激活相关专家。为了支持高效的自回归推理,DMoE仅将专家附加到最后一层前馈网络,在保留KV缓存重用的同时实现参数级知识增强。在知识密集型基准上的实验表明,DMoE在答案质量上持续优于检索和适配器基线。

英文摘要

Knowledge injection aims to equip large language models (LLMs) with external, domain-specific, or time-sensitive knowledge. Existing approaches typically face a trade-off between flexibility and integration: retrieval-augmented generation keeps knowledge outside the model but only provides prompt-level augmentation, whereas post-training based methods encode new knowledge into shared parameters but may introduce catastrophic forgetting, knowledge conflict, and costly updates. In this paper, we propose Decoupled Mixture-of-Experts (DMoE), a modular architecture for parametric knowledge injection that decouples both experts and the router from the base model. DMoE converts external knowledge corpora into independently updatable expert modules and uses a lightweight uncertainty-aware router to activate relevant experts only when the base model lacks sufficient knowledge during generation. To support efficient auto-regressive inference, DMoE attaches experts only to the final-layer feed-forward network, preserving KV-cache reuse while enabling parameter-level knowledge augmentation. Experiments on knowledge-intensive benchmarks show that DMoE consistently improves answer quality over retrieval and adapter-based baselines.

2605.13217 2026-06-15 cs.CL cs.AI cs.LG 新提交

GAGPO: Generalized Advantage Grouped Policy Optimization

GAGPO:通用优势分组策略优化

Siyuan Zhu, Chao Yu, Rongxin Yang, Zongkai Liu, Jinjun Hu, Qiwen Chen, Yibo Zhang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Meituan(美团)

AI总结 GAGPO提出一种无需价值模型的强化学习方法,通过分组价值代理和动作重要性比,实现多轮任务中精确的时间信用分配,实验表明其在ALFWorld和WebShop上优于现有基线。

详情
AI中文摘要

GAGPO提出了一种无需价值模型的强化学习方法,通过分组价值代理和动作重要性比,实现多轮任务中精确的时间信用分配,实验表明其在ALFWorld和WebShop上优于现有基线。

英文摘要

Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.

2606.13862 2026-06-15 cs.LG cs.AI cs.CL 交叉投稿

SuperThoughts: Reasoning Tokens in Superposition

SuperThoughts: 叠加中的推理令牌

Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Microsoft Research(微软研究院) Independent(独立机构) Princeton University(普林斯顿大学) Rice University(莱斯大学)

AI总结 提出SuperThoughts方法,通过将连续CoT令牌对压缩为单一潜在表示并利用多令牌预测模块解码,在保持训练监督的同时将推理吞吐量翻倍,实现约20-30%的CoT长度缩减且精度损失极小。

详情
AI中文摘要

长链思维(CoT)推理提升了LLM的问题解决能力,但由于顺序生成令牌导致计算成本高昂。尽管近期工作探索在连续潜在空间中进行推理以绕过离散令牌生成,但这些方法常面临训练稳定性问题,且因缺乏监督信号而难以扩展到复杂的长程任务。我们提出SuperThoughts,将连续的CoT令牌对压缩为单一潜在表示,并通过轻量级多令牌预测(MTP)模块每步解码两个令牌。这既在训练时保留了离散令牌监督,又在推理时使吞吐量翻倍。我们在Qwen2.5-Math-1.5B-Instruct、Qwen2.5-Math-7B-Instruct、Qwen2.5-Math-14B-Instruct上进行微调,并在MATH500、AMC、OlympiadBench和GPQA-Diamond上评估。通过基于置信度的自适应机制(在不确定时回退到标准解码),SuperThoughts实现了约20-30%的CoT长度缩减,同时保持精度,在大多数任务上仅下降1-2个准确率点。

英文摘要

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20--30\% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

2606.14368 2026-06-15 cs.LG cs.CL 交叉投稿

Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

做我的导师:通过同伴反馈实现互惠LLM改进的在策略共蒸馏

Woohyeon Byeon, Jiwon Jeon, Jeonghye Kim, Youngchul Sung

发表机构 * KAIST(韩国科学技术院)

AI总结 提出在策略共蒸馏(OPCoD)方法,通过认知门控和反馈锚定实现两个不同领域强模型间的相互教学,达到帕累托改进。

详情
AI中文摘要

我们研究多领域LLM训练,其中两个模型各自在不同领域更强,通过在线策略反馈相互教学共同进化。与单向蒸馏或单模型微调不同,我们的目标是互惠帕累托改进:每个模型在所有领域提升而不损失原有优势。为此,我们提出在策略共蒸馏(OPCoD),其中每个学生的自蒸馏以其自身正确展开和同伴反馈为条件。为使反馈交换有效,OPCoD使用基于认知的门控决定何时给予反馈,以及反馈锚定将反馈扎根于问题。在科学问答任务上,OPCoD持续优于基线,并在所有评估的领域对和学生中实现帕累托改进。

英文摘要

We study multi-domain LLM training in which two models, each stronger in a different domain, co-evolve by tutoring each other through on-policy feedback. Unlike one-way distillation or single-model fine-tuning, our goal is mutual Pareto improvement: each model improves across domains without losing its original strength. To this end, we propose On-Policy Co-Distillation (OPCoD), where each student's self-distillation is conditioned on its own correct rollout and feedback from its peer. To make feedback exchange effective, OPCoD uses cognizance-based gating to decide when to give feedback and feedback anchoring to ground feedback in the problem. On Science Q\&A tasks, OPCoD consistently outperforms baselines and achieves Pareto improvement across all evaluated domain pairs and students.

2606.14672 2026-06-15 cs.AI cs.CL 交叉投稿

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

面向LLM-Agent工作流中并行分支的直接潜在空间合成

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Meta

AI总结 提出Parallel-Synthesis框架,通过直接利用并行工作代理的KV缓存进行合成,避免文本拼接冗余,在9个数据集上匹配或超越文本合成,并将首令牌延迟降低2.5-11倍。

详情
AI中文摘要

大型语言模型越来越多地作为代理系统的执行引擎,但它们仍然通过顺序文本接口消耗上下文。这与现代结构化代理工作流不匹配,其中独立分支探索子任务、检索证据或生成候选解决方案,然后进行最终合成步骤。现有系统通常通过拼接这些分支的文本输出来合并它们,这丢弃了并行结构并导致冗余的预填充计算。在这项工作中,我们引入了Parallel-Synthesis,一个即插即用的框架,使合成器能够直接消耗由并行工作代理产生的KV缓存。Parallel-Synthesis结合了一个缓存映射器,用于校准独立生成的分支缓存,以及一个微调的合成器适配器,用于从此非顺序缓存接口生成。我们使用数据训练Parallel-Synthesis,这些数据使合成器暴露于并行缓存上下文,教授跨缓存分支的聚合,并从基于标准文本拼接的合成中蒸馏推理行为。在跨越数学、科学问答、代码生成、GAIA和多代理数据库诊断的九个下游数据集上,Parallel-Synthesis在七个数据集上匹配或优于基于文本的合成,并在另外两个数据集上保持接近。它还将首令牌时间减少了2.5-11倍,表明直接基于缓存的合成是一种更有前途的接口,用于在并行代理分支上进行更原生和高效的合成。

英文摘要

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

2505.23277 2026-06-15 cs.CL cs.AI 版本更新

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression

Sentinel: 通过注意力探测解码上下文利用以实现高效LLM上下文压缩

Yong Zhang, Heng Li, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao

发表机构 * Ping An Technology (Shenzhen) Co., Ltd., China(平安科技(深圳)有限公司,中国) University of Science and Technology of China(中国科学技术大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出Sentinel,一种轻量级句子级压缩框架,通过冻结LLM的头部注意力模式解码推理时上下文利用行为,使用单次非自回归前向传递实现压缩,在LongBench上以0.5B代理模型达到5倍压缩且性能与7B模型方法相当。

Comments Preprint

详情
AI中文摘要

检索增强生成(RAG)通常面临长且嘈杂的检索上下文。现有的上下文压缩方法通常依赖于启发式相关性估计或监督压缩模型,而不是基于LLM在推理过程中如何利用检索到的上下文。我们提出Sentinel,一种轻量级的句子级压缩框架,从冻结LLM的头部注意力模式中解码推理时的上下文利用行为。为了在检索依赖的问答行为中提供监督,Sentinel使用QA示例训练一个轻量级探针,其中模型仅在检索上下文可用时成功。Sentinel仅使用单次非自回归前向传递进行压缩,无需专门的压缩训练或自回归评分。实验发现,即使在紧凑的代理模型中,有效的上下文利用信号仍然可访问。在LongBench上,使用0.5B代理模型的Sentinel实现了高达5倍的压缩,同时达到与基于7B规模模型的压缩方法相竞争的问答性能。尽管仅使用英文QA数据训练,Sentinel也能有效泛化到中文和域外设置。

英文摘要

Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Existing context compression methods typically rely on heuristic relevance estimation or supervised compression models rather than on how LLMs utilize retrieved context during inference. We propose Sentinel, a lightweight sentence-level compression framework that decodes inference-time contextual utilization behaviors from head-wise attention patterns of frozen LLMs. To ground supervision in retrieval-dependent answering behavior, Sentinel trains a lightweight probe using QA examples where the model succeeds only when retrieved context is available. Sentinel performs compression using only a single non-autoregressive forward pass without dedicated compression training or autoregressive scoring. Empirically, we find that effective contextual utilization signals remain accessible even in compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5$\times$ compression while attaining question-answering performance competitive with compression methods built on 7B-scale models. Despite being trained only on English QA data, Sentinel also generalizes effectively to Chinese and out-of-domain settings.

2601.22954 2026-06-15 cs.CL cs.AI 版本更新

Residual Context Diffusion Language Models

残差上下文扩散语言模型

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出残差上下文扩散(RCD)模块,通过回收丢弃令牌的上下文残差提高扩散语言模型的解码效率,在长/短CoT任务上以极少额外计算提升准确率4-11个百分点。

详情
AI中文摘要

扩散大语言模型(dLLM)已成为纯自回归语言模型的有前途的替代方案,因为它们可以并行解码多个令牌。然而,最先进的逐块dLLM依赖于一种“重掩码”机制,该机制仅解码最自信的令牌并丢弃其余令牌,从而浪费计算。我们证明,回收来自被丢弃令牌的计算是有益的,因为这些令牌保留了对于后续解码迭代有用的上下文信息。鉴于此,我们提出了残差上下文扩散(RCD),一个将这些被丢弃的令牌表示转换为上下文残差并将其注入回下一个去噪步骤的模块。RCD使用解耦的两阶段训练流程来绕过与反向传播相关的内存瓶颈。我们在长链推理(SDAR)和短链指令跟随(LLaDA)模型上验证了我们的方法。我们证明,一个标准的dLLM可以仅用约3亿个令牌高效地转换为RCD范式。在广泛基准测试中,RCD以极小的额外计算开销一致地将前沿dLLM的准确率提升4-11个百分点。值得注意的是,在最具挑战性的AIME任务上,RCD几乎使基线准确率翻倍,并在基线峰值准确率下实现高达4-5倍更少的去噪步骤。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~300 million tokens. RCD consistently improves frontier dLLMs by 4-11 percentage points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at baseline's peak accuracy.

2603.23530 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

你忘记我问什么了吗?大型语言模型中的前瞻记忆失败

Avni Mittal

发表机构 * University of Washington(华盛顿大学)

AI总结 本研究通过认知心理学中的前瞻记忆视角,发现大型语言模型在执行复杂任务时,格式化指令的遵从率下降2-21%,并提出了显著性增强格式来恢复遵从性。

详情
AI中文摘要

大型语言模型在必须同时执行要求较高的任务时,常常无法满足格式化指令。我们通过认知心理学中的前瞻记忆视角,使用一个受控范式来研究这种行为,该范式将可验证的格式化约束与复杂度递增的基准任务相结合。在三个模型家族和超过8000个提示中,在并发任务负载下,遵从性下降了2-21%。脆弱性高度依赖于类型:终端约束(需要在响应边界采取行动)下降最多,高达50%,而避免约束相对稳健。显著性增强格式(显式指令框架加上尾部提醒)恢复了大量丢失的遵从性,在许多设置中将性能恢复到90-100%。干扰是双向的:格式化约束也可能降低任务准确性,其中一个模型的GSM8K准确率从93%下降到27%。在额外的堆叠实验中,随着约束的累积,联合遵从性急剧下降。所有结果均使用确定性程序化检查器,无需LLM作为评判组件,并在公开可用的数据集上进行。

英文摘要

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

2605.21363 2026-06-15 cs.CL 版本更新

"I Didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

我没有做出微决策:在协作中测量、诱导和暴露目标级AI贡献

Eunsu Kim, Jessica R. Mindel, Kyungjin Kim, Sherry Tongshuang Wu

发表机构 * KAIST(韩国科学技术院) Carnegie Mellon University(卡内基梅隆大学) Seoul National University(首尔国立大学)

AI总结 本文提出CoTrace框架,用于测量和暴露协作中目标级AI贡献,发现模型在目标塑造中贡献有限,但在引入具体要求和间接影响方面作用显著,且交互设计影响模型行为。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地影响用户如何形成、细化和扩展目标,将贡献归因于人类-人工智能协作变得对用户校准自身依赖性和评估者评估AI辅助工作至关重要。然而,现有方法专注于最终成果,忽略了目标本身共同塑造的过程。我们引入了一个目标级归因框架CoTrace,将显式目标分解为可验证的需求,并追踪对话回合中直接贡献和间接影响。对638个真实世界协作日志应用CoTrace,发现尽管模型仅在目标塑造中贡献11-26%,但它们在引入较低层次的具体需求方面贡献显著,并产生各种间接贡献。通过受控模拟,我们展示了交互设计选择显著影响模型目标塑造行为。在一项用户研究中,向参与者暴露目标级分析使他们对贡献的感知在5分量表上几乎增加2分,揭示了用户在理解自身AI辅助工作时的系统性误校准。

英文摘要

As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

2606.12881 2026-06-15 cs.CL cs.LG 版本更新

Direct Preference Optimization for Chatbot Fine-Tuning: An Empirical Study

面向聊天机器人微调的直接偏好优化:一项实证研究

Dezhi Yu, Yvonne Qiu, ShuoJia Fu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文实证研究直接偏好优化(DPO)在聊天机器人微调中的应用,表明其简化训练流程、提升计算效率且性能有竞争力,但存在训练不稳定性。

Comments 7 pages, 3 figures, 1 table. All authors contributed equally

详情
AI中文摘要

我们提出了一种使用直接偏好优化(DPO)微调大型语言模型的方法,这是一种强化学习技术。我们的实验结果表明,DPO简化了训练流程,提高了计算效率,并实现了有竞争力的性能。使用BLEU、ROUGE和余弦相似度指标的评估表明,模型有效学习并收敛,尽管需要进一步研究以解决观察到的训练不稳定性。

英文摘要

We present an approach to fine-tuning large language models using Direct Preference Optimization (DPO), a reinforcement learning technique. Our experimental results demonstrate that DPO simplifies the training pipeline, improves computational efficiency, and achieves competitive performance. The evaluation using BLEU, ROUGE, and cosine similarity metrics indicates effective learning and convergence, though further investigation is needed to address observed training instability.

2606.12941 2026-06-15 cs.CL 版本更新

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

当上下文分片到达时的多轮推理:可扩展的分片与记忆增强强化学习

Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo

发表机构 * The University of Melbourne(墨尔本大学) Google Research Australia(谷歌澳大利亚研究院)

AI总结 针对多轮对话中信息碎片化导致LLM准确率下降65%的问题,提出通过训练模型维护紧凑滚动记忆而非增长历史来缓解,并引入低成本分片流水线将单轮QA转换为多轮碎片化情节,训练的记忆增强策略显著提升多轮准确率并零样本泛化到更难任务。

详情
AI中文摘要

当用户在多个对话轮次中透露任务关键信息时,尽管上下文完全可用,LLM的准确率下降高达65%。我们表明,这种“迷失在对话中”的退化可以通过训练模型维护紧凑的滚动记忆而不是关注增长的历史来大幅缓解。为了使这种训练可扩展,我们引入了一个低成本的分片流水线,将单轮QA数据集转换为多轮碎片化信息情节,消除了数小时手动标注的需求。仅在分片的GSM8K上训练,我们的记忆增强策略显著提高了多轮准确率,并零样本泛化到更难的数学和域外长上下文QA。此外,即使在测试时给定完整历史,记忆训练模型也优于全历史基线,这表明学习压缩比单独的全上下文暴露能诱导更稳健的增量推理。

英文摘要

When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.

2505.12992 2026-06-15 cs.LG cs.AI cs.CL stat.ML 版本更新

Fractured Chain-of-Thought Reasoning

断裂链式思维推理

Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong

发表机构 * University of Amsterdam(阿姆斯特丹大学) eBay Microsoft(微软) Google Research(谷歌研究) Salesforce

AI总结 提出断裂采样策略,通过截断推理链、调整轨迹数和解数,在推理时实现精度与成本的帕累托最优。

详情
AI中文摘要

推理时扩展技术通过在不重新训练的情况下利用额外的推理计算,显著增强了大型语言模型(LLMs)的推理能力。类似地,链式思维(CoT)提示及其扩展Long CoT通过生成丰富的中间推理轨迹来提高准确性,但这些方法会带来大量的token成本,阻碍了它们在延迟敏感场景中的部署。在这项工作中,我们首先证明截断CoT(即在完成推理前停止并直接生成最终答案)通常在使用显著更少token的情况下与完整CoT采样相匹配。基于这一见解,我们引入了断裂采样,这是一种统一的推理时策略,沿着三个正交轴在完整CoT和仅解决方案采样之间进行插值:(1)推理轨迹的数量,(2)每条轨迹的最终解数量,以及(3)推理轨迹被截断的深度。通过在五个不同的推理基准和多个模型规模上进行大量实验,我们证明断裂采样始终实现优越的精度-成本权衡,在Pass@k与token预算之间产生陡峭的对数线性缩放增益。我们的分析揭示了如何在这些维度上分配计算以最大化性能,为更高效和可扩展的LLM推理铺平了道路。代码可在该https URL获取。

英文摘要

Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches the full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.

2601.05106 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Token-Level LLM Collaboration via FusionRoute

通过融合路由实现令牌级LLM协作

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

发表机构 * [cs.AI](计算机科学与人工智能)

AI总结 本文提出FusionRoute框架,通过轻量级路由器在解码步骤中选择最合适的专家并补充对数几率以优化下一个令牌分布,解决了单个通用模型在多个领域表现不佳的问题,同时在多个基准测试中优于其他方法。

Comments 25 pages

详情
AI中文摘要

大型语言模型(LLMs)在多个领域表现出色。然而,使用单一通用模型在这些领域实现强大性能通常需要扩展到训练和部署成本极高的规模。另一方面,虽然较小的领域专用模型更高效,但它们在训练分布之外的泛化能力较差。为了解决这一矛盾,我们提出了FusionRoute,一种稳健且有效的令牌级多LLM协作框架,其中轻量级路由器同时(i)在每个解码步骤中选择最合适的专家,(ii)贡献一个互补的对数几率,通过对数几率添加来细化或校正所选专家的下一个令牌分布。与现有依赖固定专家输出的令牌级协作方法不同,我们提供了一个理论分析,表明纯专家路由本质上是有限的:除非持有强全局覆盖假设,否则无法一般实现最优解码策略。通过在专家选择中加入可训练的互补生成器,FusionRoute扩展了有效的策略类别,并在温和条件下实现了最优价值函数的恢复。经验上,FusionRoute在Llama-3和Gemma-2家族以及涵盖数学推理、代码生成和指令跟随在内的多种基准测试中,优于序列级和令牌级协作、模型融合和直接微调方法,同时在各自任务上与领域专家保持竞争力。

英文摘要

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

2602.04879 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Rethinking the Trust Region in LLM Reinforcement Learning

重新思考LLM强化学习中的信任区域

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学)

AI总结 针对PPO在LLM微调中因词表大导致的训练不稳定问题,提出基于策略散度直接约束的DPPO算法,并引入高效近似方法。

详情
AI中文摘要

强化学习已成为微调大型语言模型(LLM)的基石,其中近端策略优化(PPO)是事实上的标准算法。尽管其普遍存在,我们认为PPO中的核心比率裁剪机制在结构上不适合LLM固有的大词表。PPO基于采样令牌的概率比率约束策略更新,该比率是对真实策略散度的有噪单样本蒙特卡洛估计。这导致次优的学习动态:低概率令牌的更新被过度惩罚,而高概率令牌中潜在的灾难性变化却约束不足,导致训练效率低下和不稳定。为解决此问题,我们提出散度近端策略优化(DPPO),用基于策略散度(如总变差或KL)直接估计的更原则性约束替代启发式裁剪。为避免巨大内存占用,我们引入了高效的二元和Top-K近似,以可忽略的开销捕获本质散度。大量实证评估表明,DPPO相比现有方法实现了更优的训练稳定性和效率,为基于RL的LLM微调提供了更稳健的基础。我们的代码可在https://github.com/sail-sg/Stable-RL获取。

英文摘要

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning. Our code is available at https://github.com/sail-sg/Stable-RL.

2602.14169 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

基于枢轴驱动重采样的LLM强化学习深度密集探索

Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye, Ruiqing Zhang, Shuang Qiu, Lijie Xu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) City University of Hong Kong(香港城市大学) Baidu(百度)

AI总结 针对大语言模型强化学习中探索效率低的问题,提出深度密集探索(DDE)策略,通过识别失败轨迹中的可恢复枢轴状态并局部密集重采样,结合双流优化目标,在数学推理基准上优于现有方法。

详情
AI中文摘要

有效探索是大语言模型强化学习中的一个关键挑战:在有限的采样预算内,从庞大的自然语言序列空间中发现高质量轨迹。现有方法面临显著局限性:GRPO仅从根节点采样,使高概率轨迹饱和,而深层易错状态探索不足;基于树的方法盲目地将预算分散到琐碎或不可恢复的状态,导致采样稀释,无法发现罕见的正确后缀并破坏局部基线。为解决此问题,我们提出深度密集探索(DDE),一种将探索聚焦于失败轨迹中的“枢轴”——深层、可恢复状态的策略。我们通过DEEP-GRPO实例化DDE,引入三个关键创新:(1)轻量级数据驱动效用函数,自动平衡可恢复性和深度偏差以识别枢轴状态;(2)在每个枢轴处进行局部密集重采样,增加发现后续正确轨迹的概率;(3)双流优化目标,将全局策略学习与局部纠正更新解耦。在数学推理基准上的实验表明,我们的方法一致优于GRPO、基于树的方法及其他强基线。代码见 https://this https URL

英文摘要

Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines. Code is available at https://github.com/AgentCombo/DEEP-GRPO

2604.18419 2026-06-15 cs.LG cs.CL stat.ML 版本更新

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

知道何时退出:LLM推理中动态弃权的原则性框架

Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini

发表机构 * Hebrew University of Jerusalem(特拉维夫大学)

AI总结 本文提出一个基于正则化强化学习框架的动态弃权原则,通过价值函数与弃权奖励的比较来决定是否提前终止推理,在数学推理和毒性避免任务上优于现有方法。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 by the author(s)
AI中文摘要

利用思维链推理的大型语言模型常常因产生冗长且错误的响应而浪费大量计算资源。弃权可以通过抑制可能不正确的输出来缓解这一问题。虽然大多数弃权方法在生成之前或之后决定是否保留输出,但动态的生成中弃权考虑在每个token位置提前终止无前途的推理轨迹。先前的工作探索了这一想法的经验变体,但缺乏对弃权规则的原则性指导。我们提出了LLM动态弃权的形式化分析,将弃权建模为正则化强化学习框架中的一个显式动作。弃权奖励参数控制计算与信息之间的权衡。我们证明,在一般条件下,当价值函数低于该奖励时弃权严格优于自然基线。我们进一步推导了一种原则性且高效的方法来近似价值函数。在数学推理和毒性避免任务上的实证结果支持我们的理论,并展示了相比现有方法改进的选择性准确性。

英文摘要

LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.

2604.21335 2026-06-15 cs.LG cs.CL 版本更新

Sub-Token Routing for KV Cache Compression

子令牌路由用于KV缓存压缩

Wei Jiang, Wei Wang

发表机构 * Futurewei Technologies(未来智科)

AI总结 提出子令牌路由方法,在保留令牌内对值向量分组并选择性保留,与令牌级压缩互补,在LLM和VLM中提升压缩性能。

Comments 17 pages, 8 tables, 2 figures

详情
AI中文摘要

Transformer推理通常需要大型KV缓存,尤其是在长上下文语言建模和多模态生成中。现有的压缩方法通常通过选择、驱逐、量化或压缩缓存令牌,或在语言模型推理前减少视觉令牌序列来降低缓存成本。我们引入子令牌路由,一种KV压缩方法,它在保留令牌内部添加了更精细的控制轴。它将每个保留的值向量分成组,并仅保留选定的组,同时保持查询和键状态不变。该方法设计在令牌级缩减之后工作。首先,令牌缩减方法确定保留哪些令牌。然后,子令牌路由压缩这些保留令牌内部的值状态。在匹配KV预算下的实验表明,添加子令牌路由提高了令牌级缩减在LLM和VLM设置中的性能,包括LLaMA-2-7B和Qwen2.5-7B上的Quest,以及LLaVA和Qwen-VL模型上的FastV/VisionZip。在较小的KV预算下增益更大,表明当进一步移除令牌成本高昂时,值组路由特别有用。总体而言,令牌级缩减和子令牌路由提供了互补的降低KV成本的方式。

英文摘要

Transformer inference often requires a large KV cache, especially for long-context language modeling and multimodal generation. Existing compression methods usually reduce cache cost by selecting, evicting, quantizing, or compressing cached tokens, or by reducing the visual-token sequence before language-model inference. We introduce sub-token routing, a KV-compression method that adds a finer control axis inside retained tokens. It splits each retained value vector into groups and keeps only selected groups, while leaving query and key states unchanged. The method is designed to work after token-level reduction. First, a token-reduction method determines which tokens are retained. Then, sub-token routing compresses the value states inside those retained tokens. Experiments under matched KV budgets show that adding sub-token routing improves token-level reduction performance in both LLM and VLM settings, including Quest on LLaMA-2-7B and Qwen2.5-7B, and FastV/VisionZip across LLaVA and Qwen-VL models. The gains are larger at smaller KV budgets, suggesting that value-group routing is especially useful when further token removal becomes costly. Overall, token-level reduction and sub-token routing provide complementary ways to reduce KV cost.

2606.01476 2026-06-15 cs.LG cs.CL 版本更新

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

OmniOPD:通过推测性验证实现无Logit的在线策略蒸馏

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

AI总结 提出OmniOPD框架,通过基于蒙特卡洛展开的块级语义相似度替代token级logit匹配,结合峰值熵调度器和贝叶斯先验,解决在线策略蒸馏中logit不可获取和信号脆弱问题,在数学任务上超越标准OPD达28.64%。

Comments 26 pages, 3 figures

详情
AI中文摘要

在线策略蒸馏(OPD)在强教师模型的密集token级反馈下,基于学生模型自身的生成轨迹进行训练,缓解了监督微调(SFT)的离策略分布偏移和强化学习(RL)的稀疏信用分配问题。然而,标准OPD面临两个耦合的限制。首先,它需要直接访问教师模型的token级logit,将一大类有能力的专有模型排除在教师之外。其次,token级logit信号本身是脆弱的,依赖于教师和学生之间合理下一个token的狭窄重叠,并且容易放大重复循环等退化模式。在本文中,我们引入了OmniOPD,一种通过无logit的块级监督信号解决这两个限制的新框架。OmniOPD用蒙特卡洛展开替代确定性logit匹配,通过多token块上的连续语义相似性度量近似教师的局部偏好,并通过峰值熵调度器集中这种监督,仅在学生的高不确定性推理分叉处进行审计。Dirichlet-Multinomial贝叶斯先验和基础模型KL锚进一步限制了离散采样的方差,并防止了未审计token上的策略崩溃。在竞争性基准测试中,OmniOPD在数学任务上超越标准OPD方法高达28.64%,证实了块级语义验证提取了比token级logit匹配更可靠的学习信号,后者高信息密度被显著的噪声和脆弱性所抵消。此外,当与更强的黑盒教师(如Claude-4.5-Haiku和Gemini-2.5-Flash)配对时,OmniOPD在数学任务上相对于其开放权重教师对应物额外获得了9.54%的相对提升,使学生超越了自我探索RL的性能。

英文摘要

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

2606.03085 2026-06-15 cs.LG cs.CL 版本更新

Multi-component Causal Tracing in Large Language Models

大型语言模型中的多组件因果追踪

Zirui Yan, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri, Ali Tajer

发表机构 * Rensselaer Polytechnic Institute(拉特拉姆技术学院) IBM Research(IBM研究院)

AI总结 本文提出一个统一框架,通过软干预和度量转换高效识别对目标性能指标最关键的多组件子集,优于现有基线方法。

Comments Accepted to ACL 2026 main conference

详情
AI中文摘要

因果追踪通过系统地干预大型语言模型(LLM)的内部表示,揭示并量化将特定输入或计算与特定感兴趣指标联系起来的因果路径,从而量化LLM的行为。在先前单组件或单层研究的基础上,本文提出了一个同时因果追踪多个组件的统一框架。该框架系统地识别对期望目标性能指标(如准确性和公平性)最关键的组件子集(例如注意力头和多层感知器神经元)。这是通过将灵活的干预应用于广泛期望的指标来实现的。为了解决多组件问题的组合复杂性,设计了一种高效算法,该算法利用软干预和精心设计的度量转换,将组合搜索问题转化为一个连续问题,该问题可以在适当约束下高效求解,从而为选择组件生成适当的二元决策。实验结果表明,所提出的方法高效地识别出对目标指标具有高影响力的模型组件子集,优于现有基线方法。我们的代码可从此https URL获取。

英文摘要

Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

2606.12476 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测:延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher(独立研究员)

AI总结 将幻觉起始检测建模为快速变化检测问题,基于RAGTruth验证的一阶马尔可夫模型,利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟,优于线性基线,并揭示了分类指标掩盖的延迟结构。

Comments 16 pages, 1 figure. v2: added Discussion and Appendix; recall-honest framing; robustness analyses (k-NN divergence estimate, seed-averaged decomposition)

详情
AI中文摘要

Token级幻觉检测器作为分类器进行评估,通过所有token的AUC,但流式监控器由其反应时间判断:从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型,将任务置于经典变点理论中,并得出Lorden关于检测延迟的下界:在虚警率为0.01时约为1.3个token。然后我们证明,因果循环标注器充当了具有学习增量的CUSUM;在匹配的虚警率下,它在11-13个token内检测到,而线性每token基线为31个token,受控分解将大部分优势归因于更好的每token得分,而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距:学习得分仅实现了特征携带散度的1/4.5,这一缺陷无法通过重新校准消除,其余部分为有限时域效应。分类指标掩盖了这种延迟结构;序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.

2606.11898 2026-06-15 cs.CL cs.LG 版本更新

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GraspLLM框架,通过融合图结构理解与LLM语义能力,利用基序感知对比学习和最优上下文子图对齐,实现跨数据集和跨任务的零样本泛化。

详情
AI中文摘要

近年来,对文本属性图(TAGs)的研究因其在引文网络、电子商务平台、社交媒体和网页等各类真实数据场景中的广泛应用而备受关注。受大语言模型(LLMs)卓越语义理解能力的启发,已有许多尝试将LLMs集成到TAGs中。然而,现有方法仍难以在不同图和任务间泛化,且其捕获可迁移图结构模式的能力有限。为此,我们提出了GraspLLM框架,该框架将图结构理解与LLM的语义理解能力相结合,以增强跨数据集和跨任务的泛化能力。具体而言,我们使用冻结的通用嵌入模型将不同图的节点文本表示在统一语义空间中,在此基础上,我们在多个基序诱导的邻接矩阵上进行基序感知对比学习,以提取与数据集无关的结构信息。然后,通过我们提出的最优上下文子图,为每个目标节点提取最相关的上下文子图,并通过对齐投影仪将这些子图对齐到LLM的令牌空间。在涵盖不同领域的TAG基准数据集上的大量实验表明,GraspLLM在零样本场景下始终优于先前基于LLM的TAG方法,突显了其在不同数据集和任务上的强泛化能力。我们的代码可在以下网址获取:此 https URL。

英文摘要

Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at https://github.com/Heinz217/GraspLLM.

2. 信息抽取、检索与问答 11 篇

2606.14325 2026-06-15 cs.CL cs.AI 新提交

Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

通过基于知识图谱的数据生成实现精确的文本到Cypher转换

Francesco Cazzaro, Jessica Lennon, Ariadna Quattoni

发表机构 * Universitat Politècnica de Catalunya(波兰理工大学)

AI总结 提出一种自动合成数据生成方法,微调小型LLM以提升Text2Cypher性能,使其在本地部署中与大型专有模型竞争,保障数据主权。

详情
AI中文摘要

属性图正迅速被采用作为表示异构数据源的数据库框架。为了精确访问其中包含的信息,我们需要基于文本到Cypher(Text2Cypher)解析器的对话界面。本文提出了一种自动合成数据生成方法,可用于微调小型LLM以完成此任务。我们在所有主要的Text-To-Cypher基准测试上进行了实验,证明使用我们的合成数据生成方法可以显著提高小型LLM的性能,使其能够与更大的专有模型竞争。这意味着在必须本地部署模型的场景中,我们可以在不牺牲准确性且无需昂贵标注活动的情况下确保数据主权。

英文摘要

Property Graphs are rapidly being adopted as database frameworks for representing heterogeneous data sources. To enable precise access to the information contained in them we need conversational interfaces based on Text-To-Cypher (Text2Cypher) parsers. This paper presents an automatic synthetic data generation method that can be leveraged to fine-tune small LLMs for this task. We conduct experiments on all the major Text-To-Cypher benchmarks, demonstrating that with our synthetic data generation approach we can significantly increase the performance of small LLMs, allowing them to compete with much larger proprietary models. This means that in settings in which models must be locally deployed we can ensure data-sovereignty without sacrificing accuracy and without costly annotation campaigns.

2606.13905 2026-06-15 cs.IR cs.CL 交叉投稿

ADORE: Iterative Query Expansion with Retrieval-Grounded Relevance Feedback

ADORE: 基于检索反馈的迭代查询扩展

Amin Bigdeli, Negar Arabzadeh, Radin Hamidi Rad, Sajad Ebrahimi, Charles L. A. Clarke, Ebrahim Bagheri

发表机构 * University of Waterloo(滑铁卢大学) Mila – Quebec AI Institute(魁北克人工智能研究所) University of Toronto(多伦多大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ADORE框架,通过迭代生成伪段落、检索语料库并评估相关性,利用检索反馈指导查询扩展,显著提升检索性能。

详情
AI中文摘要

基于LLM的查询扩展通过为原始查询添加额外上下文来改进检索。然而,大多数方法仍然是生成驱动的,产生看似合理的伪文档或扩展,而不检查目标语料库的响应。这可能导致检索漂移、放大误导性词汇或遗漏区分相关与不相关文档的术语。我们认为,有效的扩展需要基于检索的反馈,而不仅仅是单次生成或未经验证的迭代。我们引入ADORE(适应、观察、相关性评估),一个迭代框架,将检索结果转化为下一次扩展的反馈。在每一轮中,LLM生成伪段落,检索器暴露语料库响应,相关性评估器根据原始查询评估检索到的文档。这些判断识别出需要强化、仍未被覆盖以及需要抑制的内容。在TREC Deep Learning、BEIR和BRIGHT上,ADORE consistently outperforms strong query expansion baselines with notable improvements across nearly all evaluation settings, improving average nDCG@10 by 24.5% over BM25 and 3.6% over the strongest prior query expansion method on BEIR, and by 122.9% over BM25 and 9.2% over the best query expansion baseline on BRIGHT. 我们的代码和数据已公开。

英文摘要

LLM-based query expansion improves retrieval by enriching the original query with additional context. Yet most methods remain generation-driven, producing plausible pseudo-documents or expansions without checking how the target corpus responds. This can introduce retrieval drift, amplify misleading vocabulary, or miss terms that distinguish relevant from non-relevant documents. We argue that effective expansion requires retrieval-grounded feedback, not just single-pass generation or unverified iteration. We introduce ADORE (ADapt, Observe, Relevance Evaluate), an iterative framework that turns retrieval outcomes into feedback for the next expansion. At each round, an LLM generates pseudo-passages, a retriever exposes the corpus response, and a relevance assessor evaluates retrieved documents against the original query. These judgments identify what to reinforce, what remains undercovered, and what to suppress. Across TREC Deep Learning, BEIR, and BRIGHT, ADORE consistently outperforms strong query expansion baselines with notable improvements across nearly all evaluation settings, improving average nDCG@10 by 24.5% over BM25 and 3.6% over the strongest prior query expansion method on BEIR, and by 122.9% over BM25 and 9.2% over the best query expansion baseline on BRIGHT. Our code and data are publicly available.

2606.14047 2026-06-15 cs.IR cs.AI cs.CL cs.LG 交叉投稿

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

知识图谱增强的记忆增强检索用于长上下文建模

Ghadir Alselwi, Basem Suleiman, Hao Xue, Shoaib Jameel, Hakim Hacid, Flora D. Salim, Imran Razzak

发表机构 * University of New South Wales(新南威尔士大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University of Southampton(南安普顿大学) Technology Innovation Institute(技术创新研究所) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出KGERMAR框架,通过动态构建上下文知识图谱并融合多组件记忆架构,在长上下文建模中降低困惑度达8.5%,提升记忆效率2-2.5倍。

详情
AI中文摘要

长上下文语言建模不仅需要扩展上下文窗口,还需要在数千个token中保持对实体状态和关系的连贯理解——这是语义相似性单独无法解决的挑战。KGERMAR通过在推理过程中从输入文本构建动态的、上下文特定的知识图谱来解决这一问题,实现利用语义相似性和显式实体关系的领域自适应检索。该框架执行实时实体和关系抽取以构建上下文知识图谱,然后通过多组件记忆架构将图结构嵌入与文本语义相结合。维护三个记忆库——上下文、语义和结构——通过学习权重融合检索信号,以捕获表面语义和更深层次的关系模式。在SlimPajama(84.7K训练样本)、WikiText-103(4,358样本)、PG-19(100样本)和Proof-pile(46.3K样本)上评估,KGERMAR在1K到32K token的上下文长度上,相比记忆增强基线实现了高达8.5%的困惑度降低和2-2.5倍的记忆效率提升,并在五个NLU任务上展现出优越的上下文学习性能。动态知识图谱构建方法通过实现适应输入上下文而非依赖固定知识库的领域特定知识表示,推进了记忆增强语言建模。

英文摘要

Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens -- a challenge that semantic similarity alone cannot address. KGERMAR addresses this by constructing dynamic, context-specific knowledge graphs from input text during inference, enabling domain-adaptive retrieval that leverages both semantic similarity and explicit entity relationships. The framework performs real-time entity and relation extraction to build contextual knowledge graphs, then integrates graph-structural embeddings with textual semantics through a multi-component memory architecture. Three memory banks -- contextual, semantic, and structural -- are maintained with retrieval signals fused via learned weights to capture both surface-level semantics and deeper relational patterns. Evaluated on SlimPajama (84.7K training examples), WikiText-103 (4,358 examples), PG-19 (100 examples), and Proof-pile (46.3K examples), KGERMAR achieves up to 8.5\% lower perplexity and 2--2.5x better memory efficiency than memory-augmented baselines across context lengths from 1K to 32K tokens, with superior in-context learning performance across five NLU tasks. The dynamic knowledge graph construction approach advances memory-augmented language modeling by enabling domain-specific knowledge representation that adapts to input contexts rather than relying on fixed knowledge bases.

2606.14127 2026-06-15 cs.IR cs.CL 交叉投稿

CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search

CoRe:一种持续奖励微调的LLM查询重写器,用于大规模视频搜索中的多阶段上下文感知相关性

Yilin Wen, Rong Yang, Xiaojia Chang, Hong Sun, Gefu Tang, Chunhui Liu, Jeffrey Chen, Zeyu Ma, Lisong Qiu, Xiaochuan Fan, Congjia Yu, Quan Zhou, Yuheng Chen, Zian Wang

发表机构 * TikTok

AI总结 提出CoRe系统,通过半在线混合偏好优化和乘法奖励比,实现每周重新部署的LLM查询重写器,在多阶段视频搜索中显著降低变更查询率并提升相关性指标。

Comments 12 pages, 3 figures

详情
AI中文摘要

生产环境中的基于LLM的查询重写器面临一个矛盾:训练奖励必须反映重写结果如何被生产排序器使用,但训练过程必须足够廉价以支持随着数据漂移而持续重新部署。我们提出了CoRe(上下文相关性)系统,该系统在主要短视频搜索引擎中每周重新部署,已超过五个月。我们的奖励使用部署的多模态相关性模型作为其来源,并采用乘法比率形式,镜像生产融合代数,从而缩小了离线奖励代理留下的模拟-生产差距。半在线混合偏好优化循环使得这种奖励在每周数百万实例规模下可行:一个DPO风格的成对目标将梯度传递限制在采样轨迹的小型top-k/bottom-k子集,并且一个阶段结构将训练器/推理服务器参数同步从每步减少到每阶段。一个基于奖励和稳定性指标的自动升级门检测并恢复了一次生产中的真实奖励黑客事件。重写器输出作为并行相关性信号在召回、原始排序和精细排序阶段被消费,而不取代原始信号,从而限制了重写器故障的影响范围。来自两个连续生产发布的在线A/B测试,首先在精细排序阶段部署重写器,然后扩展到召回和原始排序阶段,在受重写影响的查询上实现了统计显著的变更查询率降低,所有主要相关性和参与度指标均朝着预期方向移动。

英文摘要

LLM-based query rewriters in production face a tension: the training reward must reflect how the rewrite is consumed by the production ranker, yet the training procedure must be cheap enough to support continuous redeployment as data drifts. We present CoRe (Context Relevance), such a system, redeployed weekly for over five months in a major short-video search engine. Our reward uses the deployed multimodal relevance model as its source and a multiplicative ratio form mirroring the production fusion algebra, closing the simulation-production gap that offline reward proxies leave open. A semi-online Mixed Preference Optimization loop makes this reward affordable at multi-million-instance weekly scale: a DPO-style pairwise objective restricts the gradient pass to a small top-k/bottom-k subset of sampled trajectories, and a phase structure reduces trainer/inference-server parameter syncs from per-step to per-phase. An automated promotion gate over reward-like and stability metrics detected and recovered from a real reward-hacking incident in production. Rewriter output is consumed as parallel relevance signals at recall, rawrank, and finerank without displacing the original signals, bounding rewriter-failure blast radius. Online A/B from two sequential production launches, first deploying the rewriter at finerank, then extending consumption to recall and rawrank, delivers statistically significant reductions in change-query rate on rewrite-impacted queries, with all headline relevance and engagement metrics moving in the expected direction.

2606.14269 2026-06-15 cs.IR cs.CL 交叉投稿

ScoreGate: Adaptive Chunk Selection for Retrieval-Augmented Generation via Dual-Score Statistical Fusion

ScoreGate: 通过双分数统计融合实现检索增强生成的自适应分块选择

Karamvir Singh, Arvind Jain

发表机构 * HighLevel, Inc.(HighLevel公司)

AI总结 针对固定K检索导致过检索或欠检索的问题,提出ScoreGate,利用双编码器相似度和交叉编码器重排序分数进行自适应检索数量控制,无需额外模型调用,在MS MARCO上以更少块数提升MRR@10,并实现零假阳性。

Comments 20 pages, 6 figures, 14 tables

详情
AI中文摘要

固定基数检索无论查询复杂度如何,都向生成器注入恒定数量的top-K块,导致窄查询的过检索和组合查询的欠检索。我们描述了ScoreGate,一种轻量级的分数空间决策机制,利用标准流程中已经产生的两个分数——双编码器相似度s_i和交叉编码器重排序分数r_i,在推理时控制检索基数,无需额外的模型推理调用。其核心见解是,交叉编码器的确认可以挽救因词汇不匹配而被双编码器检索排名较低的语义相关块——这是固定K或单分数阈值化未能解决的一种失败模式。在MS MARCO(200个开发查询)上,ScoreGate以比标准Top-K少35%的保留块实现了MRR@10 = 0.401。在一个内部基准测试(n=300,Fleiss' kappa=0.87)上,ScoreGate在97.77-99.34%的召回率下观察到零假阳性(95% CI [96.4%, 100%]),每个查询的令牌数减少34.8%,仅增加31ms延迟。在MS MARCO和实际生产流量上的结果表明,自适应检索基数可以在不降低检索质量的情况下提高检索效率。

英文摘要

Fixed-cardinality retrieval injects a constant top-K chunks into the generator regardless of query complexity, causing over-retrieval for narrow queries and under-retrieval for compositional ones. We describe ScoreGate, a lightweight score-space decision mechanism that controls retrieval cardinality at inference time using two scores already produced by the standard pipeline: bi-encoder similarity s_i and cross-encoder reranker score r_i, with no additional model inference calls required. Its core insight is that cross-encoder affirmation can rescue semantically relevant chunks that bi-encoder retrieval ranks poorly due to vocabulary mismatch -- a failure mode unaddressed by fixed-K or single-score thresholding. On MS MARCO (200 dev queries), ScoreGate achieves MRR@10 = 0.401 with 35% fewer retained chunks than Standard Top-K. On an internal benchmark (n=300, Fleiss' kappa=0.87), ScoreGate observed zero false positives (95% CI [96.4%, 100%]) at 97.77-99.34% recall, with 34.8% fewer tokens per query and only 31ms added latency. Results on both MS MARCO and real-world production traffic suggest that adaptive retrieval cardinality can improve retrieval efficiency without degrading retrieval quality.

2410.15051 2026-06-15 cs.CL cs.LG 版本更新

Automatic identification of diagnosis from hospital discharge letters via weakly supervised Natural Language Processing

通过弱监督自然语言处理自动识别出院信中的诊断

Vittorio Torri, Elisa Barbieri, Anna Cantarutti, Carlo Giaquinto, Francesca Ieva

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出一种弱监督NLP流程,无需文档级标注即可从意大利语出院信中分类诊断,在细支气管炎数据集上达到接近全监督的性能,节省大量人工标注时间。

Comments 61 pages, 9 figures

详情
AI中文摘要

从医院出院信中识别患者诊断对于大规模队列选择和流行病学研究至关重要,但传统的监督方法需要大量手动标注,这对于大型文本数据集通常不切实际。我们提出了一种弱监督自然语言处理(NLP)流程,用于对意大利语出院信进行分类,无需文档级手动标注。该方法提取与诊断相关的句子,使用在意大利医学文档上进一步预训练的Transformer模型生成语义嵌入,并应用两级聚类程序推导出弱标签,然后用于训练文档级分类器。该方法在2017年至2020年间意大利威尼托地区44个急诊室或医院收治的33,176份儿童出院信的细支气管炎案例研究中进行了评估。最佳弱监督模型在手动标注数据上实现了77.68%(±4.30%)的AUROC、73.13%(±4.93%)的AUPRC和78.14%(±4.89%)的F1分数。性能超过了无监督基线,接近全监督模型,同时对于该规模的数据集减少了超过1,500小时的手动标注需求。在较小的支气管炎数据集(3,188份出院信,2020-2025年)的二次验证中观察到类似的模型排名,最佳弱监督模型实现了76.72%(±5.02%)的AUPRC。这些结果表明弱监督NLP方法在从临床出院信中可扩展地识别疾病方面具有潜力。

英文摘要

Identifying patient diagnoses from hospital discharge letters is essential for large-scale cohort selection and epidemiological research, but traditional supervised approaches require extensive manual annotation, which is often impractical for large textual datasets. We present a weakly supervised Natural Language Processing (NLP) pipeline for classifying Italian discharge letters without document-level manual annotation. The method extracts diagnosis-related sentences, generates semantic embeddings using a transformer model further pre-trained on Italian medical documents, and applies a two-level clustering procedure to derive weak labels that are then used to train a document-level classifier. The approach was evaluated in a case study on bronchiolitis using 33,176 discharge letters of children admitted to 44 emergency rooms or hospitals in the Veneto Region, Italy, between 2017 and 2020. The best weakly supervised model achieved an AUROC of 77.68% ($\pm4.30\%$), an AUPRC of 73.13% ($\pm4.93\%$), and an F1-score of 78.14% ($\pm4.89\%$) against manually annotated data. Performance surpassed unsupervised baselines and approached fully supervised models, while reducing the need for manual annotation by more than 1,500 hours for a dataset of this size. Similar model rankings were observed in a secondary validation on a smaller bronchitis dataset (3,188 discharge letters, 2020-2025), where the best weakly supervised model achieved an AUPRC of 76.72% ($\pm 5.02\%$). These results suggest the potential of weakly supervised NLP methods for scalable disease identification from clinical discharge letters.

2504.20734 2026-06-15 cs.CL cs.AI cs.CV cs.IR cs.LG 版本更新

UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities

UniversalRAG: 在多样模态和粒度的语料库上实现检索增强生成

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出UniversalRAG,一种能够处理多种模态和粒度的检索增强生成框架,通过动态路由机制和多粒度组织,提升跨模态知识检索的有效性,实验表明其在多个模态基准上的优越性。

Comments ACL 2026. Project page : https://universalrag.github.io

详情
AI中文摘要

检索增强生成(RAG)通过将外部相关知识与查询绑定,显著提升了事实准确性。然而,现有方法多局限于文本语料,尽管最近有尝试扩展到图像、视频等模态,但通常仅针对单一模态语料。相比之下,现实中的查询所需知识类型多样,单一知识源无法满足。为此,我们引入UniversalRAG,一种any-to-any RAG框架,旨在从异构源中检索和整合多样模态和粒度的知识。具体而言,受强制所有模态进入单一聚合语料的统一表示空间导致模态间隙的观察启发,我们提出模态感知路由,动态识别最合适的模态特定语料并执行针对性检索,并通过理论分析证明其有效性。此外,除模态外,我们对每个模态组织为多个粒度层级,实现针对查询复杂性和范围的精细检索。我们验证UniversalRAG在10个多种模态基准上的性能,显示其优于各种模态特定和统一基线。

英文摘要

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose modality-aware routing, which dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it, and further justify its effectiveness with a theoretical analysis. Moreover, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 10 benchmarks of multiple modalities, showing its superiority over various modality-specific and unified baselines.

2505.04671 2026-06-15 cs.CL cs.LG 版本更新

Reward-SQL: Boosting Text-to-SQL via Stepwise Execution-Aware Reasoning and Process-Supervised Rewards

Reward-SQL:通过逐步执行感知推理和过程监督奖励提升Text-to-SQL

Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Guoliang Li, Bin Wu, Wenchao Zhou

发表机构 * Renmin University of China(中国人民大学) Tsinghua University(清华大学) Alibaba Cloud Computing(阿里云 computing)

AI总结 针对强化学习在Text-to-SQL中缺乏逐步执行感知推理和过程级奖励的问题,提出CoCTE框架和Reward-SQL方法,通过中间视图验证、结构化CTE及过程奖励模型,显著提升复杂查询的准确性和可解释性。

详情
AI中文摘要

最近,使用强化学习(RL)训练的大型语言模型(LLMs)的进展提高了Text-to-SQL的性能。然而,基于RL的方法仍然在处理复杂查询时面临两个关键限制:缺乏基于数据库反馈的逐步执行感知推理,以及缺乏用于指导推理优化的过程级奖励。为了解决这些问题,我们提出了CoCTE,一种分治且执行感知的推理框架,通过中间视图验证和结构化公共表表达式(CTEs)逐步组合SQL查询,提高了准确性和可解释性。为了实现CoCTE推理过程,我们开发了Reward-SQL,一种统一的方法,包含三个阶段:(1)模型初始化,使LLMs具备结构化CoCTE推理能力;(2)过程奖励设计,提供细粒度的、执行感知的监督;(3)过程监督的RL和推理,将过程奖励整合到训练中,并通过过程奖励指导推理阶段。本文解决了Reward-SQL中的核心挑战,并做出了以下贡献。我们引入了一个过程奖励模型(PRM),它将执行感知的轨迹评分与基于熵的步骤加权相结合,在推理步骤中提供密集且可解释的监督。我们将PRM集成到RL训练和推理阶段,稳定优化并通过过程级信号改进轨迹探索。实验表明,Reward-SQL在可比模型大小下显著优于基线,并表现出强大的跨领域泛化能力。

英文摘要

Recent advances in large language models (LLMs) trained with reinforcement learning (RL) have improved Text-to-SQL performance. However, RL-based approaches still struggle with complex queries due to two key limitations: insufficient stepwise execution-aware reasoning grounded in database feedback, and the lack of process-level rewards for guiding reasoning optimization. To address these issues, we propose CoCTE, a divide-and-conquer and execution-aware reasoning framework that progressively composes SQL queries through intermediate view validation and structured Common Table Expressions (CTEs), improving both accuracy and interpretability. To realize a CoCTE reasoning process, we develop Reward-SQL, a unified approach with three stages: (1) model initialization, which equips LLMs with structured CoCTE reasoning capabilities; (2) process reward design, which delivers fine-grained, execution-aware supervision; and (3) process-supervised RL and inference, which integrates process rewards into training and guides the inference stage by process rewards. This paper addresses the core challenges in Reward-SQL and makes the following contributions. We introduce a process reward model (PRM) that combines execution-aware trajectory scoring with entropy-based step weighting, providing dense and interpretable supervision across reasoning steps. We integrate PRM into both RL training and inference stages, stabilizing optimization and improving trajectory exploration with process-level signals. Experiments show that Reward-SQL significantly outperforms baselines with comparable model sizes, and exhibits strong cross-domain generalization.

2603.16073 2026-06-15 cs.CL 版本更新

ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

ClaimFlow: 追踪自然语言处理中科学主张的演变

Aniket Pramanick, Yufang Hou, Saif M. Mohammad, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab)(通用知识处理实验室) Department of Computer Science and Hessian Center for AI (hessian.AI)(计算机科学系和海斯曼人工智能中心) Technische Universität Darmstadt(德累斯顿技术大学) IT:U Interdisciplinary Transformation University Austria(奥地利跨学科转型大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出ClaimFlow框架,通过标注1617篇论文中的5689个主张和4871个跨论文关系,定义主张关系分类任务,并分析NLP领域主张的演变模式。

详情
AI中文摘要

科学论文提出$\textit{主张}$,后续工作支持、扩展或有时反驳这些主张。然而,现有的引文和主张分析方法仅捕捉到这一对话的片段。在这项工作中,我们在单个科学主张的层面上使这些交互变得明确。我们引入$\texttt{ClaimFlow}$,一种以主张为中心的NLP文献视图,基于$1{,}617$篇ACL文集论文$(1979-2025)$构建,这些论文被手动标注了$5{,}689$个主张和$4{,}871$个跨论文主张关系,指示引用论文是$\texttt{支持}$、$\texttt{扩展}$、$\texttt{限定}$、$\texttt{反驳}$还是将引用的主张作为$\texttt{背景}$引用。基于$\texttt{ClaimFlow}$,我们定义了一个新任务——$\textit{主张关系分类}$——要求模型从文本和引文上下文中推断对引用的主张的科学立场。评估神经网络模型和大语言模型在该任务上的表现,我们报告了$0.81$宏F1的基线性能,表明该任务是可处理的,同时仍有改进空间。然后,我们将该框架扩展到约$13$k篇NLP论文,以研究跨越数十年NLP研究的主张演变。我们显示$63.5\%$的主张从未被重复使用;只有$11.1\%$曾受到挑战。广泛传播的主张更常通过限定和扩展$\textit{重塑}$,而非被支持或反驳。总体而言,$\texttt{ClaimFlow}$提供了一个审视NLP中思想如何转变和成熟的视角。

英文摘要

Scientific papers advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $1{,}617$ ACL Anthology papers $(1979 - 2025)$ that are manually annotated with $5{,}689$ claims and $4{,}871$ cross-paper claim relations, indicating whether a citing paper $\texttt{supports}$, $\texttt{extends}$, $\texttt{qualifies}$, $\texttt{refutes}$, or references a cited claim as $\texttt{background}$. Building on $\texttt{ClaimFlow}$, we define a new task -- $\textit{Claim Relation Classification}$ -- which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating neural models and large language models on this task, we report baseline performance of $0.81$ macro-F1, suggesting that the task is tractable while leaving room for improvement. We then scale this framework to $\sim$$13k$ NLP papers to study claim evolution across decades of NLP research. We show that $63.5\%$ claims are never reused; only $11.1\%$ are ever challenged. Widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than supported or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP.

2606.06794 2026-06-15 cs.CL cs.IR 版本更新

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

TA-RAG: 面向同伴支持健康沟通的语气感知检索增强生成

Yong-Bin Kang, Anthony McCosker

发表机构 * Swinburne University of Technology(斯winburne大学)

AI总结 提出TA-RAG框架,通过轻量级提示在RAG管道中嵌入语气控制(无污名化、可读性调整、受众适应、同理心改写),无需微调模型,提升敏感健康沟通质量。

详情
AI中文摘要

检索增强生成(RAG)成功地将大型语言模型(LLM)的输出建立在可信文档上,但仅靠事实依据不足以支持敏感的同伴健康沟通。在HIV同伴支持等领域,回复还必须易于理解、无污名化、富有同理心并针对接收者定制。本文提出TA-RAG,一个轻量级的、基于提示的语气感知RAG框架,它将明确的语气控制嵌入到RAG管道中,无需模型微调。我们通过四个核心组件来操作化语气:无污名化改写、可读性调整、接收者适应和同理心重述。我们使用来自澳大利亚HIV在线学习(HOLA)、UNAIDS术语指南、可读性指标、澳大利亚HIV感染者协会(NAPWHA)的同伴支持标准以及公共同理心数据集的问题,通过组件级测试评估TA-RAG。结果表明,TA-RAG的组件在保留关键内容的同时,提高了其目标沟通质量。这些发现强调,基于提示的语气控制是使RAG输出适用于敏感同伴支持健康沟通的一个潜在方向。

英文摘要

Retrieval-augmented generation (RAG) successfully grounds large language model (LLM) outputs in trusted documents, but factual grounding alone is insufficient for sensitive peer-support health communication. In domains such as HIV peer support, responses must also be accessible, stigma-free, empathetic, and tailored to the recipient. This paper presents TA-RAG, a lightweight, prompt-based tone-aware RAG framework that embeds explicit tone control into a RAG pipeline without requiring model fine-tuning. We operationalise tone across four core components: stigma-free rewriting, readability adjustment, recipient adaptation, and empathy rephrasing. We evaluate TA-RAG through component-level tests using questions derived from HIV Online Learning Australia (HOLA), UNAIDS terminology guidance, readability metrics, peer-support standards from National Association of People with HIV Australia (NAPWHA), and a public empathy dataset. Results show that the TA-RAG's components improve their targeted communication quality while preserving key content. These findings emphasise that prompt-based tone control is a potential direction for making RAG outputs suitable for sensitive peer-support health communication.

2604.23336 2026-06-15 cs.IR cs.CL cs.LG 版本更新

Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA

高效基于理由的检索:基于JEPA的生成重排序器的在线蒸馏

Teng Chen, Sheng Xu, Feixiang Guo, Xiaoyu Wang, Qingqing Gu, Hongyan Li, Luo Ji

发表机构 * Geely AI Lab(吉利人工智能实验室)

AI总结 本文提出Rabtriever,通过在线蒸馏从生成重排序器中学习,将查询和文档独立编码,提升检索效率,同时在多个任务中表现优异。

Comments 11 pages, 8 figures. ICMR 2026 (https://youtu.be/apDcrzEVwq4)

详情
AI中文摘要

不同于传统基于事实的检索,基于理由的检索通常需要使用大语言模型对查询-文档对进行跨编码,造成显著的计算成本。为解决这一限制,我们提出了Rabtriever,它独立编码查询和文档,同时提供与重排序器相当的跨查询-文档理解能力。我们从训练一个基于LLM的生成重排序器开始,该重排序器将文档置于查询之前,并提示LLM通过对数概率生成相关性分数。然后将其作为在线蒸馏框架的教师,Rabtriever作为学生重建教师的上下文感知查询嵌入。为此,Rabtriever首先从教师中初始化,参数冻结。然后采用联合嵌入预测架构(JEPA)范式,该范式在LLM层和头部之间集成一个轻量级、可训练的预测器,将查询嵌入投影到新的隐藏空间,文档嵌入作为潜在向量。JEPA然后最小化此投影嵌入与教师嵌入的分布差异。为了增强在线蒸馏的采样效率,我们还添加了对LLM日志几率的反向KL的辅助损失,以重塑学生的日志几率分布。Rabtriever将教师在文档长度上的二次复杂度优化为线性,经理论和实验证实。实验表明,Rabtriever在多种基于理由的任务中优于不同的检索器基线,包括共情对话和机器人操作,且从重排序器中仅有微小的准确率下降。Rabtriever在传统检索基准如MS MARCO和BEIR上也表现良好,性能与最佳检索器基线相当。

英文摘要

Unlike traditional fact-based retrieval, rationale-based retrieval typically necessitates cross-encoding of query-document pairs using large language models, incurring substantial computational costs. To address this limitation, we propose Rabtriever, which independently encodes queries and documents, while providing comparable cross query-document comprehension capabilities to rerankers. We start from training a LLM-based generative reranker, which puts the document prior to the query and prompts the LLM to generate the relevance score by log probabilities. We then employ it as the teacher of an on-policy distillation framework, with Rabtriever as the student to reconstruct the teacher's contextual-aware query embedding. To achieve this effect, Rabtriever is first initialized from the teacher, with parameters frozen. The Joint-Embedding Predictive Architecture (JEPA) paradigm is then adopted, which integrates a lightweight, trainable predictor between LLM layers and heads, projecting the query embedding into a new hidden space, with the document embedding as the latent vector. JEPA then minimizes the distribution difference between this projected embedding and the teacher embedding. To strengthen the sampling efficiency of on-policy distillation, we also add an auxiliary loss on the reverse KL of LLM logits, to reshape the student's logit distribution. Rabtriever optimizes the teacher's quadratic complexity on the document length to linear, verified both theoretically and empirically. Experiments show that Rabtriever outperforms different retriever baselines across diverse rationale-based tasks, including empathetic conversations and robotic manipulations, with minor accuracy degradation from the reranker. Rabtriever also generalizes well on traditional retrieval benchmarks such as MS MARCO and BEIR, with comparable performance to the best retriever baseline.

3. 对话系统与智能体 11 篇

2606.13945 2026-06-15 cs.CL 新提交

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

MedLatentDx:用于跨医院罕见病诊断的潜在多智能体通信

Ziqing Wang, Lili Zhao, Kaize Ding

发表机构 * Northwestern University(西北大学)

AI总结 提出MedLatentDx框架,通过潜在多智能体通信实现跨医院罕见病诊断,利用潜在KV块传输保护隐私,支持同骨干和跨家族LLM部署,在CrossRare-Bench上提升诊断性能并减少可重建临床内容。

详情
AI中文摘要

罕见病影响超过3亿患者,涉及7000多种疾病,但没有任何一家医院能遇到足够多的病例来进行可靠诊断。跨医院合作可以通过允许诊断机构使用分布式、病例特定的诊断证据来提供帮助,但隐私法规限制可识别的临床文本跨机构传输。这种情况带来了两个挑战:现有的医疗智能体系统通常依赖文本证据交换,而原始潜在状态(如隐藏状态和KV缓存)仍可能泄露提示衍生的临床内容。我们引入了MedLatentDx,一个潜在多智能体通信框架,其中医院智能体将私有临床记录和检索到的病例保留在本地,并向主机智能体发送紧凑的潜在KV块以进行罕见病诊断。MedLatentDx支持两种部署设置:相同骨干的医院智能体使用潜在KV蒸馏,而具有不同LLM骨干的医院使用跨家族潜在对齐。在CrossRare-Bench(一个自建的大规模罕见病基准,具有医院级别分区)上,MedLatentDx提高了跨医院诊断性能,同时相对于原始潜在通信基线减少了可重建的临床内容。

英文摘要

Rare diseases affect over $300$ million patients across more than $7{,}000$ conditions, yet no single hospital encounters enough cases of any one condition for reliable diagnosis. Cross-hospital collaboration could help by allowing a diagnosing institution to use distributed, case-specific diagnostic evidence, but privacy regulations restrict the transmission of identifiable clinical text across institutional boundaries. This setting raises two challenges: existing medical agent systems often rely on textual evidence exchange, while raw latent states such as hidden states and KV caches may still reveal prompt-derived clinical content. We introduce MedLatentDx, a latent multi-agent communication framework in which hospital agents keep private clinical records and retrieved cases local, and send compact latent KV blocks to a host agent for rare-disease diagnosis. MedLatentDx supports two deployment settings: same-backbone hospital agents use latent KV distillation, while hospitals with different LLM backbones use cross-family latent alignment. On CrossRare-Bench, a self-built large-scale rare-disease benchmark with hospital-level partitions, MedLatentDx improves cross-hospital diagnostic performance while reducing reconstructable clinical content relative to raw-latent communication baselines.

2606.14179 2026-06-15 cs.CL 新提交

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

CacheRL: 通过缓存轨迹和混合奖励实现多轮工具调用智能体

Md Amirul Islam, Sumiran Thakur, Huancheng Chen, Su Min Park, Jiayun Wang, Gyuhak Kim

发表机构 * Center for Advanced AI Accenture(埃森哲高级人工智能中心)

AI总结 提出CacheRL系统,通过混合思维轨迹流水线、三级模糊缓存和缓存层级感知奖励,以100倍更少计算量实现92%的多步工具调用准确率,接近GPT-5的94%。

详情
AI中文摘要

我们提出了CacheRL,一个用于训练小型智能体基础模型的系统,在多步工具调用任务上实现了92%的过程准确率,接近GPT-5的94%,同时所需计算量减少100倍。我们的方法解决了实际智能体训练中的三个挑战:大规模从大模型迁移工具调用知识、无需昂贵实时工具执行的强化学习、以及从有噪声的缓存环境中稳健学习。CacheRL引入了三项关键创新。首先,混合思维轨迹流水线通过LLM生成的推理轨迹增强智能体轨迹,产生不仅教授模型调用什么工具还教授为什么调用的训练样本。其次,CacheAgentLoop通过三级模糊缓存消除了实时执行成本,同时使用token级掩码保持轨迹保真度。第三,缓存层级感知奖励动态调整答案质量权重,以避免因缓存引起的限制而惩罚模型。通过迭代监督微调(SFT)和组相对策略优化(GRPO),CacheRL将Qwen3-4B-Thinking的验证奖励从0.43提升到0.78。在公开的智能体工具调用基准测试中,我们的模型达到了与GPT-5等前沿模型竞争的性能。消融研究表明,移除知识迁移会使性能降低41%,而缓存感知奖励贡献了17%的提升。有趣的是,强化学习提高了训练稳定性,但在强监督微调基础上带来的提升有限,这表明在构建实用的小型智能体模型时,数据质量和奖励设计比复杂的优化方法更重要。

英文摘要

We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.

2606.14302 2026-06-15 cs.CL 新提交

Retrospective Progress-Aware Self-Refinement for LLM Agent Training

回顾性进度感知的LLM智能体训练自我精炼

Xinbei Ma, Congmin Zheng, Jiyang Qiu, Jiale Hong, Yao Yao, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) OPPO Research Institute(OPPO研究院)

AI总结 提出RePro框架,通过前向-反思滚动范式训练智能体自我生成进度信号,无需持续外部监督,在WebShop等任务上提升Qwen系列性能高达12%。

详情
AI中文摘要

基于强化学习训练的LLM智能体优化逐步动作预测,但缺乏对任务进度的元认知意识,导致阻碍长期扩展的差距。一项初步研究表明,在线进度提示会损害性能,而回顾性演示有帮助,但这种能力无法仅从结果奖励训练中涌现。我们提出RePro,即回顾性进度感知训练,一种通过前向-反思滚动范式训练智能体自我生成进度信号的框架:智能体在线执行动作,然后根据完成的轨迹和已知结果回顾性地重新评估其逐步进度。RePro通过回顾性热身初始化,从最少的外部演示中教授反思格式,然后通过RePro-PO进一步训练,使用复合奖励产生自我生成的信号,无需持续的外部监督。在WebShop、ALFWorld和Sokoban上的实验表明,RePro提升了Qwen系列的性能,绝对成功率提升高达12%。

英文摘要

LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospective demonstrations help, yet this capability cannot emerge from outcome-reward training alone. We present RePro, Retrospective Progress-Aware Training, a framework that trains agents to self-generate progress signals via a forward-then-reflect rollout paradigm: the agent executes actions online, then retrospectively reassesses its step-wise progress given the completed trajectory and known outcome. RePro initializes with a Retrospection Warmup that teaches reflection format from minimal external demonstrations, then further trains through RePro-PO with a composite reward that produces self-generated signals without continuous external supervision. Experiments on WebShop, ALFWorld, and Sokoban show that RePro enhances the Qwen family's performance, with up to $12\%$ absolute success rate gains.

2606.14674 2026-06-15 cs.CL 新提交

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

AgentSpec: 通过受控组合理解具身智能体脚手架

Jixuan Chen, Jianzhi Shen, Haoqiang Kang, Zhi Hong, Qingyi Jiang, Soham Bose, Yiming Zhang, Leon Leng, Amit Vyas, Lingjun Mao, Siru Ouyang, Kun Zhou, Lianhui Qin

发表机构 * University of California, San Diego(加利福尼亚大学圣迭戈分校) Johns Hopkins University(约翰霍普金斯大学) University of Washington(华盛顿大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出AgentSpec模块化规范框架,将具身智能体表示为可复用策略组件的类型化组合,通过标准化接口实现受控组件替换与重组,揭示脚手架兼容性和交互效应对性能的主导作用。

详情
AI中文摘要

LLM智能体越来越多地构建为脚手架系统,而非单一模型调用,这些系统结合了推理、记忆、反思、动作执行和学习。虽然此类脚手架通常能提升性能,但它们往往嵌入在紧密耦合的流水线中,使得难以隔离组件贡献、比较替代设计或理解模块交互如何塑造智能体行为。我们引入AgentSpec,一个模块化规范框架,将具身智能体表示为具有标准化接口的可复用策略组件的类型化组合。AgentSpec标准化了感知、记忆、推理、反思、动作和可选学习之间的接口,使得组件能够在受控条件下被交换和重组。我们在DeliveryBench、ALFRED、MiniGrid和RoboTHOR上实例化该框架,并分析了跨模型骨干的推理、记忆、反思和强化学习模块。我们的结果表明,智能体性能由脚手架兼容性和交互效应主导,而非孤立模块强度。特别是,结构化多粒度记忆改善了长程状态跟踪,推理和记忆在不同环境中非均匀交互,反思在纠正和成本之间权衡,而RL训练的策略在与部署时脚手架结构共同优化时组合最佳。AgentSpec为研究、比较和设计可组合的LLM智能体提供了受控基础。我们的代码、基线和交互式游乐场在此https URL公开。

英文摘要

LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at https://agentspec-embodied.github.io.

2606.13683 2026-06-15 cs.AI cs.CL 交叉投稿

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

UP-NRPA:基于用户画像的嵌套 rollout 策略自适应,用于目标导向对话系统中与大语言模型的规划

Hui Wang, Fafa Zhang, Meng Liu, Xiangyu Chen, Chaoxu Mu

发表机构 * School of Artificial Intelligence, Anhui University(安徽大学人工智能学院) Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University(安徽大学安徽省安全人工智能重点实验室) Pengcheng Laboratory(鹏城实验室)

AI总结 提出 UP-NRPA 在线框架,利用大语言模型和用户画像实时自适应调整对话策略,无需离线强化学习,在协作与非协作对话基准中实现 100% 任务成功率,谈判任务销售列表比提升 56.41%。

详情
AI中文摘要

为了解决当前对话策略规划方法难以动态适应不同用户特征的挑战,本文提出了一种基于用户画像的嵌套 rollout 策略自适应(UP-NRPA)在线框架,结合大语言模型。与传统依赖模型训练并为用户群体离线学习强化学习策略模型的方法不同,UP-NRPA 通过自适应机制实现对话策略的动态定制。这是通过利用实时用户反馈以及从当前用户画像映射出的个性、偏好和目标来实现的,从而无需离线强化学习即可适应用户特征。在协作和非协作对话基准测试中,UP-NRPA 展现了显著优势,在多个对话任务中实现了令人印象深刻的 100% 成功率。特别是在谈判任务中,销售列表比(SL)提高了 56.41%。这表明 UP-NRPA 无需训练机制即可适应多样化的用户需求,使对话系统能够适应用户特征。

英文摘要

To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

2606.13715 2026-06-15 cs.AI cs.CL cs.MA 交叉投稿

WorkBench Revisited: Workplace Agents Two Years On

WorkBench 再探:两年后的工作场所智能体

Olly Styles

发表机构 * GitHub

AI总结 本文重新评估2024至2026年间WorkBench基准上智能体的进展,发现前沿模型在能力和安全性上均有显著提升,但开放权重模型降低了高性能门槛。

Comments 8 pages, 3 figures. Follow-up to arXiv:2405.00823

详情
AI中文摘要

2024年3月,WorkBench上表现最好的智能体GPT-4完成了43%的任务,并在26%的任务中采取了意外的有害行为(例如给错误的人发送电子邮件)。我们在2026年6月重新审视该基准,发现迄今为止最好的智能体Claude Opus 4.8完成了89%的任务,并仅在2.5%的任务中采取了意外的有害行为。除了前沿智能体性能的显著进步外,有三点值得注意。首先,在WorkBench上,能力与安全性是相辅相成的,而非相互权衡,因此完成最多任务的模型造成的意外损害也最少。其次,虽然几类错误已被完全消除,但前沿模型仍然会犯一些基本错误,有时会导致不可逆转的损害,例如将电子邮件发送给错误的人。第三,开放权重模型的兴起大幅降低了此前仅专有模型才能达到的性能水平的成本,而前沿模型的成本则保持相对稳定。我们发布了该基准的更新版本,包括数据与代码质量改进、新的模型评分以及自2024年以来WorkBench上智能体进展的分析。

英文摘要

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes an unintended harmful action on 2.5%. Aside from this considerable progress in frontier agent performance, three things stand out. First, capability and safety go together on WorkBench rather than trade off, so the models that finish the most tasks also do the least unintended damage. Second, while several classes of error have been totally eliminated, frontier models still make some basic mistakes that occasionally result in irreversible harm, such as sending an email to the wrong person. Third, the rise of open-weight models has drastically lowered costs for a performance level that was previously only accessible to proprietary models, while frontier costs have stayed relatively stable. We release an updated version of the benchmark with data and code quality improvements, new model scores, and analysis of agent progress on WorkBench since 2024.

2606.14155 2026-06-15 cs.LG cs.CL 交叉投稿

Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems

基于图的目标反向传播用于多LLM智能体系统中的上下文自适应

Tan Zhu, Tong Yao, Kananart Kuwaranancharoen, Amit Singh, Yushang Lai, Deepa Mohan, Shankara Bhargava

发表机构 * Retail Intelligence, Walmart Global Tech(零售智能,沃尔玛全球技术)

AI总结 提出GTBP框架,通过图结构反向传播局部目标输出,实现多LLM智能体工作流的上下文自适应,理论保证稳定性,实验优于基线。

详情
AI中文摘要

上下文自适应通过迭代地从任务反馈中修改可调提示,无需修改模型权重,自动化了基于LLM系统中的提示工程。将这一范式扩展到多LLM智能体系统至关重要:现有方法存在不准确的信用分配问题且缺乏收敛保证。我们提出基于图的目标反向传播(GTBP),一种针对建模为有向无环图的智能体工作流的上下文自适应框架。GTBP通过工作流图向后传播局部目标输出,并利用目标-输出差异指导阶段式提示更新机制。理论上,我们证明GTBP的阶段式提示更新在迭代中变得稳定,且足够强大的LLM优化器可以降低整体目标。实验上,GTBP在三个基准测试中一致优于强基线,同时保持可比较的计算成本。

英文摘要

Context adaptation automates prompt engineering in LLM-based systems by iteratively revising tunable prompts from task feedback, without modifying model weights. Extending this paradigm to multi-LLM agentic systems is crucial: existing methods suffer from inaccurate credit assignment and lack convergence guarantees. We propose \textbf{G}raph-based \textbf{T}arget \textbf{B}ack-\textbf{P}ropagation (GTBP), a context adaptation framework for agentic workflows modeled as directed acyclic graphs. GTBP propagates local target outputs backward through the workflow graph and uses target--output discrepancies to guide a stage-wise prompt update mechanism. Theoretically, we show that GTBP's stage-wise prompt updates become stable over iterations, and that a sufficiently capable LLM optimizer can decrease the overall objective. Empirically, GTBP consistently outperforms strong baselines across three benchmarks while maintaining comparable computational cost.

2606.14470 2026-06-15 cs.AI cs.CL cs.LG 交叉投稿

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

GitOfThoughts: 版本控制的推理与可回放、差异比较和合并的智能体记忆

Pavan C Shekar, Abhishek H S, Aswanth Krishnan

发表机构 * QpiAI

AI总结 提出GitOfThoughts框架,将智能体推理树存储为git仓库,实现推理的可回放、审计和合并;实验表明,对于新问题,任何记忆格式均不能可靠提升准确率,仅当检索案例与当前问题高度相似(>0.8)时才有显著提升,且收益来自答案检索而非方法迁移。

Comments 10 pages, 1 figure, 9 tables

详情
AI中文摘要

大语言模型推理是短暂的:思维链随上下文窗口消失,剪枝的搜索分支不留记录,记忆缓冲区无法进行差异比较、合并或审计。其他所有复杂的软件过程(代码、基础设施、数据、实验)都受版本控制;推理却没有。我们提出GitOfThoughts,将智能体的推理树存储为git仓库:每个评分的思维是一个提交,分数是注释,结果是标签,检索是智能体自身历史上的“git log”。这使得推理可回放、可审计,并且可以在智能体之间以近乎零的工程成本进行合并。然后我们提出一个更难的问题:记忆在任何基质上是否真的能提高准确性?在五种基质(无、markdown、向量、图、git)、两个基准、两个模型规模以及预注册的复制实验中,对于新问题的答案是否定的。没有一种记忆格式可靠地有帮助,一个有希望的早期结果在其自身的预注册复制下崩溃了。记忆只有在超过我们所谓的可复制阈值时才有效:当检索到的案例与当前问题几乎重复(相似度>~0.8)时,准确率急剧上升;低于此阈值,则无效果。收益是答案检索,而非方法迁移:一个4.5倍大的模型使近重复收益翻倍,但仍然无法从工作示例中提取可迁移的方法。我们发现唯一的通用杠杆是测试时采样。因此,git作为基质的理由是审计性、溯源性和可合并性,且准确率相当。我们记录了一个撤回的结果和一个被反驳的假设,以体现我们坚持的评估标准。

英文摘要

Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

2505.16988 2026-06-15 cs.CL cs.AI cs.MA 版本更新

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

MASLab:基于LLM的多智能体系统的统一全面代码库

Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) University of Oxford(牛津大学) Princeton University(普林斯顿大学) Meta University of Michigan(密歇根大学) The University of Sydney(悉尼大学) Beihang University(北航) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出MASLab代码库,集成20余种方法,提供统一环境与标准化评估,降低研究门槛,覆盖10+基准测试和8种模型。

Comments 18 pages, 11 figures

详情
AI中文摘要

基于LLM的多智能体系统(MAS)在增强单个LLM以解决实际应用中复杂多样任务方面展现出巨大潜力。尽管取得了显著进展,该领域缺乏统一代码库来整合现有方法,导致重复实现、不公平比较和研究人员的高入门门槛。为应对这些挑战,我们引入MASLab,一个统一、全面且研究友好的基于LLM的MAS代码库。(1)MASLab集成了跨多个领域的20余种已建立方法,每种方法均通过逐步输出与官方实现的比较得到严格验证。(2)MASLab提供统一环境,包含多种基准测试,用于方法间的公平比较,确保一致输入和标准化评估协议。(3)MASLab在共享的简化结构中实现方法,降低了理解和扩展的门槛。基于MASLab,我们进行了涵盖10+基准测试和8种模型的广泛实验,为研究人员提供了当前MAS方法格局的清晰全面视图。MASLab将持续发展,跟踪该领域最新进展,并欢迎更广泛开源社区的贡献。

英文摘要

LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

2601.22436 2026-06-15 cs.CL 版本更新

Large Language Model Agents Are Not Always Faithful Self-Evolvers

大型语言模型代理并非总是忠实的自我进化者

Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yang Deng, Yanyan Zhao, Wanxiang Che, Bing Qin, Ting Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本研究首次系统调查自我进化LLM代理的经验忠实性,通过因果干预发现代理依赖原始经验但常忽略或误解浓缩经验,并分析其成因。

Comments ICML 2026

详情
AI中文摘要

自我进化的大型语言模型(LLM)代理通过积累和重用过去的经验不断改进,但尚不清楚它们是否忠实地依赖这些经验来指导其行为。我们首次系统调查了自我进化LLM代理中的经验忠实性,即代理决策对其所获经验的因果依赖性。通过对原始和浓缩形式的经验进行受控因果干预,我们全面评估了13个LLM骨干和9个环境中的四个代表性框架。我们的分析揭示了一个显著的不对称性:虽然代理始终依赖原始经验,但它们经常忽略或误解浓缩经验,即使这是唯一提供的经验。这种差距在单代理和多代理配置以及骨干规模中持续存在。我们将其根本原因追溯到三个因素:浓缩内容的语义局限性、抑制经验的内部处理偏差以及预训练先验已足够的任务机制。这些发现挑战了关于自我进化方法的普遍假设,并强调了需要更忠实、更可靠的经验整合方法。

英文摘要

Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 13 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.

2606.13662 2026-06-15 cs.AI cs.CL 版本更新

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

EurekAgent:自主科学发现中,智能体环境工程即一切

Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Zhipu AI(智谱AI)

AI总结 提出环境工程框架EurekAgent,通过权限、工件、预算和人机交互四维工程设计,在数学、内核工程和机器学习任务上取得新最优结果,总API成本低于11美元。

详情
AI中文摘要

基于LLM的智能体在自动化科学发现方面展现出日益增长的潜力。给定一个可优化的度量和执行环境,它们可以提出、验证和迭代科学解决方案,并已产生超越人类设计方法的结果。随着模型能力的持续提升,我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境:即塑造智能体行为的资源、约束和接口。我们将此框架化为环境工程:构建能够放大生产性行为(如开放式探索、系统化工件管理和智能体间协作)同时抑制有害行为(如奖励黑客和高摩擦人工监督)的环境。我们提出了EurekAgent,一个用于度量驱动自主科学发现的环境工程智能体系统。EurekAgent从四个维度进行环境工程:权限工程用于受限智能体执行和隔离评估;工件工程用于基于文件系统和Git的协作;预算工程用于预算感知探索;人机交互工程用于便捷的人工监督和干预。EurekAgent在多个数学、内核工程和机器学习任务上取得了新的最优结果,包括以不到11美元的总API成本发现新的26圆填充最优结果。我们开源了代码和结果,并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

英文摘要

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

4. 语义、语法与语言学分析 3 篇

2606.13993 2026-06-15 cs.CL 新提交

The Holistic Storage of Verb+Up Phrases in Text-based and Audio-based Language Models

文本和音频语言模型中动词+up短语的整体存储

Zachary Nicholas Houghton, Yu Zhou, Dan Pluth, Vijay K. Gurbani

发表机构 * University of Oregon(俄勒冈大学) Vail Systems, Inc(Vail Systems公司)

AI总结 研究文本和音频语言模型对动词+up短语的整体存储,发现频率和可预测性驱动独立表征,支持基于使用的语言理论。

详情
AI中文摘要

语言能力的一个关键方面是在存储表征和抽象知识之间进行权衡:必须检索学过的表征,但也需要通过应用生产性规则生成新的表征。虽然近期研究关注语言模型中的抽象知识,但对多词单元的整体存储关注较少。我们探究基于文本的LLM和ASR模型的内部表征,测试V+up短语动词是否因频率和可预测性而形成不同的表征。所有模型都显示出由频率和可预测性驱动的整体存储证据,进一步支持基于使用的语言理论。

英文摘要

A crucial aspect of linguistic capability is the ability to trade off between stored representations and abstract knowledge: one must retrieve learned representations, but also generate novel ones by applying productive rules. While recent work has examined abstract knowledge in language models, holistic storage of multi-word units has received far less attention. We probe internal representations in text-based LLMs and an ASR model, testing whether V+up phrasal verbs develop distinct representations as a function of frequency and predictability. All models show evidence of holistic storage driven by frequency and predictability, further supporting usage-based theories of language.

2606.14512 2026-06-15 cs.CL cs.AI 新提交

Fodor and Pylyshyn's Systematicity Challenge Still Stands

Fodor和Pylyshyn的系统性挑战依然存在

Michael Goodale, Salvador Mascarenhas

发表机构 * Institut Jean Nicod, Département d’études cognitives ENS, EHESS, CNRS, PSL University(让·尼科研究所,ENS认知科学系,EHESS,CNRS,PSL大学)

AI总结 本文通过实验证明,Lake和Baroni的元学习组合协议模型在分布外和分布内问题上均表现不佳,未能满足Fodor和Pylyshyn对神经网络系统性提出的挑战。

Comments Accepted in the Transactions of the Association for Computational Linguistics (TACL). This is a pre-MIT Press publication version of the paper

详情
AI中文摘要

神经网络近期在生成类人语言方面的成功在认知科学领域引起了巨大轰动,许多研究者认为,关于人类认知的经典难题以及对人工智能的挑战正被神经网络解决。一个显著的例子是Jerry Fodor和Zenon Pylyshyn提出的系统性论证,该论证认为人类表现出系统性的双条件依赖关系。例如,某人能理解句子“John saw Mary”当且仅当能理解句子“Mary saw John”。符号系统解释了这种语言和思维的系统性,而神经网络则没有提供直接的解释。最近几篇文章声称这一挑战已被神经网络解决。特别是,Brenden Lake和Marco Baroni认为他们的元学习组合协议匹配并可能解释了人类的系统性。我们证明这些结论为时过早。在其他结果中,我们发现他们的模型难以学习与训练数据分布稍有差异的规则。此外,即使在许多分布内问题上,模型的行为也是非系统性的。我们得出结论,Fodor和Pylyshyn对神经网络的挑战仍未得到满足。

英文摘要

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

2509.24102 2026-06-15 cs.CL 版本更新

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

道德推理习得的语用推理:通过元语用链接实现泛化

Guangliang Liu, Xi Chen, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Johnson

发表机构 * Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校) Nanyang Technological University(南洋理工大学) University of Mississippi(密苏里大学) Northeastern University(东北大学) Qualcomm(高通公司) Michigan State University(密歇根州立大学)

AI总结 针对大语言模型在道德推理中泛化能力不足的问题,提出基于元语用链接和道德基础理论的语用推理方法,使模型获取道德推理目标与社会变量间的元语用链接,在三个任务上验证了其适应性和泛化性。

详情
AI中文摘要

虽然道德推理已成为大型语言模型(LLM)的一个有前景的研究方向,但实现稳健的泛化仍然是一个关键挑战。这一挑战源于所说内容与道德隐含内容之间的差距。在本文中,我们基于元语用链接和道德基础理论来缩小这一差距。具体来说,我们开发了一种语用推理方法,使LLM在给定道德情境下,能够获取道德推理目标与影响它们的社会变量之间的元语用链接。我们将该方法应用于三个不同的道德推理任务,以展示其适应性和泛化性。实验结果表明,我们的方法显著增强了LLM在道德推理中的泛化能力,为未来研究利用语用推理处理更广泛的道德推理任务铺平了道路。

英文摘要

While moral reasoning has emerged as a promising research direction for large language models (LLMs), achieving robust generalization remains a critical challenge. This challenge arises from the gap between what is said and what is morally implied. In this paper, we build on metapragmatic links and Moral Foundations Theory to close this gap. Specifically, we develop a pragmatic inference approach that enables LLMs, given a moral situation, to acquire the metapragmatic links between moral reasoning objectives and the social variables that influence them. We adapt this approach to three different moral reasoning tasks to demonstrate its adaptability and generalizability. Experimental results show that our approach significantly enhances LLMs' generalization in moral reasoning, paving the way for future research to leverage pragmatic inference across a wide range of moral reasoning tasks.

5. 多模态语言处理 7 篇

2606.14691 2026-06-15 cs.CL 新提交

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CORA: 通过一致性导向的推理对齐分析与弥合多模态RLVR中的思考-答案差距

Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng, Changyuan Tian, Zichuan Lin, Wenqian Lv, Nayu Liu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Wuhan University(武汉大学) Tsinghua University(清华大学) Tianjin University(天津大学)

AI总结 本文分析多模态RLVR中思考与答案的语义不一致问题,提出CORA方法,通过轻量级一致性奖励模型引入语义一致性,并采用混合奖励优势分裂稳定优化,提升推理忠实度。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成功激发大语言模型的推理能力,推动其向多模态场景扩展。现有方法主要关注提升推理轨迹的视觉覆盖和缓解视觉幻觉,但低估了推理过程与最终答案之间的语义不一致性。本文深入研究了大型视觉语言模型(LVLMs)中RLVR的思考-答案不一致性,通过对组相对策略优化(GRPO)训练过程中收集的轨迹以及RLVR后评估输出的分析,表明该问题在训练期间持续存在,并在推理时仍然存在。受此分析启发,我们提出一致性导向的推理对齐(CORA),通过轻量级即插即用的一致性奖励模型将思考-答案语义一致性引入RLVR,并进一步结合混合奖励优势分裂(HRAS)以稳定协调任务和一致性优化。在代表性多模态推理基准和主流LVLMs上的大量实验表明,CORA在提升任务性能的同时有效缓解了思考-答案不一致性,从而产生更忠实的推理轨迹。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.

2606.13707 2026-06-15 cs.AI cs.CL cs.CV 交叉投稿

Orchestra-o1: Omnimodal Agent Orchestration

Orchestra-o1: 全模态智能体编排

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK(香港中文大学) LIGHTSPEED PKU(北京大学) THU(清华大学) Tongji University(同济大学)

AI总结 提出Orchestra-o1全模态智能体编排框架,通过统一编排机制实现模态感知任务分解、在线子智能体专业化和并行子任务执行,在OmniGAIA基准上准确率超第二名10.3%,并引入DA-GRPO强化学习方法训练Orchestra-o1-8B达到开源全模态智能体最优性能。

详情
AI中文摘要

近期智能体集群的成功将基于大语言模型(LLM)的智能体从单智能体工作流范式转向多智能体系统,凸显了智能体编排在任务分解与协作中的重要性。然而,现有编排框架局限于狭窄的模态集合,难以泛化到异构模态共存并交互的更复杂场景。这种局限性在全模态场景中尤为突出,此类任务需要对文本、图像、音频和视频等多样化输入进行统一理解与协调。在本工作中,我们提出Orchestra-o1,一种全模态智能体编排框架,旨在支持跨多种模态的高效智能体协作。Orchestra-o1引入统一编排机制,实现模态感知任务分解、在线子智能体专业化和并行子任务执行。这种可扩展设计使智能体系统能够有效处理涉及异构信息源的复杂现实任务,在OmniGAIA基准上超越第二名方法10.3%的准确率。此外,我们提出决策对齐群体相对策略优化(DA-GRPO),一种高效的智能体强化学习方法,用于训练Orchestra-o1-8B,该方法在所有现有开源全模态智能体中取得了最先进性能。

英文摘要

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

2606.14072 2026-06-15 cs.CV cs.CL 交叉投稿

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

扩散细化分割与视觉-语言解释用于儿童脑肿瘤MRI

Wentao Ke, Jianche Liu

发表机构 * Department of Mechanical Engineering, Stanford University(斯坦福大学机械工程系) School of Medicine, Stanford University(斯坦福大学医学院)

AI总结 提出两阶段框架,先用Swin-UNETR粗分割,再用条件扩散模型细化边界,最后结合多模态语言模型生成结构化报告,提升儿童脑肿瘤分割精度和可解释性。

详情
AI中文摘要

由于标注数据有限、成像表型异质性、肿瘤边界弥散以及肿瘤子区域类别不平衡,准确的儿童脑肿瘤分割仍然具有挑战性。在此,我们提出一个两阶段深度学习框架,用于改进多模态儿童脑MRI分割和临床解释。首先,我们在BraTS-PEDs MRI扫描上评估3D Res U-Net和Swin-UNETR基线模型,使用四种配准模态预测肿瘤核心、全肿瘤和增强肿瘤区域。其次,我们引入基于扩散的细化模型,以粗Swin-UNETR预测为条件,包括3D DDPM细化器和MedSegDiff。条件化显著提高了扩散稳定性和性能,特别是对于增强肿瘤边界分割。条件化MedSegDiff实现了最强的边界一致性,HD95最低。最后,预测的肿瘤体积和代表性分割叠加图与多模态语言模型集成,生成结构化的放射学风格报告。综合来看,我们的结果表明,从粗到细的扩散分割可以改善儿童肿瘤边界描绘,并支持端到端可解释的AI辅助神经肿瘤学工作流程。

英文摘要

Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage deep learning framework for improving multi-modal pediatric brain MRI segmentation and clinical interpretation. First, we evaluate 3D Res U-Net and Swin-UNETR baselines on BraTS-PEDs MRI scans, using four co-registered modalities to predict tumor core, whole tumor, and enhancing tumor regions. Second, we introduce diffusion-based refinement models conditioned on coarse Swin-UNETR predictions, including a 3D DDPM refiner and MedSegDiff. Conditioning substantially improves diffusion stability and performance, particularly for enhancing tumor boundary segmentation. Conditioned MedSegDiff achieves the strongest boundary agreement with the lowest HD95. Finally, predicted tumor volumes and representative segmentation overlays are integrated with a multimodal language model to generate structured radiology-style reports. Together, our results suggest that coarse-to-refined diffusion segmentation can improve pediatric tumor boundary delineation and support end-to-end interpretable AI-assisted neuro-oncology workflows.

2606.14141 2026-06-15 cs.SD cs.AI cs.CL 交叉投稿

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

动态声源的时空音频语言建模

Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida, Kim Sung-Bin, Toshimitsu Uesaka, Takashi Shibuya, Kyeongyoon Lee, Tae-Hyun Oh, Yuki Mitsufuji

发表机构 * POSTECH(浦项科技大学) Sony AI(索尼AI) Sony Group Corporation(索尼集团) Sungkyunkwan University(成均馆大学) KAIST(韩国科学技术院)

AI总结 提出ST-AudioLM模型,通过时空音频编码器联合学习事件语义与源轨迹,在ST-AudioQA基准上提升动态声源问答的语义-定位权衡。

详情
AI中文摘要

声音事件是具有语义身份、位置和轨迹的实体,但当前的音频-语言模型通常将片段推理为全局事件内容。相反,声音事件定位模型随时间跟踪声源方向,但对语言推理的语义覆盖有限。为解决这一差距,我们引入了ST-AudioQA,一个基于一阶环绕声(FOA)渲染的静态和移动声源的时空音频问答数据集和基准。每个场景提供源身份、活动、方向、距离和运动元数据,实现密集轨迹监督以及关于什么在发声、在哪里、如何移动以及源之间关系的问题。我们进一步提出了ST-Audio Encoder,一种时间分辨的FOA音频编码器,联合学习事件语义和源轨迹,以及ST-AudioLM,它将编码器的音频令牌连接到LLM进行时空音频问答。实验表明,这种表示改善了语义-定位权衡,并比静态空间和面向定位的基线产生更强的推理性能。

英文摘要

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.

2606.14703 2026-06-15 cs.CV cs.CL cs.LG 交叉投稿

Gaze Heads: How VLMs Look at What They Describe

注视头:视觉语言模型如何观察它们所描述的内容

Rohit Gandikota, David Bau

发表机构 * Northeastern University(东北大学)

AI总结 发现视觉语言模型的语言骨干中存在一组“注视头”,其注意力跟踪当前描述的图像区域,通过干预这些头可精确控制模型描述内容,准确率达83.1%。

详情
AI中文摘要

视觉语言模型在内部如何解决描述图像的任务远非显而易见。我们发现模型为此发展出一种特定机制:其语言模型骨干中的一小部分注意力头(我们称之为注视头),其注意力跟踪模型当前正在描述的图像区域。我们通过简单的相关性得分从几次前向传播中发现了它们,使用连环漫画作为受控测试平台,其中叙事顺序在空间上展开。这些注视头不仅跟踪正在描述的图像标记:将它们的注意力重定向到所选区域会强制视觉语言模型描述该区域。对前100个注视头(少于所有头的9%)进行单次注意力掩码干预,以83.1%的准确率将模型的答案引导到任何选定的漫画面板,而对随机头进行相同干预则无法重定向答案,并且对所有头进行干预会破坏生成。相同的杠杆还扩展到连续控制:在生成过程中切换注视目标会使模型在几个标记内结束当前面板描述并转向新面板。在漫画之外,相同的干预将答案重定向到自然COCO图像中的选定区域。该机制进一步在2B到32B参数的模型大小以及其他视觉语言模型架构中重复出现,尽管一些冻结编码器系列没有显示可比较的头集。更广泛地说,这表明通过机制分析识别的目标编辑可以作为实用的推理时杠杆来引导多模态模型行为,而无需任何重新训练。我们的代码、交互式演示和数据集可在以下网址获取:此 https URL

英文摘要

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/

2511.05017 2026-06-15 cs.CV cs.CL 版本更新

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

通过细化文本嵌入缓解大型视觉语言模型中的幻觉

Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Sarvesh Baskar, Vijay Kamarshi, Andrea Fanelli, Furong Huang

发表机构 * University of Maryland(马里兰大学) Dolby Laboratories(杜比实验室) Capital One

AI总结 针对大型视觉语言模型因过度依赖文本先验而忽视视觉线索导致的幻觉问题,提出一种简单有效的视觉特征融入方法,通过学习视觉信息化的文本嵌入来平衡注意力分布,显著降低幻觉并提升多模态推理能力。

Comments Accepted at The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

大型视觉语言模型(LVLMs)中的幻觉仍然是一个持续的挑战,通常源于多模态推理过程中视觉信息整合不足。一个关键原因是模型过度依赖文本先验而未能充分利用视觉线索,导致输出语言流畅但视觉上不准确。例如,给定一张空厨房台面的图像,LVLM可能会根据语言关联而非视觉证据幻觉出“一碗水果”或“一杯咖啡”。大多数LVLM通过将视觉特征附加到预训练LLM的输入流中,并在大规模视觉语言数据集上训练来整合视觉特征。我们的系统分析表明,由于LLM对语言主导表示的固有偏见,这种策略往往导致对文本信息的过度依赖。这种不平衡使注意力偏向文本而非视觉内容,削弱了模型将输出基于视觉输入的能力。为了解决这个问题,我们提出了一种简单而有效的视觉特征融入方法,鼓励模型学习与基础LLM不同的视觉信息化的文本嵌入,并促进更平衡的注意力分布。在多个幻觉基准上的实验结果表明,我们的方法显著减少了幻觉,并促进了更平衡的多模态推理。值得注意的是,我们的方法取得了显著提升,包括在MMVP-MLLM上+9.33%,在POPE-AOKVQA上+2.99%,在Merlin上高达+3.4%,以及在HallusionBench的硬数据分割上+3%。

英文摘要

Hallucinations in Large Vision-Language Models (LVLMs) remain a persistent challenge, often stemming from inadequate integration of visual information during multimodal reasoning. A key cause is the model's over-reliance on textual priors and underutilization of visual cues, leading to outputs that are linguistically fluent but visually inaccurate. For example, given an image of an empty kitchen countertop, an LVLM might hallucinate a "bowl of fruit" or "cup of coffee", relying on language associations rather than visual evidence. Most LVLMs incorporate visual features by appending them to the input stream of a pre-trained LLM and training on large-scale vision-language datasets. Our systematic analysis reveals that this strategy often leads to over-dependence on textual information due to the inherent bias of LLMs towards language-dominant representations. This imbalance skews attention towards the text over visual content, weakening the model's ability to ground outputs in visual inputs. To address this, we propose a simple yet effective visual feature incorporation method that encourages the model to learn visually-informed textual embeddings distinct from those of the base LLM and promotes a more balanced attention distribution. Experimental results across multiple hallucination benchmarks demonstrate that our method significantly reduces hallucinations and fosters more balanced multimodal reasoning. Notably, our approach achieves substantial gains, including +9.33% on MMVP-MLLM, +2.99% on POPE-AOKVQA, up to +3.4% on Merlin, and +3% on the hard-data split of HallusionBench.

2605.16739 2026-06-15 cs.LG cs.AI cs.CL q-bio.NC 版本更新

EmoMind: Decoding Affective Captions from Human Brain fMRI

EmoMind:从人类大脑fMRI信号解码情感描述

Bilal A. Mohammed, Lin Gu, Ruogu Fang

发表机构 * Department of Biomedical Engineering(生物医学工程系) Vanderbilt University(范德比大学) Research Institute of Electrical Communication(电气通信研究所) Tohoku University(东北大学) University of Florida(佛罗里达大学)

AI总结 本文提出EmoMind,首个端到端解码fMRI信号生成情感描述的系统,通过结合语义基础的中性场景描述和连续情感向量,实现了在内容保留与情感表达间的平衡,并在多个验证框架下优于基于标签提示的GPT-4。

详情
AI中文摘要

从大脑活动解码视觉经验已取得显著进展,但当前的脑-文本系统主要恢复语义内容而丢弃情感。此外,语言模型在接收到类别标签提示时可以生成情感文本,但此类标签将丰富的跨受试者变异性压缩成粗糙的离散类别。我们提出了EmoMind,首个端到端的解码情感描述的fMRI信号管道。EmoMind首先从解码的视觉特征中检索出语义基础的中性场景描述,然后使用从相同fMRI记录中解码的连续34维情感向量重写该描述。为了在内容保留和情感表达之间保持平衡,我们使用分类器自由指导训练重写器,以对抗一个保持身份的空分支,从而在语义忠实性和情感表达性之间实现平滑插值。我们通过涵盖受试者特异性、结构几何和因果控制的三轴验证框架评估情感描述生成。我们进一步用合成大脑替代测试增强此框架,以探测对测量设备的鲁棒性,并将每个轴与使用脑解码的前五名情感标签提示的GPT-4进行基准测试。在两个独立的情感fMRI数据集中,EmoMind在所有三个轴上均显著优于标签提示的GPT-4,其中最大的收益出现在需要个人特定情感结构而非群体层面情绪聚合的指标上。这些结果确立了连续脑解码情感作为个性化情感描述生成的可行控制信号,并为研究个体情感大脑组织开辟了新方向。

英文摘要

Decoding visual experience from brain activity has advanced substantially, but current brain-to-text systems largely recover semantic content while discarding affect. Additionally, language models can generate emotional text when prompted with categorical labels, but such labels collapse rich inter-subject variability into coarse discrete bins. We present EmoMind, the first end-to-end pipeline for decoding affective captions directly from fMRI signals. EmoMind first retrieves a semantically grounded neutral scene description from brain-decoded visual features, then rewrites it using a continuous 34-dimensional emotion vector decoded from the same fMRI recording. To control the balance between content preservation and affective expression, we train the rewriter with classifier-free guidance against an identity-preserving null branch, enabling smooth interpolation between semantic fidelity and affective expressivity. We evaluate affective caption generation with a three-axis validation framework spanning subject-specificity, structural geometry, and causal control. We further augment this framework with a synthetic-brain substitution test that probes robustness to the measurement apparatus, and we benchmark each axis against GPT-4 prompted with brain-decoded top-5 emotion labels as a strong discrete baseline. Across two independent emotion fMRI datasets, EmoMind significantly outperforms label-prompted GPT-4 on all three axes, with the largest gains on metrics that require person-specific affective structure rather than population-level emotion aggregation. These results establish continuous brain-decoded affect as a viable control signal for individualized affective caption generation and open new directions for studying individual affective brain organisation.

6. 语音语言联合与音频文本 8 篇

2606.14391 2026-06-15 cs.CL cs.AI cs.SD 新提交

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

学习听到犹豫:面向非流畅语音的连续学习ASR

Henri-Leon Kordt, Theresa Pekarek Rosin, Jae Hee Lee, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg(汉堡大学信息学系知识技术研究所)

AI总结 针对ASR系统忽略非流畅导致信息丢失的问题,提出基于连续学习与显式非流畅标记的方法,在预训练模型中引入标记并持续训练,分析标记学习与ASR性能的权衡及跨方法共享的交叉注意力头机制。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

尽管大规模自动语音识别(ASR)取得了进展,但非流畅语音仍然具有挑战性,因为最先进的系统通常被优化以忽略非流畅,导致信息丢失和幻觉。先前的工作集中于逐字转录和非流畅标记的整合,但在有限数据集上适配模型可能导致通用领域知识的灾难性遗忘。我们通过利用具有显式非流畅标记的连续学习(CL)来填补这一空白。我们首先将这些标记引入预训练ASR模型以建立稳定的标记机制,然后在具有不同非流畅分布的其他数据集上继续训练。通过对训练期间模型动态的详细分析,我们识别出标记学习与ASR性能之间的权衡,以及跨CL方法共享的一致交叉注意力头机制。

英文摘要

Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

2606.14528 2026-06-15 cs.CL eess.AS 新提交

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

BayLing-Duplex: 单一自回归LLM的原生全双工语音对话

Qingkai Fang, Shoutao Guo, Yang Feng

发表机构 * Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)(中国科学院计算技术研究所智能信息处理重点实验室) Key Laboratory of AI Safety, Chinese Academy of Sciences(中国科学院人工智能安全重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出BayLing-Duplex,一种原生全双工语音语言模型,通过单个自回归LLM决定何时听、说和停止,无需外部VAD模块,仅用少量特殊标记实现,在少量微调数据上达到高交互成功率并提升响应质量。

Comments Code: https://github.com/BayLing-Models/BayLing-Duplex

详情
AI中文摘要

实时全双工语音交互是下一代语音聊天机器人的关键特性,允许模型同时听和说,并处理重叠、犹豫和插话等自然现象。现有的语音语言模型(如LLaMA-Omni和GLM-4-Voice)仍然是基于回合的,并依赖外部语音活动检测(VAD)模块来标记用户回合的结束,这从根本上限制了它们的交互能力。在本文中,我们介绍了BayLing-Duplex,一种原生全双工SpeechLM,其中单个自回归LLM决定何时听、何时说以及何时停止,无需辅助的回合切换模块。该设计仅在标准词汇表中添加少量特殊标记,因此可以跨LLM迁移,并重用现有的训练和服务堆栈,无需架构适配。从公开的GLM-4-Voice检查点开始,仅使用400K全双工样本进行微调,随后进行轻量级DPO阶段,BayLing-Duplex在InstructS2S-Eval上达到92%的回合切换成功率和100%的打断成功率,同时将语音响应分数从Moshi的2.17提升到3.39。BayLing-Duplex在Llama Questions、Web Questions和Alpaca-Eval上也达到或超过了其基于回合的对应版本,表明同时听和说建模不会牺牲响应质量。

英文摘要

Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation. Starting from the public GLM-4-Voice checkpoint and using only 400K full-duplex samples for fine-tuning followed by a lightweight DPO stage, BayLing-Duplex reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, while improving the speech-response score from 2.17 to 3.39 over Moshi. BayLing-Duplex also matches or surpasses its turn-based counterpart on Llama Questions, Web Questions, and Alpaca-Eval, showing that simultaneous listen-and-speak modeling does not sacrifice response quality.

2606.13712 2026-06-15 cs.SD cs.CL 交叉投稿

Multimodal Speaker Identification in Classroom Environments

课堂环境中的多模态说话人识别

Michael L. Chrzan, Meghavarshini Krishnaswamy, Robert Gibboni, Katie Wetstone, Wei Ai, Jing Liu

发表机构 * University of Michigan(密歇根大学) University of Pennsylvania(宾夕法尼亚大学) University of Maryland(马里兰大学)

AI总结 针对课堂背景噪声和儿童语音变异性导致纯声学模型准确率低的问题,提出融合声学嵌入与LLM语义上下文的多模态框架,将学生识别准确率从39.0%提升至50.3%,长句准确率达76.9%,角色区分准确率99.3%。

Comments 9 pages, 5 tables, 3 figures

详情
AI中文摘要

K-12课堂动态的自动化分析面临背景噪声和儿童语音变异性带来的挑战,这些因素常常干扰纯声学模型。本研究评估了一种多模态说话人识别框架,该框架将声学嵌入与LLM衍生的语义上下文相结合。使用EDSI数据集的一个子集(8个数学课堂,N = 2,801个话语),我们发现声学基线模型(ECAPA-TDNN)仅达到39.0%的准确率。通过将基于转录的“上下文锚定”集成到梯度提升分类器中,我们的多模态方法将学生识别准确率提高到50.3%。对于超过5秒的话语,性能也有所提升,达到76.9%的准确率(基线为64.9%),Top-3准确率为90.9%。此外,该模型以99.3%的准确率区分教师与学生角色。该方法推进了能够考虑个体学生参与的自动化反馈系统的可行性,这是支持大规模公平教学的关键一步。

英文摘要

Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based "contextual anchoring" into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.

2606.14030 2026-06-15 cs.SD cs.CL 交叉投稿

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

神经说话人日志中的结构化剪枝与低位量化效率-性能权衡

Rishit Chatterjee, Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College(科尔比学院计算机科学系)

AI总结 针对资源受限硬件上的流式说话人日志,通过结构化剪枝和低位量化压缩分割模型,研究不同延迟预算下的性能权衡,发现FP16可减半模型大小但DER增加40%。

Comments 6 pages, 3 figures, preprint

详情
AI中文摘要

流式说话人日志对于时间紧迫的医疗调度至关重要,但在资源受限的硬件上部署需要更小、更快的模型。使用模拟医疗调度对话数据集SIMSAMU,我们评估了流式行为,然后通过剪枝和低位量化压缩分割模型。我们表征了在一系列流式延迟预算下的性能,发现额外的缓冲并不总是有益的,而极低延迟操作点可能显著降低性能。我们的研究表明,模型压缩以性能换取内存占用,并强调了一个操作点,其中FP16将模型大小减半,实时因子基本不变,但相对于基线,DER增加了40%。这项工作表征了实时部署的权衡,并有助于在时间关键环境中实现可靠人类通信的语音技术。

英文摘要

Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streaming behavior before compressing the segmentation model with pruning and low-bit quantization. We characterize performance across a range of streaming latency budgets and find that additional buffering is not consistently beneficial, while very low-latency operating points can substantially degrade performance. Our study shows that model compression trades performance for memory footprint, and we highlight an operating point where FP16 reduces model size by half with essentially unchanged real-time factor, at a cost of a 40\% relative DER increase against the baseline. This work characterizes the trade-offs for real-time deployment and contributes to speech technology that can enable reliable human communication in time-critical contexts.

2509.20086 2026-06-15 cs.CL 版本更新

OLaPh: Optimal Language Phonemizer

OLaPh: 最优语言音素化器

Johannes Wirth

发表机构 * Institute for Information Systems at Hof University of Applied Sciences(霍夫应用科学大学信息学院)

AI总结 提出OLaPh混合框架,结合多语言词典、NLP技术和统计子词分割,在WikiPron基准上显著优于基线,并通过LLM合成语料探索神经泛化能力。

Comments 12 pages, 1 figure, 4 tables

详情
AI中文摘要

音素化是文本到语音合成中的关键组成部分。传统方法依赖于确定性转换和词典,而神经方法在词汇外(OOV)术语上具有更高的泛化潜力。我们提出了OLaPh(最优语言音素化器),一个混合框架,将广泛的多语言词典与先进的NLP技术和统计子词分割功能相结合。在WikiPron基准上的评估表明,OLaPh在整体准确性上显著优于已建立的基线,并通过高级回退机制在OOV数据上保持鲁棒性。为了进一步探索神经泛化,我们利用该框架为指令调优的大语言模型(LLM)合成高一致性训练语料。虽然确定性框架总体上仍然更准确,但LLM表现出强大的泛化能力,匹配或部分超过了框架的性能。这表明LLM成功地从合成数据中内化了超越框架能力的语音直觉。这些工具共同为多语言字素到音素转换(G2P)研究提供了全面的开源资源。

英文摘要

Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. We introduce OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show OLaPh significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework's performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework's capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual grapheme-to-phoneme conversion (G2P) research.

2510.05150 2026-06-15 cs.CL cs.AI 版本更新

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

全双工口语对话语言模型中的时间顺序思考

Donghang Wu, Haoyang Zhang, Chen Chen, Tianyu Zhang, Fei Tian, Xuerui Yang, Gang Yu, Hexin Liu, Nana Hou, Yuchen Hu, Eng Siong Chng

发表机构 * Nanyang Technological University(南洋理工大学) StepFun Mila

AI总结 提出Chronological Thinking机制,让全双工对话模型在听用户说话时增量推理,不增加延迟,提升响应质量。

Comments Accepted by SIGDIAL 2026

详情
AI中文摘要

近期口语对话语言模型(SDLMs)的进展反映了从轮次式向全双工系统转变的日益增长的兴趣,其中模型在生成响应的同时持续感知用户语音流。这种同时听和说的设计实现了实时交互,并且智能体可以处理动态对话行为,如用户插话。然而,在听阶段,现有系统通过重复预测静默标记使智能体保持空闲,这偏离了人类行为:我们在对话中通常进行轻量级思考,而不是心不在焉。受此启发,我们提出了Chronological Thinking,一种即时对话思考机制,旨在提高全双工SDLMs的响应质量。具体来说,Chronological Thinking从传统的LLM思考方法(如思维链)中进行了范式转变,专为流式声学输入而设计。(1)严格因果:智能体在听的同时增量推理,仅从过去的音频更新内部假设,无前瞻。(2)无额外延迟:推理在听窗口期间分摊;一旦用户停止说话,智能体停止思考并立即开始说话,无进一步延迟。实验通过客观指标和人工评估证明了Chronological Thinking的有效性,在响应质量上表现出一致的改进。此外,Chronological Thinking稳健地处理对话动态,并在全双工交互指标上取得了竞争性性能。

英文摘要

Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, an on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

2606.13464 2026-06-15 cs.CL cs.AI 版本更新

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

本体记忆增强的ASR校正用于长文本-语音交错对话

Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong, Baotian Hu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), China(哈尔滨工业大学(深圳)计算与智能研究所) Shenzhen Loop Area Institute (SLAI), China(深圳环域研究所)

AI总结 提出本体记忆增强的ASR校正框架,通过动态更新本体记忆存储实体、术语、变体、混淆和语义关系,解决长文本-语音交错对话中的上下文校正问题,在RAMC-Corr数据集上优于直接校正。

详情
AI中文摘要

自动语音识别(ASR)校正传统上集中于孤立的话语或短局部上下文。然而,随着文本和语音在长交互中越来越交错,ASR校正需要对话级别的上下文证据。现有的ASR校正方法通常依赖于当前假设或拼接原始对话历史。在此类上下文中,稀疏的校正证据可能难以在冗余和噪声中定位。针对这些挑战,我们提出了一种本体记忆增强的ASR校正框架,用于长文本-语音交错对话。该框架将先前的交互历史组织成动态可更新的本体记忆,其中实体、术语、表面变体、潜在ASR混淆和语义关系作为可检索节点存储,用于上下文基础的校正。为了评估这一设置,我们构建了RAMC-Corr,一个源自MAGIC-RAMC的数据集,用于具有基础上下文的长距离ASR校正。在RAMC-Corr上的实验表明,我们的方法在10个配对骨干-设置组合中的9个上优于直接校正,并鼓励对上下文相关的ASR错误进行更具选择性和证据基础的校正。

英文摘要

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

2603.24596 2026-06-15 eess.AS cs.AI cs.CL 版本更新

X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs

X-OPD:面向语音大语言模型能力对齐的跨模态在策略蒸馏

Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan, Tao Jin

发表机构 * Tencent Hunyuan(腾讯文心) Zhejiang University(浙江大学)

AI总结 提出X-OPD框架,通过跨模态在策略蒸馏对齐语音LLM与文本LLM的能力,利用文本教师模型评估语音模型的轨迹并提供令牌级反馈,显著缩小复杂任务性能差距。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

虽然从级联对话系统转向端到端(E2E)语音大语言模型(LLMs)改善了延迟和副语言建模,但E2E模型通常表现出与其文本对应模型相比显著的性能下降。标准的监督微调(SFT)和强化学习(RL)训练方法无法弥合这一差距。为了解决这个问题,我们提出了X-OPD,一种新颖的跨模态在策略蒸馏框架,旨在系统地将语音LLM的能力与其文本对应模型对齐。X-OPD通过在线策略展开使语音LLM探索其自身分布,其中基于文本的教师模型评估这些轨迹并提供令牌级反馈,从而有效地将教师的能力蒸馏到学生的多模态表示中。在多个基准上的大量实验表明,X-OPD在保留模型固有能力的同时,显著缩小了复杂任务中的差距。

英文摘要

While the shift from cascaded dialogue systems to end-to-end (E2E) speech Large Language Models (LLMs) improves latency and paralinguistic modeling, E2E models often exhibit a significant performance degradation compared to their text-based counterparts. The standard Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training methods fail to close this gap. To address this, we propose X-OPD, a novel Cross-Modal On-Policy Distillation framework designed to systematically align the capabilities of Speech LLMs to their text-based counterparts. X-OPD enables the Speech LLM to explore its own distribution via on-policy rollouts, where a text-based teacher model evaluates these trajectories and provides token-level feedback, effectively distilling teacher's capabilities into student's multi-modal representations. Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.

7. 评测、数据集与基准 34 篇

2606.13685 2026-06-15 cs.CL cs.AI 新提交

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

抛硬币的裁判?LLM作为评估者的可靠性与偏见

Abel Yagubyan

发表机构 * Independent Researcher(独立研究员)

AI总结 研究LLM作为评估者在重复评估中的不可靠性,发现偏好翻转率平均13.6%,存在位置偏见,并建议多轮聚合和不确定性报告。

Comments 24 pages, 7 figures

详情
AI中文摘要

LLM作为评估者(LLM-as-a-Judge)现被广泛用于模型输出排名、训练奖励模型和填充公共排行榜,但其运行间可靠性仍缺乏充分表征。我们使用两个OpenAI评估模型(GPT-4o-mini和GPT-4.1-mini)在涵盖10个类别的29个任务上进行了重复的相同评估,每个问题进行50次成对试验和50次逐点试验,并辅以温度和提示敏感性消融实验。在评估者之间,成对偏好平均翻转13.6%的时间,28%的问题翻转率超过20%,一个问题达到56%。GPT-4o-mini还表现出显著的第一位置偏见(72%的A多数,p=0.024)。同时,平均逐点评分差距很小(在10分制上为0.19-0.36),且总体上不具统计显著性,产生了成对-逐点差距:评估者经常选择胜者,即使它们自己的标量分数几乎没有证据表明存在有意义的质量差异。除了评估者内部的不稳定性,评估者间的一致性仅为76%(κ=0.51),语义等价的提示模板在25%的测试案例中改变了多数结果,确定性解码减少了但不消除不一致性。可靠性曲线分析显示,在我们的数据集中,平均需要11次重复试验才能让多数投票以95%的概率恢复50次试验的参考裁决,对于高方差问题则上升至15次。这些发现表明,单次LLM评估对于高风险评估往往噪声过大,多轮聚合、位置随机化和显式不确定性报告应成为标准实践。由于两个评估者均来自同一提供商,跨提供商复制仍是重要的下一步。

英文摘要

LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations. Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19--0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise--pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference. Beyond within-judge instability, cross-judge agreement is only 76% ($κ= 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions. These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

2606.13686 2026-06-15 cs.CL cs.CY 新提交

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

电子商务欺骗性界面下的Web Agent安全基准测试

Zijing Shi, Meng Fang, Ling Chen

发表机构 * AAII, University of Technology Sydney(悉尼科技大学AAII) University of Liverpool(利物浦大学)

AI总结 提出WebDecept框架,在电子商务环境中注入七种常见欺骗性界面模式,测试多模态Web Agent的安全性,发现当前Agent极易受骗且提示约束不足。

Comments Accepted to ACL 2026

详情
AI中文摘要

随着自主Web Agent越来越多地用于执行现实任务,确保其安全性已成为关键问题。在这项工作中,我们研究了电子商务领域中现实欺骗性界面下的Web Agent行为。我们引入了WebDecept,一个轻量级且可配置的插件框架,能够将欺骗性界面模式可控地注入现有Web环境。使用WebDecept,我们实例化了开放Web上常见的七种欺骗模式,包括定向广告、域名重定向和购物操纵。通过在任务执行期间将这些模式注入前端,我们对多个多模态Web Agent进行了受控评估。我们的结果表明,当前的Web Agent极易受到多类欺骗性界面的影响,并且基于提示的约束通常不足以缓解这些失败。我们进一步分析了欺骗性模式的设计选择如何影响此类操纵的成功。这些发现凸显了在Web Agent向现实部署扩展时应解决的安全挑战。

英文摘要

As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt-based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real-world deployment.

2606.13756 2026-06-15 cs.CL 新提交

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

QIAS 2026:伊斯兰继承推理共享任务综述

Abdessalam Bouchekif, Somaya Eltanbouly, Samer Rashwani, Shahd Gaben, Mutaz Al-Khatib, Heba Sbahi, Emad Mohamed, Mohammed Ghaly

发表机构 * Hamad bin Khalifa University(哈马德·本·哈利法大学) Nazarbayev University(纳扎尔巴耶夫大学)

AI总结 本文介绍 QIAS 2026 共享任务,评估大语言模型在伊斯兰继承领域的复杂推理能力,基于 MAWARITH 数据集,使用 MIR-E 多步指标评估,16 支队伍参与,结果显示该任务对当前模型极具挑战性。

详情
AI中文摘要

本文全面概述了 QIAS 2026 共享任务,该任务作为 OSACT7 研讨会的一部分组织,并与 LREC 2026 联合举办。该共享任务旨在评估大语言模型在伊斯兰继承这一宗教和法律领域进行复杂推理的能力。与传统的问答基准不同,QIAS 2026 侧重于从自然语言案例进行端到端推理,要求系统完成完整的继承计算过程,从识别合格继承人到为每位受益人分配正确份额。为支持此评估,任务基于 MAWARITH 基准,这是一个包含 12,500 个阿拉伯语继承案例的数据集,并标注了中间推理步骤和最终答案。系统提交使用 MIR-E 进行评估,这是一个多步指标,衡量继承推理主要阶段的性能。共有 16 支队伍参与共享任务,研究了多种方法,包括基于提示的方法、检索增强生成和微调策略。结果表明,伊斯兰继承对当前语言模型来说仍然是一个极具挑战性的基准,尤其是在需要精确法律解释和结构化数值推理的阶段。本概述总结了任务设计、数据集、评估框架、参与系统和主要结果。

英文摘要

This paper presents a comprehensive overview of the QIAS 2026 shared task, organized as part of the OSACT7 Workshop and co-located with LREC 2026. The shared task was designed to evaluate the ability of large language models to perform complex reasoning in the religious and legal domain of Islamic inheritance. Unlike conventional question-answering benchmarks, QIAS 2026 focuses on end-to-end reasoning from natural language cases, requiring systems to perform the full inheritance calculation process, from identifying the eligible heirs to assigning the correct share to each beneficiary. To support this evaluation, the task was based on the MAWARITH benchmark, a dataset of $12{,}500$ Arabic inheritance cases annotated with intermediate reasoning steps and final answers. System submissions were evaluated using MIR-E, a multi-step metric that measures performance across the main stages of inheritance reasoning. A total of $16$ teams participated in the shared task, investigating a range of approaches, including prompting-based methods, retrieval-augmented generation, and fine-tuning strategies. The results show that Islamic inheritance remains a highly challenging benchmark for current language models, especially in stages that require precise legal interpretation and structured numerical reasoning. This overview summarizes the task design, dataset, evaluation framework, participating systems, and main results.

2606.13808 2026-06-15 cs.CL 新提交

The Culture Funnel: You Can't Align What isn't in the Data

文化漏斗:你无法对齐不在数据中的内容

Ananya Sahu, Mehrnaz Mofakhami, Daniel D'Souza, Thomas Euyang, Julia Kreutzer, Marzieh Fadaee

发表机构 * Cohere Labs(Cohere实验室)

AI总结 针对当前文化对齐方法依赖推理时干预而忽视训练数据的问题,提出文化数据漏斗概念,通过多维标签框架分析预训练、微调、对齐和推理数据集,发现后训练阶段文化信号急剧下降,多语言性虽增强地理多样性但未能确保平衡,所提标签可提升下游文化基准性能。

详情
AI中文摘要

当前的文化对齐方法侧重于推理时的干预,假设模型已经包含足够的文化知识。我们认为现代LLM流程存在一个文化数据漏斗。通过使用一个跨预训练、微调、对齐和推理数据集的多维标签框架,我们展示了在后训练阶段显式文化信号急剧下降,而地理上集中、任务专门化的数据占主导地位。多语言性增强了文化知识的地理多样性,但并未确保平衡的代表性。我们的标签提升了下游文化基准性能,表明进展需要将重点转向训练数据流程。为促进未来研究,我们在此https URL发布带有5.6M样本的文化标注数据集。

英文摘要

Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional tagging framework across pretraining, fine-tuning, alignment, and reasoning datasets, we show explicit cultural signals decline sharply during post-training, while geographically concentrated, task-specialized data dominates. Multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation. Our tags improve downstream cultural benchmark performance, demonstrating that advances require shifting focus in training data pipelines. To facilitate future research, we release our culturally tagged dataset with 5.6M samples at https://huggingface.co/datasets/CohereLabs/CultureMarkers.

2606.13835 2026-06-15 cs.CL cs.AI cs.MA 新提交

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

当合理但不现实:评估基于LLM的城市模拟中的人类移动性

Gustavo H. Santos, Aline Carneiro Viana, Thiago H. Silva

发表机构 * UTFPR(巴西联邦理工大学) Inria(法国国家信息与自动化研究所) U. of Toronto(多伦多大学)

AI总结 提出验证框架,通过移动定律、时间节奏等指标评估基于LLM的城市模拟器生成的人类移动模式,发现叙事合理性与经验移动现实性之间存在显著差距。

Comments 14 pages, 10 figures

详情
AI中文摘要

基于LLM的生成式智能体越来越多地用于城市模拟器,但尚不清楚它们是否再现了经验上真实的人类移动模式,还是仅仅生成合理的移动叙事。我们引入了一个验证框架,用于评估基于LLM的城市模拟器中生成智能体的移动性,并与真实世界移动数据进行比较。为此,我们使用了移动定律、时间节奏、网络模体、语义活动转换和行为移动性配置文件。利用大巴黎地区和上海的数据集,我们评估了AgentSociety和CitySim在多个移动现实性维度上的表现。我们的分析揭示了叙事合理性与经验移动现实性之间的显著差距。尽管模拟器捕捉到了一些高级语义活动分布,但它们难以再现核心的空间和时间约束,包括真实的行程长度分布、起止点流量、停留时间和转换动态。我们进一步观察到,现实的移动多样性在默认提示配置下不稳定,可能需要显式的配置文件感知初始化。为了支持可重复的评估,我们还贡献了可扩展且开放的LLM驱动基础设施,用于区域级地图生成、可观测性增强的模拟、移动性指标计算和交通模拟。我们的发现强调了需要对基于LLM的城市模拟器进行严格的经验验证,并提供了构建更真实和可重复的城市模拟系统的实用工具。

英文摘要

LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a validation framework for evaluating the mobility of generative agents of LLM-based urban simulators against real-world mobility data. For this, we use mobility laws, temporal rhythms, network motifs, semantic activity transitions, and behavioral mobility profiles. Using datasets from the Greater Paris region and Shanghai, we evaluate AgentSociety and CitySim across multiple dimensions of mobility realism. Our analysis reveals a substantial gap between narrative plausibility and empirical mobility realism. Although the simulators capture some high-level semantic activity distributions, they struggle to reproduce core spatial and temporal constraints, including realistic trip-length distributions, origin-destination flows, dwell times, and transition dynamics. We further observe that realistic mobility diversity is unstable across default prompting configurations and may require explicit profile-aware initialization. To support reproducible evaluation, we also contribute scalable and open LLM-driven infrastructure for regional-scale map generation, observability-enhanced simulation, mobility-metric computation, and traffic simulation. Our findings highlight the need for rigorous empirical validation of LLM-based urban simulators and provide practical tools for building more realistic and reproducible urban simulation systems.

2606.13904 2026-06-15 cs.CL cs.AI cs.DB 新提交

SANA: What Matters for QA Agents over Massive Data Lakes?

SANA:大规模数据湖上的问答代理关键因素是什么?

Austin Senna Wijaya, Jiaxiang Liu, Haonan Wang, Eugene Wu

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出SANA诊断框架,通过消融实验分析数据湖探索式问答中搜索、规划、数据分析及行动策略的失败原因,揭示数据分析是主要瓶颈。

Comments 9 pages, 7 figures

详情
AI中文摘要

数据湖上的探索式问答(EQA)需要LLM代理发现相关源、分析检索数据并根据中间结果调整其行动。端到端准确率无法区分搜索、规划、数据分析或代理的行动策略(即下一步做什么以及何时提交答案的决策)中的失败。我们提出了SANA(搜索代理导航消融框架),这是一个诊断性消融框架,将EQA任务转化为包含黄金源序列、清洗后子问题和执行记录的运行时配置文件。SANA利用这些配置文件构建理想化的搜索、规划和数据分析工具,从而允许对每个组件进行消融;残差是策略失败的诊断证据。为了说明SANA作为一个可复用的评估框架,我们改编了两个最近的EQA基准测试LakeQA和KramaBench,并在固定提示、预算、数据湖和运行时下评估了轻量级和中型代理。在两个基准测试中,数据分析始终是瓶颈,而规划则不那么明显。搜索在LakeQA的大数据湖设置中是主要限制,但在较小规模的KramaBench中则不那么突出。因此,SANA将端到端任务准确率分解为数据湖代理失败原因的诊断,并允许系统比较搜索、规划、数据分析和代理设计方面的进展。

英文摘要

Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent's Action Policy: its decisions about what to do next and when to submit an answer. We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data-analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures. To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid-sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA's large data-lake setting, but less so for the smaller-scale KramaBench. SANA thus deconstructs end-to-end task accuracies into a diagnosis of where data-lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design.

2606.13931 2026-06-15 cs.CL 新提交

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

DLawBench: 通过多轮法律咨询评估大语言模型

Li Zhang, Yuzhen Shi, Yiran Hu, Jingwen Zhang, Wenbo Lv, Yubo Ma, Wei Wang, Rongyao Shi, Yuanyang Qiu, Xinran Xu, Yuemeng Qi, Linlin Miao, Jaromir Savelka, Yun Liu, Kevin Ashley, Bing Zhao, Hu Wei, Lin Qu

发表机构 * University of Pittsburgh(匹兹堡大学) Alibaba Group(阿里巴巴集团) University of Waterloo(滑铁卢大学) Shandong University(山东大学) Skylenage China University of Political Science and Law(中国政法大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队) Fordham University(福特汉姆大学) Shanghai Jiao Tong University(上海交通大学) Carnegie Mellon University(卡内基梅隆大学) Tsinghua University(清华大学)

AI总结 提出DLawBench基准,通过模拟四种客户类型(合作、依赖、退缩、对抗)的多轮对话,评估大语言模型在真实法律咨询中的交互能力。

Comments 37 pages, 8 figures, 26 tables. Code and data: https://github.com/SKYLENAGE-AI/DLawBench

详情
AI中文摘要

律师-客户咨询是法律服务的关键起点。有效的法律协助依赖于从客户那里获取充分且真实的信息,以制定最能保护其利益的策略。这一任务要求大语言模型不仅具备强大的法律推理能力,还要能通过多轮交互策略性地获取关键事实,并有效引导具有不同个性的客户。然而,现有的法律基准忽略了这种交互能力。为填补这一空白,我们引入了DLawBench,一个针对真实世界法律咨询的诊断基准。基于真实的客户行为,我们将律师-客户互动分为四种类型:合作型、依赖型、退缩型和对抗型。利用基于真实案例的对话,DLawBench评估大语言模型是否能在现实条件下有效进行法律咨询。DLawBench包含来自中国和美国法律的461个案例、5532个配对事实条目、3411个查询评估标准和3348个问题解决评估标准,并评估了26个代表性大语言模型。系统实验显示仍有很大提升空间:表现最好的模型GPT-5.5在基于咨询的法律推理上仅达到0.562。更重要的是,DLawBench揭示了法律咨询中的谄媚现象以及一个悖论:当客户最需要指导时,模型表现反而更差。

英文摘要

Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (LLMs) not only to perform robust legal reasoning, but also to strategically elicit material facts through multi-turn interactions and effectively guide clients with diverse personalities. Yet existing legal benchmarks overlook this interactive capability. To fill this gap, we introduce DLawBench, a diagnostic benchmark for real-world legal consultation. Drawing on realistic client behavior, we characterize lawyer-client interactions into four types: Cooperative, Dependent, Withdrawn, and Adversarial. Using dialogues grounded in real cases, DLawBench evaluates whether LLMs can effectively conduct legal consultation under realistic conditions. DLawBench comprises 461 cases from Chinese and U.S. law, 5,532 paired fact entries, 3,411 inquiry rubrics, and 3,348 issue-resolution rubrics, and evaluates 26 representative LLMs. Systematic experiments show substantial headroom: the best-performing model, GPT-5.5, achieves only 0.562 on consultation-grounded legal reasoning. More importantly, DLawBench exposes both sycophancy in legal consultation and a paradox: models perform worse when clients need guidance most.

2606.13944 2026-06-15 cs.CL 新提交

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

LLM 蕴含多重性:部署上下文如何重塑模型层面的偏好与价值观

Filip Trhlik, Aoife O'Flynn, Angela Yu, Arduin Findeis, Paula Buttery

发表机构 * University of Cambridge(剑桥大学) ALTA Institute(ALTA研究所) Leverhulme Centre for the Future of Intelligence(勒沃霍姆未来智能中心) Microsoft UK(微软英国)

AI总结 本研究通过两个成对比较范式(国家偏好排序和效用判断)发现,部署上下文(如撰写Reddit帖子或新闻文章)对LLM偏好和价值观的影响远大于提示改写和温度控制,表明模型层面的偏好是上下文依赖的。

Comments 68 pages, 54 figures, 54 tables

详情
AI中文摘要

大型语言模型(LLM)在最近的评估工作中越来越被描述为具有稳定的、模型层面的偏好和价值观系统。然而,伴随的稳健性检查仅限于偶然的提示扰动,如句法变化和选项重新排序。这留下了当周围任务上下文发生变化时(如大多数实际部署中那样),所测量的属性是否仍然存在的问题。我们直接在两个已建立的成对范式中对此进行测试:国家偏好排序和效用判断。在两者中,我们将部署上下文——模型在做出具体价值依赖选择时执行的高层任务——作为我们的控制变量,在不同框架(如撰写Reddit帖子或新闻文章)之间变化。在五个LLM和超过120万个成对决策中,部署上下文产生的变化远大于提示释义和温度控制。在15个国家的偏好排序中,上下文引发了广泛的、统计上显著的排名变化;先前工作中报告的总体全球北方偏好本身是上下文依赖的,每个模型的偏见在不同上下文中系统性变化。在50个结果的效用引出中,跨类别的广泛排序得以保留,但领域内的细粒度排名变化很大,并且结果之间的基数交换率(例如,一个地区的多少条生命等于另一个地区的一条生命)在中位数上变化了2.47倍。因此,报告的模型层面偏好和效用最好被理解为上下文条件下的测量,而不是固定的模型属性:在一种框架下获得的安全保证在另一种框架下提供的保证有限。

英文摘要

Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering. This leaves open whether the measured properties survive when the surrounding task context changes, as it does in most real deployments. We test this directly across two established pairwise paradigms: ranking country preferences and eliciting utility judgements. In both, we make the deployment context -- the high-level task the model is performing while making concrete value-dependent choices -- our controlled variable, varied across framings such as writing a Reddit post or a news article. Across five LLMs and over 1.2M pairwise decisions, deployment context produces variation far larger than prompt paraphrasing and temperature controls. In country preference rankings over 15 countries, context induces widespread, statistically significant rank shifts; the aggregate Global North favouritism reported in prior work is itself context-dependent, with each model's bias shifting systematically across contexts. In utility elicitation over 50 outcomes, broad cross-category ordering is preserved, but fine-grained rankings within domains vary substantially, and cardinal exchange rates between outcomes (e.g. how many lives in one region equal one in another) shift by a factor of 2.47 at the median. Reported model-level preferences and utilities are therefore better understood as context-conditioned measurements than fixed model-level properties: safety guarantees obtained under one framing provide limited assurance in another.

2606.13995 2026-06-15 cs.CL 新提交

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Dialogue SWE-Bench: 对话驱动的编码智能体基准

Brendan King, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 提出Dialogue SWE-Bench基准,通过用户模拟器评估编码智能体在对话中解决软件工程问题的能力,并引入模式引导智能体提升对话性能3-14%。

Comments 22 pages, 13 figures

详情
AI中文摘要

AI编码智能体已迅速改变软件工程,驱动着广泛使用的交互式编码助手。尽管它们在现实世界中是交互式使用的,但现有基准将其评估为完全自主系统。在这项工作中,我们引入了Dialogue SWE-Bench,一个自动基准数据集,用于评估编码智能体通过与用户对话解决现实世界软件工程问题的能力。我们设计了一个新颖的、基于角色设定的用户模拟器来支持我们的任务评估,并通过对话质量的自动评估来增强任务评估。我们还提出了一种新的模式引导智能体,旨在提升现成编码智能体的对话能力,相比强基线提升了3-14%。我们的结果表明,更好的编码模型并不总是对应更好的对话模型,这表明对话能力是编码智能体性能的一个独特且目前研究不足的维度。

英文摘要

AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

2606.14122 2026-06-15 cs.CL 新提交

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

超越困惑度:字节感知语言模型中的 UTF-8 有效性

Sangwhan Moon, Daisuke Oba, Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki

发表机构 * University of Tokyo(东京大学) National Institute of Information and Communications Technology(信息通信技术国家研究所)

AI总结 研究字节级分词语言模型生成无效UTF-8序列的问题,通过多语言训练实验发现UTF-8有效性收敛比困惑度慢约一倍,且罕见字符结构有效性更高,表明可靠UTF-8生成是需要单独评估的能力。

详情
AI中文摘要

字节级分词使语言模型能够处理任何Unicode输入,但当遇到罕见或未见字符时,模型可能生成无效的UTF-8序列。我们使用一个355M参数的模型,在来自英语、日语、韩语和中文的平衡多语言语料库的80B tokens上训练,研究了训练规模与UTF-8生成可靠性之间的关系。我们引入了多种评估协议,将UTF-8结构有效性与语言建模分离。UTF-8有效性收敛滞后于困惑度大约两倍:困惑度在2.1B tokens后稳定,但UTF-8有效性需要4.2B tokens。在无上下文生成中,罕见字符比常见字符获得更高的结构有效性,表明频繁字符表示过度特化。通过实验,我们观察到可靠的UTF-8生成是一种独特的能力,需要超越困惑度的评估。

英文摘要

Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

2606.14199 2026-06-15 cs.CL cs.AI cs.LG 新提交

OdysSim: Building Foundation Models for Human Behavior Simulation

OdysSim: 构建人类行为模拟的基础模型

Xuhui Zhou, Weiwei Sun, Weihua Du, Jiarui Liu, Haojia Sun, Qianou Ma, Tongshuang Wu, Yiming Yang, Maarten Sap

发表机构 * Carnegie Mellon University, Language Technologies Institute(卡内基梅隆大学语言技术研究所)

AI总结 提出OdysSim,通过SOUL分类法统一62个数据集和23个基准任务,采用混合训练、任务特定强化学习和专家蒸馏,构建8B参数行为基础模型OSim,在多数任务上超越前沿模型,并实现更类人输出和零样本迁移。

Comments 34 pages. Code: https://github.com/sunnweiwei/OdysSim ; Models and data: https://huggingface.co/collections/cmu-lti/odyssim

详情
AI中文摘要

大型语言模型越来越多地被部署为人类模拟器,用于交互式评估和社会模拟。然而,以有用性为导向的后训练使它们趋向于同质化、过于随和的助手风格,造成了行为上的Sim2Real差距。我们提出了OdysSim,这是对行为基础模型(即经过训练以大规模模拟人类行为的模型)进行的最大规模开放系统研究。我们提出了SOUL,一个包含五个能力轴(CONV、SS、COG、ROLE、EVAL)的分类法,将62个数据集和23个基准任务统一在一个框架下。具体来说,我们整理了OdysSim语料库(2140万次交互,100亿个token,并配备了反向生成的社交上下文),构建了SOUL-Index基准,并开发了一个端到端的训练方案,结合了中期训练、任务特定强化学习和专家蒸馏。由此产生的开源8B OSim模型在23个任务中的8个上排名第一或并列第一,按此计数优于任何单个前沿模型,在对话和社交任务上取得了最大的提升。其输出在长度、格式和词汇选择上也更接近人类,并在τ-bench上零样本迁移到分布外的用户模拟,在反应一致性上几乎与真实用户匹配(93.2 vs 93.5)。我们进一步表明,LLM作为评判者的强化学习会引发奖励黑客模式,而我们的检测器可以在后训练期间缓解这些模式。总之,我们的发现表明,行为基础模型需要重新思考LLM的训练范式。我们发布所有工件以支持未来的研究。

英文摘要

Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $τ$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.

2606.14257 2026-06-15 cs.CL 新提交

The Linguistics Olympiads: Towards a New Corpus for Linguistics Research?

语言学奥林匹克:语言学研究的语料库新方向?

Vlad A. Neacsu

AI总结 本文探讨语言学奥林匹克问题作为语言学语料库的潜力,分析其优势与局限,并提出在学术研究中负责任使用的标准。

Comments Accepted for publication in LingBaW. Linguistics Beyond and Within (Volume 12, 2026)

详情
AI中文摘要

语言学奥林匹克问题(LOPs)是一类自足的谜题,包含一个缩小的语料库,代表特定的语言现象,解题者需从中推断出语言的基本规则集,然后翻译一组新元素。语言学奥林匹克(LOs)已成为全球现象,有43个不同地区参加2025年国际语言学奥林匹克(IOL)。尽管LOPs的类型和解题策略已被分析,但其科学层面及与学术语言学的联系尚待探索。LOPs直接关联许多语言学领域,如语言类型学、语言相对论和语言学田野调查。最近,LOPs作为大语言模型的基准成为研究焦点,凸显其在计算语言学中的实用性。然而,它们尚未被纳入主流语言学研究。本文试图通过提供LOPs作为语言数据源的结构化评估,开辟将这类特殊谜题纳入学术研究的新方向,并提出在学术研究中负责任使用的标准。基于超过1800个LOPs的数据集,本研究批判性地考察了LOPs作为语言学新语料库的潜力,讨论了它们作为工具的优势和局限性,以及这些谜题可能适用的语言学领域。这项工作为更广泛的倡议奠定了基础,旨在通过为LOPs建立稳健的理论框架,弥合LOs与学术语言学之间的鸿沟。

英文摘要

Linguistics olympiad problems (LOPs) are a category of self-sufficient puzzles consisting of a scaled-down corpus representative of certain linguistic phenomena, from which the solver must deduce a primitive set of rules of the language and then translate a new set of elements. The linguistics olympiads (LOs) have become a worldwide phenomenon with 43 different territories taking part in the International Linguistics Olympiad (IOL) 2025. While the typology and solving strategies of LOPs have been analysed, their scientific facet and connections to academic linguistics have yet to be explored. LOPs are directly connected to many linguistic fields, e.g., linguistic typology, linguistic relativity, and linguistics fieldwork. Recently, LOPs have become a research focus as benchmarks for large language models, thus highlighting their usefulness in computational linguistics. Nevertheless, they have not yet been integrated into mainstream linguistics research. This paper attempts to open new directions of including this particular type of puzzle in academic research by offering a structured evaluation of LOPs as linguistic data sources and proposes criteria for their responsible use in academic research. Starting from a set of over 1800 LOPs, this study critically examines the potential of LOPs as a novel corpus for linguistics research by discussing their strengths and limitations as tools, as well as the areas of linguistics into which these problems could fit. This work forms the foundation for a broader initiative aimed at bridging the gap between LOs and academic linguistics, by establishing a robust theoretical framework for LOPs.

2606.14278 2026-06-15 cs.CL 新提交

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

法官偏爱英语吗?评估LLM作为法官时的语言切换不变性

Shaojie Yin

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出Judge-LS协议,通过语言切换变体评估LLM作为自动法官的可靠性,发现语言切换导致10.7-14.4%偏好翻转,但翻译等价测试未显示系统英语偏好。

详情
AI中文摘要

大型语言模型(LLM)现在被广泛用作开放式指令遵循评估的自动法官。这种做法方便、可扩展,且通常比基于参考的指标更具语义意识,但它也引入了一个新的可靠性问题:法官评估的是答案的质量,还是也对比较呈现的语言做出反应?我们提出了Judge-LS,一个轻量级元评估协议,将LLMBar响应对项目转换为英语、中文和中英文语言切换变体。一个可靠的法官应在保持标签的语言转换下保持其偏好,并且当两个答案翻译等价时不应偏好某种语言。我们在完整的419项LLMBar基准上评估了四个API可访问的法官,产生了13,408个成功的成对判断。跨模型,中文和语言切换呈现相对于英语导致10.7-14.4%的偏好翻转,所有法官在英语中达到最高准确率。然而,翻译等价的平局探测并未揭示系统的英语偏好:大多数探测被判断为平局,非平局决策更常偏向中文。我们添加了置信区间、配对显著性检验和自动转换审计,并进行了敏感性分析,排除了机械标记的高风险变体。该实验不需要模型训练,仅使用API调用,并且在适度的本地硬件上可行。

英文摘要

Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it also introduces a new reliability question: does a judge evaluate the quality of an answer, or does it also react to the language in which the comparison is presented? We propose Judge-LS, a lightweight meta-evaluation protocol that transforms LLMBar response-pair items into English, Chinese, and Chinese-English language-switched variants. A reliable judge should preserve its preference under label-preserving language transformations and should not prefer a language when two answers are translation-equivalent. We evaluate four API-accessible judges on the full 419-item LLMBar benchmark, producing 13,408 successful pairwise judgments. Across models, Chinese and language-switched presentations induce 10.7--14.4% preference flips relative to English, and all judges achieve their highest accuracy in English. However, translation-equivalent tie probes do not reveal a systematic English preference: most probes are judged as ties, and non-tie decisions more often favor Chinese. We add confidence intervals, paired significance tests, and an automatic transformation audit with a sensitivity analysis that excludes mechanically flagged high-risk variants. The experiment requires no model training, uses only API calls, and is feasible on modest local hardware.

2606.14459 2026-06-15 cs.CL cs.AI cs.SD 新提交

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

MoDiCoL:用于鲁棒语音识别的模块化诊断持续学习数据集

Theresa Pekarek Rosin, Matthias Kerzel, Stefan Wermter

发表机构 * Knowledge Technology, Department of Informatics, University of Hamburg, Germany(德国汉堡大学信息学系知识技术研究所)

AI总结 提出MoDiCoL数据集,通过模块化设计分离语言内容、说话人特征和声学环境,并设计持续学习课程来模拟真实分布变化,评估三种持续学习策略下的鲁棒性获取、迁移和遗忘。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

现代自动语音识别(ASR)系统在标准基准测试上取得了显著进展,但在真实世界的分布变化下,由录音条件、口音、语言障碍和噪声引起的性能差距已经显现。现有数据集和基准通常孤立这些因素,忽略了它们在真实应用中的共现。在本文中,我们认为模型鲁棒性可以被视为一种动态能力,持续发展,并引入了MoDiCoL,一个模块化诊断持续学习数据集,旨在对语言内容、说话人特征和声学环境进行受控分析。此外,我们提出了一个受真实世界启发的持续学习课程,以模拟增量更新,并研究鲁棒性是如何获取、迁移和遗忘的。我们评估了三种持续学习策略,并提供了在演化条件下鲁棒性的详细见解。

英文摘要

Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.

2606.14574 2026-06-15 cs.CL cs.AI 新提交

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

SIMMER: 基于世界模型评估LLM可执行规划中的潜在故障

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出SIMMER基准,通过人工策划的厨房领域符号世界模型,评估LLM规划中的潜在故障;实验发现前沿模型最多17%无错误计划,56%含潜在故障,多数不可逆;反事实预演可减少72%潜在故障和75%不可逆案例。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为家庭环境中自主代理的规划器。虽然现有基准评估LLM生成的计划是否成功执行,但它们忽略了一种关键类型的故障:潜在故障。与立即故障(在执行时触发即时反馈并允许及时纠正)不同,潜在故障不会立即停止计划执行,而是悄无声息地损害目标实现。在严重情况下,它们会导致不可逆的损害。为弥补这一空白,我们引入了SIMMER,这是一个通过人工策划的、基于厨房领域的符号世界模型来评估LLM规划中潜在故障的基准。SIMMER定义了一个世界模型,包含77个动作、262个独特对象和约46,800种语义真实的可能交互,这些交互源自真实世界的烹饪脚本。然后,它利用一个状态机执行器,根据世界模型验证计划,并检测即时前提违规、潜在危险和不可逆故障。在六个LLM上的实验表明,即使是最前沿的模型,其无错误计划最多也只有17%。此外,高达56%的计划包含潜在故障,其中大多数导致不可逆后果。我们进一步证明,通过反事实预演进行显式状态推理可以将潜在故障减少高达72%,不可逆案例减少高达75%,这为更鲁棒的LLM规划器指明了一个有前景的方向。

英文摘要

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

2606.14600 2026-06-15 cs.CL 新提交

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

LoSoNA:群组对话中局部社交规范适应的基准

Mateusz Winiarek, Maksymilian Bilski, Mateusz Jacniacki

发表机构 * Humalike Research

AI总结 提出LoSoNA基准,通过群聊场景测试LLM从对话历史推断并适应未明示的局部社交规范的能力,评估多种模型在不同提示条件下的表现。

详情
AI中文摘要

在线群聊是具有局部对话规范的社会空间,这些规范很少被明确陈述。基于LLM的智能体识别并适应这些规范的能力和意愿尚未得到充分探索。我们引入了LoSoNA,一个用于多方聊天中局部社交规范适应的基准。每个场景向主体模型提供一个精心策划的群聊记录,其中非主体参与者展示一个隐藏的局部规范,随后是一个最终的引发轮次,迫使模型做出响应,揭示其是否推断出该规范。我们在四种提示条件下评估了八个前沿和开放权重模型,这些条件在模型被明确告知将先前的对话作为其应如何回答的证据方面有所不同。对于大多数模型,朴素提示仍然有限;显式的规范感知提示帮助不均,Gemini 3.1 Pro达到84.2%,Claude Fable 5达到81.6%,而其他几个模型显示出微小的增益或回归。LoSoNA通过测试模型是否能够从先例推断局部对话规范并在单轮群聊响应中使用它们,为近期评估LLM社交能力的呼吁做出了贡献。

英文摘要

Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce LoSoNA, a benchmark for local social norm adaptation in multi-party chat. Each scenario gives a subject model a curated group-chat transcript in which non-subject participants demonstrate a hidden local norm, followed by a final elicitor turn that forces a response revealing whether the subject has inferred that norm. We evaluate eight frontier and open-weight models under four prompting conditions that vary how explicitly the model is told to treat the prior conversation as evidence for how it should answer. Naive prompting remains limited for most models; explicit norm-aware prompting helps unevenly, with Gemini 3.1 Pro reaching $84.2\%$ and Claude Fable 5 reaching $81.6\%$, while several other models show small gains or regressions. LoSoNA contributes to recent calls for evaluating LLM social capabilities by testing whether models can infer local conversational norms from precedent and use them in a one-turn group-chat response.

2606.13684 2026-06-15 cs.CY cs.AI cs.CL cs.LG 交叉投稿

Cross-Dataset Bloom Question Classification: Supervised Models and Prompted LLMs

跨数据集布鲁姆问题分类:监督模型与提示式大语言模型

Abdolali Faraji, Mohammadreza Molavi, Zohreh Rasoulkhani, Mohammadreza Tavakoli, Gábor Kismihók

发表机构 * Leibniz Information Centre for Science and Technology(莱比锡信息科学与技术研究中心) University of Genoa(热那亚大学)

AI总结 评估监督ML/DL模型和LLM在跨数据集布鲁姆分类中的泛化能力,发现LLM更稳定,并基于最佳提示策略开发了轻量级UI。

Comments Accepted at AIED 2026. Abdolali Faraji and Mohammadreza Molavi contributed equally to this work

详情
AI中文摘要

自动对评估问题进行布鲁姆分类可以大幅减少教师工作量,但标注具有主观性且依赖教师。先前的机器学习和深度学习方法在数据集内表现良好,但很少在跨数据集设置中评估,导致现实世界的泛化能力不明确;同时,LLM在布鲁姆问题分类中的有效性尚未被系统研究。我们评估了现有ML/DL方法的跨数据集泛化能力,并在五个数据集上使用多种提示策略评估了LLM;最佳提示策略结合了上下文示例和课程特定的动作动词。监督ML/DL模型在未见数据集上性能大幅下降,而LLM更稳定,表明其在多样化教育环境中是一种稳健的替代方案。基于最佳提示策略,我们还开发了一个轻量级用户界面,支持教师自动分类大量问题库;可用性研究表明低工作量和高度可用性。

英文摘要

Automatic Bloom's taxonomy classification of assessment questions can substantially reduce instructor workload, but labeling is subjective and teacher-dependent. Prior machine learning (ML) and deep learning (DL) approaches reported strong within-dataset results, yet were rarely evaluated in cross-dataset settings, leaving real-world generalizability unclear; meanwhile, LLM effectiveness for Bloom question classification has not been systematically studied. We evaluated the cross-dataset generalization of existing ML/DL methods and assessed LLMs with multiple prompting strategies on five datasets; the best prompting strategy combined in-context examples with course-specific action verbs. Supervised ML/DL models degraded substantially on unseen datasets, whereas LLMs were more stable, suggesting a robust alternative across diverse educational contexts. Based on the best prompting strategy, we also presented a lightweight UI that supports instructors in automatically classifying large question banks; a usability study indicated low workload and high usability.

2606.13815 2026-06-15 cs.AI cs.CL 交叉投稿

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Poker Arena: 大型语言模型中策略推理与记忆的多轴剖析

Pratham Singla, Shivank Garg, Vihan Singh

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Raeth AI

AI总结 提出Poker Arena平台,通过三层记忆架构和九轴认知剖面分解策略推理,揭示标量排行榜系统性误排模型能力结构。

Comments 33 pages, ICML Workshop

详情
AI中文摘要

不确定性下的策略推理支撑着谈判、金融和政策中的关键决策,但现有的游戏基准将异质推理维度压缩为单一标量,导致前沿LLM的能力结构未被审视。我们引入Poker Arena,一个无限注德州扑克锦标赛平台,该平台将三层记忆架构(手牌内、会话内和跨会话)与九轴认知剖面相结合,将策略推理分解为可解释的维度,如下注规模校准和位置意识。我们在50个会话(每个会话1000手牌)和受控记忆消融实验中评估了七个前沿模型;锦标赛筹码和聚合轴得分对模型进行了不同排序:Claude Opus 4.6赢得+15,730筹码和14次第一名,但在平均轴得分上仅排名第五(共七个),而持久记忆对某些模型有帮助,对另一些则有损害。这些发现表明,多轴评估揭示了标量排行榜系统性误排的能力结构,其中跨维度一致性优于任何单一维度的峰值性能。

英文摘要

Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We introduce Poker Arena, a no-limit Texas Hold'em tournament platform that couples a three-layer memory architecture (within-hand, session, and cross-session) with a nine-axis cognitive profile decomposing strategic reasoning into interpretable dimensions such as bet-sizing calibration and positional awareness. We evaluate seven frontier models across 50 sessions of 1,000 hands and a controlled memory ablation; tournament chips and aggregate axis score order the field differently: Claude Opus 4.6 wins +$15,730 chips with 14 first-place finishes, yet ranks only fifth of seven on mean axis score, while persistent memory helps some models and hurts others. These findings show that multi-axis evaluation surfaces capability structure that scalar leaderboards systematically misrank, with cross-dimensional consistency outweighing peak performance on any single axis.

2606.14516 2026-06-15 cs.AI cs.CL cs.CY 交叉投稿

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Every Eval Ever:AI评估结果的统一模式与社区仓库

Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek Šuppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen

发表机构 * Technical University Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Weizenbaum Institute(魏岑鲍姆研究所) Zuse Institute Berlin(柏林祖泽研究所) Evidence Prime Trustible Kitware ETH Zurich(苏黎世联邦理工学院) StickFlux Labs Stanford University(斯坦福大学) Northeastern University(东北大学) IBM Research(IBM研究院) Comenius University Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco(思科) University of Notre Dame(圣母大学) Hebrew University of Jerusalem(耶路撒冷希伯来大学) University of Oxford(牛津大学) Ohio University(俄亥俄大学) Writer TCS Research(塔塔咨询服务研究院) Oxford University Press(牛津大学出版社) Queen Mary University of London(伦敦玛丽女王大学) Technical University Berlin(柏林工业大学) University of Delaware(特拉华大学) Cinemo Johns Hopkins University(约翰霍普金斯大学) University of Copenhagen(哥本哈根大学) ELLIS(欧洲学习与智能系统实验室) Iowa State University(爱荷华州立大学) Meta FAIR University of Montreal(蒙特利尔大学) Mila Quebec AI Institute(Mila魁北克人工智能研究所) EleutherAI Yale University(耶鲁大学) Hugging Face University of Edinburgh(爱丁堡大学) Harvard University(哈佛大学) ETH AI Center(ETH人工智能中心) MIT(麻省理工学院) MIT-IBM Watson Lab(MIT-IBM沃森实验室)

AI总结 针对AI评估结果格式不统一、难以比较的问题,提出首个共享模式与社区众包仓库,通过标准化表示、自动转换器和社区数据库实现跨评估框架的统一。

详情
AI中文摘要

AI评估被广泛用于测试和理解进展。然而,多样化的评估工具带来了不一致性,挑战了分析和比较。首先,结果以不兼容的格式保存,分散在排行榜、论文、博客文章、评估工具日志和自定义仓库中。其次,结果由不同的评估框架创建,这些框架对名义上相同的评估产生不同的分数,并且不一致地记录元数据,阻碍了比较、跨社区评估科学、成本降低和重用。我们介绍了Every Eval Ever,这是第一个用于AI评估结果的共享模式和社区众包仓库。该模式标准化了评估在统一的单个JSON文档中的表示方式。它在设计上与源无关,可以摄取来自评估工具和论文的结果,并可选择存储每个实例的输出以进行细粒度分析。我们贡献了:(i) 一个社区治理的元数据模式及其配套的实例级模式,这是同类标准化工作的首次;(ii) 从流行格式、评估工具和排行榜到统一模式的自动转换器;以及 (iii) 一个托管在Hugging Face上的众包社区数据库,目前涵盖22,235个模型、2,273个独特基准和31种评估格式。

英文摘要

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

2606.14697 2026-06-15 cs.CV cs.AI cs.CL 交叉投稿

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: 用于诊断医学多模态大语言模型推理中阶段式幻觉的基准

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学)

AI总结 提出ClinHallu基准,包含7031个实例,每个实例带有结构化推理轨迹(视觉识别、知识回忆、推理整合),通过阶段替换干预和轨迹监督微调,实现细粒度幻觉诊断与缓解。

Comments Code and datasets: https://github.com/alibaba-damo-academy/ClinHallu

详情
AI中文摘要

构建可信的医学多模态大语言模型(MLLM)对于可靠的临床决策支持至关重要。现有的医学幻觉基准主要关注数据收集,但往往忽略了推理过程中幻觉的起源。我们发现幻觉来源因样本而异:错误可能源于视觉误识别、不正确的医学知识回忆或有缺陷的推理整合。为了实现源级别的幻觉诊断,我们引入了ClinHallu,一个用于医学MLLM推理中阶段式幻觉诊断的基准。ClinHallu包含7031个经过验证的实例,每个实例都附有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹。我们还使用阶段替换干预来测量纠正特定阶段如何影响最终答案。除了评估,我们表明轨迹监督微调减少了阶段式幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准可从此https URL公开获取。

英文摘要

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

2502.10886 2026-06-15 cs.CL 版本更新

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

MET-Bench:用于评估视觉语言与推理模型局限性的多模态实体追踪

Vanya Cohen, Raymond Mooney

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出MET-Bench多模态实体追踪基准,发现视觉语言模型在图像实体追踪上显著弱于文本,主要源于视觉推理缺陷,强化学习可提升模态内性能但跨模态迁移不足。

Comments ICML 2026

详情
AI中文摘要

实体状态追踪是世界建模的必要组成部分,需要随时间维护实体的连贯表示。以往工作仅在纯文本任务中基准测试实体追踪性能。我们引入MET-Bench,一个多模态实体追踪基准,旨在评估视觉语言模型跨模态追踪实体状态的能力。使用三个领域,我们评估了当前模型整合文本和图像状态更新的有效性。我们的发现揭示了基于文本和基于图像的实体追踪之间存在显著性能差距。我们通过实验表明,这种差异主要源于视觉推理缺陷而非感知缺陷。我们进一步证明,显式的基于文本的推理策略能提升性能,但局限性依然存在,尤其是在长程多模态任务中。我们应用强化学习来改进开源视觉语言模型中的实体追踪。这带来了显著的模态内增益,但未能稳健地跨输入模态迁移。我们的结果凸显了改进多模态表示和推理技术以弥合文本与视觉实体追踪之间差距的必要性。

英文摘要

Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using three domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain, especially in long-horizon multimodal tasks. We apply reinforcement learning to improve entity tracking in open-source VLMs. This yields substantial in-modality gains, but does not transfer robustly across input modalities. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.

2508.05782 2026-06-15 cs.CL 版本更新

FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

FineDialFact:细粒度对话事实验证基准

Xiangyan Chen, Yufeng Li, Yujian Gan, Arkaitz Zubiaga, Matthew Purver

发表机构 * arXiv

AI总结 针对对话系统幻觉检测中事实标签粗粒度的问题,提出细粒度对话事实验证基准FineDialFact,基于公开数据集构建并评估多种基线方法,实验表明思维链推理可提升性能,但最佳F1仅0.74,任务仍具挑战性。

详情
AI中文摘要

大型语言模型已知会产生幻觉——事实不正确或捏造的信息——这对许多自然语言处理应用(如对话系统)构成了重大挑战。因此,检测幻觉已成为一个关键研究领域。当前对话系统中幻觉检测的方法主要集中于验证生成回复的事实一致性。然而,这些回复通常包含准确、不准确或不可验证的事实的混合,使得使用单一事实标签过于简单和粗粒度。在本文中,我们引入了一个基准FineDialFact,用于细粒度对话事实验证,涉及验证从对话回复中提取的原子事实。为此,我们基于公开可用的对话数据集构建了一个数据集,并使用各种基线方法对其进行评估。实验结果表明,结合思维链推理的方法可以提升对话事实验证的性能。尽管如此,在开放域对话数据集HybriDialogue上取得的最佳F1分数仅为0.74,表明该基准对于未来研究仍是一个具有挑战性的任务。我们在以下网址发布数据集和代码:https://this https URL。

英文摘要

Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or non-verifiable facts, making the use of a single factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.74, indicating that the benchmark remains a challenging task for future research. We release our dataset and code at https://github.com/XiangyanChen/FineDialFact.

2508.06196 2026-06-15 cs.CL cs.HC 版本更新

EiCAP: Beyond Fluency, Probing and Improving Emotional Intelligence in LLMs via Psychologically Grounded Multi-Turn Dialogue

EiCAP:超越流畅性,通过心理学基础的多轮对话探究和提升大语言模型的情感智能

Nizi Nazar, Pardis Sadat Zahraei, Dilek Hakkani-Tür, Natasa Milic-Frayling, Ehsaneddin Asgari

发表机构 * Qatar Computing Research Institute (QCRI), Hamad Bin Khalifa University(卡塔尔计算研究所(QCRI),哈马德·本·卡伊夫大学) University of Illinois Urbana-Champaign (UIUC)(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出基于心理学六层情感智能分类法的EiCAP框架,包含评估基准EiCAP-Bench和微调语料EiCAP-SFT,发现通用对话微调不提升情感智能,而基于情感智能的LoRA微调显著提升模型在所有24个子类别上的表现。

详情
AI中文摘要

大语言模型越来越多地用于情感敏感角色,包括心理健康支持、教育和危机响应,但它们缺乏评估或提升情感智能(EI)的原则性框架。我们引入EiCAP,一个统一的、基于心理学的六层EI分类法,并实现为两个互补资源。EiCAP-Bench是一个多轮、一对三强制选择评估套件,包含3,174个探针,涵盖24个子类别和跨轮依赖关系,反映真实对话EI需求。EiCAP-SFT是一个152,820轮对话的监督语料库,与同一分类法对齐,支持可控、可解释的微调。两个关键发现出现。首先,通用对话监督微调不会赋予EI:在UltraChat上微调在24个子类别中均无显著提升,宏观得分为24.6%,接近随机水平25%。其次,直接对Qwen-2.5-7B-Base应用基于EI的LoRA(使用约0.8%的参数)在全部24个子类别上取得显著提升,宏观得分达到75.33%,比Base提升51.7个百分点,比Instruct提升37.1个百分点。关键的是,消融实验表明UltraChat预训练阶段适得其反,使性能降低21.4个百分点:直接基于EI的训练既必要又充分。

英文摘要

Large Language Models increasingly serve in emotionally sensitive roles, including mental health support, education, and crisis response, yet they lack a principled framework for assessing or improving Emotional Intelligence (EI). We introduce EiCAP, a unified, psychologically grounded six-layer EI taxonomy operationalized into two complementary resources. EiCAP-Bench is a multi-turn, one-vs-three forced-choice evaluation suite with 3,174 probes across 24 subcategories and cross-turn dependencies that reflect real conversational EI demands. EiCAP-SFT is a 152,820-dialogue supervision corpus aligned to the same taxonomy, enabling controlled, interpretable fine-tuning. Two key findings emerge. First, generic conversational supervised fine-tuning does not confer EI: fine-tuning on UltraChat yields no significant gain in any of the 24 subcategories, with a macro score of 24.6%, near the chance level of 25%. Second, applying EI-grounded LoRA, using approximately 0.8% of parameters, directly to Qwen-2.5-7B-Base achieves significant gains in all 24 subcategories, reaching a macro score of 75.33%, a gain of 51.7 percentage points over Base and 37.1 percentage points over Instruct. Crucially, an ablation shows that the UltraChat pre-stage is counterproductive, reducing performance by 21.4 percentage points: direct EI-grounded training is both necessary and sufficient.

2601.15828 2026-06-15 cs.CL cs.AI 版本更新

Can professional translators identify machine-generated text?

专业翻译人员能否识别机器生成的文本?

Michael Farrell

发表机构 * IULM University Milan Italy(米兰IULM大学)

AI总结 通过实验研究无专门训练的专业翻译人员识别AI生成短篇故事的能力,发现少数人(16.2%)能准确区分,但多数依赖主观印象导致误判,低突发性和叙事矛盾是可靠指标。

Comments Pages 581 to 591, Volume 1, proceedings of the 26th Annual Conference of the European Association for Machine Translation, 2026

详情
AI中文摘要

本研究调查了未经专门训练的专业翻译人员能否可靠地识别由人工智能(AI)生成的意大利语短篇故事。69名翻译人员参加了一项现场实验,评估了三篇匿名短篇故事——两篇由ChatGPT-4o生成,一篇由人类作者撰写。对于每篇故事,参与者评估了AI作者身份的可能性并提供了选择理由。虽然平均结果不明确,但有一个统计上显著的子集(16.2%)成功区分了合成文本与人类文本,表明他们的判断基于分析技能而非偶然。然而,几乎相同数量的人以相反方向错误分类了文本,通常依赖主观印象而非客观标记,这可能反映了读者对AI生成文本的偏好。低突发性和叙事矛盾成为合成作者身份最可靠的指标,同时报告了意外的仿译、语义借用和来自英语的句法迁移。相比之下,语法准确性和情感基调等特征经常导致误分类。这些发现对专业语境中合成文本编辑的作用和范围提出了疑问。

英文摘要

This study investigates whether professional translators without prior specialized training can reliably identify short stories generated in Italian by artificial intelligence (AI). Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.

2603.05167 2026-06-15 cs.CL cs.AI 版本更新

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

C2-Faith: 为思维链推理中的因果和覆盖忠实性基准测试LLM评判者

Avni Mittal, Rauno Arike

发表机构 * SPARAI

AI总结 提出C2-Faith基准,通过因果和覆盖两个维度评估LLM评判者对思维链推理过程忠实性的判断能力,发现模型在错误定位和覆盖评分上存在显著不足。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作思维链(CoT)推理的评判者,但目前尚不清楚它们能否可靠地评估过程忠实性,而不仅仅是答案的合理性。我们引入了C2-Faith,这是一个基于PRM800K构建的基准,明确将忠实性分解为两个互补维度:因果性(每一步是否逻辑上源自先前上下文)和覆盖性(是否包含必要的中间推理)。通过受控扰动,我们构建了具有已知因果错误位置的示例,将单个步骤替换为逻辑不一致的变体,并以不同速率进行受控覆盖删除,从而能够直接根据参考标签进行测量。我们评估了三个前沿的LLM评判者在三项任务上的表现:二元因果检测、因果步骤定位和覆盖评分。我们的结果表明,评判者的可靠性高度依赖于任务,没有单一模型在所有设置中占主导地位。虽然模型通常能检测到错误存在,但它们难以准确定位错误,这表明检测与归因之间存在显著差距。此外,所有评判者都系统性地高估了推理完整性,即使中间推理的很大部分缺失,也会给出高覆盖分数。这些发现揭示了LLM评判者在过程级评估中的根本局限性,并强调了在使用LLM评估推理质量时需要更可靠和校准的方法。

英文摘要

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.

2605.11378 2026-06-15 cs.CL 版本更新

An Empirical Study of Automating Agent Evaluation

自动化代理评估的实证研究

Kang Zhou, Sangmin Woo, Haibo Ding, Kiran Ramnath, Subramanian Chidambaram, Aosong Feng, Vinayak Arannil, Muhyun Kim, Ishan Singh, Darren Wang, Zhichao Xu, Megha Gandhi, Nirmal Prabhu, Soumya Smruti Mishra, Vivek Singh, Gouri Pandeshwar, Lin Lee Cheong

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 本文研究了自动化代理评估的可行性,提出EvalAgent系统,通过编码技能和领域知识提升评估效率,实验显示其在评估准确性和人类偏好上显著优于基线方法。

详情
AI中文摘要

代理评估需要评估复杂的多步骤行为,涉及工具使用和中间推理,这使其成本高昂且需要专业知识。一个自然的问题是:前沿编码助手能否可靠地自动化这一评估过程?我们的研究表明,仅仅提示编码助手是不够的。没有领域特定的评估知识,前沿编码助手仅能达到30%的执行成功率,并产生平均每个代理12+个指标的过度工程化评估,表明强大的编码能力并不自动转化为可靠的代理评估能力。我们引入EvalAgent,一种自动化端到端代理评估流程的AI助手。EvalAgent将评估领域专业知识编码为评估技能(程序指令、可重用代码和模板、以及动态检索的API文档),这些技能组成基于跟踪的流程,生成完整的评估成果,包括指标、可执行代码和报告。为了系统评估生成的评估,我们引入了一个元评估框架和AgentEvalBench基准,该基准包含20个代理,每个代理配对评估要求和测试场景。我们进一步提出了Eval@1指标,以衡量生成的评估代码是否在首次运行时既执行又产生有意义的结果。我们的实验显示,EvalAgent生成的评估更加聚焦,将Eval@1从17.5%提升到65%,并在人类专家偏好上达到79.5%的优势。进一步的消融研究显示,评估技能对于处理复杂评估至关重要:移除它们会使Eval@1显著从65%降至30%。

英文摘要

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

2605.21182 2026-06-15 cs.CL cs.AI cs.CV 版本更新

Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Manga109-v2026: 重新审视Manga109标注以适应现代漫画理解

Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa

发表机构 * University of Tokyo(东京大学)

AI总结 本文重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡,并通过结合OCR基于的问题检测和人工修订构建Manga109-v2026,修订了约29,000个对话标注,使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留漫画特有的表达结构。

Comments Accepted to the Culture x AI Workshop at ICML 2026. Project page: https://manga109.github.io/manga109-project-website/en/

详情
AI中文摘要

漫画是一种具有文化特色的多模态媒介,是日本流行文化中最具影响力的形态之一。随着AI系统越来越多地针对漫画理解、OCR和翻译进行研究,Manga109已成为漫画相关AI研究的基础数据集。然而,当前的Manga109数据集包含转录错误和粗略的标注,这与现代OCR和多模态漫画理解任务不匹配。在本工作中,我们重新审视Manga109的对话文本标注,识别出五类标注问题,包括转录错误、缺失文本区域、对话与拟声词重叠以及未分割的对话气泡。为了解决这些问题,我们结合基于OCR的问题检测和人工修订,构建了Manga109-v2026,修订了大约29,000个对话标注。我们的修订使Manga109更好地适应现代OCR和多模态漫画理解系统,同时保留了漫画特有的表达结构。

英文摘要

Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains inaccurate transcriptions and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including inaccurate transcriptions, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

2605.30931 2026-06-15 cs.CL 版本更新

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

MineExplorer: 评估MLLM智能体在Minecraft中的开放世界探索能力

Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Meituan(美团)

AI总结 提出MineExplorer基准,通过多智能体合成工作流构建隐式多跳任务,评估多模态大语言模型在Minecraft中的开放世界探索能力,发现长轨迹协调仍是挑战。

Comments Working in progress

详情
AI中文摘要

多模态大语言模型(MLLM)在感知、推理和动作生成方面展现出强大能力。然而,它们在动态开放世界中持续探索的能力仍不明确。现有的具身和基于游戏的基准通常将交互压缩为短时任务,或将成功与特定领域的游戏机制纠缠在一起。在本文中,我们介绍了MineExplorer基准,用于评估MLLM智能体在Minecraft中的开放世界探索能力。我们首先筛选出解决方案高度依赖Minecraft特定知识的原子任务,以更好地反映通用开放世界推理。然后,我们围绕ReAct风格的能力公式组织基准,并将原子任务组合成隐式多跳任务。为了进一步构建可靠的实例,MineExplorer使用多智能体合成工作流,联合设计任务图、沙盒场景和基于规则的里程碑评估器。人工评估表明,多智能体合成工作流比单智能体基线产生显著更可靠的实例。与先进MLLM智能体的实验表明,开放世界探索仍然具有挑战性,因为强模型可以处理许多单跳任务,但在需要协调隐藏前提条件的长轨迹中性能急剧下降。进一步分析发现,任务难度与智能体完成度相关,而更大的模型或思考模式并不一致地转化为更好的性能。代码和数据集可在https://github.com/Jometeorie/MineExplorer获取。

英文摘要

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

2606.02320 2026-06-15 cs.CL 版本更新

TVIR: Building Deep Research Agents Towards Text-Visual Interleaved Report Generation

TVIR:构建面向文本-视觉交错报告生成的深度研究智能体

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu

发表机构 * Nanjing University Alibaba Group(南京大学阿里集团)

AI总结 提出TVIR基准和层次化多智能体框架,解决深度研究报告中视觉元素的事实可靠性与对齐问题。

详情
AI中文摘要

深度研究智能体在多步信息检索、推理和长文本报告生成方面表现出强大能力,但现有基准和系统仍以文本为中心,对视觉元素是否事实可靠且与周围分析良好对齐的评估有限。为填补这一空白,我们引入了TVIR(文本-视觉交错报告生成),包括TVIR-Bench(一个包含100个专家策划的多模态深度研究任务的基准,要求视觉元素服务于特定的分析子目标)和TVIR-Agent(一个层次化多智能体框架,作为构建大纲、检索图像、生成可溯源图表以及通过上下文感知的顺序写作撰写报告的强基线)。我们进一步开发了结合文本评估和视觉评估的双路径评估框架。在九个深度研究系统上的实验表明,TVIR-Agent实现了强大的整体性能,凸显了显式多模态设计和评估对于证据驱动报告生成的重要性。

英文摘要

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text-Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

2606.07040 2026-06-15 cs.CL 版本更新

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

超越评分标准:面向奖励建模的探索引导评估技能

Xing Yue, Linjuan Wu, Daoxin Zhang, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University(浙江大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出Eval-Skill方法,通过探索引导合成可复用的领域级评估技能,以替代每查询生成评分标准,在RewardBench 2上使Qwen3-8B和DeepSeek-V4-Flash分别提升13.44%和18.51%。

详情
AI中文摘要

开放式奖励建模需要裁判在无法获得可验证答案时遵循微妙的、领域特定的偏好。现有的基于评分标准的方法通常通过为每个查询在线生成标准来解决这一问题,但额外的生成步骤会增加推理开销,并产生僵化或错位的指导。我们引入了Eval-Skill,一种探索引导的方法,为奖励建模合成可复用的评估技能,并将奖励指导重新定义为上下文演化,而非参数训练或每查询评分标准生成。仅使用每个领域100个案例进行技能演化,Eval-Skill通过两个渐进阶段(工作流生成和原则生成)合成可复用的领域级评估技能,并在两个阶段中交错进行探索和选择。一旦生成,技能直接注入裁判上下文。在多个奖励建模基准上,Eval-Skill持续改进不同的裁判骨干模型;在RewardBench 2上,与普通评判相比,每个主要骨干模型都取得了显著提升(Qwen3-8B提升13.44%,DeepSeek-V4-Flash提升18.51%)。对演化时间缩放、泛化性和可迁移性的进一步分析表明,紧凑的评估技能为基于LLM的评估提供了一种高效的新范式。代码可在https://this URL获取。

英文摘要

Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at https://github.com/xing-stellus-yue/Eval-Skill.

2601.00821 2026-06-15 cs.AI cs.CL cs.IR 版本更新

Verbatim Chunks Beat Extracted Artifacts: A Controlled Ablation of Memory Representations for Long LLM Conversations

逐字块胜过提取的人工制品:长LLM对话中记忆表征的控制消融研究

Tao An

发表机构 * Hawaii Pacific University(夏威夷太平洋大学)

AI总结 通过控制消融实验,发现逐字对话块在长对话记忆检索中比LLM提取的结构化人工制品(事实、决策等)准确率高15.9-22.0点,原因是提取过程丢失了逐字细节,而结构化记忆应作为逐字文本的补充而非替代。

Comments v2: substantially revised -- reframed from a system paper to a controlled ablation study; title and conclusions updated accordingly. 26 pages, 5 figures

详情
AI中文摘要

一类日益增长的对话记忆系统将对话历史压缩为结构化人工制品——提取的事实、决策或事件——其前提是蒸馏后的结构比原始文本检索效果更好。我们通过控制消融实验检验了这一前提:在固定的检索-重排-推理流水线中,仅交换存储的表征——LLM提取的类型化人工制品与逐字对话块——保持模型、检索器、重排器和评判器不变。逐字块在LoCoMo上领先15.9个百分点(43.9% vs. 28.0%),在LongMemEval-S上领先22.0个百分点(67.4% vs. 45.4%);1跳语义图无法弥补差距,五个混淆控制实验重现了该效应。其机制是有损蒸馏:提取丢弃了逐字细节,而块则免费保留;提取人工制品流水线在整体准确率上从未超过朴素RAG。同时,使用近乎逐字、保留来源的单元所取得的积极结果也符合同一解释:检索准确性取决于表征与源文本的偏离程度。对于我们测试的提取设计,结构化记忆应增强逐字文本而非替代它:块∪人工制品联合存储在两个基准上都匹配块,而仅人工制品则丧失优势。代码和数据:此 https URL

英文摘要

A growing class of conversational-memory systems compresses dialogue history into structured artifacts -- extracted facts, decisions, or events -- on the premise that distilled structure retrieves better than raw text. We test this premise with a controlled ablation: within one fixed retrieval-rerank-reasoning pipeline, we swap only the stored representation -- LLM-extracted typed artifacts versus verbatim conversation chunks -- holding the model, retriever, reranker, and judge constant. Verbatim chunks win by 15.9 points on LoCoMo (43.9% vs. 28.0%) and 22.0 points on LongMemEval-S (67.4% vs. 45.4%); a 1-hop semantic graph does not recover the gap, and five confound controls reproduce the effect. The mechanism is lossy distillation: extraction discards verbatim detail that chunks retain for free, and the extracted-artifact pipeline never beats naive RAG in overall accuracy. Concurrent positive results with near-verbatim, provenance-preserving units fit the same account: retrieval accuracy tracks how far the representation departs from the source. For the extraction designs we test, structured memory should augment verbatim text rather than replace it: a chunks $\cup$ artifacts union store matches chunks on both benchmarks while artifacts alone forfeit the gap. Code and data: https://github.com/tao-hpu/cog-canvas

2601.04646 2026-06-15 cs.IR cs.AI cs.CL cs.LG 版本更新

Succeeding at Scale: Enterprise Retrieval Benchmark Construction and Index-Preserving Query Adaptation for Multi-Tenant Search

规模化成功:面向多租户搜索的企业检索基准构建与索引保持查询适配

Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis

发表机构 * Prateek Jain Shabari S Nair Ritesh Goru Prakhar Agarwal Ajay Yadav Yoga Sri Varshan Varadharajan Constantine Caramanis

AI总结 针对多租户检索系统中标注数据匮乏和模型更新成本高的问题,提出全自动构建基准DevRev-Search,并研究仅微调查询编码器而保持文档索引不变的索引保持查询适配策略,实现质量与效率的平衡。

详情
AI中文摘要

大规模多租户检索系统生成大量查询日志,但缺乏用于有效领域适应的精心策划的相关性标签,导致大量“暗数据”未被充分利用。模型更新的高成本加剧了这一挑战,因为联合微调查询和文档编码器需要完整的语料库重新索引,这在拥有数千个独立索引的多租户环境中是不切实际的。我们引入了DevRev-Search,这是一个通过完全自动化管道构建的技术客户支持段落检索基准。候选生成使用跨多种稀疏和密集检索器的融合,随后使用LLM作为评判器进行一致性过滤和相关性标记。我们进一步研究并系统评估了索引保持查询适配策略,该策略仅微调查询编码器,同时保持文档索引固定。在DevRev-Search、SciFact和FiQA-2018上的实验表明,参数高效的查询编码器微调提供了显著的质量-效率权衡,实现了可扩展且实用的企业多租户检索。

英文摘要

Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data." This challenge is compounded by the high cost of model updates, as jointly fine-tuning query and document encoders requires full corpus re-indexing, which is impractical in multi-tenant settings with thousands of isolated indices. We introduce DevRev-Search, a passage retrieval benchmark for technical customer support built via a fully automated pipeline. Candidate generation uses fusion across diverse sparse and dense retrievers, followed by an LLM-as-a-Judge for consistency filtering and relevance labeling. We further study and systematically evaluate index-preserving query-only adaptation strategies that fine-tune only the query-encoder while keeping the document indices fixed. Experiments on DevRev-Search, SciFact, and FiQA-2018 show that parameter-efficient fine-tuning of the query encoder delivers a remarkable quality-efficiency trade-off, enabling scalable and practical enterprise multi-tenant retrieval.

2602.05413 2026-06-15 cs.IR cs.CL 版本更新

SciDef: Datasets and Tools for Automated Definition Extraction from Scientific Literature with LLMs

SciDef:基于LLM的科学文献自动定义提取数据集与工具

Filip Kučera, Christoph Mandl, Isao Echizen, Radu Timofte, Timo Spinde

发表机构 * National Institute of Informatics (NII)(国立信息研究所) University of Würzburg(乌尔姆大学) University of Passau(帕萨乌大学) University of Würzburg (JMU)(乌尔姆大学)

AI总结 提出SciDef资源套件,包含人工验证的定义基准DefExtra、相似度判断DefSim及基于LLM的提取流程,通过16个语言模型评估,发现NLI匹配指标与人类判断高度一致,但相关性过滤仍是自动提取的关键瓶颈。

Comments Under Review - Submitted to CIKM 2026 Resources Track;

详情
AI中文摘要

科学概念在不同论文中常被不一致地定义,使得比较发现、复用术语和构建可靠的下游资源变得困难。我们提出SciDef,一个用于科学定义提取的资源套件。该套件包含DefExtra,一个包含来自75篇学术论文的268个人工验证的作者定义基准;DefSim,60个人工标注的定义对相似度判断;以及一个基于LLM的开放流程,用于PDF预处理、分块、定义提取、提示优化和评估。我们通过跨提示策略和分块方案对16个语言模型进行基准测试来验证资源。最强集合级配置得分为0.397,而最高覆盖配置至少匹配86.4%的金标准定义,但过度生成了候选定义。我们进一步表明,基于NLI的匹配指标与人类DefSim判断高度一致。这些结果将SciDef定位为以定义为中心的文献分析的可复用基准和工具层,同时突出相关性感知过滤作为全自动定义提取的关键瓶颈。代码和数据集可在https://this URL获取。

英文摘要

Scientific concepts are often defined inconsistently across papers, making it difficult to compare findings, reuse terminology, and build reliable downstream resources. We present SciDef, a resource suite for scientific definition extraction. The suite contains DefExtra, a benchmark of 268 human-validated author-stated definitions from 75 academic papers; DefSim, 60 human-labeled definition-pair similarity judgments; and an open LLM-based pipeline for PDF preprocessing, chunking, definition extraction, prompt optimization, and evaluation. We validate the resources by benchmarking 16 language models across prompting strategies and chunking schemes. The strongest set-level configuration achieves a score of 0.397, while the highest-coverage configuration matches at least one prediction to 86.4% of gold definitions but over-generates candidate definitions. We further show that an NLI-based matching metric agrees strongly with human DefSim judgments. These results position SciDef as a reusable benchmark and tooling layer for definition-centric literature analysis, while highlighting relevance-aware filtering as the key bottleneck for fully automatic definition extraction. Code & datasets are available at https://github.com/Media-Bias-Group/SciDef.

2604.20462 2026-06-15 cs.SE cs.CL cs.IR 版本更新

Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark

大规模既视感:基于Sentence-Transformer嵌入和行为驱动软件测试中重复Gherkin步骤的释义鲁棒检测与110万步骤开放基准

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

发表机构 * work(工作) seecs.edu.pk(SEECs大学) tum.de(图腾大学)

AI总结 针对BDD测试中Gherkin步骤重复问题,提出结合精确哈希、归一化Levenshtein、句子Transformer余弦和Levenshtein带混合的四种策略检测器,并构建跨组织语料库和标注基准,F1达0.906。

Comments 28 pages, 2 figures, 4 tables. Submitted to Information and Software Technology (Elsevier). Tool, corpus, labelled benchmark, and rubric released at https://github.com/amughalbscs16/cukereuse-release under Apache-2.0

详情
AI中文摘要

背景。行为驱动开发(BDD)套件中的Gherkin步骤文本重复累积,带来已知的维护成本。现有检测器要么需要可运行测试,要么是单组织的,存在空白:一个静态的、对释义鲁棒的步骤级检测器,以及一个用于校准的公共基准。目标。我们发布了(i)迄今为止最大的跨组织BDD步骤语料库,(ii)一个标注的对级校准基准,以及(iii)一个四策略检测器,附带一个将聚类与ISO/IEC 25010可维护性子特征关联的合并节省模型。方法。语料库包含347个公共GitHub仓库、23,667个.feature文件和1,113,616个Gherkin步骤,带有SPDX标签。检测器分层使用精确哈希、归一化Levenshtein、句子Transformer余弦以及Levenshtein带混合。校准使用1,020个手动标注的步骤对,依据发布的规则(60对重叠,Fleiss kappa = 0.84)。我们报告了在主规则下和免评分重新标注下的精确率、召回率和F1,附带bootstrap 95%置信区间,并与SourcererCC风格和NiCad风格的词汇基线进行比较。结果。步骤加权精确重复率为80.2%;中位数仓库重复率为58.6%(Spearman rho = 0.51)。顶级混合聚类在2,245个文件中出现20,737次。近似精确在免评分标签上达到F1 = 0.822;语义在主规则下F1 = 0.906,反映了已披露的分层伪影。词汇基线达到F1 = 0.761和0.799。节省模型估计整个语料库可消除893,357个步骤出现;在中位数仓库中,62.5%的步骤行可消除。

英文摘要

Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines. Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5% of step lines are eliminable.

8. 安全、隐私、公平与可解释NLP 13 篇

2606.14037 2026-06-15 cs.CL 新提交

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

对或错,模型都顺从:LLM 道德判断中的方向盲从

Jihye Kim, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 本文提出顺从不对称性(A = BCR/HCR)双向诊断指标,发现大语言模型在事实判断中更顺从有益提示(A=1.58),但在道德判断中几乎同等顺从有益和误导提示(A=1.04),揭示了方向盲从这一对齐失败模式。

详情
AI中文摘要

随着语言模型在许多领域扮演整合角色,LLM对用户反驳的响应成为一个关键的对齐属性。然而,许多现有评估将顺从视为单向的,测量模型是否抵抗压力,但不测量它们是否有选择地抵抗。我们引入顺从不对称性(A = BCR/HCR),一种双向诊断方法,比较在有益提示下的有益输出变化与在误导提示下的有害变化。在9个模型和972,000个提示条件响应中,我们发现这种选择性在事实判断和道德判断中有所不同:模型在事实问题上遵循有益提示多于有害提示(A = 1.58),但在道德问题上以几乎相同的速率遵循两个方向(A = 1.04)。这种现象在模型家族、能力水平和提示类型中持续存在。有趣的是,我们还发现思维链提示同时放大了有益和有害的顺从,而基于身份的提示以几乎相同的幅度抑制了这两者。这些结果将方向盲从道德顺从确定为当前LLM中一个独特的失败模式,并表明对齐应针对方向校准的更新,而不是仅降低顺从。

英文摘要

As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

2606.14068 2026-06-15 cs.CL 新提交

Harsher on Male? Evaluating LLMs on Gender-Asymmetric Moral Framing Across Diverse Conflict Scenarios

对男性更严厉?评估LLM在不同冲突场景中的性别不对称道德框架

Guangzong Si, Dong Wang, Zhenhao Li, Yifan Yu, Panwang Pan, Wentao Zhu

发表机构 * University of Science and Technology of China(中国科学技术大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 提出GAMA-Bench基准,通过性别镜像场景评估LLM对相同负面行为的回应,发现模型对男性更倾向于惩罚和责备,对女性则更强调共情和治疗。

Comments underreview

详情
AI中文摘要

现有关于LLM性别偏见的研究主要关注刻板印象、职业关联或明确的有害输出。在这项工作中,我们询问LLM是否在匹配的男性角色和女性角色条件下对相同的负面行为应用一致的回应标准。我们引入了GAMA-Bench,一个包含1298个场景的性别镜像基准,涵盖亲密关系和公共社会冲突。它通过受控网格和跨模型审查构建性别中立的不当行为模板,然后将它们编译成配对的第一人称提示,包含匹配的角色性别和角色指称变化。我们进一步设计了一个结构化的回应框架协议,以衡量模型如何分配惩罚、共情、升级、指导和责备。在10个代表性LLM上的实验揭示了一致的男性不利不对称:男性角色获得更多惩罚性、升级性和责备中心的框架,而女性角色对相同的不当行为获得更多治疗性和共情导向的框架。进一步分析表明,这种模式在模型家族、场景轨道、模型规模和显式思维风格推理中持续存在。官方代码见https://this URL。

英文摘要

Existing studies on gender bias in LLMs have largely focused on stereotypes, occupational associations, or explicit harmful outputs. In this work, we ask whether LLMs apply consistent response standards to the same negative behavior under matched male-actor and female-actor conditions. We introduce GAMA-Bench, a gender-mirrored benchmark of 1,298 scenarios covering intimate relationship and public social conflicts. It constructs gender-neutral misconduct templates through controlled grids and cross-model review, then compiles them into paired first-person prompts with matched actor-gender and role-reference variations. We further design a structured response-framing protocol to measure how models allocate punishment, empathy, escalation, instruction, and blame. Experiments on 10 representative LLMs reveal a consistent male-disadvantaging asymmetry: male actors receive more punitive, escalatory, and blame-centered framing, whereas female actors receive more therapeutic and empathy-oriented framing for the same misconduct. Further analyses show that this pattern persists across model families, scenario tracks, model scale, and explicit thinking-style reasoning. The official code is available at https://github.com/xufeiqiong/GAMA-Bench.

2606.14209 2026-06-15 cs.CL 新提交

Detecting undisclosed LLM-generated content in parliamentary texts

检测议会文本中未披露的LLM生成内容

Minerva Suvanto, Andrea McGlinchey, Peter J. Barclay, Mattias Wahde

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) University of Cambridge(剑桥大学)

AI总结 本研究训练可解释文本分类器,检测英国和瑞典议会文本中未披露的LLM使用情况,发现自2022年起两国议会中未披露的LLM使用持续增加。

详情
AI中文摘要

在本文中,我们评估了英国和瑞典议会文本中未披露的LLM生成内容的程度。在许多领域,如新闻或学术写作,通常要求明确披露是否使用了LLM等AI工具。对于议会文本,AI使用披露指南更为模糊。然而,为了保持透明度和维护公众信任,通常建议议员在撰写议会动议等文本时说明是否使用了AI。我们使用LLM出现前的议会文本及其LLM生成版本训练了一个可解释(玻璃箱)文本分类器。然后将分类器应用于包含近期议会文本的测试集,发现自2022年起,两国议会中未披露的LLM使用均呈稳步增长趋势。

英文摘要

In this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.

2606.14460 2026-06-15 cs.CL cs.HC 新提交

A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

ClinicalBERT 语言预测中人口统计关联编码的计算审计

Kehinde Temitayo Soetan

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 通过两种探针方法审计 ClinicalBERT 中的表征偏差,发现模型主要放大而非继承训练数据中的统计差异。

Comments 17 pages, 4 tables, appendices A-E, preprint

详情
AI中文摘要

基于 Transformer 的临床语言模型越来越多地集成到高风险临床决策支持流程中,但医学文档中编码的人口统计关联通过计算机制传播到模型概率分布的方式仍缺乏实证研究。我们对 ClinicalBERT (Alsentzer et al., 2019) 的表征偏差进行了系统性计算审计,该模型是在 MIMIC-III 出院小结上预训练的 BERT 模型。我们采用两种互补的探针方法:对数概率偏差分析 (LPBA),量化人口统计描述符引起的掩码标记概率分布在行为和评价语义类别上的偏移;以及基于掩码语言模型的分析 (MLM),探测内部表征结构在 98 个真实临床句子模板和八种交叉种族-性别组合中的人口统计代理归因编码。语料频率分析通过将模型输出与 MIMIC-III 训练语料中的经验词频进行基准测试,操作化了统计差异与偏差放大之间的区别。在 32 个统计显著的结果中,65.6% 与观察到的语料分布相矛盾,在 MLM 探针下,黑人患者和代理归因的比率分别上升至 80% 和 87.5%,这提供了直接的经验证据,表明 ClinicalBERT 中的表征偏差主要通过模型内部放大而非训练数据继承来运作。关键词:自然语言处理,临床文档,算法审计,表征偏差,健康公平

英文摘要

Transformer-based clinical language models are increasingly integrated into high-stakes clinical decision support pipelines, yet the computational mechanisms through which demographic associations encoded in medical documentation propagate into model probability distributions remain empirically underspecified. We present a systematic computational audit of representational bias in ClinicalBERT (Alsentzer et al., 2019), a BERT-based model pretrained on MIMIC-III discharge summaries, employing two complementary probing methodologies: Log Probability Bias Analysis (LPBA), which quantifies demographic descriptor-induced shifts in masked token probability distributions across behavioral and evaluative semantic categories, and Masked Language Model-based analysis (MLM), which probes internal representational structure for demographic agency attribution encoding across 98 real clinical sentence templates and eight intersectional race-gender combinations. Corpus frequency analysis operationalizes the distinction between statistical disparity and bias amplification by benchmarking model outputs against empirical term frequencies in the MIMIC-III training corpus. Of 32 statistically significant findings, 65.6% contradict observed corpus distributions, rising to 80% for Black patients and 87.5% for agency attribution under MLM probing, providing direct empirical evidence that representational bias in ClinicalBERT operates predominantly through model-internal amplification rather than training data inheritance. Keywords: natural language processing, clinical documentation, algorithmic auditing, representational bias, health equity 1

2606.13873 2026-06-15 cs.LG cs.CL 交叉投稿

Natively Unlearnable Large Language Models

原生不可学习的大语言模型

Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出NULLs模型,通过共享骨干和稀疏激活的sinks分离数据源贡献,实现无需梯度更新的高效遗忘,在维基百科上验证了单篇文章遗忘的有效性和鲁棒性。

详情
AI中文摘要

遗忘旨在移除特定训练数据源的影响,但由于不同数据源的贡献在模型中纠缠,这已被证明具有挑战性。将源贡献隔离到不相交的参数中使得移除更容易,尽管这会阻碍跨源的联合学习。我们提出NULLs(原生不可学习的大语言模型),这是一类模型,通过训练一组共享骨干神经元以及一个稀疏激活的sinks池,满足隔离源特定贡献和跨源联合学习这两个对立目标。在训练过程中,特定于源的信息自然集中在其sinks中,而跨源共享的信息积累在骨干中。在部署时,通过禁用相应的sinks来遗忘一个源,无需梯度更新也无需访问保留数据。我们展示了NULLs可扩展到维基百科约600万篇文章,将每篇文章隔离为独立源。遗忘单篇文章会移除其特定知识,同时保留与语义相关文章共享的事实,与从头重新训练紧密匹配。我们注意到,使用NULLs进行遗忘也具有鲁棒性:在遗忘《哈利·波特》书籍的案例研究中,NULLs抵抗了对抗性提取和逆转事后遗忘的重新学习。最后,NULLs保留了一般语言能力,在下游基准测试中与标准Transformer相匹配。这些结果共同表明,源级遗忘不必是事后考虑。它可以原生地构建到LLM训练中,同时保留共享表示学习的优势。

英文摘要

Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources. We propose NULLs (Natively Unlearnable LLMs), a model class that satisfies the two opposing goals of isolating source-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data. We show that NULLs scales to Wikipedia's ~6M articles, isolating each as an independent source. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post-hoc unlearning. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks. Together, these results suggest that source-level unlearning need not be an afterthought. It can be built natively into LLM training while retaining the benefits of shared representation learning.

2606.14060 2026-06-15 cs.LG cs.CL 交叉投稿

Non-Parametric Machine Text Detection via Multi-View Gaussian Processes

非参数化机器文本检测:基于多视角高斯过程

Aleem Khan, Nicholas Andrews

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出多视角非参数检测框架,通过高斯过程集成互补特征视图,提高对对抗攻击的鲁棒性,并提供校准概率和分布外输入的原则性弃权。

详情
AI中文摘要

对抗条件(如释义和定向风格迁移)会急剧降低机器文本检测器的准确性。然而,文档携带多种互补信号(例如,风格特征、似然和排序特征、结构特征),抑制其中一种的攻击可能使其他信号保持完整。虽然参数化分类器可以在充分监督下学习组合这些特征,但当分布发生变化时(例如,新型攻击或未见过的语言模型),分类器容易做出过度自信的错误预测。为了解决这个问题,我们提出了一种多视角、非参数化的检测框架,该框架从同一文档中提取互补的特征视图,并通过高斯过程集成聚合每个视图的证据。通过跨视图聚合证据,对手必须同时击败多个独立的检测轴,从而大幅提高逃避成本。高斯过程公式还提供了校准概率和对分布外输入的原则性弃权,支持在高风险场景中的可靠部署。我们在三个涵盖不同生成器和攻击的基准测试(DetectRL和RAID基准测试,以及PAN2025共享任务)上进行了评估,结果表明,我们的多视角检测器在考虑的攻击下保持强性能,在针对未见攻击时优于现有方法。

英文摘要

Adversarial conditions such as paraphrasing and targeted style transfer sharply degrade the accuracy of machine text detectors. A document, however, carries multiple complementary signals (e.g., stylistic features, likelihood and rank-order features, and structural features), and an attack that suppresses one may leave others intact. While a parametric classifier can learn to combine these features given sufficient supervision, classifiers are prone to making confidently incorrect predictions when the distribution shifts (e.g., novel attacks or unseen language models). To address this, we propose a multi-view, non-parametric detection framework that extracts complementary feature views from the same document and aggregates per-view evidence through a Gaussian process ensemble. By aggregating evidence across views, an adversary must simultaneously defeat multiple independent axes of detection, substantially raising the cost of evasion. The Gaussian process formulation additionally provides calibrated probabilities and principled abstention on out-of-distribution inputs, supporting reliable deployment in high-stakes settings. We evaluate on three benchmarks spanning diverse generators and attacks: the DetectRL and RAID benchmarks, and the PAN2025 shared task and demonstrate that our multi-view detector maintains strong performance under the considered attacks, outperforming existing approaches against held out attacks.

2509.01455 2026-06-15 cs.CL 版本更新

Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

大型语言模型中的可信不确定性:置信度校准与风险控制拒绝的统一框架

Markus Oehri, Giulia Conti, Kaviraj Pather, Alexandre Rossi, Laia Serra, Adrian Parody, Rogvi Johannesen, Aviaja Petersen, Arben Krasniqi

发表机构 * University of Liechtenstein(列支敦士登大学) University of the Republic of San Marino(圣马尔科共和国大学) University of Mauritius(毛里求斯大学) International University of Monaco(摩纳哥国际大学) University of Andorra(安道尔大学) University of Gibraltar(直布罗陀大学) University of the Faroe Islands(法罗群岛大学) Ilisimatusarfik (University of Greenland)(格陵兰岛研究所(格陵兰大学)) University of Prizren(普里兹伦大学)

AI总结 提出UniCR框架,融合多种不确定性证据为校准的正确概率,并通过原则性拒绝满足用户指定的错误预算,无需微调基础模型,在短问答、代码生成和检索增强长问答中提升校准指标并降低风险-覆盖曲线下面积。

Comments arXiv admin note: This paper has been withdrawn by arXiv due to unverifiable authorship and affiliation

详情
AI中文摘要

部署的语言模型必须决定回答什么以及何时不回答。我们提出UniCR,一个统一框架,将异构的不确定性证据(包括序列似然、自一致性离散度、检索兼容性以及工具或验证器反馈)转化为校准的正确概率,然后通过原则性拒绝强制执行用户指定的错误预算。UniCR学习一个轻量级的校准头,采用温度缩放和适当评分,通过黑盒特征支持仅API模型,并使用共形风险控制提供无分布保证。对于长文本生成,我们通过监督从检索证据中得出的原子事实性分数,将置信度与语义保真度对齐,在保持覆盖的同时减少自信幻觉。在短问答、带执行测试的代码生成和检索增强长问答上的实验显示,与熵或logit阈值、后处理校准器和端到端选择性基线相比,校准指标持续改善,风险-覆盖曲线下面积更低,固定风险下覆盖率更高。分析表明,证据矛盾、语义离散度和工具不一致是弃权的主要驱动因素,产生信息丰富的面向用户的拒绝消息。结果是一种可移植的证据融合到校准概率再到风险控制决策的配方,无需微调基础模型即可提高可信度,并在分布偏移下保持有效。

英文摘要

Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.

2601.04885 2026-06-15 cs.CL cs.AI cs.LG 版本更新

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

CuMA: 通过人口统计感知的适配器混合使大语言模型与稀疏文化价值观对齐

Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Yuheng Jia, Shu Su

发表机构 * Southeast University(东南大学) ByteDance Inc.(字节跳动公司) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用重点实验室(东南大学),中华人民共和国教育部,中国)

AI总结 提出CuMA框架,通过人口统计感知路由将冲突梯度分离到专家子空间,解决密集模型在多文化对齐中的均值崩溃问题,在WorldValuesBench等基准上取得最优性能。

Comments ACL 2026 Main

详情
AI中文摘要

随着大语言模型服务于全球用户,对齐必须从强制执行普遍共识转向尊重文化多元主义。我们证明,密集模型在被迫适应冲突的价值分布时会出现\textbf{均值崩溃},收敛到无法代表不同群体的通用平均值。我们将其归因于\textbf{文化稀疏性},其中梯度干扰阻止密集参数跨越不同的文化模式。为解决此问题,我们提出\textbf{\textsc{CuMA}}(\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters),一个将对齐视为\textbf{条件容量分离}问题的框架。通过引入人口统计感知路由,\textsc{CuMA}内化了一个\textit{潜在文化拓扑},以将冲突梯度明确解耦到专门的专家子空间中。在WorldValuesBench、Community Alignment和PRISM上的广泛评估表明,\textsc{CuMA}达到了最先进的性能,显著优于密集基线和仅语义MoE。关键的是,我们的分析证实\textsc{CuMA}有效缓解了均值崩溃,保留了文化多样性。我们的代码可在该https URL获取。

英文摘要

As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

2605.28591 2026-06-15 cs.CL cs.AI 版本更新

Models That Know How Evaluations Are Designed Score Safer

知道评估如何设计的模型更安全

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center(图宾根ELLIS研究所、图宾根马克斯·普朗克智能系统研究所、图宾根人工智能中心)

AI总结 本文通过微调模型使其掌握评估的元知识(如可验证结构或道德困境),发现这会导致模型在安全基准测试中表现更安全,从而引入了一种独立于显式记忆或评估意识的新混淆因素。

详情
AI中文摘要

AI安全评估的有效性取决于模型在受控环境和部署环境中行为的一致性。先前的研究已经发现测试时的上下文线索(例如假设场景)是口头评估意识和后续行为转变的来源。在本文中,我们研究了这一现象的一个潜在解释:评估元知识,定义为关于评估结构特征的参数化知识。类似于数据集污染(基准暴露通过记忆导致更高性能),我们假设在描述评估实践的文本上训练的模型可能隐式地学会识别和响应类似评估的上下文,例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为了验证这一点,我们在描述评估特征(如可验证结构或道德困境)的合成文档上微调模型。在六个安全基准上评估这个微调模型,我们发现它比基础模型和控制模型显著更安全。即使将分析限制在缺乏明确评估意识口头表达的响应中,这种行为转变仍然存在。我们的结果表明,评估元知识可能夸大安全基准性能,引入了一种独立于显式记忆或口头评估意识的新混淆因素,因此难以检测。这些发现对AI安全评估的设计和解释具有重要意义。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。

英文摘要

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on five safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

2606.11502 2026-06-15 cs.CL cs.AI 版本更新

When Roleplaying, Do Models Believe What They Say?

角色扮演时,模型是否相信它们所说的话?

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结 通过线性真实探针研究角色扮演对LLM内部表征的影响,发现角色扮演主要改变输出而非内部真实表征,而紧急错位则更显著地改变内部表征。

详情
AI中文摘要

语言模型可以陈述“地球绕太阳运行”,并在扮演亚里士多德时断言相反的说法。最近的研究认为,角色采用是语言模型运作的基础,模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出,还是也影响了模型内部表征为真实的内容?我们通过线性真实探针研究这个问题,将其应用于扮演历史人物(其可能的信念与现代共识不同)的LLM。对于每个角色,我们比较该角色可能赞同的虚假陈述(*时代相信*)与主题匹配但该角色不会赞同的虚假陈述(*时代虚假*)。通过提示、上下文学习和监督微调,角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述,但它们总体上仍被分类为虚假。因此,角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位(EM)的模型进行对比。在三个模型家族(Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B)中,它们的虚假陈述显著向探针空间的真实区域移动,在挑战下大约一半时间被辩护(而角色扮演约为六分之一),并用于下游推理。因此,角色扮演和紧急错位是信念内化谱系上的点,其中角色扮演改变模型所说的内容而表征变化很小,而紧急错位则改变虚假陈述的内部表征,但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

2305.07609 2026-06-15 cs.IR cs.CL cs.CY 版本更新

Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation

ChatGPT 在推荐中是否公平?评估大语言模型推荐的公平性

Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 针对大语言模型推荐(RecLLM)可能存在的偏见,提出公平性基准 FaiRLLM,包含精心设计的指标和涵盖8个敏感属性的数据集,评估发现 ChatGPT 在推荐中仍存在不公平现象。

Comments Accepted by Recsys 2023 (Short). Typo corrections

详情
AI中文摘要

大语言模型(LLM)的显著成就催生了一种新颖的推荐范式——基于LLM的推荐(RecLLM)。然而,需要注意的是,LLM可能包含社会偏见,因此RecLLM做出的推荐的公平性需要进一步研究。为了避免RecLLM的潜在风险,有必要评估RecLLM在用户侧各种敏感属性上的公平性。由于RecLLM范式与传统推荐范式存在差异,直接使用传统推荐的公平性基准是有问题的。为解决这一困境,我们提出了一个新的基准,称为基于LLM的推荐公平性(FaiRLLM)。该基准包括精心设计的指标和一个数据集,该数据集考虑了音乐和电影两个推荐场景中的八个敏感属性。通过使用我们的FaiRLLM基准,我们对ChatGPT进行了评估,发现它在生成推荐时仍然对某些敏感属性表现出不公平性。我们的代码和数据集可在以下网址找到:https://this URL。

英文摘要

The remarkable achievements of Large Language Models (LLMs) have led to the emergence of a novel recommendation paradigm -- Recommendation via LLM (RecLLM). Nevertheless, it is important to note that LLMs may contain social prejudices, and therefore, the fairness of recommendations made by RecLLM requires further investigation. To avoid the potential risks of RecLLM, it is imperative to evaluate the fairness of RecLLM with respect to various sensitive attributes on the user side. Due to the differences between the RecLLM paradigm and the traditional recommendation paradigm, it is problematic to directly use the fairness benchmark of traditional recommendation. To address the dilemma, we propose a novel benchmark called Fairness of Recommendation via LLM (FaiRLLM). This benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes1 in two recommendation scenarios: music and movies. By utilizing our FaiRLLM benchmark, we conducted an evaluation of ChatGPT and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations. Our code and dataset can be found at https://github.com/jizhi-zhang/FaiRLLM.

2606.10813 2026-06-15 cs.CR cs.CL 版本更新

RedAct: Redacting Agent Capability Traces for Procedural Skill Protection

RedAct: 为程序技能保护而编辑智能体能力痕迹

Shuwen Xu, Zhitao He, Yi R. Fung

发表机构 * Hong Kong University of Science and Technology(香港理工大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出RedAct框架,通过定位保护关键信息、重写痕迹并嵌入行为水印,将技能转移率降至无技能基线以下,同时保留审计证据。

详情
AI中文摘要

用户依赖执行痕迹来观察智能体行为、诊断故障并确保问责。这些痕迹包含丰富的程序细节,包括工具调用、中间决策和错误恢复逻辑。然而,这些细节可能暴露私有的程序技能,使下游方法能够在没有模型权重或技能文件的情况下恢复关键公式、阈值和策略。为了量化这种风险并评估保护措施,我们构建了\textsc{CapTraceBench},一个包含75个专业长时任务和154个跨七个领域精选技能的基准。我们还引入了\textsc{RedAct}(https://github.com/...),一个受保护的痕迹发布框架,该框架定位受保护的关键信息,重写痕迹同时保留验证者关键证据,并嵌入行为水印用于下游溯源分析。在代表性的痕迹重用方法中,\textsc{RedAct}将归一化技能转移(NST)从原始痕迹的44.7--67.1%降至低于无技能基线,同时保留审计证据。其独立的行为水印实现了93.6--100.0%的真实检测率,误报率最多为1.9%。这些结果将公共智能体痕迹视为安全接口,并表明选择性编辑可以在不删除审计证据的情况下减少程序能力泄露。

英文摘要

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct CapTraceBench, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce RedAct, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, RedAct reduces normalized skill transfer (NST) from 44.7-67.1% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6-100.0% true detection with a false alarm rate of at most 1.9%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.

2605.21006 2026-06-15 cs.AI cs.CL cs.LG 版本更新

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

扮演魔鬼的代言人:现成的人格向量在顺从性上与针对性引导相媲美

Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary

发表机构 * University of Toronto(多伦多大学) Princeton University(普林斯顿大学) Purdue University(普渡大学) EPFL(瑞士联邦理工学院) Algoverse Independent(独立)

AI总结 本文研究了不同人格对顺从性的影响,发现现成的人格引导向量在减少顺从性方面与针对性引导相当,且在用户正确时保持准确性。

详情
Journal ref
ICML, Pluralistic Alignment Workshop, 2026
AI中文摘要

我们研究了不同人格对顺从性的影响:模型在用户错误时仍同意用户。标准缓解方法,对比激活添加(CAA),从顺从性和诚实响应的标记对中推导出引导方向。本研究评估了现成的人格引导向量是否能作为替代方案,这些向量最初是为一般角色扮演开发的,且未在顺从性数据上训练。在两个指令微调模型中,引导至以怀疑或审查为特征的人格可将顺从性减少到CAA效果的约68%和98%,且不同于CAA,在用户正确时保持准确性。效果也是不对称的:引导至顺从的人格不会产生镜像增加的顺从性。几何上,人格向量在激活空间的方向上与顺从性方向基本无关。总体而言,这些发现表明,顺从性应被视为人格层面的属性,而非单一可引导方向。我们在此发布代码:https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

英文摘要

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

9. 低资源、领域适配与高效训练 4 篇

2606.13894 2026-06-15 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Gefen: Optimized Stochastic Optimizer

Gefen: 优化随机优化器

Nadav Benedek, Tomer Koren, Ohad Fried

发表机构 * Reichman University(赖希曼大学) Tel Aviv University(特拉维夫大学) Google Research(谷歌研究院)

AI总结 提出Gefen优化器,通过共享二阶矩估计和量化一阶矩,将AdamW内存占用减少约8倍,同时保持相同性能,支持更大批量和吞吐量。

详情
AI中文摘要

AdamW是现代深度学习的默认优化器,但其一阶和二阶矩状态会额外占用约两倍参数大小的训练内存。我们提出Gefen,一种内存高效的优化器,它自动在参数块之间共享二阶矩估计,并使用学习到的码本量化一阶矩,从而将AdamW的内存占用减少约8倍,同时保持相同性能,相当于每十亿参数减少6.5 GiB。该方法受理论结果启发,该结果表明大的混合Hessian项将平方梯度的比率约束为接近1,表明Hessian对齐的参数是共享二阶矩统计量的自然候选。由于大规模计算Hessian不切实际,Gefen从初始平方梯度推断块结构,除了AdamW默认超参数外,不需要任何架构特定的元数据或超参数。Gefen学习基于精确直方图的动态规划量化码本,并重用相同的块进行一阶矩缩放。在多种实验中,Gefen在比较的类似AdamW的方法中实现了最低的峰值优化器内存,同时保持AdamW级别的性能。在FSDP和DDP训练中,减少的内存占用支持更大的微批次,并显著提高相对于AdamW的吞吐量,提供了一种实用的即插即用替代方案,具有更低的内存使用,可以增加吞吐量并支持训练更大的模型或使用更大的批量大小。我们提供了完整的Python实现,包括融合CUDA内核,网址为https://this https URL。

英文摘要

AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result showing that large mixed Hessian entries constrain the ratio of squared gradients toward one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Since computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic-programming quantization codebook and reuses the same blocks for first-moment scaling. Across diverse experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and improves throughput significantly over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels at https://github.com/ndvbd/Gefen

2606.14150 2026-06-15 cs.LG cs.CL 交叉投稿

Small LLMs: Pruning vs. Training from Scratch

小型LLM:剪枝 vs. 从头训练

Yufeng Xu, Taiming Lu, Kunjun Li, Jiachen Zhu, Mingjie Sun, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学) New York University(纽约大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过六种剪枝方法在Llama-3.1-8B上比较剪枝与从头训练,发现有限预算下剪枝更优,预算充足时粗粒度剪枝可被超越。

Comments Our code is available at https://github.com/zlab-princeton/llm-pruning-collection

详情
AI中文摘要

剪枝有望成为获得强大小型语言模型的捷径。在本工作中,我们通过六种涵盖深度、宽度和稀疏粒度的剪枝方法,在两种受控的token匹配设置下,以0.5-0.8的剪枝率对Llama-3.1-8B进行剪枝,检验了这一承诺。(1) 在相同的训练token预算下,剪枝初始化始终优于随机初始化。这表明父模型提供了一个强起点,尽管随着训练token预算的增加和剪枝率的提高,优势逐渐缩小,在我们研究的最高剪枝率下几乎消失。(2) 当从头训练被给予整个流程消耗的全部token预算时,细粒度剪枝仍保持优势,而粗粒度结构化剪枝可能被匹配或超越。这表明父模型传递了额外训练token无法完全恢复的知识,但仅在细粒度下如此。综合来看,我们的结果给出了明确的建议:当手头有一个大型预训练模型且训练token预算有限时,剪枝优于从头训练;当训练预算不受限时,从头训练在粗粒度剪枝下可能具有竞争力,因此大型预训练父模型并非总是必要的。

英文摘要

Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.

2606.14695 2026-06-15 cs.LG cs.CL 交叉投稿

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

Persona-Pruner: 为角色扮演雕琢轻量级模型

Jinsu Kim, Jihoon Tack, Noah Lee, Jongheon Jeong

AI总结 提出Persona-Pruner框架,通过从单个描述中隔离特定角色的子网络来剪枝语言模型,在保持角色扮演性能的同时大幅降低计算成本,性能下降比最强基线减少93.8%。

Comments 25 pages; ICML 2026; Code is available at https://github.com/jsu-kim/Persona-Pruner

详情
AI中文摘要

语言模型(LMs)作为角色扮演聊天机器人展现出显著潜力,在给定角色或用户画像规范时,能够提供一致且风格化的交互。然而,将这些能力应用于现实世界应用(例如,众多NPC同时交互的生态系统)时,由于过高的计算成本,暴露了关键的效率问题。在本文中,我们质疑将完整的通用模型专用于单一角色的必要性,假设特定角色身份仅依赖于模型总容量的一小部分。我们观察到,朴素地剪枝LM通常会严重降低特定角色的角色扮演性能;它无法区分冗余知识和基本角色特征。我们提出Persona-Pruner,一个通过从单个描述中隔离特定角色的子网络来雕琢轻量级角色扮演模型的框架。我们的实验一致表明,Persona-Pruner在保留角色扮演性能方面比现有最先进的LLM剪枝技术有效得多,在RoleBench上使用LLM-as-a-judge评分,将性能下降从密集模型减少至多93.8%(相比最强基线),同时仍保持通用LLM能力。代码可在以下网址获取:此https URL。

英文摘要

Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model's total capacity. We observe that naively pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We propose Persona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities. Code is available at https://github.com/jsu-kim/Persona-Pruner.

2512.22671 2026-06-15 cs.CL cs.AI cs.LG 版本更新

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

脆弱的知识,稳健的指令遵循:Llama-3.2中的宽度剪枝二分法

Pere Martra

发表机构 * Independent Researcher(独立研究员)

AI总结 通过峰值幅度准则对GLU-MLP层进行结构化宽度剪枝,发现降低扩展比会损害参数化知识任务,但能提升指令遵循能力,挑战了剪枝导致均匀退化的假设。

Comments 22 pages, 5 figures, 9 tables. Code available at https://github.com/peremartra/llama-glu-expansion-pruning

详情
AI中文摘要

对Llama-3.2模型中GLU-MLP层的结构化宽度剪枝,以峰值幅度(PPM)准则为指导,揭示了降低扩展比如何系统性地影响不同模型能力的二分法。虽然依赖参数化知识的任务(如MMLU、GSM8K)和困惑度指标的性能随扩展比降低而可预测地下降,但指令遵循能力在2.4倍平衡比下得到提升(IFEval:Llama-3.2-1B中+4.8分/+46%,Llama-3.2-3B中+3.7分/+39%),且多步推理保持稳健(MUSR)。这种模式在两个评估模型大小上一致观察到,挑战了压缩研究中剪枝导致均匀退化的主流假设。为探究这一点,我们使用评估事实知识、数学推理、语言理解、指令遵循和真实性的综合基准套件,评估了七种扩展比配置。我们的分析将扩展比识别为一个关键架构参数,它选择性地重塑模型的任务性能轮廓,而不仅仅是作为压缩指标。

英文摘要

Structured width pruning of GLU-MLP layers in Llama-3.2 models, guided by the Peak-to-Peak Magnitude (PPM) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably with decreasing expansion ratios, instruction-following capabilities improve at the 2.4x equilibrium ratio (IFEval: +4.8 points / +46% in Llama-3.2-1B and +3.7 points / +39% in Llama-3.2-3B), and multi-step reasoning remains robust (MUSR). This pattern, observed consistently across both evaluated model sizes, challenges the prevailing assumption in compression research that pruning induces uniform degradation. To investigate this, we evaluated seven expansion ratio configurations using comprehensive benchmark suites that assess factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively reshapes the model's task performance profile, rather than merely serving as a compression metric.

10. 其他/综合NLP 22 篇

2606.13852 2026-06-15 cs.CL 新提交

Hybrid Classical-Quantum Variational Autoencoder for Neural Topic Modeling

用于神经主题建模的混合经典-量子变分自编码器

Ivan Kankeu

发表机构 * Ivan Kankeu

AI总结 提出混合经典-量子变分自编码器,在推理网络中嵌入参数化量子电路,使用改进的高斯Softmax后验降低量子比特需求,在AgNews数据集上优于现有神经主题模型。

详情
AI中文摘要

神经主题模型能够实现可扩展的语义发现,但它们与量子硬件的集成仍 largely unexplored。我们提出了一种概念验证的混合经典-量子变分自编码器(VAE)用于主题建模,在VAE推理网络中嵌入参数化量子电路,同时保留经典的主题-词解码器。为了解决量子硬件的资源限制,我们提出了一种改进的高斯Softmax后验,将潜在空间维度与要提取的主题数量解耦,使模型能够在低资源(10量子比特)的量子设备上运行。在AgNews数据集上,混合VAE优于最先进的神经主题模型(NTMs),达到了0.71的$C_v$连贯性分数和0.20的NPMI分数,同时保持了高主题多样性。为了比较,我们还构建了一个完全经典的变体,它在AgNews上也优于最先进的模型,并在潜在空间中表现出清晰的类别分离。这些结果表明,即使在NISQ时代的设备上,混合VAE在计算上也是可行的,并且代表了量子增强主题建模的一个有前景的方向。

英文摘要

Neural topic models enable scalable semantic discovery, but their integration with quantum hardware remains largely unexplored. We present a proof-of-concept hybrid classical-quantum variational autoencoder (VAE) for topic modeling, embedding parameterized quantum circuits within the VAE inference network while retaining a classical topic-word decoder. To address the resource constraints of quantum hardware, we propose a modified Gaussian Softmax posterior that decouples latent space dimensionality from the number of topics to be extracted, enabling the model to operate with a low-resource 10-qubit quantum device. On the AgNews dataset, the hybrid VAE outperforms state-of-the-art neural topic models (NTMs), reaching a $C_v$ coherence score of 0.71 and an NPMI score of 0.20 while preserving high topic diversity. For comparison, we also construct a fully classical variant, which also outperforms state-of-the-art models on AgNews and exhibits clear class separation in the latent space. These results demonstrate that hybrid VAEs are computationally viable even on NISQ-era devices and represent a promising direction for quantum-enhanced topic modeling.

2606.13977 2026-06-15 cs.CL 新提交

Creative Integration: A Decidable Criterion of Creativity

创造性整合:一个可判定的创造力标准

Yoshinori Nomura

发表机构 * Mirage Mountain Technologies(幻山科技)

AI总结 提出基于描述长度压缩的创造性整合可判定标准,通过四个二元门和伪整合分类法实现判别,并在多领域语料库上通过四项可证伪测试验证。

Comments 18 pages, 1 figure

详情
AI中文摘要

"整合性"解决方案广受赞誉但鲜有定义:我们缺乏一种可操作的方式来区分真正的整合(使世界更易于描述)与整洁的重新描述。基于将创造力和智能视为压缩的思想脉络,我们为创造性整合(CI)给出了这样一个标准:当且仅当在固定描述语言下,描述长度严格缩短(C = L_pre/L_post > 1),且缩减位于冲突本身时,A与B之间真实冲突的解决即为CI。我们通过四个二元合取门使判断可判定,并通过一个伪整合分类法(命名并拒绝相似物)固定其外延。我们用一个精心策划的多领域语料库支持该标准,并且——关键的是——不是通过人类评分者间一致性,而是通过它可能失败的四个可证伪测试来验证:独立计算检查、对硬负例的区分、样本外预测和描述语言鲁棒性;所有测试均以余量通过。贡献不在于"创造力即压缩",而在于其可判定性、区分性和语料库:据此,使一个举动真正具有创造性——而非仅仅是新颖——的是它压缩了一个冲突,新颖性和价值是下游症状;所有创造力是否都如此构成,我们作为一个明确的猜想陈述。我们仅声称C-1的符号;我们判断,而非生成。结果是一个可引用的基元,用于更广泛的计划。

英文摘要

"Integrative" solutions are widely praised but rarely defined: we lack an operational way to tell a genuine integration -- one that makes the world cheaper to describe -- from a tidy re-description. Building on the lineage that treats creativity and intelligence as compression, we give such a criterion for creative integration (CI): the resolution of a real conflict between A and B is CI if and only if, under a fixed description language, the description length strictly shrinks (C = L_pre/L_post > 1), with the reduction located in the conflict itself. We make the judgment decidable through four binary, conjunctive gates, and we fix its extension through a taxonomy of pseudo-integration that names and rejects the look-alikes. We back the criterion with a curated, multi-domain corpus and -- crucially -- validate it not by human inter-rater agreement but by four falsifiable tests it could fail: an independent computational check, discrimination against hard negatives, out-of-sample prediction, and description-language robustness; all pass with margin. The contribution is not "creativity is compression" but its decidability, discrimination, and corpus: on this account, what makes a move genuinely creative -- rather than merely novel -- is that it compresses a conflict, with novelty and value as downstream symptoms; whether all creativity is so constituted we state as an explicit conjecture. We claim only the sign of C-1; we judge, not generate. The result is a citable primitive for a broader program.

2606.13991 2026-06-15 cs.CL 新提交

Fusing Stylometric and Embedding Systems to Estimate Authorship Likelihood Ratios in Japanese

融合文体特征与嵌入系统以估计日语作者身份似然比

Praju Ghatpande, Satoru Tsuge, Shunichi Ishihara, Wataru Zaitsu, Mitsuyuki Inaba

AI总结 首次将似然比框架应用于日语数字文本,通过融合文体特征系统与基于嵌入的系统,提高了似然比量值和判别能力,最佳融合的log-likelihood-ratio成本为0.32484。

详情
AI中文摘要

似然比框架被广泛认为是法医学证据分析的逻辑和法律基础,其在文本证据的作者身份分析中的重要性日益得到认可。然而,迄今为止,其应用仅限于英语文本。同时,作者身份归属传统上依赖于多种文体特征,而预训练大语言模型的兴起使得新的上下文嵌入方法成为可能。通过融合这些不同方法有望提升性能,但尚未在似然比范式下应用于整合文体特征系统与基于嵌入的系统。本研究首次将基于似然比的法医文本比较应用于日语数字文本,使用约1000字符的博客摘录,以1)评估系统性能和似然比量级,2)评估融合文体特征系统与基于嵌入的系统的影响。结果表明,融合系统在保持优秀校准的同时,1)增加了与事实一致的似然比量级;2)减少了与事实相反的似然比量级;3)提高了整体判别能力。最佳融合实现了0.32484的对数似然比成本,既证明了似然比框架在日语中的可行性,也展示了异构系统融合的优势。

英文摘要

The likelihood ratio framework is widely recognized as the logically and legally sound basis for evidential analysis across forensic sciences, and its importance is increasingly acknowledged in analyses of authorship in textual evidence. To date, however, its application has been confined to English-language texts. Meanwhile, authorship attribution has traditionally relied on a diverse array of stylometric features, even as the rise of pre-trained large language models enables new contextual-embedding approaches. Combining these diverse approaches through fusion promises enhanced performance, yet it has not been applied to integrate stylometric-feature systems with embedding-based systems within the likelihood ratio paradigm. This study is the first to apply likelihood ratio-based forensic text comparison to Japanese digital texts, using ~1,000-character excerpts from blogs, to 1) evaluate system performance and likelihood ratio magnitudes and 2) assess the impact of fusing stylometric-feature systems with embedding-based systems. The results demonstrate that the fused system maintains excellent calibration while 1) increasing consistent-with-fact likelihood ratio magnitudes; 2) decreasing contrary-to-fact likelihood ratio magnitudes and 3) improving overall discriminability. The best-performing fusion achieved a log-likelihood-ratio cost of 0.32484, illustrating both the feasibility of likelihood ratio framework for Japanese and the benefits of fusion across heterogeneous systems.

2606.14145 2026-06-15 cs.CL 新提交

Personal Care Utility: Health as Everyday Infrastructure

个人护理公用设施:健康作为日常基础设施

Mahyar Abbasian, Elahe Khatibi, Saba A. Farahani, Nitish Nagesh, Arshia Ilaty, Hooman Sajjadi, Amir Rahmani, Ramesh Jain

发表机构 * University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出个人护理公用设施(PCU)分层事件驱动架构,将健康视为日常基础设施,通过Personicle组织连续信号,分离临床决策与语言表达,以2型糖尿病为例验证其生成实时干预和知识引导的能力。

Comments 12 pages, 2 figures, 3 tables

详情
AI中文摘要

医疗保健本质上是必要的、专业的和偶发性的——围绕一个人每年与临床医生相处的大约一小时而设计。在临床环境之外的8,759小时中,饮食、睡眠、运动、用药和压力实际上塑造着长期健康,却没有相应的基础设施。个性化健康的瓶颈不是原始数据或推理能力,而是缺乏这一基础设施层。本文介绍了个人护理公用设施(PCU):一种分层事件驱动架构,被提议作为日常健康缺失的公用设施,就像支付、网络和电力是其领域的公用设施一样。PCU通过Personicle将连续的个人信号组织成具有语义意义的生活事件,根据个人基线估计动态健康状态,推理原因和背景,并通过一个编排器将临床决策逻辑、行为策略选择和自然语言表达分离,从而引导指导。这种分离使得大型语言模型能够支持推理和沟通,同时将安全关键的临床决策建立在经过验证的证据基础上。我们针对2型糖尿病实例化了PCU——将CGM、饮食、活动、用药、睡眠、压力和临床数据转化为血糖事件、个性化状态估计、因果解释和基于知识的干预。一个日常生活场景展示了相同的基础设施根据背景和风险产生实时提示、每周总结、用药检查、沉默或确定性安全警报。最后,我们讨论了PCU如何推广到其他慢性疾病以及任何始终在线的个人健康公用设施必须解决的治理问题。结果是一个蓝图,将个性化视为日常健康指导的架构属性,而不是最终的消息传递层。

英文摘要

Healthcare is essential, expert, and episodic by design - built around the roughly one hour per year a person spends with a clinician. The 8,759 hours outside clinical settings, where eating, sleeping, movement, medication, and stress actually shape long-term health, have no comparable infrastructure. The bottleneck for personalized health is not raw data or reasoning capability; it is the absence of that infrastructure layer. This paper introduces the Personal Care Utility (PCU): a layered, event-driven architecture proposed as the missing utility for everyday health, in the way that payments, networks, and power are utilities for their domains. PCU organizes continuous personal signals into semantically meaningful life events through a Personicle, estimates dynamic health state against personal baselines, reasons about cause and context, and routes guidance through an orchestrator that separates clinical decision logic, behavioral strategy selection, and natural-language expression. This separation lets large language models support reasoning and communication while keeping safety-critical clinical decisions grounded in validated evidence. We instantiate PCU for Type 2 Diabetes - turning CGM, meal, activity, medication, sleep, stress, and clinical data into glycemic events, individualized state estimates, causal explanations, and knowledge-grounded interventions. A day-in-the-life scenario shows the same infrastructure producing real-time nudges, weekly summaries, medication check-ins, silence, or deterministic safety alerts depending on context and risk. We close with how PCU generalizes to other chronic conditions and the governance questions any always-on personal health utility must address. The result is a blueprint that treats personalization not as a final messaging layer, but as an architectural property of everyday health guidance.

2606.14420 2026-06-15 cs.CL 新提交

Coping in Crisis: Computational Modeling of Coping Styles in Digital Crisis Discourse During the 2023 Turkiye Earthquake

危机中的应对:2023年土耳其地震数字危机话语中应对风格的计算建模

Şevval Çakıcı

发表机构 * Koç University(科克大学)

AI总结 本研究利用BERTurk分类器,基于Lazarus和Folkman的应对理论,从百万条土耳其语推文中检测三种应对风格(问题聚焦、情绪聚焦、意义建构),揭示了危机不同阶段的动态变化,并发现愤怒与意义建构强相关。

Comments 20 pages, 5 figures, 3 tables. To be submitted to Social Science Computer Review

详情
AI中文摘要

当灾难来临时,人们如何应对?我们能否从他们的文字中实时、大规模地检测到这种应对?本研究利用2023年2月6日土耳其地震后发布的超过一百万条土耳其语推文来回答这个问题。这场地震发生在深度政治极化的背景下,距离全国大选仅数月。基于Lazarus和Folkman(1984)的应对理论,我们开发了一个多标签BERTurk分类器,用于在四个理论驱动的危机阶段检测三种应对风格(问题聚焦、情绪聚焦和意义建构)。BERTurk的宏F1得分为0.693,显著优于零样本mDeBERTa基线(宏F1=0.324)。将该分类器应用于完整语料库后,揭示了清晰的时间轨迹:问题聚焦应对在紧急阶段占主导地位并急剧下降,情绪聚焦应对上升并趋于稳定,意义建构应对单调递增。愤怒与意义建构的相关性最强(Spearman r=0.387),表明愤怒作为一种动员力量,指向归责而非实际行动。这些发现表明,应对理论可以在现实世界的数字危机数据中可靠地操作化,并且这样做可以帮助人道主义组织根据人口的实际状态调整其响应。

英文摘要

How do people cope when disaster strikes and can we detect it at scale, in real time, from what they write? This study addresses that question using over one million Turkish-language tweets posted in the aftermath of the February 6, 2023 earthquake in Turkiye, which unfolded in a deeply polarized political context just months before a national election. Drawing on Lazarus and Folkman's (1984) coping theory, we develop a multi-label BERTurk classifier to detect three coping styles (problem-focused, emotion-focused, and meaning-making) across four theoretically motivated crisis phases. BERTurk achieves a macro F1 of 0.693, substantially outperforming a zero-shot mDeBERTa baseline (macro F1 = 0.324). Applied to the full corpus, the classifier reveals a clear temporal trajectory: problem-focused coping dominates the urgency phase and declines sharply, emotion-focused coping rises and stabilizes, and meaning-making increases monotonically. Anger correlates most strongly with meaning-making (Spearman r = 0.387), suggesting it functions as a mobilizing force toward blame attribution rather than practical action. These findings demonstrate that coping theory can be reliably operationalized in real-world digital crisis data and that doing so can help humanitarian organizations tailor their responses to where a population actually is.

2606.14580 2026-06-15 cs.CL 新提交

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

说服指数:一个理论指导的说服分析框架

Liancheng Gong, Zhiyang Wang, Yiwei Xu, Julia Mendelsohn

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) New York University(纽约大学)

AI总结 提出基于心理学和传播学理论的15维说服指数(PI)及55个子特征实现,在四个数据集上验证其能解释说服相关修辞模式,并提供轻量级预测信号。

详情
AI中文摘要

识别有说服力的修辞线索在多个领域至关重要,从检测信息操纵、提高AI安全性到推进公共卫生沟通。我们提出说服指数(PI),这是一个基于心理学和传播学说服理论的15维分类法,以及一个使用55个子特征的透明实现,这些子特征基于词汇和规则检测器构建。该分类法是模块化的:单个检测器可以被替换,同时保留理论结构。通过在四个领域、风格和结果度量不同的公共数据集上评估PI,我们表明PI提供了一个共享的特征空间,用于解释与说服结果相关的修辞模式。线性模型表明,PI特征在保持计算轻量级的同时携带了有意义的预测信号。维度级分析揭示了PI维度与说服结果之间跨数据集的重复关联,同时也突出了主题和立场特定的变化。我们将PI作为开源包和Web界面发布,用于对人和AI中介的沟通进行原则性和可审计的分析。

英文摘要

Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

2606.14626 2026-06-15 cs.CL 新提交

Characterizing Cultural Localization in AI-Generated Stories

表征AI生成故事中的文化本地化

Shaily Bhatt, Supriti Vijay, Jeremiah Milbauer, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种方法,通过识别区分国籍的词汇标记并移除后测量叙事相似性,检测AI生成故事中的模板化本地化,发现仅9-17%词汇解释国籍差异,且部分文化标记具有冒犯性。

Comments Accepted to the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP) Co-located with ACL 2026, San Diego, USA (non-archival)

详情
AI中文摘要

人工智能的全球使用增加了对评估生成文化本地化内容(包括故事)能力的兴趣。故事中的文化本地化通常通过模板化本地化(在通用叙事中使用文化标记,如姓名、地点)或整体本地化(除文化标记外,情节、价值观和主题的变化)发生。我们提出了一种方法来衡量内容通过模板化本地化生成的程度。具体来说,我们识别出区分不同国籍故事的词汇标记,并测量移除这些标记后剩余叙事的相似性。在五个模型针对125个主题和193个国籍生成的故事中,我们的方法能够检测到仅有一小部分词汇(9-17%)解释了国籍间的差异,并且移除这些标记后剩余的叙事包含重复的多词序列,表明存在一个共享的文化无关叙事模板。最后,我们表征了文化标记的刻板性和冒犯性,发现来自19个国家(主要位于全球南方)的标记平均具有冒犯性。

英文摘要

The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.

2606.13690 2026-06-15 cs.CY cs.CL 交叉投稿

Indirect Computing Model with Indirect Formal Method

间接计算模型与间接形式化方法

Xiaohui Zou

发表机构 * Institute for Higher Education(高等教育研究院) China University of Geosciences (Beijing)(中国地质大学(北京)) Institute of Synergistic Cultural Gene Engineering(协同文化基因工程研究院) Tsinghua Science Park(清华大学科技园) Sino-US Project: UC Berkeley Searle Research Bilingual Information Processing Group(中美项目:伯克利赛尔研究双语信息处理组)

AI总结 本文从人机界面与协同计算程序结合的协同智能计算系统视角,探讨间接计算模型与间接形式化方法支持的优化云计算技术原理,并介绍兼容大小字符串的间接计算模型和形式理论,以中文信息数据为例展示原型设计,旨在将数据中心优化为知识中心。

Comments 10 pages, 6 figures

详情
Journal ref
Software 2011,32(5)
AI中文摘要

本文从人机界面与协同计算程序结合形成的协同智能计算系统视角,讨论了由间接计算模型与间接形式化方法组合支持的优化云计算技术原理。在系统回顾先前理论成果——图灵的可计算性理论、克林的小字符串形式理论、冯·诺依曼的数字计算机体系结构以及图灵关于AI判断的假说——对主流通用数字计算机范式的影响的基础上,作者重点介绍了一种兼容大小字符串的间接计算模型和间接形式理论。以中文信息数据为例,给出了协同智能计算系统原型的设计理念。其意义在于,该成果有助于将数据中心优化为知识中心。

英文摘要

This paper,from the perspective of a collaborative intelligent computing system formed by combining human-computer interface and collaborative computing programs, discusses the principles of optimized cloud computing technology supported by the combination of an indirect computing model and an indirect formal method. On the basis of systematically reviewing the influence of previous theoretical achievements Turing's computability theory,Kleene's formal theory of small strings,von Neumann's digital computer architecture and Turing's hypothesis on AI judgment on the mainstream general-purpose digital computer paradigm,the author focuses on introducing an indirect computing model and an indirect formal theory compatible with both large and small strings. Using Chinese information data as an example,the design concept of a collaborative intelligent computing system prototype is presented. The significance is that this achievement facilitates optimization of cloud computing from data centers to knowledge centers.

2606.14113 2026-06-15 cs.SE cs.CL 交叉投稿

Simulating Students' Java Programming Errors with Large Language Models

用大型语言模型模拟学生的Java编程错误

Ali Keramati, Jie Cao, Iman Mohammadi, Mark Warschauer, Yang Shi

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 探索用LLM模拟学生编程错误,评估五种模型在多样性和对齐性上的表现,发现Claude Sonnet 4平衡最佳,且合成错误与真实错误难以区分。

详情
AI中文摘要

理解学生在编程中的错误是编程教育的基石,然而,对于任何新设计的任务,获取具有代表性的学生错误集仍然缓慢且成本高昂,因为真实的提交只有在广泛的课堂部署后才能积累。本文探讨了大型语言模型(LLMs)是否可以通过模拟代码提交中的真实逻辑错误,作为学生的可扩展代理。使用包含37个问题的74,000多个独特学生Java提交的CodeWorkout数据集,我们在三种主流提示策略下评估了五种LLM:输入-输出(IO)、思维链(CoT)和迭代自我改进。我们沿着两个关键维度评估性能:多样性(不同错误模式的范围)和对齐性(与真实学生错误的一致性),并考察这些维度如何随编程任务的困难程度变化。我们的定量发现表明,虽然所有模型都能生成多样化的错误,但它们与人类提交的对齐性存在差异:Claude Sonnet 4实现了最平衡的性能。此外,我们进行了一项盲法专家注释研究(N = 401),比较合成错误和真实错误。这一定性分析证实,生成的错误在功能上与真实学生错误无法区分。此外,更高困难程度的问题会引发更多样化但更不像学生的错误。这些结果突出了使用LLM模拟人类学习者的权衡,并为将合成错误集成到可教学代理、智能辅导系统和大规模学习分析中提供了设计考虑。

英文摘要

Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.

2606.14230 2026-06-15 cs.CV cs.CL 交叉投稿

A Multi-Domain Feature Fusion Framework for Generalizable Deepfake Detection Across Different Generators

面向不同生成器的可泛化深度伪造检测的多域特征融合框架

Amna Amjid, Sana Qadir, Mehwish Fatima, Raja Khurram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan(巴基斯坦国立科技大学电气工程与计算机科学学院) Department of Communication, Quality Management and Information Systems, Mid Sweden University, Ostersund Campus, Sweden(瑞典中部大学通信、质量管理和信息系统系,厄斯特松德校区) Department of Computer Science, Electrical and Space Engineering, Lulea University of Technology, Luleå, Sweden(瑞典吕勒奥理工大学计算机科学、电气与空间工程系)

AI总结 提出SGFF-Net,融合空间、梯度和DWT频率表示,在双残差学习架构中实现跨生成器和跨范式的深度伪造检测,准确率提升至79.80%。

详情
AI中文摘要

深度伪造是人工生成的图像、音频或视频,威胁隐私、安全和信息完整性。检测此类内容对于打击虚假信息至关重要,因为最新模型能生成高度逼真的内容。虽然基于空间或频率的方法在基于生成对抗网络(GANs)的深度伪造上取得了良好的检测率,但它们往往难以处理最近的扩散模型生成的图像。特别是,现有方法很少利用互补的多域表示或系统地评估跨生成器的鲁棒性。为了解决这些挑战,我们提出了一种多域深度伪造检测框架SGFF-Net(空间-梯度-频率融合网络),该网络在双残差学习架构中集成了空间、梯度和基于DWT(离散小波变换)的频率表示。实验结果表明,SGFF-Net在数据集内评估中达到了98.95%的准确率,并在跨模型(70.46%)和跨范式(69.94%)设置中提升了性能。结合多源训练和数据增强进一步增强了鲁棒性,在跨模型评估中准确率从70.46%提升到79.80%,在跨范式评估中从69%提升到78%,在真实世界数据上从61.50%提升到75.80%。与单域检测器不同,SGFF-Net在空间、梯度和小波频率域中学习互补的取证线索,从而在跨生成器和跨范式评估中具有更强的鲁棒性。结果进一步表明,将多域表示与数据多样性和增强相结合,显著提高了泛化能力,为开发更可靠的深度伪造检测系统提供了实用见解。

英文摘要

Deepfakes are artificially generated images, audio, or videos that threaten privacy, security, and information integrity. Detecting such content is crucial for countering disinformation, as the latest models generate highly realistic content. While spatial- or frequency-based approaches achieve good detection rates on Generative Adversarial Networks (GANs)-based generated deepfakes, they often struggle with recent diffusion model-generated images. In particular, existing approaches rarely exploit complementary multi-domain representations or systematically evaluate cross-generator robustness. To address these challenges, we propose a multi-domain deepfake detection framework called SGFF-Net (Spatial-Gradient-Frequency Fusion Network) that integrates spatial, gradient, and DWT (Discrete Wavelet Transform)-based frequency representations within a dual residual learning architecture. Experimental results show that the SGFF-Net achieves 98.95\% accuracy in intra-dataset evaluation and improves performance in both cross-model (70.46\%) and cross-paradigm (69.94\%) settings. Incorporating multi-source training and data augmentation further enhances robustness, increasing accuracy from 70.46\% to 79.80\% in cross-model evaluation, from 69\% to 78\% in cross-paradigm evaluation, and from 61.50\% to 75.80\% on real-world data. Unlike single-domain detectors, the SGFF-Net learns complementary forensic cues across spatial, gradient, and wavelet-frequency domains, resulting in greater robustness under cross-generator and cross-paradigm evaluation. The results further show that combining multi-domain representations with data diversity and augmentation substantially improves generalization, providing practical insights for developing more reliable deepfake detection systems.

2606.14348 2026-06-15 physics.soc-ph cond-mat.stat-mech cs.CL 交叉投稿

Detecting Historical Turning Points in Italian Media: A Complex Systems Approach to a Diachronic News Corpus

检测意大利媒体中的历史转折点:一种针对历时新闻语料库的复杂系统方法

Dario Zarcone, Salvatore Miccichè, David Sanchez

发表机构 * Department of Physics and Chemistry, University of Palermo(帕尔米洛大学物理与化学系) Institute for Cross-Disciplinary Physics and Complex Systems (IFISC), UIB-CSIC(跨学科物理与复杂系统研究所(IFISC), UIB-CSIC)

AI总结 通过NLP和复杂系统方法分析意大利《共和报》1985-2000年的60万篇文章,无监督检测出意大利第一共和国向第二共和国过渡、海湾战争等关键历史转折点。

Comments 16 pages, 9 figures, 1 table

详情
AI中文摘要

大规模文本语料库的日益可用性为使用自然语言处理(NLP)进行数据驱动的、定量的历史分析开辟了新的可能性。然而,来自前数字时代的具有历史意义的历时语料库仍然稀缺且往往不完整。我们提出了一种基于重建和探索历时语料库的定量历史分析方法,该语料库包含来自意大利报纸《共和报》的约60万篇文章,涵盖了从1985年1月1日至2000年12月31日期间发表的所有文章——这是意大利和全球政治、社会和地缘政治发生重大变革的时期。使用NLP技术,我们在词汇和语义层面分析文本;然后应用复杂系统和统计物理的工具来追踪媒体话语随时间的变化。这使我们能够检测到关键的过渡时期,例如意大利从第一共和国向第二共和国的过渡,或主要的国际冲突如海湾战争或科索沃战争,而无需依赖先验标注。结果表明,将计算语言学与复杂系统的思想相结合,可以为历史变化提供新的定量见解,为通过大规模文本数据研究媒体和社会的动态开辟了新途径。

英文摘要

The increasing availability of large-scale textual corpora has opened new possibilities for data-driven, quantitative approaches to historical analysis using Natural Language Processing (NLP). However, diachronic corpora with historical relevance from the pre-digital era remain scarce and often incomplete. We present a quantitative approach to historical analysis based on the reconstruction and exploration of a diachronic corpus of around 600,000 articles from the Italian newspaper "La Repubblica", covering all the articles published from the 1st of January 1985 to the 31st of December 2000 - a period of major political, social, and geopolitical change in Italy and globally. Using NLP techniques, we analyze the text at both lexical and semantic levels; we then apply tools from complex systems and statistical physics to trace shifts in media discourse over time. This allows us to detect key transition periods, such as the transition from the First Republic to the Second Republic in Italy, or major international conflicts like the Gulf War or the Kosovo War, without relying on prior labeling. The results show how combining computational linguistics with ideas from complex systems can offer new quantitative insight into historical changes, opening up new paths for studying the dynamics of media and society through large-scale textual data.

2606.14654 2026-06-15 cs.AI cs.CL cs.LG 交叉投稿

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

将跨领域动作序列抽象为可解释的工作流

Gaurav Verma, Scott Counts

发表机构 * Microsoft Corporation(微软公司)

AI总结 提出WorkflowView框架,利用大语言模型将低层动作序列抽象为高层活动,在三个不同任务中验证了有效性和泛化能力,实现高语义相似度和预测性能。

Comments preprint; 9 pages, 5 figures

详情
AI中文摘要

序列或时间戳交互日志提供了数字应用使用的客观记录,但其粒度和噪声常常掩盖了关于人们工作的有意义见解。这些见解对于以真实用户交互为基础改进数字产品至关重要。先前的研究应用深度学习模型将用户动作聚类为高层活动,但这些方法对噪声高度敏感且难以跨应用泛化。为解决这一局限,我们引入了WorkflowView,一个使用大语言模型(LLMs)将低层动作序列抽象为高层活动的框架。我们在三个不同且具有挑战性的序列任务和多样化领域中建立了该方法的有效性和泛化性:(a)从浏览器日志中进行零样本任务描述重构(实现高语义相似度,$\mu_{sim} = 0.91$),(b)使用MOOC交互日志进行少样本学生退学预测(仅用五个少样本示例达到加权$F_1 = 0.90$),以及(c)对Microsoft Word中文档工作流中AI工具集成进行匿名化、隐私保护分析。我们的工作表明,基于LLM的抽象是将低层行为数据转化为高层、可解释且可操作见解的稳健高效途径。我们还讨论了在日志基础设施中部署基于LLM的推理时的实际考虑,包括计算效率和用户隐私。

英文摘要

Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $μ_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

2606.14688 2026-06-15 cs.LG cs.AI cs.CL cs.DS 交叉投稿

Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit

洪流与收获:通过极限语言生成视角证明琐碎知识对于生成有价值数学的必要性

Xiaoyu Li, Andi Han, Dai Shi, Zheng Gao, Jiaojiao Jiang, Junbin Gao

发表机构 * University of New South Wales(新南威尔士大学) University of Sydney(悉尼大学) University of Cambridge(剑桥大学)

AI总结 本文通过极限语言生成模型证明,在形式化数学生成中,验证器无法替代品味:覆盖未记录的有价值数学必须产生无限但渐近可忽略的琐碎语句,这是理论上的必然。

详情
AI中文摘要

与证明助手耦合的AI系统现在能够大规模生成形式化数学,而验证器可验证的内容与数学家认为有价值的内容之间的差距已成为制约因素。我们将有价值数学的生成建模为极限下的嵌套语言生成:通过成员查询预言机(证明检查器)访问的可验证形式语言$F$包含一个未知的有价值语言$H \in \mathcal{H}$,该语言仅通过核心$C \subseteq H$的对抗性枚举揭示,其精确密度为$\alpha$(文献)。每个输出要么是有价值的($\in H$),要么是琐碎的($\in F \setminus H$),要么是幻觉($\notin F$)。我们解决了四个问题。第一,验证器不是品味:允许广度生成的集合恰好是无预言机模型中的那些,按纤维由Angluin条件刻画。第二,验证器确实提供了可靠覆盖,覆盖所有未见过的有价值陈述同时仅断言有效陈述:有验证器可能,无验证器不可能;它将不可避免的错误从虚假转移到琐碎。第三,核心地,关于紧族存在尖锐二分法:生成有限个琐碎语句的生成器达到最优覆盖$\alpha/2$,而任何无限琐碎语句的允许,即使以消失速率,也将最优值跃升至$1-\alpha/2$(两者均为紧界,对于以候选交集形式呈现的核心),且存在一个生成器同时达到两端。转变在于琐碎语句的数量而非速率;间隙$1-\alpha$是未记录的质量。第四,两种机制在数学的压缩模型中实例化。完美的验证器无法替代品味:正确但无价值的语句的无界流并非工程事故,而是可证明的必要性,因为覆盖未记录的有价值数学需要无限但渐近可忽略的已认证琐碎语句流。

英文摘要

AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language $F$, accessed through a membership oracle (the proof checker), contains an unknown valuable language $H \in \mathcal{H}$ revealed only through an adversarial enumeration of a core $C \subseteq H$ of exact density $α$ (the literature). Every output is valuable ($\in H$), trivial ($\in F \setminus H$), or a hallucination ($\notin F$). We settle four questions. First, the verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin's condition. Second, the verifier does buy sound coverage, covering all unseen valuable statements while asserting only valid ones: possible with it, impossible without it; it relocates unavoidable errors from false to trivial. Third, and centrally, a sharp dichotomy on the tight family: generators emitting finitely many trivia achieve optimal coverage $α/2$, while any infinite trivia allowance, even at vanishing rate, jumps the optimum to $1-α/2$ (both tight, for cores presented as the candidate intersection), and one generator attains both ends. The transition is in trivia count, not rate; the gap $1-α$ is the unrecorded mass. Fourth, both regimes instantiate in a compression model of mathematics. A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

2604.24942 2026-06-15 cs.CL q-bio.NC 版本更新

Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

基于独立成分的故事理解过程中大脑活动的编码模型

Kamya Hari, Taha Binhuraib, Jin Li, Cory Shain, Anna A. Ivanova

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电子与计算机工程学院) School of Psychological and Brain Sciences, Georgia Institute of Technology(佐治亚理工学院心理学与脑科学学院) Department of Linguistics, Stanford University(斯坦福大学语言学系)

AI总结 提出基于独立成分的编码框架,从fMRI数据中分离刺激驱动和噪声信号,利用语言模型预测独立成分时间序列,识别出与听觉和语言相关的认知网络,验证了成分的可解释性和跨个体一致性。

Comments Accepted to CCN 2026 (Proceedings Track)

详情
AI中文摘要

编码模型为连接连续刺激特征与神经活动提供了强大框架;然而,传统的体素方法受限于测量噪声、个体间变异以及由编码重叠神经信号的空间相关体素引起的冗余。本文提出了一种基于独立成分(IC)的编码框架,从fMRI数据中分离刺激驱动和噪声驱动信号。我们使用一部分数据将自然故事聆听过程中的连续fMRI数据分解为独立成分,并在独立数据上训练编码模型,从语言输入的大型语言模型表示中预测独立成分时间序列。跨被试来看,一部分独立成分表现出持续的高可预测性。这些独立成分在空间和时间上跨被试一致,并包括已知在故事聆听期间响应的认知网络(听觉和语言)。听觉成分时间序列与声学刺激特征强相关,突出了所识别成分时间序列的可解释性。被ICA-AROMA识别为噪声或运动相关伪影的成分表现出普遍较差的预测性能,证实高预测成分反映的是真实的刺激相关神经信号而非混淆因素。总体而言,基于独立成分的编码模型能够在功能网络层面进行分析,适应个体间网络位置的变异性,并提供易于跨被试比较的可解释结果。代码见:this https URL

英文摘要

Encoding models provide a powerful framework for linking continuous stimulus features to neural activity; however, traditional voxelwise approaches are limited by measurement noise, inter-subject variability, and redundancy arising from spatially correlated voxels encoding overlapping neural signals. Here, we propose an independent component (IC)-based encoding framework that dissociates stimulus-driven and noise-driven signals in fMRI data. We decompose continuous fMRI data from naturalistic story listening into ICs using one subset of the data, and train encoding models on independent data to predict IC time series from large language model representations of linguistic input. Across subjects, a subset of ICs exhibited consistently high predictivity. These ICs were spatially and temporally consistent across subjects and included cognitive networks known to respond during story listening (auditory and language). Auditory component time series were strongly correlated with acoustic stimulus features, highlighting the interpretability of identified component time series. Components identified as noise or motion-related artifacts by ICA-AROMA showed uniformly poor predictive performance, confirming that highly predicted components reflect genuine stimulus-related neural signals rather than confounds. Overall, IC-based encoding models enable analyses at the level of functional networks, accommodating the variability in network locations across individuals and providing interpretable results that are easy to compare across subjects. Code provided at: https://github.com/kamyahari/IC-Encoding-Models.git

2606.04883 2026-06-15 cs.CL cs.LO 版本更新

Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

优化 Lean 中智能定理证明器的成本-质量权衡

Kári Rögnvaldsson, Chenhao Sun, Jasper Dekoninck, Martin Vechev

发表机构 * University of Washington(华盛顿大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种包含数据平面和控制平面的动作路由智能体,通过观察失败轨迹并估计成功概率与成本来动态决定继续证明或重新分解,在 PutnamBench 子集上平均降低 25.8% 成本且保持性能。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于在 Lean 中生成形式化证明的工作流程。这些工作流程通常将问题分解为更小的引理,采样许多证明尝试,并使用编译器反馈来指导搜索。然而,它们可能成本高昂,往往在最终失败的尝试上花费大量计算。在这项工作中,我们通过一个包含数据平面和控制平面的动作路由智能体来解决这个问题。数据平面生成自然语言的引理分解,在 Lean 中形式化它们,并为由此产生的定理和引理目标采样证明尝试。控制平面观察之前失败的 Lean 尝试,估计成功可能性和另一次尝试的成本,并决定是继续证明当前目标还是从新的分解重新开始。在 PutnamBench 的一个子集上,我们的智能体平均比固定步长基线降低 25.8% 的成本,在显著减少计算量的同时保持性能。这些结果表明,失败的 Lean 轨迹为智能定理证明中的成本感知资源分配提供了可操作的信号。

英文摘要

Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $28.9\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.

2602.06142 2026-06-15 cs.PL cs.AI cs.CL cs.LG cs.PF 版本更新

Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering

Protean Compiler: 一种驱动细粒度阶段排序的敏捷框架

Amir H. Ashouri, Shayan Shirahmad Gale Bagi, Kavin Satheeskumar, Tejas Srikanth, Jonathan Zhao, Ibrahim Saidoun, Ziwen Wang, Bryan Chan, Tomasz S. Czajkowski

发表机构 * Huawei Technologies Canada(华为技术加拿大)

AI总结 提出Protean Compiler框架,在LLVM中内置细粒度阶段排序能力,通过140多种静态特征收集方法和机器学习优化,平均加速4.1%,最高15.7%。

Comments Version 3: Preprint version of the accepted work at ACM TACO 2026

详情
AI中文摘要

阶段排序问题自20世纪70年代末以来一直是一个长期挑战,但由于其优化空间巨大且具有无界性,至今仍是一个开放问题,没有有限解。传统上,这种局部优化决策由手工编码的算法针对少量基准测试进行调整,当基准测试套件变化时,通常需要大量精力重新调整。过去20年中,机器学习被用于构建性能模型以改进编译器优化的选择和排序,但这些方法并未无缝集成到编译器中,也从未在细粒度的代码段范围内实现。本文提出Protean Compiler:一种敏捷框架,使LLVM在细粒度范围内具备内置的阶段排序能力。该框架还包含一个完整的库,包含140多种在不同范围内手工设计的静态特征收集方法,实验结果表明,相对于LLVM的O3,在Cbench应用程序上仅需增加几秒构建时间,平均加速可达4.1%,最高可达15.7%。此外,Protean编译器易于与第三方ML框架和其他大型语言模型集成,两步优化的两个应用在CBench的Susan和Jpeg应用程序上相对于-O3分别获得10.1%和8.5%的加速。Protean编译器无缝集成到LLVM中,可作为新的、增强的、全功能的编译器使用。我们计划在不久的将来将该项目发布到开源社区。

英文摘要

The phase ordering problem has been a long-standing challenge since the late 1970s, yet it remains an open problem due to having a vast optimization space and an unbounded nature, making it an open-ended problem without a finite solution, one can limit the scope by reducing the number and the length of optimizations. Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes. In the past 20 years, Machine Learning has been employed to construct performance models to improve the selection and ordering of compiler optimizations, however, the approaches are not baked into the compiler seamlessly and never materialized to be leveraged at a fine-grained scope of code segments. This paper presents Protean Compiler: An agile framework to enable LLVM with built-in phase-ordering capabilities at a fine-grained scope. The framework also comprises a complete library of more than 140 handcrafted static feature collection methods at varying scopes, and the experimental results showcase speedup gains of up to 4.1% on average and up to 15.7% on select Cbench applications wrt LLVM's O3 by just incurring a few extra seconds of build time on Cbench. Additionally, Protean compiler allows for an easy integration with third-party ML frameworks and other Large Language Models, and two applications of this two-step optimization show a gain of 10.1\% and 8.5\% speedup w.r.t. -O3 on CBench's Susan and Jpeg applications. Protean compiler is seamlessly integrated into LLVM and can be used as a new, enhanced, full-fledged compiler. We plan to release the project to the open-source community in the near future.

2606.12923 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Order Is Not Control: Driven-Dissipative Response Laws Across Artificial and Biological Systems

秩序并非控制

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

发表机构 * Australian Broadcasting Corporation(澳大利亚广播公司)

AI总结 本文论证秩序不等于控制,提出接收器门控响应定律,并在生物、大语言模型、适配器和随机算子面板中验证,表明控制是局部的、可测量的。

Comments 52 pages, 7 figures, updated title

详情
AI中文摘要

AI对齐、可解释性、引导和神经扰动研究识别出诱导秩序的对象。我们认为秩序并非控制。控制需要接收器门控的响应定律:一个分母索引算子,将物质状态、动作/驱动、浴和接收器状态映射到响应位移、汇、努力和盆地投影。我们在生物、大语言模型、适配器和随机算子面板中识别出该定律。这些定律是局部的:干预可以被接纳、饱和、变号、泄漏或过驱动,取决于介质、浴、接收器状态、动作端口和比较器。当有限努力在相同分母下移动目标或结果读出类别,而损伤、无效/规避、无效格式、过驱动和不必要努力保持有界时,控制被分配。小鼠ALM、秀丽隐杆线虫和斑马鱼面板提供了物理响应算子证据,同时排除了坐标同一性和控制器结论。大语言模型面板展示了生成输出响应定律:在四种物质条件下,响应向量的分量符号预测准确率为72.8-73.7%,非零分量上提升至84.3-84.8%;留出观察者以93.6%和91.7%的准确率预测系统效应和目标/预言家族。宪法条件适配器将易感性重塑为制备介质,随机算子面板将测量机会与可部署行动策略分离。这给出了介观控制层面的驱动-耗散响应系统描述:驱动通过制备介质、浴和接收器作用,产生接纳运动、阻抗、汇或过驱动。证据支持局部接纳控制和可测量的随机响应算子,同时将可部署的预生成控制、隐藏/logit因果充分性、生物到LLM坐标同一性以及字面热力学量排除在范围之外。

英文摘要

AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

2507.09011 2026-06-15 cs.CL q-bio.NC q-bio.QM 版本更新

From dots to faces: Individual differences in visual imagery capacity predict the content of Ganzflicker-induced hallucinations

从点到面孔:个体在视觉意象能力上的差异预测Ganzflicker诱导的幻觉内容

Ana Chkhaidze, Reshanne R. Reeder, Connor Gag, Anastasia Kiyonaga, Seana Coulson

发表机构 * Department of Cognitive Science, University of California, San Diego (USA)(加州大学圣地亚哥分校认知科学系) Department of Psychology, Institute of Population Health, University of Liverpool (UK)(利物浦大学心理学系、人口健康研究所) Department of Computer Science, University of California, San Diego (USA)(加州大学圣地亚哥分校计算机科学系) Kavli Institute for Brain and Mind, San Diego (USA)(圣地亚哥脑与心智研究所)

AI总结 研究探讨视觉意象能力差异如何影响Ganzflicker诱导的幻觉内容,通过自然语言处理分析4000多人的描述,发现高意象者描述复杂自然内容,低意象者报告简单几何图案。

详情
AI中文摘要

一种快速交替的红黑显示称为Ganzflicker,会引发反映视觉系统生成能力的视觉幻觉。个体在视觉意象程度上存在差异,从无到生动不等。最近的提议认为,这种意象光谱中的视觉系统差异也应影响其他内部生成视觉体验的复杂性。在这里,我们利用自然语言处理工具分析超过4000名参与者的自由文本描述,探讨不同意象表型的人在Ganzflicker诱导的幻觉中是否看到不同内容。主题建模显示,强意象者描述复杂、自然的内容,而弱意象者报告简单几何图案。使用众包的传感器运动规范,我们还发现,强意象者使用更丰富的感知关联语言。这些发现可能反映个体在早期视觉区域和与意象光谱相关的更高阶区域之间协调性的差异。

英文摘要

A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Individuals vary in their degree of visual imagery, ranging from absent to vivid imagery. Recent proposals suggest that differences in the visual system along this imagery spectrum should also influence the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind's eye during Ganzflicker-induced hallucinations. Topic modeling of descriptions revealed that strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Using crowd-sourced sensorimotor norms, we also found that participants with stronger imagery used language with richer perceptual associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.

2511.17637 2026-06-15 cs.LG cs.CL 版本更新

PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

PocketLLM: 通过元网络实现大语言模型的终极压缩

Ye Tian, Chengcheng Wang, Jing Han, Yehui Tang, Kai Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出PocketLLM,通过元网络在潜在空间压缩大语言模型,利用编码器和解码器实现高效压缩,实验表明在高压缩比下仍保持高精度。

Comments AAAI 2026 camera ready

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33250-33258 (2026)
AI中文摘要

随着大语言模型(LLMs)的持续增长,将其存储和传输到边缘设备变得越来越具有挑战性。传统方法如量化和剪枝在不牺牲精度的情况下难以实现极端压缩。本文介绍了一种新的压缩方法PocketLLM,通过元网络在潜在空间中压缩LLMs。提出一个简单的编码器网络,将LLMs的权重投影到离散的潜在向量中,然后使用紧凑的代码本进行表示。轻量级的解码器网络用于将代码本的代表性向量映射回原始权重空间。该方法仅需一个小解码器、简洁的代码本和一个索引即可实现LLMs中大权重的显著压缩。大量实验表明,PocketLLM在显著的压缩比下仍能保持优越的性能,例如将Llama 2-7B压缩10倍,精度损失微不足道。

英文摘要

As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.

2506.00784 2026-06-15 cs.CL 版本更新

Research Borderlands: Analysing Writing Across Research Cultures

研究边疆:跨研究文化中的写作分析

Shaily Bhatt, Tal August, Maria Antoniak

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所) Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院) Pioneer Centre for AI, University of Copenhagen(哥本哈根大学先锋人工智能中心) Allen Institute for Artificial Intelligence (Ai2)(人工智能研究院)

AI总结 本文通过跨文化访谈,揭示研究文化中的语言规范,并评估LLM的文化适应能力,强调人类中心方法在测量文化规范中的有效性。

Comments Accepted to ACL 2025 (Main)

详情
AI中文摘要

提升语言技术的文化能力至关重要。然而,大多数最新研究很少与所研究的社区互动,而是依赖合成设置和不完美的文化代理。本文采用以人类为中心的方法,发现并测量基于语言的文化规范以及LLM的文化能力。我们专注于一种文化——研究文化,以及一种任务——在不同研究文化中适应写作。通过与跨学科研究者进行访谈,这些专家擅长在不同文化间转换,我们创建了结构、风格、修辞和引用规范的框架,这些规范在不同研究文化中有所不同。我们通过一组计算度量来操作这些特征,并用于(a)大规模揭示人类写作研究论文中的潜在文化规范;以及(b)突出LLM缺乏文化能力,以及其倾向于同质化写作的倾向。总体而言,我们的工作展示了以人类为中心的方法在测量人类写作和LLM生成文本中的文化规范的有效性。

英文摘要

Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, research cultures, and a single task, adapting writing across research cultures. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenise writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.

2406.11565 2026-06-15 cs.CL cs.CY 版本更新

Extrinsic Evaluation of Cultural Competence in Large Language Models

对外评估大型语言模型中的文化素养

Shaily Bhatt, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过两个文本生成任务评估模型在文化敏感性方面的表现,发现文化提示对输出有影响,但不同国家输出的相似性与文化价值观无显著关联。

Comments Accepted to EMNLP Findings 2024

详情
AI中文摘要

多样化的用户与语言技术之间的互动要求后者输出具有文化相关性和敏感性。先前的工作评估了模型对文化规范、价值观和物品的知识,但未考虑这种知识如何在下游应用中体现。在本工作中,我们专注于两个文本生成任务的外在评估:开放性问答和故事生成。我们定量和定性地评估当提示中明确提示文化,特别是国籍时,模型输出的变化。尽管我们发现当改变国籍和特征文化相关词汇时,模型输出确实有所变化,但我们还发现不同国家输出的相似性与这些国家的文化价值观之间存在弱相关性。最后,我们讨论了在面向用户的任务中设计全面评估文化能力的重要考虑因素。

英文摘要

Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.

2406.03221 2026-06-15 cs.CL cs.IR 版本更新

Linking Named Entities in Diderot's Encyclopédie to Wikidata

将狄德罗《百科全书》中的命名实体链接到Wikidata

Pierre Nugues

发表机构 * Université de Lausanne(洛桑大学)

AI总结 本文通过将《百科全书》中的10300多个条目与Wikidata标识符链接,实现了知识图谱的连接,展示了地理和人文实体的标注方法与应用实例。

Comments 6 pages, 3 figures

详情
Journal ref
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 10610--10615
AI中文摘要

狄德罗《百科全书》是一部18世纪欧洲的知识参考作品,旨在收集当时的知识。维基百科有相同的目标,但范围更大。然而,两者之间缺乏数字连接可能阻碍其比较和知识演变的研究。维基百科的关键要素是Wikidata,它通过结构化数据图谱支持文章。本文描述了对《百科全书》中超过10,300个条目进行注释,以连接到图谱。我们考虑了地理和人文实体。《百科全书》不包含传记条目,因为它们大多作为位置的子条目出现。我们提取了所有地理条目,并完全注释了所有包含人类实体描述的条目。这代表了超过2,600个链接,指向位置或人类实体。此外,我们注释了超过9,500个仅包含地理内容的条目。我们描述了注释过程以及应用示例。此资源可在https://github.com/pnugues/encyclopedie_1751获取。

英文摘要

Diderot's Encyclopédie is a reference work from XVIIIth century in Europe that aimed at collecting the knowledge of its era. Wikipedia has the same ambition with a much greater scope. However, the lack of digital connection between the two encyclopedias may hinder their comparison and the study of how knowledge has evolved. A key element of Wikipedia is Wikidata that backs the articles with a graph of structured data. In this paper, we describe the annotation of more than 10,300 of the Encyclopédie entries with Wikidata identifiers enabling us to connect these entries to the graph. We considered geographic and human entities. The Encyclopédie does not contain biographic entries as they mostly appear as subentries of locations. We extracted all the geographic entries and we completely annotated all the entries containing a description of human entities. This represents more than 2,600 links referring to locations or human entities. In addition, we annotated more than 9,500 entries having a geographic content only. We describe the annotation process as well as application examples. This resource is available at https://github.com/pnugues/encyclopedie_1751