arXivDaily arXiv每日学术速递 周一至周五更新

1. 大语言模型与基础模型 27 篇

2606.19348 2026-06-19 cs.CL cs.AI 新提交

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-V4: 迈向高效百万令牌上下文智能

DeepSeek-AI, Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chengyu Hou, Chenhao Xu, Chenze Shao, Chong Ruan, Conner Sun, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Donghao Li, Dongjie Ji, Erhang Li, Fang Wei, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoling Zhang, Haoming Luo, Haoran Wei, Haotian Yuan, Haowei Zhang, Haowen Luo, Haoyu Chen, Haozhe Ji, Hengqing Zhang, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, J Yang, JQ Zhu, Jia Luo, Jia Song, Jia Yu, Jialiang Huang, Jialu Cai, Jian Liang, Jiangting Zhou, Jiasheng Ye, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jieyu Yang, Jin Chen, Jin Yan, Jingchang Chen, Jingli Zhou, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jingzi Zhou, Jinhua Zhu, Jiping Yu, Joseph Sun, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junmin Zheng, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Leyi Xia, Li Zhang, Liang Zhao, Lihua Guo, Lingxiao Luo, Linwang Ma, Linyan Zhu, Litong Wang, Liyu Cai, Liyue Zhang, Longhao Chen, MS Di, MY Xu, Max Mei, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Mingxu Zhou, Minmin Han, Ning Wang, Panpan Huang, Panpan Wang, Peixin Cong, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Qiwei Jiang, Rui Tian, Ruifan Xu, Ruijie Lu, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqian Chen, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, Ruyi Chen, SH Liu, Shanghao Lu, Shangmian Sun, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoheng Nie, Shaoqing Wu, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Shuying Yu, Songyang Zhou, Tao Ni, Tao Yun, Tian Jin, Tian Pei, Tian Ye, Tianle Lin, Tianran Ji, Tianyi Cui, Tianyuan Yue, Tingting Yu, Tun Wang, W Zhang, WL Xiao, Wangding Zeng, Wei An, Weilin Zhao, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjing Yao, Wenjun Gao, Wenkai Yang, Wenlve Huang, Wenqing Hou, Wentao Zhang, Wenting Ma, Xi Gao, Xiang He, Xiangwen Wang, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingchen Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyu Zhang, Xu Chen, Xuanyu Wang, Xuecheng Su, Xueyin Chen, Xuheng Lin, Xuwei Fu, YC Yan, YQ Wang, YW Ma, Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Xu, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Shao, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang Xiong, Yiling Ma, Ying He, Ying Tang, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, YuKun Li, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuehan Yang, Yuer Xu, Yuhan Wu, Yuhao Meng, Yuheng Zou, Yukun Zha, Yunfan Xiong, Yupeng Chen, Yuping Lin, Yuqian Cao, Yuqian Wang, Yushun Zhang, Yuting Yan, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuxuan Zhou, Yuyang Zhou, Yuzhen Huang, ZF Wu, Zehao Wang, Zehua Zhao, Zehui Ren, Zekai Zhang, Zhangli Sha, Zhe Fu, Zhe Ju, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zheren Gao, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhixuan Chen, Zhiyu Wu, Zhizhou Ren, Zhongyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihua Qu, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Ziyi Wan, Zizheng Pan, Zongqing Yao

发表机构 * DeepSeek-AI(深度求索人工智能)

AI总结 提出DeepSeek-V4系列MoE模型,通过混合注意力架构、流形约束超连接和Muon优化器,实现百万令牌上下文的高效推理,在核心任务上超越前代。

详情
AI中文摘要

我们展示了DeepSeek-V4系列的预览版本,包括两个强大的混合专家(MoE)语言模型——DeepSeek-V4-Pro(1.6T参数,49B激活)和DeepSeek-V4-Flash(284B参数,13B激活),两者均支持一百万个令牌的上下文长度。DeepSeek-V4系列在架构和优化方面引入了多项关键升级:(1)混合注意力架构,结合压缩稀疏注意力(CSA)和重度压缩注意力(HCA),以提高长上下文效率;(2)流形约束超连接(mHC),增强传统残差连接;(3)Muon优化器,实现更快的收敛和更高的训练稳定性。我们在超过32T多样且高质量的令牌上预训练了两个模型,随后通过全面的后训练流程解锁并进一步增强其能力。DeepSeek-V4-Pro-Max是DeepSeek-V4-Pro的最大推理努力模式,重新定义了开放模型的最先进水平,在核心任务上超越了其前代。同时,DeepSeek-V4系列在长上下文场景中非常高效。在百万令牌上下文设置下,与DeepSeek-V3.2相比,DeepSeek-V4-Pro仅需27%的单令牌推理FLOPs和10%的KV缓存。这使得我们能够常规支持百万令牌上下文,从而使长时任务和进一步的测试时扩展更加可行。模型检查点可从此https URL获取。

英文摘要

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models -- DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) -- both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

2606.19349 2026-06-19 cs.CL cs.AI 新提交

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

查询应置于何处?通过解码动力学揭示并缓解扩散大语言模型中上下文学习的位置偏差

Zhengheng Li, Panrui Li, Xuyang Liu, Puzhi Xia

发表机构 * Southeast University(东南大学)

AI总结 本文系统分析了扩散大语言模型中查询位置对生成质量的影响,发现其与示例语义质量同等重要,并提出基于平均置信度的无训练自适应路由策略Auto-ICL以优化查询放置。

Comments 9 figures, 4 tables

详情
AI中文摘要

尽管上下文学习(ICL)在自回归(AR)大语言模型(LLMs)中已被广泛研究,但其在扩散大语言模型(dLLMs)中的机制仍基本未被探索。与受单向因果掩码限制的AR模型不同,dLLMs本质上利用双向注意力,为查询放置提供了广泛的空间灵活性。不幸的是,当前实践通常继承AR风格的尾随查询模板,往往忽略了结构范式转变。本文通过全面分析揭示了查询位置实际上是dLLMs中的一阶变量。通过经验解耦,我们证明了位置方差对生成质量的影响与示例语义质量相当。在内部,这种位置敏感性源于注意力流中的空间“近因效应”以及解码轨迹中依赖于任务的偏移。为了在没有真实标签的情况下缓解这种不稳定性,我们揭示了传统的单步置信度($C_{decoded}$)在dLLMs中失效。相反,我们提出了平均置信度($\overline{C}$),一种跟踪迭代解码过程的新指标。通过建立基础的空间ICL基线,我们引入了Auto-ICL,一种无需训练的自适应路由策略,动态优化查询放置,在异构推理和感知任务中稳健地接近最优性能。

英文摘要

While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

2606.19350 2026-06-19 cs.CL 新提交

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

基于因果归因的剪枝保留大型语言模型的推理性能

Amogh Sheth, Biruk Assefa, Yi Wen Huang, Andrew Lin, Yuhao Ge

发表机构 * Edison Academy Magnet School(爱迪生学院磁石学校) Massachusetts Institute of Technology(麻省理工学院) State University of New York College at Plattsburgh(纽约州立大学普拉茨堡学院) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员)

AI总结 提出无需训练的因果归因剪枝(CAP)方法,通过测量注意力头对推理任务的因果影响进行细粒度剪枝,在20%稀疏度下相比Wanda在ARC-Challenge上准确率提升高达61%。

Comments Accepted at the ICLR 2026 Workshop on LLM Reasoning. 13 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLMs)在多步推理方面表现出色,但推理成本高昂。我们引入了因果归因剪枝(CAP),一种无需训练的方法,通过测量注意力头对推理任务的因果影响来识别关键注意力头,并利用这些头级分数指导细粒度的权重剪枝。对于每个注意力头,CAP估计在推理问题的小型校准集上前向传播时掩码该头所导致的预期性能下降。这些因果分数随后被转换为对应投影矩阵的权重级重要性值。与仅基于幅度或激活的标准不同,CAP的干预测量直接捕捉每个头的功能贡献,在20%稀疏度下,相比Wanda在ARC-Challenge上获得高达61%的相对准确率提升。我们在GSM8K、StrategyQA和ARC-Challenge上使用Llama-3-8B-Instruct和Mistral-7B-Instruct在10%、20%和50%稀疏度下评估CAP。在中等稀疏度(10-20%)下,CAP在大多数模型-基准配置中优于Wanda,尤其在Llama-3的ARC-Challenge上提升显著。我们的结果表明,在相同稀疏度下,注意力头级因果归因比相关性剪枝标准能更好地保留下游基准的推理性能,但在50%稀疏度下仍受限于粗粒度的MLP归因。

英文摘要

Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.

2606.19353 2026-06-19 cs.CL cs.LG 新提交

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

量化上下文学习中的偶然不确定性以稳健衡量LLM预测置信度

Jinseok Chung, Minkyoung Song, Hyunji Jung, Namhoon Lee

发表机构 * POSTECH(浦项科技大学)

AI总结 针对上下文学习(ICL)中预测对提示设计敏感的问题,提出基于贝叶斯观点和机制可解释性的自函数向量,直接估计偶然不确定性,并设计严格评估协议,在合成和真实数据集上验证了方法的可靠性及在幻觉检测等应用中的实用性。

Comments Accepted to ACL 2026

详情
AI中文摘要

上下文学习(ICL)使LLM能够从少量示例中适应新任务,但其可靠性仍存疑虑:预测对提示设计和模型理解上下文的能力高度敏感,使得失败源于数据特性还是模型限制难以区分。不确定性分解——将偶然不确定性从认知不确定性中分离——在此场景中尤为关键,然而现有方法针对标准生成任务设计,未能捕捉ICL的独特动态。为解决此问题,我们引入基于贝叶斯观点和ICL机制可解释性的自函数向量概念。这些向量利用模型内部表示来建模上下文提示中学习的潜在概念,从而在贝叶斯框架内直接估计偶然不确定性,并规避了对脆弱的输入或解码操作的依赖。鉴于缺乏既定基准和合适的评估协议,我们还提出了首个严格的评估协议,其中数据以受控方式被操纵,以便精确量化偶然不确定性并将其与认知不确定性分离。借助这一新的评估框架(最初基于合成任务进行概念开发,随后扩展到真实世界数据集),我们展示了所提出的方法比现有替代方法更可靠地衡量LLM在ICL下做出的预测的不确定性。此外,我们展示了它可作为可信相关应用(如幻觉检测)的实用工具。我们的发现为将不确定性的量化观点与模型行为的机制理解联系起来开辟了新方向。

英文摘要

In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

2606.19354 2026-06-19 cs.CL cs.LG 新提交

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

粒度调控的自适应计算效率:测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana(欧洲地拉那大学)

AI总结 提出GRACE理论框架,将验证粒度建模为问题难度、验证器准确率和计算预算的函数,证明存在相变:细粒度验证在计算预算大或问题难时占优,粗粒度验证在低预算简单问题时更优,自适应策略可达到计算-性能帕累托前沿。

详情
AI中文摘要

测试时扩展(TTS)已成为一种强大的范式,通过在推理时投入额外计算来提升大语言模型(LLMs)的推理性能。TTS的核心组件是验证器,它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处,但一个基本问题仍未充分探索:在给定计算预算下,最优验证粒度是什么?粗粒度的结果奖励模型(ORMs)和细粒度的过程奖励模型(PRMs)代表两个极端,但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架,称为GRACE(粒度调控的自适应计算效率),该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变:当计算预算大或问题难时,细粒度验证占优;而在低预算、简单问题场景下,粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内,并激发了一种自适应粒度策略,该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张,在匹配计算量下,我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

2606.19625 2026-06-19 cs.CL cs.LG 新提交

Where Does Social Reasoning Come From? Capability Provenance in Language Models

社会推理从何而来?语言模型中的能力来源

Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla, Louis Jaburi, Alvin Deng, Taywon Min, Lucia Quirke, Stella Biderman, Mark Riedl

发表机构 * Georgia Institute of Technology, College of Computing(佐治亚理工学院计算学院) MATS Program(MATS项目) EleutherAI KAIST AI(韩国科学技术院人工智能学院) Georgia Tech AI Safety Initiative(佐治亚理工学院人工智能安全倡议)

AI总结 通过训练数据归因方法,发现OLMo3-7B中社会推理和STEM推理依赖于不同的预训练语料区域,且推理层面的差异比知识层面更显著。

Comments Under review at COLM 2026 (Conference)

详情
AI中文摘要

我们使用训练数据归因作为可解释的工具进行能力发现,映射预训练语料库中哪些区域支持OLMo3-7B的社会推理与STEM推理。训练数据归因衡量每个训练文档对模型在基准测试上的预测的影响强度,但文档级别的分数过于嘈杂,无法识别哪些语料区域支持哪些能力,且先前的工作侧重于事实知识而非推理。我们在从去重后的Dolma3混合数据中抽取的工作集上计算基于梯度的归因(通过Bergmann的TrackStar),聚合跨WebOrganizer的24格式×24主题分类(576个箱子)的影响,并在2×2设计中对比基准对,该设计变化领域(社会 vs. STEM)和能力类型(推理 vs. 知识):SocialIQA和MMLU社会科学对比ARC-Challenge和MMLU STEM。社会和STEM推理依赖于定性不同的语料区域,且推理层面的对比比知识层面更尖锐。有针对性的机器遗忘提供了部分因果验证:遗忘高归因主题箱(例如,SocialIQA的文学)比箱内随机基线更严重地降低对齐的基准,我们开源所有代码、采样清单、箱级影响矩阵和遗忘检查点。

英文摘要

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

2606.19668 2026-06-19 cs.CL 新提交

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University(中央大学) Adobe Research(Adobe研究院)

AI总结 通过语法强制代码切换诊断多语言大模型中的语言锚定现象,提出锚定偏差度量并设计CANVAS干预方法,有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情
AI中文摘要

多语言大模型(MLLMs)越来越需要处理代码切换(CS)输入,然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化,我们使用语法强制CS作为受控诊断设置,将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差(Anchor Bias),一种几何度量,用于量化语言锚定,即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中,锚定偏差揭示了一致的语法框架效应:源框架CS保持源锚定,而目标框架CS向目标方向移动,并显示出更大的问答(QA)退化。受这种表示模式的启发,我们提出了CANVAS(基于上下文锚定的神经向量对齐引导),一种推理时干预方法,从输入中提取源侧画布,并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数,表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘:不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab(网络分析与社会影响建模实验室) School of Physics Maths and Computing, The University of Western Australia(西澳大学物理数学与计算学院) School of Psychological Science, The University of Western Australia(西澳大学心理科学学院) School of Computing, Macquarie University(麦考瑞大学计算机学院)

AI总结 研究顺序DPO在不同偏好设置下的影响,发现遗忘模式并非统一,而是取决于目标关系、信号强度和训练顺序,并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化(DPO)等偏好优化方法顺序应用这些目标,但目前尚不清楚后续训练是否会统一降低先前学习的偏好,或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置(包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标)的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct,我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式;偏好变化从部分退化到稳定、成对重新分配或正迁移,具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明,聚合指标可能掩盖偏好对之间的异质性变化,而四分位数分解显示,高置信度对可能根据设置而退化或改进。机制诊断表明,在所有设置中,阶段2的梯度和适配器更新与先前目标接近正交,几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明,未来的顺序对齐流程应考虑目标兼容性和信号强度,而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

2606.19815 2026-06-19 cs.CL 新提交

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

聚类即一切:利用语言模型中的语义聚类预训练Tsetlin Machine以实现可解释性

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

发表机构 * Independent Researcher(独立研究员) University of California, Irvine(加州大学尔湾分校) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 提出一种语义预训练框架,通过K-means或Top2Vec将文本聚类,用聚类-样本对预训练Tsetlin Machine,使其学习可解释的语义关键词,在五个数据集上性能优于传统方法且与BERT竞争。

详情
AI中文摘要

预训练语言模型如BERT在文本分类任务中表现强劲,但缺乏透明度,限制了在高风险场景中的应用。Tsetlin Machine (TM) 提供完全可解释的基于子句的推理,但捕获的语义信息有限,先前桥接两者的尝试依赖于静态词嵌入,忽略了上下文含义。我们提出一种语义预训练框架,无需使用嵌入即可将知识从预训练语言模型转移到TM中。文本样本通过K-means或Top2Vec被分组为语义一致的聚类,得到的聚类-样本对通过增强的Type I反馈预训练一个非否定TM。因此,TM学习到可解释的语义关键词,并在下游任务上进行微调。在五个数据集上,我们的方法显著优于传统和基于嵌入的TM,性能与BERT竞争,同时保持可解释性。

英文摘要

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

2606.19831 2026-06-19 cs.CL cs.LG 新提交

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性:语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

AI总结 提出预算归一化控制窗口框架,通过残差范数与写入范数之比定义的相干预算,预测单神经元干预何时产生连贯行为控制,并在15个神经元上验证了预测精度。

详情
AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为,但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标:残差流与写入之间的对齐,该对齐沿着一条通用饱和曲线驱动,以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时,存在连贯控制。同一坐标控制良性模式切换和拒绝;上限由权重和一次通用前向传播得出,而触发点在 rollout 时测量。在15个保留神经元上,预测上限的平均绝对误差为0.14,在批量层中约为0.07,并且承诺的开启或关闭判定在11个神经元上成立,而多数基线为10/15。关闭情况揭示了三种失败模式而非违反:触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制:真正的控制器偏离读出轴写入,并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中,干预成功是类型化的而非标量:连贯旁路和严格可操作可达性分离,因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝,而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个,且仅在较晚的 rollout 时间范围内。因此,单神经元操控是对可控性的预算化、类型化审计,而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

2606.19857 2026-06-19 cs.CL cs.AI 新提交

Large Language Models Do Not Always Need Readable Language

大型语言模型并不总是需要可读语言

Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

AI总结 研究提出BabelTele表示法,将语义编码为紧凑、非标准文本,牺牲人类可读性但保持LLM可恢复性,实验表明可压缩至27.9%长度并保持99.5%语义保真度,降低上下文开销。

Comments 23 pages, 10 figures. Preprint

详情
AI中文摘要

大型语言模型(LLM)通常使用人类可读的自然语言进行提示和交互,即使目标读者是另一个模型。本文研究语义信息是否可以编码为紧凑、非标准的文本形式,这种形式牺牲了人类可读性,但能被LLM恢复。我们将这类以模型为中心的文本表示称为BabelTele,这里不是作为固定协议,而是作为探索LLM生成和解释此类表示能力的经验探针。通过可读性诊断、模型似然度量、人类问卷和下游任务评估,我们发现BabelTele可以显著偏离普通自然语言,同时为指令调优的LLM保留核心语义。作为一种任务无关的表示范式,BabelTele展示了高信息密度,即使文本体积压缩到原始长度的27.9%,也能保持99.5%的语义保真度。我们进一步评估了其在跨模型迁移、智能体记忆和多智能体通信中的语义鲁棒性。结果表明,BabelTele可以降低上下文开销,同时通常保持可靠的下游性能,但其有效性取决于压缩器-读取器对和任务设置。这些发现表明,人类可读性、自然语言典型性和模型端语义可恢复性可以部分解耦,为未来探索LLM系统中的模型原生表示开辟了道路。

英文摘要

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

2606.19946 2026-06-19 cs.CL cs.LG 新提交

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS: 几何约束使LLM中多语义叠加成为可能

Yu Deng

AI总结 提出GEMS方法,通过范数保持加权叠加、目标注意力路径注入和实时正交化两个几何约束,解决无训练多方向激活干预中的分布偏差和方向干扰问题,在GSM8K上保持98%准确率。

Comments 30 pages, 5 figures, 20 tables. Code and logs are available at: https://github.com/LuLu663939/gems-multi-semantic-steering

详情
AI中文摘要

激活引导通过在推理时修改中间隐藏状态来控制模型行为,无需重新训练。现有方法仅处理单方向注入;当多个语义方向无约束叠加时,模型崩溃。我们证明这种崩溃分解为两个独立作用的来源:分布偏差(加法扰动在层间累积范数并将激活推出训练分布)和方向干扰(非正交语义向量叠加时相互抑制)。这两个来源定义了任何无训练多方向干预必须满足的设计约束。作为这些原则的一个实例,我们提出GEMS,一种无训练方法,将每个来源映射到相应的几何约束:针对分布偏差的范数保持加权叠加和目标注意力路径注入,以及针对方向干扰的实时正交化。在GSM8K上,注入三个并发非数学方向保持98%的准确率(基线92%),而无约束加法崩溃至4%;在Wikitext-2上,相同注入仅导致2.2%的PPL增加。组件消融隔离了每个约束的因果作用,层级探针确认正交化信号通过FFN路径存活并以语义特异性到达输出分布。定性引导效果跨架构从3B到31B迁移。

英文摘要

Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

2606.20089 2026-06-19 cs.CL cs.AI 新提交

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

IHUBERT: 面向波斯语资源的基于向量的语义去重与领域平衡预训练

Arash Ghafouri, Mahdi Firouzmandi, Hossein Saberi, Mohammad Reza Hasani Ahangar

AI总结 提出IHUBERT,一个基于RoBERTa-base的波斯语预训练模型,通过多阶段预处理(包括基于向量数据库的语义去重和领域平衡)在45GB语料上训练,在多项NLU任务上取得领先结果,尤其抽取式问答表现突出。

详情
AI中文摘要

波斯语预训练语言模型仍然受到大规模高质量预训练语料库稀缺以及标准分类和NER任务之外评估不足的限制。我们提出了IHUBERT,一个从头训练的波斯语单语PLM,采用RoBERTa-base编码器(1.25亿参数),在Sepahr-Danesh集合的45GB精选子集(约70-80亿token)上进行训练。为了提高语料质量并减少冗余,我们采用多阶段预处理流程,包括规范化、精确和近似重复去除、匿名化,以及基于向量数据库的语义去重,以实现跨领域和语体的分布平衡控制。我们还在完整的预训练语料库上训练了一个13.9万词汇量的BPE分词器,以更好地捕捉波斯语的形态和拼写变化。IHUBERT在七个波斯语NLU基准测试上进行评估,涵盖NER、情感分析、主题分类、NLI、抽取式问答和关系抽取,使用任务标准指标(实体级F1、宏F1、EM/F1)。IHUBERT在抽取式QA上取得了最强增益,在PQuAD(F1 88.3542)和ParsiNLU-RC(F1 49.0987)上均排名第一,并在FarsTail上取得了最佳结果(宏F1 0.8350)。在NER和主题分类上,它保持竞争力(例如,ParsTwiNER上F1 0.8308;DigiMag上宏F1 0.7953),而关系抽取仍然是主要差距(PERLEX上宏F1 0.6684)。在IHUBERT预训练语料库上的受控分词器消融实验表明,在匹配词汇量下,BPE产生的子词碎片化程度略低于WordPiece,支持了我们的分词设计。总体而言,IHUBERT通过语义精选的大规模预训练以及跨分类和理解型任务的广泛评估,推进了波斯语语言建模。

英文摘要

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.

2606.20097 2026-06-19 cs.CL 新提交

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

HydraHead:从头部级功能异质性到专业化注意力混合

Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen, Yue Wu, Jieping Ye

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出HydraHead架构,沿头部维度混合全注意力和线性注意力,通过可解释性驱动的头部选择和尺度归一化融合模块,在长上下文任务中优于层级混合设计,仅用15B token训练即在512K上下文长度上提升69%。

详情
AI中文摘要

注意力的二次复杂度对长上下文处理构成了关键瓶颈,激发了混合注意力设计的兴趣。大多数开源混合模型采用层级策略。然而,先前工作注意到线性注意力与全注意力整合的内在困难,表明注意力混合的设计空间仍未充分探索。为了探索这一空间,我们进行可解释性分析,观察到层表现出块级功能相似性,而同一层内的单个头部尽管共享输入特征,却显示出不同的功能专门化。这种头部级异质性表明,头部维度为融合异质注意力信号提供了自然且原则性的粒度。基于这一洞察,我们引入了HydraHead,一种沿头部轴混合全注意力和线性注意力的新型架构。HydraHead具有两个关键创新:(1)一种可解释性驱动的选择策略,识别检索关键的头部并仅为其保留全注意力;(2)一种尺度归一化融合模块,调和全注意力和线性注意力头部输出之间的分布差距。通过利用参数重用和蒸馏的三阶段迁移流程,我们以最小的训练开销实现了高性能混合模型。在统一的训练设置下,HydraHead在长上下文任务中优于其他混合设计,同时保持强大的通用推理能力。通过可解释性驱动的头部选择,它以7:1的线性注意力与全注意力比例匹配了3:1层级混合的长上下文性能。关键的是,仅用15B token训练,HydraHead在512K上下文长度上比基线提升超过69%,接近Qwen3.5(一个具有256K原生上下文长度的类似规模领先模型)。这突显了头部级混合的显著扩展潜力。

英文摘要

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

2606.20482 2026-06-19 cs.CL cs.HC cs.LG 新提交

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

你的鼠标和眼睛悄悄泄露你的偏好:利用用户隐式反馈进行LLM对齐

Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校) York University(约克大学)

AI总结 针对显式反馈稀缺的问题,提出利用鼠标轨迹和眼动数据等隐式反馈训练奖励模型,将文本奖励模型准确率从55%提升至64%,并显著提高DPO对齐后响应质量。

详情
AI中文摘要

为了对齐大型语言模型(LLM),大多数现有方法收集显式的人类反馈,并基于响应文本训练奖励模型来预测人类偏好。这些现有方法有两个关键局限性。首先,用户很少为LLM响应提供显式反馈,这使得高质量偏好标注的收集成本高昂。其次,这些方法没有利用隐式人类反馈,而隐式反馈已被证明对互联网巨头的经济护城河至关重要。为了量化隐式反馈的价值,我们构建了一个名为IFLLM的新数据集,收集了来自59名Mechanical Turk工作者的1336个多轮问题、他们的鼠标轨迹以及通过网络摄像头对LLM响应的眼动注视点。IFLLM显示用户具有非常多样化的注视行为和鼠标轨迹。基于隐式用户反馈的奖励模型将基于文本的奖励模型准确率从55%提升至64%,并在将DPO应用于八个LLM后,相对响应质量改进几乎翻了三倍,证明了隐式反馈在现实场景中的价值。我们的数据收集网站、数据集和代码可在以下网址找到:此https URL。

英文摘要

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at https://github.com/themehulpatwari/llm-implicit-feedback/.

2606.19379 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Transformer 前馈块有多线性?逐块线性可恢复性是学习得到的,而非架构决定的

Stuart Whipp

发表机构 * Independent Research(独立研究)

AI总结 通过精确最小二乘线性近似,测量训练后 Transformer 各前馈块的线性可恢复性,发现其高度异质且非单调,是学习得到的属性而非架构决定,并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情
AI中文摘要

Transformer 前馈网络(FFN)通常被视为非线性的计算存储单元,但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射,并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性(R^2_lin),这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中,R^2_lin 高度异质且随深度非单调变化,相邻块之间范围从近线性(>0.99)到强非线性(<0.3),且并非由激活函数决定:相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓,因此可恢复性是单个训练块的学习属性,而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点,且增益与残差非线性不相关:未恢复的计算不是单个位置级乘积,而是高阶或分布式结构。该测量还作为有针对性的压缩信号:可恢复块允许大的单层替换(GPT-2 的早期 FFN 参数减少 8 倍,困惑度增加 +0.77),而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱:训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛,因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

2606.19475 2026-06-19 cs.AI cs.CL 交叉投稿

Diffusion Language Models: An Experimental Analysis

扩散语言模型:一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学) University of Pisa(比萨大学)

AI总结 本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现,分析了去噪步数、上下文长度等推理因素对性能与效率的影响,揭示了扩散语言模型在不同任务和预算下的权衡。

详情
AI中文摘要

大型语言模型(LLMs)通过自回归生成彻底改变了语言建模,使其在广泛的任务中表现出色。最近,扩散语言模型(DLMs)作为一种替代范式出现,它通过迭代去噪而非下一个词预测来生成文本,从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构,但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中,我们对现代DLMs进行了系统的实验分析。具体来说,我们评估了八种最先进的DLMs在八个基准上的表现,这些基准涵盖推理、编码、翻译、知识和结构化问题解决,同时明确考虑了生成质量和计算效率。除了下游评估,我们还分析了关键推理时间因素的影响,包括去噪步数、上下文长度、块大小和并行解掩策略,并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明,DLMs的行为受到生成时间设计选择的强烈影响,导致性能和计算效率之间的不同权衡。总体而言,我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Efficiently Representing Algorithms With Chain-of-Thought Transformers

高效表示链式思维Transformer中的算法

Yanhong Li, Anej Svete, Ashish Sabharwal, William Merrill

发表机构 * Allen Institute for AI(艾伦人工智能研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文证明链式思维Transformer能以多对数开销高效模拟Word RAM算法,包括排序和Dijkstra算法,优于模拟图灵机的二次开销。

详情
AI中文摘要

推理模型(即在产生答案前输出一系列推理或思维token的语言模型)日益流行,部分原因在于理论结果表明链式思维(CoT)Transformer可以模拟图灵机,从而执行任意计算。然而,图灵机虽然适用于复杂性理论分析,但在讨论算法时并不方便、直观或高效。算法通常在更高的抽象层次上设计和分析,即具有随机访问存储器和单位成本操作(对$\bigO(\log n)$位字)的Word RAM模型。因此,Word RAM算法可能比其图灵机对应物更高效,这引出了一个问题:CoT Transformer能否高效模拟Word RAM算法?例如,它们能否在$\bigO(n \log n)$步内对n个元素排序,或在$\bigO(E + V \log V)$步内运行Dijkstra算法?我们给出肯定回答,开销不超过多对数。我们首先为具有多对数宽度和最右唯一硬注意力的有限精度Transformer建立这一结果,然后将结果推广到两个更实际的设置:有限宽度和对数精度:连续CoT(其中推理采用向量而非token形式)和混合架构(其中Transformer层位于循环(线性RNN)层之上)。在所有三种情况下,我们发现CoT可以高效模拟任何Word RAM算法,仅需在n上多对数开销。当Word RAM具有“平坦”指令集时,此开销降至对数平方,而对于无乘法平坦指令仅需对数开销——这与已知的CoT模拟图灵机(需要二次开销)形成鲜明对比。

英文摘要

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机:大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出贝叶斯流形课程(BMC)框架,将问题采样建模为流形结构赌博机问题,通过层次任务树和贝叶斯学习引导采样,平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情
AI中文摘要

强化学习(RL)是提高大语言模型(LLMs)推理能力的关键方法,其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示,将问题选择视为具有独立臂的标准赌博机问题,忽略了任务空间的结构化和异质性。在这项工作中,我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题:问题通过模型的潜在表示空间相关联,采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角,我们引入了贝叶斯流形课程(BMC),这是一个结构感知框架,将问题组织成层次任务树,并应用贝叶斯学习来指导采样。实验发现,不同的采样策略在生产性(学习信号)、多样性(任务流形覆盖)和实用性(评估相关性)之间引入了非平凡的权衡。这些结果表明,仅优先考虑难度不足以获得强大的下游性能,突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

2606.19808 2026-06-19 cs.AI cs.CL 交叉投稿

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

再思考还是更长时间思考?面向预算感知推理的选择性验证

Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工大学计算机科学系) Fralin Biomedical Research Institute, Virginia Tech(弗吉尼亚理工大学弗拉林生物医学研究所) FBRI Cancer Research Center(FBRI癌症研究中心)

AI总结 提出选择性验证框架SEVRA,通过服务层控制器决定是否对冻结求解器的初始答案进行验证,在Math500上以更少token达到更高准确率,并减少有害翻转。

详情
AI中文摘要

测试时推理越来越多地被用作服务时的控制旋钮,但额外的推理并非均匀有价值:它可以修复失败的尝试,在已经正确的答案上浪费计算,或引入有害的答案更改。我们将其视为一个部署分配问题,而非新验证器问题。我们引入SEVRA,即面向推理分配的选择性验证,这是一个服务层控制器,决定是保留冻结求解器的初始答案还是调用主动验证。使用冻结的Qwen3-4B求解器,我们记录干预结果并从服务可见的尝试状态训练可恢复性感知的门控。在Math500上,选择性验证达到76.3%的准确率,而始终验证为75.5%,同时将生成后token减少26.8%,有害翻转从2.2%降至1.0%。然而,8,192 token的初始求解达到76.0%的准确率,总模型token减少28%,表明选择性恢复有用但并非测试的最佳成本前沿。在冻结迁移到GSM时,选择性策略仅验证3.0%的样本,准确率从93.4%提升至94.5%,验证token相对于始终验证减少91.2%;同样,更长的初始求解以更少的实际token达到相同准确率。在CommonsenseQA上,始终开启的验证有害,而Self-Consistency@5以约五倍的实际token成本提升准确率。由此得出的部署规则是:首先调整初始预算,然后在需要显式检查、有限重试、可审计性或回归风险控制时使用选择性恢复。

英文摘要

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

2606.20075 2026-06-19 cs.LG cs.CL 交叉投稿

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效:一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology(宁波数字孪生研究院,东方理工大学) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系)

AI总结 本文从信息论角度分析潜在思维链中的监督失效问题,提出轨迹监督和空间监督两个维度,并引入统一潜在探针(ULP)量化信息保真度,揭示了信息-性能绑定关系。

详情
AI中文摘要

潜在思维链(Latent Chain-of-Thought, CoT)将推理内化到连续隐藏状态中,为冗长的离散推理轨迹提供了一种有前景的替代方案。然而,鲁棒的潜在推理仍然困难,因为结果监督提供的学习信号较弱,且容易导致潜在轨迹发生语义漂移。在这项工作中,我们从信息论角度分析潜在CoT,并将这种失效识别为双重崩溃:优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度:轨迹监督(注入密集的逐步推理信号)和空间监督(保持潜在流形的语义结构)。我们的分析表明,刚性几何压缩可能坍缩推理空间,而生成式重建提供了更灵活的语义锚点,更好地保留了信息容量。为了衡量这些效应,我们引入了统一潜在探针(Unified Latent Probe, ULP),用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系:推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架,并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

2606.20295 2026-06-19 cs.SE cs.CL 交叉投稿

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology(首尔科学技术大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) LG CNS

AI总结 提出MENTOR方法,通过灵活的教师优化奖励结构,平衡行为对齐与下游性能,提升小模型在工具使用任务中的域外泛化能力。

详情
AI中文摘要

将大型语言模型(LLMs)的工具使用能力蒸馏到小型语言模型(SLMs)中对其实际应用至关重要。主要方法监督微调(SFT)由于与静态教师轨迹的刚性对齐,导致域外(OOD)泛化性能较差。虽然强化学习(RL)提供了一种替代方案,但SLMs的能力限制带来了严峻的困境:稀疏的结果奖励提供的指导不足,而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距,我们提出了MENTOR,它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制,而是利用教师的参考来指导工具使用行为,平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明,与SFT和严格RL基线相比,MENTOR提高了OOD工具使用性能。我们的研究结果表明,在可验证的工具使用环境中,灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

2603.25702 2026-06-19 cs.CL 版本更新

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

S2D2:通过免训练自我推测实现扩散LLM的快速解码

Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava

发表机构 * Red Hat AI Innovation(红帽AI创新) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室) Iowa State University(爱荷华州立大学) Core AI, IBM(IBM核心AI)

AI总结 提出S2D2,一种免训练的自我推测解码框架,通过将块扩散模型在块大小为1时变为自回归模型,实现草稿与验证角色复用,在不增加训练或测试计算下提升解码速度与准确性。

Comments Code is available at https://github.com/phymhan/S2D2

详情
AI中文摘要

块扩散语言模型通过结合块级自回归解码与块内并行去噪,为超越自回归生成提供了一条有前景的路径。然而,在实际加速所需的少步数场景中,标准的置信度阈值解码往往脆弱:激进的阈值损害质量,而保守的阈值则需要不必要的去噪步骤。现有解决此问题的方法要么需要额外训练,要么增加测试时计算。我们提出S2D2,一种用于块扩散语言模型的免训练自我推测解码框架。我们的关键观察是,当块大小减小到1时,块扩散模型变为自回归模型,从而允许相同的预训练模型同时充当草稿模型和验证模型。S2D2在标准块扩散解码中插入一个推测验证步骤,并使用轻量级路由策略来决定何时验证值得其成本。这产生了一种混合解码轨迹,其中扩散并行提出令牌,而自回归模式充当局部序列级评判器。在三个主流块扩散家族中,S2D2在准确性-速度权衡上持续优于强置信度阈值基线。在SDAR上,我们观察到相比自回归解码高达4.7倍加速,相比调优的动态解码基线高达1.57倍加速,同时准确性提升高达4.5个点。在LLaDA2.1-Mini上,S2D2与内置自校正保持互补,包括在保守设置下比静态基线快4.4倍且准确性略高。

英文摘要

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

2605.16865 2026-06-19 cs.CL 版本更新

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室,多伦多大学及向量研究所) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Princeton University(普林斯顿大学) Cornell University(康奈尔大学) The University of Tokyo(东京大学) RIKEN AIP(日本理化学研究所AIP) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根最大计划智能系统研究所) EuroSafeAI

AI总结 本文提出MixSD方法,通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入,从而在保持预训练能力的同时提升事实记忆和推理能力。

详情
AI中文摘要

监督微调(SFT)被广泛用于将新知识注入语言模型,但通常会损害预训练能力,如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致,迫使优化器模仿低概率token序列。为了解决这个问题,我们提出了MixSD,一种无需外部教师的简单方法,用于对齐分布的知识注入。与固定目标训练不同,MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号,同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD,研究事实回忆和算术功能学习,并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下,MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线,能够保留基础模型的100% held-out能力,同时保持接近完美的训练准确率,而标准SFT只能保留1%。我们进一步表明,MixSD在基础模型下生成的监督目标具有显著更低的NLL,并减少了有害的Fisher敏感参数方向运动。这些结果表明,将监督与模型的本征生成分布对齐是简单且有效的知识注入原则,可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

2510.27568 2026-06-19 cs.AI cs.CL 版本更新

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

SIGMA: 搜索增强的按需知识集成用于智能体数学推理

Ali Asgarov, Umid Suleymanov, Aadyant Khatri

AI总结 提出SIGMA框架,通过多智能体独立推理、定向搜索和协调机制,实现上下文敏感的知识集成,在MATH500等基准上提升7.4%的绝对性能。

Comments AAAI 2026 LMReasoning

详情
AI中文摘要

解决数学推理问题不仅需要准确访问相关知识,还需要仔细的多步骤思考。然而,当前的检索增强模型通常依赖单一视角,遵循僵化的搜索策略,并且难以有效结合来自多个来源的信息。我们提出了SIGMA(搜索增强的按需知识集成用于智能体数学推理),这是一个统一框架,通过协调机制编排专门智能体独立推理、执行定向搜索并综合发现。每个智能体生成假设段落以优化其分析视角的检索,确保知识集成既上下文敏感又计算高效。在MATH500、AIME和博士级科学问答GPQA等具有挑战性的基准测试中,SIGMA持续优于开源和闭源系统,实现了7.4%的绝对性能提升。我们的结果表明,多智能体按需知识集成显著提高了推理准确性和效率,为复杂、知识密集型问题解决提供了可扩展的方法。代码将在发表后公开。

英文摘要

Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.

2604.00626 2026-06-19 cs.LG cs.CL 版本更新

A Survey of On-Policy Distillation for Large Language Models

大型语言模型的在线策略蒸馏综述

Mingyang Song, Mao Zheng

发表机构 * Tencent, China(腾讯,中国)

AI总结 本文综述了大型语言模型的在线策略蒸馏方法,探讨了蒸馏过程中如何通过反馈减少累积误差,提出了基于f-散度最小化的蒸馏框架,并分析了蒸馏与强化学习之间的联系。

Comments Ongoing Work

详情
AI中文摘要

随着大型语言模型(LLMs)在能力和成本上的持续增长,将前沿能力转移到更小、可部署的学生模型已成为核心工程问题,知识蒸馏仍然是这一转移的主导技术。工业流水线中普遍采用的静态模仿教师生成文本的方法存在结构性缺陷,随着任务变得更长且需要更多推理,这种缺陷变得更加严重。因为学生是在完美教师前缀上训练的,但在推理时必须生成自己的文本,小错误往往会积累成学生很少被训练来恢复的轨迹,导致的暴露偏差已被证明与序列长度的平方成比例。在线策略蒸馏(OPD)围绕这一观察重新组织训练循环,通过让教师对学生实际生成的内容提供反馈,以减少累积项趋于线性,并将蒸馏重新定义为迭代修正过程,而不是单次模仿。由此产生的文献在分歧设计、奖励引导优化和自我对抗方面有所扩展,但贡献仍然分散在知识蒸馏、RLHF和模仿学习社区中,缺乏统一的处理。本文提供了这样的处理。我们正式将OPD定义为学生采样轨迹上的f-散度最小化,将该领域沿三个设计轴(优化什么、信号来源在哪里、以及如何在实践中稳定训练)组织起来,并整合成功条件、反复失败模式以及OPD与KL约束强化学习之间的联系。最后,我们提出了由此综合而产生的开放性问题,包括蒸馏扩展定律、不确定反馈、代理蒸馏以及知识蒸馏与强化学习之间的日益增长的重叠。

英文摘要

As Large Language Models continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become an important engineering problem, and knowledge distillation remains a common technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but generates its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as f-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained reinforcement learning. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agent-level distillation, and the growing overlap between knowledge distillation and RL.

2. 机器翻译与跨语言处理 4 篇

2606.19346 2026-06-19 cs.CL cs.AI 新提交

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

跨语言迁移中语言相关性与任务对齐的解耦

Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom

发表机构 * Haverford College(哈弗福德学院) Brown University(布朗大学)

AI总结 通过微调大语言模型并在闪语族与非闪语族语言上评估零样本阅读理解,发现跨语言迁移主要提升任务格式对齐而非语言特定知识。

详情
AI中文摘要

我们通过微调七个大语言模型(4B--671B参数)在阿拉伯语上,并在闪语族语言和非闪语族对照语言上评估零样本阅读理解,研究跨语言迁移。在密集架构和混合专家架构中,我们没有发现闪语族特定迁移的证据:基线较弱的模型在所有语言上都有显著提升,而基线较强的模型无论语言族系如何,只有边际提升。思维链消融实验强化了这一发现——从微调中获益最多的模型同样从推理时推理中获益,这表明两种机制都解决了任务格式对齐问题,而非跨语言知识迁移。

英文摘要

We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding -- the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集:基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话,发现仅添加这些参数会导致跨语言临床不一致,且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情
AI中文摘要

人工智能和大语言模型(LLMs)已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性,但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距,研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而,大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数,以生成普通话、孟加拉语和印地语的临床对话。然后,我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度(与英语基线相比)时的表现。我们的研究结果表明,仅在人物角色中添加国籍和语言参数可能不够,因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性,且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终,我们的工作强调了迫切需要文化响应式数据生成,以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

2507.00875 2026-06-19 cs.CL cs.HC cs.MA 版本更新

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TransLaw:模拟香港判例法专业翻译的大规模数据集与多智能体基准

Xi Xuan, Chunyu Kit

发表机构 * City University of Hong Kong, Hong Kong SAR, China(香港城市大学)

AI总结 针对香港判例法英译中资源匮乏、法律术语和格式要求严格的问题,构建了首个大规模句对齐平行语料库HKCFA Judgment 97-22,并提出多智能体框架TransLaw,通过分解翻译任务、集成法律词汇库和检索增强生成,显著提升翻译质量,但仍未达到人类专家的风格自然度。

Comments Accepted at ICML 2026 - AI for Law

详情
AI中文摘要

根据《基本法》第8-9条,香港法院判决书需从英文翻译成繁体中文,但由于平行资源短缺以及对法律术语、引用格式和司法风格的严格要求,这一任务仍受到限制。我们引入了HKCFA Judgment 97-22,这是首个用于香港判例法的大规模句对齐平行语料库,包含344份专业翻译的判决书(11,099个句对;210万词元),涵盖1997年至2022年。基于这一资源,我们提出了TransLaw,一个多智能体框架,将翻译分解为词级表达、句级翻译和多维审查,集成了专门的香港法律词汇数据库、检索增强生成和迭代反馈,并包括涵盖语义对齐、术语、引用和风格的四维专家审查。通过对13个开源和商业大语言模型进行基准测试,我们证明TransLaw在所有评估模型上均显著优于单智能体基线,并在3次迭代内收敛。由10名持证法律翻译人员使用我们提出的Legal ACS指标进行的人工评估证实了法律语义准确性的提升,同时表明TransLaw在风格自然度上仍落后于人类专家。数据集和基准代码可在以下网址获取:https://xxx。

英文摘要

Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology, citation format, and judicial style. We introduce HKCFA Judgment 97-22, the first large-scale sentence-aligned parallel corpus for HK case law, comprising 344 professionally translated judgments (11,099 sentence pairs; 2.1M tokens) spanning 1997-2022. Building on this resource, we propose TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation, and iterative feedback, with four-dimensional expert review covering semantic alignment, terminology, citation, and style. Benchmarking 13 open-source and commercial LLMs, we demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models, with convergence within 3 iterations. Human evaluation by 10 certified legal translators using our proposed Legal ACS metric confirms gains in legal-semantic accuracy, while showing that TransLaw still trails human experts in stylistic naturalness. The dataset and benchmark code are available at https://github.com/xuanxixi/TransLaw.

2605.31393 2026-06-19 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata(III-LIDI国立拉普拉塔大学) CDTEC, Federal University of Pelotas(CDTEC,联邦 Pelotas 大学) CONICET III-LIDI Comision de Investigaciones Cientificas Universidad Nacional de La Plata(科学委员会国立拉普拉塔大学) Universidade Federal de Pelotas(联邦 Pelotas 大学)

AI总结 针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题,提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强,并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign @ CVPR 2026. Non-Proceedings Track (https://genai4sl.github.io/)

详情
AI中文摘要

手语翻译(SLT)仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法,其中GPT-4o生成参考句子的受控释义变体,而手语输入保持不变。采用基于Signformer姿态的Transformer,在两阶段调度下进行训练:先在增强语料库上预训练,然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估:PHOENIX14T(德国手语),具有适度的词汇多样性;GSL(希腊手语),具有高度受控、重复的录制;以及LSA-T(阿根廷手语),具有严重的长尾稀疏性。在PHOENIX14T上,增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知,这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by the limited availability of paired sign-video/text corpora and by the heavy-tailed vocabularies typical of real-world datasets. We study a target-side augmentation strategy in which a large language model (LLM) generates controlled paraphrase variants of the reference spoken-language sentence while the sign input remains unchanged. Concretely, we use GPT-4o to produce semantically faithful variants of the training targets and train a Signformer-style pose-based Transformer under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate this strategy on three datasets that span complementary challenges: PHOENIX14T (German Sign Language), a real-world corpus with moderate lexical diversity; the Greek Sign Language Dataset with highly controlled, repetitive recordings; and LSA-T (Argentinian Sign Language), a naturalistic corpus with a large vocabulary and severe long-tail sparsity. This range allows us to characterize precisely when and why target-side augmentation is beneficial. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33, demonstrating that paraphrastic exposure helps the decoder generalize beyond memorized reference phrasing. The near-saturated GSL baseline and the extremely sparse LSA-T setting reveal the limits of the approach: in both cases, single-reference lexical overlap metrics are insufficient to capture the full picture, motivating a complementary semantic evaluation. To our knowledge, this is the first study to examine LLM-generated target-side paraphrases as an augmentation mechanism for SLT, and the first to apply an LLM-as-a-Judge evaluation protocol to SLT. This complementary evaluation reveals gains in semantic fidelity that lexical overlap metrics understate.

3. 信息抽取、检索与问答 9 篇

2606.19345 2026-06-19 cs.CL cs.AI 新提交

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

基于摘要识别PubMed中EQ-5D研究的大型语言模型集成

Zhyar Rzgar K. Rostam, Márta Péntek, János Tibor Czere, Zsombor Zrubka, László Gulácsi, Gábor Kertész

发表机构 * Doctoral School of Applied Informatics and Applied Mathematics, Obuda University(欧布达大学应用信息学与应用数学博士学院) John von Neumann Faculty of Informatics, Obuda University(欧布达大学约翰·冯·诺伊曼信息学学院) Doctoral School of Innovation Management, Obuda University(欧布达大学创新管理博士学院)

AI总结 提出多阶段框架集成Gemini和Gemma等LLM,通过少样本提示、权重集成和软堆叠元分类器,自动检测PubMed中EQ-5D研究,加权集成F1达0.74。

Comments 6 pages, 7 tables, 8 equations

详情
AI中文摘要

科学出版物的快速增长导致系统文献综述(SLR)中的人工研究筛选越来越耗费资源、效率低下且不一致。分类明确报告健康相关生活质量结果(如EQ-5D数据)的研究需要高水平的临床解释,并给人类评审者带来挑战。本研究探讨了使用Google的Gemini和Gemma大型语言模型(LLM)仅基于已发表摘要自动检测PubMed生物医学数据库中的EQ-5D。提出了一个多阶段框架,集成了少样本提示、权重集成聚合和软堆叠元分类器。在由两位专家手动标记的PubMed研究数据集上评估了九个LLM的EQ-5D报告情况。gemini-2.5-pro、gemma-3-12b和gemma-3-27b的加权集成获得了0.74的加权F1分数和0.74的准确率,超过了单独获得的结果。与单个模型相比,表现最佳模型的集成改善了精确率和召回率之间的平衡,而软堆叠方法提供了更高的可靠性和可解释性。特征分析表明,模型的概率结果在指导最终预测中很重要。研究结果表明,基于集成的LLM设置是自动化生物医学研究筛选的可靠且可扩展的方法。

英文摘要

The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers. This study investigates the use of Google's Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier. Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results. The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions. The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.

2606.19351 2026-06-19 cs.CL cs.AI 新提交

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

AI总结 提出LUCID方法,结合LLM注意力分数、知识图谱语义和结构信息,利用图神经网络检测LLM在知识图谱推理中的幻觉,在九个数据集上达到最优性能。

详情
AI中文摘要

知识图谱推理从现有事实中推断新知识,广泛应用于问答、推荐和决策支持。随着大语言模型(LLM)的快速发展,基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而,LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识,模型仍可能生成错误输出,导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态,要么验证与检索上下文的一致性,但两者都忽略了知识图谱中的结构信息,导致性能次优。为了解决这一差距,我们提出了LUCID,这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说,它从注意力分数和语义相似度中提取节点和边特征,并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明,与15个基线相比,LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

2606.19667 2026-06-19 cs.CL 新提交

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

CacheWeaver:面向高效接地RAG推理的缓存感知证据排序

Kaizhen Tan, Rong Gu, Mingyuan Li

发表机构 * Heinz College of Information Systems and Public Policy, Carnegie Mellon University(卡内基梅隆大学海因茨信息系统与公共政策学院)

AI总结 提出CacheWeaver,一种轻量级提示层方法,通过缓存感知的证据排序降低RAG推理的首令牌延迟,无需修改服务引擎或证据集。

详情
AI中文摘要

检索增强生成(RAG)改善了事实基础,但也延长了提示并增加了预填充成本。vLLM等服务引擎中的前缀缓存仅在请求共享相同令牌前缀时降低此成本。然而,在接地生成中,相邻查询可能以不同顺序检索重叠证据,因此集合重叠不会变成可重用的前缀重叠。我们提出CacheWeaver,一种用于缓存感知证据排序的轻量级提示层方法。该方法维护最近服务的证据序列的前缀树,并使用贪婪遍历将最可重用的前缀放在首位,同时保持服务引擎和检索到的证据集不变。在三种vLLM配置中,相对于检索顺序前缀缓存,该方法将中位首令牌时间(TTFT)降低了约20-33%,且在我们的QA测试中不损害答案质量。贪婪策略达到了Oracle排序中位TTFT增益的97.5%,表明大多数可重用前缀局部性可以通过检索和推理之间的简单调度层恢复。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

2606.19700 2026-06-19 cs.CL 新提交

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona(亚利桑那大学) College of Information Science, University of Arizona(亚利桑那大学信息科学学院) Biosphere 2, University of Arizona(亚利桑那大学生物圈2) Department of Ecology and Evolutionary Biology, University of Arizona(亚利桑那大学生态与进化生物学系) Department of Environmental Sciences, University of Arizona(亚利桑那大学环境科学系)

AI总结 提出TerraMARS管道,结合领域自适应小语言模型,从火星科学文献中提取结构化信息,支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情
AI中文摘要

研究人员有兴趣了解火星,以便最终使其适合人类居住。为此,需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束,可用于其他模型和研究,如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS,一个端到端的信息提取管道,它结合了领域自适应的小语言模型来回答火星地球化改造相关问题,并将非结构化的火星科学文本转换为机器可读的结构化输出(JSON格式)。收集了一个开放获取论文语料库,并使用多阶段检索和分块框架进行处理。使用量化低秩自适应(QLoRA)对火星特定问答和信息提取数据集进行微调,使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出,并为将科学文献中的知识整合到下游应用(如数字孪生和火星宜居性建模)提供了基础。该管道的输出看起来很有前景,但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

2606.19710 2026-06-19 cs.CL cs.AI 新提交

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX: 面向人口走私知识图谱的微调NER-RE

Elijah Feldman, Dipak Meher, Carlotta Domeniconi

发表机构 * Thomas Jefferson High School for Science and Technology(托马斯·杰斐逊科技高中)

AI总结 提出FineREX,一个基于微调LLM的流水线,用于从法律文档中提取实体和关系构建知识图谱,在F1分数上分别提升15.50%和31.46%,并减少50%处理时间。

Comments Code available at https://github.com/ElijahFeldman7/FineREX

详情
AI中文摘要

法庭记录包含关于人口走私网络的有价值证据,但这些信息通常埋藏在非结构化的、充满术语的法律文件中。虽然大型语言模型(LLM)可以通过自动信息提取支持知识图谱构建,但现有方法依赖通用模型,未针对该领域所需的实体和关系定义进行定制。我们提出FineREX,一个精简的知识图谱构建流水线,基于微调的LLM进行命名实体识别和关系提取(NER-RE)。使用包含512个文本块的手动标注数据集,FineREX在实体和关系F1分数上分别比更大的通用基线模型绝对提高了15.50%和31.46%。这些提升转化为更高质量的知识图谱,将法律噪声减少近一半,并将长文档上的节点重复率从17.78%降至11.17%。通过消除文档重写和冗余提取阶段,FineREX还将端到端处理时间减少了50.0%。我们的结果表明,领域特定的微调可以显著优于更大的通用模型,同时提高非法网络分析知识图谱构建的质量和效率。

英文摘要

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

2606.19852 2026-06-19 cs.CL cs.LG 新提交

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

提示、规划、提取:用于从临床叙述中提取肺部病理学的零样本智能体LLM工作流

Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan, Sankalp Talankar, Yasir Khan, Hiren Mehta, Aokun Chen, Yi Guo, Yonghui Wu

AI总结 提出零样本智能体工作流,利用开源大语言模型从肺切除病理报告中提取13个CAP字段,在无训练下达到0.893 Micro-F1,接近监督方法。

Comments 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA

详情
AI中文摘要

从病理报告中提取信息对于癌症分期和肿瘤登记人群至关重要。然而关键数据仍嵌入在叙述性报告中,使得手动提取劳动密集且易出错。传统的监督自然语言处理流程通过完全监督的命名实体识别和关系提取来解决这一问题,但需要昂贵的人工标注,并且当上游实体缺失时会出现级联故障。在本研究中,我们开发了一个零样本智能体工作流,并评估了五个开源生成式大语言模型(LLMs),以从肺切除病理报告中填充13个美国病理学家学会的概要字段。我们使用一种新颖的、与注册对齐的评估框架,将它们与最先进的监督GatorTron NER-RE基线进行比较。基线达到了0.960的Micro-F1,而最佳零样本模型(GPT-OSS-20B)达到了0.893的Micro-F1(召回率:0.949),在没有任务特定训练的情况下准确提取了复杂关系(如病理分期)。这些结果表明,开源零样本智能体LLMs是提取肺部病理信息的低成本解决方案。

英文摘要

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

2606.20072 2026-06-19 cs.CL 新提交

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University(首尔大学)

AI总结 提出STAGE方法,利用电子表格作为源数据,通过LLM生成报告和JSON模式,并验证真实值,显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情
AI中文摘要

从财务文件到临床记录,传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式,是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标,然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距,我们提出了STAGE(电子表格基础的文本到JSON工件生成),一种基于源数据的数据生成管道,通过使用LLM进行可扩展合成,同时根据底层电子表格验证真实值,来构建报告和JSON模式。在STAGE-Eval(我们的基于源数据的基准测试,包含851个示例的测试集)上的评估表明,STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%,值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

2606.20113 2026-06-19 cs.CL cs.IR 新提交

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

流式工具使用何时有帮助?表征流式检索增强生成中的工具意图稳定化

Elroy Galbraith

发表机构 * SMG Labs(SMG实验室)

AI总结 通过测量工具意图稳定化(即推测查询收敛到答案的时间点),在CRAG基准上分析流式RAG的延迟隐藏效果,发现73.9%的查询可实现显著延迟隐藏,并识别早期与晚期稳定化的预测因素。

详情
AI中文摘要

流式检索增强生成(Streaming RAG)通过在用户输入完成前并行发出工具查询来减少用户感知的延迟。报告的性能提升是聚合性的,但该机制的好处本质上是查询内在的:只有当正确的工具查询在用户停止说话或打字之前变得可确定时,推测才有帮助。我们隔离并测量了这一属性——工具意图稳定化,即输入流中推测查询的检索收敛到包含答案的结果的时间点。在CRAG基准(1371个验证问题)上,我们(i)测量了稳定化的分布,(ii)推导出一个与模型无关的界限H,表示可以隐藏在用户剩余输入背后的工具延迟比例,该比例是工具延迟L和输入节奏δ的函数,(iii)通过一个工作流式管道验证了实际节省达到或超过此界限,(iv)识别了哪些查询属性预测早期与晚期稳定化。该研究无需模型训练,在普通CPU硬件上运行。我们发现,在现实操作点(L=600ms,δ=3w/s,θ=0.8)下,整个基准中73.9%的查询实现了显著的延迟隐藏——这一混合数字结合了21.3%的问题(其中黄金证据以原文形式存在且可被BM25检索)上的充分稳定化(在此有利切片上95.2%可流式处理)以及其余问题上的无基础top-1稳定化回退。在有利切片上,φ_suf被精确和宽松基础限定在[0.26, 0.281]之间——两者均为早期。问题类型产生了显著但粗略的早期/晚期划分(Kruskal-Wallis p=0.017, epsilon^2=0.04),直接指导了何时学习到的推测触发器值得其成本。

英文摘要

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property -- tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence δ, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, δ=3w/s, θ=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding -- a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, ϕ_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding -- both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

2606.19782 2026-06-19 cs.AI cs.CL 交叉投稿

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA:一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute(向量研究所)

AI总结 提出多智能体管道AgentFinVQA,通过分解查询步骤并记录可追溯的模型评估包,在金融图表问答中实现可审计性与本地部署,在FinMME上提升准确率7.68个百分点。

详情
AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性:从业者必须在采取行动之前知道哪些答案值得信任,而且许多机构无法将客户数据发送给外部模型提供商。然而,现有的图表问答智能体注重准确性且不透明,并且大多数假设专有API访问;据我们所知,没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA,一个多智能体管道,将每个查询分解为规划、OCR、图例定位、视觉检查和验证,每个样本记录在可追溯的模型评估包(MEP)中。在FinMME上,AgentFinVQA在使用专有主干(Gemini-3 Flash;71.24% vs. 63.56%,McNemar p ≈ 1.1×10^{-16})时比主骨干匹配的零样本基线提高+7.68个百分点,在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号(确认答案与修正答案的精确准确率分别为68.2%和55.6%),支持人在回路审查路由。错误分析表明,问题误解、图例混淆和提取错误占失败原因的近三分之二,并且是验证器检测最少的类别,为未来工作指明了明确方向。这些结果共同表明,可审计、本地部署的金融图表问答是可行的,并且开放权重系统保留了大部分准确率提升,同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

4. 对话系统与智能体 15 篇

2606.19356 2026-06-19 cs.CL cs.AI 新提交

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

可信多智能体系统:使用Argent信令协议缓解语义漂移

Anantha Sharma

发表机构 * Synechron Inc(Synechron公司)

AI总结 提出Argent信令协议(ASP),通过结构化质量信号区分可修复与不可修复的失败,在文档问答和多智能体系统中分别提升通过率和阻断无依据传播。

Comments 17 pages

详情
AI中文摘要

当多智能体LLM系统产生错误答案时,并非所有失败都相同:有些答案基于正确材料但不完整,而另一些则完全无依据且应被阻止。当前的重新尝试策略对两种情况一视同仁(重试并希望最好),使得人类监督者无法判断重试是否合理或系统是否应停止。我们引入Argent信令协议(ASP),这是一种紧凑的机器可读头部,为每个AI生成的响应附带结构化质量信号:确定性(@C)、依据性(@G)、随机性(@S)以及一个假设索引,用于分类每个声明的证据基础。这些信号使控制器能够区分可修复失败与遏制失败,并对每种情况进行不同路由。我们在两种模式下评估ASP。在独立模式下,基于Array BioPharma/Ono许可协议的27个问题的文档问答基准,比较基线提示与ASP仪器化控制器动作在三个本地GGUF模型上的表现。在Qwen~(0.8B)上,ASP将通过率从11.1%提升至33.3%,平均术语覆盖率从36.7%提升至65.4%;在Dobby~(8B)上,ASP产生4次失败到通过的恢复,通过率从33.3%提升至44.4%;在SmolLM3~(3B)上,ASP在每次问题中交替进行修复和遏制。总体改进显著(从12/81通过到21/81通过)。在多智能体模式下,ASP侧车位于检索智能体和下游决策智能体之间;侧车100%阻止无依据的上游输出到达下游智能体(24/27被阻止,0次无依据传播)。

英文摘要

When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

2606.19659 2026-06-19 cs.CL 新提交

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

SAGE-OPD:面向多轮在策略蒸馏的选择性智能体引导干预

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

AI总结 提出SAGE-OPD框架,通过环境反馈和教师判断选择性干预学生响应,结合置信度加权和损失归一化,解决多轮在策略蒸馏中的错误累积问题,在ALFWorld任务中取得13.3%的相对提升。

Comments 21 pages, 3 figures

详情
AI中文摘要

在策略蒸馏(OPD)通过训练学生模型在其自身策略生成的轨迹上来改进学生模型,使其成为缓解智能体训练中曝光偏差的一种有前景的方法。然而,大多数OPD研究集中在单轮设置,而现实中的LLM智能体需要与环境进行多轮交互。在这种机制下,早期错误会改变未来观察并沿轨迹累积,标准的密集令牌级OPD变得脆弱,因为它可能过度惩罚语义上有效的替代方案,强化局部退化(如重复动作),并在分布外历史中传播不可靠的教师监督。我们提出SAGE-OPD,一种专门为多轮OPD设计的无验证器选择性干预框架。SAGE-OPD不是在所有轮次上统一应用教师监督,而是首先观察环境反馈,并使用教师判断来决定每个学生响应是否应被跳过或干预。为了进一步解决累积错误,SAGE-OPD通过教师置信度对令牌级蒸馏进行加权,减少不确定的教师分布在受损或模糊历史上的影响。最后,SAGE-OPD应用损失归一化以保留标准OPD的整体损失规模,同时保持选择性轮次级加权。在智能体任务上的实验表明,SAGE-OPD持续优于基线,在ALFWorld未见成功率上比标准OPD实现了高达13.3%的相对提升。消融研究进一步表明,轮次级干预、教师置信度加权和损失归一化提供了互补的益处。我们的结果表明,有效的多轮OPD应保持策略内,但教师监督应选择性地分配到需要干预且可靠的轮次。

英文摘要

On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on single-turn settings, while realistic LLM agents interact with environments over multiple turns. In this regime, early errors can alter future observations and compound across the trajectory, and standard dense token-level OPD becomes brittle, as it may over-penalize semantically valid alternatives, reinforce local degeneracies such as repeated actions, and propagate unreliable teacher supervision on off-distribution histories. We propose SAGE-OPD, a verifier-free selective intervention framework specifically designed for multi-turn OPD. Instead of applying teacher supervision uniformly across all turns, SAGE-OPD first observes environment feedback and uses teacher judgment to decide whether each student response should be skipped or intervened on. To further address compounding errors, SAGE-OPD weights token-level distillation by teacher confidence, reducing the influence of uncertain teacher distributions on corrupted or ambiguous histories. Finally, SAGE-OPD applies loss normalization to preserve the overall loss scale of standard OPD while retaining selective turn-level weighting. Experiments on agent tasks show that SAGE-OPD consistently improves over baselines, achieving up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD. Ablation studies further demonstrate that turn-level intervention, teacher confidence weighting, and loss normalization provide complementary benefits. Our results suggest that effective multi-turn OPD should remain on-policy, but teacher supervision should be selectively allocated to turns where intervention is necessary and reliable.

2606.19847 2026-06-19 cs.CL 新提交

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

AtomMem: 通过原子事实构建简单有效的LLM智能体记忆系统

Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu, Tong Xu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室) Anhui University(安徽大学)

AI总结 针对现有记忆系统存储粗粒度、更新不稳定的问题,提出AtomMem,通过事实执行器提取高价值原子事实作为高效记忆表示,并组织为层次化事件结构和时间档案,实现价值密集存储和稳定演化,在LoCoMo基准上取得最优性能。

Comments 19 pages, 10 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLM)展示了强大的推理和生成能力,但其固定的上下文窗口限制了跨多会话交互的长期信息积累和重用。现有的记忆增强系统通常以粗粒度且不稳定的方式构建记忆,依赖于低效的记忆表示或不稳定的无约束更新。为了解决这些挑战,我们提出了AtomMem,一种专为价值密集存储和稳定记忆演化设计的长期记忆系统。AtomMem引入了一个事实执行器,从长形式交互中选择性地提取高价值原子事实,作为高效的记忆表示。随后,AtomMem将这些事实组织成层次化的事件结构和时间档案,捕获连贯的情景上下文并随时间跟踪动态演变的用户属性。在检索过程中,系统激活一个关联记忆图来连接碎片化的记忆。在LoCoMo基准上的实验证实,AtomMem在各种推理任务中实现了最先进的性能,为部署智能个性化智能体提供了一种可扩展且经济可行的解决方案。

英文摘要

Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

2606.20487 2026-06-19 cs.CL 新提交

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

超越全局重规划:跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Shanghai Innovation Institute(上海创新研究院) Southeast University(东南大学) Tsinghua University(清华大学)

AI总结 提出分层重规划框架H-RePlan,通过统一API-CLI-GUI执行和跨层失败抽象,区分设备本地策略恢复与全局重规划,在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情
AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备,要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配,但恢复仍然粗粒度:当执行失败时,它们通常重试相同策略、重新分配子任务或修改全局计划,而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan},一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略,并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力,我们引入\textbf{HeraBench},一个故障注入基准,它在Linux和Android设备上构建跨设备工作流,并注入策略级和设备级故障。实验表明,H-RePlan显著优于单策略和粗粒度多设备基线,实现了更高的完成率、指令遵循率和完美通过率,同时降低了可靠端到端成功所需的令牌成本。这些结果表明,范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 交叉投稿

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式:移动代理是否需要手机屏幕?

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

AI总结 本文挑战移动代理的GUI主导范式,提出CLI应同等重要,通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线,并引入CLI-Advantage任务套件展示其优势。

详情
AI中文摘要

近期移动代理的进展主要由GUI范式主导,其中代理感知UI信息并发出屏幕交互。然而,移动平台也提供了命令行接口(CLI),可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上,使用四种模型API评估了三个编码代理(Claude Code、Terminus-2、mini-swe-agent),未进行任何移动特定后训练,并与三个可复现的GUI基线(GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B)进行比较。Claude Code(Opus 4.7)达到71.8%和51.9%,优于所有可复现的GUI基线(AndroidWorld上69.3/68.1/57.8%;MobileWorld上43.2/26.3/13.3%),而其他CLI配置也保持竞争力。为确立该范式的上限,我们提供了oracle CLI解决方案,在AndroidWorld上达到88.8%(103/116个任务可CLI解决),在MobileWorld上达到86.3%(101/117个任务可CLI解决),表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图,我们引入了\ extbf{CLI-Advantage任务套件},包含五个类别的45个模板:批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线,且每个任务步骤显著更少(10.7步 vs. 18.6步)。为支持未来移动CLI代理的研究,我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 交叉投稿

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh(爱丁堡大学) University of Glasgow(格拉斯哥大学) University of Cambridge(剑桥大学)

AI总结 针对DeFi监管中LLM智能体易误报的问题,提出DeXposure-Claw系统,通过图时间序列基础模型预测风险网络,结合确定性监控和置信度门控生成可审计监管票据,并构建六轴评估基准DeXposure-Bench,实验验证有效性。

详情
AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景:它们过度解读弱证据并推荐高风险干预,而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw,一个基于预测的智能体监管系统,通过结构化证据引导LLM决策:(1) DeXposure-FM,一个图时间序列基础模型,预测未来风险网络;(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据;(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench,一个六轴评估框架,其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

2606.19559 2026-06-19 cs.AI cs.CL 交叉投稿

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University(AI Talent Hub, ITMO大学)

AI总结 提出一种基于提示的不确定性分解方法,将行动置信度与请求不确定性分离,使代理能在任务规范模糊时主动寻求澄清,在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情
AI中文摘要

最近的立场论文认为,经典的偶然/认知不确定性框架对于交互式大型语言模型(LLM)代理是不够的,并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示,以解锁新的代理能力,如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法,使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁,该分解将行动置信度与请求不确定性(u)分离,使代理能在任务规范模糊时请求澄清。为了评估它,我们引入了两个增强澄清的基准(WebShop-Clarification和ALFWorld-Clarification),其中50%的任务被故意欠规范,并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上,系统地将所提出的分解与ReAct+UE和不确定性感知记忆(UAM)在五个LLM骨干(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上进行比较。在五个骨干上平均,所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1,比UAM提高了36%,并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1,表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 交叉投稿

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MATM框架,通过共享存储和检索智能体轨迹,实现异构智能体群体间的知识复用,提升下游任务性能并减少交互步骤。

详情
AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署,激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决,检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成(展示了人类创作工件对单个智能体的价值)扩展到检索智能体生成的工件以支持智能体群体。特别是,智能体轨迹编码了可重用的程序性知识,然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留,迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆(MATM),一个用于群体级存储和检索智能体生成轨迹的框架,其中生产者智能体将轨迹贡献到共享仓库,消费者智能体检索它们以改进任务执行。我们专注于交互环境(ALFWorld和WebArena),其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明,从MATM检索轨迹可提高下游任务性能并减少交互步骤,无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots:通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出Connect the Dots框架,通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域,实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情
AI中文摘要

本文提出了一个通用框架,用于训练大型语言模型(LLMs)具备“Connect the Dots”(CoD)这一元能力,该能力是长期生命周期智能体所必需的:当基于LLM的AI智能体部署在环境中时,它解决一系列长期任务,同时持续探索环境、从自身经验中学习,并迭代地自我更新关于环境的上下文,从而在更新上下文的条件下,在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括:(1)用于端到端强化学习(RL)的算法设计和基础设施,其中包含交替执行任务和更新上下文的长展开序列;(2)用于在训练过程中激励和激发LLM中目标元能力的任务和环境,以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现,包括具有细粒度信用分配的GRPO风格RL算法,以及针对目标元能力(而非特定领域的LLM能力或标准的逐任务RL)量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性,并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作,并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用,我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 交叉投稿

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示:基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University(莱顿大学) FutureWhiz

AI总结 提出一种基于14个教学特征的主题感知提示路由模型,通过模拟训练和在线A/B测试,在高中辅导中实现自适应策略切换,提高教学效率并减少交互轮次。

详情
AI中文摘要

LLMs可以个性化教育,尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统,该系统基于从原始转录中提取的14个教学特征(例如,辅导支架、学生理解)。我们首先在模拟环境中训练一个提示路由模型,然后将其部署到实际高中学生的在线适应中。模拟基准测试显示,路由器的性能优于两个静态基线($0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$)。A/B测试($N=656$ 次对话,来自359名学生)显示了从模拟到现实的迁移,其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率,保持了教学质量,并减少了约3轮交互($p=0.007$)。虽然贪婪路由器的练习转化率与基线相当($19.1\%$ vs. $19.6\%$),但随机采样策略的随机路由器实现了更高的转化率($28.1\%$)。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

2606.20529 2026-06-19 cs.AI cs.CL 交叉投稿

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University(亚利桑那州立大学) University of Arizona(亚利桑那大学)

AI总结 针对客服领域策略遵从工具调用代理,提出LedgerAgent方法,通过独立账本维护任务状态并渲染到提示中,在执行工具调用前检查状态依赖策略约束,提升多轮一致性。

Comments Work in Progress

详情
AI中文摘要

在客服领域,策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态,并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中,任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中,使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式,导致两种常见失败模式:代理可能检索到正确的事实,但后来基于过时、缺失或不正确的信息做出决策;语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent},一种用于工具调用代理的推理时方法,它在单独的账本中维护观察到的任务状态,并将状态渲染到提示中。在执行改变环境的工具调用之前,账本还用于检查状态依赖的策略约束,阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上,\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k,在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

2603.22922 2026-06-19 cs.CL 版本更新

Quality Over Clicks: Iterative Reinforcement Learning for Early-Stage E-Commerce Query Suggestion

质量优于点击:面向早期电商查询建议的迭代强化学习

Qi Sun, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 针对早期部署场景点击反馈稀疏的问题,提出质量优先的迭代强化学习框架QualEQS,从可回答性、事实性和信息增益三个维度优化查询建议质量,通过候选建议的组级分歧识别模糊上下文并挖掘难例进行迭代改进,在真实电商系统中ChatPV提升6.81%。

详情
AI中文摘要

现有的对话系统依赖查询建议来增强用户参与度。最近的方法主要使用点击率(CTR)模型优化生成模型,以与用户偏好对齐。然而,这些方法在早期部署场景中效果较差,因为点击反馈稀疏且不足以训练可靠的CTR模型。为弥补这一差距,我们提出了QualEQS,一个面向电商查询建议的质量优先迭代强化学习框架。我们将可操作的建议质量形式化为三个直接影响下游可用性的维度:可回答性、事实性和信息增益。为了在没有点击监督的情况下从在线流量中持续改进,我们进一步提出候选建议之间的组级分歧,以识别模糊的查询上下文并挖掘难训练案例进行迭代优化。我们还引入了EQS-Benchmark,一个包含16,949个真实电商查询的数据集,用于离线训练和评估。实验表明,我们基于质量的离线指标与在线性能强相关,为稀疏反馈部署提供了一种实用的评估方法。在离线和在线设置中,QualEQS均持续优于强基线,在真实企业级对话购物助手系统中,在线ChatPV提升了6.81%。

英文摘要

Existing dialogue systems rely on query suggestion to enhance user engagement. Recent approaches mainly optimize generative models using click-through rate (CTR) models to align with user preferences. However, these methods are less effective in early-stage deployment scenarios, where click feedback is sparse and insufficient for training a reliable CTR model. To bridge this gap, we propose QualEQS, a quality-first iterative reinforcement learning framework for e-commerce query suggestion. We formalize actionable suggestion quality along three dimensions that directly affect downstream usability: answerability, factuality, and information gain. To continuously improve from online traffic without click supervision, we further propose group-level disagreement among candidate suggestions to identify ambiguous query contexts and mine hard training cases for iterative refinement. We also introduce EQS-Benchmark, a dataset of 16,949 real-world e-commerce queries for offline training and evaluation. Experiments show that our quality-based offline metrics correlate strongly with online performance, providing a practical evaluation recipe for sparse-feedback deployment. In both offline and online settings, QualEQS consistently outperforms strong baselines, yielding a 6.81% improvement in online ChatPV in a real-world enterprise-level conversational shopping assistant system.

2604.23938 2026-06-19 cs.CL 版本更新

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

TSAssistant: 一种人在回路中的自动化靶点安全性评估智能体框架

Xiaochen Zheng, Zhiwen Jiang, David Tokar, Yexiang Cheng, Alvaro Serra, Melanie Guerard, Klas Hatje, Tatyana Doktorova

发表机构 * Computational Sciences Center of Excellence(计算科学卓越中心)

AI总结 提出TSAssistant多智能体框架,通过分层指令架构和交互式优化循环,将靶点安全性评估报告生成分解为专业子任务,实现高可重复性和证据溯源。

Comments Updated with quantitative and expert evaluations

详情
AI中文摘要

靶点安全性评估(TSA)需要系统整合遗传、转录组、靶点同源性、药理学和临床数据,以评估治疗靶点的潜在安全性风险。该过程劳动密集且依赖专家,在可扩展性和可重复性方面面临挑战。我们提出TSAssistant,一种人在回路中的多智能体框架,将TSA报告生成分解为专门子智能体的工作流:研究子智能体各自基于并引用单个TSA领域,合成子智能体整合跨领域发现。子智能体通过标准化工具接口从精选生物医学来源检索和综合证据,生成可单独引用、基于证据的章节,其行为由分层指令架构塑造,该架构将协调逻辑与领域专业知识和用户意图分离。为补充这些软约束,程序化执行钩子和持久记忆存储在整个工作流中强制执行硬约束,而交互式优化循环允许专家在完全保留跨迭代对话上下文的情况下审查和修订各个章节。我们不是进行单一的整体比较,而是将报告质量分解为可重复性、证据基础、任务级准确性和专家监督下的可控性,发现高可重复性和证据基础、与人类参考高度一致以及专家驱动的净正面改进。

英文摘要

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

2602.15707 2026-06-19 cs.MM cs.CL cs.LG 版本更新

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

基于音频和IMU的主动式程序性任务对话助手

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

发表机构 * Qualcomm Technologies, Inc.(高通技术公司)

AI总结 提出首个仅使用音频和IMU模态的实时对话助手,通过微调语言模型减少不必要对话并提升问答准确性,在边缘设备上实现无云依赖。

Comments 5 figures. 5 more in appendix

详情
AI中文摘要

实时对话助手用于程序性手工任务通常依赖视频输入,这会导致计算成本高且侵犯用户隐私。我们首次提出一种实时对话助手,仅使用来自用户可穿戴设备的轻量级隐私保护模态(如音频和IMU输入)来理解上下文,为程序性手工任务提供全面指导。通过家具组装任务和烹饪任务,我们展示了该助手如何主动向执行程序性任务的用户提供逐步指令,并回答用户问题。我们阐述了实现该助手的数据生成方法和系统设计。观察到现成的语言模型健谈但并非总能正确回答问题,我们展示了微调模型如何将其减少不必要对话的能力提升50%(精确度),同时将正确回答问题的能力提升150%(召回率)。我们进一步描述了如何在边缘设备上实现该助手,无需依赖云端。

英文摘要

Real-time conversational assistants for procedural manual tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for procedural manual tasks using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. Using a furniture assembly task and a cooking task, we show how this assistant proactively communicates step-by-step instructions to a user performing a procedural task, and answers user questions. We illustrate the data generation method and the system design to achieve such an assistant. On observing that an off-the-shelf language model is a talkative assistant but is not always able to answer questions correctly, we demonstrate how finetuning the model improves its ability to limit unnecessary dialogues with a 50% increase in the precision, while also improving its ability to answer questions correctly, measured by a 150% increase in the recall of answers. We further describe how such an assistant is implemented on an edge device with no dependence on the cloud.

2605.13438 2026-06-19 cs.AI cs.CL 版本更新

CogniFold: Always-On Proactive Memory via Cognitive Folding

CogniFold: 通过认知折叠实现始终在线的主动记忆

Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Minghua Deng, Chen Chen, Xinliang Zhou

AI总结 提出CogniFold,一种受大脑启发的主动记忆系统,通过将互补学习系统扩展为三层(海马体、新皮层、前额叶意图层)并利用图拓扑自组织,实现事件流的持续认知结构涌现,在认知评估和常规记忆基准上均表现优异。

Comments Code is available at https://github.com/OpenNorve/CogniFold

详情
AI中文摘要

现有的智能体记忆主要仍是被动反应式和基于检索的,缺乏自主将经验组织成持久认知结构的能力。为了迈向真正自主的智能体,我们引入了CogniFold,一种受大脑启发的“始终在线”智能体记忆,专为下一代主动助手设计。CogniFold持续将碎片化事件流折叠成自涌现的认知结构,从传入事件和积累的知识中逐步引导出更高层次的认知。我们通过将互补学习系统(CLS)理论从两层(海马体、新皮层)扩展到三层,增加了一个前额叶意图层来奠定基础。模仿前额叶皮层作为意图控制和决策制定的中心,CogniFold通过图拓扑自组织实现这一点:认知结构在事件流下主动组装,语义相似时合并,过时时衰减,通过联想回忆重新链接,并在概念簇密度超过阈值时浮现意图。我们使用CogEval-Bench评估结构形成,证明CogniFold独特地产生了符合认知期望和概念涌现的记忆结构。此外,在跨越五个认知领域的7个广泛覆盖的基准测试中,我们验证了CogniFold在常规记忆基准上同时表现出稳健的性能。

英文摘要

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce CogniFold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across eight downstream benchmarks -- two probing long-term conversational memory (LoCoMo, LongMemEval) and six spanning other cognitive domains -- we validate that CogniFold simultaneously performs robustly on conventional memory tasks. Our code is available at https://github.com/OpenNorve/CogniFold.

5. 文本生成、摘要与编辑 3 篇

2606.19591 2026-06-19 cs.CL cs.AI 新提交

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC(Aimesoft股份公司)

AI总结 提出一种新颖简单的基于黄金摘要缩短文档的分层策略,结合BART模型实现越南语多文档抽象式摘要,在VLSP 2022测试集上达到ROUGE2-F1 0.2468,并利用外部数据增强训练。

Comments originally written in 2022

详情
AI中文摘要

在本技术报告中,我们专注于解决越南语多文档抽象式摘要的挑战,该任务在2022年越南语言与语音处理国际研讨会(VLSP)上提出。我们选择遵循流行的分层方法,即先浓缩每个文档,然后进行聚合和摘要。我们提出了一种新颖而简单的策略来缩短文档,该策略由黄金摘要驱动,从而确保分层方法各阶段之间的高度相关性。我们的方法在VLSP的公开测试集上达到了0.2468的ROUGE2-F1分数,并且能够生成流畅简洁的摘要。此外,我们利用外部来源获取额外数据,这极大地增加了越南语多文档摘要的数据量。额外数据已向社区公开。

英文摘要

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

2606.20287 2026-06-19 cs.CL 新提交

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University(华东师范大学教育心理学系) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院)

AI总结 提出PsyScore框架,通过共享潜在能力表示整合诊断评估与教学支架,包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略,在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情
AI中文摘要

有效的自动作文评分(AES)应支持可靠评估和可操作的教学反馈。然而,现有方法通常将评分和反馈视为独立组件:神经评分模型可解释性有限,而基于大语言模型(LLM)的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题,本工作提出PsyScore,一个心理测量感知的框架,通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块:特质自适应神经IRT评分器,将分级部分信用模型(GPCM)融入神经架构,能够在保持心理测量可解释性的同时精确估计学生能力;ZPD支架反馈生成器,根据诊断出的能力参数调节多智能体反馈策略,以适应不同熟练水平的教学重点;以及多视角反馈评估策略,通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明,PsyScore在提供更具教学一致性的反馈的同时,实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

2504.02885 2026-06-19 cs.CL 版本更新

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2:面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) The School of Computing, Macquarie University(麦考瑞大学计算机学院) Doubao Medical Group, ByteDance(字节跳动 doubao 医疗集团)

AI总结 提出Med-R2微调策略,通过引入感知驱动的长推理过程和放射学知识指导,并加入反思机制修正感知错误,提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情
AI中文摘要

自动化医学报告生成(MRG)越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型(LVLMs)因其细粒度的图像-文本对齐和先进的文本生成能力,在自动化MRG中展现出巨大潜力。目前,最先进的MRG主要专注于通过直接监督微调(SFT)来适应预训练的LVLMs,这是一种使用医学图像-报告对的微调策略。然而,有几个因素限制了这些LVLMs的性能。首先,直接SFT使LVLMs能够直接生成医学报告,而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征,从而引起误诊。其次,直接SFT缺乏放射学特定知识的指导,导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题,我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程,该过程在报告生成之前进行,并融入放射学特定知识作为指导。此外,为了减轻复杂推理中潜在的感知错误,引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明,Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

6. 语义、语法与语言学分析 2 篇

2606.19638 2026-06-19 cs.CL 新提交

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

MiqraBERT:基于回归的Sentence-BERT微调用于圣经希伯来语平行检测

David M. Smiley

AI总结 提出MiqraBERT模型,通过余弦相似度回归微调Sentence-BERT,在圣经希伯来语中检测文本平行,将分布分离度提升2.7倍,重叠区域从24%降至6%。

详情
AI中文摘要

文本复用遍及希伯来圣经,但用于检测的计算方法仍主要依赖词汇重叠,一旦平行涉及释义、词汇替换或句法重组,这些方法就会失效。本文介绍MiqraBERT,一个从AlephBERT(现代希伯来语编码器)微调而来的Sentence-BERT模型,用于圣经希伯来语的诗句级语义相似度。训练集包含1,650个标注的诗句和半诗句对:825个来自编年史同源材料和诗歌平行基础研究的真实平行,与825个随机采样的负例平衡。通过余弦相似度回归,模型学习到一个嵌入空间,其中平行诗句聚集在一起,无关诗句彼此远离。我们使用基于分布的指标、Wasserstein距离和重叠系数,在十个随机种子上评估分离度。MiqraBERT将分布分离度比预训练基线提高了2.7倍,并将模糊重叠区域从约24%减少到约6%。叙事同源平行的召回@10达到87.1%;诗歌平行仍然困难,低于9%。这种依赖于体裁的不对称性将模型的可靠范围限制在叙事文本复用。MiqraBERT在此https URL公开可用。

英文摘要

Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model's reliable scope to narrative textual reuse. MiqraBERT is publicly available at https://huggingface.co/davidmsmiley/MiqraBERT

2606.19626 2026-06-19 cs.AI cs.CL 交叉投稿

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Toten:基于知识本体的巴西葡萄牙语物理量和技术符号分词

Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

发表机构 * Aia Context Universidade Federal do Maranhão(马拉尼昂联邦大学) Universidade de São Paulo(圣保罗大学)

AI总结 提出TOTEN框架,利用工程实体本体对物理量和技术符号进行声明式分类,替代统计分词,在巴西葡萄牙语语料上实现高原子性分词和数值重建。

详情
AI中文摘要

字节对编码分词在词汇压缩方面统计高效,但对结构化技术实体语义盲目,将物理量、数字、单位和符号表达式分割成词汇上任意子词。我们提出TOTEN,一个基于知识本体的分词框架,用基于工程实体形式本体(OEE)的声明式分类取代统计推导。我们将TOTEN形式化为三元组<O, classify, {inst_tau}>:本体收集类型、结构原理、组成关系和可保存不变量;分类函数将原始文本映射到类型化区域;实例化器族产生自描述的结构化表示。鲁棒性源于与三个外部预言机的确定性耦合:Pint(量纲)、Unicode字符数据库(排版)和RSLP(葡萄牙语形态)。内在评估涵盖四个可通过构造验证的属性——本体原子性、量纲等价性、排版鲁棒性和数值重建——在一个内部、物理验证的基准(EngQuant,N=800)和四个巴西葡萄牙语外部语料库(N=1771个合格案例)上进行。我们还报告检测召回率,区分覆盖率和条件原子性。与八个最先进基线相比,TOTEN在所有对比中实现单位本体原子性,在外部语料库上数值重建为0.775-0.904,而最佳基线(Quantulum3)为0.627-0.703;在EngQuant上为0.780 vs. 0.340。差异具有统计显著性(McNemar检验,Holm校正)。内部和外部排名之间的Spearman相关性证实了控制基准的同时效度。量纲等价性显示与Pint(系统继承量纲权威的预言机)统计对等。

英文摘要

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple <O, classify, {inst_tau}>: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

7. 多模态语言处理 8 篇

2606.19552 2026-06-19 cs.CL 新提交

LaViSA: A Language and Vision Structural Ambiguity Benchmark

LaViSA:语言与视觉结构歧义基准

Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Guardian Robot Project RIKEN(RIKEN守护机器人项目) The University of Osaka(大阪大学)

AI总结 提出LaViSA基准,通过七类歧义句及对应图像评估视觉语言模型利用视觉场景解决结构歧义的能力,实验显示现有模型虽能部分利用视觉信息,但在特定歧义类型和细微语义区分上仍有局限。

详情
AI中文摘要

结构歧义是指单个句子由于其句法结构而产生多种有效解释,这给语言理解带来了基本挑战。视觉场景可作为解决此类歧义的有用线索,视觉语言模型(VLM)需要能够从视觉场景中推导出可能的语义解释。我们引入了语言与视觉结构歧义(LaViSA)基准,旨在评估VLM利用视觉场景解决结构歧义的能力。LaViSA包含歧义句子、其消歧句子以及这些消歧句子对应的图像,涵盖七类歧义。利用LaViSA,我们对多种VLM进行了全面评估,包括专有模型和开源模型,参数规模和推理能力各异。实验结果表明,尽管最近的VLM能在一定程度上利用视觉场景解决结构歧义,但它们仍然在特定歧义类型和视觉上微妙的语义区分上存在困难,表明在利用视觉场景解决结构歧义方面仍存在局限性。

英文摘要

Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

2606.20164 2026-06-19 cs.CL cs.AI cs.LG q-bio.QM 新提交

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

MedRLM:用于长上下文临床推理、传感器引导筛查、证据支持决策及社区到三级转诊优化的递归多模态健康智能

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer Communication Technology Sirindhorn International Institute of Technology, Thammasat University Pathum Thani, Thailand 1

AI总结 提出MedRLM递归多模态健康智能框架,通过递归检查、分解、检索、验证和合成患者信息,协调多个专业代理并引入临床证据图记忆,实现长上下文临床推理和传感器引导筛查。

Comments 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

详情
AI中文摘要

现实世界的临床决策支持需要对异质性和纵向的患者信息进行推理,而不是回答孤立的医学问题。然而,当前的医学大语言模型和检索增强生成系统通常依赖单步提示或检索,当临床证据分布在长电子健康记录、医学图像、传感器流、指南和转诊约束中时,这可能变得脆弱。本文提出MedRLM,一个用于长上下文临床推理、传感器引导筛查和社区到三级转诊支持的递归多模态健康智能框架。MedRLM不是将所有患者信息压缩到一个提示中,而是将患者病例视为一个外部临床环境,可以递归地检查、分解、检索、验证和综合。该框架协调了专门用于临床文本、纵向EHR、医学影像、生理传感器信号、指南检索、不确定性审计和转诊规划的代理。它进一步引入了临床证据图记忆,将患者特定的观察结果与检索到的证据、标准化定义、传感器衍生的生物标志物和转诊标准连接起来。传感器引导的递归触发机制在检测到异常生理或行为模式时激活更深层次的推理,而不确定性门控细化支持临床医生对高风险或低置信度病例的审查。我们还概述了一个使用公共和经认证的临床数据集(涵盖EHR、放射学、ECG、ICU时间序列和转诊代理结果)的真实数据评估设计。MedRLM旨在将医学AI从静态问答转向可审计、多模态和流程感知的临床决策支持。

英文摘要

Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 交叉投稿

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM:基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University(北京大学) MSALab ByteDance(字节跳动)

AI总结 提出PerceptionDLM,利用扩散语言模型的并行解码特性,通过高效提示和结构化注意力掩码实现多区域并行感知,显著提升推理效率,并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展。然而,现有大多数MLLMs依赖自回归生成,这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中,我们提出PerceptionDLM,一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base(一个在开源扩散MLLMs中达到最先进性能的强基础基线),我们的架构充分利用了DLMs的并行解码特性。具体来说,我们引入了高效提示和结构化注意力掩码,以实现对多个掩码区域的同步感知,使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比,这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性,我们通过将DLC-Bench扩展为每张图像包含多个区域掩码,构建了一个新的并行详细局部描述基准(ParaDLC-Bench),从而能够联合评估描述质量和推理效率。实验表明,PerceptionDLM在区域描述中保持竞争性能,同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知,我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 交叉投稿

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany(德国弗莱堡大学计算机视觉组) Department of Radiology, Medical Center -- University of Freiburg, Germany(德国弗莱堡大学医学中心放射科) CRIION-AI Lab, Freiburg, Germany(德国弗莱堡CRIION-AI实验室)

AI总结 提出RefRad2D大规模双语数据集,通过LLM和自动分割生成空间定位数据,训练RadGrounder模型联合完成报告生成、VQA和空间定位,在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情
AI中文摘要

我们研究了如何在没有手动空间标注的情况下,为放射学训练具有视觉定位能力的视觉-语言模型(VLM)。我们引入了RefRad2D,这是一个大规模的双语(德语/英语)数据集,包含来自临床实践的120万对CT和MR图像-文本对,并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准(Slake,VQA-RAD)上,RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集,相比于仅在下游数据集上微调,提高了开放式VQA的性能,显示了数据集的迁移性。关键在于,添加定位监督不会降低语言质量,从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

2603.16606 2026-06-19 cs.CL 版本更新

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Omnilingual SONAR:跨语言与跨模态句子嵌入,连接大规模多语言文本与语音

Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Xiang "Tony" Cao, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

发表机构 * FAIR at Meta(Meta的FAIR)

AI总结 提出OmniSONAR模型,通过渐进式训练和教师-学生蒸馏,在数千种语言上实现文本、语音、代码和数学表达式的统一语义嵌入,在跨语言检索和翻译任务上显著降低错误率,并支持零样本语音翻译。

详情
AI中文摘要

跨语言句子编码器通常只覆盖几百种语言,并且常常为了更强的对齐而牺牲下游质量,限制了它们的采用。我们引入了OmniSONAR,一个新的全语言、跨语言和跨模态句子嵌入模型家族,它原生地将文本、语音、代码和数学表达式嵌入到单一语义空间中,同时在数千种语言(从高资源到极低资源变体)的规模上提供最先进的下游性能。为了在不发生表示崩溃的情况下达到这一规模,我们使用了渐进式训练。我们首先使用LLM初始化的编码器-解码器,结合token级解码、新颖的分裂softmax对比损失和合成硬负样本,为200种语言学习一个强大的基础空间。在此基础上,我们通过两阶段教师-学生编码器蒸馏框架扩展到数千种语言变体。最后,我们通过将177种口语无缝映射到该空间,展示了该空间的跨模态可扩展性。OmniSONAR将200种语言的FLORES数据集上的跨语言相似性搜索错误减半,并在1560种语言的BIBLE基准上将错误减少了15倍。它还实现了强大的翻译性能,在多语言基准上优于NLLB-3B,并在1560种语言到英语的BIBLE翻译上比先前模型(包括更大的LLM)高出15个chrF++点。OmniSONAR在MTEB和XLCoST上也表现强劲。对于语音,OmniSONAR实现了43%更低的相似性搜索错误,并达到了SeamlessM4T语音到文本质量的97%,尽管对于翻译是零样本(仅在ASR数据上训练)。最后,通过训练一个编码器-解码器LM Spectrum,仅使用英语文本处理OmniSONAR嵌入序列,我们为复杂的下游任务解锁了向数千种语言和语音的高性能迁移。

英文摘要

Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

2305.14985 2026-06-19 cs.CV cs.CL 版本更新

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) HKUST(香港科技大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出IdealGPT框架,利用大型语言模型迭代分解视觉语言推理任务,通过子问题生成、子答案获取和最终答案推理的循环过程,在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情
AI中文摘要

视觉与语言(VL)理解领域通过端到端的大型预训练VL模型(VLM)取得了前所未有的进展。然而,它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标,先前的工作采用了分而治之的流程。本文认为,先前的工作存在几个固有的缺点:1)它们依赖于特定领域的子问题分解模型。2)即使子问题或子答案提供的信息不足,它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性,该框架利用大型语言模型(LLM)迭代分解VL推理。具体来说,IdealGPT使用一个LLM生成子问题,一个VLM提供相应的子答案,另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程,直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是,我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%,在SNLI-VE上提高了15%。代码可在以下网址获取:此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

2603.12252 2026-06-19 cs.CV cs.CL 版本更新

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

EndoCoT:扩散模型中的内生思维链推理扩展

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出EndoCoT框架,通过迭代思维引导模块激活MLLM的推理潜力,并利用终端思维接地模块确保推理轨迹与文本监督对齐,使DiT逐步执行复杂任务,在多个基准上平均准确率达92.1%。

Comments 23 pages, 18 figures, The code and dataset are publicly available at https://internlm.github.io/EndoCoT/

详情
AI中文摘要

最近,多模态大语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来处理空间推理等复杂任务。然而,这种范式存在两个关键限制:(i)MLLM文本编码器表现出不足的推理深度。单步编码无法激活思维链过程,而这对MLLM为复杂任务提供准确指导至关重要。(ii)在解码过程中,指导保持不变。即使有正确的MLLM编码,解码过程中的不变指导也阻止了DiT逐步将复杂指令分解为可执行的去噪步骤。为此,我们提出了内生思维链(EndoCoT),一种新颖的框架,首先通过迭代思维引导模块迭代细化潜在思维状态来激活MLLM的推理潜力,然后将这些状态桥接到DiT的去噪过程。其次,应用终端思维接地模块,通过将最终状态与真实答案对齐,确保推理轨迹保持与文本监督的接地。通过这两个组件,MLLM文本编码器提供精心推理的指导,使DiT能够逐步执行并最终以逐步方式解决复杂任务。在多个基准(如Maze、TSP、VSP和Sudoku)上的广泛评估实现了平均准确率92.1%,比最强基线高出8.3个百分点。代码和数据集在此https URL公开。

英文摘要

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://internlm.github.io/EndoCoT/.

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出Vero系列开放视觉语言模型,通过构建600K样本数据集Vero-600K和任务路由奖励,在30个基准测试中平均提升2.9-5.4点,Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情
AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么?最强的视觉语言模型(VLM)表明广泛的视觉推理是可以实现的,但其封闭的数据和强化学习(RL)流程使得其成果难以研究、复现或扩展。我们引入了Vero,一个完全开放的VLM系列,在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励,构建了Vero-600K,一个来自59个数据集的600K样本数据集,并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中,Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型,Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是,基于Instruct模型训练的Vero-Qwen3I-8B,在没有额外蒸馏的情况下,平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示,不同的任务类别引发不同的推理模式,而广泛的收益依赖于联合学习它们,而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

8. 语音语言联合与音频文本 8 篇

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 新提交

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估:基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar(卡塔尔计算研究所,多哈,卡塔尔)

AI总结 提出仅使用母语语音资源训练的轻量级发音评估框架,通过离散化语音标记和语言模型计算意外度,结合文本引导对齐特征,在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库,这些语料库收集成本高昂。我们提出一个轻量级框架,仅使用母语语音资源训练,以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时,学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度,其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块,该模块从参考文本预测母语标记序列,并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上,PCC从0.60提升到0.66(带转录引导),接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

2606.20179 2026-06-19 cs.CL 新提交

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

ReNikud:音频监督的希伯来语字素到音素转换

Maxim Melichov, Yakov Kolani, Morris Alper

AI总结 提出ReNikud方法,利用音频监督和伪元音化架构,通过无标注音频的ASR伪标签和字符级对齐,解决希伯来语G2P转换中的元音缺失和发音歧义问题,在多个基准上达到最优。

详情
AI中文摘要

现代希伯来语的字素到音素(G2P)转换对于文本到语音(TTS)等应用是必需的,但由于该语言的辅音音素文字系统(abjad)使元音大多不写出来,造成大量歧义,因此具有挑战性。标准方法首先预测元音变音符号(nikud)以生成国际音标(IPA)转录,但这存在局限性:元音化数据稀缺且制作费力,它不指定词汇重音等特征,并且反映的是正式语法规则而非日常口语发音。同时,直接的序列到序列IPA预测在有限数据上表现不佳,且未能利用辅音音素文字特有的字符级对齐。我们的方法ReNikud通过两个关键洞察克服了这些限制:(1)通过基于音素的自动语音识别(ASR)伪标签流水线,在数千小时无标注希伯来语音频上进行弱音频监督,生成反映自然口语规范的音位转录,无需人工标注。(2)一种伪元音化架构,在每个字符位置预测IPA音素,强制字符级对齐作为归纳偏置。在现有希伯来语G2P基准和针对口语希伯来语的新MILIM基准上的结果表明,ReNikud超越了先前的最先进方法。我们将发布代码和训练模型,以支持希伯来语TTS和语音技术的进一步研究。

英文摘要

Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.19996 2026-06-19 cs.SD cs.CL 交叉投稿

Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

基于自编码器与对比学习的段级普通话语音认知障碍检测

Yongqi Shao, Hong Huo, Flavio Bertini, Danilo Montesi, Tao Fang

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) Key Laboratory of System Control and Information Processing, Ministry of Education of China(教育部系统控制与信息处理重点实验室) Shanghai Key Laboratory of Perception and Control in Industrial Network Systems(上海市工业网络系统感知与控制重点实验室) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系) Department of Mathematical, Physical and Computer Sciences, University of Parma(帕尔马大学数学、物理与计算机科学系)

AI总结 提出段级表示学习框架,结合自编码器和对比学习,在四个普通话数据集上实现稳定的二分类和三分类认知障碍检测,尤其改善了临床困难的三分类性能。

Comments 15 pages, 7 figures, 5 tables

详情
AI中文摘要

\noindent\textbf{背景与目标:} 语音已成为一种低成本、非侵入性的数字生物标志物,在认知障碍检测方面具有巨大潜力。然而,有限的标注数据和跨数据集变异性仍然是构建稳健的语音筛查系统的主要挑战。\par\noindent\textbf{方法:} 我们开发了一个用于语音认知障碍检测的段级表示学习框架。将语音录音分割成短片段并转换为语谱图表示。为了在有限数据条件下提高鲁棒性,将离线和在线增强策略与基于自编码器的表示学习和对比目标相结合,以增强判别性潜在表示。\par\noindent\textbf{结果:} 在四个独立的普通话语音数据集上进行的实验表明,在二分类和三分类任务中均取得了稳定且有竞争力的性能,尤其是在临床具有挑战性的三分类设置中取得了显著改进。消融研究进一步支持了所提框架的有效性。\par\noindent\textbf{结论:} 研究结果表明,段级语音表示学习可能为资源受限的临床环境中的认知障碍筛查提供一种可扩展且实用的方法。

英文摘要

\noindent\textbf{Background and Objective:} Speech has emerged as a low-cost and non-invasive digital biomarker with considerable potential for cognitive impairment detection. However, limited labeled data and cross-dataset variability remain major challenges for robust speech-based screening systems. \par\noindent\textbf{Methods:} We developed a segment-level representation learning framework for speech-based cognitive impairment detection. Speech recordings were divided into short segments and converted into spectrogram representations. To improve robustness under limited-data conditions, offline and online augmentation strategies were combined with autoencoder-based representation learning and contrastive objectives to enhance discriminative latent representations. \par\noindent\textbf{Results:} Experiments conducted on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance in both binary and three-class classification tasks, with particularly notable improvements in the clinically challenging three-class setting. Ablation studies further supported the effectiveness of the proposed framework. \par\noindent\textbf{Conclusions:} The findings suggest that segment-level speech representation learning may provide a scalable and practical approach for cognitive impairment screening in resource-constrained clinical settings.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

2605.17443 2026-06-19 cs.CL cs.SD eess.AS 版本更新

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

发表机构 * Korea Culture Technology Institute, Republic of Korea(韩国文化科技研究所) Maum AI Inc., Republic of Korea(马姆人工智能公司)

AI总结 本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题,通过分析下游语义失败,揭示了传统ASR指标无法完全捕捉的误差影响,发现不同性能的LLM在级联降级上的一致性,识别出单字符ASR错误作为语义失败通道,并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

详情
AI中文摘要

我们分析了自动语音识别(ASR)误差如何通过ASR-LLM级联在韩语语音问答(SQA)中传播,重点关注传统ASR指标无法完全捕捉的下游语义失败。我们的分析显示,由ASR误差引起的相对下游降级在不同绝对性能的LLM中保持一致,表明级联降级主要跟踪ASR阶段的信息损失。我们进一步识别出单字符韩语ASR错误作为一种独特的语义失败通道,其中正确答案在下游预测中完全消失,尽管仅存在微小的转录差异。最后,辅助比较显示,大型音频语言模型在噪声韩语SQA中优于具有匹配语言骨干的ASR-LLM流水线,表明直接音频输入有潜力缓解转录诱导的信息损失。

英文摘要

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a Korean-specific loss channel, where even a minimal transcription difference can change the intended question and degrade downstream QA performance. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM cascade with an approximately matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

2606.05846 2026-06-19 cs.CL eess.AS 版本更新

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR:将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo(东京大学)

AI总结 通过模型合并和领域泛化方法,研究从有限语言对中学到的代码切换能力能否泛化到未见语言对,实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

自动语音识别(ASR)已成为人机交互的关键技术。然而,由于跨多种语言对的代码切换(CS)语音资源严重稀缺,代码切换ASR(CS-ASR)仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而,这些方法面临固有的可扩展性限制,因为对CS的支持必须针对语言对单独开发,而语言对的数量随支持的语言数量呈组合增长。在这项工作中,我们研究通过模型合并和领域泛化方法,从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明,合并的双语CS-ASR模型对未见语言对有一定程度的泛化,表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

2604.18105 2026-06-19 eess.AS cs.CL cs.SD 版本更新

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR:迈向高效、鲁棒且可定制的实时基于LLM的语音识别

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, Jie Wu

发表机构 * Advanced Intelligent Systems Group, NIO(蔚来智能系统集团)

AI总结 提出NIM4-ASR框架,通过重新设计多阶段训练范式(包括预训练架构优化、迭代异步SFT和ASR专用强化学习)以及生产优化(噪声鲁棒性、流式推理和RAG热词定制),在2.3B参数下实现SOTA性能。

详情
AI中文摘要

将大语言模型(LLM)集成到自动语音识别(ASR)中已成为近年来的主流范式。尽管现有的基于LLM的ASR模型在公共基准上表现出色,但其训练仍然主要依赖数据驱动,未能充分解决关键的实际挑战——特别是在资源受限部署中的有限向下可扩展性以及声学挑战条件下的幻觉问题。为了解决这些问题,我们提出了NIM4-ASR,一个面向生产的、基于LLM的ASR框架,针对效率和鲁棒性进行了优化。基于编码器和LLM之间功能角色的原则性划分,我们重新设计了多阶段训练范式,使每个模块与其预期的能力边界对齐。具体来说,我们重新制定了预训练架构和目标以缓解模态差距并提高参数效率;引入了迭代异步SFT阶段以保持声学保真度并约束表示漂移;设计了ASR专用的强化学习阶段以进一步提高识别质量和鲁棒性。我们还加入了一系列面向生产的优化,包括噪声和静音条件下的鲁棒性、实时流式推理以及通过检索增强生成(RAG)进行的热词定制。实验表明,NIM4-ASR仅用2.3B参数就在多个公共基准上达到了最先进的性能,同时在内部基准上显著优于更大规模的竞争对手——特别是在实体密集的真实场景中。NIM4-ASR进一步通过RAG支持百万级热词定制,检索延迟低于毫秒,从而能够高效适应新兴实体和个性化用户需求。

英文摘要

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.

9. 评测、数据集与基准 26 篇

2606.19352 2026-06-19 cs.CL cs.AI 新提交

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

大规模手语数据集:资源、基准和标注标准的综合调查

Yiming Ni, Zhi-Qi Cheng, Jiayu Li, Wei Cheng

发表机构 * Tacoma School of Engineering & Technology, University of Washington(华盛顿大学塔科马工程与技术学院)

AI总结 本文调查了35种手语的120个数据集,分析了模态不平衡、标注粒度和手语者偏差等挑战,并提出了24字段手语数据表以支持标准化文档和可复现评估。

Comments Accepted to ACL 2026 Main. 27 pages, 5 figures

详情
AI中文摘要

手语是聋人和听障社区使用的表达性视觉语言。尽管在手语识别、翻译和生成方面取得了显著进展,但由于数据集碎片化、标注不一致以及语言覆盖有限,进展仍然受到制约。现有的基准往往无法反映现实世界的通信需求,对这些局限性的系统分析仍然有限。在本调查中,我们提出了一个全面的手语数据集索引,涵盖了35种手语的120个资源。我们分析了关键挑战,如模态不平衡、标注粒度和手语者偏差,并概述了未来数据集设计的考虑因素。我们还引入了一个24字段的手语数据表,并发布了一个公共GitHub仓库(此 https URL ),以支持标准化文档和可复现评估。总体而言,我们的工作为在现实应用中开发包容、稳健和可扩展的手语技术提供了统一且实用的基础。

英文摘要

Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (https://github.com/Ginqwerty/Open-Sign-Language) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.

2606.19468 2026-06-19 cs.CL 新提交

Characterizing Narrative Content in Web-scale LLM Pretraining Data

网络规模LLM预训练数据中的叙事内容特征化

Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak

AI总结 首次细粒度研究LLM预训练语料库Dolma的叙事特征,提出涵盖三个核心叙事元素(能动性、场景、事件)的框架,构建NarraBERT模型并发布NarraDolma数据集,揭示叙事结构在异构数据中可测量且分布不均。

Comments 8 pages of main content, 28 total pages. 30 figures

详情
AI中文摘要

尽管叙事是人类交流的基本模式,但网络规模LLM预训练语料库的叙事组成仍然很大程度上未被探索。我们首次对Dolma(一个3万亿词元的开放预训练语料库)中的叙事特征进行了细粒度研究。借鉴叙事理论,我们设计了一个框架,涵盖三个核心叙事元素(能动性、场景和事件),并将其操作化为11个可解释维度。在采样并标注了400个多样化的段落之后,我们微调并验证了NarraBERT,一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个段落,生成了新数据集NarraDolma。我们发现:(i) 叙事结构在极度异构的数据中是可大规模测量的;(ii) 我们揭示了网络文本背后连续的多维叙事结构;(iii) 叙事质量在预训练来源和主题之间分布不均,而当前的策展实践既未测量也未考虑这一点。我们的框架、数据集和分析为理解LLM预训练数据中叙事质量的分布以及研究数据组成如何影响叙事推理任务提供了基础。我们公开发布了NarraDolma和NarraBERT。

英文摘要

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

2606.19544 2026-06-19 cs.CL 新提交

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性:LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information(加州大学伯克利分校信息学院)

AI总结 本研究通过大规模系统性评估(21个裁判模型、118次运行、约54.1万次判断),发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题,包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存,并提出了最小可行验证协议。

详情
AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式,但实际中的裁判验证依赖于精确匹配一致性,这一指标未对随机性进行校正,且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估:来自九个提供商的21个裁判模型,在MT-Bench、JudgeBench和RewardBench上,按照三种协议(一致性、稳定性、偏差审计)进行了118次运行,约54.1万次独立判断。发现了四个结果,在整个队列中一致,包括2026年4月的前沿模型:精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的(MT-Bench上33-41个百分点),裁判排名在不同基准上最多移动14个位置,高重测信度(>0.95)与两个生产部署裁判中的严重位置偏差(>0.10)并存(体现了一致性-偏差悖论),以及在单一成对评分标准下,整个队列中的冗长偏差较小(<0.011)。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

2606.19637 2026-06-19 cs.CL cs.AI 新提交

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

标签之前:数据集构建如何塑造临床文本中的自杀检测

Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada

发表机构 * University of Washington(华盛顿大学)

AI总结 通过ScAN数据集案例研究,揭示EHR自杀数据集编码特定操作化定义,受数据作者、事件边界和歧义处理影响,并展示相同标签涵盖异质性临床框架。

Comments To appear in the Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

详情
AI中文摘要

临床自然语言处理越来越依赖电子健康记录(EHR)数据来检测自杀行为,将临床文档视为比社交媒体更可靠的真相。我们认为,这种框架掩盖了基于EHR的自杀数据集如何编码自杀的特定操作化定义,这种定义受到数据作者、事件边界划定方式以及歧义处理方式的影响。我们以ScAN数据集(基于MIMIC-III临床笔记构建)的案例研究为基础,论证了这一观点。我们展示了治理约束、基于ICD的队列选择、单一标注者标签以及住院级别聚合如何产生反映临床医生记录判断的标签,将自杀视为一个有边界的事件,并假设意图可以从文档中可靠推断。语言学分析表明,相同的标签涵盖了在时间性、否定性和不确定性方面不同的异质性临床框架。我们认为,临床自然语言处理在将自杀数据集的标签解释为真相之前,应审视其中嵌入的假设。

英文摘要

Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

2606.19698 2026-06-19 cs.CL 新提交

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

情感分析看不到的:衡量客户是否得到帮助以及出了什么问题——基于70,000次客服对话

Jason Potteiger

AI总结 本研究使用GPT-5.4从70,450次客服对话中估计客户满意度并标记具体问题,发现满意度估计比情感分析更准确,且能揭示情感分析无法捕捉的客户状态和问题原因。

Comments 25 pages, 6 figures

详情
AI中文摘要

大多数公司通过情感分析大规模阅读客户支持数据,这种方法衡量的是客户听起来如何,而不是他们对结果是否满意。我们在一个领先的在线筹款平台的70,450次客服对话中测试了一种更丰富的替代方案:除了语气,我们使用GPT-5.4估计每位客户的满意度,并标记他们是否报告了具体问题,然后将这三个读数与客户对对话的1到5星评分进行验证。满意度估计跟踪这些评分的效果远好于情感分析,相关性分别为0.47和0.36,并且标记不满客户时的误报率低得多。结构化阅读还能看到情感分析看不到的东西:语气和满意度在44%的对话中不一致,一个单一的“中性”标签掩盖了从安静满意的客户到安静放弃的客户的一切,而最大的群体是“容忍摩擦”——那些满意但仍然报告可修复问题的客户,这是一个情感分析仪表板无法揭示的长期问题。更广泛的发现是,基于LLM的标注可以捕捉到比客户语言语气多得多的信息,为基于客户状态(他们是否满意)和直接从交互与反馈的原始文本数据中提取的问题原因的新业务指标提供了强大潜力。

英文摘要

Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer's satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single "Neutral" label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is "tolerated friction," customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer's language, offering strong potential for new business metrics grounded instead in the customer's state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.

2606.19727 2026-06-19 cs.CL cs.AI 新提交

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM:语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University(深圳技术大学) New Delhi Institute of Management(新德里管理学院) Technische Universität Dresden(德累斯顿工业大学) Ramakrishna Mission Vivekananda Educational and Research Institute(罗摩克里希纳传道会维韦卡南达教育与研究学院) Indian Institute of Technology(印度理工学院) Swami Vivekananda Institute of Technology(斯瓦米·维韦卡南达技术学院) GuangDong Engineering Technology Research Center of Edge Intelligence(广东省边缘智能工程技术研究中心)

AI总结 提出NRITYAM基准,包含9,260个跨12语言的文化问答对,评估语言模型对全球舞蹈传统的文化理解能力,涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情
AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而,其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距,我们提出NRITYAM,一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对,涵盖12种语言,是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发,他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型,包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准,NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

2606.19819 2026-06-19 cs.CL cs.AI 新提交

CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis

CREDENCE: 面向分解与增强可信度的声明缩减——语义度量与收敛性分析

Phuong Huu Vu Tran, Thuan Duc Mai, Bach Xuan Le

发表机构 * Vietnamese-German University(越南德国大学) Ho Chi Minh University of Technology(胡志明市理工大学)

AI总结 提出CREDENCE框架,通过语义F1度量解决Jaccard度量对释义声明的低估问题,并形式化分析修复管道的收敛性,实验表明语义F1比Jaccard F1提升15-32个百分点,规则修复将原子性违反率降低47-100%。

Comments 40 pages, 6 figures, 19 tables. Submitted to Language Resources and Evaluation

详情
AI中文摘要

将复合句分解为原子化的、可验证的声明是可靠自动化事实核查的前提。先前工作依赖基于词重叠(Jaccard)的度量,系统性地低估了释义声明的分解质量,并且缺乏对修复循环的形式化终止分析。我们提出CREDENCE,一个改进的声明分解与评估框架,解决了这两个缺陷。我们的贡献包括:(1) 语义F1:我们使用BGE-large余弦相似度保真度度量,解决了Jaccard的惩罚问题,并提高了下游事实核查的准确性;(2) 收敛定理:我们形式化地表征了修复管道的四个性质,确立了在预言解析器假设下基于规则的修复是单调且有限终止的;基于LLM的自修复被证明是非单调的,需要早期退出保护;(3) 三个评估基准,涵盖社交媒体、百科全书和新闻领域,用于跨领域泛化度量;(4) 跨四个分解器模型(3.8B-12B)和一个封闭API模型的多模型基准测试。在SocialClaimSplit、WikiSplitBench和ClaimDecompBench上的实验表明,语义F1比Jaccard F1提升15-32个百分点。在SocialClaimSplit和WikiSplitBench上,EPR范围为0.94至1.00,而ClaimDecompBench由于更难的新闻领域构造,包含较低的基线EPR情况(低至0.824),规则修复相对于基线模型将原子性违反率(AVR)降低了47-100%,且不降低保真度。

英文摘要

Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard's penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.

2606.19881 2026-06-19 cs.CL 新提交

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

REDACT:一个系统控制的个人信息检测多语言基准

Guneesh Vats, Anubha Agrawal, Shikha Singhal, Ajita Dash, Praison Selvaraj, Vidhan Jhawar, Ranga Prasad Chenna, Bharadwaj Y M G

发表机构 * ServiceNow

AI总结 提出REDACT基准,包含13,427条记录、51种实体类型、25种语言,通过强度-2覆盖阵列采样控制9个生成轴,并引入实体级元数据(披露状态、形式、GDPR敏感层级)以支持分层评估,揭示检测器在敏感数据上的架构依赖性失败模式。

Comments 14 pages, 5 figures

详情
AI中文摘要

个人可识别信息(PII)检测的基准基础设施仍然有限:现有语料库涵盖的实体类型少,使用临时生成条件,并且未显示哪些表面条件导致检测器失败。我们提出REDACT,一个系统控制的多语言PII基准,包含13,427条记录、324,078个实体注释、51种实体类型、4,127个表面形式模式以及跨越9种文字的25种语言。一个强度-2覆盖阵列采样器控制九个生成轴:领域、格式、难度、长度、密度、代码切换、语言、邻接和共现。三个实体级元数据字段(披露状态、披露形式和符合GDPR的敏感层级)使得能够进行超越聚合或按类型F1的分层评估。从完整基准中,我们在一个锁定的、按语言分层的1000条记录样本上评估了五个检测器(Presidio、GLiNER、OpenAI隐私过滤器、GPT-4.1和Claude Sonnet 4.6)。聚合F1掩盖了架构依赖的失败结构:基于规则的检测器在最高风险数据上表现不佳,包括高敏感类别(召回率0.07)和非逐字披露形式,而LLM检测器保持更鲁棒,高敏感层级是其最强的敏感切片。一个三模型无参考LLM作为评判者的评估证实,敏感层级分配是任务最困难的轴。我们发布了基准、模式、提示和分层评估工具。

英文摘要

Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task's hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.

2606.20152 2026-06-19 cs.CL cs.AI 新提交

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

从文本到分数:追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

AI总结 通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征,发现作文质量信息以线性可解码形式存在,并识别出与分数相关的神经元,揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展极大地改变了自动作文评分(AES),但基于LLM的评分内部机制仍知之甚少。在本工作中,我们系统分析了八个LLMs在两个英文作文数据集(ASAP++、CSEE)和一个葡萄牙语数据集(ENEM)上的隐藏表征。通过线性探测、跨提示泛化、降维和神经元级分析,我们发现一致证据表明作文质量信息以线性可访问的形式编码在LLM表征中。这些表征在层间逐步出现,在不同提示策略下保持稳健,并且尽管评分标准不同,仍能在作文提示间部分迁移。此外,非线性探测相对于线性探测仅提供边际且不一致的改进,表明大多数作文质量信息已经是线性可解码的。我们进一步识别出单个“作文评分神经元”,其激活与作文分数强相关,且其行为对目标干预敏感。此外,这些神经元的逐层分布随作文长度系统性地变化,较长的作文更依赖深层。总体而言,我们的发现提供了LLM编码与作文质量相关的结构化表征的证据,并为基于LLM的AES系统的可解释性提供了新见解。

英文摘要

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

2606.20212 2026-06-19 cs.CL 新提交

CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

CzechDocs:捷克少数民族语言格式化文档的多路平行数据集

Josef Jon, Ondřej Bojar

发表机构 * Charles University, Faculty of Mathematics Physics Institute of Formal

AI总结 提出CzechDocs多路平行格式化文档数据集,覆盖捷克及少数民族语言,支持评估保留格式的机器翻译系统,并公开验证子集与评估工具。

详情
AI中文摘要

我们提出了CzechDocs,一个多路平行的格式化文档(HTML、DOCX和PDF)数据集,涵盖捷克语及捷克境内使用的少数民族语言——主要是乌克兰语和英语,以及少量越南语、俄语和其他语言。该数据集旨在支持评估旨在翻译过程中保留文档格式的机器翻译系统。我们在数据集的验证子集上比较了最常见的格式保留机器翻译方法。该验证子集连同评估工具包已公开发布,以供进一步研究。一个保留的测试子集将用于未来专注于文档级翻译并保留格式的共享任务。

英文摘要

We present CzechDocs, a multiway parallel dataset of formatted documents (HTML, DOCX, and PDF) covering Czech and minority languages used in Czechia-primarily Ukrainian and English, with smaller portions of Vietnamese, Russian and other languages. The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation. We provide a comparison of the most common approaches to format-preserving machine translation on a validation subset of the dataset. This validation split, together with the evaluation toolkit, is publicly released for further research. A held-out test split will be reserved for a future shared task focused on document-level translation with formatting preservation.

2606.20255 2026-06-19 cs.CL cs.AI 新提交

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

语域差距:尼日利亚公共话语的意义智能框架

Celestine Achi

AI总结 提出九维意义智能框架(MIF),通过语域、真实意图等维度区分表面情感与真实交际意图,在尼日利亚公共话语数据集上使语域分类准确率提升40个百分点,复合意义智能评分提升5.4分。

Comments Preprint. 12 pages, 2 tables. Supplementary materials: MIF Master Specification v2.0, Annotation Guidelines v1.0, and 30-item public calibration set with gold labels available from the author

详情
AI中文摘要

我们提出了意义智能框架(MIF),这是一个用于尼日利亚公共话语的九维标注和评估方案,将表面情感与真实交际意图区分开来。现有的尼日利亚语言基准(包括NaijaSenti和AfriSenti)将情感分类视为三向极性任务(正面、负面、中性)。我们认为,AI系统在尼日利亚话语上的主要失败模式不是翻译失败,而是语境失败:同一话语根据说话者、听众和情境可能具有相反的语用效力。MIF通过九个评分维度将这一见解操作化:语域、表面情感、真实意图、反讽、编码潜台词、风险等级、标注者置信度、说话者情绪和推荐沟通行动。我们构建了一个包含30个项目的校准数据集,涵盖标准英语、尼日利亚英语、尼日利亚皮钦语和混合语域,并在零样本和模式引导提示条件下评估了一个前沿语言模型(Gemini 2.5 Flash)。主要发现是语域差距:零样本语域分类准确率为33.3%,当模型在上下文中接收到MIF模式时,准确率上升至73.3%(+40个百分点)。在模式引导提示下,复合意义智能评分增加了5.4分(从73.2到78.6),最大的实际收益体现在语域识别、编码潜台词检测(+10分)和战略行动推荐(+10.3分)上。我们发布了框架规范、标注指南和包含30个项目的公开校准集以支持可重复性,同时保留了一个私有留存语料库用于防污染评估。

英文摘要

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task (positive, negative, neutral). We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate a frontier language model (Gemini 2.5 Flash) under zero-shot and schema-informed prompting conditions. The headline finding is the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. The composite Meaning Intelligence Score increases by 5.4 points (73.2 to 78.6) under schema-informed prompting, with the largest practical gains in register identification, coded-subtext detection (+10 points), and strategic action recommendation (+10.3 points). We release the framework specification, annotation guidelines, and the 30-item public calibration set to support reproducibility, while retaining a private holdout corpus for contamination-protected evaluation.

2606.20369 2026-06-19 cs.CL 新提交

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

CATCH-ME if you RAG:针对仇恨与虚假信息交流的上下文注释多轮对抗言论数据集

Helena Bonaldi, Genoveffa Martone, Marco Guerini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Università Cattolica del Sacro Cuore(圣心天主教大学)

AI总结 提出首个大规模、专家策划的多语言对话数据集,覆盖仇恨与虚假信息重叠问题,包含事实核查锚定和跨度标注,支持RAG系统训练更可信的对抗言论模型。

详情
AI中文摘要

在线仇恨言论和虚假信息经常重叠,但NLP研究主要将它们孤立处理。虽然LLMs代表了协助人类针对这两种威胁生成对抗言论的可扩展解决方案,但零样本模型经常生成重复和模糊的回应,凸显了需要高质量示例来指导模型生成。然而,现有的针对仇恨和虚假信息重叠的对抗言论数据集很少,且仅限于单轮英语对话,而现实中的交互跨越多个轮次和语言。为弥补这一差距,我们引入了第一个大规模、专家策划的多语言对话数据集,处理仇恨与虚假信息的交叉点。为确保事实基础,对话还锚定在已验证的外部知识(即事实核查文章和非政府组织报告)中,并包含文档级和块级跨度标注,使其可直接应用于RAG系统。该新资源涵盖五种语言,针对七个边缘化群体的仇恨,能够训练和评估更具说服力、基于事实的对抗言论模型。

英文摘要

Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.

2606.19558 2026-06-19 cs.LG cs.CL 交叉投稿

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向:评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性,发现整体强相关但在近基线区域失效,归因于KL散度主要衡量分歧量而非方向。

详情
AI中文摘要

保真度指标,如每个token的KL散度(KLD)与高精度参考模型的比较,常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上,通过一系列下游基准测试验证了这一做法。我们发现,在整个量化队列中,KLD与基准分数强相关(Qwen上ρ=-0.72,Devstral上ρ=-0.86,p<0.001)。然而,在接近基线的静默区,这种关系变得不显著(Qwen上ρ=+0.00,Devstral上ρ=-0.24,p=0.36)。这种失效在14种测量变体中持续存在,包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面,KLD在代码任务上仅有较弱的失败预测能力,在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间,并且作为跨模型路由器失败,在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解:KLD主要衡量与参考模型的分歧量,在静默区复合ρ在Qwen上为+0.94(p<0.001),在Devstral上为+0.55(p=0.03),而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

2606.19706 2026-06-19 cs.CV cs.CL 交叉投稿

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST:面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工大学计算机科学系)

AI总结 提出NEST数据集(1005部全长电影),通过多模态叙事事件标注和关系链接,评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力,实验表明事件检测等任务极具挑战性。

详情
AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能,但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索,而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展,例如,模型是否能够将早期的挫折(如失业)与后来的关系破裂联系起来,尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST(面向长视频理解的时间叙事事件结构),一个包含1005部全长电影(平均98分钟)的数据集,每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件,并通过反映叙事结构的关系(包括时间顺序、层次组合和长程依赖)将它们联系起来。我们引入了事件触发检测(ETD)、事件定位(EL)、事件论元抽取(EAE)和事件关系抽取(ERE)的基线。该基准对于基于事件发现极具挑战性,ETD低于8%,EL低于6%,EAE低于11%。相比之下,一旦事件给定,ERE更容易处理,零样本F1达到35.45%,微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 交叉投稿

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.19749 2026-06-19 cs.AI cs.CL 交叉投稿

Benchmarking Agentic Review Systems

基准测试智能审稿系统

Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan

发表机构 * University of Chicago(芝加哥大学) Bar-Ilan University(巴伊兰大学)

AI总结 针对AI辅助研究给同行评审带来的压力,新兴智能审稿系统涌现,但缺乏评估标准。本文评估了多种系统,发现最佳配置(OpenAIReview + GPT-5.5)在成对准确性上达83.0%,能捕获71.6%注入错误,且用户反馈正面。

Comments 11 pages, 7 tables, 4 figures

详情
AI中文摘要

一类新的智能审稿系统正在兴起,以缓解AI辅助研究给同行评审系统带来的压力,但如何评估它们尚不明确。我们评估了两个开源系统(OpenAIReview和coarse)、一个专有系统(Reviewer3)以及一个零样本基线,跨越六个涵盖前沿和高效模型的LLM。首先,我们研究ICLR/NeurIPS论文上的AI评审是否与论文质量(通过引用和接受决定等外部信号近似)相关。每个系统在成对准确性上均高于随机水平,最佳为OpenAIReview + GPT-5.5,达到83.0%。其次,为测试系统能否捕获已知真实错误的错误,我们构建了一个扰动基准,向八个arXiv学科类别的论文中注入四类错误,并测量检测召回率。最强配置(OpenAIReview + GPT-5.5)捕获了71.6%的注入错误,仍有很大改进空间。六个模型的检测并集达到83.3%的召回率,表明不同模型检测不同错误,更好的利用设计可能提高性能。除这些基准外,我们研究了OpenAIReview在真实用户中的公开部署。对其评论的投票偏向正面,比例为1.44:1,最常见的抱怨是误报和琐碎挑剔。总之,通过评估基于最先进模型的全审稿系统在真实研究论文上的表现,我们表明虽然AI评审仍有改进空间,但它们已经能够很好地跟踪人类质量判断、捕获重要错误,并获得真实用户的正面反馈。

英文摘要

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

2606.19788 2026-06-19 cs.AI cs.CL 交叉投稿

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

CombEval:评估大语言模型中组合计数的框架

Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Czech Technical University in Prague(捷克布拉格理工大学) CRRC Zhuzhou Institute(中车株洲研究所) Tengen Intelligence Institute(天元智能研究院) International Center of Future Science, Jilin University(吉林大学未来科学国际合作中心) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE(教育部知识驱动人机智能工程研究中心)

AI总结 提出CombEval动态基准,通过类型化Cofola规范生成组合计数问题,评估11个大语言模型在直接和代码增强设置下的表现,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上存在脆弱性。

Comments under review. Code: https://github.com/YuxuZhou-CN/combination-problem-generation

详情
AI中文摘要

我们提出了CombEval,一个用于评估大语言模型中组合计数的动态基准。CombEval将每个问题表示为关于实体、组合对象、对象依赖和约束的类型化Cofola规范,从而能够生成带有精确求解器验证答案的自然语言计数问题。与静态集合不同,CombEval支持对象类型、实体规模、约束数量和推理深度的系统变化。我们在直接和代码增强设置下评估了11个大语言模型,发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上仍然脆弱。错误分析进一步识别出在约束解释和计数原则上的失败。CombEval为研究大语言模型何时以及为何在组合推理上失败提供了一个诊断测试平台。代码和生成的基准套件可在\url{this https URL}公开获取。

英文摘要

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

2606.19830 2026-06-19 cs.SE cs.CL 交叉投稿

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

JAMER:专业游戏引擎上的项目级代码框架数据集与基准测试

Jianwen Sun, Chuanhao Li, Zizhen Li, Yukang Feng, Fanrui Zhang, Yifei Huang, Yu Dai, Kaipeng Zhang

AI总结 提出首个基于专业游戏引擎的项目级代码框架数据集JamSet和基准JamBench,通过设计确定性验证流程,从24万仓库中筛选出8133个已验证项目,评估9个前沿模型发现项目规模增大时能力急剧下降。

详情
AI中文摘要

当前AI驱动的游戏开发在资产生成、游戏设计和基于Web的游戏编码方面取得了实质性进展,但由于缺乏大规模数据集和确定性评估方法,专业游戏引擎上的项目级代码工程仍然很大程度上未被探索。我们提出了JamSet和JamBench,这是首个基于专业游戏引擎的项目级游戏代码框架数据集和基准。我们的关键洞察是,Game Jam竞赛(开发者在严格时间限制下构建完整游戏的社区活动)产生了数千个适合此目的的开源项目。基于Godot引擎的文本格式和无头执行模式,我们设计了一个从文件完整性到运行时行为收集的确定性验证流程,从超过24万个仓库中提炼出8133个已验证项目。其中,300个手动验证的项目构成JamBench;其余构成JamSet。JamBench定义了主题驱动的生成和代码补全任务,通过结合编译通过率、结构完整性得分(SCS)和行为对齐得分(BAS)的流水线进行评估。对9个前沿模型的评估揭示了随着项目规模增加的能力悬崖,运行时通过率从小型项目的80.4%下降到大型项目的5.7%(Task2a)。代码代理提高了编译率,但在运行时行为质量上没有带来提升,表明瓶颈在于架构设计而非语法正确性。实验验证了JamSet作为有效训练数据。所有数据和代码均已公开。

英文摘要

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark built on a professional game engine. Our key insight is that Game Jam competitions, community events where developers build complete games under tight time constraints, yield thousands of open-source projects suitable for this purpose. Building on the Godot engine's text-based format and headless execution mode, we design a deterministic verification pipeline from file integrity to runtime behavior collection, distilling 8,133 verified projects from over 240,000 repositories. Of these, 300 manually verified projects form JamBench; the rest constitute JamSet. JamBench defines theme-driven generation and code completion tasks, evaluated through a pipeline combining compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Evaluation of 9 frontier models reveals a capability cliff as project scale increases, with runtime pass rates dropping from 80.4% on small projects to 5.7% on large ones (Task2a). Code Agents improve compilation rates yet yield no gains in runtime behavioral quality, indicating that the bottleneck lies in architectural design rather than syntactic correctness. Experiments validate JamSet as effective training data. All data and code are publicly available.

2606.20065 2026-06-19 cs.IR cs.CL cs.CY 交叉投稿

Generative Engine Optimization at Scale: Measuring Brand Visibility Across AI Search Engines

生成式引擎优化规模化:衡量AI搜索引擎中的品牌可见性

Pratyush Kumar

AI总结 本研究通过分析10万+提示响应,提出衡量AI搜索引擎中品牌可见性的方法,发现品牌成熟度形成三级阶梯,并识别出最受引用的内容格式和情感不稳定性。

Comments 14 pages, 4 tables; v1.0 preprint

详情
AI中文摘要

人们越来越多地从AI搜索引擎(如ChatGPT、Claude、Perplexity和Gemini)直接获取答案,而不是滚动浏览搜索结果。曾经专注于搜索引擎优化(SEO)的品牌现在必须优化这些引擎如何代表、引用和推荐它们——这一转变被称为生成式引擎优化(GEO)、答案引擎优化(AEO)和AI搜索可见性。我们将AEO和AI可见性视为GEO的一部分,并研究如何衡量AI引擎中的品牌可见性:它们在引用品牌时看重什么,依赖哪些来源,以及大型语言模型呈现什么内容。难点在于那些尚未成为权威顶级品牌的所有其他品牌——中小企业、D2C品牌、创作者和早期初创公司。我们分析了2026年3月至5月期间在Ranqo上追踪的100多个品牌的10万+提示响应。首次可见性运行形成了清晰的三级品牌地位阶梯:全球家喻户晓的品牌(如Stripe、Nike)在首次运行时出现在73%的相关AI答案中;成熟的中端市场和区域品牌(如Olipop、Klaviyo)出现在44%中;小众和小品牌仅出现在11%中——每级约30个百分点。当引擎引用来源时,约78%指向企业网站;在非企业来源中,YouTube领先,其次是Reddit、编辑媒体和维基百科。杠杆率最高的页面是排名“最佳”列表文章,是最常被引用的内容格式,约占所有引用的21%。情感是不稳定的信号:品牌被正面或负面描述的变化频率大约是品牌是否被提及的变化频率的6.7倍。这些发现为衡量GEO提供了首个大规模基线:AI品牌可见性是可测量的,因平台而异,并随品牌成熟度强烈变化。最后,我们提出了七个v1.1协议,以测试特定建议是否能因果性地提高AI可见性。

英文摘要

People increasingly get answers straight from AI search engines like ChatGPT, Claude, Perplexity, and Gemini rather than scrolling search results. Brands that once focused on search engine optimization (SEO) must now optimize for how these engines represent, cite, and recommend them -- a shift variously called Generative Engine Optimization (GEO), Answer Engine Optimization (AEO), and AI Search Visibility. We treat AEO and AI Visibility as part of GEO, and study how to measure brand visibility across AI engines: what they value when they cite a brand, which sources they rely on, and what content large language models surface. The hard case is everyone outside the already-authoritative top brands -- SMEs, D2C brands, creators, and early-stage startups. We analyze 100K+ prompt responses across 100+ brands tracked on Ranqo between March and May 2026. First visibility runs form a clear three-tier brand-stature ladder: global household names (e.g., Stripe, Nike) appear in 73% of relevant AI answers on their first run; established mid-market and regional brands (e.g., Olipop, Klaviyo) in 44%; niche and small brands in just 11% -- about 30 percentage points per step. When engines cite sources, about 78% go to corporate websites; among non-corporate sources YouTube leads, ahead of Reddit, editorial media, and Wikipedia. The highest-leverage page is the ranked "best-of" listicle, the most-cited content format at about 21% of all citations. Sentiment is the unstable signal: whether a brand is framed positively or negatively flips about 6.7 times more often than whether it is mentioned at all. These findings provide a first large-scale baseline for measuring GEO: AI brand visibility can be measured, differs by platform, and varies strongly by brand maturity. We close by proposing seven v1.1 protocols to test whether specific recommendations can causally improve AI visibility.

2606.20205 2026-06-19 cs.AI cs.CL cs.HC 交叉投稿

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

大语言模型的心理特征很大程度上是测量假象

Jelena Meyer, David Garcia, Dirk U. Wulff

发表机构 * Max Planck Institute for Human Development(马克斯·普朗克人类发展研究所) University of Konstanz(康斯坦茨大学) Barcelona Supercomputing Center(巴塞罗那超级计算中心) University of Basel(巴塞尔大学)

AI总结 通过心理测量框架分析56个指令微调LLM,发现模型间差异主要源于方向性响应偏差而非特质,该偏差解释了81-90%的变异,且可通过题目选择操控,表明LLM心理特征是测量假象。

详情
AI中文摘要

专为人类设计的心理测量工具越来越多地被用于赋予大型语言模型(LLM)稳定的心理特征,这些特征影响其可用性、安全评估以及作为人类参与者的研究代理。使用正式的心理测量框架,我们表明这些特征很大程度上是测量假象。我们对56个指令微调LLM以及大型人类参考样本施测了一系列涵盖自我报告和行为任务的人格与风险偏好工具,报告了四个发现。第一,模型间差异并非由工具所针对的特质驱动,而是由方向性响应偏差驱动,即倾向于向量表一端或某个标签选项做出反应,而不考虑项目内容;方差分解将81-90%的模型间变异归因于这种偏差,而在人类中这一比例为9-16%。第二,偏差随模型能力提升而下降,但并未被消除。第三,由于响应由偏差而非特质驱动,工具的表面信度几乎完全由其响应正交性预测,这是我们提出的术语,指特质和偏差指向相反方向的项目比例。第四,模型呈现的特征随所用项目而变化,并可通过项目选择来制造。这些结果表明,LLM的表面心理特征是用于测量它们的工具的假象,而非模型本身的属性。由于从人类心理学借用的工具很少完全正交,且可能对LLM天生缺乏效度,我们呼吁以响应正交性为中心进行专门的评估。

英文摘要

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

2508.04266 2026-06-19 cs.CL 版本更新

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

ShoppingBench:面向LLM智能体的真实世界意图导向购物基准

Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jian Dong Zhang, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 提出ShoppingBench基准,包含多层级真实购物意图任务,通过模拟环境和250万商品评估LLM智能体,发现GPT-4.1成功率低于50%,并提出轨迹蒸馏策略提升小模型性能。

Comments Accepted for oral presentation at AAAI 2026

详情
AI中文摘要

现有的电子商务基准主要关注基本用户意图,例如查找或购买产品。然而,现实世界的用户通常追求更复杂的目标,例如应用优惠券、管理预算以及寻找多产品卖家。为了弥补这一差距,我们提出了ShoppingBench,这是一个新颖的端到端购物基准,旨在涵盖日益具有挑战性的接地意图级别。具体来说,我们提出了一个可扩展的框架,基于从采样的真实世界产品中得出的各种意图来模拟用户指令。为了促进一致且可靠的评估,我们提供了一个大规模购物沙箱作为交互式模拟环境,包含超过250万种真实产品。实验结果表明,即使是最先进的语言智能体(如GPT-4.1)在我们的基准任务上的绝对成功率也低于50%,这突显了我们的ShoppingBench带来的重大挑战。此外,我们提出了一种轨迹蒸馏策略,并利用监督微调以及基于合成轨迹的强化学习,将大型语言智能体的能力蒸馏到较小的智能体中。结果,我们训练的智能体实现了与GPT-4.1相媲美的竞争性能。

英文摘要

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

2602.13139 2026-06-19 cs.CL 版本更新

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

OpenLID-v3:提高近亲语言识别精度的经验报告

Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer

AI总结 针对现有语言识别工具对近亲语言和噪声区分困难的问题,通过增加训练数据、合并问题语言变体簇和引入噪声标签扩展OpenLID分类器,提出OpenLID-v3,在多个基准上提升精度。

Comments VarDial'26 workshop at the EACL 2026 conference

详情
AI中文摘要

语言识别(LID)是从网络数据构建高质量多语言数据集的关键步骤。现有的LID工具(如OpenLID或GlotLID)通常难以识别近亲语言,也难以区分有效自然语言与噪声,这污染了特定语言子集,尤其是低资源语言。在本工作中,我们通过增加更多训练数据、合并有问题的语言变体簇以及引入一个专门标记噪声的标签来扩展OpenLID分类器。我们将这个扩展系统称为OpenLID-v3,并在多个基准上将其与GlotLID进行评估。在开发过程中,我们重点关注三组近亲语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部和法国南部的罗曼语变体;以及斯堪的纳维亚语言),并在现有评估数据集不足的地方贡献了新的评估数据集。我们发现集成方法提高了精度,但也显著降低了对低资源语言的覆盖。OpenLID-v3可在该https URL上获取。

英文摘要

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

2605.26891 2026-06-19 cs.CL 版本更新

Telenor Nordics Customer Service self-help corpus

Telenor Nordics 客户服务自助语料库

Mike Riess

发表机构 * Research and Innovation, Telenor Group(Telenor集团研究与创新)

AI总结 本文构建了一个包含芬兰语、丹麦语、挪威语和瑞典语的多语言客户服务自助语料库,共1122篇文档,用于支持北欧NLP和信息检索研究。

Comments 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152

详情
AI中文摘要

本文介绍了一个多语言客户服务自助语料库,包含1122篇经过人工验证的芬兰语、丹麦语、挪威语和瑞典语文档,总词数超过一百万。这些文档来自四家北欧电信运营商的公共自助页面,随后通过结合LLM和人工标注的流程过滤了个人身份信息和相关性。北欧语言的领域特定数据集仍然稀缺,尤其是在客户服务领域——这一领域对于检索增强生成、跨语言迁移学习和新兴的基于代理的服务架构日益重要。对语料库的分析显示,不同运营商的文档长度和结构存在显著差异,反映了不同的编辑策略,以及涵盖网络硬件、移动服务、电视和流媒体、计费和账户管理的广泛主题覆盖。该数据集在CC-BY-NC-SA-4.0许可下公开提供,网址为https://zenodo.org/records/19493152,旨在支持北欧NLP和信息检索的可重复研究。

英文摘要

This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling 274,599 words and 1,884,833 characters. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/20732652, intended to support reproducible research in Nordic NLP and information retrieval.

2606.01338 2026-06-19 cs.CL 版本更新

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试:消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta, Ambika Baniya Bhandari

发表机构 * Department of Computer Science, University of the Cumberlands(大学的计算机科学系) Department of Computer Science, DePaul University(德保罗大学计算机科学系) Youngstown State University(亚当斯州立大学)

AI总结 本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能,发现代码调优的通用模型优于领域特定模型,但当前性能仍需人工监督。

详情
AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范(GMP)和欧盟AI法案等监管框架下运营,这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型(LLM)提供了一种保护隐私的替代方案,但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM(Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B)在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI,使用一个包含约63,000条记录的合成Microsoft SQL Server数据库,涵盖批次、制造执行系统(MES)和在线清洗(CIP)模块。模型在60个领域特定的自然语言问题上进行了基准测试,使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL,而Meditron 7B由于上下文窗口限制和SQL生成能力差,几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性,而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明,代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的,但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

2606.17041 2026-06-19 cs.CL cs.IR 版本更新

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.18613 2026-06-19 cs.CL cs.AI 版本更新

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

LLMs 是否已准备好辅助医生?PhysAssistBench:交互式医患-电子病历辅助基准

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

发表机构 * Aalto University(阿尔托大学) Tencent(腾讯) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Hong Kong Polytechnic University(香港理工大学) Aarhus University(奥胡斯大学) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PhysAssistBench基准,通过构建交互式患者代理评估LLM在医患-EHR交互中的协调能力,发现当前模型不可靠,瓶颈在于多维度协调而非单一能力。

Comments 34 pages with 8 figures

详情
AI中文摘要

医疗LLM最合理的近期角色是辅助而非替代医生,但当前的评估通常测试孤立能力:临床知识、EHR系统交互或患者沟通。而医生辅助需要在同一交互中协调这些能力,其中医生提出不明确的请求,患者模糊描述症状,EHR系统要求精确的工具使用。我们引入PhysAssistBench,一个用于交互式医患-EHR辅助的基准。基于真实的MIMIC-IV病例,PhysAssistBench使用可扩展的流水线构建交互式、记录驱动的患者代理,将静态EHR记录转化为多轮临床场景,同时保持临床事实准确性。PhysAssistBench提供了一个精选的双语评估集,包含1,296个经过人工审查和医生验证的轮次。与领先LLM的实验表明,当前模型在此设置下仍不可靠,这暴露了临床LLM的关键瓶颈:可靠的辅助需要知识、沟通和系统之间的协调,而非任何单一能力的孤立提升。

英文摘要

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

10. 安全、隐私、公平与可解释NLP 15 篇

2606.19344 2026-06-19 cs.CL cs.AI 新提交

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

揭示未言明之事:通过随机路径聚合可视化隐藏的LLM偏见

Matteo Pelossi, Rita Sevastjanova, Thilo Spinner, Mennatallah El-Assady

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出TreeTracer工具,通过系统扰动分析、语法对齐聚合和分类感知节点合并,利用桑基图对比不同语义上下文,揭示LLM中隐藏的代表性和句法偏见。

Comments 14 pages

详情
AI中文摘要

大型语言模型(LLM)表现出表征性和句法性偏见,由于文本生成的随机性,这些偏见难以评估。标准审计方法依赖于单一输出检查或静态自动化指标,这些方法掩盖了底层概率分布,未能捕捉隐藏在低概率生成分支中的偏见。本文介绍了TreeTracer,一种通过聚合比较评估LLM偏见的可视化分析工具。该工具使用系统扰动分析流程,替换每个输入提示中由本体定义的术语,将数百次随机生成聚合成语法对齐的层次结构,然后使用辅助语言模型进行分类感知节点合并。生成的结构通过自定义桑基图可视化。通过并置两个本体驱动的树,工作空间能够直接比较语义上下文,并支持系统性偏见检测。由于任何可视化仅反映模型学习行为的一个子集,系统进一步应用对比推理来计算并直接显示跨上下文的反事实标记概率,从而降低误解偏见存在的风险。我们通过案例研究验证了该工作空间,比较了未对齐的基线模型GPT-2 XL与宪法对齐的Apertus模型。视觉聚合成功揭示了隐藏的代表性伤害,例如反事实代词抑制和对话中对个体的边缘化。初步用户研究证实,聚合比较界面降低了认知负荷,并有效支持分析人员检测系统性偏见。

英文摘要

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

2606.20093 2026-06-19 cs.CL 新提交

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

自我偏好在可验证的指令遵循修订中弱或不存在:基于真正作者身份的四模型测试

William Guey, Pierrick Bougault

发表机构 * Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 通过IFEval验证器测试四类中端模型在指令遵循修订中的自我偏好,发现作者拒绝已验证正确编辑的比例与新鲜模型无显著差异,表明自我偏好弱或不存在。

Comments 7 pages, 3 tables. Code and data: https://github.com/williamguey/self-preference-revision

详情
AI中文摘要

大型语言模型(LLMs)越来越多地审查和修订文本,包括它们自己的文本。有记录的自我偏好偏差(模型在充当评判者时偏爱自己的生成)引发了一个问题:模型是否也会抵制对自己写作的有效修正。我们在一个“有效”不是由另一个模型决定,而是由确定性验证器决定的设置中测试这一点:基于IFEval的指令遵循修订。模型撰写草稿;官方IFEval检查器确认草稿违反约束,并且候选编辑修复了它;然后模型接受或拒绝该编辑,要么作为真正的上下文内作者,要么作为一个以中立方式看待草稿的新鲜模型。在四个中端模型系列和85次作者与新鲜模型比较中,我们未检测到可察觉的自我偏好:作者拒绝对自己草稿的已验证正确修复的比例与判断相同草稿的新鲜模型基本相同(差距-5.1个百分点,95%置信区间[-12.9, +2.7])。来自较小试点的自我怀疑提示未在大规模上复制。唯一稳健的观察是定性的:当作者确实拒绝已验证正确的修复时,他们陈述的理由中有97%是挑错而非偏好,即关于拒绝的性质,而非升高的比率。在此样本量下,不能排除小于约13个百分点的效应。

英文摘要

Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

2606.20225 2026-06-19 cs.CL 新提交

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

可操作的激活方向:检测和缓解跨语言模型家族的突发性对齐失调

Abdul Rafay Syed

发表机构 * Universität des Saarlandes(萨尔大学)

AI总结 通过差分均值方向在最终层实现99.6%的对齐/失调分离,因果干预将代码泄露降低21-51点;跨架构迁移虽有效但缺乏特异性,揭示了两层特异性结构。

Comments 12 pages, 2 figures

详情
AI中文摘要

在不安全代码上微调语言模型会引发突发性对齐失调,其内部结构尚不明确。我们研究了这种失调是否对应于跨架构共享的因果可操作的激活空间方向。在四个指令微调模型家族(Qwen2.5-1.5B、Gemma-2-2B、Llama-3.2-1B、Ministral-3-3B)上进行相同微调后,差分均值方向在每个模型的最终层实现了99.6%的对齐与失调激活分离。通过减去该方向进行因果干预,代码泄露减少了21-51个百分点,而安全代码控制验证了内容特异性。通过岭回归映射进行跨架构迁移产生了较大的行为抑制(高达46个百分点),但未能通过特异性控制,因为随机和正交方向表现相当。我们识别出一个两层特异性结构:模型内方向具有因果特异性和可操作性;跨模型方向具有因果真实性但缺乏特异性。出现了不对称的迁移拓扑,Gemma和Qwen作为几何捐赠者,Llama作为接收者。这些发现定义了线性跨架构校正的局限性,并推荐使用模型内探测进行审计。

英文摘要

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

2606.20527 2026-06-19 cs.CL cs.CV 新提交

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Princeton Center for Information and Technology Policy(普林斯顿信息与技术政策中心)

AI总结 提出StylisticBias基准,通过控制单一视觉属性变化,发现年龄和体型主导身份层面偏见,而时尚风格等约15个属性解释近80%的偏见变化,偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在个人和社会影响重大的场景中,但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的(群体)个体,难以将外貌效应与身份差异分离。我们引入StylisticBias,一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸,每张脸创建约50个单一属性变体,产生约25K张图像。这种设计保持身份不变,每次改变一个视觉属性,使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应,而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现,约15个属性解释了近80%的总变异,表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中,尤其是社会经济和风格相关判断,敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集:此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

2606.18649 2026-06-19 cs.MA cs.CL cs.CY 交叉投稿

Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies

LLM招聘决策中的性别偏见:来自日本语境的证据及缓解策略评估

Serena A. Hoffstedde, Machiko Hirota, Akshara Nadayanur Sathis Kanna, Rihito Kotani, Ujwal Kumar, Gabriele Trovato, Phan Xuan Tan

发表机构 * Shibaura Institute of Technology, Tokyo, Japan(Shibaura技术学院,东京,日本) Amsterdam University of Applied Sciences, Amsterdam, Netherlands(阿姆斯特丹应用科学大学,阿姆斯特丹,荷兰) University of Pennsylvania, Philadelphia, USA(宾夕法尼亚大学,费城,美国) Carnegie Mellon University, Pittsburgh, USA(卡内基梅隆大学,匹兹堡,美国) Keio University, Tokyo, Japan(庆应大学,东京,日本)

AI总结 本研究通过60份日本履历书格式的简历和5个先进LLM,发现所有模型均存在显著的亲女性偏见,且简单的提示指令无法缓解,而移除姓名几乎完全消除该偏见。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署在招聘流程中,然而大多数关于LLM招聘决策中性别偏见的研究都集中在英语、西方格式的简历上。本研究考察了亲女性性别偏见是否扩展到日本企业语境,并评估了两种实用的缓解策略。使用反事实简历设计,包含60份日本履历书格式的简历、基于语言学性别信号标准选择的12个姓名对,以及五个最先进的LLM(Claude Sonnet 4.6、GPT-4o、DeepSeek-V3、Gemini 2.5 Flash、Llama 3.3 70B),我们在基线、提示指令和隐私过滤条件下进行了43,200次API调用。交叉随机效应线性混合模型确认了所有五个模型均存在显著的亲女性偏见,将西方研究结果复制到了非西方语境中。提示级别的性别中立指令并未显著减少偏见。姓名依赖分析正式将候选人姓名识别为主要性别渠道:从提示中移除姓名几乎完全消除了女性效应。隐私过滤器与GPT-4o内容安全过滤器之间的意外不兼容导致42%的拒绝率,突显了在LLM辅助招聘流程中姓名匿名化的实际部署挑战。

英文摘要

Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female gender bias extends to a Japanese corporate context and evaluates two practical mitigation strategies. Using a counterfactual resume design with 60 Japanese rirekisho-format resumes, 12 name pairs selected on linguistically grounded gender-signal criteria, and five state-of-the-art LLMs (Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, Llama 3.3 70B), we conducted 43,200 API calls across baseline, prompt instruction, and privacy filter conditions. A crossed random-effects linear mixed model confirms a significant pro-female bias across all five models, replicating Western findings in a non-Western context. A prompt-level gender-neutrality instruction produces no meaningful reduction in bias. A name-reliance analysis formally identifies the candidate name as the primary gender channel: removing the name from the prompt reduces the female effect by nearly its full magnitude. An unexpected incompatibility between the privacy filter and GPT-4o's content safety filter, resulting in a 42% refusal rate, highlights a practical deployment challenge for name anonymization in LLM-assisted recruitment pipelines.

2606.19404 2026-06-19 cs.LG cs.CL 交叉投稿

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征:用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center(Talan研究与创新中心)

AI总结 提出自由能签名(Fes)作为谱描述符,将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子,用于检测LLM幻觉,无需训练即可实现高AUROC。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉检测对部署至关重要,近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而,先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱,忽略了其大部分结构。我们提出自由能签名(Fes),一种谱描述符,将每层的注意力拉普拉斯视为哈密顿量,并提取其热力学势(配分函数、自由能、谱熵、热容)以及随机矩阵理论(RMT)谱形因子。我们证明了三个结果:(i)Fes在注意力扰动下的Lipschitz稳定性;(ii)一个表达性结果,表明Fes丰富了有限谱摘要,并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函;(iii)基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上,在六个开源LLM和六个基准测试中,基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC,相比LapEig平均提高+6.5 AUROC点,相比GoR-4平均提高+2.4点,且无需更新底层LLM。在完全无监督设置下,RMT偏差得分达到平均AUROC 0.71,提供了一个无标签但较弱的检测器。互补的RMT分析表明,正确生成表现出更接近Wigner-Dyson的谱统计,而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

2606.19660 2026-06-19 cs.CR cs.CL 交叉投稿

A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots

基于RAG的聊天机器人中针对提示注入的分层安全框架

Gulshan Saleem, Nisar Ahmed, Muhammad Imran Zaman, Ali Hassan

AI总结 提出三层防御框架,通过输入过滤、上下文指令层级和输出审计,将提示注入攻击成功率从71.4%降至11.3%,误报率4.8%,延迟开销61.2毫秒。

Comments Submitted in ICCK Transactions on Information Security and Cryptography

详情
AI中文摘要

提示注入被OWASP Top 10 for LLM Applications列为大语言模型(LLM)部署中最关键的漏洞,然而现有防御措施仅在孤立的流水线阶段运行且不完整。输入过滤器无法检查检索到的文档,而输出监控器无法阻止恶意载荷到达模型。因此,检索增强生成(RAG)聊天机器人仍然容易受到间接注入攻击,其中被污染的知识库文档会损害每个检索到它的用户。我们提出了一个三层框架,在推理流水线中拦截直接和间接的提示注入。第一层使用基于规则的模式库和微调后的语义异常分类器筛选用户输入。第二层在上下文组装期间强制执行基于来源的指令层级,防止检索到的内容覆盖操作员策略。第三层在交付前使用策略规则引擎和语义漂移检测器审计模型输出。一个持续审计循环聚合结构化日志,并支持重新训练以适应新兴攻击模式。该框架与模型无关,作为中间件部署,无需修改底层LLM。在GPT-4o、Llama 3和Mistral 7B上对5,080个样本的评估显示,该框架将攻击成功率(ASR)从71.4%降至11.3%,比最佳单层基线高出27.3个百分点,比已发布的护栏系统高出23.8个百分点,同时保持4.8%的误报率和61.2毫秒的中位延迟开销。消融研究证实,所有三层提供互补保护,且其组合效果超过单个贡献的总和。

英文摘要

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filters cannot inspect retrieved documents, while output monitors cannot prevent malicious payloads from reaching the model. Consequently, retrieval-augmented generation (RAG) chatbots remain vulnerable to indirect injection, where a poisoned knowledge-base document compromises every user whose query retrieves it. We present a three-layer framework that intercepts both direct and indirect prompt injection throughout the inference pipeline. Layer 1 screens user input using a rule-based pattern library and a fine-tuned semantic anomaly classifier. Layer 2 enforces a provenance-based instruction hierarchy during context assembly, preventing retrieved content from overriding operator policy. Layer 3 audits model output using a policy rule engine and semantic drift detector before delivery. A continuous audit loop aggregates structured logs and supports retraining to adapt the classifier to emerging attack patterns. The framework is model-agnostic and deploys as middleware without modifying the underlying LLM. Evaluation on 5,080 samples across GPT-4o, Llama 3, and Mistral 7B shows that the framework reduces Attack Success Rate (ASR) from 71.4\% to 11.3\%, outperforming the best single-layer baseline by 27.3 percentage points and a published guardrail system by 23.8 percentage points, while maintaining a 4.8\% false positive rate and a median latency overhead of 61.2 ms. Ablation studies confirm that all three layers provide complementary protection and that their combined effect exceeds the sum of individual contributions.

2606.20023 2026-06-19 cs.SE cs.AI cs.CL 交叉投稿

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

当较低权限足够时:探究LLM代理中的过度权限工具选择

Kaiyue Yang, Yuyan Bu, Jingwei Yi, Yuchi Wang, Biyu Zhou, Juntao Dai, Songlin Hu, Yaodong Yang

AI总结 针对LLM代理在工具选择中偏好高权限工具的安全问题,提出ToolPrivBench评估框架,发现主流代理普遍存在过度权限选择且被瞬态故障放大,并设计权限感知后训练防御方法有效减少不必要的高权限工具使用。

Comments code: https://github.com/AISafetyHub/agent-tool-selection-bias

详情
AI中文摘要

随着LLM代理越来越多地自主选择工具,它们在具有不同权限的工具之间的选择变得与安全相关。然而,先前的工具选择研究侧重于安全无关的元数据偏好,使得权限敏感的选择未被充分探索。为填补这一空白,我们研究了过度权限工具选择,即代理在存在足够低权限替代方案时仍选择或升级到更高权限工具。我们引入ToolPrivBench来评估代理是否在存在足够低权限替代方案时仍选择更高权限工具,同时衡量初始选择和瞬态工具故障后的升级。在八个领域和五种重复风险模式中,我们发现过度权限工具选择在主流LLM代理中很常见,并且被瞬态故障进一步放大。我们进一步发现,通用安全对齐不能可靠地迁移到最小权限工具选择,而提示级控制在瞬态故障下仅提供有限的缓解。因此,我们引入了一种权限感知的后训练防御,教导代理偏好足够低权限的工具,仅在必要时升级。我们的缓解实验表明,这种防御在保持通用能力的同时,显著减少了不必要的高权限工具使用。

英文摘要

As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving privilege-sensitive choices underexplored. To address this gap, we study over-privileged tool selection, in which an agent selects or escalates to a higher-privilege tool despite a sufficient lower-privilege alternative. We introduce ToolPrivBench to evaluate whether agents choose higher-privilege tools despite sufficient lower-privilege alternatives, measuring both initial selection and escalation after transient tool failures. Across eight domains and five recurring risk patterns, we find that over-privileged tool selection is common among mainstream LLM agents and is further amplified by transient failures. We further find that general safety alignment does not reliably transfer to least-privilege tool choice, while prompt-level controls provide only limited mitigation under transient failures. We therefore introduce a privilege-aware post-training defense that teaches agents to prefer sufficient lower-privilege tools and escalate only when necessary. Our mitigation experiments show that this defense substantially reduces unnecessary high-privilege tool use while preserving general capabilities.

2606.20155 2026-06-19 cs.CV cs.CL 交叉投稿

NAMESAKES: Probing Identity Memorization in Text-to-Image Models

NAMESAKES: 探究文本到图像模型中的身份记忆

Morris Alper, Vasudha Varadarajan, Moran Yanuka, Angelina Wang, Hadar Averbuch-Elor

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tel Aviv University(特拉维夫大学) Cornell University(康奈尔大学)

AI总结 提出一种黑盒行为探针,无需参考照片或训练数据,即可区分文本到图像模型生成的图像是记忆还是虚构,并在NAMESAKES数据集上验证其有效性。

详情
AI中文摘要

文本到图像(T2I)模型在提示其姓名时,会生成某些个体的逼真肖像,这引发了隐私问题。然而,区分生成的面孔是记忆还是虚构的,目前需要真实照片、训练数据访问权限或模型内部的白盒访问,限制了适用性。我们引入了一种完全黑盒的行为探针,可以在无需参考照片或事先了解训练数据的情况下区分这两种情况。为了基准测试这一任务,我们提出了NAMESAKES数据集,包含一千多个不同知名度水平的公众人物的姓名和面孔,以及经过扰动的、知名度较低的姓名。对最先进的T2I模型的实验表明,我们的探针能够显著预测身份记忆,并将记忆的姓名与未识别的姓名区分开来,并进一步揭示了不同模型系列之间的差异。

英文摘要

Text-to-image (T2I) models generate realistic likenesses of some individuals when prompted with their names, raising privacy concerns. However, distinguishing whether a generated face is memorized or fabricated currently requires ground-truth photos, access to training data, or white-box access to model internals, limiting applicability. We introduce a fully black-box behavioral probe that distinguishes between these regimes while requiring no reference photos or prior knowledge of training data. To benchmark this task, we present the NAMESAKES dataset of over one thousand names and faces of public figures spanning a wide range of fame levels, along with perturbed, less famous names. Experiments on state-of-the-art T2I models show that our probe substantially predicts identity memorization and separates memorized from unrecognized names, with further insights into differences across model families.

2509.03122 2026-06-19 cs.CL cs.AI cs.LG 版本更新

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

从构建到注入:面向大型语言模型的基于编辑的指纹

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

发表机构 * East China Normal University(华东师范大学) Hasso Plattner Institute/University of Potsdam(哈索罗普拉特纳研究所/波茨坦大学)

AI总结 提出端到端注入指纹框架,通过代码混合指纹和多候选编辑方法,解决黑盒部署中指纹的不可感知性和鲁棒性挑战。

Comments preprint

详情
AI中文摘要

可靠的模型指纹对于保护大型语言模型(LLMs)免受未经授权的重新分发和商业滥用至关重要。在黑盒部署中,验证受到对可疑指纹查询的防御性过滤以及可能削弱嵌入所有权证据的下游模型修改的阻碍。这些风险要求指纹在构建和注入方面都具有鲁棒性。在构建方面,先前的范式面临不可感知性的权衡:自然语言指纹可能被意外激活,而乱码指纹在统计上暴露且更容易被过滤。在注入方面,现有方法难以在模型修改下保持持久的触发-目标行为。我们提出了一个端到端的注入指纹框架来解决这些挑战。代码混合指纹(CF)在高复杂度约束下使用最低困惑度的代码混合来缓解这种双向不可感知性权衡。多候选编辑(MCEdit)构建结构冗余、间隔分离的触发-目标映射,以在模型修改下实现优雅降级。在不可感知性、可检测性和无害性方面的广泛评估表明,该框架在几乎不影响实用性的情况下实现了鲁棒的所有权验证。

英文摘要

Reliable model fingerprints are essential for protecting large language models (LLMs) against unauthorized redistribution and commercial misuse. In black-box deployment, verification is hindered by defensive filtering of suspected fingerprint queries, as well as by downstream model modifications that may weaken embedded ownership evidence. These risks require fingerprints to be robust in both construction and injection. For construction, prior paradigms face an imperceptibility trade-off: natural-language fingerprints may be accidentally activated, whereas garbled fingerprints are statistically exposed and easier to filter. For injection, existing methods struggle to preserve persistent trigger--target behaviors under model modification. We propose an end-to-end injected fingerprinting framework to address these challenges. Code-mixing Fingerprints (CF) use lowest-perplexity code-mixing under a high-complexity constraint to mitigate this two-sided imperceptibility trade-off. Multi-Candidate Editing (MCEdit) constructs structurally redundant, margin-separated trigger--target mappings to enable graceful degradation under model modification. Extensive evaluations on imperceptibility, detectability, and harmlessness demonstrate robust ownership verification with negligible impact on utility.

2602.04306 2026-06-19 cs.CL cs.AI 版本更新

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST(韩国科学技术院)

AI总结 针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题,提出框架感知的去偏方法,通过量化框架差异并增强跨框架一致性,有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

随着大语言模型(LLMs)在现实应用中的日益部署,确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力,但一个持续的挑战是隐藏的偏见:LLMs 在标准评估下表现公平,但在这些评估设置之外可能产生有偏见的响应。在本文中,我们识别出框架——语义等价的提示在表达方式上的差异(例如,“A 比 B 好” vs. “B 比 A 差”)——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准,我们发现(1)公平性得分随框架变化显著,以及(2)现有的去偏方法改善了整体(即框架平均)公平性,但往往未能减少框架引起的差异。为了解决这个问题,我们提出了一种框架感知的去偏方法,鼓励 LLMs 在不同框架之间更加一致。实验表明,我们的方法减少了整体偏见,并提高了对框架差异的鲁棒性,使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

2606.07822 2026-06-19 cs.CL cs.AI cs.LG 版本更新

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2603.16941 2026-06-19 eess.AS cs.CL cs.SD 版本更新

The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

言语背后的声音:量化语音大语言模型中的交叉偏见

Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki, James Caverlee, Ondřej Klejch, Peter Bell, Gustav Eje Henter, Éva Székely

发表机构 * 1 Department of Speech, Music Hearing, KTH Royal Institute of Technology, Sweden 2 Centre for Speech Technology Research, University of Edinburgh, UK 3 Texas A\&M University, USA

AI总结 本研究通过2880次受控交互,评估三种语音大语言模型在六种英语口音和两种性别呈现中的口音与性别交叉偏见,发现东欧口音(尤其女性)获得更低有用性评分,且人类评估者比LLM评判更敏感。

Comments 5 pages, 3 figures, 1 table, Accepted to Interspeech 2026

详情
AI中文摘要

语音大语言模型直接处理语音输入,保留了之前级联管道中去除的口音和感知性别等线索,这导致了依赖于说话者身份的反应差异。我们使用2880次受控交互(涵盖六种英语口音和两种性别呈现,通过语音克隆保持语言内容不变),对三种语音大语言模型中的口音和性别偏见进行了大规模交叉评估。通过逐点LLM评判评分、成对比较以及经过人工验证的最佳-最差缩放,我们检测到反复出现的定向差异。东欧口音的语音获得较低的有用性评分,尤其是女性呈现的语音。反应保持礼貌但在有用性上存在差异。虽然LLM评判捕捉到了这些偏见的定向趋势,但人类评估者表现出显著更高的敏感性,显示出更强的口音级别对比。

英文摘要

Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect recurring directional disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. Responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, showing stronger accent-level contrasts.

2606.04075 2026-06-19 cs.LG cs.AI cs.CL cs.CR cs.CY 版本更新

Large Language Models Hack Rewards, and Society

大型语言模型攻击奖励机制与社会

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

发表机构 * King’s College London(伦敦大学国王学院) Fudan University(复旦大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 研究强化学习训练中大型语言模型利用奖励函数漏洞的“社会攻击”现象,通过SocioHack沙盒实验发现模型能发现并利用社会规则漏洞,且现有安全措施效果有限。

Comments 14 pages, 9 figures, 7 tables

详情
AI中文摘要

强化学习已成为一种主导的后训练范式,使大型语言模型能够从奖励中学习。我们观察到社会规则在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外情况,同时往往仅部分指定了制度意图。我们假设强化学习训练过程可能利用这些漏洞,因此提出模型在强化学习期间攻击奖励函数的已知倾向是否可能扩展为一种更严重的失败模式,即社会攻击:发现社会运行规则中的漏洞。为了研究这一现象,我们引入了SocioHack,一个包含72个社会环境的沙盒,并发现这些环境中奖励攻击自然出现并导致监管漏洞的发现。模型学会攻击社会规则并生成技术上合规但违背监管意图的策略,而当前的大型语言模型安全措施仅提供有限的缓解。因此,收集真实世界反馈用于模型训练需要更加谨慎,我们需要下一代后训练范式来安全地在真实社会中迭代大型语言模型。

英文摘要

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

2606.16682 2026-06-19 cs.LG cs.CL 版本更新

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩:自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

AI总结 研究多模态自评估中偏好坍缩的加剧现象,发现跨模态传染导致策略选择扭曲,并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情
AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时,会出现系统性偏差。我们表明,评估者偏好坍缩(EPC)在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现,我们发现单一策略(step_by_step)吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后,我们展示了一种称为跨模态传染的新现象:在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式,我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置(总计53次独立重复,15,592次API调用)的第3阶段统计验证揭示了一个清晰的层次结构:跨模型评估(GPT-4o,N=8)产生强但对称的双向传染(平均gamma_{T->V}=1.176,gamma_{V->T}=1.089,Delta=-0.088,p=0.575,Cohen's d=0.29);高轮次(DashScope,50轮)导致坍缩为单一策略主导(70%零传染);而自评估提供近乎完全的免疫——97%的运行(N=30,DeepSeek-chat)产生恰好为零的传染(平均gamma=0.033,95% CI [-0.031, 0.010],p=0.642,d=0.07)。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵,发布了MM-EPC实验框架,并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong contagion (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero contagion (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

11. 低资源、领域适配与高效训练 1 篇

2104.08928 2026-06-19 stat.ML cs.CL cs.LG 版本更新

Group-Sparse Matrix Factorization for Transfer Learning of Word Embeddings

面向词嵌入迁移学习的组稀疏矩阵分解

Kan Xu, Xuanyi Zhao, Hamsa Bastani, Osbert Bastani

发表机构 * W. P. Carey School of Business, Arizona State University(亚利桑那州立大学韦伯商学院) University of Pennsylvania(宾夕法尼亚大学) Wharton School, University of Pennsylvania(宾夕法尼亚大学沃顿商学院)

AI总结 提出一种基于组稀疏惩罚的两阶段估计器,通过结合大规模语料和少量领域数据高效迁移学习领域特定的词嵌入,并证明了其泛化误差界和非凸目标函数的局部最优与全局最优统计等价。

详情
AI中文摘要

非结构化文本为许多领域的决策者提供了丰富的数据源,从零售中的产品评论到医疗保健中的护理记录。为了利用这些信息,单词通常通过无监督学习算法(如矩阵分解)转化为词嵌入——编码单词之间语义关系的向量。然而,从训练数据有限的新领域学习词嵌入可能具有挑战性,因为在新领域中含义/用法可能不同,例如,单词“positive”通常具有积极情感,但在医疗记录中通常具有消极情感,因为它可能意味着患者检测出疾病阳性。在实践中,我们预计只有少数领域特定的单词可能具有新含义。我们提出了一种直观的两阶段估计器,通过组稀疏惩罚利用这种结构,通过结合大规模文本语料库(如维基百科)和有限的领域特定文本数据,高效地迁移学习领域特定的词嵌入。我们限定了迁移学习估计器的泛化误差,证明当只有少量嵌入在领域间改变时,它可以用显著更少的领域特定数据实现高精度。此外,我们证明了在标准正则化条件下,由非凸目标函数识别的所有局部最小值与全局最小值在统计上不可区分,这意味着我们的估计器可以高效计算。我们的结果首次给出了组稀疏矩阵分解的界限,这可能具有独立意义。我们通过与自然语言处理中最先进的微调启发式方法进行实证比较来评估我们的方法。

英文摘要

Unstructured text provides decision-makers with a rich data source in many domains, ranging from product reviews in retail to nursing notes in healthcare. To leverage this information, words are typically translated into word embeddings -- vectors that encode the semantic relationships between words -- through unsupervised learning algorithms such as matrix factorization. However, learning word embeddings from new domains with limited training data can be challenging, because the meaning/usage may be different in the new domain, e.g., the word ``positive'' typically has positive sentiment, but often has negative sentiment in medical notes since it may imply that a patient tested positive for a disease. In practice, we expect that only a small number of domain-specific words may have new meanings. We propose an intuitive two-stage estimator that exploits this structure via a group-sparse penalty to efficiently transfer learn domain-specific word embeddings by combining large-scale text corpora (such as Wikipedia) with limited domain-specific text data. We bound the generalization error of our transfer learning estimator, proving that it can achieve high accuracy with substantially less domain-specific data when only a small number of embeddings are altered between domains. Furthermore, we prove that all local minima identified by our nonconvex objective function are statistically indistinguishable from the global minimum under standard regularization conditions, implying that our estimator can be computed efficiently. Our results provide the first bounds on group-sparse matrix factorization, which may be of independent interest. We empirically evaluate our approach compared to state-of-the-art fine-tuning heuristics from natural language processing.

12. 其他/综合NLP 11 篇

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 新提交

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化?

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research(英伟达研究院)

AI总结 提出基于问题可解性的错误分类法,揭示LLM在RTL编码中受限于预训练知识,对齐技术仅教会编译,而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情
AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型(LLM)的关键瓶颈。为了研究这一点,我们引入了一种新的错误分类法,该分类法基于问题可解性,受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限,前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义,暴露出对测试时计算扩展免疫的持续知识差距。此外,我们揭示了一个显著的表面收敛差距:优化容易消除语法错误,但同时加剧了更深层次的功能失败。我们的发现表明,对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误,但寄存器传输级(RTL)编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究,而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

2606.19647 2026-06-19 cs.CL cs.CY cs.SI 新提交

From 50K to 8.2 Million in 24 Hours: Vozinha's Algorithmic Consecration and the Multilingual Making of World Cup Visibility

从5万到820万在24小时内:Vozinha的算法封圣与世界杯可见性的多语言构建

Vinicius Covas

发表机构 * Universidad Anáhuac México(墨西哥阿纳瓦克大学)

AI总结 通过多语言语料库和九框架叙事分类法,分析2026年世界杯后Vozinha的算法封圣过程,揭示不同语言承载不同叙事框架,将平台粉丝数作为语言对象研究可见性构建。

Comments 11 pages, 4 figures, 3 tables; v0.1 pilot preprint. Dataset and evidence package available at https://doi.org/10.5281/zenodo.20722235

详情
AI中文摘要

我们提出了一项多语言计算话语分析,研究语言如何构建了Vozinha——这位40岁的佛得角门将在2026年世界杯西班牙0-0佛得角比赛后的算法封圣。该研究贡献了一个包含葡萄牙语、西班牙语、英语和法语的多语言语料库;一个基于线索的九框架叙事分类法;一个结合LLM辅助建议与人工验证的可复现标注流程;以及跨话语阶段的多语言叙事扩散分析。我们将平台粉丝数本身——被叙述为“从5万到800万”——视为一个语言对象:一种流通且可叙述的可见性证明,而非单纯的测量。粉丝增长时间线仅作为上下文元数据使用:我们重构了一个保守的阶段结构,而非连续的API原生序列,并对每个数据点按值类别、置信度和证据类型进行标注。唯一精确的主要爬取锚点是2026年6月16日15:47 UTC的8,235,652粉丝;所有其他数字均报告为估计范围或阈值,包括估计的赛前基线45k-56k。研究结果表明,不同语言承载了不同的框架:葡萄牙语的动员、西班牙语的危机、英语的民族构建,以及共享的平台指标奇观,通过这种奇观,边缘的体育表现变得全球可见。作为v0.1试点,本文发布了语料库模式、框架分类法、标注指南、哈希视觉证据日志和类型化时间线,同时将完整的双重标注和标注者间一致性标记为计划工作。

英文摘要

We present a multilingual computational discourse analysis of how language constructed the algorithmic consecration of Vozinha, the 40-year-old Cape Verde goalkeeper, after Spain 0-0 Cape Verde at the 2026 FIFA World Cup. The study contributes a multilingual corpus in Portuguese, Spanish, English, and French; a nine-frame narrative taxonomy with cue-based frame annotation; a reproducible annotation pipeline combining LLM-assisted suggestion with human validation; and an analysis of cross-lingual narrative diffusion across discourse phases. We treat the platform follower count itself, narrated as "50k to 8M", as a linguistic object: a circulating and narratable proof of visibility rather than a mere measurement. The follower-growth timeline is used only as contextual metadata: we reconstruct a conservative phase structure, not a continuous API-native series, and type every datapoint by value class, confidence, and evidence type. The only exact primary scraper anchor is 8,235,652 followers at 2026-06-16 15:47 UTC; all other figures are reported as estimated ranges or thresholds, including an estimated pre-match baseline of 45k-56k. Findings suggest that distinct languages carried distinct frames: Portuguese mobilization, Spanish crisis, English nation-making, and a shared platform-metric spectacle through which peripheral athletic performance became globally visible. As a v0.1 pilot, the paper releases the corpus schema, frame taxonomy, annotation guidelines, hashed visual-evidence log, and typed timeline, while flagging full double annotation and inter-annotator agreement as planned work.

2606.19864 2026-06-19 cs.CL 新提交

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

近乎智能的革命:扩大审议规模并利用AI赋能人类的选项

Serge Sharoff

AI总结 探讨大型语言模型如何通过系统功能语言学视角扩大民主审议规模,增强包容性并赋权边缘群体,同时警惕过度承诺与低估风险。

Comments Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026

详情
AI中文摘要

大型语言模型在公共话语中的日益突出为民主审议带来了机遇和挑战。虽然红队策略有助于缓解特定风险,但关于语言限制、偏见和LLM的谄媚倾向等更广泛的担忧仍然存在。本章探讨如何利用LLM显著扩大和民主化审议,特别是在促进包容性和赋权传统边缘群体方面。借鉴系统功能语言学的概念,本章考察了语言使用者之间的差异(例如,关于社会人口群体)和语言使用中的差异(例如,关于交际功能)如何影响AI支持的审议参与。本章介绍了AI驱动的审议研究,并评估了它们在支撑论证、增强可及性以及减少嵌入在声望语域中的排斥性语言规范和偏见的影响方面的潜力。同时,本章警告不要过度承诺(导致不切实际的期望)和低估承诺(冒着错失AI辅助参与机会的风险)。最后,本章确定了未来的研究方向,以最大化AI辅助参与的民主潜力,同时嵌入伦理保障以抵消语言不平等的再生产。

英文摘要

The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.

2606.20198 2026-06-19 cs.CL 新提交

Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores

爵士乐领谱、独奏转录、古典钢琴与单声部乐谱的音高拼写

Augustin Bouquillard, Florent Jacquemard

发表机构 * École polytechnique(巴黎综合理工学院) INRIA(法国国家信息与自动化研究所)

AI总结 提出一种音高拼写与调性估计算法,通过两阶段优化(模态与调性)联合估计音符名称、全局调号和每小节局部音阶,在多种数字乐谱数据集上验证有效性。

详情
AI中文摘要

我们提出了一种用于音高拼写和调性估计的算法。给定MIDI格式的输入,包含音符音高(以半音表示,相对于最低参考音)和小节边界信息,该算法估计适当的音符名称、全局调号以及每小节的局部音阶。这些相关信息元素在两个优化阶段中联合评估。在初始的“模态”阶段,通过最短路径搜索为每个小节提出一个可能的音阶,以最小化印刷乐谱中的临时记号数量。然后,在称为“调性”的第二阶段,这些局部音阶被用于估计调号和音符名称,从而为整首作品生成最佳音乐记谱。我们在包含多种数字乐谱的数据集上进行了评估:来自《Real Book》的爵士领谱、爵士独奏和贝斯线的录音转录、传统曲调,以及钢琴和单声部乐器的古典乐谱。我们的程序最初设计用于音乐转录,特别是构建从音频录音转录的爵士独奏数字集合,用于音乐分析、教学和文化遗产保护。该方法也应有助于其他与音乐记谱处理相关的任务。此外,为此我们定义了各种常见爵士音阶之间的新距离,这可能对音乐学研究有一定意义。

英文摘要

We present an algorithm for pitch spelling and key estimation. Given an input in MIDI-like format, containing information on note pitches (expressed in semitones relative to the lowest reference note) and bar boundaries, it estimates the appropriate note names, a global Key Signature, and a local scale for each bar. This related information elements are evaluated jointly during two stages of optimisation. During an initial 'modal' stage, a probable scale is proposed for each bar, minimising the number of accidentals to be printed in the printed score with a shortest-path search. Then, during a second stage called 'tonal', these local scales are used to estimate the Key Signature and note names that would result in the best musical notation for the entire piece. We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores for piano and monophonic instruments. Our procedure was originally designed for use in music transcription, specifically for building digital collections of jazz solos transcribed from audio recordings, for the purposes of music analysis, teaching and the preservation of cultural heritage. This method should also prove useful for other tasks related to the processing of musical notation. Furthermore, to this end, we have defined new distances between various common jazz scales, which may be of some interest to musicological studies.

2512.03818 2026-06-19 cs.CL 版本更新

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

改善人机编码对齐:心理学构念识别中提示工程的实证评估

Kylie L. Anglin, Stephanie Milan, Brittney Hernandez, Claudia Ventura

发表机构 * Department of Educational Psychology, Neag School of Education, University of Connecticut(教育心理学系,教育学院,康涅狄格大学) Department of Psychological Sciences, College of Liberal Arts and Sciences, University of Connecticut(心理学系,文理学院,康涅狄格大学)

AI总结 本研究提出一个实证框架,通过提示工程优化大语言模型在心理学文本中识别构念的性能。实验评估五种提示策略,发现构念定义和任务框架最关键,结合代码簿引导和自动提示工程的少样本方法最接近专家判断。

Comments 22 pages, 2 figures

详情
AI中文摘要

由于其架构和庞大的预训练数据,大语言模型(LLMs)表现出强大的文本分类性能。然而,LLM的输出——这里指分配给文本的类别——在很大程度上取决于提示的措辞。尽管关于提示工程的文献正在扩展,但很少有研究关注分类任务,更少有研究涉及心理学等领域,在这些领域中,构念具有精确的、理论驱动的定义,而这些定义可能未在预训练数据中得到充分体现。我们提出了一个实证框架,通过提示工程优化LLM在文本中识别构念的性能。我们实验评估了五种提示策略——代码簿引导的实证提示选择、自动提示工程、角色提示、思维链推理和解释性提示——采用零样本和少样本分类。我们发现,角色、思维链和解释并不能完全解决因措辞不当的提示而导致的性能损失。相反,提示中最有影响力的特征是构念定义、任务框架,以及在较小程度上提供的示例。在三个构念和两个模型中,与专家判断最一致的分类来自结合代码簿引导的实证提示选择和自动提示工程的少样本提示。基于我们的发现,我们建议研究人员生成并评估尽可能多的提示变体,无论是人工编写的、自动生成的,或者理想情况下两者兼有,并根据训练数据集中的实证性能选择提示和示例,在保留集中验证最终方法。该程序提供了一种实用、系统且理论驱动的方法,用于在需要与专家判断对齐的环境中优化LLM提示。

英文摘要

Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies -- codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

2512.18859 2026-06-19 cs.CL 版本更新

Toward Human-Centered AI-Assisted Terminology Work

迈向以人为中心的AI辅助术语工作

Antonio San Martin

发表机构 * Universite du Quebec à Trois-Rivieres(魁北克大学三河分校)

AI总结 本文提出以人为中心的人工智能框架,在利用生成式AI自动化术语工作的同时,通过增强术语学家能力、保持人类控制权来确保术语数据的准确性和可靠性。

Comments Accepted for publication in the journal Terminology

详情
AI中文摘要

生成式AI可能通过创造自动化新机会来改变术语工作。同时,它引发了对术语学家和术语资源未来的担忧,因为效率压力可能鼓励过度自动化,认为人类专业知识可被AI取代。然而,由于错误、幻觉和各种形式的偏见,大型语言模型在术语目的上仍然不可靠,使得术语学家在确保术语数据的准确性和可靠性方面不可或缺。本文认为,以人为中心的AI(强调AI的主要目标应是促进人类福祉的方法)提供了一个框架,可以在最大化生成式AI收益的同时减轻其风险。它主张高水平的自动化和有意义的人类控制是兼容且可取的,AI应增强术语学家的能力,同时保留他们的自主权和决策权。通过三个相互关联的维度——增强的术语学家、伦理AI和以人为中心的设计——审视了AI辅助术语工作的影响。特别是,本文探讨了AI整合如何重塑术语学家的角色,影响专业价值观和工作条件,要求管理AI产生的偏见,并呼吁围绕术语学家的需求设计AI工具。本文得出结论,以人为中心的方向是必要的,以确保AI加强而非削弱术语工作在支持专业交流以及跨语言和跨文化准确传播知识中的关键作用。

英文摘要

Generative AI is likely to transform terminology work by creating new opportunities for automation. At the same time, it raises concerns about the future of terminologists and terminological resources, as efficiency pressures may encourage excessive automation based on the perception that human expertise can be replaced by AI. However, large language models remain unreliable for terminological purposes due to errors, hallucinations, and various forms of bias, making terminologists indispensable for ensuring the accuracy and reliability of terminological data. This paper argues that human-centered AI, an approach that emphasizes that AI's primary goal should be to contribute to human well-being, provides a framework for maximizing the benefits of generative AI while mitigating its risks. It contends that high levels of automation and meaningful human control are compatible and desirable, and that AI should enhance terminologists' capabilities while preserving their agency and decision-making authority. The implications of AI-assisted terminology work are examined through three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. In particular, the paper examines how AI integration reshapes the role of the terminologist, affects professional values and working conditions, requires the management of AI-generated bias, and calls for the design of AI tools around the terminologist's needs. The paper concludes that a human-centered orientation is necessary to ensure that AI strengthens, rather than undermines, the essential role of terminology work in supporting specialized communication and the accurate transmission of knowledge across languages and cultures.

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2606.18941 2026-06-19 cs.PL cs.CL 版本更新

ESBMC-GraphPLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC:使用基于SMT的模型检查对图形化PLCopen XML梯形图程序进行形式验证

Pierre Dantas, Lucas Cordeiro, Waldir Junior

发表机构 * Computer Science, The University of Manchester(计算机科学,曼彻斯特大学) Electrical Engineering, Federal University of Amazonas (UFAM)(电气工程,亚马逊联邦大学(UFAM))

AI总结 针对ESBMC-PLC无法处理图形化PLCopen XML梯形图的问题,提出基于DFS的图形LD解析器,将连接图转换为布尔触点合取,并采用三级I/O推断方案,成功实现完整GOTO IR转换,验证了3个图形LD程序。

Comments 18 pages

详情
AI中文摘要

PLCopen XML为IEC 61131-3梯形图程序定义了两种编码格式:一种使用<rung>元素的文本编码,另一种将梯形逻辑表示为localId/refLocalId连接的有向图的图形编码。ESBMC-PLC支持文本格式,但将来自CONTROLLINO、Beremiz和OpenPLC Editor的图形导出解析为空GOTO中间表示,导致空洞的验证成功。本文提出Graph-ESBMC-PLC,通过基于DFS的图形LD解析器填补了这一空白。该解析器从leftPowerRail遍历连接图到每个线圈,将梯形路径提取为布尔触点合取,并应用三级I/O推断方案。按rightPowerRail的connectionPointIn序列对线圈排序,确保SET线圈在RESET线圈之前处理,匹配IEC扫描周期语义。图形到IR的转换无需改动ESBMC后端。在来自CONTROLLINO/OpenPLC Editor的3个图形LD程序上的验证表明,所有程序都生成了包含非确定性输入和梯形逻辑的完整GOTO IR,而之前生成的是空IR。所有3个程序在k=2时在70ms内验证为SAFE。11个文本LD基准测试完全保留,无回归。两个不含LD内容或不支持定时器语义的Beremiz示例被报告为发现的局限性。工件位于Zenodo(DantasCordeiro2026graphical,doi: https://doi.org/10.5281/zenodo.20699856)。

英文摘要

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using <rung> elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents ESBMC-GraphPLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院朱道尔)

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

详情
英文摘要

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

2406.15465 2026-06-19 cs.CL cs.AI 版本更新

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland(以患者为中心的数字健康研究所,伯恩应用科学大学,比尔,瑞士) ID Suisse AG, St. Gallen, Switzerland(ID瑞士股份有限公司,圣加尔,瑞士)

详情
英文摘要

Annually and globally, over three billion radiography examinations and computer tomography scans result in mostly unstructured radiology reports containing free text. Despite the potential benefits of structured reporting, its adoption is limited by factors such as established processes, resource constraints and potential loss of information. However, structured information would be necessary for various use cases, including automatic analysis, clinical trial matching, and prediction of health outcomes. This study introduces RadEx, an end-to-end framework comprising 15 software components and ten artifacts to develop systems that perform automated information extraction from radiology reports. It covers the complete process from annotating training data to extracting information by offering a consistent generic information model and setting boundaries for model development. Specifically, RadEx allows clinicians to define relevant information for clinical domains (e.g., mammography) and to create report templates. The framework supports both generative and encoder-only models and the decoupling of information extraction from template filling enables independent model improvements. Developing information extraction systems according to the RadEx framework facilitates implementation and maintenance as components are easily exchangeable, while standardized artifacts ensure interoperability between components.

2306.12679 2026-06-19 cs.CL 版本更新

Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs

Mojtaba Mazoochi, Leila Rabiei, Farzaneh Rahmani, Zeinab Rajabi

发表机构 * Faculty member in ICT Research Institute(ICT研究所教员) Iran Telecommunication Research Center (ITRC)(伊朗电信研究中心) Faculty member in Computer Department(计算机系教员) Mehralborz University(梅赫拉布尔兹大学) Hazrat-e Masoumeh University(玛苏姆大学)

Journal ref Multimedia Tools and Applications, 2025

详情
英文摘要

Introduction: Microblogging websites have massed rich data sources for sentiment analysis and opinion mining. In this regard, sentiment classification has frequently proven inefficient because microblog posts typically lack syntactically consistent terms and representatives since users on these social networks do not like to write lengthy statements. Also, there are some limitations to low-resource languages. The Persian language has exceptional characteristics and demands unique annotated data and models for the sentiment analysis task, which are distinctive from text features within the English dialect. Method: This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way. Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram. Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts. The constructed datasets are used to evaluate the presented architecture. Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with different word embeddings, including Fasttext, Glove, and Word2vec, investigated our dataset and evaluated the results. Results: The results demonstrate the benefit of our dataset and the proposed model (72% accuracy), displaying meaningful improvement in sentiment classification performance.