arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 97 信号源:cs.CL, cs.AI, cs.LG

1. 预训练 7 篇

2606.19348 2026-06-19 cs.CL cs.AI 新提交 95%

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-V4: 迈向高效百万令牌上下文智能

DeepSeek-AI, Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chengyu Hou, Chenhao Xu, Chenze Shao, Chong Ruan, Conner Sun, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Donghao Li, Dongjie Ji, Erhang Li, Fang Wei, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoling Zhang, Haoming Luo, Haoran Wei, Haotian Yuan, Haowei Zhang, Haowen Luo, Haoyu Chen, Haozhe Ji, Hengqing Zhang, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, J Yang, JQ Zhu, Jia Luo, Jia Song, Jia Yu, Jialiang Huang, Jialu Cai, Jian Liang, Jiangting Zhou, Jiasheng Ye, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jieyu Yang, Jin Chen, Jin Yan, Jingchang Chen, Jingli Zhou, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jingzi Zhou, Jinhua Zhu, Jiping Yu, Joseph Sun, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junmin Zheng, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Leyi Xia, Li Zhang, Liang Zhao, Lihua Guo, Lingxiao Luo, Linwang Ma, Linyan Zhu, Litong Wang, Liyu Cai, Liyue Zhang, Longhao Chen, MS Di, MY Xu, Max Mei, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Mingxu Zhou, Minmin Han, Ning Wang, Panpan Huang, Panpan Wang, Peixin Cong, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Qiwei Jiang, Rui Tian, Ruifan Xu, Ruijie Lu, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqian Chen, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, Ruyi Chen, SH Liu, Shanghao Lu, Shangmian Sun, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoheng Nie, Shaoqing Wu, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Shuying Yu, Songyang Zhou, Tao Ni, Tao Yun, Tian Jin, Tian Pei, Tian Ye, Tianle Lin, Tianran Ji, Tianyi Cui, Tianyuan Yue, Tingting Yu, Tun Wang, W Zhang, WL Xiao, Wangding Zeng, Wei An, Weilin Zhao, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjing Yao, Wenjun Gao, Wenkai Yang, Wenlve Huang, Wenqing Hou, Wentao Zhang, Wenting Ma, Xi Gao, Xiang He, Xiangwen Wang, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingchen Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyu Zhang, Xu Chen, Xuanyu Wang, Xuecheng Su, Xueyin Chen, Xuheng Lin, Xuwei Fu, YC Yan, YQ Wang, YW Ma, Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Xu, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Shao, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang Xiong, Yiling Ma, Ying He, Ying Tang, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, YuKun Li, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuehan Yang, Yuer Xu, Yuhan Wu, Yuhao Meng, Yuheng Zou, Yukun Zha, Yunfan Xiong, Yupeng Chen, Yuping Lin, Yuqian Cao, Yuqian Wang, Yushun Zhang, Yuting Yan, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuxuan Zhou, Yuyang Zhou, Yuzhen Huang, ZF Wu, Zehao Wang, Zehua Zhao, Zehui Ren, Zekai Zhang, Zhangli Sha, Zhe Fu, Zhe Ju, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zheren Gao, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhixuan Chen, Zhiyu Wu, Zhizhou Ren, Zhongyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihua Qu, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Ziyi Wan, Zizheng Pan, Zongqing Yao

发表机构 * DeepSeek-AI(深度求索人工智能)

专题命中 预训练 :百万token上下文MoE模型,架构优化

AI总结 提出DeepSeek-V4系列MoE模型,通过混合注意力架构、流形约束超连接和Muon优化器,实现百万令牌上下文的高效推理,在核心任务上超越前代。

详情
AI中文摘要

我们展示了DeepSeek-V4系列的预览版本,包括两个强大的混合专家(MoE)语言模型——DeepSeek-V4-Pro(1.6T参数,49B激活)和DeepSeek-V4-Flash(284B参数,13B激活),两者均支持一百万个令牌的上下文长度。DeepSeek-V4系列在架构和优化方面引入了多项关键升级:(1)混合注意力架构,结合压缩稀疏注意力(CSA)和重度压缩注意力(HCA),以提高长上下文效率;(2)流形约束超连接(mHC),增强传统残差连接;(3)Muon优化器,实现更快的收敛和更高的训练稳定性。我们在超过32T多样且高质量的令牌上预训练了两个模型,随后通过全面的后训练流程解锁并进一步增强其能力。DeepSeek-V4-Pro-Max是DeepSeek-V4-Pro的最大推理努力模式,重新定义了开放模型的最先进水平,在核心任务上超越了其前代。同时,DeepSeek-V4系列在长上下文场景中非常高效。在百万令牌上下文设置下,与DeepSeek-V3.2相比,DeepSeek-V4-Pro仅需27%的单令牌推理FLOPs和10%的KV缓存。这使得我们能够常规支持百万令牌上下文,从而使长时任务和进一步的测试时扩展更加可行。模型检查点可从此https URL获取。

英文摘要

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models -- DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) -- both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

2606.20381 2026-06-19 cs.AI 新提交 90%

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

重新思考LLM FP4预训练中的收缩偏差:几何起源、系统影响与UFP4方案

Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

发表机构 * Ling Team, Ant Group(蚂蚁集团灵团队)

专题命中 预训练 :研究LLM FP4预训练中的收缩偏差与优化方案。

AI总结 本文发现E2M1格式因几何不对称导致收缩偏差,该偏差经随机哈达玛变换放大,造成训练不稳定;提出均匀网格E1M2/INT4及UFP4训练方案,在多种模型上实现更低损失。

Comments 18 pages, 12 figures

详情
AI中文摘要

FP4训练有望大幅减少LLM预训练的内存和计算成本,然而当前的FP4硬件路径和方案,包括NVIDIA Blackwell/Rubin级系统和AMD MI350系列GPU,仍以E2M1数据元素为中心。在本研究中,我们识别出该选择的一个根本限制:诸如E2M1的非均匀格式固有地遭受收缩偏差,这是一种由其可表示区间的几何不对称性导致的系统性负舍入误差。我们证明该偏差在层间乘性累积,并被随机哈达玛变换(RHT)放大,为现有基于E2M1的FP4方案中观察到的训练不稳定性提供了统一解释。相比之下,均匀网格(E1M2/INT4)绕过了这种网格几何误差,并能更好地将RHT改进的桶利用率转化为更高的量化质量。基于这一发现,我们提出UFP4,一种均匀4位训练方案,它将RHT应用于所有三个训练GEMM,同时仅对dY施加随机舍入。在Dense 1.5B、MoE 7.9B和MoE 124B的长程预训练中,UFP4始终比强E2M1基线实现更低的BF16相对损失退化,这得到了缩放定律分析和消融研究的支持。我们的结果表明,未来的加速器应支持E1M2/INT4风格的均匀4位网格作为与E2M1并列的一等训练原语。

英文摘要

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

2606.20089 2026-06-19 cs.CL cs.AI 新提交 90%

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

IHUBERT: 面向波斯语资源的基于向量的语义去重与领域平衡预训练

Arash Ghafouri, Mahdi Firouzmandi, Hossein Saberi, Mohammad Reza Hasani Ahangar

发表机构 * Department of Artificial Intelligence and Cognitive Science, Imam Hossein Comprehensive University(人工智能与认知科学系,伊玛目·侯赛因综合大学)

专题命中 预训练 :波斯语预训练语言模型

AI总结 提出IHUBERT,一个基于RoBERTa-base的波斯语预训练模型,通过多阶段预处理(包括基于向量数据库的语义去重和领域平衡)在45GB语料上训练,在多项NLU任务上取得领先结果,尤其抽取式问答表现突出。

详情
AI中文摘要

波斯语预训练语言模型仍然受到大规模高质量预训练语料库稀缺以及标准分类和NER任务之外评估不足的限制。我们提出了IHUBERT,一个从头训练的波斯语单语PLM,采用RoBERTa-base编码器(1.25亿参数),在Sepahr-Danesh集合的45GB精选子集(约70-80亿token)上进行训练。为了提高语料质量并减少冗余,我们采用多阶段预处理流程,包括规范化、精确和近似重复去除、匿名化,以及基于向量数据库的语义去重,以实现跨领域和语体的分布平衡控制。我们还在完整的预训练语料库上训练了一个13.9万词汇量的BPE分词器,以更好地捕捉波斯语的形态和拼写变化。IHUBERT在七个波斯语NLU基准测试上进行评估,涵盖NER、情感分析、主题分类、NLI、抽取式问答和关系抽取,使用任务标准指标(实体级F1、宏F1、EM/F1)。IHUBERT在抽取式QA上取得了最强增益,在PQuAD(F1 88.3542)和ParsiNLU-RC(F1 49.0987)上均排名第一,并在FarsTail上取得了最佳结果(宏F1 0.8350)。在NER和主题分类上,它保持竞争力(例如,ParsTwiNER上F1 0.8308;DigiMag上宏F1 0.7953),而关系抽取仍然是主要差距(PERLEX上宏F1 0.6684)。在IHUBERT预训练语料库上的受控分词器消融实验表明,在匹配词汇量下,BPE产生的子词碎片化程度略低于WordPiece,支持了我们的分词设计。总体而言,IHUBERT通过语义精选的大规模预训练以及跨分类和理解型任务的广泛评估,推进了波斯语语言建模。

英文摘要

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.

2606.19993 2026-06-19 cs.LG 新提交 85%

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

激活与影响感知秩 (AIR):保持功能的SVD压缩用于大语言模型

Nico Harder, Daniel Becking, Karsten Mueller, Wojciech Samek

发表机构 * Fraunhofer HHI(弗劳恩霍夫研究所)

专题命中 预训练 :提出LLM压缩框架,提升模型效率

AI总结 提出AIR框架,基于SVD和反向信号影响度量,通过单次交替最小二乘扫描实现权重矩阵的低秩近似,在参数保留≤60%时困惑度比SVD-LLM(W)改善>18%,并减少90%校准数据。

Comments Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM), Seoul, South Korea (non-archival)

详情
AI中文摘要

我们提出了激活与影响感知秩(AIR),一个基于SVD的大语言模型压缩框架,它使用反向信号影响度量来指导每个权重矩阵的低秩近似。从SVD-LLM(W)的激活感知最优解出发,AIR运行单次封闭形式的交替最小二乘(ALS)扫描,在单调下降保证下逐元素整合影响。AIR是层局部的,并与端到端方法正交组合:单独使用时超过ACIP,AIR+LoRA进一步超越。AIR在参数保留≤60%时,困惑度比SVD-LLM(W)改善超过18%,使用约90%更少的校准数据达到相同质量,并将参数节省转化为FLOP、峰值内存和每令牌延迟的收益。

英文摘要

We present Activation- and Influence-Aware Ranks (AIR), an SVD-based LLM compression framework that guides each weight matrix's low-rank approximation with a backward-signal influence metric. Starting from the activation-aware optimum of SVD-LLM(W), AIR runs a single closed-form alternating least squares (ALS) sweep that integrates influence element-wise under a monotone-descent guarantee. AIR is layer-local and composes orthogonally with end-to-end methods: alone it exceeds ACIP, and AIR+LoRA outperforms it further. AIR improves perplexity over SVD-LLM(W) by >18% at <=60% parameter retention, matches its quality with ~90% less calibration data, and turns parameter savings into FLOP, peak-memory, and per-token latency gains.

2606.19491 2026-06-19 cs.LG stat.ML 新提交 85%

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

LayerNorm Transformer 中的代数死方向:一种仅需前向传播的大语言模型规模诊断方法

Tejas Pradeep Shirodkar, P. J. Narayanan

发表机构 * IIIT, Hyderabad(海得拉巴国际信息技术学院)

专题命中 预训练 :研究LayerNorm变换器的死方向,涉及预训练模型诊断。

AI总结 本文发现 LayerNorm 的逆尺度方向是后最终归一化中心激活协方差矩阵的精确代数核,可仅从参数中读取死方向,无需前向或后向传播,并在 14 个预训练模型上验证了其有效性。

Comments 34 pages, 7 figures, 6 tables. Empirical companion to arXiv:2606.05957

详情
AI中文摘要

预训练 Transformer 位于损失函数的奇异极小值附近,此时 Fisher 信息度量沿死方向退化:参数空间中方向性 Fisher 为零的方向。通常定位这样的方向需要一次前向传播和激活矩阵的特征分解,或基于采样的复杂度估计;没有一种方法能仅从网络参数计算方向。我们针对 LayerNorm Transformer 给出了一个这样的方向。LayerNorm 仿射的逆尺度方向 $\gamma^{-1}/\|\gamma^{-1}\|$ 是后最终归一化中心激活协方差矩阵的精确代数核,适用于任何输入分布,并在参数空间中诱导出相应的死方向。它仅从 LN 尺度参数读取,无需前向或后向传播,无需特征分解:这是针对 LayerNorm 的最廉价死方向读取方法。我们在 14 个预训练 Transformer(9 个 LayerNorm,5 个 RMSNorm;160M-35B;语言和视觉目标)上进行了测试。在随机初始化时,预测方向与测量的底部奇异方向(一次前向传播,直接 SVD)在 9/9 的 LayerNorm 模型上匹配到小数点后四位,并在 5/5 的 RMSNorm 模型上正确缺失,后者缺乏产生该方向的均值减法投影器。在训练后的检查点上,沿该方向的协方差特征值加深约 ${\sim}10^3$ 倍,并打开更多死方向;随机初始化到训练后的差距是一次前向传播、每检查点沿预测坐标的奇异结构读出。由此得出两个闭式结论:残差流的最小奇异值在 13/14 个 Transformer 上逐块保持不变(在其自身输入分布上测量),唯一的例外(Gemma$4$-$31$B)是一个真正的死方向,同一读出可精确定位;核方向的存在从参数本身即可对 Transformer 的归一化进行分类。

英文摘要

Pretrained transformers sit near singular minima of the loss, where the Fisher information metric degenerates along dead directions: directions in parameter space along which the directional Fisher vanishes. Locating such a direction normally needs a forward pass and an eigendecomposition of activations, or a sampling-based complexity estimate; none returns a direction computable from the network's parameters alone. We give one, for LayerNorm transformers. The inverse-scale direction $γ^{-1}/\|γ^{-1}\|$ of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm. We test it on $14$ pretrained transformers ($9$ LayerNorm, $5$ RMSNorm; $160$M-$35$B; language and vision objectives). At random initialisation the predicted direction matches the measured bottom singular direction (one forward pass, direct SVD) to four decimal places on $9/9$ LayerNorm models, and is correctly absent on $5/5$ RMSNorm models, which lack the mean-subtraction projector that creates it. On the trained checkpoint the covariance eigenvalue along this direction deepens by ${\sim}10^3\times$ and further dead directions open; the random-init-to-trained gap is a one-forward-pass, per-checkpoint readout of singular structure along the predicted coordinate. Two consequences follow in closed form: the residual stream's smallest singular value is preserved block-to-block on $13/14$ transformers measured on their own input distribution, the one exception (Gemma$4$-$31$B) a genuine dead direction the same read pinpoints; and the kernel direction's presence classifies a transformer's normalisation from the parameters alone.

2606.19468 2026-06-19 cs.CL 新提交 85%

Characterizing Narrative Content in Web-scale LLM Pretraining Data

网络规模LLM预训练数据中的叙事内容特征化

Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak

发表机构 * University of Colorado Boulder(科罗拉多大学波尔德分校) ETH Zürich(苏黎世联邦理工学院) McGill University(麦吉尔大学)

专题命中 预训练 :细粒度研究LLM预训练语料库的叙事特征。

AI总结 首次细粒度研究LLM预训练语料库Dolma的叙事特征,提出涵盖三个核心叙事元素(能动性、场景、事件)的框架,构建NarraBERT模型并发布NarraDolma数据集,揭示叙事结构在异构数据中可测量且分布不均。

Comments 8 pages of main content, 28 total pages. 30 figures

详情
AI中文摘要

尽管叙事是人类交流的基本模式,但网络规模LLM预训练语料库的叙事组成仍然很大程度上未被探索。我们首次对Dolma(一个3万亿词元的开放预训练语料库)中的叙事特征进行了细粒度研究。借鉴叙事理论,我们设计了一个框架,涵盖三个核心叙事元素(能动性、场景和事件),并将其操作化为11个可解释维度。在采样并标注了400个多样化的段落之后,我们微调并验证了NarraBERT,一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个段落,生成了新数据集NarraDolma。我们发现:(i) 叙事结构在极度异构的数据中是可大规模测量的;(ii) 我们揭示了网络文本背后连续的多维叙事结构;(iii) 叙事质量在预训练来源和主题之间分布不均,而当前的策展实践既未测量也未考虑这一点。我们的框架、数据集和分析为理解LLM预训练数据中叙事质量的分布以及研究数据组成如何影响叙事推理任务提供了基础。我们公开发布了NarraDolma和NarraBERT。

英文摘要

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

2606.19989 2026-06-19 cs.DC cs.LG 新提交 80%

Online Dynamic Batching with Formal Guarantees for LLM Training

面向LLM训练的具有形式保证的在线动态批处理

Dian Li, Zekun Wang, Yaoru Wang, Jiahong Yan

发表机构 * Tencent(腾讯)

专题命中 预训练 :提出在线动态批处理系统加速LLM训练

AI总结 提出在线动态批处理(ODB)系统,在数据加载器侧将批构建延迟到样本真实成本可观测时,解决离线批采样中预处理成本不可见问题,实现1.58-4.43x吞吐量提升,并提供无死锁有界终止的形式化保证。

Comments 29 pages, 3 figures, 21 tables

详情
AI中文摘要

现代LLM训练打破了离线批采样器背后的一个核心假设:样本的真实训练成本只有在预处理、增强、模板化、分词和多模态视觉标记扩展之后才能观察到。除非为依赖于预处理和增强的长度缓存付费,否则批构建对于决定填充、内存使用和GPU饱和度的量是盲目的。我们引入了在线动态批处理(ODB),这是一个数据加载器侧的即插即用系统,它将批形成移动到这一精确可观测性点,同时保持DDP步骤对齐。我们将这一同步需求形式化为分布式组对齐问题,并证明了在默认加入模式身份覆盖和可选非加入样本配额封闭下的无死锁有界终止。ODB不需要修改模型、优化器或注意力核,并以轻量级训练器适配器的形式发布为online-dynamic-batching。在UltraChat/LLaVA/ShareGPT4o上对公开的2B/8B Qwen3-VL进行的实验中,与固定批Standard相比,ODB在单节点全量微调/LoRA上实现了1.58-2.51倍的逐字样本吞吐量提升,在两节点全量微调上实现了1.71-3.78倍提升,质量与Standard相当;生产环境MM-Mix达到4.43倍。与GMT/BMT离线令牌预算预言机相比,ODB在UltraChat/LLaVA上差距在15%以内,在高变异系数的ShareGPT4o上更快:单节点全量微调/LoRA为2.24-2.39倍,两节点全量微调为3.06-3.69倍。总之,ODB占据了高异质性LLM微调的在线/即插即用领域:在质量与Standard相当的情况下实现大幅吞吐量提升,提供形式化的DGAP保证,无需长度缓存预计算或核重写。

英文摘要

Modern LLM training breaks a core assumption behind offline batch samplers: the true training cost of a sample is only observable after preprocessing, augmentation, templating, tokenization, and multimodal visual-token expansion. Unless one pays for a preprocessing- and augmentation-dependent length cache, batch construction is therefore blind to the quantity that determines padding, memory use, and GPU saturation. We introduce Online Dynamic Batching (ODB), a DataLoader-side drop-in system that moves batch formation to this point of accurate observability while preserving DDP step alignment. We formalize this synchronization requirement as the Distributed Group Alignment Problem and prove deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure. ODB requires no model, optimizer, or attention-kernel changes and is released as online-dynamic-batching with lightweight trainer adapters. Across public 2B/8B Qwen3-VL runs on UltraChat/LLaVA/ShareGPT4o, ODB improves literal emitted-sample throughput vs. fixed-batch Standard by 1.58-2.51x on single-node Full FT/LoRA and 1.71-3.78x on two-node Full FT, with Standard-comparable quality; production MM-Mix reaches 4.43x. Against GMT/BMT offline token-budget oracles, ODB is within 15% on UltraChat/LLaVA and faster on high-CV ShareGPT4o: 2.24-2.39x single-node Full FT/LoRA and 3.06-3.69x two-node Full FT. Together, ODB occupies the online/drop-in regime for high-heterogeneity LLM fine-tuning: large throughput gains at Standard-comparable quality, formal DGAP guarantees, and no length-cache precompute or kernel rewrites.

2. 长上下文 1 篇

2606.20097 2026-06-19 cs.CL 新提交 90%

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

HydraHead:从头部级功能异质性到专业化注意力混合

Zhentao Tan, Wei Chen, Jingyi Shen, Yao Liu, Xu Shen, Yue Wu, Jieping Ye

发表机构 * Alibaba Group(阿里巴巴集团)

专题命中 长上下文 :长上下文注意力混合架构

AI总结 提出HydraHead架构,沿头部维度混合全注意力和线性注意力,通过可解释性驱动的头部选择和尺度归一化融合模块,在长上下文任务中优于层级混合设计,仅用15B token训练即在512K上下文长度上提升69%。

详情
AI中文摘要

注意力的二次复杂度对长上下文处理构成了关键瓶颈,激发了混合注意力设计的兴趣。大多数开源混合模型采用层级策略。然而,先前工作注意到线性注意力与全注意力整合的内在困难,表明注意力混合的设计空间仍未充分探索。为了探索这一空间,我们进行可解释性分析,观察到层表现出块级功能相似性,而同一层内的单个头部尽管共享输入特征,却显示出不同的功能专门化。这种头部级异质性表明,头部维度为融合异质注意力信号提供了自然且原则性的粒度。基于这一洞察,我们引入了HydraHead,一种沿头部轴混合全注意力和线性注意力的新型架构。HydraHead具有两个关键创新:(1)一种可解释性驱动的选择策略,识别检索关键的头部并仅为其保留全注意力;(2)一种尺度归一化融合模块,调和全注意力和线性注意力头部输出之间的分布差距。通过利用参数重用和蒸馏的三阶段迁移流程,我们以最小的训练开销实现了高性能混合模型。在统一的训练设置下,HydraHead在长上下文任务中优于其他混合设计,同时保持强大的通用推理能力。通过可解释性驱动的头部选择,它以7:1的线性注意力与全注意力比例匹配了3:1层级混合的长上下文性能。关键的是,仅用15B token训练,HydraHead在512K上下文长度上比基线提升超过69%,接近Qwen3.5(一个具有256K原生上下文长度的类似规模领先模型)。这突显了头部级混合的显著扩展潜力。

英文摘要

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

3. 后训练 5 篇

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交 90%

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘:不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab(网络分析与社会影响建模实验室) School of Physics Maths and Computing, The University of Western Australia(西澳大学物理数学与计算学院) School of Psychological Science, The University of Western Australia(西澳大学心理科学学院) School of Computing, Macquarie University(麦考瑞大学计算机学院)

专题命中 后训练 :研究顺序DPO在不同偏好设置下的影响,涉及对齐方法。

AI总结 研究顺序DPO在不同偏好设置下的影响,发现遗忘模式并非统一,而是取决于目标关系、信号强度和训练顺序,并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化(DPO)等偏好优化方法顺序应用这些目标,但目前尚不清楚后续训练是否会统一降低先前学习的偏好,或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置(包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标)的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct,我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式;偏好变化从部分退化到稳定、成对重新分配或正迁移,具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明,聚合指标可能掩盖偏好对之间的异质性变化,而四分位数分解显示,高置信度对可能根据设置而退化或改进。机制诊断表明,在所有设置中,阶段2的梯度和适配器更新与先前目标接近正交,几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明,未来的顺序对齐流程应考虑目标兼容性和信号强度,而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

2606.20008 2026-06-19 cs.LG 新提交 85%

VIMPO: Value-Implicit Policy Optimization for LLMs

VIMPO: 值隐式策略优化用于大语言模型

Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao

发表机构 * UC Berkeley(加州大学伯克利分校) Yale University(耶鲁大学)

专题命中 后训练 :提出VIMPO方法优化LLM推理能力。

AI总结 提出VIMPO方法,通过KL正则化强化学习的最优条件导出策略隐含值函数,无需训练评论家,实现细粒度信用分配,在数学推理基准上优于GRPO。

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的核心工具,但当前方法在简单性与信用分配之间存在权衡。GRPO等群组相对方法避免了训练评论家,但通常为每个token分配轨迹级优势。Actor-critic方法提供更密集的学习信号,但需要学习值函数,其自身存在训练不稳定性。我们提出VIMPO,一种无需评论家的策略优化方法,从KL正则化强化学习的最优条件推导出策略隐含值函数。对于自回归生成,得到的值递归可以用策略-参考对数比率表示,并由轨迹结束时无未来奖励的终止条件锚定。这给出了一个简单的值损失,它结合了结果级可验证奖励,而无需训练评论家。相同的推导也产生了无需评论家的actor优势,使VIMPO能够通过值损失分离奖励合并,并通过PPO风格的actor更新进行策略改进。在数学RLVR基准上,VIMPO在MATH-500、AIME 2024、AIME 2025和OlympiadBench上均优于GRPO,尤其在竞赛式评估中提升更大。在噪声奖励下,VIMPO保持对GRPO的持续优势,表明策略隐含值优化可以在保持无评论家训练实用简单性的同时提供更精细的信用分配。

英文摘要

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 新提交 80%

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots:通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

专题命中 后训练 :通过强化学习训练LLM的元能力。

AI总结 提出Connect the Dots框架,通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域,实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情
AI中文摘要

本文提出了一个通用框架,用于训练大型语言模型(LLMs)具备“Connect the Dots”(CoD)这一元能力,该能力是长期生命周期智能体所必需的:当基于LLM的AI智能体部署在环境中时,它解决一系列长期任务,同时持续探索环境、从自身经验中学习,并迭代地自我更新关于环境的上下文,从而在更新上下文的条件下,在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括:(1)用于端到端强化学习(RL)的算法设计和基础设施,其中包含交替执行任务和更新上下文的长展开序列;(2)用于在训练过程中激励和激发LLM中目标元能力的任务和环境,以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现,包括具有细粒度信用分配的GRPO风格RL算法,以及针对目标元能力(而非特定领域的LLM能力或标准的逐任务RL)量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性,并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作,并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用,我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

2606.19679 2026-06-19 cs.LG cs.AI 新提交 80%

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

LOKI: 无记忆零空间约束的终身知识编辑

Masih Eskandar, Miquel Sirera Perelló, Stratis Ioannidis, Jennifer Dy

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

专题命中 后训练 :终身知识编辑方法,动态选择层并投影到零空间

AI总结 提出LOKI方法,通过希尔伯特-施密特独立性准则动态选择层,并将梯度更新投影到模型权重的零空间,实现无需访问旧知识的终身知识编辑,平均准确率提升14%。

详情
AI中文摘要

终身知识编辑旨在随着时间推移,当新知识可用或模型出错时,高效且顺序地更新语言模型,同时保持对过去知识的可接受性能。一个未解决的挑战是现有方法对所有新知识样本修改固定层集,降低了灵活性并增加了灾难性遗忘。另一个挑战是需要访问先前知识并进行大量预处理以获得数据统计。为了解决这些挑战,我们引入了LOKI,一种新颖的方法,它基于希尔伯特-施密特独立性准则进行动态层选择,并将梯度更新投影到模型权重的零空间,从而绕过了对先前知识访问的需求。我们表明,LOKI在广泛的实验中实现了优于现有方法的性能,平均准确率提升高达14%。

英文摘要

Lifelong knowledge editing aims to efficiently and sequentially update language models over time, as new knowledge becomes available or when the model makes mistakes, while preserving acceptable performance on past knowledge. One unresolved challenge is that existing methods modify a fixed set of layers for all new knowledge samples, reducing flexibility and increasing catastrophic forgetting. Another is requiring access to previous knowledge and extensive pre-processing to obtain data statistics. To address these challenges, we introduce LOKI, a novel approach that uses dynamic layer selection based on the Hilbert-Schmidt Independence Criterion and projects gradient updates onto the null-space of the model weights, bypassing the requirement for previous knowledge access. We show that LOKI achieves superior performance to existing approaches across a wide variety of experiments, achieving up to a 14\% improvement in average accuracy.

2606.19607 2026-06-19 cs.AI stat.AP 新提交 80%

Which Pairs to Compare for LLM Post-Training?

LLM后训练中应比较哪些对?

Jiangze Han, Vineet Goyal, Will Ma

发表机构 * Columbia University(哥伦比亚大学)

专题命中 后训练 :研究偏好后训练中比较对的选择,提升样本效率。

AI总结 研究偏好后训练中如何选择最具信息量的比较对,提出基于采样设计的比较策展方法,通过DPO训练的理论分析给出优化准则,实验证明能提升样本效率。

详情
AI中文摘要

基于偏好的后训练已成为对齐语言模型的核心范式。常见的数据收集策略是为每个提示生成少量补全并标注生成的比较对。然而,人工偏好标签通常比生成额外补全昂贵得多,这提示了相同标注预算的不同使用方式:生成更大的补全集,但只标注最具信息量的比较对。本文研究在基于偏好的后训练中应比较哪些对。我们将比较策展形式化为一个采样设计问题,并通过基于偏好的后训练目标下的最终策略质量来评估设计。我们针对直接偏好优化(DPO)实例化该框架,分析标注对的选择如何通过DPO训练传播到下游策略性能。我们的主要结果为DPO训练策略的后训练最优性差距提供了匹配的上界和下界。这些界限表明,比较选择通过一个单一的设计相关信息矩阵影响下游性能,该矩阵将标签分配与参数估计误差和策略次优性联系起来。这为预算受限的比较策展提供了显式优化准则,并激发了从大型生成补全池中选择信息对的实际采样设计。在合成设置和语言模型后训练基准上的实验表明,所提出的设计在样本效率上持续优于常见的比较选择启发式方法。

英文摘要

Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

4. 其他LLM 15 篇

2606.20493 2026-06-19 cs.LG cs.AI cs.MA 新提交 85%

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

传染网络:多智能体LLM系统中的评估者偏见传播

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

专题命中 其他LLM :研究多智能体LLM系统中评估者偏见传播

AI总结 提出传染网络框架,量化评估者偏见在多智能体LLM系统中的传播,发现同模型智能体间偏见传播系数为0.157-0.352,且增大评估委员会规模可减少72.4%的传播效应。

Comments 20 pages, 4 figures, 4 tables

详情
AI中文摘要

当大型语言模型在多智能体系统中担任评估者时,其系统性评估偏见会通过智能体网络传播。我们引入传染网络,这是一个用于衡量评估者偏见如何在交互的LLM智能体间传播的正式框架。在使用DeepSeek-chat进行的受控3智能体实验中,我们采用了三种不同的评估者偏见配置文件(结构化、平衡、基于证据),测量了跨智能体传染矩阵Gamma_3,并发现评估者偏见始终在智能体间传播(gamma在[0.157, 0.352]范围内),即使是在相同底层模型内也是如此。我们识别出由谱半径rho(Gamma_N)控制的三种传播机制,并证明同质模型智能体产生的传染系数比先前工作中观察到的跨模型系数弱3-5倍(MM-EPC: gamma约0.85-1.3),使其处于抑制机制中。我们表明,将评估委员会规模从k=1增加到k=3可将有效传染减少72.4%,提供了一种可行的缓解策略。我们发布了开源的传染网络实验框架。

英文摘要

When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.

2606.19746 2026-06-19 cs.DC 新提交 85%

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

SAC: 面向稀疏注意力LLM的基于CXL的解耦KV缓存系统

Ruiyang Ma, Teng Ma, Junru Li, Hantian Zha, Xuchun Shang, Qingda Hu, Zheng Liu, Xinjun Yang, Tao Ma, Guojie Luo

专题命中 其他LLM :提出面向稀疏注意力LLM的解耦KV缓存系统SAC,优化长上下文推理性能。

AI总结 针对稀疏注意力模型在长上下文推理中全量KV缓存传输导致的瓶颈,提出基于CXL按需获取top-k KV条目的解耦缓存系统SAC,相比RDMA方案吞吐提升2.1倍、TTFT降低9.7倍。

详情
AI中文摘要

LLM向长上下文推理的扩展将主要服务系统瓶颈从计算转移到内存容量。传统针对密集注意力模型的解决方案依赖基于RDMA的解耦内存池,在解码前从远程存储粗粒度地获取整个前缀KV缓存到本地内存。然而,这种方法对于新兴的稀疏注意力模型本质上是低效的。尽管解码过程中只有一小部分KV条目是活跃的,这些系统仍然将完整的KV缓存获取到本地,导致严重的传输瓶颈和本地内存浪费。为了解决这个问题,我们提出了SAC,第一个针对稀疏注意力模型优化的高效解耦KV缓存系统。通过利用Compute Express Link (CXL)的低延迟、缓存行粒度的加载/存储语义,SAC在推理过程中按需仅获取所需的top-k KV条目。在使用SGLang对DeepSeek-V3.2的评估中,与基于RDMA的基线相比,SAC实现了2.1倍的吞吐量提升、9.7倍的TTFT降低和1.8倍的TBT降低,确立了基于CXL的解耦作为新兴稀疏注意力模型的优越基础设施。

英文摘要

The scaling of LLMs toward long-context inference has shifted the primary serving system bottleneck from computation to memory capacity. Traditional solutions for dense attention models rely on RDMA-based disaggregated memory pools, which perform coarse-grained fetching of the entire prefix KV cache from remote storage to local memory before decoding. However, this approach is fundamentally inefficient for emerging sparse attention models. While only a small fraction of KV entries are active during decoding, these systems still fetch the full KV cache locally, leading to severe transmission bottlenecks and local memory wastage. To address this, we propose SAC, the first efficient disaggregated KV cache system optimized for sparse attention models. By leveraging the low-latency, cache-line granularity load/store semantics of Compute Express Link (CXL), SAC fetches only the required top-k KV entries on demand during inference. Evaluations on DeepSeek-V3.2 using SGLang show that SAC achieves 2.1x higher throughput, 9.7x lower TTFT, and 1.8x lower TBT compared to RDMA-based baselines, establishing CXL-based disaggregation as the superior infrastructure for emerging sparse attention models.

2606.19605 2026-06-19 cs.SE cs.AI 新提交 85%

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO:多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

发表机构 * Foundation AI–Cisco Systems Inc.(基础AI–思科系统公司) Yale University(耶鲁大学)

专题命中 其他LLM :提出FAPO框架自动优化多步LLM流水线的提示和链结构

AI总结 提出FAPO框架,通过自动诊断流水线瓶颈并迭代优化提示或链结构,在18个模型-基准比较中15次优于基线GEPA,平均提升14.1个百分点。

详情
AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败,因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO(全自动提示优化),一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更,并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑,仅当提示优化似乎不足时,在归因识别出结构瓶颈的情况下,在允许范围内更改链结构。在六个基准和三个任务模型上,FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中,FAPO以不重叠的均值±试验标准差范围获胜,平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中,当提示优先搜索升级为结构变更时,FAPO在所有六个中获胜,平均增益为+33.8个百分点。FAPO还提高了安全任务的性能:在CTIBench-RCM(一个安全CVE到CWE任务)上,仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率,在Foundation-Sec-8B-Instruct上提升了+7.1个百分点,在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

2606.19475 2026-06-19 cs.AI cs.CL 新提交 85%

Diffusion Language Models: An Experimental Analysis

扩散语言模型:一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学) University of Pisa(比萨大学)

专题命中 其他LLM :系统比较扩散语言模型在多种任务上的表现。

AI总结 本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现,分析了去噪步数、上下文长度等推理因素对性能与效率的影响,揭示了扩散语言模型在不同任务和预算下的权衡。

详情
AI中文摘要

大型语言模型(LLMs)通过自回归生成彻底改变了语言建模,使其在广泛的任务中表现出色。最近,扩散语言模型(DLMs)作为一种替代范式出现,它通过迭代去噪而非下一个词预测来生成文本,从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构,但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中,我们对现代DLMs进行了系统的实验分析。具体来说,我们评估了八种最先进的DLMs在八个基准上的表现,这些基准涵盖推理、编码、翻译、知识和结构化问题解决,同时明确考虑了生成质量和计算效率。除了下游评估,我们还分析了关键推理时间因素的影响,包括去噪步数、上下文长度、块大小和并行解掩策略,并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明,DLMs的行为受到生成时间设计选择的强烈影响,导致性能和计算效率之间的不同权衡。总体而言,我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

2606.19351 2026-06-19 cs.CL cs.AI 新提交 85%

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

专题命中 其他LLM :检测LLM在知识图谱推理中的幻觉

AI总结 提出LUCID方法,结合LLM注意力分数、知识图谱语义和结构信息,利用图神经网络检测LLM在知识图谱推理中的幻觉,在九个数据集上达到最优性能。

详情
AI中文摘要

知识图谱推理从现有事实中推断新知识,广泛应用于问答、推荐和决策支持。随着大语言模型(LLM)的快速发展,基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而,LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识,模型仍可能生成错误输出,导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态,要么验证与检索上下文的一致性,但两者都忽略了知识图谱中的结构信息,导致性能次优。为了解决这一差距,我们提出了LUCID,这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说,它从注意力分数和语义相似度中提取节点和边特征,并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明,与15个基线相比,LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 新提交 85%

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础:用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.(Spotify美国公司)

专题命中 其他LLM :提出LLM替代人类进行A/B测试的统计框架

AI总结 提出替代指标理论框架,证明在弱于分布等价条件下,校准LLM输出可识别平均处理效应,并分析随机性带来的偏差与方差。

详情
AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型(LLM)代替人类参与者,以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效,但这不现实。因此,我们开发了一个统计框架,将替代终点理论适配到LLM。该框架表明,将LLM结果校准到人类结果,在替代性和可比性条件(联合弱于分布等价性)下,可以识别平均处理效应。当这些条件不成立时,感兴趣的效应仅部分可识别,我们提供了诊断方法,可以在历史实验上证伪替代性,并给出有限重叠下最坏情况偏差的界限。我们进一步证明,LLM固有的随机性会引入偏差和方差,但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是,LLM结果作为替代指标的有效性只能对过去的处理被证伪,而无法对新处理被验证,因此对于新颖干预,人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用,以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

2606.07822 2026-06-19 cs.CL cs.AI cs.LG 新提交 85%

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

专题命中 其他LLM :提出激活置信度估计协议,提升校准与信任

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2606.20517 2026-06-19 cs.AI cs.PL 新提交 80%

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB: 将 LiveCodeBench 扩展到多种编程语言

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev

发表机构 * GigaCode Yandex School of Data Analysis, Applied AI Institute(Yandex数据分析学院,应用人工智能研究所)

专题命中 其他LLM :评估LLM跨语言代码生成能力,涉及预训练模型

AI总结 提出 Multi-LCB 基准,将 LiveCodeBench 的 Python 任务扩展到 12 种编程语言,评估 LLM 跨语言代码生成能力,发现 Python 过拟合和语言特定污染等问题。

Comments ICLR 2026

详情
AI中文摘要

LiveCodeBench (LCB) 最近已成为评估大型语言模型 (LLM) 在代码生成任务上的广泛采用的基准。通过策划竞争性编程问题、不断向集合中添加新问题并根据发布日期进行过滤,LCB 提供了污染感知的评估,并提供了编码能力的整体视图。然而,LCB 仍然局限于 Python,留下了 LLM 是否能够泛化到现实软件工程所需的各种编程语言的问题。我们引入了 Multi-LCB,这是一个跨十二种编程语言(包括 Python)评估 LLM 的基准。Multi-LCB 将 LCB 数据集中的 Python 任务转换为其他语言中的等效任务,同时保留 LCB 的污染控制和评估协议。由于它与原始 LCB 格式完全兼容,Multi-LCB 将自动跟踪未来的 LCB 更新,从而能够系统地评估跨语言代码生成能力,并要求模型在 Python 之外保持良好的性能。我们在 Multi-LCB 上评估了 24 个 LLM 的指令和推理能力,发现了 Python 过拟合、语言特定污染以及多语言性能显著差异的证据。我们的结果将 Multi-LCB 确立为多编程语言代码评估的严格新基准,直接解决了 LCB 的主要局限性,并揭示了当前 LLM 能力的关键差距。

英文摘要

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

2606.20245 2026-06-19 cs.AI 新提交 80%

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

导航不可靠的参数化与上下文知识:面向LLM推理的显式知识冲突解决

Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao

发表机构 * National Key Laboratory of Big Data and Decision, National University of Defense Technology(国防科技大学大数据与决策国家重点实验室)

专题命中 其他LLM :解决LLM参数知识与上下文冲突

AI总结 提出MACR框架,通过自适应知识评估与多智能体推理,显式解决大语言模型内部参数知识与外部上下文之间的冲突,超越传统二元选择范式。

Comments 12 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)通过利用广泛的参数化知识和上下文学习能力,在多种基于语言的任务中取得了强劲性能,使其能够整合输入提示中提供的外部信息。然而,外部知识的整合可能引入冲突,不仅存在于模型内部参数知识与外部信息之间,也存在于多个外部上下文之间。现有方法通常假设模型或提供的上下文是可靠的,忽视了两种来源都可能包含错误的情况,并通过优先考虑某一来源而非另一来源来避免冲突,而非主动解决不一致性。为解决这些局限,我们提出了一种新颖的LLM知识冲突解决框架MACR,该框架超越了传统的二元选择范式,并基于多智能体推理方法引入了显式的冲突解决机制。具体而言,我们首先提出一种自适应知识评估与检索方法,采用改进的语义熵度量来量化LLM对给定查询答案的置信度。基于此置信度估计,MACR要么将模型的内部知识外化为文本表示,要么在内部知识不足时检索相关外部知识,为后续推理生成基本上下文。然后,我们引入一个归纳式多智能体推理框架,包含三个专门智能体,分别用于归纳显式规则、分析潜在冲突以及解决所有可用上下文中的不一致性。实验结果表明,MACR在多个基准测试中显著优于最先进的基线方法,同时提供了可解释的显式冲突解决方案。

英文摘要

Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict-resolution mechanism based on a multi-agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM's confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model's internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi-agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state-of-the-art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.

2606.20152 2026-06-19 cs.CL cs.AI 新提交 80%

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

从文本到分数:追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

发表机构 * NLP2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系NLP2 CT实验室) Institute of International Language Services Studies, Macau Millennium College(澳门 millennium 学院国际语言服务研究学院) School of Data Science and Information Engineering, Guizhou Minzu University(贵州民族大学数据科学与信息工程学院)

专题命中 其他LLM :分析LLM内部表征用于自动作文评分。

AI总结 通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征,发现作文质量信息以线性可解码形式存在,并识别出与分数相关的神经元,揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展极大地改变了自动作文评分(AES),但基于LLM的评分内部机制仍知之甚少。在本工作中,我们系统分析了八个LLMs在两个英文作文数据集(ASAP++、CSEE)和一个葡萄牙语数据集(ENEM)上的隐藏表征。通过线性探测、跨提示泛化、降维和神经元级分析,我们发现一致证据表明作文质量信息以线性可访问的形式编码在LLM表征中。这些表征在层间逐步出现,在不同提示策略下保持稳健,并且尽管评分标准不同,仍能在作文提示间部分迁移。此外,非线性探测相对于线性探测仅提供边际且不一致的改进,表明大多数作文质量信息已经是线性可解码的。我们进一步识别出单个“作文评分神经元”,其激活与作文分数强相关,且其行为对目标干预敏感。此外,这些神经元的逐层分布随作文长度系统性地变化,较长的作文更依赖深层。总体而言,我们的发现提供了LLM编码与作文质量相关的结构化表征的证据,并为基于LLM的AES系统的可解释性提供了新见解。

英文摘要

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

2606.19868 2026-06-19 cs.AI 新提交 80%

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

大型语言模型黑盒不确定性估计方法的系统评估

Jiayi Wang, Xu-Yao Zhang

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室)

专题命中 其他LLM :系统评估LLM黑盒不确定性估计方法。

AI总结 系统评估了24种黑盒不确定性估计方法在4个模型和4个数据集上的表现,发现无单一方法普遍最优,但基于答案空间推理和比较的方法通常有效,混合方法在多数条件下表现良好。

详情
AI中文摘要

尽管大型语言模型(LLMs)在广泛的任务中展现出强大的能力,但其输出通常仍不可靠,可能包含幻觉,因此不确定性估计(UE)对于构建可信赖的LLMs至关重要。在实践中,许多主流LLMs仅通过受限API访问,此时logits和隐藏状态等内部信号不可用,使得黑盒UE尤为重要。然而,现有关于LLMs黑盒UE的研究在方法论上仍然零散,缺乏统一的实证比较。为填补这一空白,我们系统回顾了黑盒UE方法,并将其分为五类:基于口头化、基于采样、基于解释、多智能体和混合方法。我们进一步构建了统一的评估框架,并在4个模型和4个数据集设置下对24种代表性方法进行了基准测试。结果表明,没有单一方法在所有设置中一致占优。然而,在答案空间中进行推理和比较候选的方法通常有效,而结合多种不确定性信号的混合方法在大多数条件下表现良好。通过发布基准数据和统一评估框架,我们旨在促进可重复比较并支持未来研究,同时我们的实证发现为开发未来LLMs的黑盒UE方法提供了实践指导。

英文摘要

Although large language models (LLMs) have shown strong capabilities across a wide range of tasks, their outputs often remain unreliable and may contain hallucinations, making uncertainty estimation (UE) essential for building trustworthy LLMs. In practice, many mainstream LLMs are only accessible through restricted APIs, where internal signals such as logits and hidden states are unavailable, making black-box UE especially important. However, existing work on black-box UE for LLMs remains fragmented in methodology and lacks a unified empirical comparison. To address this gap, we present a systematic review of black-box UE methods and organize them into five categories: verbalization-based, sampling-based, explanation-based, multi-agent, and hybrid methods. We further build a unified evaluation framework and benchmark 24 representative methods across 4 models and 4 dataset settings. Our results show that no single method consistently dominates across all settings. Nevertheless, methods that reason over and compare candidates in the answer space are generally effective, and hybrid methods that combine multiple uncertainty signals perform well under most conditions. By releasing the benchmark data and a unified evaluation framework, we aim to facilitate reproducible comparisons and support future research, while our empirical findings provide practical guidance for developing future black-box UE methods for LLMs.

2606.19735 2026-06-19 cs.AI cs.CV 新提交 80%

GLARE: A Natural Language Interface for Querying Global Explanations

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University(俄勒冈州立大学)

专题命中 其他LLM :基于LLM的接口将自然语言转换为SQL查询。

AI总结 提出基于LLM的交互接口GLARE,将自然语言问题转换为SQL查询以聚合局部解释数据,提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情
AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要,但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案,而不是静态产物,我们提出了一种基于LLM的交互接口,提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者,将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能,而无需向用户暴露低级表示。对于每个查询,接口输出统计增强的自然语言响应,支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明,LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

2606.19727 2026-06-19 cs.CL cs.AI 新提交 80%

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM:语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University(深圳技术大学) New Delhi Institute of Management(新德里管理学院) Technische Universität Dresden(德累斯顿工业大学) Ramakrishna Mission Vivekananda Educational and Research Institute(罗摩克里希纳传道会维韦卡南达教育与研究学院) Indian Institute of Technology(印度理工学院) Swami Vivekananda Institute of Technology(斯瓦米·维韦卡南达技术学院) GuangDong Engineering Technology Research Center of Edge Intelligence(广东省边缘智能工程技术研究中心)

专题命中 其他LLM :评估语言模型对全球舞蹈文化的理解能力。

AI总结 提出NRITYAM基准,包含9,260个跨12语言的文化问答对,评估语言模型对全球舞蹈传统的文化理解能力,涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情
AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而,其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距,我们提出NRITYAM,一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对,涵盖12种语言,是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发,他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型,包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准,NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

2606.19698 2026-06-19 cs.CL 新提交 80%

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

情感分析看不到的:衡量客户是否得到帮助以及出了什么问题——基于70,000次客服对话

Jason Potteiger

发表机构 * Dimension Labs(Dimension实验室)

专题命中 其他LLM :使用GPT-5.4估计客户满意度并标记问题。

AI总结 本研究使用GPT-5.4从70,450次客服对话中估计客户满意度并标记具体问题,发现满意度估计比情感分析更准确,且能揭示情感分析无法捕捉的客户状态和问题原因。

Comments 25 pages, 6 figures

详情
AI中文摘要

大多数公司通过情感分析大规模阅读客户支持数据,这种方法衡量的是客户听起来如何,而不是他们对结果是否满意。我们在一个领先的在线筹款平台的70,450次客服对话中测试了一种更丰富的替代方案:除了语气,我们使用GPT-5.4估计每位客户的满意度,并标记他们是否报告了具体问题,然后将这三个读数与客户对对话的1到5星评分进行验证。满意度估计跟踪这些评分的效果远好于情感分析,相关性分别为0.47和0.36,并且标记不满客户时的误报率低得多。结构化阅读还能看到情感分析看不到的东西:语气和满意度在44%的对话中不一致,一个单一的“中性”标签掩盖了从安静满意的客户到安静放弃的客户的一切,而最大的群体是“容忍摩擦”——那些满意但仍然报告可修复问题的客户,这是一个情感分析仪表板无法揭示的长期问题。更广泛的发现是,基于LLM的标注可以捕捉到比客户语言语气多得多的信息,为基于客户状态(他们是否满意)和直接从交互与反馈的原始文本数据中提取的问题原因的新业务指标提供了强大潜力。

英文摘要

Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer's satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single "Neutral" label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is "tolerated friction," customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer's language, offering strong potential for new business metrics grounded instead in the customer's state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.

2606.19668 2026-06-19 cs.CL 新提交 80%

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University(中央大学) Adobe Research(Adobe研究院)

专题命中 其他LLM :研究多语言大模型中的代码切换和语言锚定现象

AI总结 通过语法强制代码切换诊断多语言大模型中的语言锚定现象,提出锚定偏差度量并设计CANVAS干预方法,有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情
AI中文摘要

多语言大模型(MLLMs)越来越需要处理代码切换(CS)输入,然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化,我们使用语法强制CS作为受控诊断设置,将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差(Anchor Bias),一种几何度量,用于量化语言锚定,即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中,锚定偏差揭示了一致的语法框架效应:源框架CS保持源锚定,而目标框架CS向目标方向移动,并显示出更大的问答(QA)退化。受这种表示模式的启发,我们提出了CANVAS(基于上下文锚定的神经向量对齐引导),一种推理时干预方法,从输入中提取源侧画布,并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数,表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

5. 领域大模型 2 篇

2606.19700 2026-06-19 cs.CL 新提交 85%

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona(亚利桑那大学) College of Information Science, University of Arizona(亚利桑那大学信息科学学院) Biosphere 2, University of Arizona(亚利桑那大学生物圈2) Department of Ecology and Evolutionary Biology, University of Arizona(亚利桑那大学生态与进化生物学系) Department of Environmental Sciences, University of Arizona(亚利桑那大学环境科学系)

专题命中 领域大模型 :领域自适应小语言模型管道,用于火星科学文献提取。

AI总结 提出TerraMARS管道,结合领域自适应小语言模型,从火星科学文献中提取结构化信息,支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情
AI中文摘要

研究人员有兴趣了解火星,以便最终使其适合人类居住。为此,需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束,可用于其他模型和研究,如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS,一个端到端的信息提取管道,它结合了领域自适应的小语言模型来回答火星地球化改造相关问题,并将非结构化的火星科学文本转换为机器可读的结构化输出(JSON格式)。收集了一个开放获取论文语料库,并使用多阶段检索和分块框架进行处理。使用量化低秩自适应(QLoRA)对火星特定问答和信息提取数据集进行微调,使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出,并为将科学文献中的知识整合到下游应用(如数字孪生和火星宜居性建模)提供了基础。该管道的输出看起来很有前景,但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 新提交 80%

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示:基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University(莱顿大学) FutureWhiz

专题命中 领域大模型 :自适应LLM高中辅导系统。

AI总结 提出一种基于14个教学特征的主题感知提示路由模型,通过模拟训练和在线A/B测试,在高中辅导中实现自适应策略切换,提高教学效率并减少交互轮次。

详情
AI中文摘要

LLMs可以个性化教育,尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统,该系统基于从原始转录中提取的14个教学特征(例如,辅导支架、学生理解)。我们首先在模拟环境中训练一个提示路由模型,然后将其部署到实际高中学生的在线适应中。模拟基准测试显示,路由器的性能优于两个静态基线($0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$)。A/B测试($N=656$ 次对话,来自359名学生)显示了从模拟到现实的迁移,其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率,保持了教学质量,并减少了约3轮交互($p=0.007$)。虽然贪婪路由器的练习转化率与基线相当($19.1\%$ vs. $19.6\%$),但随机采样策略的随机路由器实现了更高的转化率($28.1\%$)。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).