arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 113 信号源:cs.CL, cs.AI, cs.LG

1. 领域大模型 3 篇

2606.19266 2026-06-18 cs.CL cs.AI 新提交 90%

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

医学LLM适应中的权衡:法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020(艾克斯-马赛大学,法国国家科学研究中心,LIS UMR 7020) Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004(南特大学,南特中央理工大学,法国国家科学研究中心,LS2N UMR 6004) Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217(格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,INRIA,格勒诺布尔INP,LIG UMR 5217)

专题命中 领域大模型 :法语医学LLM领域适应策略比较

AI总结 通过法语医学问答任务,实证比较持续预训练(CPT)和监督微调(SFT)在多个模型家族和规模下的效果,发现CPT+SFT在多项选择问答上最优但增益小,SFT是强且经济的默认选择,而CPT在开放式问答中提升重叠指标。

详情
AI中文摘要

大型语言模型(LLMs)的发展导致了对它们适应专业领域和语言的关注增加,但领域适应策略的有效性仍不明确。我们以法语医学问答(QA)为案例,进行了医学领域适应的研究。我们比较了持续预训练(CPT)、监督微调(SFT)及其组合,跨越三个模型家族、多个规模和三种初始化类型,明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下,使用自动指标和LLM-as-a-Judge评估,评估了多项选择问答(MCQA)和开放式问答(OEQA)。对于MCQA,CPT+SFT通常取得最佳分数,但相比SFT的增益很小且通常不显著,使得SFT成为强大且成本效益高的默认选择。对于OEQA,CPT持续改善基于重叠的指标,而SFT常降低生成质量;指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示,法语适应能有效迁移到英语基准。总体而言,我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

2606.18699 2026-06-18 cs.CL cs.AI cs.IR 新提交 90%

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: 衡量台湾法律理解

Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang

发表机构 * University of Rochester(罗切斯特大学) National Taiwan University(国立台湾大学) NVIDIA(英伟达)

专题命中 领域大模型 :台湾法律理解基准,评估LLM法律推理

AI总结 提出TW-LegalBench基准,包含多项选择、开放式问答和法律判决预测任务,评估13个LLM在台湾法律上的表现,发现顶尖模型通过律师考试但未达到法官检察官标准,且法律条文引用困难。

Comments 10 pages, 2 figures, To appear in ICAIL 2026

详情
AI中文摘要

大型语言模型(LLM)在多种任务上展现出令人印象深刻的能力,但其在特定司法管辖区法律推理上的表现仍未充分探索。我们提出TW-LegalBench,利用台湾法律系统丰富的官方公开语料库,填补了在普通法基准(侧重英文来源)和大陆法基准(侧重简体中文来源)之外评估LLM在台湾法律上的空白。TW-LegalBench包含三种任务类型:(1)涵盖18个专业领域五年官方考试的超过16,000道多项选择题(MCQ);(2)来自法律专业人员考试的117道开放式问答题(OEQ),附有官方评分标准;(3)超过14,000个法律判决预测(LJP)实例,涵盖数百种犯罪类别。我们使用MCQ的准确率、基于评分标准点的分解式LLM作为裁判框架评估OEQ,以及LJP的判决准确性和法条引用指标,评估了13个LLM。我们的结果显示,表现最佳的模型超过了合格律师的通过门槛(通过率:11%),但未达到法官和检察官的通过标准(通过率:1-2%)。对于LJP,虽然模型展示了合理的判决类型准确性和刑期预测能力,但它们难以准确引用具体法律条文。这些发现表明,即使LLM在资格考试上的表现接近人类水平,可靠的 legal 文本生成仍然具有挑战性。

英文摘要

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

2606.18600 2026-06-18 cs.DC 新提交 85%

ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters

ShuntServe: 异构竞价型GPU集群上的成本高效LLM服务

Seungwoo Jeong, Moohyun Song, Juhyun Park, Kyungyong Lee

专题命中 领域大模型 :提出ShuntServe系统优化LLM在异构GPU上服务

AI总结 提出ShuntServe系统,通过屋顶线模型估计性能和动态规划优化模型放置,在异构竞价型GPU集群上最大化吞吐量,结合输出保留迁移与共享张量存储实现容错,相比基线吞吐量提升1.42倍,成本效率提升31.9%以上。

Comments 18 pages, 16 figures, 5 tables

详情
AI中文摘要

随着大语言模型(LLM)服务的广泛采用,在云环境中为这些模型提供服务的GPU资源成本已成为关键问题。竞价实例相比按需实例可节省高达90%的成本,但其频繁中断和有限可用性对连续LLM服务构成重大挑战。特别是GPU竞价实例的可用性比基于CPU的实例更低且更不稳定,使得依赖单一GPU类型的同构集群容易受到关联故障的影响。跨多种GPU类型的异构集群可以通过利用不同竞价池的互补可用性模式来解决这一问题,然而现有的LLM服务系统是为同构环境设计的,在异构GPU上部署时会遇到负载不均衡的问题。本文提出了ShuntServe,一个用于异构竞价型GPU集群的成本高效LLM服务系统。ShuntServe采用基于屋顶线模型的分析性服务性能估计器和基于动态规划的模型放置优化器,联合确定节点配置、并行化策略和层分配,以最大化跨异构GPU的吞吐量。为了增强使用竞价实例时的容错能力,ShuntServe将输出保留的请求迁移与通过共享张量存储的并发初始化相结合,通过重叠替换节点准备与持续服务来最小化迁移停机时间。在由L4、A10G和L40S GPU组成的异构AWS集群上对Llama-3.1-70B和Qwen3-32B的评估表明,ShuntServe的吞吐量比最先进的基线高出1.42倍和1.35倍,并且与按需实例相比,在离线服务和在线服务中分别实现了31.9%和31.2%的成本效率提升。

英文摘要

As large language model (LLM) services become widely adopted, the cost of GPU resources for serving these models in cloud environments has emerged as a critical concern. Spot instances offer up to 90% cost savings over on-demand instances, but their frequent interruptions and limited availability pose significant challenges for continuous LLM serving. GPU spot instances, in particular, exhibit lower and more volatile availability than CPU-based instances, making homogeneous clusters that depend on a single GPU type vulnerable to correlated failures. Heterogeneous clusters spanning multiple GPU types can address this by leveraging complementary availability patterns across diverse spot pools, yet existing LLM serving systems are designed for homogeneous environments and suffer from load imbalance when deployed on heterogeneous GPUs. This paper presents ShuntServe, a cost-efficient LLM serving system for heterogeneous spot GPU clusters. ShuntServe employs a roofline model-based analytical serving performance estimator and a dynamic programming-based model placement optimizer that jointly determines node configuration, parallelization strategy, and layer assignment to maximize throughput across heterogeneous GPUs. To enhance fault tolerance when using spot instances, ShuntServe combines output-preserving request migration with concurrent initialization via a shared tensor store, minimizing migration downtime by overlapping replacement node preparation with ongoing serving. Evaluation on Llama-3.1-70B and Qwen3-32B with a heterogeneous AWS cluster of L4, A10G, and L40S GPUs shows that ShuntServe achieves 1.42x and 1.35x higher throughput than state-of-the-art baselines and attains 31.9% and 31.2% cost efficiency improvements over on-demand instances for offline and online serving, respectively.

2. 预训练 4 篇

2606.18663 2026-06-18 cs.CL 新提交 90%

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D: 通过代理训练轨迹实现动态数据混合

Kaiyan Zhao, Zhongtao Miao, Akiko Aizawa, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(国立信息学研究所)

专题命中 预训练 :LLM预训练动态数据混合方法

AI总结 提出RegMix-D,通过代理训练轨迹预测多阶段最优混合比例,实现动态数据混合,在13个下游任务上优于RegMix和DoReMi,且代理计算预算仅为RegMix的25%。

Comments Work in progress

详情
AI中文摘要

数据混合选择对于大型语言模型预训练至关重要。现有方法如RegMix通过在小规模代理运行上拟合回归模型来选择单个静态混合。我们提出RegMix-D,这是RegMix的一个简单扩展,用于动态混合。我们的关键观察是,代理运行不仅产生端点损失,还产生完整的损失轨迹,这些轨迹可用于进一步改进数据混合。通过在这些轨迹上训练回归模型,我们可以预测多个训练阶段的最优混合。RegMix-D支持两种部署模式:一种离线变体,在目标训练之前生成完整的混合计划;另一种在线变体,在训练期间使用观察到的损失自适应调整混合。在Pile数据集的250亿token上使用1B参数目标模型的实验表明,RegMix-D在13个下游任务上一致优于RegMix和DoReMi,同时保持代理高效:即使仅使用128个代理模型(RegMix代理计算预算的25%),它也超越了RegMix。

英文摘要

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

2606.19036 2026-06-18 cs.LG 新提交 85%

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

稀疏混合专家模型中不连续性的几何与随机分析

Tho Tran Huu, Huu-Tuan Nguyen, Thien-Hai Nguyen, Nhat-Tri Ho, Viet-Hoang Tran, Tho Quan, Tan Minh Nguyen

发表机构 * Department of Mathematics, National University of Singapore, Singapore(新加坡国立大学数学系) Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam(胡志明市技术大学计算机科学与工程学院)

专题命中 预训练 :分析稀疏MoE不连续性,提出平滑机制,核心是LLM架构改进。

AI总结 本文对稀疏混合专家模型中的不连续性进行几何与随机分析,分类不连续阶数,建立渐近体积估计,证明随机路径几乎必然击中一阶不连续,并提出低开销平滑机制以提升性能。

Comments ICML 2026 Spotlight. arXiv admin note: text overlap with arXiv:2510.17794 by other authors

详情
AI中文摘要

稀疏混合专家(SMoE)架构现已广泛应用于最先进的语言和视觉模型中,其中条件路由允许扩展到非常大的网络。然而,正是这种Top-$k$专家选择使得条件路由成为可能,同时也导致SMoE映射本质上不连续。在这些不连续曲面附近,即使任意接近的输入也可能激活截然不同的专家集,从而产生显著不同的输出。本文对这些不连续性进行了严格的几何和随机分析。首先,我们根据切换事件中并列专家的数量对不连续性进行阶数分类。利用测度论切片论证,我们建立了加厚不连续曲面的渐近体积估计,表明低阶不连续集占主导地位,而高阶不连续集占据的体积相对极小。接着,通过扩散过程对输入空间中的随机扰动建模,我们证明路径最终会遇到不连续,并且首次击中几乎必然发生在阶数为1的不连续上,同时给出了显式的有限时间概率界。我们进一步推导了占据时间界,量化了随机路径在每个不连续阶数邻域内停留的时长。这些理论结果表明输入更可能位于低阶不连续附近。受此启发,我们提出一种简单的平滑机制,可直接应用于现有SMoE,在接近不连续处软性地整合专家;我们的分析保证增加的额外计算开销很小,同时在不连续附近提供局部平滑,跨语言和视觉任务的实验表明,平滑不仅增强了SMoE映射的连续性,还提升了经验性能。

英文摘要

Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection that enables conditional routing also renders the SMoE map inherently discontinuous. In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs. In this work we give a rigorous geometric and stochastic analysis of these discontinuities. We first classify them by order, determined by the number of tied experts at a switching event. Using measure-theoretic slicing arguments, we establish asymptotic volume estimates for the thickened discontinuity surfaces, showing that lower-order discontinuity sets dominate, whereas higher-order ones occupy a vanishingly small relative volume. Next, modeling random perturbations in the input space via a diffusion process, we prove that the path eventually encounter a discontinuity, and moreover that the first hit almost surely occurs on an order-1 discontinuity with explicit finite-time probability bounds. We further derive occupation-time bounds that quantify the duration the random path spend in the neighborhoods of each discontinuity order. These theoretical results imply that inputs are more likely to lie near lower order discontinuities. Motivated by this insight, we propose a simple smoothing mechanism that can be directly applied to existing SMoEs, softly incorporating experts near discontinuities; our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.

2606.19005 2026-06-18 cs.CL cs.LG 新提交 85%

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi: 从头训练的开放均匀扩散语言模型

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

发表机构 * Tohoku University(东北大学)

专题命中 预训练 :从头预训练7B均匀扩散语言模型,性能与自回归模型相当。

AI总结 本文提出Sumi,一个从零开始预训练的70亿参数均匀扩散语言模型,在1.5T tokens上训练,性能与同规模自回归模型相当,并开源所有资源。

详情
AI中文摘要

扩散模型已成为自回归模型的有前途的替代方案。其中,均匀扩散语言模型(UDLM)允许在任何步骤更新任何token,原则上能够实现更灵活的生成。然而,目前还没有从零开始预训练的大参数规模和大token预算的UDLM。自回归建模和掩码扩散建模已经拥有大规模的可供社区研究和构建的模型;而均匀扩散模型则没有。大规模从头预训练的UDLM将为研究缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供一个干净的参考点。为此,我们引入了Sumi(日语中“墨水”的意思),一个完全开放的70亿参数均匀扩散语言模型,从零开始在1.5T tokens上预训练。Sumi在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当,但在常识基准测试中表现较差,其中我们以教育为主的数据混合可能是原因之一。我们发布了模型权重、检查点和完整的训练方案,包括在公开可用的语料库上的数据混合的完整规范。我们希望这次发布能使社区研究大规模原生均匀扩散,并促进对其尚未很好理解的方面的研究。

英文摘要

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

2606.19025 2026-06-18 cs.LG cs.AI cs.DC cs.SY eess.SY 新提交 80%

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

FoMoE: 打破全副本壁垒的专家混合联邦系统

Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

发表机构 * DeepSeek-AI

专题命中 预训练 :提出跨数据中心MoE训练系统,降低通信开销。

AI总结 提出FoMoE系统,通过跨工作节点分区专家层打破全副本范式,结合部分专家复制和跳跃令牌机制,显著降低通信开销并提升吞吐量。

详情
AI中文摘要

预训练大型语言模型(LLMs)通常需要大规模基础设施,配备紧密耦合的硬件加速器。虽然增加模型和数据集规模仍是性能的主要驱动力,但专家混合(MoE)架构最近通过将参数数量与计算成本解耦,取得了最先进的结果。这种效率使得在受限计算预算下训练大规模模型成为可能,但通常需要单个数据中心的高速互连。为了克服这些物理限制,最近的方法如DiLoCo和Photon使用低通信数据并行方法,使得能够在地理分布、弱连接的数据中心之间进行扩展。然而,这些方法存在根本性的低效问题:它们需要在每个站点拥有完整的模型副本,这带来了高昂的内存约束和通信开销。在这项工作中,我们引入了FoMoE,一个通过跨工作节点分区专家层来打破全副本范式的系统。我们证明FoMoE:(I)通过部分专家复制,在所研究的场景中,相比高效基线降低了高达1.42倍的通信成本,相比DDP降低了45.44倍;(II)通过一种新颖的跳跃令牌机制,实现了高达1.4倍的经验吞吐量加速;(III)在训练代理场景中展示了稳定的路由,并通过系统建模将通信/内存优势推广到100B规模的配置。

英文摘要

Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

3. 指令微调 2 篇

2606.18875 2026-06-18 cs.CL 新提交 85%

Efficient Financial Language Understanding via Distillation with Synthetic Data

通过合成数据蒸馏实现高效金融语言理解

Wen-Fong, Huang, Edwin Simpson

发表机构 * School of Engineering Mathematics and Technology(工程数学与技术学院) University of Bristol(布里斯托大学)

专题命中 指令微调 :用大教师模型蒸馏到小模型,金融情感分析。

AI总结 提出一种在低资源条件下通过合成数据蒸馏进行金融情感分析的框架,利用聚类种子选择生成代表性合成数据,使紧凑模型在少量标注下达到强性能,甚至在某些任务上超越教师模型。

Journal ref Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), 2026, pp. 10242-10254

详情
AI中文摘要

大型指令跟随模型功能强大但部署成本高昂,尤其在金融领域,标注数据因保密性和专家标注成本而受限。我们提出一种通过合成数据蒸馏进行金融情感分析的高效框架,将知识从大型指令调优教师模型迁移到紧凑的学生模型。该框架专为低资源条件设计,其中收集并手工标注少量真实样本。框架随后对样本进行聚类,并利用聚类结果选择种子,通过结构化少样本提示生成合成样本。实验表明,基于聚类的种子选择比随机采样能生成更具代表性的合成数据,使紧凑模型在极少量监督下实现强性能。值得注意的是,在更复杂且噪声更多的文本领域,基于完整合成种子语料库训练的紧凑模型甚至优于教师模型,同时在正式文本上保持竞争力。该框架为金融NLP中资源高效的领域自适应提供了一条实用途径,且只需最少的人工标注工作。

英文摘要

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

2606.18307 2026-06-18 cs.LG cs.AI 新提交 85%

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT: 通过在线策略数据归因优化指令数据

Zefan Wang, Lincheng Li, Tianyu Yu, Yuan Yao

发表机构 * Tsinghua University(清华大学)

专题命中 指令微调 :提出DRIFT方法优化指令微调数据分布,提升LLM性能上限。

AI总结 提出DRIFT方法,利用在线策略影响函数解决标准影响函数在指令微调数据归因中的近邻偏差和梯度范数偏差问题,通过模型自身生成作为验证目标,提升7B模型性能上限。

详情
AI中文摘要

优化监督微调(SFT)的训练数据分布决定了大型语言模型(LLMs)的能力。虽然现有的数据筛选方法在有限预算下加速训练方面表现出色,但它们不太适合提升能力上限。这里的挑战不再是识别一个保持性能的较小子集,而是将数据分布优化为最能提升最终模型的实例。为了解决这个问题,我们探索了使用影响函数(IF)进行实例级数据归因。我们发现标准IF公式在此设置中存在两个结构限制:由离策略验证目标引起的近邻偏差,以及对梯度范数的严重偏向。我们提出了DRIFT(通过在线策略影响函数进行数据优化用于监督微调)。DRIFT不依赖外部参考数据,而是利用模型的在线策略生成作为验证目标,这在经验上最小化了参数近邻偏差,并更好地符合IF的局部邻域假设。它进一步基于轨迹正确性应用符号加权,并针对梯度操纵问题对影响分数进行去偏,使得少量验证查询能够作为可靠锚点来归因整个数据集。在7B参数指令和推理模型上的实验表明,DRIFT持续提升了两者的性能上限,优于现有的数据筛选基线。

英文摘要

Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model's on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

4. 后训练 10 篇

2606.18831 2026-06-18 cs.CL cs.AI 新提交 85%

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程:长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB Tsinghua University(清华大学)

专题命中 后训练 :通过数据配方和GRPO强化学习提升LLM长上下文推理能力

AI总结 提出一种简单有效的数据配方,结合最小化基于结果的GRPO设置,显著提升大语言模型的长上下文推理能力,在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情
AI中文摘要

长上下文推理是大语言模型的一项关键能力,特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式,然而现有工作主要关注奖励工程,而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题,并表明仅凭一种简单有效的数据配方,结合最小化基于结果的GRPO设置,就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集,总计约1.4万个示例。在三个模型(Qwen3-4B/8B/30B-A3B)上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升,超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中,在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练,GAIA提升+4.8分,BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

2606.18810 2026-06-18 cs.LG cs.AI 新提交 85%

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

从自身解中学习:面向可验证奖励强化学习的自条件化信用分配

Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

发表机构 * Beijing Institute of Technology(北京理工大学) Beihang University(北京航空航天大学) Independent Researcher(独立研究者)

专题命中 后训练 :SC-GRPO方法用于RLVR,提升LLM推理能力

AI总结 提出SC-GRPO方法,利用自条件化分布间的KL散度作为GRPO梯度的乘性权重,实现细粒度信用分配,在数学、代码和智能体任务上平均提升8.1%。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)在训练LLMs进行推理任务方面取得了显著进展,但代表性方法如GRPO对所有token分配统一信用,浪费了常规token上的梯度,同时低估了关键推理步骤。现有的token级信用分配方法需要超出模型自身rollout的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个token的散度分配信用,但需要外部教师(在线策略蒸馏)或特权信息(在线策略自蒸馏)。然而,这些依赖性限制了在纯RLVR设置中的适用性。我们观察到,将模型以其自身验证过的轨迹为条件,会在原始分布和条件分布之间诱导出可测量的每token KL散度,并证明当存在多个验证过的轨迹时,从由验证过的轨迹构建的自教师进行蒸馏会导致不可行的加权平均解。我们提出SC-GRPO(自条件化GRPO),它使用前述KL散度作为GRPO梯度的乘性权重。在涵盖数学、代码和智能体任务的五个基准上,SC-GRPO一致优于GRPO 8.1%,优于DAPO 5.9%,并具有更强的分布外性能。此外,SC-GRPO实现了比OPD更高的性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交 85%

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

专题命中 后训练 :LLM智能体搜索RL后训练策略

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2606.19336 2026-06-18 cs.CL 新提交 80%

Learning User Simulators with Turing Rewards

基于图灵奖励的学习用户模拟器

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

专题命中 后训练 :图灵奖励训练用户模拟器

AI总结 提出Turing-RL方法,利用基于图灵测试的强化学习训练用户模拟器,通过判别性图灵奖励使生成响应与真实用户不可区分,在对话和论坛讨论中优于基线方法。

详情
AI中文摘要

在交互式环境中学习模拟人类用户可以推动代理助手的训练、个性化系统的评估、社会科学研究等。现有方法通常通过训练大型语言模型(LLM)来匹配单一真实响应,要么通过最大化对数概率,要么使用相似性奖励。我们提出{Turing-RL}:一种基于图灵测试的强化学习方法,用于训练用户模拟器模型。{Turing-RL}使用带有LLM评判器的判别性图灵奖励,根据用户历史记录对生成的响应与真实用户的不可区分程度进行评分,用户模拟器LLM学习在这种奖励下产生与用户可能说的内容不可区分的响应。在两个不同领域——对话聊天和Reddit论坛讨论中,我们发现{Turing-RL}在LLM和人工评估指标上均持续优于基线方法。我们的研究表明,优化不可区分性而非响应匹配对于学习用户模拟器是有效的。

英文摘要

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

2606.19327 2026-06-18 cs.AI cs.CL 新提交 80%

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督:基于评分准则的自蒸馏

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

发表机构 * Yale University(耶鲁大学)

专题命中 后训练 :评分准则自蒸馏优化推理模型

AI总结 提出评分准则条件自蒸馏框架,通过结构化细粒度反馈指导推理模型,在科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。

详情
AI中文摘要

推理语言模型的后训练通常由监督蒸馏和基于可验证奖励的强化学习驱动。蒸馏通常依赖于思维链注释,这些注释获取成本高昂,且可能本身带有噪声、不完整或部分错误;即使最终答案正确,不完美的推理过程也会干扰学习。另一方面,基于验证奖励的强化学习通常将评估反馈压缩为标量信号,掩盖了响应中哪些方面需要改进。我们提出\textbf{评分准则条件自蒸馏}框架,该框架将评分准则作为结构化、细粒度的反馈用于策略内自蒸馏。我们的方法使教师模型以准则级评分准则为条件,并利用它在学生自身采样的轨迹上提供令牌级指导。这种设计避免了将单一参考推理过程作为唯一的监督目标。相反,评分准则指定了一个强响应应满足的条件,从而在推理过程中实现比标量奖励优化更细粒度的信用分配。我们通过一个两阶段流程实例化该框架:首先学习生成任务特定的评分准则,然后训练一个评分准则引导的推理器。我们在多样化的科学推理基准上进行评估,结果表明,评分准则条件自蒸馏有效地将准则级标准转化为推理过程中的令牌级指导,平均超过GRPO 1.0分、OPSD 0.9分。

英文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

2606.19004 2026-06-18 cs.DC cs.AI cs.LG 新提交 80%

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

Spotlight: 协同种子探索与抢占式GPU用于DiT强化学习后训练

Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

发表机构 * NTU Singapore(南洋理工大学) Hong Kong University of Science and Technology(香港科技大学) Alibaba Group(阿里巴巴集团)

专题命中 后训练 :提出Spotlight系统,利用抢占式GPU加速DiT强化学习后训练。

AI总结 针对DiT强化学习后训练成本高的问题,提出Spotlight系统,通过利用探索对旧权重的容忍性和SP组快速重配置,在抢占式GPU上实现高效训练,加速4倍并降低成本1.4-6.4倍。

详情
AI中文摘要

扩散Transformer(DiT)的强化学习(RL)后训练成本极高,需要数千块高端GPU。现有工作探索了两个降低成本的方向:种子探索通过选择高对比度样本来改善训练收敛,但增加了关键路径的计算量;抢占式GPU提供69-77%的成本降低,但在训练期间处于空闲状态,因为DiT rollout几乎同时完成,这阻止了类似LLM的rollout与训练流水线化。抢占式GPU的抢占进一步破坏了序列并行(SP)组,导致GPU拓扑碎片化。我们提出了Spotlight,这是第一个利用抢占式GPU进行DiT RL后训练的系统。Spotlight基于我们设计的两个关键洞察:(1)我们证明探索可以容忍过时的模型权重,因为使用前一次迭代模型权重的探索保留了随机种子的相对排序,允许探索在训练期间在空闲的抢占式GPU上运行。(2)SP重配置可以重用节点内状态,将组恢复时间从分钟级缩短到亚秒级启动。基于这些洞察,Spotlight引入了三种技术:基于bandit的探索规划器,在训练时间预算内最大化奖励方差;弹性序列并行,通过持久调度器和节点内权重复制动态重配置SP组;以及抢占感知的拉取式请求调度器,平衡负载并在抢占时提交进行中的状态。我们在开源RL平台ROLL上实现了Spotlight,并在Qwen-Image后训练上进行了评估。Spotlight达到相同目标验证分数的速度比基线快4倍,总成本降低1.4-6.4倍,同时在分辨率512×512和1280×1280的DeepSeek-OCR和Geneval数据集上实现了更优的图像质量。

英文摘要

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

2606.19002 2026-06-18 cs.CL 新提交 80%

Enhancing Multilingual Reasoning via Steerable Model Merging

通过可引导的模型合并增强多语言推理

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Fudan University(复旦大学) Beihang University(北京航空航天大学) Monash University(墨尔本大学) Zhongguancun Laboratory(中关村实验室) Nanjing University(南京大学) Tsinghua University(清华大学)

专题命中 后训练 :提出可引导模型合并框架,增强多语言推理能力。

AI总结 提出可引导模型合并(ST-Merge)框架,通过门控交叉注意力机制自适应调节源模型贡献,在多语言推理任务中优于强基线。

Comments 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

详情
AI中文摘要

模型合并是组合多语言模型和推理模型能力的有效技术。通过对齐不同模型的特征空间,它在多语言推理任务中取得了有希望的泛化效果。然而,合并后的单一模型往往无法解决源模型之间的冲突,导致性能次优。换句话说,一刀切的合并策略可能无法适应不同输入的特性,这些输入可能要求优先考虑某些模型。为此,我们提出了一个可引导模型合并(ST-Merge)框架来调节每个源模型的贡献。为了实现这一想法,我们引入了一种门控交叉注意力机制,以自适应方式加权或过滤两个关注的源模型。大量实验表明,ST-Merge在涵盖21种不同语言的四个多语言推理基准上持续优于多个强基线。

英文摘要

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

2606.18967 2026-06-18 cs.LG 新提交 80%

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout: 面向强化学习推演的感知系统的自推测解码

Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim, Donghoon Kim, Coleman Hooper, Harman Singh, Amir Gholami, Hyung Il Koo, Wonjun Kang

发表机构 * FuriosaAI University of California, Berkeley(加州大学伯克利分校)

专题命中 后训练 :提出自推测解码加速强化学习推演。

AI总结 针对强化学习推演中自回归解码延迟瓶颈,提出感知系统的自推测解码框架,通过量化自推测解码器与感知系统的推测开关策略,在保持模型质量前提下降低推演和端到端延迟。

Comments Project Page: https://github.com/furiosa-ai/EfficientRollout

详情
AI中文摘要

强化学习(RL)已成为LLMs代表性后训练范式,赋予其强大的推理和智能体能力。然而,推演生成仍是主要的延迟瓶颈,因为自回归采样顺序解码响应,且少量长尾生成往往决定完成时间。推测解码(SD)为缓解此瓶颈提供了自然途径,它是一种用于服务固定LLMs的成熟技术,通过快速草拟令牌并通过并行验证接受它们来降低延迟,同时保持目标模型分布。但其实际加速效果无法直接迁移到RL推演:(i)不断变化的目标策略使得任何固定草拟者与策略输出分布日益不匹配;(ii)推演解码过程中活跃批次大小缩小,解码从计算受限转向内存受限,此时并行验证可利用未充分利用的计算资源。因此,加速RL推演需要草拟者在长序列、高温生成下对演化策略保持有效,以及感知系统的SD使用以避免计算受限状态。我们提出EfficientRollout,一个感知系统的自推测SD框架,旨在解决RL推演中的这一差距。EfficientRollout从目标模型诱导量化草拟者(即自推测解码),使其与演化策略保持耦合,无需单独草拟者预训练或在线适应。它进一步协调感知系统的SD切换策略与接受感知的草稿长度自适应,仅在有益状态下进行推测,同时使草拟预算与演化草拟者质量匹配。EfficientRollout在加速自回归推演基线上分别将推演和端到端延迟降低高达19.6%和12.7%,同时保持最终模型质量。

英文摘要

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.

2606.18844 2026-06-18 cs.LG 新提交 80%

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

从自身错误中学习:为自蒸馏构建可学习的微反思轨迹

Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba(阿里巴巴通义千问事业部) Tsinghua University(清华大学) Peking University(北京大学)

专题命中 后训练 :策略优化方法,利用自身轨迹。

AI总结 提出TAPO方法,通过对比正确与错误轨迹构建微反思修正,实现从隐式分布对齐到显式轨迹构建的自蒸馏改进,在多个数学推理基准上优于GRPO。

详情
AI中文摘要

自蒸馏通过使用模型自身的生成作为训练信号来改进大型语言模型的推理能力,通常通过隐式的logit级对齐来实现,最小化与特权目标分布的KL散度。然而,由于这种监督是通过无控制采样生成的,它无法提供关于模型特定错误的诊断性洞察,也无法针对其个体失败模式提供纠正性指导。因此,模型学习的是模仿特权分布,而不是接收精确指出其推理失败位置和原因的细粒度修正。在本文中,我们提出了轨迹增强策略优化(TAPO),将自蒸馏从隐式分布对齐推进到显式轨迹构建。在强化学习训练期间,模型对同一查询同时产生正确和错误的生成轨迹,TAPO利用这种对比结构来构建微反思修正——新的训练轨迹,保留模型在失败点之前的错误推理,然后插入自然语言诊断和由同一采样组中的正确参考引导的修正推理。由于每条轨迹都锚定在学习者自身的前缀和解决方案上,与基于KL的方法施加的位置级对齐相比,修正信号在更大程度上保留了模型的在策略分布。为了整合这些轨迹,TAPO在模型能力边界引入了难度感知的候选选择,并采用解耦优势估计以防止梯度污染。在AIME 2024、AIME 2025和HMMT 2025上的实验表明,在相同训练步数下,TAPO相比GRPO取得了一致的改进。进一步分析表明,TAPO增强了首次推理和错误纠正的有效性。

英文摘要

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

2606.18774 2026-06-18 cs.LG 新提交 80%

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

RouteJudge: 一个可复现且偏好感知的LLM路由开放平台

Guannan Lai, Haoran Hu, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) National Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) SinapisAI

专题命中 后训练 :评估LLM路由策略,偏好感知平台。

AI总结 提出RouteJudge平台,通过匿名成对比较评估LLM路由策略的决策质量,并发布ORBIT工具箱标准化路由工作流,支持可复现和偏好感知的路由评估。

Comments Accepted by Pluralistic Alignment Workshop at ICML 2026

详情
AI中文摘要

我们提出RouteJudge,一个用于LLM路由系统的在线成对偏好评估框架,并提供一个公开平台(https://...)。与模型级别的响应评估不同,RouteJudge关注路由器级别的决策质量。对于每个用户查询,多个路由策略在相同的模型池和预算约束下独立推荐候选模型。然后通过匿名成对比较将所选模型的响应呈现给用户,由此产生的用户偏好归因于比较响应背后的路由策略。每条评估记录存储查询、路由决策、模型响应、偏好标签、成本、延迟和任务元数据,从而支持对LLM路由器进行偏好感知、成本感知和任务条件分析。为了支持RouteJudge中路由方法的持续扩展,我们进一步发布了ORBIT(最优路由与预算推理工具箱),这是一个模块化且可扩展的工具箱,标准化了LLM路由的端到端工作流。ORBIT为基准加载、查询表示、路由器实现、预算感知评估和方法比较提供了统一接口,允许研究人员在一致的协议下开发和评估路由算法。它同时作为RouteJudge的提交和集成层:研究人员可以在ORBIT中实现路由方法,在现有路由基准上验证它们,并提交兼容的路由器进行在线偏好评估。ORBIT的代码可在https://...获取。

英文摘要

We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on router-level decision quality. For each user query, multiple routing strategies independently recommend candidate models under the same model pool and budget constraints. The selected model responses are then presented to users through anonymous pairwise comparisons, and the resulting user preferences are attributed back to the routing strategies behind the compared responses. Each evaluation record stores the query, routing decisions, model responses, preference labels, cost, latency, and task metadata, enabling preference-aware, cost-aware, and task-conditioned analysis of LLM routers. To support the continuous expansion of routing methods in RouteJudge, we further release ORBIT (Optimal Routing and Budgeted Inference Toolbox), a modular and extensible toolbox that standardizes the end-to-end workflow of LLM routing. ORBIT provides unified interfaces for benchmark loading, query representation, router implementation, budget-aware evaluation, and method comparison, allowing researchers to develop and evaluate routing algorithms under consistent protocols. It also serves as the submission and integration layer for RouteJudge: researchers can implement routing methods within ORBIT, validate them on existing routing benchmarks, and submit compatible routers for online preference-based evaluation. The code of ORBIT is available at https://github.com/AIGNLAI/LAMDA-ORBIT.

5. 其他LLM 11 篇

2606.18431 2026-06-18 cs.LG cs.DC 新提交 85%

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

超越预测:面向LLM推理的尾延迟感知调度

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

发表机构 * Cornell University, Computer Science Department(康奈尔大学计算机科学系) Cornell University, Electrical and Computer Engineering Department(康奈尔大学电气与计算机工程系) Cornell University, Operations Research and Information Engineering Department(康奈尔大学运筹学与信息工程系) Microsoft Azure System Research(微软Azure系统研究) NVIDIA Corporation(英伟达公司)

专题命中 其他LLM :提出LLM推理调度框架,优化尾延迟

AI总结 针对LLM推理中长度预测调度在分布偏移和尾延迟控制上的脆弱性,提出无预测的分布感知调度框架,通过轻量统计信号实现软优先级提升,结合缓存感知抢占,在多种工作负载下将P99 TTLT降低35-50%,TTFT降低34-47%。

Journal ref Forty-Third International Conference on Machine Learning (2026)

详情
AI中文摘要

LLM服务表现出极端的长度可变性,使得基于大小的调度在实践中变得困难。最近的LLM调度器使用预测的解码长度或排名来近似SJF/SRPT,并主要报告均值中心指标如TTFT和TBT。我们表明,这些预测驱动的策略在分布偏移、突发到达和GPU内存压力下可能脆弱,同时对主导用户体验的尾延迟(P90-P99)控制有限,即使拥有完美的解码长度知识。我们引入了一个分布感知、无预测的调度框架,用由轻量统计信号驱动的软优先级提升取代显式长度预测。我们的设计协同优化调度和缓存感知抢占,以考虑跨工作负载混合的内存耦合解码动态。在生产环境和开源轨迹上的评估表明,相对于具有完美长度知识的SRPT,我们的方法将P99 TTLT降低了高达35-50%,并在各种工作负载(包括推理密集型和聊天密集型任务)上将TTFT降低了34-47%。这些结果证明了在在线LLM服务中优化尾延迟的稳健替代方案。

英文摘要

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

2606.18394 2026-06-18 cs.CL 新提交 85%

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: 通过并行树草稿突破推测解码的缩放上限

Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang

发表机构 * UC San Diego(加州大学圣地亚哥分校) Zhejiang University(浙江大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Nanjing University(南京大学) StepFun(阶跃星辰)

专题命中 其他LLM :提出并行树草稿加速LLM推测解码

AI总结 提出JetFlow框架,通过因果并行草稿头结合树推测解码,将更大草稿预算转化为更长接受前缀和更高端到端加速,在Qwen3模型上实现最高9.64倍加速。

详情
AI中文摘要

推测解码(SD)通过草拟多个令牌并并行验证来加速自回归大语言模型(LLM),但面临缩放限制:仅当接受率保持较高且草拟开销较低时,增加草稿预算才能提高速度。这一上限难以突破,因为先前基于头的SD方法面临因果-效率困境。自回归草稿器生成路径条件候选,适用于树推测解码且接受长度更高,但其草拟成本随树深度增长。双向块扩散草稿器一次性生成所有位置,但其分支无关的边缘分布可能形成个体合理但相互不一致的树,浪费预算并降低接受率。我们提出JetFlow,一种基于头的SD框架,结合单次前向草拟效率与分支级因果条件。JetFlow在冻结目标模型的融合隐藏状态上训练因果并行草稿头,生成与目标模型自回归分解对齐的候选树。这使得JetFlow能够将更大的草稿预算转换为更长的接受前缀和更高的端到端加速。在密集和MoE Qwen3模型上的数学、编码和聊天基准测试中,JetFlow始终优于双向头和基于树的SD基线。在H100 GPU上,JetFlow在MATH-500上实现高达9.64倍加速,在开放式对话工作负载上实现4.58倍加速,并通过vLLM集成在实际服务负载下进一步降低延迟。我们的代码和模型可在该https URL获取。

英文摘要

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 85%

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

基于Bandit的提示设计策略选择改进提示优化器

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

专题命中 其他LLM :提出OPTS方法优化LLM提示策略

AI总结 本文提出OPTS方法,通过显式选择提示设计策略提升EvoPrompt性能,采用Thompson采样机制在BIG-Bench Hard上验证效果,实现最优结果。

Comments Accepted to ACL 2025 Findings

详情
AI中文摘要

提示优化旨在寻找能提升大语言模型性能的有效提示。尽管现有方法已发现有效提示,但往往与人类专家精心设计的复杂提示不同。提示设计策略作为提升提示性能的最佳实践,对优化提示至关重要。最近,Autonomous Prompt Engineering Toolbox (APET) 将多种提示设计策略整合到提示优化过程中。在APET中,需要LLM隐式选择和应用合适的策略,因为提示设计策略可能产生负面影响。这种隐式选择可能因LLM的有限优化能力而表现不佳。本文引入Optimizing Prompts with sTrategy Selection (OPTS),实现提示设计的显式选择机制。我们提出三种机制,包括基于Thompson采样的方法,并将其整合到EvoPrompt中。在使用BIG-Bench Hard对Llama-3-8B-Instruct和GPT-4o mini进行提示优化的实验中,结果表明提示设计策略的选择提升了EvoPrompt的性能,Thompson采样机制实现了最佳整体结果。我们的实验代码可在https://github.com/shiralab/OPTS获取。

英文摘要

Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

2506.09822 2026-06-18 cs.CE cs.AI 85%

Superstudent intelligence in thermodynamics

热力学中的超级学生智能

Rebecca Loubet, Pascal Zittlau, Marco Hoffmann, Luisa Vollmer, Sophie Fellenz, Heike Leitte, Fabian Jirasek, Johannes Lenhard, Hans Hasse

发表机构 * Laboratory of Engineering Thermodynamics (LTD)(工程热力学实验室) Visual Information Analysis Research Group (VIA)(视觉信息分析研究组) Machine Learning Research Group (ML)(机器学习研究组)

专题命中 其他LLM :评估o3模型在热力学考试中的表现

AI总结 研究展示OpenAI的o3模型在热力学考试中超越所有学生,证明机器在复杂任务中的能力,影响工程教育与实践。

Comments This document is the unedited Author's version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

详情
AI中文摘要

在本文中,我们报告并分析了一个引人注目的事件:OpenAI的大型语言模型o3在热力学考试中击败了所有学生。热力学考试是大多数学生的难点,需要展示对这一重要主题基本原理的掌握。因此,失败率很高,A级分数稀少,被视为学生卓越智力的证明。这是因为模式学习无助于考试。问题只能通过有创造力地结合热力学原理来解决。我们不仅将最新热力学考试提供给学生,还提供给OpenAI最强大的推理模型o3,并以相同方式评估其答案。在零样本模式下,模型o3正确解答了所有问题,优于所有参加考试的学生;其总分在1985年以来超过10000次类似考试中最佳分数范围内。这标志着转折点:机器现在在复杂任务中表现出色,通常被视为人类智力能力的证明。我们讨论了这对工程师工作和未来工程师教育的影响。

英文摘要

In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 85%

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

对上中学数学中演进式大语言模型的评估

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology(信息科技学院) University of Jyväskylä(于韦斯屈莱大学) Faculty of Humanities and Social Sciences(人文与社会科学学院)

专题命中 其他LLM :评估LLM在中学数学考试中的能力

AI总结 本文评估了不同大语言模型在芬兰毕业考试中的数学能力,发现随着模型演进,其表现显著提升,部分模型接近完美,展示了LLM在数学能力上的快速进步及其在教育中的潜力。

详情
AI中文摘要

大型语言模型(LLMs)在教育环境中展现出日益增长的前景,但其数学推理能力被认为是在不断演变的。本研究通过芬兰毕业考试,一种针对上中学教育的高风险数字测试,评估了各种LLMs的数学能力。初步测试显示中等表现,对应中等成绩,但后续评估显示随着语言模型的演进,表现显著提升。令人惊讶的是,某些模型达到了接近完美或完美分数,与顶尖学生表现相当,符合大学入学要求。我们的发现突显了LLM数学能力的快速进步,并展示了其作为支持学习和教学的潜在工具的可能性。

英文摘要

Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential as underlying tools to support learning and teaching in a variety of ways.

2606.19256 2026-06-18 cs.AI 新提交 80%

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides:面向受众条件的幻灯片生成基准测试

Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Harbin Institute of Technology(哈尔滨工业大学) SenseTime

专题命中 其他LLM :LLM幻灯片生成基准测试

AI总结 提出X+Slides基准,通过动态评估框架和受众特定权重,衡量幻灯片生成系统在受众覆盖、领域覆盖、效率和正确性方面的表现,揭示现有系统在受众关键信息恢复上的不足。

详情
AI中文摘要

从源文档自动生成幻灯片是大语言模型(LLMs)的重要应用。现有基准主要评估幻灯片的完整性和技术深度,而忽略了目标受众这一关键现实因素。例如,专家需要严格的证明,而决策者优先考虑可操作的结论。为弥补这一差距,我们引入了X+Slides,一个专门为受众条件幻灯片生成设计的基准。基于涵盖113个主题和七种演示场景的多样化语料库,X+Slides采用由8,133个去重、基于源的探针构建的动态评估框架。通过为相同的基于源的探针分配受众特定的效用权重,X+Slides报告四个互补指标:受众覆盖率衡量传达了受众必要信息的程度,领域覆盖率显示覆盖了哪些信息类型,效率衡量每单位注意力成本传递的效用,正确性验证幻灯片声明是否得到源支持。在DeepPresenter、SlideTailor和NotebookLM上的实验表明,当前系统可以恢复大部分但仍有缺失的受众必要信息:在τ_A=0.7时,DeepPresenter达到最佳受众覆盖率0.714,SlideTailor达到0.594,NotebookLM消融达到0.853,同时显示出明显的接地差异。这些结果表明,视觉质量和广泛的主题覆盖不应在没有基于源评估的情况下被视为证据支持。

英文摘要

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

2606.18946 2026-06-18 cs.CL 新提交 80%

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University(西北工业大学) Zhejiang Lab(浙江实验室)

专题命中 其他LLM :检测LLM生成文本,建模句间依赖

AI总结 针对人机混合文档的句子级AI文本检测,提出SenFlow模型,通过图传播和CRF解码建模句间依赖,在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情
AI中文摘要

针对混合文档(人类与LLM共同撰写同一文本)的句子级AI生成文本检测(S-AGTD)面临两个空白:现有方法孤立地对每个句子进行分类,忽略了句间依赖;现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准,包含来自PubMed和XSum的16,000个混合文档,由DeepSeek-V3.2和Kimi K2生成,并经过严格质量控制,包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测,并实例化为SenFlow,在句子图的单次文档级传递中,将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能,在跨域迁移(三种难度递增协议中最难的一种)上平均Macro-F1提高了4.15个百分点。我们进一步发现,即使困惑度过滤器平衡了显式线索,AI插入仍然保留了一个依赖于生成器的句子长度差距,句子级检测器仍可利用这一点。代码和数据:此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

2606.18922 2026-06-18 cs.CL cs.AI 新提交 80%

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单:评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol(智能系统实验室 英国布里斯托尔大学) ILLC University of Amsterdam(阿姆斯特丹大学语言学研究所)

专题命中 其他LLM :评估LLM对否定与比喻语言的理解

AI总结 本研究通过开发新的注释数据集,测试多种大型语言模型在比喻语言中理解否定的能力,发现否定与比喻的组合对模型构成挑战,且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

比喻语言和否定是当前语言模型面临挑战的两个领域,然而,两者在书面和口语中广泛使用。大型语言模型(LLMs)也广泛应用于日常场景,在这些场景中它们不一定能针对特定数据集进行调整。因此,理解LLMs正确解释包含否定和比喻语言的文本的能力至关重要。为了研究这一点,我们为现有的比喻语言数据集开发了一套新的注释,并在该数据集上测试了一系列语言模型。我们发现,否定和比喻性的结合可能带来特殊挑战,并且整体性能以及不同否定类型上的性能特别依赖于所使用的提示风格。

英文摘要

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

2606.18797 2026-06-18 cs.CL 新提交 80%

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

超越标量分数:探索基于LLM的放射学报告临床意义评估指标

Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

发表机构 * Nanyang Technological University(南洋理工大学) Technical University of Munich(慕尼黑工业大学) Alibaba(阿里巴巴) University of Glasgow(格拉斯哥大学) University of Massachusetts Boston(马萨诸塞大学波士顿分校)

专题命中 其他LLM :基于LLM的放射学报告评估指标

AI总结 针对放射学报告评估中临床准确性要求,研究基于LLM的指标区分临床错误与无害变体的能力,发现判别偏差,并通过合成数据训练轻量级指标,在成本敏感部署中优于大型模型。

Comments Under Review

详情
AI中文摘要

对生成的放射学报告进行可靠评估需要严格的临床准确性,因为遗漏关键发现或误判影像学观察结果会直接影响患者护理。现有指标通过将报告质量简化为一个医学上无依据的标量而模糊了这一要求。尽管大型语言模型(LLM)拥有丰富的医学知识,但它们同样难以在临床显著错误和无害变异之间划定可靠边界。我们以ReEvalMed基准为测试平台研究这一边界,并从检测真实临床错误(“判别力”)和容忍无关变异(“鲁棒性”)两方面评估指标的临床意义。在单次和两次设置下对8个LLM评估器进行实验,我们发现了一个普遍的判别偏差:模型能有效检测错误,但也过度惩罚无害的改写。为缓解这一问题,我们合成了4000对报告,并在Qwen3-8B和MedGemma-4B上训练了轻量级可解释指标。我们训练的指标明确了临床意义边界,超越了32B规模的医学LLM,并与专有模型保持竞争力。关键的是,成本更高的两次设置未能持续提升整体性能,主要是在用判别力换取鲁棒性。这些发现表明,单次训练指标是成本敏感部署的实用选择,而两次推理则保留给判别-鲁棒平衡至关重要的场景。我们将发布数据集和指标。

英文摘要

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

2606.18741 2026-06-18 cs.DC 新提交 80%

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving

ReMP:面向LLM服务的低停机时间运行时模型并行重配置

Haipeng Yuan, Kaining Zheng, Yongshu Bai, Yuchen Zhang, Yunquan Zhang, Baodong Wu, Xiang Gao, Daning Cheng

专题命中 其他LLM :LLM推理服务模型并行重配置,低停机时间。

AI总结 提出ReMP框架,通过解耦拓扑与运行时状态、二维KV缓存迁移等技术,实现LLM推理服务中模型并行拓扑的在线动态调整,将重配置停机时间从分钟级降至1-7秒。

详情
AI中文摘要

当前大语言模型(LLM)推理系统普遍采用张量并行(TP)和流水线并行(PP)的组合来部署超大规模模型。然而,现有系统将模型并行拓扑视为静态配置,无法在运行时灵活调整。这种刚性设计与实际场景中动态变化的推理负载存在根本矛盾。最先进的系统缺乏在线重配置能力,只能通过重启服务来切换配置,导致数分钟的服务中断、KV缓存丢失以及高昂的重计算开销。为解决此问题,本文提出ReMP,一种支持低停机时间的运行时模型并行重配置框架。ReMP通过三项关键技术实现动态调整:(1)将模型并行拓扑与运行时状态解耦,避免完全重建服务;(2)设计二维KV缓存迁移机制,在TP/PP变化后保留可复用的缓存状态;(3)实现端到端的在线重配置。实验表明,ReMP能在7B到70B参数规模的模型上,在1-7秒内完成大多数拓扑切换,相比重启方法实现数十至上百倍的加速。此外,在动态负载下,ReMP显著优于固定配置,在TTFT、TPOT和输出吞吐量方面表现出更优性能。

英文摘要

Current large language model (LLM) inference systems universally deploy ultra-large-scale models using a combination of Tensor Parallelism (TP) and Pipeline Parallelism (PP). However, existing systems treat the model parallelism topology as a static configuration that cannot be flexibly adjusted at runtime. This rigid design creates a fundamental contradiction with the dynamically changing inference workloads in real-world scenarios. State-of-the-art systems lack online reconfiguration capabilities and can only switch configurations by restarting the service, resulting in several minutes of service interruption, KV cache loss, and prohibitive recomputation overhead. To address this problem, this paper presents ReMP, a runtime model parallelism reconfiguration framework that supports low downtime. ReMP achieves dynamic adjustment through three key techniques: (1) decoupling the model parallelism topology from runtime state to avoid full service reconstruction; (2) designing a two-dimensional KV cache migration mechanism to preserve reusable cache states after TP/PP changes; and (3) implementing end-to-end online reconfiguration. Experiments demonstrate that ReMP can complete most topology switches within 1-7 seconds on models ranging from 7B to 70B parameters, achieving speedups of tens to over a hundred times compared to the restart approach. Moreover, ReMP significantly outperforms fixed configurations under dynamic workloads, delivering superior performance in terms of TTFT, TPOT, and output throughput.

2606.18677 2026-06-18 cs.LG cs.AI 新提交 80%

Bounded Context Management for Tabular Foundation Models on Stream Learning

表格基础模型在流学习中的有界上下文管理

Jinmo Lee, Doyun Choi, Moongi Choi, Jaemin Yoo

发表机构 * Seoul National University(首尔大学) KAIST(韩国科学技术院)

专题命中 其他LLM :表格基础模型流学习上下文管理

AI总结 针对表格流学习中分布漂移问题,提出上下文管理策略CURE,通过不确定性门控准入和冗余感知驱逐管理上下文,在七个流上相对提升最高27.0%。

Comments Accepted as a spotlight oral (top 5%) at the 2nd ICML Workshop on Foundation Models for Structured Data (FMSD@ICML2026)

详情
AI中文摘要

表格流学习需要在分布漂移下对顺序到达的样本进行预测。虽然标准方法通过更新模型状态来适应,但表格基础模型(TFMs)以上下文方式基于标记上下文进行预测,使其成为流学习的自然替代方案。这便将挑战从如何更新模型转移到如何管理上下文。我们提出一种未来信息视角,为上下文管理导出三个实际需求:保留最近样本、保留不确定样本、移除冗余样本。我们将这些需求实例化为CURE(通过不确定性感知准入和冗余感知驱逐的上下文管理),一种具有熵门控准入和冗余感知驱逐的上下文管理策略。在七个流上,CURE相比经典流学习器相对提升高达27.0%,在多个TFM骨干上保持鲁棒,并在其他策略变体中排名第一。代码和数据集可在该https URL获取。

英文摘要

Tabular stream learning requires predictions on sequentially arriving examples under distribution shift. While standard methods adapt by updating model states, tabular foundation models (TFMs) make predictions conditioned on a labeled context in an in-context manner, making them a natural alternative for stream learning. This shifts the challenge from how to update the model to how to manage the context. We propose a future information view that yields three practical requirements for context management: preserve recent examples, retain uncertain examples, and remove redundant examples. We instantiate these requirements as CURE (Context management via Uncertainty-aware admission and Redundancy aware Eviction), a context-managing policy with entropy-gated admission and redundancy-aware eviction. Across seven streams, CURE shows up to 27.0% relative improvement over classical stream learners, remains robust across multiple TFM backbones, and ranks first among other policy variants. Code and datasets are available at https://github.com/morcellinus/CURE-ICML-FMSD.