语言大模型 / LLM - arXivDaily 专题

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新 95%

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）

专题命中后训练：综述DPO，一种大模型后训练对齐方法

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情

DOI: 10.1109/TPAMI.2026.3704314

AI中文摘要

随着大语言模型（LLMs）的快速发展，将策略模型与人类偏好对齐变得日益关键。直接偏好优化（DPO）作为一种有前景的对齐方法，作为从人类反馈中强化学习（RLHF）的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性，但文献中目前缺乏对这些方面的深入综述。在这项工作中，我们对DPO中的挑战和机遇进行了全面回顾，涵盖理论分析、变体、相关偏好数据集和应用。具体而言，我们基于关键研究问题对近期DPO研究进行分类，以提供对DPO当前格局的透彻理解。此外，我们提出了几个未来研究方向，为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

URL PDF HTML ☆

赞 0 踩 0

2606.18831 2026-06-18 cs.CL cs.AI 新提交 85%

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB ； Tsinghua University（清华大学）

专题命中后训练：通过数据配方和GRPO强化学习提升LLM长上下文推理能力

AI总结提出一种简单有效的数据配方，结合最小化基于结果的GRPO设置，显著提升大语言模型的长上下文推理能力，在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情

AI中文摘要

长上下文推理是大语言模型的一项关键能力，特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式，然而现有工作主要关注奖励工程，而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题，并表明仅凭一种简单有效的数据配方，结合最小化基于结果的GRPO设置，就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集，总计约1.4万个示例。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升，超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中，在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练，GAIA提升+4.8分，BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.18810 2026-06-18 cs.LG cs.AI 新提交 85%

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

从自身解中学习：面向可验证奖励强化学习的自条件化信用分配

Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

发表机构 * Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； Independent Researcher（独立研究者）

专题命中后训练：SC-GRPO方法用于RLVR，提升LLM推理能力

AI总结提出SC-GRPO方法，利用自条件化分布间的KL散度作为GRPO梯度的乘性权重，实现细粒度信用分配，在数学、代码和智能体任务上平均提升8.1%。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）在训练LLMs进行推理任务方面取得了显著进展，但代表性方法如GRPO对所有token分配统一信用，浪费了常规token上的梯度，同时低估了关键推理步骤。现有的token级信用分配方法需要超出模型自身rollout的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个token的散度分配信用，但需要外部教师（在线策略蒸馏）或特权信息（在线策略自蒸馏）。然而，这些依赖性限制了在纯RLVR设置中的适用性。我们观察到，将模型以其自身验证过的轨迹为条件，会在原始分布和条件分布之间诱导出可测量的每token KL散度，并证明当存在多个验证过的轨迹时，从由验证过的轨迹构建的自教师进行蒸馏会导致不可行的加权平均解。我们提出SC-GRPO（自条件化GRPO），它使用前述KL散度作为GRPO梯度的乘性权重。在涵盖数学、代码和智能体任务的五个基准上，SC-GRPO一致优于GRPO 8.1%，优于DAPO 5.9%，并具有更强的分布外性能。此外，SC-GRPO实现了比OPD更高的性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交 85%

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon（亚马逊）

专题命中后训练：LLM智能体搜索RL后训练策略

AI总结提出LLMZero系统，利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略，揭示容量参数单调累积、正则化参数振荡的规律，在4个GRPO任务上相对基线提升9%-140%。

详情

AI中文摘要

RL后训练策略依赖于数据集，并揭示了一个反复出现的经验模式：容量参数在阶段间单调累积，而正则化参数主要根据训练动态的变化而振荡。这种区别很重要，因为固定调度将所有参数提交到固定轨迹，因此无法表达正则化必须跟踪的非平稳探索-利用权衡；该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点，该系统通过树搜索让LLM智能体搜索训练轨迹，诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中，LLMZero发现的策略相对基础模型提升9%到140%，相对网格搜索提升6%到15%，始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移，解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.01249 2026-06-18 cs.LG cs.CL 版本更新 85%

Trust Region On-Policy Distillation

信任区域在线策略蒸馏

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang

发表机构 * Samsung Research（三星研究院）； University of Oxford（牛津大学）； Peking University（北京大学）

专题命中后训练：信任区域在线策略蒸馏，用于LLM后训练

AI总结提出信任区域在线策略蒸馏（TrOPD），通过信用分配策略和信任区域学习解决师生分布差异导致的训练不稳定问题，在数学推理、代码生成和通用基准上超越现有方法。

详情

AI中文摘要

在线策略蒸馏（OPD）是大型语言模型（LLM）高效后训练的基本技术，在智能体学习、多任务增强和模型压缩中具有广泛应用。然而，当教师和学生分布差异较大时，OPD训练变得不稳定，因为教师对学生生成token的监督可能产生不可靠的策略梯度，甚至导致优化失败。本文通过信用分配策略解决可靠的在线策略token级监督问题，并提出信任区域在线策略蒸馏（TrOPD）。它具有以下特点：1）信任区域在线策略学习：TrOPD仅在教师提供可靠监督的区域进行OPD，缓解了分布不匹配下K1反向KL估计的优化困难。2）异常值估计：对于异常区域，我们探索梯度裁剪、掩码和前向KL估计，以减少不可靠监督的不利影响。3）离策略引导：学生从教师前缀继续生成，并使用前向KL模仿离策略引导，鼓励向可靠区域进行在线策略探索。实验表明，TrOPD在数学推理、代码生成和通用领域基准上始终优于最先进的OPD基线，包括OPD、EOPD和REOPOLD。

英文摘要

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2601.17226 2026-06-18 cs.CL cs.AI 版本更新 85%

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复：面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales（新南威尔士大学）

专题命中后训练：使用强化学习提升LLM故事复述能力

AI总结提出RRR强化学习框架，结合结构主义叙事学与标量叙事性，通过d-RLAIF从文本特征中获取训练信号，无需参考输出，提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情

AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷，此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练（如SFT）无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR)，一个基于强化学习的流水线，将结构主义叙事学与标量叙事性相结合，以教授故事结构。我们扩展了TimeTravel数据集，加入人工标注的叙事平衡阶段，以评估奖励模型。通过d-RLAIF，RRR从文本特征的叙事性中推导训练信号，无需参考输出。评估表明，RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线，输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集，为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

URL PDF HTML ☆

赞 0 踩 0

2506.14126 2026-06-18 cs.LG cs.AI 版本更新 85%

From Memorization to Parameter Interference: How Overtraining Experts Harms Model Merging

从记忆到参数干扰：过度训练专家如何损害模型合并

Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

发表机构 * Concordia University（康科德大学）； Mila -- Québec AI Institute（魁北克人工智能研究所）； Google DeepMind（谷歌深Mind）

专题命中后训练：研究专家模型微调对合并的影响

AI总结本文研究专家模型微调过度对模型合并的影响，发现长时间微调导致记忆困难样本，造成参数干扰，降低合并性能，并提出任务相关的早停策略改善合并效果。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情

AI中文摘要

现代深度学习日益以使用开放权重基础模型为特征，这些模型可以在专门数据集上进行微调。这导致了专家模型和适配器的激增，通常通过HuggingFace和AdapterHub等平台共享。模型合并最近成为一种有效利用这些现有资源的方法，使得能够组合不同模型检查点的能力。因此，形成了一种自然的流程来利用迁移学习的好处并分摊沉没训练成本：模型在通用数据上预训练，在特定任务上微调，然后合并多个检查点以获得更强大的模型。一个普遍假设是，该流程中某一阶段的改进会向下游传播，从而在后续步骤中带来收益。在这项工作中，我们通过研究专家微调如何影响模型合并来挑战这一假设。我们表明，针对个体性能优化的专家长时间微调会导致跨视觉和语言模态、多种模型规模以及完全微调和LoRA适配模型的合并性能下降。我们将这种退化追溯到对一小部分困难样本的记忆，这些样本主导了微调后期步骤。这会导致负参数干扰，并编码在合并过程中被遗忘的知识。最后，我们证明任务相关的激进早停策略可以显著改善模型合并性能。

英文摘要

Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. Model merging has recently emerged as an effective way to leverage these existing resources, enabling the composition of capabilities from different model checkpoints. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then multiple checkpoints are merged to obtain a more capable model. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model merging. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance across vision and language modalities, multiple model scales, and both fully fine-tuned and LoRA-adapted models. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps. This causes negative parameter interference and encodes knowledge that is forgotten during merging. Finally, we demonstrate that task-dependent aggressive early stopping strategies can significantly improve model merging performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19336 2026-06-18 cs.CL 新提交 80%

Learning User Simulators with Turing Rewards

基于图灵奖励的学习用户模拟器

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

专题命中后训练：图灵奖励训练用户模拟器

AI总结提出Turing-RL方法，利用基于图灵测试的强化学习训练用户模拟器，通过判别性图灵奖励使生成响应与真实用户不可区分，在对话和论坛讨论中优于基线方法。

详情

AI中文摘要

在交互式环境中学习模拟人类用户可以推动代理助手的训练、个性化系统的评估、社会科学研究等。现有方法通常通过训练大型语言模型（LLM）来匹配单一真实响应，要么通过最大化对数概率，要么使用相似性奖励。我们提出{Turing-RL}：一种基于图灵测试的强化学习方法，用于训练用户模拟器模型。{Turing-RL}使用带有LLM评判器的判别性图灵奖励，根据用户历史记录对生成的响应与真实用户的不可区分程度进行评分，用户模拟器LLM学习在这种奖励下产生与用户可能说的内容不可区分的响应。在两个不同领域——对话聊天和Reddit论坛讨论中，我们发现{Turing-RL}在LLM和人工评估指标上均持续优于基线方法。我们的研究表明，优化不可区分性而非响应匹配对于学习用户模拟器是有效的。

英文摘要

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

URL PDF HTML ☆

赞 0 踩 0

2606.19327 2026-06-18 cs.AI cs.CL 新提交 80%

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督：基于评分准则的自蒸馏

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

发表机构 * Yale University（耶鲁大学）

专题命中后训练：评分准则自蒸馏优化推理模型

AI总结提出评分准则条件自蒸馏框架，通过结构化细粒度反馈指导推理模型，在科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。

详情

AI中文摘要

推理语言模型的后训练通常由监督蒸馏和基于可验证奖励的强化学习驱动。蒸馏通常依赖于思维链注释，这些注释获取成本高昂，且可能本身带有噪声、不完整或部分错误；即使最终答案正确，不完美的推理过程也会干扰学习。另一方面，基于验证奖励的强化学习通常将评估反馈压缩为标量信号，掩盖了响应中哪些方面需要改进。我们提出\textbf{评分准则条件自蒸馏}框架，该框架将评分准则作为结构化、细粒度的反馈用于策略内自蒸馏。我们的方法使教师模型以准则级评分准则为条件，并利用它在学生自身采样的轨迹上提供令牌级指导。这种设计避免了将单一参考推理过程作为唯一的监督目标。相反，评分准则指定了一个强响应应满足的条件，从而在推理过程中实现比标量奖励优化更细粒度的信用分配。我们通过一个两阶段流程实例化该框架：首先学习生成任务特定的评分准则，然后训练一个评分准则引导的推理器。我们在多样化的科学推理基准上进行评估，结果表明，评分准则条件自蒸馏有效地将准则级标准转化为推理过程中的令牌级指导，平均超过GRPO 1.0分、OPSD 0.9分。

英文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

URL PDF HTML ☆

赞 0 踩 0

2606.19004 2026-06-18 cs.DC cs.AI cs.LG 新提交 80%

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

Spotlight: 协同种子探索与抢占式GPU用于DiT强化学习后训练

Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

发表机构 * NTU Singapore（南洋理工大学）； Hong Kong University of Science and Technology（香港科技大学）； Alibaba Group（阿里巴巴集团）

专题命中后训练：提出Spotlight系统，利用抢占式GPU加速DiT强化学习后训练。

AI总结针对DiT强化学习后训练成本高的问题，提出Spotlight系统，通过利用探索对旧权重的容忍性和SP组快速重配置，在抢占式GPU上实现高效训练，加速4倍并降低成本1.4-6.4倍。

详情

AI中文摘要

扩散Transformer（DiT）的强化学习（RL）后训练成本极高，需要数千块高端GPU。现有工作探索了两个降低成本的方向：种子探索通过选择高对比度样本来改善训练收敛，但增加了关键路径的计算量；抢占式GPU提供69-77%的成本降低，但在训练期间处于空闲状态，因为DiT rollout几乎同时完成，这阻止了类似LLM的rollout与训练流水线化。抢占式GPU的抢占进一步破坏了序列并行（SP）组，导致GPU拓扑碎片化。我们提出了Spotlight，这是第一个利用抢占式GPU进行DiT RL后训练的系统。Spotlight基于我们设计的两个关键洞察：（1）我们证明探索可以容忍过时的模型权重，因为使用前一次迭代模型权重的探索保留了随机种子的相对排序，允许探索在训练期间在空闲的抢占式GPU上运行。（2）SP重配置可以重用节点内状态，将组恢复时间从分钟级缩短到亚秒级启动。基于这些洞察，Spotlight引入了三种技术：基于bandit的探索规划器，在训练时间预算内最大化奖励方差；弹性序列并行，通过持久调度器和节点内权重复制动态重配置SP组；以及抢占感知的拉取式请求调度器，平衡负载并在抢占时提交进行中的状态。我们在开源RL平台ROLL上实现了Spotlight，并在Qwen-Image后训练上进行了评估。Spotlight达到相同目标验证分数的速度比基线快4倍，总成本降低1.4-6.4倍，同时在分辨率512×512和1280×1280的DeepSeek-OCR和Geneval数据集上实现了更优的图像质量。

英文摘要

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

URL PDF HTML ☆

赞 0 踩 0

2606.19002 2026-06-18 cs.CL 新提交 80%

Enhancing Multilingual Reasoning via Steerable Model Merging

通过可引导的模型合并增强多语言推理

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Fudan University（复旦大学）； Beihang University（北京航空航天大学）； Monash University（墨尔本大学）； Zhongguancun Laboratory（中关村实验室）； Nanjing University（南京大学）； Tsinghua University（清华大学）

专题命中后训练：提出可引导模型合并框架，增强多语言推理能力。

AI总结提出可引导模型合并（ST-Merge）框架，通过门控交叉注意力机制自适应调节源模型贡献，在多语言推理任务中优于强基线。

Comments 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

详情

AI中文摘要

模型合并是组合多语言模型和推理模型能力的有效技术。通过对齐不同模型的特征空间，它在多语言推理任务中取得了有希望的泛化效果。然而，合并后的单一模型往往无法解决源模型之间的冲突，导致性能次优。换句话说，一刀切的合并策略可能无法适应不同输入的特性，这些输入可能要求优先考虑某些模型。为此，我们提出了一个可引导模型合并（ST-Merge）框架来调节每个源模型的贡献。为了实现这一想法，我们引入了一种门控交叉注意力机制，以自适应方式加权或过滤两个关注的源模型。大量实验表明，ST-Merge在涵盖21种不同语言的四个多语言推理基准上持续优于多个强基线。

英文摘要

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

URL PDF HTML ☆

赞 0 踩 0

2606.18967 2026-06-18 cs.LG 新提交 80%

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout: 面向强化学习推演的感知系统的自推测解码

Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim, Donghoon Kim, Coleman Hooper, Harman Singh, Amir Gholami, Hyung Il Koo, Wonjun Kang

发表机构 * FuriosaAI ； University of California, Berkeley（加州大学伯克利分校）

专题命中后训练：提出自推测解码加速强化学习推演。

AI总结针对强化学习推演中自回归解码延迟瓶颈，提出感知系统的自推测解码框架，通过量化自推测解码器与感知系统的推测开关策略，在保持模型质量前提下降低推演和端到端延迟。

Comments Project Page: https://github.com/furiosa-ai/EfficientRollout

详情

AI中文摘要

强化学习（RL）已成为LLMs代表性后训练范式，赋予其强大的推理和智能体能力。然而，推演生成仍是主要的延迟瓶颈，因为自回归采样顺序解码响应，且少量长尾生成往往决定完成时间。推测解码（SD）为缓解此瓶颈提供了自然途径，它是一种用于服务固定LLMs的成熟技术，通过快速草拟令牌并通过并行验证接受它们来降低延迟，同时保持目标模型分布。但其实际加速效果无法直接迁移到RL推演：（i）不断变化的目标策略使得任何固定草拟者与策略输出分布日益不匹配；（ii）推演解码过程中活跃批次大小缩小，解码从计算受限转向内存受限，此时并行验证可利用未充分利用的计算资源。因此，加速RL推演需要草拟者在长序列、高温生成下对演化策略保持有效，以及感知系统的SD使用以避免计算受限状态。我们提出EfficientRollout，一个感知系统的自推测SD框架，旨在解决RL推演中的这一差距。EfficientRollout从目标模型诱导量化草拟者（即自推测解码），使其与演化策略保持耦合，无需单独草拟者预训练或在线适应。它进一步协调感知系统的SD切换策略与接受感知的草稿长度自适应，仅在有益状态下进行推测，同时使草拟预算与演化草拟者质量匹配。EfficientRollout在加速自回归推演基线上分别将推演和端到端延迟降低高达19.6%和12.7%，同时保持最终模型质量。

英文摘要

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18844 2026-06-18 cs.LG 新提交 80%

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

从自身错误中学习：为自蒸馏构建可学习的微反思轨迹

Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba（阿里巴巴通义千问事业部）； Tsinghua University（清华大学）； Peking University（北京大学）

专题命中后训练：策略优化方法，利用自身轨迹。

AI总结提出TAPO方法，通过对比正确与错误轨迹构建微反思修正，实现从隐式分布对齐到显式轨迹构建的自蒸馏改进，在多个数学推理基准上优于GRPO。

详情

AI中文摘要

自蒸馏通过使用模型自身的生成作为训练信号来改进大型语言模型的推理能力，通常通过隐式的logit级对齐来实现，最小化与特权目标分布的KL散度。然而，由于这种监督是通过无控制采样生成的，它无法提供关于模型特定错误的诊断性洞察，也无法针对其个体失败模式提供纠正性指导。因此，模型学习的是模仿特权分布，而不是接收精确指出其推理失败位置和原因的细粒度修正。在本文中，我们提出了轨迹增强策略优化（TAPO），将自蒸馏从隐式分布对齐推进到显式轨迹构建。在强化学习训练期间，模型对同一查询同时产生正确和错误的生成轨迹，TAPO利用这种对比结构来构建微反思修正——新的训练轨迹，保留模型在失败点之前的错误推理，然后插入自然语言诊断和由同一采样组中的正确参考引导的修正推理。由于每条轨迹都锚定在学习者自身的前缀和解决方案上，与基于KL的方法施加的位置级对齐相比，修正信号在更大程度上保留了模型的在策略分布。为了整合这些轨迹，TAPO在模型能力边界引入了难度感知的候选选择，并采用解耦优势估计以防止梯度污染。在AIME 2024、AIME 2025和HMMT 2025上的实验表明，在相同训练步数下，TAPO相比GRPO取得了一致的改进。进一步分析表明，TAPO增强了首次推理和错误纠正的有效性。

英文摘要

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.18774 2026-06-18 cs.LG 新提交 80%

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

RouteJudge: 一个可复现且偏好感知的LLM路由开放平台

Guannan Lai, Haoran Hu, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； National Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； SinapisAI

专题命中后训练：评估LLM路由策略，偏好感知平台。

AI总结提出RouteJudge平台，通过匿名成对比较评估LLM路由策略的决策质量，并发布ORBIT工具箱标准化路由工作流，支持可复现和偏好感知的路由评估。

Comments Accepted by Pluralistic Alignment Workshop at ICML 2026

详情

AI中文摘要

我们提出RouteJudge，一个用于LLM路由系统的在线成对偏好评估框架，并提供一个公开平台（https://...）。与模型级别的响应评估不同，RouteJudge关注路由器级别的决策质量。对于每个用户查询，多个路由策略在相同的模型池和预算约束下独立推荐候选模型。然后通过匿名成对比较将所选模型的响应呈现给用户，由此产生的用户偏好归因于比较响应背后的路由策略。每条评估记录存储查询、路由决策、模型响应、偏好标签、成本、延迟和任务元数据，从而支持对LLM路由器进行偏好感知、成本感知和任务条件分析。为了支持RouteJudge中路由方法的持续扩展，我们进一步发布了ORBIT（最优路由与预算推理工具箱），这是一个模块化且可扩展的工具箱，标准化了LLM路由的端到端工作流。ORBIT为基准加载、查询表示、路由器实现、预算感知评估和方法比较提供了统一接口，允许研究人员在一致的协议下开发和评估路由算法。它同时作为RouteJudge的提交和集成层：研究人员可以在ORBIT中实现路由方法，在现有路由基准上验证它们，并提交兼容的路由器进行在线偏好评估。ORBIT的代码可在https://...获取。

英文摘要

We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on router-level decision quality. For each user query, multiple routing strategies independently recommend candidate models under the same model pool and budget constraints. The selected model responses are then presented to users through anonymous pairwise comparisons, and the resulting user preferences are attributed back to the routing strategies behind the compared responses. Each evaluation record stores the query, routing decisions, model responses, preference labels, cost, latency, and task metadata, enabling preference-aware, cost-aware, and task-conditioned analysis of LLM routers. To support the continuous expansion of routing methods in RouteJudge, we further release ORBIT (Optimal Routing and Budgeted Inference Toolbox), a modular and extensible toolbox that standardizes the end-to-end workflow of LLM routing. ORBIT provides unified interfaces for benchmark loading, query representation, router implementation, budget-aware evaluation, and method comparison, allowing researchers to develop and evaluate routing algorithms under consistent protocols. It also serves as the submission and integration layer for RouteJudge: researchers can implement routing methods within ORBIT, validate them on existing routing benchmarks, and submit compatible routers for online preference-based evaluation. The code of ORBIT is available at https://github.com/AIGNLAI/LAMDA-ORBIT.

URL PDF HTML ☆

赞 0 踩 0

2606.13795 2026-06-18 cs.LG 新提交 80%

DiPOD: Diffusion Policy Optimization without Drifting Apart

无漂移扩散策略优化

Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Simons Institute for the Theory of Computing（西蒙斯计算理论研究所）； Department of Electrical Engineering and Computer Sciences, University of California, Berkeley（加州大学伯克利分校电气工程与计算机科学系）

专题命中后训练：扩散策略优化用于语言模型后训练

AI总结针对扩散策略梯度方法的不稳定性，提出DiPOD框架，通过自蒸馏与策略改进梯度更新交替进行，维持紧界行为，实现稳定且高效的策略优化。

Comments Project page: astro-eric.github.io/blogs/dipod/ Code: https://github.com/Astro-Eric/DiPOD-release

2606.18910 2026-06-18 cs.LG cs.CL 新提交 75%

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES：通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University（西北大学）； Amazon AGI（亚马逊人工智能实验室）； Qualcomm AI Research（高通人工智能研究）； University of Minnesota（明尼苏达大学）

专题命中后训练：提出两阶段训练框架优化推理

AI总结提出REVES框架，通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示，实现高效的离策略数据生成，提升大语言模型的多步推理能力，在LiveCodeBench上比强化学习基线高6.5分。

详情

AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型（LLM）推理能力的强大范式。然而，标准的后训练方法主要优化单次目标，与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“接近正确”答案）转化为解耦的修订和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比，这种方法实现了高效的离策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比RL基线高6.5分，比标准多轮训练高4.0分。除了编码，我们的方法在圆填充问题上达到了先前报告的SOTA结果，同时使用了最小的基础模型（4B）和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题，如n皇后和迷你数独，其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

URL PDF HTML ☆

赞 0 踩 0

2606.18627 2026-06-18 cs.LG 新提交 70%

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

PACT: 在任务向量中保留锚定核心用于模型合并

Ningyuan Shi, Zhipeng Zhou, Hao Wang, Chunyan Miao, Peilin Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

专题命中后训练：模型合并方法，保留预训练权重中的核心维度

AI总结提出PACT方法，通过识别并保留预训练权重中的承重墙维度，在任务向量中锚定任务特定核心，解决任务向量范式下任务冲突和性能下降问题，提升模型合并效果。

Comments 33 pages,14 figures

详情

AI中文摘要

模型合并已成为多任务学习的一种无需训练的替代方案，旨在将多个任务特定的微调模型组合成一个单一的多任务模型。大多数现有的模型合并方法遵循任务算术范式，该范式将微调权重分解为预训练参数和任务向量，并仅在任务向量空间中进行合并。这一范式的有效性隐含地依赖于一个假设，即任务特定知识仅编码在任务向量中。我们认为，由于预训练模型固有的任务偏好，这一假设通常不成立。具体而言，我们识别出\textbf{承重墙（LBW）维度}，即一些任务关键知识仍嵌入在预训练权重中，而非完全转移到任务向量中。我们从标量权重和子空间两个角度刻画LBW维度，从而覆盖现有模型合并方法的主要范式。我们的分析表明，忽略LBW维度会导致基于任务向量的方法无法完全解决任务冲突，并可能无意中破坏预训练模型中编码的任务特定知识，从而导致性能下降。为解决这一问题，我们提出PACT，该方法通过将任务向量的正交补与预训练权重的子空间对齐，从而在任务向量中保留锚定的任务特定核心（即LBW维度）。在应用现有模型合并算法之前，将这些对齐的子空间分量从任务向量中移除。此外，我们开发了一种基于随机SVD的高效变体以提高可扩展性。PACT可以无缝集成到现有方法中。在多个基准上的大量实验表明，PACT持续增强主流模型合并方法，并建立了新的最先进性能。

英文摘要

Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task Arithmetic paradigm, which decomposes fine-tuned weights into pre-trained parameters and task vectors, and performs merging exclusively in the task-vector space. The effectiveness of this paradigm implicitly relies on the assumption that task-specific knowledge is encoded solely within task vectors. We argue that this assumption generally does not hold due to the intrinsic task preferences of pre-trained models. Specifically, we identify \textbf{Load-Bearing Wall (LBW) dimensions}, namely some task-critical knowledge that remains embedded in the pre-trained weights rather than being fully transferred into task vectors. We characterize LBW dimensions from both scalar-weight and subspace perspectives, thereby covering the major paradigms of existing model merging methods. Our analysis reveals that, by ignoring LBW dimensions, task-vector-based approaches fail to fully resolve task conflicts and may inadvertently damage task-specific knowledge encoded in the pre-trained model, leading to degradation. To address this issue, we propose PACT, which preserves the anchored task-specific cores (i.e., LBW dimensions) within task vectors by aligning their orthogonal complements with the subspace of the pre-trained weights. These aligned subspace components are then removed from the task vectors before applying existing model merging algorithms. Furthermore, we develop an efficient variant based on randomized SVD to improve scalability. PACT can be seamlessly integrated with existing methods. Extensive experiments across multiple benchmarks demonstrate that PACT consistently enhances mainstream model merging approaches and establishes new state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18606 2026-06-18 cs.CL cs.AI 新提交 70%

Steerable Cultural Preference Optimization of Reward Models

可引导的文化偏好优化奖励模型

Minsik Oh, Advit Deepak, Sophie Wu, Douwe Kiela, Ekaterina Shutova

发表机构 * Stanford University（斯坦福大学）； University of Amsterdam（阿姆斯特丹大学）

专题命中后训练：训练奖励模型用于LLM对齐

AI总结提出SCPO算法，通过平衡多种文化偏好训练奖励模型，在PRISM和GlobalOpinionQA数据集上提升少数群体偏好预测准确率最多7点，训练效率提高280%。

Comments Accepted to Pluralistic Alignment @ ICML 2026

详情

AI中文摘要

大型语言模型（LLM）技术以每个文化子社区可接受的方式服务于众多不同文化子社区至关重要。然而，迄今为止，关于LLM对齐的研究主要集中于预测来自特定地区的标注者的统一响应偏好。本文旨在以更全球化的视角推进对齐模型的发展，使其能够准确代表子社区的偏好，并且不对任何子社区表现出过度偏见。我们专注于为此目的开发奖励模型，并提出一种新颖的奖励模型训练算法（SCPO），该算法能够以平衡的方式融入多样化的文化偏好。我们的方法使得少数群体奖励模型在两个数据集（PRISM和GlobalOpinionQA）以及7个国家上的性能比基线模型提升最多7点。SCPO在训练数据效率上比奖励模型的完整数据微调高出最多280%。此外，我们通过分别评估子社区的偏好来进行偏见分析，并表明我们的加权方法减轻了过度偏见。我们的代码可在以下网址获取：this https URL

英文摘要

It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at https://github.com/minsik-ai/Steerable-Cultural-Preference

URL PDF HTML ☆

赞 0 踩 0

2606.18521 2026-06-18 cs.LG cs.AI 新提交 70%

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒：从模型合并理解RLVR模型参数空间

Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu, Haishuai Wang

发表机构 * Zhejiang University（浙江大学）； Simon Fraser University（西蒙菲莎大学）； The Chinese University of Hong Kong（香港中文大学）； Zhejiang Key Lab of Accessible Perception and Intelligent Systems（浙江省可感知智能系统重点实验室）

专题命中后训练：研究RLVR模型参数空间与合并

AI总结本文发现RLVR模型的稀疏更新在参数空间中分散更远，形成近正交捷径导致合并脆弱，并提出SAR-Merging方法解决该问题。

Comments Accepted by KDD 2026

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为一种强大的后训练范式，在激发推理智能和抵抗灾难性遗忘方面超越了监督微调（SFT）。最近的研究进一步揭示，与SFT相比，RLVR会引发高度稀疏且偏离主成分的参数更新。这自然引出一个问题：这种稀疏性是否使RLVR模型更易于模型合并？如果是，模型合并将提供一种可扩展的、无需训练的方法，来聚合来自独立训练的RLVR模型的多样化推理能力。令人惊讶的是，我们发现相反的情况，揭示了一种稀疏性诅咒：稀疏的RLVR更新在参数空间中分散得更远，形成近正交的捷径，使得聚合本质上是脆弱的。这很可能源于RL优化的随机性和涌现推理模式的多样性。与SFT模型收敛到共享的平坦盆地并自然合并不同，RLVR模型在标准合并方法下遭受严重退化。通过对更新几何的系统性实证分析，我们描述了这种失败背后的机制，并提出了敏感性感知解析合并（SAR-Merging），这是一种针对RLVR参数空间独特结构定制的合并方案。SAR-Merging通过基于Fisher信息的敏感性仲裁解决重叠更新区域中的冲突，然后通过幅度感知稀疏化和重新缩放来保留脆弱的推理路径。在数学和编程基准上的实验表明，SAR-Merging在RLVR模型上显著优于现有合并方法，实现了单任务增强和多能力融合。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.

URL PDF HTML ☆

赞 0 踩 0

2606.16276 2026-06-18 cs.AI 新提交 70%

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

SpecAlign: 通过合成数据实现高效的大语言模型规范对齐

Wenjie Wang, Yue Huang, Zhengqing Yuan, Han Bao, Shiyi Du, Yuchen Ma, Yue Zhao, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame（圣母大学）； Carnegie Mellon University（卡内基梅隆大学）； LMU Munich（慕尼黑大学）； University of Southern California（南加州大学）

专题命中后训练：后训练对齐方法，提升LLM规则遵守度

AI总结提出规范对齐新范式，通过从规范文档合成数据（SpecAlign框架），结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度偏好对，提升规则遵守度且不损害通用能力。

Comments 58 pages

详情

AI中文摘要

随着大语言模型（LLM）在现实应用中的部署日益增多，对齐不再由单一的通用安全或有用性概念主导，而是由提供商或应用特定的模型规范主导。这些规范通常冗长、结构化且频繁更新，然而现有的对齐流程缺乏系统化的机制来将其作为训练信号。在本文中，我们提出规范对齐（specification-grounded alignment），一种新的对齐范式，将提供商编写的模型规范作为主要对齐目标，而非抽象原则或静态基准。为实例化该范式，我们引入SpecAlign框架，该框架直接从规范文档合成对齐数据。SpecAlign结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度、边界感知的偏好对，捕获合规行为和有意义的规范违反。在多个模型规范和骨干模型上的实验表明，使用SpecAlign进行训练一致地提高了规则遵守度，同时保持了通用能力并避免了过度保守的行为。这些结果表明，将对齐建立在显式模型规范上，能够实现LLM行为对不断变化的政策要求的快速、精确和可扩展的适应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider- or application-specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification-grounded alignment, a new alignment paradigm that treats provider-authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi-agent adversarial data synthesis to generate fine-grained, boundary-aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over-conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

URL PDF HTML ☆

赞 0 踩 0

2603.26557 2026-06-18 cs.CL 版本更新 70%

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

MemBoost：一种面向成本感知的LLM推理的内存增强框架

Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng

发表机构 * University of Cambridge（剑桥大学）； ETH Zurich（苏黎世联邦理工学院）

专题命中后训练：记忆增强框架降低LLM推理成本

AI总结提出MemBoost框架，通过轻量模型重用历史答案和检索支持信息，并选择性将困难查询路由到强模型，以降低LLM推理成本，同时保持回答质量。

Comments ICML MemFM 2026 Workshop

2606.18309 2026-06-18 cs.LG cs.AI 新提交 65%

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SAGE: 保留感知的最终遗忘向量事后净化

Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang

发表机构 * Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University（上海交通大学图像处理与模式识别研究所）

专题命中后训练：提出事后净化遗忘向量，缓解遗忘与保留权衡。

AI总结提出SAGE方法，通过事后净化最终更新向量，在不重新运行原始遗忘流程的情况下，缓解大语言模型遗忘与保留能力之间的权衡。

详情

AI中文摘要

大语言模型（LLM）遗忘旨在移除不良知识或行为，同时保留已有能力。当前的遗忘方法都涉及遗忘与保留之间的权衡。我们发现，保留激活偏差也可用于量化遗忘方法对保留造成的损害，而无需考虑遗忘过程的具体实现。这使得我们能够通过事后方法恢复任何遗忘方法的保留性能。因此，我们提出一种互补的事后设置，在不重新运行原始遗忘流程的情况下净化最终更新向量。在该设置中，我们设计了SAGE（光谱激活-几何净化），一种对最终遗忘更新的源无关修正。SAGE从一个小型保留代理收集真实模块输入，提取其主导激活几何结构，并求解一个闭式源锚定优化目标，该目标抑制与高能保留方向对齐的更新分量，同时保留源方法的遗忘载体。在多种遗忘方法、模型规模和基准测试中，SAGE持续缓解保留-遗忘权衡，将最终向量的事后净化识别为机器遗忘中一个实用且未被充分探索的维度。

英文摘要

Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

URL PDF HTML ☆

赞 0 踩 0