Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
列表式策略优化:基于组的RLVR作为LLM响应单纯形上的目标投影
Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji
AI总结 本文提出列表式策略优化(LPO),通过显式执行目标投影来解构隐式目标,利用响应单纯形限制近端RL目标,并通过精确散度最小化进行策略投影,从而在多样推理任务和LLM基础上提升训练性能,同时保持优化稳定性和响应多样性。
详情
可验证奖励的强化学习(RLVR)已成为大语言模型(LLMs)训练后的一种标准方法,以激励推理能力。在现有方法中,基于组的策略梯度很流行,它为每个提示样本生成一组响应,并通过组内优势信号更新策略。本文揭示这些优化策略共享一个共同的几何结构:每种策略隐式地定义了一个目标分布,并通过一阶近似向响应单纯形投影。基于这一见解,我们提出了列表式策略优化(LPO)以显式执行目标投影,通过限制近端RL目标到响应单纯形来解构隐式目标,然后通过精确散度最小化进行策略投影。该框架提供了(i)在列表式目标上单调改进,具有有界、零和和自校正的投影梯度,以及(ii)通过解耦的投影步骤灵活选择散度,具有不同的结构性质。在多样推理任务和LLM基础架构上,LPO在匹配的目标下一致地优于典型的策略梯度基线,同时内在地保持了优化稳定性和响应多样性。
Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.