CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment
CompassDPO: 用于鲁棒安全对齐的动态控制直接偏好优化
Jilong Liu, Yonghui Yang, Pengyang Shao, Wenjian Tao, Hao Zhan, Haokai Ma, Wei Qin, Richang Hong
AI总结 提出CompassDPO,通过隐式DPO奖励边际控制更新方向和幅度,无需外部奖励模型,在PKU-SafeRLHF等基准上提升鲁棒性。
详情
直接偏好优化(DPO)已成为安全对齐的标准框架,但其对成对偏好更新的依赖使得训练对不完美监督敏感。现有的鲁棒DPO方法通常通过全局损失校正或外部数据级干预来解决这种敏感性,而很大程度上忽略了不可靠比较如何扭曲批次级优化动态。我们提出CompassDPO,一种无奖励的DPO框架,通过动态控制稳定偏好优化。使用隐式DPO奖励边际作为训练时的指南针,CompassDPO沿着两个互补轴调节样本影响:更新方向和更新幅度。对于方向控制,它应用稀疏、有预算和预热延迟的损失混合,以减弱与新兴偏好方向冲突的更新分量。对于幅度控制,它自适应地软温莎化高损失尾部贡献,减少尾部主导同时保留来自困难样本的有用梯度。两种机制仅使用标准DPO训练期间可用的信号,无需外部奖励模型或额外监督。在PKU-SafeRLHF上跨四个骨干网络和多个分布外安全基准的实验表明,CompassDPO在鲁棒性上持续优于普通DPO和强DPO系列基线,特别是在受控标签翻转噪声下。代码可在https://anonymous.4open.science/r/CompassDPO-4D00获取。
Direct Preference Optimization (DPO) has become a standard framework for safety alignment, but its reliance on pairwise preference updates makes training sensitive to imperfect supervision. Existing robust DPO methods often address this sensitivity through global loss corrections or external data-level interventions, while largely overlooking how unreliable comparisons distort batch-level optimization dynamics. We propose CompassDPO, a reward-free DPO framework that stabilizes preference optimization through dynamics control. Using the implicit DPO reward margin as a training-time compass, CompassDPO regulates sample influence along two complementary axes: update direction and update magnitude. For directional control, it applies sparse, budgeted, and warm-up delayed loss mixing to attenuate update components that conflict with the emerging preference direction. For magnitude control, it adaptively soft-winsorizes high-loss tail contributions, reducing tail dominance while preserving useful gradients from hard examples. Both mechanisms use only signals available during standard DPO training and require no external reward model or additional supervision. Experiments on PKU-SafeRLHF across four backbones and multiple out-of-distribution safety benchmarks show that CompassDPO consistently improves robustness over vanilla DPO and strong DPO-family baselines, especially under controlled label-flip noise. Code is available at https://anonymous.4open.science/r/CompassDPO-4D00