大模型对齐与安全 - arXivDaily 专题

2509.25148 2026-06-19 cs.AI 版本更新 80%

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

AAPA：用于大型语言模型后训练的对抗锚定偏好对齐

Faqiang Qian, Kang An, Weikun Zhang, Ziliang Wang, Xuhui Zheng, Liangjian Wen, Yong Dai, Mengya Gao, Yichao Wu

发表机构 * Southwest University of Finance and Economics（西南财经大学）

专题命中偏好对齐：对抗锚定方法用于偏好对齐，防止策略漂移

AI总结提出AAPA框架，通过固定轻量判别器对策略输出与专家响应进行句子级对抗锚定，增强SFT、GRPO等后训练目标，在指令遵循基准上持续提升性能。

详情

AI中文摘要

大型语言模型的后训练对齐通常结合了专家演示上的监督微调（SFT）和来自偏好或可验证反馈的强化学习（RL）。SFT提供了有用的行为锚点，但可能过拟合静态演示，而RL鼓励探索但可能偏离专家行为或利用不完美的奖励。我们提出\textbf{AAPA}（\emph{对抗锚定偏好对齐}），这是一个插件式框架，通过句子级对抗锚定信号增强现有的后训练目标。AAPA使用固定的轻量判别器将策略生成结果与离线预收集的专家响应进行比较，因此在策略优化期间既不需要在线教师推理，也不需要判别器协同训练。相同的锚定项可以添加到SFT、GRPO和CHORD中，同时保留其原始训练流程。在指令遵循基准上的实验表明，AAPA在不同模型规模上一致地改善了相应的基础目标。特别是，分阶段的AAPA配置在\texttt{Qwen3-0.6B}上比强GRPO基线提高了5.77%，在\texttt{Qwen3-4B}上提高了3.75%。对响应长度、对数概率分布和判别器变体的进一步分析表明，对抗锚定为偏好优化提供了稳定的语义基础信号。代码可在\url{this https URL}获取。

英文摘要

Post-training alignment of large language models often combines supervised fine-tuning (SFT) on expert demonstrations with reinforcement learning (RL) from preference or verifiable feedback. SFT provides a useful behavioral anchor but can overfit to static demonstrations, whereas RL encourages exploration but may drift from expert behavior or exploit imperfect rewards. We propose \textbf{AAPA} (\emph{Adversarially Anchored Preference Alignment}), a plug-in framework that augments existing post-training objectives with a sentence-level adversarial anchoring signal. AAPA compares policy rollouts with offline, pre-collected expert responses using a fixed lightweight discriminator, and therefore requires neither online teacher inference nor discriminator co-training during policy optimization. The same anchoring term can be added to SFT, GRPO, and CHORD while preserving their original training pipelines. Experiments on instruction-following benchmarks show that AAPA consistently improves the corresponding base objectives across model scales. In particular, the staged AAPA configuration improves over a strong GRPO baseline by 5.77\% on \texttt{Qwen3-0.6B} and 3.75\% on \texttt{Qwen3-4B}. Further analyses on response length, log-probability distributions, and discriminator variants suggest that adversarial anchoring provides a stable semantic grounding signal for preference optimization. Code is available at \url{https://github.com/IsFaqq/AAPA}.

URL PDF HTML ☆

赞 0 踩 0