大模型对齐与安全 - arXivDaily 专题

2606.18606 2026-06-18 cs.CL cs.AI 新提交 90%

Steerable Cultural Preference Optimization of Reward Models

可引导的文化偏好优化奖励模型

Minsik Oh, Advit Deepak, Sophie Wu, Douwe Kiela, Ekaterina Shutova

发表机构 * Stanford University（斯坦福大学）； University of Amsterdam（阿姆斯特丹大学）

专题命中偏好对齐：提出SCPO算法优化奖励模型文化偏好对齐

AI总结提出SCPO算法，通过平衡多种文化偏好训练奖励模型，在PRISM和GlobalOpinionQA数据集上提升少数群体偏好预测准确率最多7点，训练效率提高280%。

Comments Accepted to Pluralistic Alignment @ ICML 2026

详情

AI中文摘要

大型语言模型（LLM）技术以每个文化子社区可接受的方式服务于众多不同文化子社区至关重要。然而，迄今为止，关于LLM对齐的研究主要集中于预测来自特定地区的标注者的统一响应偏好。本文旨在以更全球化的视角推进对齐模型的发展，使其能够准确代表子社区的偏好，并且不对任何子社区表现出过度偏见。我们专注于为此目的开发奖励模型，并提出一种新颖的奖励模型训练算法（SCPO），该算法能够以平衡的方式融入多样化的文化偏好。我们的方法使得少数群体奖励模型在两个数据集（PRISM和GlobalOpinionQA）以及7个国家上的性能比基线模型提升最多7点。SCPO在训练数据效率上比奖励模型的完整数据微调高出最多280%。此外，我们通过分别评估子社区的偏好来进行偏见分析，并表明我们的加权方法减轻了过度偏见。我们的代码可在以下网址获取：this https URL

英文摘要

It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at https://github.com/minsik-ai/Steerable-Cultural-Preference

URL PDF HTML ☆

赞 0 踩 0

2606.18487 2026-06-18 cs.LG cs.AI cs.CL 新提交 90%

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SFT 过训练通过熵崩溃预测 RLVR 下的排名反转

Siddharth Aphale, Kelly Liu

发表机构 * Stanford University（斯坦福大学）

专题命中偏好对齐：SFT过训练导致RLVR下排名反转

AI总结研究发现 SFT 过度训练导致 rollout 分布熵降低，使 GRPO 中优势信号消失，从而引发排名反转；提出基于熵的两阶段诊断方法可预警高风险检查点。

Comments 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

详情

AI中文摘要

当 SFT 压缩 rollout 分布时，选择 pass@1 最高的 SFT 检查点进行 GRPO 的标准启发式方法可能失败。对于二元奖励，组内期望优势方差为 $p(1{-}p)(g{-}1)/g$；当早期 GRPO 将 $p$ 驱动到 $p^*(g)$ 以下时，大多数组具有相同奖励，不提供组间相对信号。我们研究了 Qwen2.5-Coder-3B 和 DeepSeek-Coder-6.7B 的 SFT 深度阶梯。我们在五个深度和三个种子上测试 Qwen2.5-Coder-3B，在四个匹配深度和三个种子上测试 DeepSeek-Coder-6.7B。在 Qwen 上，RL 前的 pass@1 随 SFT 深度增加而上升，但 GRPO 峰值 pass@10 从 $0.806$ 下降到 $0.481$（3 种子均值，$n{=}20$）；RL 前的熵与 GRPO 结果正相关（$\rho{=}{+}0.69$）。在 DeepSeek 上，pass@1 仍远高于 $p^*(8){=}0.083$，GRPO 结果压缩而非反转。结合 RL 前熵分诊与早期 GRPO 熵监测的两阶段诊断方法，可标记高风险检查点并提前停止失败运行。在我们的设置中，简单的 KL 参考正则化和标签平滑变体未能挽救崩溃的 Qwen 检查点，表明该失败并非琐碎的 GRPO 超参数伪影。

英文摘要

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($ρ{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

URL PDF HTML ☆

赞 0 踩 0

2606.16276 2026-06-18 cs.AI 新提交 90%

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

SpecAlign: 通过合成数据实现高效的大语言模型规范对齐

Wenjie Wang, Yue Huang, Zhengqing Yuan, Han Bao, Shiyi Du, Yuchen Ma, Yue Zhao, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame（圣母大学）； Carnegie Mellon University（卡内基梅隆大学）； LMU Munich（慕尼黑大学）； University of Southern California（南加州大学）

专题命中偏好对齐：规范对齐框架，合成数据实现规则遵守

AI总结提出规范对齐新范式，通过从规范文档合成数据（SpecAlign框架），结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度偏好对，提升规则遵守度且不损害通用能力。

Comments 58 pages

详情

AI中文摘要

随着大语言模型（LLM）在现实应用中的部署日益增多，对齐不再由单一的通用安全或有用性概念主导，而是由提供商或应用特定的模型规范主导。这些规范通常冗长、结构化且频繁更新，然而现有的对齐流程缺乏系统化的机制来将其作为训练信号。在本文中，我们提出规范对齐（specification-grounded alignment），一种新的对齐范式，将提供商编写的模型规范作为主要对齐目标，而非抽象原则或静态基准。为实例化该范式，我们引入SpecAlign框架，该框架直接从规范文档合成对齐数据。SpecAlign结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度、边界感知的偏好对，捕获合规行为和有意义的规范违反。在多个模型规范和骨干模型上的实验表明，使用SpecAlign进行训练一致地提高了规则遵守度，同时保持了通用能力并避免了过度保守的行为。这些结果表明，将对齐建立在显式模型规范上，能够实现LLM行为对不断变化的政策要求的快速、精确和可扩展的适应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider- or application-specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification-grounded alignment, a new alignment paradigm that treats provider-authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi-agent adversarial data synthesis to generate fine-grained, boundary-aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over-conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

URL PDF HTML ☆

赞 0 踩 0

2601.17637 2026-06-18 cs.CY cs.HC 90%

Scaling Laws for Moral Machine Judgment in Large Language Models

大语言模型中道德机器判断的扩展规律

Kazuhiro Takemoto

专题命中偏好对齐：研究LLM道德判断与人类偏好对齐的扩展规律

AI总结研究通过评估75种大语言模型配置，发现模型规模与人类偏好距离呈幂律关系，扩展推理模型在较小规模时表现更优，为价值判断的扩展规律研究提供依据。

Comments 12 pages, 4 figures, 3 tables

Journal ref R Soc Open Sci. (2026) 13 (6): 260202

详情

DOI: 10.1098/rsos.260202

AI中文摘要

自主系统日益需要道德判断能力，但其与模型规模的可预测性尚不清楚。我们系统评估了75种大语言模型配置（0.27B-1000B参数），利用道德机器框架测量其在生死困境中的对齐程度。观察到人类偏好距离（D）与模型规模（S）呈幂律关系（D ∝ S^{-0.10±0.01}，R²=0.50，p<0.001）。混合效应模型证实此关系在控制模型家族和推理能力后仍成立。扩展推理模型在较小模型中表现更优（规模×推理交互作用：p=0.024）。此关系在多样架构中成立，但随着规模增大，方差降低，表明计算规模系统性地提高了道德判断的可靠性。这些发现将扩展规律研究扩展到基于价值的判断，并为人工智能治理提供实证基础。

英文摘要

Autonomous systems increasingly require moral judgment capabilities, yet whether these capabilities scale predictably with model size remains unexplored. We systematically evaluate 75 large language model configurations (0.27B--1000B parameters) using the Moral Machine framework, measuring alignment with human preferences in life-death dilemmas. We observe a consistent power-law relationship with distance from human preferences ($D$) decreasing as $D \propto S^{-0.10\pm0.01}$ ($R^2=0.50$, $p<0.001$) where $S$ is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size$\times$reasoning interaction: $p = 0.024$). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale. These findings extend scaling law research to value-based judgments and provide empirical foundations for artificial intelligence governance.

URL PDF HTML ☆

赞 0 踩 0

2606.18327 2026-06-18 cs.LG cs.AI 新提交 70%

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL：基于强化学习的自一致性训练

Itamar Pres, Laura Ruis, Melat Ghebreselassie, Belinda Z. Li, Jacob Andreas

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

专题命中偏好对齐：通过强化学习优化语言模型自我解释与行为一致性。

AI总结提出Self-CTRL方法，通过强化学习优化语言模型自我解释与行为之间的一致性，在概率推理和宪法AI任务上显著提升一致性和安全性。

Comments 34 pages, 12 figures, includes appendices

详情

AI中文摘要

能够忠实描述自身行为的语言模型（LMs）更容易被用户审计、理解和信任。本文描述了基于强化学习的自一致性训练（Self-CTRL），该方法通过更新解释以更好地预测行为或更新行为以更好地匹配解释，优化LM的自我解释与相关输入行为之间的一致性。我们在两个领域应用该方法。首先，研究一个形式化概率推理任务，其中LM必须学习模仿一组有偏采样器，并评估其报告相关偏差的能力。我们发现，一致性训练将自我报告和行为测量的潜在偏差之间的相关性从$R^2=0.24$提高到$R^2=0.64$（在保留分布上），匹配直接真实标签监督的泛化能力。其次，研究一个宪法AI领域，其中LM必须描述何时拒绝或遵守用户请求。在此，Self-CTRL产生忠实描述模型在保留请求上行为的规则，将第三方审计模型的拒绝预测从$36\%$提高到$92\%$。另一方面，行为更新改善了对齐，将HarmBench失败率从$15.0\%$降低到$0.5\%$，而不会显著增加对无害提示的拒绝。通过对齐解释和行为，我们的工作为训练更安全、更透明、更可控的AI模型提供了通用方法。

英文摘要

Language models (LMs) that faithfully describe their own behavior can more easily be audited, understood, and trusted by users. This paper describes Self-Consistency Training with Reinforcement Learning (Self-CTRL), a method that optimizes for consistency between a LM's self-explanations and behavior on related inputs by updating explanations to better predict behavior or updating behavior to better match explanations. We apply our method in two domains. First, we study a formal probabilistic reasoning task in which LMs must learn to imitate a family of biased samplers and evaluated on their ability to report the associated biases. We find that consistency training improves the correlation between self-reported and behaviorally-measured latent biases from $R^2=0.24$ to $R^2=0.64$ on a set of held-out distributions, matching the generalization of direct ground-truth supervision. Second, we study a constitutional AI domain in which LMs must describe when they will refuse or comply with user requests. Here, Self-CTRL produces rules that faithfully describe the model's behavior on held-out requests, improving the refusal predictions of a third-party auditor model from $36\%$ to $92\%$. In the other direction, behavior updates improve alignment, reducing HarmBench failure rate from $15.0\%$ to $0.5\%$ without substantially increasing refusal on harmless prompts. By aligning explanations and behavior, our work provides a general recipe for training AI models to be safer, more transparent, and more controllable.

URL PDF HTML ☆

赞 0 踩 0

2606.19162 2026-06-18 cs.LG cs.CV 新提交 60%

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中：用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta ； Columbia University ； Mila -- Qu\' e bec AI Institute ； McGill University ； Canada CIFAR AI Chair

专题命中偏好对齐：使用RL进行偏好对齐，但主要针对图像生成

AI总结针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷，提出判别器引导的强化学习（DRL），利用预训练空间中判别器的logit作为奖励，显著提升无引导FID和语义FD，并改善偏好对齐。

Comments 84 pages, including appendices

详情

AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的：与主观偏好对齐，以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差，这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励，强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励，因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习（DRL）。DRL训练一个判别器，在预训练表示空间中区分数据样本和基础模型样本，并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上，而logit估计数据与模型之间的对数似然比，这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上，DRL降低了无引导FID（例如，SiT上从9.38降至2.62）和语义空间FD（例如，SiT上DINOv3从88.2降至19.3），在所有骨干网络上均有一致提升，并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中，DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿，在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

URL PDF HTML ☆

赞 0 踩 0