arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

大模型对齐与安全

大模型对齐、安全、越狱、红队、提示注入和可信评测。

今日/当前日期收录 5 信号源:cs.CL, cs.AI, cs.CY, cs.LG
2606.19818 2026-06-19 cs.LG cs.AI 新提交 90%

Uncertainty-Aware Reward Modeling for Stable RLHF

不确定性感知的奖励建模用于稳定的RLHF

Licheng Pan, Haocheng Yang, Haoxuan Li, Yichen Sun, Yunsheng Lu, Shijian Wang, Lei Shen, Yuan Lu, Zhixuan Chu, Hao Wang

发表机构 * Zhejiang University(浙江大学) Peking University(北京大学) National University of Singapore(新加坡国立大学)

专题命中 偏好对齐 :不确定性感知奖励建模用于稳定RLHF,缓解奖励黑客。

AI总结 提出不确定性感知奖励建模(UARM),通过分位数保形预测校准不确定性并利用异方差方差分解重加权GRPO优势,以缓解奖励黑客问题,提升对齐质量。

详情
AI中文摘要

从人类反馈中强化学习(RLHF)通过在偏好数据上训练奖励模型并优化策略以最大化预测奖励来对齐大型语言模型。然而,该流程面临两个基本挑战:(1)奖励模型无法在预测不可靠时发出信号,因为它们通常充当确定性点估计器;(2)现代基于组的策略优化可能放大不可靠的奖励信号,例如GRPO在优势计算中对奖励的统一处理。随着策略探索越来越多样化的响应,这两个限制造成了一个关键漏洞:不可靠的奖励估计可能被赋予不成比例的影响力,引发严重的奖励黑客问题。我们提出不确定性感知奖励建模(UARM),通过基于分位数的保形预测为奖励模型配备校准的不确定性,并通过异方差方差分解重加权GRPO优势。在HelpSteer、UltraFeedback和PKU-SafeRLHF上的实验表明,与标准GRPO和不确定性无关的基线相比,UARM显著改善了奖励模型校准,减少了奖励黑客问题,并增强了下游对齐质量。

英文摘要

Reinforcement learning from human feedback (RLHF) aligns large language models by training reward models on preference data and optimizing policies to maximize predicted rewards. However, this pipeline faces two fundamental challenges: (1) reward models cannot signal when their predictions are unreliable, since they usually act as deterministic point estimators; and (2) modern group-based policy optimization can amplify unreliable reward signals, as exemplified by GRPO's uniform treatment of rewards during advantage computation. As policies explore increasingly diverse responses, these two limitations create a critical vulnerability: unreliable reward estimates may be granted disproportionate influence, triggering severe reward hacking. We propose Uncertainty-Aware Reward Modeling (UARM), which equips reward models with calibrated uncertainty via quantile-based conformal prediction and reweights GRPO advantages through heteroscedastic variance decomposition. Experiments across HelpSteer, UltraFeedback, and PKU-SafeRLHF demonstrate that UARM significantly improves reward model calibration, reduces reward hacking, and enhances downstream alignment quality compared to standard GRPO and uncertainty-agnostic baselines.

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交 90%

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘:不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab(网络分析与社会影响建模实验室) School of Physics Maths and Computing, The University of Western Australia(西澳大学物理数学与计算学院) School of Psychological Science, The University of Western Australia(西澳大学心理科学学院) School of Computing, Macquarie University(麦考瑞大学计算机学院)

专题命中 偏好对齐 :核心研究偏好优化方法DPO的顺序应用与遗忘模式。

AI总结 研究顺序DPO在不同偏好设置下的影响,发现遗忘模式并非统一,而是取决于目标关系、信号强度和训练顺序,并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化(DPO)等偏好优化方法顺序应用这些目标,但目前尚不清楚后续训练是否会统一降低先前学习的偏好,或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置(包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标)的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct,我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式;偏好变化从部分退化到稳定、成对重新分配或正迁移,具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明,聚合指标可能掩盖偏好对之间的异质性变化,而四分位数分解显示,高置信度对可能根据设置而退化或改进。机制诊断表明,在所有设置中,阶段2的梯度和适配器更新与先前目标接近正交,几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明,未来的顺序对齐流程应考虑目标兼容性和信号强度,而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

2606.19527 2026-06-19 cs.AI 新提交 90%

Emergent Alignment

涌现对齐

Martin Kolář

发表机构 * CIIRC, Czech Technical University in Prague(捷克理工大学CIIRC)

专题命中 偏好对齐 :在线对齐技术使LLM自我纠正非伦理输出

AI总结 提出一种在线对齐技术,通过引入良心步骤和基于直接偏好优化的对齐损失,使大语言模型在训练、微调、对抗提示和零样本学习中自我纠正非伦理输出。

Comments Rejected from ICML 2026

详情
AI中文摘要

大型语言模型(LLM)能否辨别其自身输出何时与人类伦理不一致?它们能否自我纠正?我们赋予LLM一个良心步骤,用于审查其自身的推理和输出,并通过使用直接偏好优化(DPO)扩展训练损失中的对齐组件,引导模型远离非伦理输出。结果是一种在线技术,可在广泛的应用中对齐模型:训练、微调、对抗提示和零样本学习。它不需要较弱或较强的评判者,而是依赖于自身的冻结副本。在先前的工作中,涌现错位场景显示了微调模型以破解代码时出现的一系列涌现非伦理行为。相反,我们实证展示了如何实现涌现对齐:在相同的代码破解场景下,一个单一的高层内省问题将训练引导向伦理模型。

英文摘要

Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics? And can they self-correct? We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Preference Optimization (DPO) to steer the model away from non-ethical outputs. The result is an online technique to align models in a wide range of applications: training, fine-tuning, adversarial prompting, and zero-shot learning. It does not require a weaker or stronger judge, relying instead on a frozen copy of itself. In previous work, the Emergent Misalignment scenario showed a range of emergent unethical behaviors from fine-tuning the model to hack code. Instead, we empirically show how to achieve Emergent Alignment: a single high-level introspective question steers training toward an ethical model under the same code hacking scenario.

2606.20482 2026-06-19 cs.CL cs.HC cs.LG 新提交 85%

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

你的鼠标和眼睛悄悄泄露你的偏好:利用用户隐式反馈进行LLM对齐

Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校) York University(约克大学)

专题命中 偏好对齐 :利用隐式反馈进行LLM对齐

AI总结 针对显式反馈稀缺的问题,提出利用鼠标轨迹和眼动数据等隐式反馈训练奖励模型,将文本奖励模型准确率从55%提升至64%,并显著提高DPO对齐后响应质量。

详情
AI中文摘要

为了对齐大型语言模型(LLM),大多数现有方法收集显式的人类反馈,并基于响应文本训练奖励模型来预测人类偏好。这些现有方法有两个关键局限性。首先,用户很少为LLM响应提供显式反馈,这使得高质量偏好标注的收集成本高昂。其次,这些方法没有利用隐式人类反馈,而隐式反馈已被证明对互联网巨头的经济护城河至关重要。为了量化隐式反馈的价值,我们构建了一个名为IFLLM的新数据集,收集了来自59名Mechanical Turk工作者的1336个多轮问题、他们的鼠标轨迹以及通过网络摄像头对LLM响应的眼动注视点。IFLLM显示用户具有非常多样化的注视行为和鼠标轨迹。基于隐式用户反馈的奖励模型将基于文本的奖励模型准确率从55%提升至64%,并在将DPO应用于八个LLM后,相对响应质量改进几乎翻了三倍,证明了隐式反馈在现实场景中的价值。我们的数据收集网站、数据集和代码可在以下网址找到:此https URL。

英文摘要

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at https://github.com/themehulpatwari/llm-implicit-feedback/.

2606.20258 2026-06-19 cs.HC cs.AI 新提交 70%

Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination

编辑对齐:一种参与式方法,将编辑专业知识引入LLM介导的知识传播

Simon Aagaard Enni, Malthe Stavning Erslev, Karl-Emil Kjær Bilstrup, Kristoffer Laigaard Nielbo

发表机构 * Aarhus University(奥胡斯大学) University of Copenhagen(哥本哈根大学)

专题命中 偏好对齐 :提出编辑对齐参与式AI设计

AI总结 本文提出“编辑对齐”作为参与式AI设计实践,通过设计工作坊让编辑参与重新对齐LLM接口至编辑标准,以维护公共知识机构的编辑职能。

Comments 14 pages

详情
AI中文摘要

LLM驱动的信息服务的出现正在重塑公共知识机构的运作条件,威胁着吸收这些机构赖以存在的编辑功能。虽然LLM为知识传播提供了强大的新可能性,但预训练的LLM已经与其商业开发者的价值观和传播策略对齐,从而挑战了编辑权威。本文通过一个案例研究,调查编辑通过设计工作坊参与将LLM接口重新对齐到编辑标准的过程,在该案例中,我们与一家北欧公共知识机构设计并实现了一个LLM增强的百科全书界面。我们将编辑对齐作为参与式AI中的一种设计实践引入,将AI对齐视为一个设计过程,并将编辑标准定位为一种设计工件,将编辑实践和价值观转化为技术实现的对齐目标。最后,我们讨论了编辑对齐如何为持续参与创造空间,并赋予编辑在LLM介导的知识传播中的自主权。

英文摘要

The emergence of LLM-driven information services is reshaping the conditions under which public knowledge institutions operate, threatening to absorb the editorial function these institutions exist to exercise. While LLMs offer powerful new affordances for knowledge dissemination, editorial authority is challenged by pretrained LLMs that arrive already aligned with the values and dissemination strategies of their commercial developers. This paper investigates editor participation in re-aligning LLM interfaces to editorial standards through design workshops, in a case study where we design and implement an LLM-enabled encyclopedia interface with a Nordic public knowledge institution. We introduce editorial alignment as a design practice within Participatory AI, framing AI alignment as a design process and positioning the editorial standard as a design artefact that translates editorial practice and values into alignment objectives for technical implementation. Last, we discuss how editorial alignment can create space for ongoing participation and give editors agency in LLM-mediated knowledge dissemination.