AI Alignment Breaks at the Edge
AI对齐在边缘处破裂
Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou, Carl Yang, Xiangliang Zhang, Yanfang Ye
AI总结 本文探讨了AI对齐在边缘案例中的失效问题,提出了一种新的对齐方法,通过识别和处理价值冲突、多方利益分歧和认知模糊性来改进AI的安全性和有效性。
Comments 38 pages, 6 figures
详情
通用对齐已经提高了平均情况下的有用性和安全性,但当前的对齐实践仍然奖励自信的单轮响应。问题不仅在于模型在边缘案例中失败,而且当前的评估使许多这些失败难以察觉。我们认为对齐必须超越平均情况的评估,通过使价值冲突、多方利益分歧和认知模糊性下的失败变得可见和可操作。标量奖励将多样化的价值观压缩成一个数字;数据和评估制度崩溃、过滤或未能激发对齐最困难的案例;治理往往缺乏裁定争议案例的机制。这些盲点导致了价值扁平化、表征损失和不确定性盲区。我们使用“边缘对齐”来命名一种检测、评估和治理议程,以揭示这些失败并将其与适当的干预措施联系起来。而不是单一的训练目标,边缘对齐定义了标准对齐应何时让位于保持多维价值结构、代表多方观点和支持不确定性意识互动的机制。一个包含91个边缘案例和四个现代模型的试点诊断集表明,普通的有用性和安全性读数可能无法发现边缘意识评估所暴露的过程失败。我们概述了操作性的边缘信号、过程意识的评估标准,以及一个三阶段的过程堆栈,将对齐重新定义为动态规范治理的生命周期问题。
General Alignment has improved average-case helpfulness and safety, but current alignment practice still rewards confident, single-turn responses. The problem is not only that models fail on edge cases; it is that current evaluation makes many of these failures hard to see. We take the position that alignment must move beyond average-case evaluation by making failures under value conflict, plural stakeholder disagreement, and epistemic ambiguity visible and actionable. Scalar rewards compress diverse values into a single number; data and evaluation regimes collapse, filter, or fail to elicit the cases where alignment is hardest; and governance often lacks mechanisms for adjudicating contested cases. These blind spots produce value flattening, representation loss, and uncertainty blindness. We use Edge alignment to name a detection, evaluation, and governance agenda for surfacing these failures and connecting them to appropriate interventions. Rather than a single training objective, Edge alignment defines the conditions under which standard alignment should yield to mechanisms that preserve multidimensional value structure, represent plural perspectives, and support uncertainty-aware interaction. A pilot diagnostic set of 91 edge cases and four contemporary models illustrates that ordinary helpfulness and safety readings can miss process failures that edge-aware evaluation exposes. We outline operational edge signals, process-aware evaluation criteria, and a three-phase process stack that reframes alignment as a lifecycle problem of dynamic normative governance.