大模型对齐与安全 - arXivDaily 专题

2412.16468 2026-06-18 cs.LG 版本更新 90%

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

通往人工超级智能之路：超级对齐的全面综述

HyunJin Kim, DongHyun Ryu, Xiaoyuan Yi, Jing Yao, Jianxun Lian, Muhua Huang, Shitong Duan, JinYeong Bak, Xing Xie

发表机构 * Microsoft Research Asia（微软亚洲研究院）； Sungkyunkwan University（顺天大学）； Stanford University（斯坦福大学）； Fudan University（复旦大学）

专题命中安全评测：综述超级对齐问题，分析可扩展监督范式

AI总结本文综述了超级对齐问题，通过分析可扩展监督范式（夹层、自我增强和弱到强泛化）及其局限性，探讨了监督、控制和管理人工超级智能的挑战与路径。

Comments 24 pages

详情

AI中文摘要

大型语言模型（LLMs）的出现引发了关于人工超级智能（ASI）的讨论，这是一种假设性的、超越人类智能的AI系统。尽管ASI仍处于假设阶段且远超出当前AI能力，但讨论其潜力、探索其可行性和潜在风险对于未来AI系统的发展至关重要。超级对齐的概念源于可扩展监督，后者研究当直接人类监督不足时如何监督日益强大的AI系统。本文聚焦于超级对齐问题：“监督、控制和管理人工超级智能的过程”。我们首先回顾可扩展监督范式——夹层、自我增强和弱到强泛化，然后通过可能性和不可能性的视角分析当前范式的局限性，讨论关键挑战，并提出未来AI系统安全持续改进的路径。

英文摘要

The emergence of large language models (LLMs) has sparked discussion on Artificial Superintelligence (ASI), a hypothetical AI system that surpasses human intelligence. Although ASI remains hypothetical and far beyond current AI capabilities, discussing its potential and exploring its feasibility and potential risks is critical for the development of future AI systems. The idea of superalignment originates from scalable oversight, which studies how to supervise increasingly capable AI systems when direct human supervision becomes insufficient. In this paper, we focus on the superalignment problem: "The process of supervising, controlling, and governing artificial superintelligence." We first review scalable oversight paradigms-Sandwiching, Self-Enhancement, and Weak-to-Strong Generalization -- then analyze the limitations of current paradigms through the lens of possibility and impossibility, discuss key challenges, and propose pathways for the safe and continual improvement of future AI systems.

URL PDF HTML ☆

赞 0 踩 0

2505.20045 2026-06-18 cs.CL 版本更新 85%

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

基于不确定性感知注意力头的高效大语言模型幻觉检测

Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（莫扎德人工智能大学）； ETH Zurich（苏黎世联邦理工学院）； Independent Researcher（独立研究者）； Applied AI Institute（应用人工智能研究所）

专题命中安全评测：无监督幻觉检测，提升LLM可靠性

AI总结提出RAUQ框架，利用不确定性感知注意力头与令牌级置信度，通过单次前向传递实现无监督、高效的序列级幻觉检测，在12个数据集上优于现有方法且额外计算少于1%。

Journal ref Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, 2026

详情

AI中文摘要

尽管大型语言模型（LLM）已经变得非常强大，但它们仍然容易出现事实性错误，通常称为“幻觉”。不确定性量化（UQ）为缓解这一问题提供了一种有前景的方法，但大多数现有方法计算量大且/或需要监督。在这项工作中，我们提出了基于循环注意力的不确定性量化（RAUQ），这是一种无监督且高效的幻觉识别框架。该方法利用了Transformer注意力行为的一个观察：当生成错误信息时，某些“不确定性感知”注意力头倾向于减少对前驱令牌的关注。RAUQ自动检测这些注意力头，并以循环方式将其激活模式与令牌级置信度度量相结合，仅通过一次前向传递即可生成序列级不确定性估计。通过在涵盖问答、摘要和翻译的十二个数据集上对九个不同LLM进行的实验，我们表明RAUQ始终优于最先进的UQ基线。重要的是，它产生的开销极小，所需的额外计算不到1%。由于它既不需要标记数据也不需要广泛的参数调整，RAUQ可作为白盒LLM中实时幻觉检测的轻量级即插即用解决方案。

英文摘要

While large language models (LLMs) have become highly capable, they remain prone to factual inaccuracies, commonly referred to as "hallucinations." Uncertainty quantification (UQ) offers a promising way to mitigate this issue, but most existing methods are computationally intensive and/or require supervision. In this work, we propose Recurrent Attention-based Uncertainty Quantification (RAUQ), an unsupervised and efficient framework for identifying hallucinations. The method leverages an observation about transformer attention behavior: when incorrect information is generated, certain "uncertainty-aware" attention heads tend to reduce their focus on preceding tokens. RAUQ automatically detects these attention heads and combines their activation patterns with token-level confidence measures in a recurrent scheme, producing a sequence-level uncertainty estimate in just a single forward pass. Through experiments on twelve datasets spanning question answering, summarization, and translation across nine different LLMs, we show that RAUQ consistently outperforms state-of-the-art UQ baselines. Importantly, it incurs minimal overhead, requiring less than 1\% additional computation. Since it requires neither labeled data nor extensive parameter tuning, RAUQ serves as a lightweight, plug-and-play solution for real-time hallucination detection in white-box LLMs.

URL PDF HTML ☆

赞 0 踩 0

2507.04219 2026-06-18 cs.LG cs.AI 版本更新 80%

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

模型崩溃不是错误，而是大语言模型机器遗忘中的一种特性

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

发表机构 * Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich（计算机科学系及慕尼黑数据科学研究所，技术大学慕尼黑）； Mila, Université de Montréal（蒙特利尔大学Mila）

专题命中安全评测：机器遗忘方法，移除私有信息，涉及安全

AI总结提出部分模型崩溃（PMC）方法，通过故意触发模型在目标数据上的分布崩溃实现遗忘，无需在遗忘目标上优化，有效移除私有信息并保持模型效用。

Comments Accepted at ICLR 2026

详情

AI中文摘要

当前大语言模型的遗忘方法通过将待移除的私有信息纳入微调数据来优化。我们认为这不仅可能强化对敏感数据的暴露，而且从根本上违背了最小化其使用的原则。作为补救，我们提出了一种新颖的遗忘方法——部分模型崩溃（PMC），该方法在遗忘目标中不需要遗忘目标。我们的方法受到最近观察的启发：在生成模型上训练其自身生成会导致分布崩溃，从而有效移除模型输出中的信息。我们的核心见解是，可以通过故意触发我们旨在移除的数据上的模型崩溃来利用模型崩溃进行机器遗忘。我们从理论上分析了我们的方法收敛到期望结果，即模型遗忘目标移除的数据。我们实验证明，PMC克服了现有显式优化遗忘目标的遗忘方法的四个关键限制，并在保持通用模型效用的同时更有效地从模型输出中移除私有信息。总体而言，我们的贡献代表了向更全面、更符合现实隐私约束的遗忘迈出的重要一步。代码可在该 https URL 获取。

英文摘要

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

URL PDF HTML ☆

赞 0 踩 0

2504.14798 2026-06-18 cs.LG cs.CV 版本更新 75%

RUB: Evaluating Residual Knowledge in Unlearned Models

RUB: 评估未学习模型中的残留知识

Hao Xuan, Xingyu Li

发表机构 * Electrical and Computer Engineering University of Alberta（电气与计算机工程大学阿尔伯塔大学）

专题命中安全评测：评估未学习模型残留知识，对抗攻击

AI总结提出鲁棒未学习原则及统一基准RUB，通过未学习映射攻击（UMA）检测残留信息，揭示现有方法在对抗评估下的脆弱性。

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2026, pages 8550-8559

详情

AI中文摘要

机器未学习（MUL）已成为隐私保护和内容监管的关键机制，然而当前技术往往无法保证完全移除敏感信息。虽然现有工作大多关注验证未学习的执行，但它们忽略了模型在面对对抗性恢复遗忘知识尝试时是否保持鲁棒性的关键问题。在这项工作中，我们倡导鲁棒未学习原则，要求模型既与重新训练的模型不可区分，又能抵御多样化的对抗威胁。为实例化这一原则，我们提出了一个统一基准RUB（鲁棒未学习基准），系统评估未学习算法在分类、图像到图像重建和文本到图像合成中的鲁棒性。在此框架内，我们引入未学习映射攻击（UMA）作为检测残留信息的通用方法，并展示现有攻击策略如何适应此框架，只要它们符合通用UMA框架。我们在判别式和生成式任务上的实验表明，最先进的未学习方法在这些评估下仍然脆弱，即使通过了标准验证指标。通过将鲁棒性定位为核心标准并提供对抗评估基准，我们希望RUB能为更可靠和安全的未学习实践铺平道路。RUB中的代码库和模型检查点将公开发布。

英文摘要

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

URL PDF HTML ☆

赞 0 踩 0

2604.13899 2026-06-18 cs.CL cs.AI 版本更新 70%

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中？比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

专题命中安全评测：比较LLM与人类在敌意检测中的标注效果

AI总结研究比较了LLM与人类在主动学习中的标注效果，发现LLM标注成本更低且性能更优，但主动学习在LLM标注下无优势。

详情

AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习（AL）提出了两个问题：LLM标签能否替代AL回路中的人类标签？当整个语料库可以廉价标注时，AL是否仍然必要？我们在一个新的包含277,902条德国政治TikTok评论（25,974条LLM标注，5,000条人工标注）的数据集上进行了研究，比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下，大规模LLM标注的性能优于人类监督分类器，成本约为其十分之一（GPT-5.2 Batch API为28美元，Prolific为316美元）。这一优势对于闭源（GPT-5.2）和开源（Qwen3.5-122B-10B）LLM均成立，在软标签评估下具有鲁棒性，并且是通过双问题分解实现的；整体单提示基线仅与人类监督持平。在任一LLM标注器下，主动学习相比随机采样没有可靠优势。然而，错误结构差异显著：只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡，而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

英文摘要

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

URL PDF HTML ☆

赞 0 踩 0