大模型对齐与安全

2605.17986 2026-06-18 cs.CR cs.AI 版本更新专题 95

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

LivePI：更真实的智能体对抗间接提示注入基准测试

Lei Zhao, Abhay Bhaskar, Edgar Dobriban

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

专题命中提示注入：基准测试AI智能体对抗间接提示注入，核心是安全。

AI总结提出LivePI基准，覆盖7种输入表面、12种攻击/渲染家族和5种恶意目标，在真实虚拟机环境中评估多个AI智能体，发现攻击成功率10.7%-29.6%，并验证了两层防御的有效性。

详情

AI中文摘要

诸如OpenClaw之类的AI智能体越来越多地部署在本地工作流中，并能够访问外部工具。这带来了间接提示注入（IPI）风险：智能体可能会执行嵌入在不可信输入（如电子邮件、下载文件、网页、仓库或群聊消息）中的有害指令。现有的评估通常规模较小、完全模拟或仅关注狭窄的通道。我们引入了LivePI（实时提示注入），这是一个在生产类似但测试可控环境中的IPI风险结构化基准。LivePI覆盖了七个输入表面、十二个攻击/渲染家族和五个恶意目标，包括受保护信息窃取、未经授权的安全控制更改、不安全的代码检索或执行、收件箱摘要窃取以及加密货币转账。我们在一个真实的虚拟机上运行LivePI，该虚拟机具有实时但测试可控的电子邮件、聊天、网页、本地文件、仓库和钱包接口。在GPT-5.3-Codex、Claude Opus 4.6、Gemini 3.1 Pro、Kimi K2.5和GLM-5上，总攻击成功率范围为10.7%至29.6%。群聊注入在我们部署中评估的所有骨干模型上均成功，而仓库链接攻击尽管分母较小，仍产生了高严重性失败。我们还评估了一种由提示级过滤和执行前工具调用授权组成的两层防御。在GPT-5.3-Codex设置中，该防御在LivePI中拦截了所有测试的恶意目标完成，同时保留了PinchBench衍生工作负载上的良性效用。

英文摘要

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email, downloaded files, webpages, repositories, or group-chat messages. Existing evaluations are often small, purely simulated, or focused on a narrow set of channels. We introduce LivePI (Live Prompt Injection), a structured benchmark for IPI risk in a production-like but test-controlled environment. LivePI covers seven input surfaces, twelve attack/rendering families, and five malicious goals, including protected-information exfiltration, unauthorized security-control changes, unsafe code retrieval or execution, inbox-summary exfiltration, and cryptocurrency transfer. We run LivePI on a real virtual machine with live but test-controlled email, chat, web, local-file, repository, and wallet interfaces. Across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5, total attack success rates range from 10.7% to 29.6%. Group-chat injection is uniformly successful across the evaluated backbones in our deployment, and repository-link attacks produce high-severity failures despite a small denominator. We also evaluate a two-layer defense consisting of prompt-level filtering and pre-execution tool-call authorization. In the GPT-5.3-Codex setting, the defense intercepts all tested malicious-goal completions in LivePI before execution while preserving benign utility on PinchBench-derived workloads.

URL PDF HTML ☆

赞 0 踩 0

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新专题 95

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）

专题命中偏好对齐：DPO是偏好对齐的核心方法之一

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情

DOI: 10.1109/TPAMI.2026.3704314

AI中文摘要

随着大语言模型（LLMs）的快速发展，将策略模型与人类偏好对齐变得日益关键。直接偏好优化（DPO）作为一种有前景的对齐方法，作为从人类反馈中强化学习（RLHF）的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性，但文献中目前缺乏对这些方面的深入综述。在这项工作中，我们对DPO中的挑战和机遇进行了全面回顾，涵盖理论分析、变体、相关偏好数据集和应用。具体而言，我们基于关键研究问题对近期DPO研究进行分类，以提供对DPO当前格局的透彻理解。此外，我们提出了几个未来研究方向，为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

URL PDF HTML ☆

赞 0 踩 0

2604.23130 2026-06-18 cs.CL cs.AI 版本更新专题 90

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征：越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC（马里兰大学伯克利分校）； Apple（苹果公司）

专题命中越狱攻击：机制定位越狱漏洞，分析有害特征

AI总结提出一种基于Token的机制流水线，通过稀疏自编码器特征子组定位越狱漏洞，发现单个有害Token足以定位脆弱特征，且这些特征集中在中后期层。

详情

AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式：模型可以被推向有害行为，但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为，包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线，将Gemma-2-2B的残差流分解为稀疏自编码器（SAE）特征，并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰，我们从对抗性响应中提取有害概念，并通过子空间相似性将其与概念相关的提示Token对齐。然后，我们应用三种特征分组策略：基于聚类的、层次链接的和单Token驱动的，以识别所有26层中的SAE特征子组。最后，我们放大每个子组中的顶级特征，并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性，表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组，而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层，且更集中在中后期层，其中目标引导暴露了特定的模型脆弱性。总体而言，我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组，补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.

URL PDF HTML ☆

赞 0 踩 0

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新专题 85

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（数据科学学院、人工智能学院、香港中文大学（深圳））

专题命中越狱攻击：提出语义感知通用扰动劫持MLLM，属于越狱攻击。

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术。本文研究了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点，我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性，在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

URL PDF HTML ☆

赞 0 踩 0

2412.16468 2026-06-18 cs.LG 版本更新专题 90

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

通往人工超级智能之路：超级对齐的全面综述

HyunJin Kim, DongHyun Ryu, Xiaoyuan Yi, Jing Yao, Jianxun Lian, Muhua Huang, Shitong Duan, JinYeong Bak, Xing Xie

发表机构 * Microsoft Research Asia（微软亚洲研究院）； Sungkyunkwan University（顺天大学）； Stanford University（斯坦福大学）； Fudan University（复旦大学）

专题命中安全评测：综述超级对齐问题，分析可扩展监督范式

AI总结本文综述了超级对齐问题，通过分析可扩展监督范式（夹层、自我增强和弱到强泛化）及其局限性，探讨了监督、控制和管理人工超级智能的挑战与路径。

Comments 24 pages

详情

AI中文摘要

大型语言模型（LLMs）的出现引发了关于人工超级智能（ASI）的讨论，这是一种假设性的、超越人类智能的AI系统。尽管ASI仍处于假设阶段且远超出当前AI能力，但讨论其潜力、探索其可行性和潜在风险对于未来AI系统的发展至关重要。超级对齐的概念源于可扩展监督，后者研究当直接人类监督不足时如何监督日益强大的AI系统。本文聚焦于超级对齐问题：“监督、控制和管理人工超级智能的过程”。我们首先回顾可扩展监督范式——夹层、自我增强和弱到强泛化，然后通过可能性和不可能性的视角分析当前范式的局限性，讨论关键挑战，并提出未来AI系统安全持续改进的路径。

英文摘要

The emergence of large language models (LLMs) has sparked discussion on Artificial Superintelligence (ASI), a hypothetical AI system that surpasses human intelligence. Although ASI remains hypothetical and far beyond current AI capabilities, discussing its potential and exploring its feasibility and potential risks is critical for the development of future AI systems. The idea of superalignment originates from scalable oversight, which studies how to supervise increasingly capable AI systems when direct human supervision becomes insufficient. In this paper, we focus on the superalignment problem: "The process of supervising, controlling, and governing artificial superintelligence." We first review scalable oversight paradigms-Sandwiching, Self-Enhancement, and Weak-to-Strong Generalization -- then analyze the limitations of current paradigms through the lens of possibility and impossibility, discuss key challenges, and propose pathways for the safe and continual improvement of future AI systems.

URL PDF HTML ☆

赞 0 踩 0

2505.20045 2026-06-18 cs.CL 版本更新专题 85

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

基于不确定性感知注意力头的高效大语言模型幻觉检测

Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（莫扎德人工智能大学）； ETH Zurich（苏黎世联邦理工学院）； Independent Researcher（独立研究者）； Applied AI Institute（应用人工智能研究所）

专题命中安全评测：无监督幻觉检测，提升LLM可靠性

AI总结提出RAUQ框架，利用不确定性感知注意力头与令牌级置信度，通过单次前向传递实现无监督、高效的序列级幻觉检测，在12个数据集上优于现有方法且额外计算少于1%。

Journal ref Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, 2026

详情

AI中文摘要

尽管大型语言模型（LLM）已经变得非常强大，但它们仍然容易出现事实性错误，通常称为“幻觉”。不确定性量化（UQ）为缓解这一问题提供了一种有前景的方法，但大多数现有方法计算量大且/或需要监督。在这项工作中，我们提出了基于循环注意力的不确定性量化（RAUQ），这是一种无监督且高效的幻觉识别框架。该方法利用了Transformer注意力行为的一个观察：当生成错误信息时，某些“不确定性感知”注意力头倾向于减少对前驱令牌的关注。RAUQ自动检测这些注意力头，并以循环方式将其激活模式与令牌级置信度度量相结合，仅通过一次前向传递即可生成序列级不确定性估计。通过在涵盖问答、摘要和翻译的十二个数据集上对九个不同LLM进行的实验，我们表明RAUQ始终优于最先进的UQ基线。重要的是，它产生的开销极小，所需的额外计算不到1%。由于它既不需要标记数据也不需要广泛的参数调整，RAUQ可作为白盒LLM中实时幻觉检测的轻量级即插即用解决方案。

英文摘要

While large language models (LLMs) have become highly capable, they remain prone to factual inaccuracies, commonly referred to as "hallucinations." Uncertainty quantification (UQ) offers a promising way to mitigate this issue, but most existing methods are computationally intensive and/or require supervision. In this work, we propose Recurrent Attention-based Uncertainty Quantification (RAUQ), an unsupervised and efficient framework for identifying hallucinations. The method leverages an observation about transformer attention behavior: when incorrect information is generated, certain "uncertainty-aware" attention heads tend to reduce their focus on preceding tokens. RAUQ automatically detects these attention heads and combines their activation patterns with token-level confidence measures in a recurrent scheme, producing a sequence-level uncertainty estimate in just a single forward pass. Through experiments on twelve datasets spanning question answering, summarization, and translation across nine different LLMs, we show that RAUQ consistently outperforms state-of-the-art UQ baselines. Importantly, it incurs minimal overhead, requiring less than 1\% additional computation. Since it requires neither labeled data nor extensive parameter tuning, RAUQ serves as a lightweight, plug-and-play solution for real-time hallucination detection in white-box LLMs.

URL PDF HTML ☆

赞 0 踩 0

2507.04219 2026-06-18 cs.LG cs.AI 版本更新专题 80

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

模型崩溃不是错误，而是大语言模型机器遗忘中的一种特性

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

发表机构 * Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich（计算机科学系及慕尼黑数据科学研究所，技术大学慕尼黑）； Mila, Université de Montréal（蒙特利尔大学Mila）

专题命中安全评测：机器遗忘方法，移除私有信息，涉及安全

AI总结提出部分模型崩溃（PMC）方法，通过故意触发模型在目标数据上的分布崩溃实现遗忘，无需在遗忘目标上优化，有效移除私有信息并保持模型效用。

Comments Accepted at ICLR 2026

详情

AI中文摘要

当前大语言模型的遗忘方法通过将待移除的私有信息纳入微调数据来优化。我们认为这不仅可能强化对敏感数据的暴露，而且从根本上违背了最小化其使用的原则。作为补救，我们提出了一种新颖的遗忘方法——部分模型崩溃（PMC），该方法在遗忘目标中不需要遗忘目标。我们的方法受到最近观察的启发：在生成模型上训练其自身生成会导致分布崩溃，从而有效移除模型输出中的信息。我们的核心见解是，可以通过故意触发我们旨在移除的数据上的模型崩溃来利用模型崩溃进行机器遗忘。我们从理论上分析了我们的方法收敛到期望结果，即模型遗忘目标移除的数据。我们实验证明，PMC克服了现有显式优化遗忘目标的遗忘方法的四个关键限制，并在保持通用模型效用的同时更有效地从模型输出中移除私有信息。总体而言，我们的贡献代表了向更全面、更符合现实隐私约束的遗忘迈出的重要一步。代码可在该 https URL 获取。

英文摘要

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

URL PDF HTML ☆

赞 0 踩 0

2504.14798 2026-06-18 cs.LG cs.CV 版本更新专题 75

RUB: Evaluating Residual Knowledge in Unlearned Models

RUB: 评估未学习模型中的残留知识

Hao Xuan, Xingyu Li

发表机构 * Electrical and Computer Engineering University of Alberta（电气与计算机工程大学阿尔伯塔大学）

专题命中安全评测：评估未学习模型残留知识，对抗攻击

AI总结提出鲁棒未学习原则及统一基准RUB，通过未学习映射攻击（UMA）检测残留信息，揭示现有方法在对抗评估下的脆弱性。

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2026, pages 8550-8559

详情

AI中文摘要

机器未学习（MUL）已成为隐私保护和内容监管的关键机制，然而当前技术往往无法保证完全移除敏感信息。虽然现有工作大多关注验证未学习的执行，但它们忽略了模型在面对对抗性恢复遗忘知识尝试时是否保持鲁棒性的关键问题。在这项工作中，我们倡导鲁棒未学习原则，要求模型既与重新训练的模型不可区分，又能抵御多样化的对抗威胁。为实例化这一原则，我们提出了一个统一基准RUB（鲁棒未学习基准），系统评估未学习算法在分类、图像到图像重建和文本到图像合成中的鲁棒性。在此框架内，我们引入未学习映射攻击（UMA）作为检测残留信息的通用方法，并展示现有攻击策略如何适应此框架，只要它们符合通用UMA框架。我们在判别式和生成式任务上的实验表明，最先进的未学习方法在这些评估下仍然脆弱，即使通过了标准验证指标。通过将鲁棒性定位为核心标准并提供对抗评估基准，我们希望RUB能为更可靠和安全的未学习实践铺平道路。RUB中的代码库和模型检查点将公开发布。

英文摘要

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

URL PDF HTML ☆

赞 0 踩 0

2604.13899 2026-06-18 cs.CL cs.AI 版本更新专题 70

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中？比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

专题命中安全评测：比较LLM与人类在敌意检测中的标注效果

AI总结研究比较了LLM与人类在主动学习中的标注效果，发现LLM标注成本更低且性能更优，但主动学习在LLM标注下无优势。

详情

AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习（AL）提出了两个问题：LLM标签能否替代AL回路中的人类标签？当整个语料库可以廉价标注时，AL是否仍然必要？我们在一个新的包含277,902条德国政治TikTok评论（25,974条LLM标注，5,000条人工标注）的数据集上进行了研究，比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下，大规模LLM标注的性能优于人类监督分类器，成本约为其十分之一（GPT-5.2 Batch API为28美元，Prolific为316美元）。这一优势对于闭源（GPT-5.2）和开源（Qwen3.5-122B-10B）LLM均成立，在软标签评估下具有鲁棒性，并且是通过双问题分解实现的；整体单提示基线仅与人类监督持平。在任一LLM标注器下，主动学习相比随机采样没有可靠优势。然而，错误结构差异显著：只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡，而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

英文摘要

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

URL PDF HTML ☆

赞 0 踩 0

1. 提示注入 1 篇

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

2. 偏好对齐 1 篇

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

3. 越狱攻击 2 篇

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

4. 安全评测 5 篇

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

RUB: Evaluating Residual Knowledge in Unlearned Models

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection