大模型对齐与安全

2606.18673 2026-06-18 cs.CR 新提交 95%

Understanding and Mitigating Prompt Leaking Attacks in Real-World LLM-Based Applications

理解并缓解真实世界基于LLM的应用中的提示泄露攻击

Yong Yang, Chong Fu, Tong Zhang, Rui Zeng, Qingming Li, Tianyu Du, Zonghui Wang, Shouling Ji, Wenzhi Chen

专题命中安全评测：系统提示泄露攻击与防御

AI总结本研究系统测量了1200个真实世界基于LLM的应用，发现超过80%会泄露系统提示，并提出了基于注意力漂移分析的AREA防御方法，在保持可用性的同时有效防止泄露。

Comments Accepted at ACM CCS 2026

详情

AI中文摘要

基于大型语言模型（LLM）的应用依赖系统提示来编码核心逻辑和开发者定义的约束，使得这些提示成为重要的知识产权。然而，系统提示容易受到提示泄露攻击。尽管先前的工作在受控环境中展示了此类攻击，但其在真实世界部署中的普遍性、原因和防御措施仍不清楚。本文对真实世界基于LLM的应用中的提示泄露进行了系统研究。我们测量了六个主要商业平台上的1200个应用，发现超过80%的部署在现实对抗性查询下泄露了系统提示，有时会暴露敏感信息，如第三方API密钥。我们还表明，现有防御措施往往无法在不降低可用性的情况下防止泄露。为了解释这些失败，我们进行了注意力层面的机制分析，并识别出注意力漂移，其中查询-键对齐偏差和softmax放大导致LLM逐渐忽略防御约束。基于这一洞察，我们提出了AREA，一种实用的防御方法，通过可优化的软提示重新锚定模型的注意力。实验和真实世界案例研究表明，AREA在匹配最先进防御措施的防泄露能力的同时，将平均可用性提高了33%以上，并将优化开销降低了近3倍。我们的负责任披露导致两家受影响的供应商将这些泄露归类为中危漏洞。

英文摘要

Large language model (LLM)-based applications rely on system prompts to encode core logic and developer-defined constraints, making these prompts important intellectual property. However, system prompts are vulnerable to prompt leaking attacks. Although prior work has shown such attacks in controlled settings, their prevalence, causes, and defenses in real-world deployments remain unclear. This paper presents a systematic study of prompt leaking in real-world LLM-based applications. We measure 1,200 applications across six major commercial platforms and find that over 80% of deployments leak system prompts under realistic adversarial queries, sometimes exposing sensitive information such as third-party API keys. We also show that existing defenses often fail to prevent leakage without degrading usability. To explain these failures, we conduct an attention-level mechanistic analysis and identify attention drift, where query-key alignment bias and softmax amplification cause LLMs to progressively ignore defensive constraints. Guided by this insight, we propose AREA, a practical defense that re-anchors the model's attention using an optimizable soft prompt. Experiments and real-world case studies show that AREA matches the leakage resistance of state-of-the-art defenses while improving average usability by over 33% and reducing optimization overhead by nearly 3x. Our responsible disclosure led two affected vendors to classify these leaks as medium-severity vulnerabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.19222 2026-06-18 cs.LG cs.AI 新提交 90%

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

机制引导的选择性遗忘：针对RLVR诱导的推理

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo, Japan（东京科学研究院工程学院）； College of Control Science and Engineering, Zhejiang University, China（浙江大学控制科学与工程学院）； Department of Electrical and Computer Engineering, National University of Singapore, Singapore（新加坡国立大学电子与计算机工程系）

专题命中安全评测：针对RLVR推理的遗忘方法，涉及模型安全

AI总结提出MAST方法，通过机制引导选择性更新参数，在遗忘RLVR诱导的推理行为时，显著降低对保留性能的附带损害。

Comments 15 pages, 4 figures, 7 tables

详情

AI中文摘要

我们提出MAST（机制对齐选择性目标），一种机制引导的方法，用于遗忘RLVR诱导的推理，其附带损害远低于标准全参数更新。在Qwen2.5-Math-1.5B和Qwen3-1.7B-Base的匹配SFT/RLVR检查点上，SFT到RLVR的增量在token级delta-log-probability上与SFT更新显著不同，而全参数梯度上升仅通过破坏保留的MATH和GSM8K来实现遗忘。MAST根据离主能量、更新幅度和遗忘梯度耦合幅度对注意力投影张量进行排序，然后仅更新排名最高的子集。在主模型上，MAST诱导了统计上显著的目标遗忘（MATH遗忘从45/150降至37/150；McNemar p=0.0078），同时保留了GSM8K（+0.8个百分点）和MATH保留（-0.5个百分点）。该优势在不同种子、NPO/SimNPO目标以及Qwen3上均得到复现，在Qwen3上MAST保留了GSM8K，而全参数遗忘导致其崩溃。

英文摘要

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

URL PDF HTML ☆

赞 0 踩 0

2606.19168 2026-06-18 cs.AI cs.LG 新提交 90%

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

超越安全数据：具有正则安全反射的预训练阶段对齐

Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）

专题命中安全评测：预训练阶段安全对齐方法，属于安全

AI总结提出安全反射预训练方法，在预训练语料中插入安全反思，使模型具备自我监控能力，实验表明该方法能有效降低推理和微调攻击成功率。

详情

AI中文摘要

为了实现大型语言模型（LLMs）更深层次的安全对齐，最近的研究探讨了如何将安全干预措施提前到预训练阶段，主要通过过滤不安全数据或将其改写为更安全的形式。我们认为，预训练阶段的对齐应超越使数据安全：LLMs可能将看似良性的知识和能力组合成不安全的行为。为此，我们提出了安全反射预训练，一种预训练阶段的对齐方法，该方法定期在预训练语料中插入简短的安全反思，将自我监控直接集成到语言建模中，建立一种基础能力，随后通过兼容的后训练加以强化。我们在FineWeb-Edu上预训练的1.7B模型上的实验表明，安全反射预训练提高了安全分类准确性，并显著降低了推理阶段和微调攻击的成功率。除了真实世界实验，我们还引入了一个完全受控的合成环境MedSafetyWorld，其中包含清晰的安全定义和推理结构，模型可以轻松地从安全数据中泛化出不安全行为。在MedSafetyWorld中的消融实验进一步表明，与数据过滤和改写相比，安全反射预训练在防止模型根据安全数据泛化出的不安全行为方面具有明显优势。综合来看，我们的发现表明，预训练对齐不仅应使训练数据安全，还应塑造模型可能从安全数据中习得的行为。

英文摘要

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

URL PDF HTML ☆

赞 0 踩 0

2606.19023 2026-06-18 cs.CR cs.LG 新提交 90%

Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

生命周期感知的动态分析用于安全ML模型执行

Gabriele Digregorio, Marco Di Gennaro, Francesco Pastore, Stefano Zanero, Stefano Longari, Michele Carminati

发表机构 * Politecnico di Milano（米兰理工大学）

专题命中安全评测：提出动态生命周期分析方法检测ML模型恶意行为。

AI总结提出Moat，一种动态生命周期感知方法，通过监控模型执行各阶段与宿主系统的结构化交互来检测恶意行为，在多个框架上实现零误报率。

详情

AI中文摘要

对预训练机器学习（ML）模型的日益依赖引入了新的攻击面。最近的漏洞表明，恶意行为可以嵌入模型工件中，常常绕过现有防御。当前的模型扫描解决方案主要依赖于静态的、特定格式的规则或已知的攻击签名，这限制了它们跨框架泛化和检测新型利用路径的能力。相比之下，我们提出了一种解决方案，专注于攻击对执行模型的宿主系统产生的影响，并基于关于ML模型执行的基本直觉。特别地，我们观察到ML模型在定义良好的生命周期阶段内运行，并且在每个阶段内，与宿主系统的交互是高度结构化和可预测的。我们将这些直觉转化为Moat，一种用于安全ML模型执行的动态生命周期感知方法，并在我们的参考实现Re-Moat中实例化此设计。我们使用来自Hugging Face Hub的77,974个真实世界模型工件、来自CVE的31个概念验证（PoC）以及来自最先进数据集的334个模型，在多个ML框架上评估Re-Moat，并将其与最先进的模型扫描解决方案进行比较。我们的结果表明，我们的方法检测到所有评估的攻击类别，同时保持接近零的误报率，验证了我们的直觉并激励了用于安全ML模型执行的动态分析。

英文摘要

The growing reliance on pre-trained Machine Learning (ML) models has introduced new attack surfaces. Recent vulnerabilities demonstrate that malicious behavior can be embedded within model artifacts, often bypassing existing defenses. Current model-scanning solutions primarily rely on static, format-specific rules or known attack signatures, which limit their ability to generalize across frameworks and to detect novel exploitation paths. In contrast, we propose a solution that focuses on the effects an attack has on the host system executing the model and builds on foundational intuitions about ML model execution. In particular, we observe that ML models operate within well-defined lifecycle phases and that, within each phase, interactions with the host system are highly structured and predictable. We translate these intuitions into Moat, a dynamic lifecycle-aware approach for securing ML model execution, and instantiate this design in Re-Moat, our reference implementation. We evaluate Re-Moat across multiple ML frameworks using 77,974 real-world model artifacts from the Hugging Face Hub, 31 Proofs-of-Concept (PoCs) from CVEs, and 334 models from a state-of-the-art dataset, and compare it against state-of-the-art model-scanning solutions. Our results show that our approach detects all evaluated attack classes while maintaining a close-to-zero false-positive rate, validating our intuitions and motivating dynamic analysis for securing ML model execution.

URL PDF HTML ☆

赞 0 踩 0

2606.18656 2026-06-18 cs.CL 新提交 90%

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

错误的正确：量化和定位大语言模型中的失调对齐

Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

发表机构 * University of Michigan（密歇根大学）； University of Cambridge（剑桥大学）； University of Aberdeen（阿伯丁大学）

专题命中安全评测：提出失调对齐基准VETO和量化指标MAR

AI总结本文提出VETO基准和失调对齐率（MAR）指标，发现所有LLM在刻板印象相关问题上均存在非平凡的失调对齐，且人类为0%，机制分析表明对齐诱导的线索会放大该现象。

详情

AI中文摘要

警告：本文研究刻板印象和偏见，包含可能令人不适的例子，仅用于说明目的。我们的发现不应被解释为反对对齐的论据。相反，本文强调了需要更先进对齐的原则性方法。对齐旨在确保大语言模型（LLMs）安全可靠地行为，包括避免不安全的推理。然而，我们表明这种安全导向的行为可能误触发：模型可能拒绝有根据的结论，即使上下文明确支持它们。我们将这种失败模式称为失调对齐，其中对齐引起的改变导致LLMs覆盖显式证据。为了量化这一现象，特别是针对刻板印象相关的对齐，我们引入了VETO，一个由2,032个BBQ派生对比对组成的基准，并定义了一个新指标，失调对齐率（MAR），它衡量在0到100的尺度上，模型在刻板印象相关问题上失败但在其对比对应问题上成功的频率。我们在VETO上对25个LLMs进行了基准测试，并表明所有LLMs，包括最新的，都表现出非平凡的（4.7%至18.9%）MAR，而所有人类参与者达到0.0%的MAR。受控启动实验进一步表明，对齐诱导的线索可以显著放大LLMs的MAR，表明这些失败不仅仅是单个例子的伪影，而是可以由安全相关的框架诱导。对开放权重LLMs的机制分析揭示了后期层对证据支持答案的抑制，并且指令模型与基础模型之间的比较表明这种抑制在指令训练后出现。这些发现表明，当前的对齐方法可能过度泛化表面安全线索，以至于覆盖客观证据，这激励了更多关于更好保持上下文基础的对齐目标的工作。

英文摘要

Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.18430 2026-06-18 cs.LG cs.CR 新提交 90%

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

签名过滤：大型语言模型中统计水印检测的轻量级增强方法

Chih-Duo Hong, Yen-Pang Chen, Fang Yu

发表机构 * National Chengchi University（国立政治大学）

专题命中安全评测：提出签名过滤增强LLM水印检测

AI总结提出签名过滤模块，通过移除干扰水印检测的签名令牌，在弱信号和低熵设置下将检测率从8-31%提升至78-99%，同时保持可控的假阳性率。

详情

AI中文摘要

统计水印帮助组织归因大型语言模型（LLM）的输出，但现有检测器在水印信号弱、文本重复或水印被编辑时往往表现不佳。我们提出签名过滤，一种检测时模块，在不修改水印嵌入和文本生成的情况下增强水印检测。它学习一小部分“签名”令牌，这些令牌的存在会使水印测试不可靠，并在检测前移除这些令牌。通过在小训练集上求解混合整数线性规划获得签名，约束条件最大化真阳性率。我们还推导了在几种攻击者模型（色盲、颜色自适应和分布相关）下的有限样本和渐近界。在四个知名水印家族（Kgw、Sweet、Unigram、Exp）、四个基准语料库（C4、MBPP、HumanEval、Code-Search-Net）和六个LLM（Opt-1.3b、Opt-6.7b、Llama2-13b、Llama3.1-8b、Qwen2.5-14b、Phi-3-medium-14b）上，2-gram和3-gram签名在弱信号和低熵设置下将检测率从无过滤时的8-31%提升至78-99%，同时保持假阳性率可控且通常可忽略。在压力测试中，我们打乱句子并稀释、删除和替换25-50%的令牌，针对Kgw风格水印的2-gram过滤器保留了大部分干净文本的检测增益，通常匹配或超越先进的WinMax水印检测器。因此，签名过滤提供了一种简单、可扩展且模型无关的附加组件，以加强信息处理工作流中LLM文本基于水印的来源检查。

英文摘要

Statistical watermarks help organizations attribute large language model (LLM) outputs, yet existing detectors often struggle when watermark signals are weak, texts are repetitive, or watermarks are edited. We propose signature filtering, a detection-time module that enhances watermark detection without modifying watermark embedding and text generation. It learns a small set of ``signature'' tokens whose presence makes watermark tests unreliable, and removes these tokens before detection. The signatures are obtained by solving a mixed-integer linear program on a small training set, with constraints that maximize the true positive rate. We additionally derive finite-sample and asymptotic bounds under several attacker models (color-blind, color-adaptive, and distributionally correlated). On four well-known watermark families (Kgw, Sweet, Unigram, Exp), four benchmark corpora (C4, MBPP, HumanEval, Code-Search-Net), and six LLMs (Opt-1.3b, Opt-6.7b, Llama2-13b, Llama3.1-8b, Qwen2.5-14b, Phi-3-medium-14b), 2- and 3-gram signatures raise detection rates in weak-signal and low-entropy settings from 8~31% without filtering to 78~99% with filtering, while keeping false positives controllable and often negligible. In stress tests where we scramble sentences and perturb 25~50% of tokens by dilution, deletions, and substitutions, 2-gram filters for Kgw-style watermarks preserve most of the clean-text detection gains, often matching or outperforming the advanced WinMax watermark detector. Signature filtering thus provides a simple, scalable, and model-agnostic add-on to strengthen watermark-based provenance checks for LLM text in information processing workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.18356 2026-06-18 cs.CR cs.AI 新提交 90%

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

SafeClawBench: 区分工具使用LLM代理中的语义、审计证据和沙箱危害

Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang

发表机构 * Peking University（北京大学）； Beijing Jiaotong University（北京交通大学）； SUIBE（上海外国语大学）； Huawei（华为）； Tsinghua University（清华大学）

专题命中安全评测：提出工具使用LLM代理安全基准，区分语义、审计和沙箱危害。

AI总结提出SafeClawBench基准，通过三个独立端点（语义攻击接受、审计可见危害证据、沙箱观察危害）评估工具使用LLM代理的安全性，揭示不同失败模式并支持可复现比较。

Comments 32 pages, 5 figures

详情

AI中文摘要

使用工具的语言模型代理引入了超出不安全文本的安全失败：它们可以泄露受保护对象、写入持久内存、发送消息、修改数据库或触发有害代码和工具效果。现有的评估通常将这些阶段合并为单一的攻击成功率，使得难以判断模型仅仅是同意了攻击者还是实际产生了可观察的危害。我们引入了SafeClawBench，一个用于工具使用代理安全性的分阶段基准，包含600个受控对抗任务，涵盖六种攻击家族：直接和间接提示注入、工具返回注入、内存投毒、内存提取以及歧义驱动的不安全推理。SafeClawBench报告三个独立的端点：语义攻击接受、审计可见危害证据和沙箱观察的工具/状态危害。在四种提示级策略下评估五个代理端点，我们发现这些端点捕捉了不同的失败模式。在没有额外提示保护的情况下，语义失败率在不同模型间差异很大，从9.0%到44.2%。审计危害证据比语义失败更窄，并且在单独的可执行协议下，一些匹配的任务身份在通过语义核心调用后仍产生沙箱危害：在12000行的匹配分析中，347个观察到的沙箱危害中有291个发生在通过语义检查的行中。提示策略会改变端点结果，但其效果取决于模型和协议。SafeClawBench提供了一个可复现的框架，用于比较代理模型和提示策略条件，而不会混淆文本合规性、证据支持的危害和可执行状态变化。开源数据集可在该https URL获取。

英文摘要

Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Existing evaluations often collapse these stages into a single attack success rate, making it difficult to tell whether a model merely agreed with an attacker or actually produced observable harm. We introduce SafeClawBench, a staged benchmark for tool-using agent security with 600 controlled adversarial tasks across six attack families: direct and indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference. SafeClawBench reports three separate endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool/state harm. Evaluating five agent endpoints under four prompt-level policies, we find that these endpoints capture different failure modes. Without additional prompt protection, semantic failure rates vary widely across models, from 9.0% to 44.2%. Audited harm evidence is narrower than semantic failure, and under a separate executable protocol some matched task identities produce sandbox harm despite passing the Semantic Core call: in a 12,000-row matched analysis, 291 of 347 observed sandbox harms occur in rows that pass the semantic check. Prompt policies change endpoint outcomes, but their effects depend on both model and protocol. SafeClawBench provides a reproducible framework for comparing agent models and prompt-policy conditions without conflating textual compliance, evidence-supported harm, and executable state changes. The open-source dataset is available at https://huggingface.co/datasets/sairights/safeclawbench.

URL PDF HTML ☆

赞 0 踩 0

2412.16468 2026-06-18 cs.LG 版本更新 90%

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

通往人工超级智能之路：超级对齐的全面综述

HyunJin Kim, DongHyun Ryu, Xiaoyuan Yi, Jing Yao, Jianxun Lian, Muhua Huang, Shitong Duan, JinYeong Bak, Xing Xie

发表机构 * Microsoft Research Asia（微软亚洲研究院）； Sungkyunkwan University（顺天大学）； Stanford University（斯坦福大学）； Fudan University（复旦大学）

专题命中安全评测：综述超级对齐问题，分析可扩展监督范式

AI总结本文综述了超级对齐问题，通过分析可扩展监督范式（夹层、自我增强和弱到强泛化）及其局限性，探讨了监督、控制和管理人工超级智能的挑战与路径。

Comments 24 pages

详情

AI中文摘要

大型语言模型（LLMs）的出现引发了关于人工超级智能（ASI）的讨论，这是一种假设性的、超越人类智能的AI系统。尽管ASI仍处于假设阶段且远超出当前AI能力，但讨论其潜力、探索其可行性和潜在风险对于未来AI系统的发展至关重要。超级对齐的概念源于可扩展监督，后者研究当直接人类监督不足时如何监督日益强大的AI系统。本文聚焦于超级对齐问题：“监督、控制和管理人工超级智能的过程”。我们首先回顾可扩展监督范式——夹层、自我增强和弱到强泛化，然后通过可能性和不可能性的视角分析当前范式的局限性，讨论关键挑战，并提出未来AI系统安全持续改进的路径。

英文摘要

The emergence of large language models (LLMs) has sparked discussion on Artificial Superintelligence (ASI), a hypothetical AI system that surpasses human intelligence. Although ASI remains hypothetical and far beyond current AI capabilities, discussing its potential and exploring its feasibility and potential risks is critical for the development of future AI systems. The idea of superalignment originates from scalable oversight, which studies how to supervise increasingly capable AI systems when direct human supervision becomes insufficient. In this paper, we focus on the superalignment problem: "The process of supervising, controlling, and governing artificial superintelligence." We first review scalable oversight paradigms-Sandwiching, Self-Enhancement, and Weak-to-Strong Generalization -- then analyze the limitations of current paradigms through the lens of possibility and impossibility, discuss key challenges, and propose pathways for the safe and continual improvement of future AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19106 2026-06-18 cs.CR cs.CY 新提交 85%

Quantifying Compromise Risk in Exceptional Access Architectures Under Sparse and Indirect Evidence

在稀疏和间接证据下量化特殊访问架构中的泄露风险

Alan Woodward

专题命中安全评测：量化特殊访问架构的系统性泄露风险，属于安全评测。

AI总结针对特殊访问系统缺乏公开泄露数据的问题，构建结构化不确定性框架，通过历史类比、蒙特卡洛场景、信道独立性分解和贝叶斯结构风险模型，量化传输层与平台层EA架构的系统性泄露风险，发现两类架构风险均高于无EA基线，且分布形态不同。

详情

AI中文摘要

合法的特殊访问（EA）系统持有用于授权方解密受保护通信的加密密钥。关于其风险的争论长期且定性，因两个问题而复杂化：不存在EA特定泄露事件的公开数据集，因此评估必须使用稀疏的间接证据；先前的工作将结构不同的设计视为等效，尽管运营商基础设施中的传输层EA（T-EA）和平台层的覆盖层EA（OTT-EA）在加密密钥与密文数据的关系上有所不同。本文构建了一个结构化不确定性框架，用于评估EA架构中的系统性泄露风险。它不产生预测性预测，因为证据无法支持；它将稳健的发现与依赖于校准的发现分开。对T-EA和OTT-EA应用了四个分析层：三个实证支柱（历史类比、蒙特卡洛场景层、信道独立性分解）加上一个并行子图攻击图上的贝叶斯结构风险模型。核心发现是结构性的。首先，任何类别的配备EA的架构都承担比无EA反事实更高的建模风险，这一顺序独立于校准。其次，类别在分布形状上不同：T-EA风险由中心趋势主导，OTT-EA风险在关联活动下由尾部主导。第三，在结构化判断的目标溢价区间内，T-EA的校准条件年概率范围从1.4%到12.9%。在数十年时间跨度上，累积泄露远高于零；关键材料外泄是不可逆的，对OTT-EA更大的用户群体影响严重。该框架量化泄露概率，而非预期危害；后果建模和收益估计不在其范围内。

英文摘要

Lawful exceptional access (EA) systems hold the cryptographic keys that decrypt protected communications for authorised parties. The debate over their risks has been long and qualitative, complicated by two problems: no public dataset of EA-specific compromise events exists, so assessment must use sparse, indirect evidence; and prior work has treated structurally different designs as equivalent, though transmission-layer EA in carrier infrastructure (T-EA) and over-the-top EA at the platform layer (OTT-EA) differ in how cryptographic keys relate to ciphertext data. This paper builds a structured uncertainty framework for evaluating systemic compromise risk in EA architectures. It does not produce predictive forecasts, which the evidence cannot support; it separates findings robust to assumptions from those that depend on calibration. Four analytical layers are applied to T-EA and OTT-EA: three empirical pillars (historical analogues, a Monte Carlo scenario layer, a channel-independence decomposition) plus a Bayesian Structural Risk Model on a parallel-subgraph attack graph. The central findings are structural. First, EA-equipped architectures of either class carry strictly higher modelled risk than their no-EA counterfactual, an ordering independent of calibration. Second, the classes differ in distribution shape: T-EA risk is dominated by central tendency, OTT-EA by the tail under correlated campaigns. Third, calibration-conditional annual probability ranges span 1.4% to 12.9% for T-EA across the structured-judgement targeting-premium interval. Over multi-decade horizons, cumulative compromise is well above zero; key-material exfiltration is irreversible, weighing heavily on OTT-EA's larger user populations. The framework quantifies compromise probability, not expected harm; consequence modelling and benefit estimation are outside its scope.

URL PDF HTML ☆

赞 0 踩 0

2606.18936 2026-06-18 cs.AI cs.CY 新提交 85%

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench：面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China（脑启发认知智能实验室，自动化研究所，中国科学院，北京，中国）； School of Future Technology, University of Chinese Academy of Sciences, China（未来技术学院，中国科学院大学，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences, China（人工智能学院，中国科学院大学，中国）； Zhongguancun Academy, China（中关村学院，中国）； Beijing Key Laboratory of Safe AI and Superalignment（北京安全人工智能与超对齐重点实验室）； Gaoling School of AI, Renmin University of China（甘露人工智能学院，中国人民大学）； Beijing Institute of AI Safety and Governance (Beijing-AISI)（北京人工智能安全与治理研究院（北京-AISI））； School of Humanities, University of Chinese Academy of Sciences, China（人文学院，中国科学院大学，中国）

专题命中安全评测：提出科学领域安全基准，评测风险维度

AI总结提出SciRisk-Bench基准，从显式风险维度和科学学科两个角度评估AI4Science安全，覆盖7个学科、31个子学科和10个风险维度，实验揭示主流及科学大模型的安全薄弱环节。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地嵌入到人工智能驱动的科学（AI4Science）工作流程中，从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估，不仅要评估科学能力，还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式，但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench}，这是一个旨在从两个互补视角评估AI4Science安全的基准：显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分，我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现，从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

URL PDF HTML ☆

赞 0 踩 0

2606.18782 2026-06-18 cs.CL cs.AI 新提交 85%

RedactionBench

RedactionBench：基于上下文完整性的隐私保护基准测试

Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

发表机构 * A10 Networks, Inc.（A10网络公司）

专题命中安全评测：提出隐私保护基准测试，评估大模型上下文完整性。

AI总结 RedactionBench通过200个跨11个领域的文档，评估红actions的上下文隐私问题，提出R-Score指标，揭示红actions的主观性，推动隐私保护系统的发展。

详情

AI中文摘要

大型语言模型日益应用于需要擦除个人身份信息（PII）的敏感领域。尽管擦除PII是数据清理的必要步骤，现有基准测试将提取机制与隐私语义混为一谈。公共电话号码与医疗记录中的电话号码并不等同。是否构成违规取决于持有者、原因和上下文，根本区别红actions与简单实体识别。基于上下文完整性，我们引入RedactionBench，一个手动标注的基准测试，包含200个跨11个领域的文档，主要源自真实世界来源。我们还引入R-Score，一种新的字符级指标，将语义相似的红actions视为同等重要，并消除浅层格式选择，如电话号码的不同遮蔽样式。在命名实体识别模型、实体提取小型语言模型和前沿模型上进行评估，证明上下文红actions仍是一个未解决的问题。对RedactionBench的80多名用户的人工评估显示隐私观念存在明显分歧。标注者在强制性红actions（89.4%）和安全文本保留（94.1%）上达成一致，但在上下文红actions（47.7%）上未能达成一致。这种差异展示了上下文隐私的主观性，推动R-Score，将上下文模糊性与严格精度分离。我们比较了35个模型家族的性能，并报告了它们在擦除PII方面的表现。最后，我们发布RedactionBench，以建立未来隐私保护系统的基准，希望激发高效模型设计和标准化评估。

英文摘要

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.18473 2026-06-18 cs.CL 新提交 85%

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

PreUnlearn: 在大语言模型遗忘之前审计附带知识损害

Bo Su, Ankit Shah, Thai Le

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

专题命中安全评测：审计大模型遗忘的附带知识损害

AI总结提出PreUnlearn方法，通过数据特征预测遗忘操作对同领域和远距离知识的附带损害，实现遗忘前的风险审计。

Comments 12 pages, 6 figures

详情

AI中文摘要

大语言模型（LLMs）的机器遗忘旨在移除特定知识，同时保留模型其余能力。然而，遗忘与保留知识之间的界限往往不明确，因为相关甚至遥远的信息可能在模型中纠缠。在本文中，我们从数据中心的视角研究LLM遗忘，并衡量遗忘效应如何从遗忘集传播到同领域和远距离知识。我们发现一致的衰减模式：附带损害在遗忘集附近最强，随语义距离减弱，但不会在领域边界消失。我们进一步询问这种损害是否可以在执行遗忘之前被审计。我们将遗忘集审计制定为遗忘前预测任务，并分析哪些数据特征最能预测下游损害。我们的结果表明，遗忘集与评估集之间的交互特征提供了最强的信号，表明附带损害部分反映在模型更新前的数据几何中。这些发现将遗忘集审计定位为识别风险遗忘运行和设计更可靠遗忘程序的早期预警工具。

英文摘要

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

URL PDF HTML ☆

赞 0 踩 0

2606.12618 2026-06-18 cs.AI 新提交 85%

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

“你撒谎了吗？”评估不同规模模型和信念验证模型生物体的谎言检测器

Alan Cooney, David Africa, Geoffrey Irving

发表机构 * AI Security Institute（AI安全研究所）

专题命中安全评测：评估语言模型谎言检测器

AI总结本研究通过构建13个信念可验证的推理模型生物体和多样化提示撒谎测试集，评估了四种谎言检测器在不同规模模型上的表现，发现基于激活和概率的检测器在训练模型生物体上性能显著下降，而思维链法官保持较强性能，但存在伪影。

Comments 12 pages, 6 figures

详情

AI中文摘要

语言模型的鲁棒谎言检测器可以实现审计、监控和事后调查模型行为的强大技术，但评估它们需要模型可验证地相信与其所说相反的测试平台。我们表明，现有的训练模型生物体通常无法满足这一要求，使得先前的正面和负面检测结果难以解释。我们通过13个推理模型生物体来解决这个问题，这些生物体的隐藏信念在思维链中得到验证，并显示泛化到保留任务，同时结合了多样化欺骗（Varied Deception），一个涵盖广泛谎言诱导动机的提示撒谎测试集。在这些测试平台上，我们评估了四个检测器：一个思维链法官、一个对数概率分类器和两个激活探针，包括Did-You-Lie（DYL），一种训练后续探针的新方法。在提示撒谎任务上，跨越31个开放权重模型（参数从2B到1T），所有四个检测器都显示出与模型能力正相关的缩放。然而，每个基于激活和对数概率的检测器在我们训练的生物体上性能急剧下降，其中DYL保留了最多的信号；只有思维链法官保持强劲，平衡准确率为0.82，部分原因是我们的验证过程偏向于CoT可读的信念。因此，当前的谎言检测器无法支持关于模型信念的高置信度声明，我们提出了可能解决当前一些局限性的研究方向。我们发布了我们的数据集、模型生物体和训练好的检测器。

英文摘要

Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

URL PDF HTML ☆

赞 0 踩 0

2505.20045 2026-06-18 cs.CL 版本更新 85%

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

基于不确定性感知注意力头的高效大语言模型幻觉检测

Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（莫扎德人工智能大学）； ETH Zurich（苏黎世联邦理工学院）； Independent Researcher（独立研究者）； Applied AI Institute（应用人工智能研究所）

专题命中安全评测：无监督幻觉检测，提升LLM可靠性

AI总结提出RAUQ框架，利用不确定性感知注意力头与令牌级置信度，通过单次前向传递实现无监督、高效的序列级幻觉检测，在12个数据集上优于现有方法且额外计算少于1%。

Journal ref Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, 2026

详情

AI中文摘要

尽管大型语言模型（LLM）已经变得非常强大，但它们仍然容易出现事实性错误，通常称为“幻觉”。不确定性量化（UQ）为缓解这一问题提供了一种有前景的方法，但大多数现有方法计算量大且/或需要监督。在这项工作中，我们提出了基于循环注意力的不确定性量化（RAUQ），这是一种无监督且高效的幻觉识别框架。该方法利用了Transformer注意力行为的一个观察：当生成错误信息时，某些“不确定性感知”注意力头倾向于减少对前驱令牌的关注。RAUQ自动检测这些注意力头，并以循环方式将其激活模式与令牌级置信度度量相结合，仅通过一次前向传递即可生成序列级不确定性估计。通过在涵盖问答、摘要和翻译的十二个数据集上对九个不同LLM进行的实验，我们表明RAUQ始终优于最先进的UQ基线。重要的是，它产生的开销极小，所需的额外计算不到1%。由于它既不需要标记数据也不需要广泛的参数调整，RAUQ可作为白盒LLM中实时幻觉检测的轻量级即插即用解决方案。

英文摘要

While large language models (LLMs) have become highly capable, they remain prone to factual inaccuracies, commonly referred to as "hallucinations." Uncertainty quantification (UQ) offers a promising way to mitigate this issue, but most existing methods are computationally intensive and/or require supervision. In this work, we propose Recurrent Attention-based Uncertainty Quantification (RAUQ), an unsupervised and efficient framework for identifying hallucinations. The method leverages an observation about transformer attention behavior: when incorrect information is generated, certain "uncertainty-aware" attention heads tend to reduce their focus on preceding tokens. RAUQ automatically detects these attention heads and combines their activation patterns with token-level confidence measures in a recurrent scheme, producing a sequence-level uncertainty estimate in just a single forward pass. Through experiments on twelve datasets spanning question answering, summarization, and translation across nine different LLMs, we show that RAUQ consistently outperforms state-of-the-art UQ baselines. Importantly, it incurs minimal overhead, requiring less than 1\% additional computation. Since it requires neither labeled data nor extensive parameter tuning, RAUQ serves as a lightweight, plug-and-play solution for real-time hallucination detection in white-box LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19057 2026-06-18 stat.ML cs.LG stat.CO stat.ME 新提交 80%

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

通过正-无标签学习量化与审计大语言模型评估

Zilong Zhang, Yi-Ting Hung, Lei Ding, Chi-Kuang Yeh

发表机构 * Department of Mathematics and Statistics（数学与统计学系）； Georgia State University（佐治亚州立大学）； Department of Statistics（统计学系）； University of Manitoba（曼尼托巴大学）

专题命中安全评测：审计LLM评估偏差

AI总结针对大语言模型作为评估者存在的系统性偏差（如冗长偏好），提出基于部分最优传输的几何审计框架，利用少量人工验证正样本校正偏差，无需重训练即可提升与人类偏好的一致性。

详情

AI中文摘要

大语言模型（LLM）越来越多地被用作可扩展评估的评判者，然而这种LLM作为评判者的系统表现出与语义质量脱节的系统性偏差，最显著的是冗长偏差。同时，人工监督成本高昂且通常具有选择性，产生可靠的正向判断，但大多数输出未被标记且质量可能参差不齐。我们将选择性人工监督下的LLM评估形式化为一个正-无标签学习问题，并提出了一个基于部分最优传输的几何审计框架。通过在固定嵌入空间中将一小部分人工验证的正样本与可靠的无标签输出子集对齐，我们的方法识别出与人类一致的偏好，并在无需重新训练的情况下纠正有偏的评判者。实验表明，该方法提高了与人类偏好的一致性，增强了对呈现偏差的鲁棒性，并提供了可解释的置信度估计，为现有的LLM作为评判者流程提供了一种可扩展且统计上有依据的替代方案。

英文摘要

Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supervision is costly and typically selective, yielding reliable positive judgments but leaving most outputs unlabelled and potentially mixed in quality. We formulate LLM evaluation under selective human supervision as a positive--unlabelled learning problem and propose a geometric auditing framework based on Partial Optimal Transport. By aligning a small set of human--verified positives with a reliable subset of unlabelled outputs in a fixed embedding space, our method identifies human--consistent preferences and corrects biased judges without retraining. Experiments demonstrate improved alignment with human preferences, increased robustness to presentation biases, and interpretable confidence estimates, offering a scalable and statistically grounded alternative to existing LLM--as--a--judge pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.19262 2026-06-18 cs.LG 新提交 80%

Detecting Hidden ML Training With Zero-Overhead Telemetry

使用零开销遥测检测隐藏的机器学习训练

Robi Rahman, Sabiha Tajdari

发表机构 * Machine Intelligence Research Institute（机器智能研究院）； University of Virginia（弗吉尼亚大学）

专题命中安全评测：检测隐藏ML训练，用于AI治理安全

AI总结本文评估了仅使用零开销、隐私保护的NVML遥测（内容无关信号）对GPU工作负载分类的对抗鲁棒性，开发了一个分类器，在识别训练工作负载时达到98.2%的二元准确率，并对最具挑战性的意外工作负载达到43-87%的准确率。

Comments Technical AI Governance Research workshop at ICML 2026

2606.19242 2026-06-18 cs.SE 新提交 80%

Runtime Compliance Verification for AI Agents

AI代理的运行时合规性验证

Nafiseh Kahani, Masoud Barati, Diana Addae

专题命中安全评测：运行时监控确保GDPR合规

AI总结提出C-Trace框架，通过运行时监控和形式化策略谓词，确保AI代理在工具调用和对话中遵守GDPR规则，将攻击成功率降至12%以下。

2606.18767 2026-06-18 cs.CL 新提交 80%

Output Vector Editing for Memorization Mitigation in Large Language Models

输出向量编辑：缓解大型语言模型中的记忆化问题

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich（慕尼黑大学语言与信息处理中心）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Pioneer Centre for AI（人工智能先锋中心）

专题命中安全评测：缓解LLM记忆化，输出向量编辑方法。

AI总结提出输出向量编辑方法，通过约束优化修改MLP神经元输出向量引入干扰项，在不改变激活值的情况下抑制记忆化序列，在OLMo-7B上实现87.9%抑制率，并揭示MLP编辑的机制边界。

详情

AI中文摘要

大型语言模型会记忆并复现训练数据中的序列，从而带来隐私、版权和安全风险。现有的神经元级缓解方法将编辑等同于将神经元激活归零，但激活仅控制神经元是否参与；输出向量才是写入残差流的内容，并通过叠加编码多个特征。我们提出输出向量编辑，这是一种约束优化的权重编辑方法，定位负责记忆化延续的一小组MLP神经元，并最小程度地修改其输出向量，以在词汇空间中引入干扰项，从而重定向它们在残差流中的贡献，同时保持激活不变。在四个模型（SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B，参数规模从360M到7B）上进行评估，我们重点研究OLMo-7B（其开放权重和预训练语料库支持系统化挖掘），挖掘了6831个记忆化序列，实现了高达87.9%的抑制率。在相同定位的神经元上，与零消融相比，2.7倍的差距表明抑制来自输出向量编辑，而非仅定位。四种编辑模式涵盖了从激进抑制到最小重定向的谱系；集成使用时覆盖了96.5%的记忆化序列，而我们推荐的单一模式配置达到了81.5%，且没有灾难性的局部性失败。我们进一步识别了一个机制边界：约14%的序列无法通过仅MLP编辑达到；虽然这些失败总体上并非由注意力驱动，但消融贡献最大的注意力头可恢复其中60-64%，对于从前缀复制token的延续，恢复更强，这表明注意力是互补的后备机制而非主要机制。编辑模式排序和成功-局部性权衡在所有四个模型上迁移，成功率随模型规模而非家族增长。

英文摘要

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

URL PDF HTML ☆

赞 0 踩 0

2606.18532 2026-06-18 cs.CR cs.AI cs.RO cs.SE 新提交 80%

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

AI沙箱：威胁模型、分类法与测量框架

Inderjeet Singh, Haitham Mahmoud, Andrés Murillo

发表机构 * Fujitsu Research of Europe（富士通欧洲研究）

专题命中安全评测：AI沙箱威胁模型与测量框架

AI总结提出AI沙箱的威胁模型、分类法和测量框架，形式化沙箱边界与最弱链规则，定义网络物理威胁模型，并通过三个案例验证。

Comments 50 pages, 8 figures, 10 tables

详情

AI中文摘要

AI系统越来越多地在结合隔离、仿真、仪器化、监督和证据捕获的有界环境中进行评估。对于物理AI、AIoT和网络物理系统，这种转变不仅仅是术语问题：被测系统可能通过物理过程、网络设备和人类操作员进行感知、决策、执行、通信和故障。本文开发了一种面向保证的AI沙箱描述，将其作为数字AI、具身自主和网络物理部署中测试、评估、验证和确认的受控环境。我们形式化了沙箱边界和用于将每个维度的证据组合成有界部署声明的“最弱链”规则；分离了主要的沙箱原型；定义了一个包括对保证装置本身攻击的网络物理威胁模型；并引入了一个跨越保真度、可控性、可观测性、包含性、可重复性和治理工件的测量框架，在三个实际沙箱的工作案例研究中实例化。由此产生的威胁模型、分类法和测量框架阐明了沙箱可以有效测试什么、它可以包含哪些风险，以及它可以为安全、安保和监管保证支持哪些形式的证据。

英文摘要

AI systems are increasingly evaluated in bounded environments that combine isolation, simulation, instrumentation, supervision, and evidence capture. For physical AI, AIoT, and cyber-physical systems, this shift is not a matter of terminology: the system under test may sense, decide, actuate, communicate, and fail through physical processes, networked devices, and human operators. This article develops an assurance-oriented account of AI sandboxes as controlled environments for testing, evaluation, verification, and validation across digital AI, embodied autonomy, and cyber-physical deployments. We formalize the sandbox boundary and a weakest-link rule for composing per-dimension evidence into a bounded deployment claim; separate major sandbox archetypes; define a cyber-physical threat model that includes attacks on the assurance apparatus itself; and introduce a measurement framework spanning fidelity, controllability, observability, containment, reproducibility, and governance artifacts, instantiated on three worked case studies of real sandboxes. The resulting threat model, taxonomy, and measurement framework clarify what a sandbox can validly test, which risks it can contain, and what forms of evidence it can support for safety, security, and regulatory assurance.

URL PDF HTML ☆

赞 0 踩 0

2605.17986 2026-06-18 cs.CR cs.AI 版本更新 95%

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

LivePI：更真实的智能体对抗间接提示注入基准测试

Lei Zhao, Abhay Bhaskar, Edgar Dobriban

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

专题命中提示注入：基准测试AI智能体对抗间接提示注入，核心是安全。

AI总结提出LivePI基准，覆盖7种输入表面、12种攻击/渲染家族和5种恶意目标，在真实虚拟机环境中评估多个AI智能体，发现攻击成功率10.7%-29.6%，并验证了两层防御的有效性。

详情

AI中文摘要

诸如OpenClaw之类的AI智能体越来越多地部署在本地工作流中，并能够访问外部工具。这带来了间接提示注入（IPI）风险：智能体可能会执行嵌入在不可信输入（如电子邮件、下载文件、网页、仓库或群聊消息）中的有害指令。现有的评估通常规模较小、完全模拟或仅关注狭窄的通道。我们引入了LivePI（实时提示注入），这是一个在生产类似但测试可控环境中的IPI风险结构化基准。LivePI覆盖了七个输入表面、十二个攻击/渲染家族和五个恶意目标，包括受保护信息窃取、未经授权的安全控制更改、不安全的代码检索或执行、收件箱摘要窃取以及加密货币转账。我们在一个真实的虚拟机上运行LivePI，该虚拟机具有实时但测试可控的电子邮件、聊天、网页、本地文件、仓库和钱包接口。在GPT-5.3-Codex、Claude Opus 4.6、Gemini 3.1 Pro、Kimi K2.5和GLM-5上，总攻击成功率范围为10.7%至29.6%。群聊注入在我们部署中评估的所有骨干模型上均成功，而仓库链接攻击尽管分母较小，仍产生了高严重性失败。我们还评估了一种由提示级过滤和执行前工具调用授权组成的两层防御。在GPT-5.3-Codex设置中，该防御在LivePI中拦截了所有测试的恶意目标完成，同时保留了PinchBench衍生工作负载上的良性效用。

英文摘要

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email, downloaded files, webpages, repositories, or group-chat messages. Existing evaluations are often small, purely simulated, or focused on a narrow set of channels. We introduce LivePI (Live Prompt Injection), a structured benchmark for IPI risk in a production-like but test-controlled environment. LivePI covers seven input surfaces, twelve attack/rendering families, and five malicious goals, including protected-information exfiltration, unauthorized security-control changes, unsafe code retrieval or execution, inbox-summary exfiltration, and cryptocurrency transfer. We run LivePI on a real virtual machine with live but test-controlled email, chat, web, local-file, repository, and wallet interfaces. Across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5, total attack success rates range from 10.7% to 29.6%. Group-chat injection is uniformly successful across the evaluated backbones in our deployment, and repository-link attacks produce high-severity failures despite a small denominator. We also evaluate a two-layer defense consisting of prompt-level filtering and pre-execution tool-call authorization. In the GPT-5.3-Codex setting, the defense intercepts all tested malicious-goal completions in LivePI before execution while preserving benign utility on PinchBench-derived workloads.

URL PDF HTML ☆

赞 0 踩 0

2606.18550 2026-06-18 cs.CR 新提交 85%

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

门仅与其合约一样诚实：面向风险感知因果门控合约层的ContractGuard

Laxmipriya Ganesh Iyer, Rahul Suresh Babu

专题命中提示注入：防御间接提示注入攻击

AI总结针对工具增强型LLM代理的间接提示注入，提出ContractGuard，通过验证合约完整性（而非风险标签）来防御攻击，在基准测试中实现零注入成功率。

详情

AI中文摘要

风险感知因果门控（RACG）通过从代理的可见动作空间中移除危险工具来防御工具增强型LLM代理免受间接提示注入，使得即使完全符合注入条件的代理也无法调用其不可见的工具。我们提出三点。首先，这种结构性保证并未消除安全工具使用背后的信任假设；它将其转移到门所读取的工具合约——声明的先决条件、效果、风险和授权——的完整性上，因此攻击者若破坏合约，可使门误判而无需说服代理。其次，伪造工具的效果比篡改其风险标签更危险，因为RACG在可准入门之前应用因果门：离路径工具从不暴露，因此仅重新标记风险会失败，而效果伪造则将危险工具路由到因果路径上并成功。效果完整性，而非风险标签，是承载假设。第三，我们引入ContractGuard，一个位于注册表和门之间的验证器，它分层使用签名来源、类型化合约认证和运行时效果验证；在受控基准测试中，它针对所有建模攻击（包括穷举白盒自适应攻击）将注入成功率恢复为零，且不会过度拒绝诚实合约，该结构性预测在六个当前代托管模型（Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B）上得到确认。

英文摘要

Risk-Aware Causal Gating (RACG) defends tool-augmented LLM agents against indirect prompt injection by removing dangerous tools from the agent's visible action space, so that even a fully injection-compliant agent cannot call a tool it cannot see. We make three points. First, this structural guarantee does not eliminate the trust assumption behind safe tool use; it relocates it into the integrity of the tool contracts -- declared preconditions, effects, risk, and authorization -- that the gate reads, so an attacker who corrupts a contract can make the gate mis-decide without ever persuading the agent. Second, forging a tool's effects is strictly more dangerous than tampering with its risk label, because RACG applies a causal gate before its admissibility gate: an off-path tool is never exposed, so risk-relabeling alone fails, whereas effect forgery routes the dangerous tool onto the causal path and succeeds. Effect integrity, not the risk label, is the load-bearing assumption. Third, we introduce ContractGuard, a verifier between the registry and the gate that layers signed provenance, typed contract attestation, and runtime effect verification; on a controlled benchmark it restores injection success to zero against every modeled attack -- including an exhaustive white-box adaptive attacker -- without over-rejecting honest contracts, and the structural prediction is confirmed on six current-generation hosted models (Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B).

URL PDF HTML ☆

赞 0 踩 0

2606.18530 2026-06-18 cs.CR cs.CL cs.LG 新提交 85%

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

评估基于提示的防御策略对抗领域伪装注入攻击

Aaditya Pai

发表机构 * Data Science Institute（数据科学研究所）

专题命中提示注入：评估防御领域伪装注入攻击

AI总结针对领域伪装注入攻击，评估五种基于提示的防御方法（如释义、重点标记等）在三个模型家族和三个部署领域中的有效性，发现释义法最有效，可将伪装攻击成功率降低55-84%。

Comments 9 pages, 4 figures, 4 tables; under review at the AdvML-Frontiers x CoTMA workshop, COLM 2026

详情

AI中文摘要

领域伪装注入攻击使用领域特定词汇将恶意指令嵌入检索内容中，从而逃避依赖句法注入标记的标准检测器。当检测失败时，从业者需要知道哪些防御架构能降低攻击成功率。我们评估了五种基于提示的防御方法（重点标记、释义、提示夹层以及两种组合）对抗领域伪装注入攻击，涉及三个模型家族（Claude Haiku、Llama 3.1 8B、Gemini 2.0 Flash）和三个部署领域（金融、法律、通用），共进行3,510次试验。在代理处理之前对检索内容进行释义是最一致有效的防御方法，根据模型不同，可将伪装攻击成功率降低55-84%，并且在所有测试模型上均实现了比我们的Llama Guard 4配置更低的攻击成功率。防御效果强烈依赖于模型：重点标记在Claude Haiku上将攻击成功率减半，但在Llama 3.1 8B上没有任何益处。金融领域部署面临最高的残余风险，基线攻击成功率为26-33%，在较弱模型上没有任何基于提示的防御能完全消除威胁。这些结果首次系统评估了专门针对伪装类注入攻击的基于提示的防御方法，并为从业者建立了基于基准的建议。所有任务均使用合成构建的专业文档；这些基准排名是否能推广到真实企业文档仍是一个开放问题。

英文摘要

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

URL PDF HTML ☆

赞 0 踩 0

2606.19235 2026-06-18 cs.CR 新提交 80%

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

CodeSentinel：代码上下文中针对间接提示注入的三层防御

Po-Han Cheng, Chia-Mu Yu, Ying-Dar Lin, Yu-Sung Wu, Wei-Bin Lee

专题命中提示注入：针对代码上下文的提示注入防御

AI总结针对代码大语言模型在检索外部代码时面临的间接提示注入攻击，提出CodeSentinel三层推理时净化器，结合语法引导预过滤、CST引导动态Min-K%评分和节点扰动分析，实现0.80节点级F1，优于现有方法。

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新 95%

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）

专题命中偏好对齐：DPO是偏好对齐的核心方法之一

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情

DOI: 10.1109/TPAMI.2026.3704314

AI中文摘要

随着大语言模型（LLMs）的快速发展，将策略模型与人类偏好对齐变得日益关键。直接偏好优化（DPO）作为一种有前景的对齐方法，作为从人类反馈中强化学习（RLHF）的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性，但文献中目前缺乏对这些方面的深入综述。在这项工作中，我们对DPO中的挑战和机遇进行了全面回顾，涵盖理论分析、变体、相关偏好数据集和应用。具体而言，我们基于关键研究问题对近期DPO研究进行分类，以提供对DPO当前格局的透彻理解。此外，我们提出了几个未来研究方向，为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

URL PDF HTML ☆

赞 0 踩 0

2606.18606 2026-06-18 cs.CL cs.AI 新提交 90%

Steerable Cultural Preference Optimization of Reward Models

可引导的文化偏好优化奖励模型

Minsik Oh, Advit Deepak, Sophie Wu, Douwe Kiela, Ekaterina Shutova

发表机构 * Stanford University（斯坦福大学）； University of Amsterdam（阿姆斯特丹大学）

专题命中偏好对齐：提出SCPO算法优化奖励模型文化偏好对齐

AI总结提出SCPO算法，通过平衡多种文化偏好训练奖励模型，在PRISM和GlobalOpinionQA数据集上提升少数群体偏好预测准确率最多7点，训练效率提高280%。

Comments Accepted to Pluralistic Alignment @ ICML 2026

详情

AI中文摘要

大型语言模型（LLM）技术以每个文化子社区可接受的方式服务于众多不同文化子社区至关重要。然而，迄今为止，关于LLM对齐的研究主要集中于预测来自特定地区的标注者的统一响应偏好。本文旨在以更全球化的视角推进对齐模型的发展，使其能够准确代表子社区的偏好，并且不对任何子社区表现出过度偏见。我们专注于为此目的开发奖励模型，并提出一种新颖的奖励模型训练算法（SCPO），该算法能够以平衡的方式融入多样化的文化偏好。我们的方法使得少数群体奖励模型在两个数据集（PRISM和GlobalOpinionQA）以及7个国家上的性能比基线模型提升最多7点。SCPO在训练数据效率上比奖励模型的完整数据微调高出最多280%。此外，我们通过分别评估子社区的偏好来进行偏见分析，并表明我们的加权方法减轻了过度偏见。我们的代码可在以下网址获取：this https URL

英文摘要

It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at https://github.com/minsik-ai/Steerable-Cultural-Preference

URL PDF HTML ☆

赞 0 踩 0

2606.18487 2026-06-18 cs.LG cs.AI cs.CL 新提交 90%

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SFT 过训练通过熵崩溃预测 RLVR 下的排名反转

Siddharth Aphale, Kelly Liu

发表机构 * Stanford University（斯坦福大学）

专题命中偏好对齐：SFT过训练导致RLVR下排名反转

AI总结研究发现 SFT 过度训练导致 rollout 分布熵降低，使 GRPO 中优势信号消失，从而引发排名反转；提出基于熵的两阶段诊断方法可预警高风险检查点。

Comments 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

详情

AI中文摘要

当 SFT 压缩 rollout 分布时，选择 pass@1 最高的 SFT 检查点进行 GRPO 的标准启发式方法可能失败。对于二元奖励，组内期望优势方差为 $p(1{-}p)(g{-}1)/g$；当早期 GRPO 将 $p$ 驱动到 $p^*(g)$ 以下时，大多数组具有相同奖励，不提供组间相对信号。我们研究了 Qwen2.5-Coder-3B 和 DeepSeek-Coder-6.7B 的 SFT 深度阶梯。我们在五个深度和三个种子上测试 Qwen2.5-Coder-3B，在四个匹配深度和三个种子上测试 DeepSeek-Coder-6.7B。在 Qwen 上，RL 前的 pass@1 随 SFT 深度增加而上升，但 GRPO 峰值 pass@10 从 $0.806$ 下降到 $0.481$（3 种子均值，$n{=}20$）；RL 前的熵与 GRPO 结果正相关（$\rho{=}{+}0.69$）。在 DeepSeek 上，pass@1 仍远高于 $p^*(8){=}0.083$，GRPO 结果压缩而非反转。结合 RL 前熵分诊与早期 GRPO 熵监测的两阶段诊断方法，可标记高风险检查点并提前停止失败运行。在我们的设置中，简单的 KL 参考正则化和标签平滑变体未能挽救崩溃的 Qwen 检查点，表明该失败并非琐碎的 GRPO 超参数伪影。

英文摘要

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($ρ{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

URL PDF HTML ☆

赞 0 踩 0

2606.16276 2026-06-18 cs.AI 新提交 90%

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

SpecAlign: 通过合成数据实现高效的大语言模型规范对齐

Wenjie Wang, Yue Huang, Zhengqing Yuan, Han Bao, Shiyi Du, Yuchen Ma, Yue Zhao, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame（圣母大学）； Carnegie Mellon University（卡内基梅隆大学）； LMU Munich（慕尼黑大学）； University of Southern California（南加州大学）

专题命中偏好对齐：规范对齐框架，合成数据实现规则遵守

AI总结提出规范对齐新范式，通过从规范文档合成数据（SpecAlign框架），结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度偏好对，提升规则遵守度且不损害通用能力。

Comments 58 pages

详情

AI中文摘要

随着大语言模型（LLM）在现实应用中的部署日益增多，对齐不再由单一的通用安全或有用性概念主导，而是由提供商或应用特定的模型规范主导。这些规范通常冗长、结构化且频繁更新，然而现有的对齐流程缺乏系统化的机制来将其作为训练信号。在本文中，我们提出规范对齐（specification-grounded alignment），一种新的对齐范式，将提供商编写的模型规范作为主要对齐目标，而非抽象原则或静态基准。为实例化该范式，我们引入SpecAlign框架，该框架直接从规范文档合成对齐数据。SpecAlign结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度、边界感知的偏好对，捕获合规行为和有意义的规范违反。在多个模型规范和骨干模型上的实验表明，使用SpecAlign进行训练一致地提高了规则遵守度，同时保持了通用能力并避免了过度保守的行为。这些结果表明，将对齐建立在显式模型规范上，能够实现LLM行为对不断变化的政策要求的快速、精确和可扩展的适应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider- or application-specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification-grounded alignment, a new alignment paradigm that treats provider-authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi-agent adversarial data synthesis to generate fine-grained, boundary-aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over-conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

URL PDF HTML ☆

赞 0 踩 0

2601.17637 2026-06-18 cs.CY cs.HC 90%

Scaling Laws for Moral Machine Judgment in Large Language Models

大语言模型中道德机器判断的扩展规律

Kazuhiro Takemoto

专题命中偏好对齐：研究LLM道德判断与人类偏好对齐的扩展规律

AI总结研究通过评估75种大语言模型配置，发现模型规模与人类偏好距离呈幂律关系，扩展推理模型在较小规模时表现更优，为价值判断的扩展规律研究提供依据。

Comments 12 pages, 4 figures, 3 tables

Journal ref R Soc Open Sci. (2026) 13 (6): 260202

详情

DOI: 10.1098/rsos.260202

AI中文摘要

自主系统日益需要道德判断能力，但其与模型规模的可预测性尚不清楚。我们系统评估了75种大语言模型配置（0.27B-1000B参数），利用道德机器框架测量其在生死困境中的对齐程度。观察到人类偏好距离（D）与模型规模（S）呈幂律关系（D ∝ S^{-0.10±0.01}，R²=0.50，p<0.001）。混合效应模型证实此关系在控制模型家族和推理能力后仍成立。扩展推理模型在较小模型中表现更优（规模×推理交互作用：p=0.024）。此关系在多样架构中成立，但随着规模增大，方差降低，表明计算规模系统性地提高了道德判断的可靠性。这些发现将扩展规律研究扩展到基于价值的判断，并为人工智能治理提供实证基础。

英文摘要

Autonomous systems increasingly require moral judgment capabilities, yet whether these capabilities scale predictably with model size remains unexplored. We systematically evaluate 75 large language model configurations (0.27B--1000B parameters) using the Moral Machine framework, measuring alignment with human preferences in life-death dilemmas. We observe a consistent power-law relationship with distance from human preferences ($D$) decreasing as $D \propto S^{-0.10\pm0.01}$ ($R^2=0.50$, $p<0.001$) where $S$ is model size. Mixed-effects models confirm this relationship persists after controlling for model family and reasoning capabilities. Extended reasoning models show significantly better alignment, with this effect being more pronounced in smaller models (size$\times$reasoning interaction: $p = 0.024$). The relationship holds across diverse architectures, while variance decreases at larger scales, indicating systematic emergence of more reliable moral judgment with computational scale. These findings extend scaling law research to value-based judgments and provide empirical foundations for artificial intelligence governance.

URL PDF HTML ☆

赞 0 踩 0

2604.23130 2026-06-18 cs.CL cs.AI 版本更新 90%

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征：越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC（马里兰大学伯克利分校）； Apple（苹果公司）

专题命中越狱攻击：机制定位越狱漏洞，分析有害特征

AI总结提出一种基于Token的机制流水线，通过稀疏自编码器特征子组定位越狱漏洞，发现单个有害Token足以定位脆弱特征，且这些特征集中在中后期层。

详情

AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式：模型可以被推向有害行为，但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为，包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线，将Gemma-2-2B的残差流分解为稀疏自编码器（SAE）特征，并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰，我们从对抗性响应中提取有害概念，并通过子空间相似性将其与概念相关的提示Token对齐。然后，我们应用三种特征分组策略：基于聚类的、层次链接的和单Token驱动的，以识别所有26层中的SAE特征子组。最后，我们放大每个子组中的顶级特征，并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性，表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组，而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层，且更集中在中后期层，其中目标引导暴露了特定的模型脆弱性。总体而言，我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组，补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.

URL PDF HTML ☆

赞 0 踩 0

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新 85%

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（数据科学学院、人工智能学院、香港中文大学（深圳））

专题命中越狱攻击：提出语义感知通用扰动劫持MLLM，属于越狱攻击。

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术。本文研究了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点，我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性，在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

URL PDF HTML ☆

赞 0 踩 0

1. 安全评测 19 篇

Understanding and Mitigating Prompt Leaking Attacks in Real-World LLM-Based Applications

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Lifecycle-Aware Dynamic Analysis for Secure ML Model Execution

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

Signature filtering: a lightweight enhancement for statistical watermark detection in large language models

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

Quantifying Compromise Risk in Exceptional Access Architectures Under Sparse and Indirect Evidence

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

RedactionBench

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning

Detecting Hidden ML Training With Zero-Overhead Telemetry

Runtime Compliance Verification for AI Agents

Output Vector Editing for Memorization Mitigation in Large Language Models

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

2. 提示注入 4 篇

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts

3. 偏好对齐 5 篇

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

Steerable Cultural Preference Optimization of Reward Models

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

Scaling Laws for Moral Machine Judgment in Large Language Models

4. 越狱攻击 2 篇

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation