大模型对齐与安全 - arXivDaily 专题

2604.23130 2026-06-18 cs.CL cs.AI 版本更新 90%

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征：越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC（马里兰大学伯克利分校）； Apple（苹果公司）

专题命中越狱攻击：机制定位越狱漏洞，分析有害特征

AI总结提出一种基于Token的机制流水线，通过稀疏自编码器特征子组定位越狱漏洞，发现单个有害Token足以定位脆弱特征，且这些特征集中在中后期层。

详情

AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式：模型可以被推向有害行为，但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为，包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线，将Gemma-2-2B的残差流分解为稀疏自编码器（SAE）特征，并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰，我们从对抗性响应中提取有害概念，并通过子空间相似性将其与概念相关的提示Token对齐。然后，我们应用三种特征分组策略：基于聚类的、层次链接的和单Token驱动的，以识别所有26层中的SAE特征子组。最后，我们放大每个子组中的顶级特征，并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性，表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组，而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层，且更集中在中后期层，其中目标引导暴露了特定的模型脆弱性。总体而言，我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组，补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.

URL PDF HTML ☆

赞 0 踩 0

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新 85%

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（数据科学学院、人工智能学院、香港中文大学（深圳））

专题命中越狱攻击：提出语义感知通用扰动劫持MLLM，属于越狱攻击。

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术。本文研究了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点，我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性，在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

URL PDF HTML ☆

赞 0 踩 0