Chain-of-Thought Hijacking
思维链劫持
Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez
AI总结 提出思维链劫持攻击,通过诱导大型推理模型进行长时间良性推理来削弱其拒绝有害请求的能力,实现高成功率越狱。
详情
大型推理模型(LRMs)通过扩展推理时间推理来提高任务性能。尽管先前研究表明更长的推理应导致更稳健的安全行为,但我们发现了相反的证据:过度扩展的推理反而可以被利用来系统性地削弱拒绝行为。我们提出了思维链劫持,一种简单而有效的黑盒越狱攻击,诱导LRMs进行长时间的良性谜题求解推理(通常持续五分钟以上),然后引发有害的顺从。在HarmBench上,思维链劫持在Gemini 2.5 Pro、ChatGPT o4 Mini、Grok 3 Mini和Claude 4 Sonnet上分别实现了99%、94%、100%和94%的攻击成功率。为了理解该攻击为何成功,我们对开源推理模型进行了激活探测、注意力模式分析和因果干预。我们的结果表明,拒绝行为依赖于一个低维安全信号,其表达随着推理轨迹变长而减弱。特别是,扩展的良性推理将注意力从有害意图转移开,并减弱与拒绝相关的激活,产生了我们称之为拒绝稀释的现象。这些发现表明,过长的推理可能引入系统性的越狱攻击面。我们发布了评估材料以支持可重复性和进一步研究。
Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.