Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
分布外声音:同人小说子类型作为对齐LLM的通用白话越狱
Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang
AI总结 本文发现安全训练覆盖不足的自然人类写作语域是对齐LLM的真正失败模式,并提出首个利用真实同人小说子类型作为通用攻击载体的越狱方法,显著提升攻击成功率。
详情
- Comments
- 23 pages
现有的针对对齐LLM的越狱方法是离散的产物,其表面形式容易被指纹识别和修补。我们认为真正的失败模式不是任何特定的提示,而是安全训练覆盖不足的整个自然人类写作语域。基于这一见解,我们引入了第一个使用真实同人小说子类型作为通用攻击载体的越狱家族:一种创意写作元条件基于来自十二个Archive of Our Own (AO3)子类型之一的段落,有害行为被嵌入为结果场景的高潮。该构造不需要攻击者LLM,也不需要针对每个目标进行适应。在HarmBench和JailbreakBench的并集上对八个对齐LLM,该攻击在四评委集成下将平均ASR从0.278提升到0.731;因子分解显示增益由语域而非长度或结构带来。两种主动防御扩大了而非缩小了白话与基线的比率,表明针对模板的防御仅仅将攻击者引向像我们这样的基于语域的攻击。我们还提出了SAGA-A4,一种静态的四轮扩展,实现了平均ASR 0.924,大大超过了现有的三种多轮方法。
Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.