Steering at the Source: Style Modulation Heads for Robust Persona Control
源头操控:用于稳健角色控制的风格调制头
Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura
AI总结 本文通过识别并仅干预少量注意力头(风格调制头),在无需微调的情况下实现对大型语言模型角色和风格的稳健控制,同时显著缓解了残差流干预导致的连贯性下降问题。
详情
- Comments
- 8 main pages with appendix
激活操控提供了一种计算高效的机制,无需微调即可控制大型语言模型(LLM)。虽然能有效控制目标特征(如角色),但连贯性下降仍然是安全和实际部署的主要障碍。我们假设这种下降源于对残差流的干预,该干预无差别地影响聚合特征,并无意中放大了非目标噪声。在这项工作中,我们识别出一组稀疏的注意力头(仅三个头),它们独立控制角色和风格形成,我们将其称为风格调制头。具体来说,这些头可以通过内部表示的几何分析进行定位,结合层间余弦相似度和头部贡献分数。我们证明,仅针对这些特定头的干预能够实现稳健的行为控制,同时显著减轻残差流操控中观察到的连贯性下降。更广泛地说,我们的发现表明,精确的组件级定位能够实现更安全、更精确的模型控制。
Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.