2605.29358
2026-05-29
cs.AI
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
扩展单一语义性:从Claude 3 Sonnet中提取可解释特征
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan
发表机构
*
Anthropic
AI总结
本研究通过稀疏自编码器从生产级语言模型Claude 3 Sonnet中提取可解释特征,验证了字典学习方法在大规模模型上的可扩展性,并分析了特征的多语言、多模态特性及其对模型行为的因果影响。