Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
后训练的解剖:利用可解释性表征数据并塑造学习信号
Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger, Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas McGrath, Ekdeep Singh Lubana
AI总结 提出基于可解释性的数据后训练流程,通过统计假设识别偏好数据中的潜在概念,实现细粒度反馈,减少虚假关联和不良行为。
详情
语言模型后训练是塑造模型行为的主要阶段,但它仍然主要涉及优化总结多样需求的标量奖励。这种抽象使从业者几乎无法了解数据实际教会了模型什么,导致模型学习虚假关联,并引发过度风格化和谄媚等不良行为。为了解决这个问题,我们提出:能否在优化之前检查偏好数据集,并在概念层面决定模型应该被允许学习哪些行为?受此启发,我们引入了一个以数据为中心的后训练流程,该流程使用可解释性协议来开发统计假设,以区分偏好和非偏好生成的潜在概念,使其明确以供细粒度用户反馈。基于这一观点,我们将几种基于可解释性的训练协议统一为通过特征或数据干预来塑造奖励的方式。实验上,我们表明我们的流程诊断了现有偏好数据中的不良信号,减轻了脱靶学习,并且还可以帮助放大或塑造期望的属性,如安全防护和模型个性。更广泛地说,我们的结果表明,可解释性可以将后训练从优化不透明的代理奖励转变为审计和塑造学习信号本身的过程。
Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.