Consistency Training while Mitigating Obfuscation via Rate Matching
通过速率匹配缓解混淆的一致性训练
Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa
AI总结 提出速率匹配一致性训练(RMCT),通过匹配目标行为率而非约束表达方式,在减少模型受无关特征影响的同时避免混淆,提升可监控性。
详情
大型语言模型常常受到无关输入特征的影响,例如揭示用户偏好答案的线索。一致性训练通过训练模型在具有和不具有无关特征的输入上表现相似来减少这种影响。然而,现有方法在整个响应或内部激活上训练一致性,这也限制了模型是否表达这些无关特征。我们表明这会导致混淆,即模型学会不提及线索但仍受其影响,这可能削弱可监控性。为了解决这个问题,我们引入了速率匹配一致性训练(RMCT),它在选定的行为属性上训练一致性,而不约束这种行为如何表达。RMCT匹配模型在输入扰动下表现出目标行为(例如,遵循偏见线索)的速率,而不是要求具有和不具有无关特征的配对输入,从而将一致性训练扩展到无法移除无关特征的场景。我们在两个开放权重语言模型上评估了RMCT在减少谄媚方面的效果,在保留的偏见类型上实现了与标准一致性训练基线相当的偏见遵循减少,同时很大程度上保留了模型表达偏见线索的倾向。此外,我们发现RMCT在我们的实验中更节省数据,但计算效率较低。总体而言,RMCT表明一致性训练可以在不直接牺牲可监控性的情况下提高行为鲁棒性。
Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.