Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance
无权衡的可解释性:在同等预测性能下解开多义性
Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Jonas Fischer, Robin Hesse
AI总结 提出ELUDe方法,通过无损重组层间信息流,在不改变模型输出的前提下将多义神经元分解为单义特征,提升深度神经网络的可解释性。
详情
- Comments
- Preprint
深度神经网络(DNN)被广泛使用,但解释它们实际学到什么仍然困难。一个主要障碍是单个神经元通常编码多个不相关的概念,模糊了网络的决策过程。虽然先前的工作,如稀疏自编码器,可以将这些混合信号分离成更有意义的“单义”特征,但这通常需要以可能降低下游性能的方式改变模型。为了克服这一点,我们引入了ELUDe(显式、无损、无监督解缠),一种在保持功能等价性的同时提高DNN可解释性的方法。ELUDe将潜在表示分解为清晰、可检查的子单元,这些子单元表现得像可解释的特征,同时保证模型的输出保持完全相同。它不需要显式训练,不需要标签,并且可以应用于预训练模型。ELUDe通过重组层间信息流的方式工作,重新路由特定概念的贡献,同时通过构造保留原始计算。在多个视觉模型上,包括DINOv2和有监督的ViT-B/16,ELUDe提高了可解释性,保持下游准确性不变,运行高效,并支持实际用途,如引导模型表示。简而言之,ELUDe提供了(几乎)没有权衡的可解释性:更清晰、可扩展且可操作的模型洞察,且性能无损失。
Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.