A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle
从分布视角看视觉机制可解释性:KL最小软约束原理
Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge, Wentao Shu, Deyu Meng, Xipeng Qiu
AI总结 本文提出了一种基于分布的视觉机制可解释性方法,通过KL最小化优化问题来平衡可解释性和模型忠实性,利用能量引导的扩散后验采样实现,并在DINOv3模型上验证了其有效性。
详情
当前视觉机制可解释性(MI)的主要范式仍局限于通过启发式方法(如Top-K激活检索或正则化优化)解释视觉模型的内部单元。在本文中,我们建立了视觉MI的理论分布视角,该视角模型了特征激活对自然图像分布的影响,从而构建了一个KL最小化优化问题来建模MI任务。在此框架下,识别了先前MI范式中的统计偏差,揭示这些范式可能在人类感知上不可解释(即偏离自然图像分布)或在机械上不忠实于视觉模型(即无法激活模型特征)。为了解决这些偏差,我们提出了一种基于KL最小化软约束原理的视觉MI模型,该模型在理论上平衡了可解释性和忠实性。我们通过能量引导的扩散后验采样实现了这一原理。广泛的实验验证了所提出分布视角的理论正确性,并展示了我们的范式在DINOv3视觉模型上的实际有效性。
Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.