Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space
玩具组合可解释性模型揭示早期特征空间中的彩票彩票
Alon Bebchuk, Nir Shavit
AI总结 本文研究了彩票彩票假说在早期特征空间中的表现,通过组合玩具模型揭示了彩票彩票在特征空间中的保留对象,表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。
详情
彩票彩票假说认为密集网络中包含稀疏子网络,即' winning tickets',当重置初始权重并单独训练时,其性能可与完整模型匹配。我们提出更机理性的问题:彩票彩票保留的是什么内部对象?我们采用组合、子句结构的玩具设置,该设置允许具有明确组合距离的可解释特征空间表示。我们显示,在权重空间中彩票彩票对应于特征空间中已接近最终特征通道编码的前驱位置。密集SGD通过结构化选择解决这些位置:近邻位置要么收敛到最终代码要么被拒绝,拒绝集中在更拥挤的神经元,暗示在叠加下存在竞争。因此,彩票彩票是兼容代码位置的家族,共同平衡接近最终代码与低特征间干扰。稀疏重训练通常在不同行上重新表达相同的子句/模板家族,因此保留的对象是家族层面而非微观行身份。我们通过轻量级探针基于特征空间距离和运动验证了这一观点;在我们的设置中,这些探针在准确性和精确代码恢复方面经常优于已建立的基于权重的彩票发现方法。尽管这些发现基于玩具设置,但它们表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。
The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,'' that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. We validate this account with lightweight probes based on feature-space distance and motion; in our setting, these probes frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery. Although these findings are grounded in a toy setting, they suggest that the lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.