Formalizing the Binding Problem
形式化绑定问题
Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording
AI总结 本文用信息论方法形式化绑定问题,提出一种探测方法测量模型表示中的绑定信息,并在视觉Transformer上实验,证明绑定是强视觉识别和推理的关键要素。
Comments Accepted to ICML 2026
详情
世界表征,可以说,包含关于特征的信息(例如,某物是蓝色的,某物是圆形的),但也包含关于哪些特征属于同一对象的信息(例如,圆形是蓝色的),我们称之为绑定信息。任何具有理解包含多个对象场景能力的系统都必须解决绑定问题:它需要知道哪些特征属于一起。然而,尽管有研究表明视觉Transformer(ViT)知道哪些补丁属于一起,但目前尚不清楚当前的深度学习模型是否学会展示绑定信息,即针对特征的信息。我们可能认为绑定信息并不多,毕竟将特征错误归因于错误对象是基于ViT架构的常见失败,尤其是在对象共享特征的场景中。本文用信息论方法形式化绑定问题,并引入一种探测方法来测量模型表示中的绑定信息。我们在ViT上进行实验,测量来自架构不同组件(如图像摘要标记[CLS]或空间标记)的绑定信息。我们使用具有不同绑定挑战的数据集,例如特征共享、遮挡和自然特征,同时比较多个预训练ViT的性能。总体而言,我们的研究证明了绑定是强视觉识别和推理的关键要素。
Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.