On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning
关于Transformer图像嵌入在非可解空间推理中的内在限制
Siyi Lyu, Quan Liu, Feng Yan
AI总结 本文通过将空间理解形式化为群同态问题,证明恒定深度Transformer由于TC⁰复杂度限制,无法在单次前向传播中捕获非可解群(如SO(3))的空间结构。
详情
视觉Transformer(ViT)在语义识别方面表现出色,但在心理旋转等空间推理任务中却出现系统性失败。虽然这通常归因于数据规模,但本文认为该限制源于架构的内在电路复杂度。通过将空间理解形式化为学习一个群同态问题——其中潜在嵌入保留作用于图像的物理变换的代数结构——我们识别出一个基本的计算瓶颈。具体来说,对于非可解群(例如$\mathrm{SO}(3)$),维持这种保结构嵌入的下界由单词问题决定,该问题是$\mathsf{NC^1}$-完全的。相比之下,具有多项式精度的恒定深度ViT严格受限于复杂度类$\mathsf{TC^0}$。在标准猜想$\mathsf{TC^0} \subsetneq \mathsf{NC^1}$下,出现了一个复杂度边界:恒定深度架构缺乏在单次前向传播中捕获非可解空间结构所需的逻辑深度。为了实证验证这一理论差距,我们提出了潜在空间代数(LSA)基准,该基准揭示了随着非可解任务组合深度的增加,ViT表示出现显著退化。
Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, this work argues that the limitation arises from the intrinsic circuit complexity of the architecture. By formalizing spatial understanding as learning a Group Homomorphism Problem -- where latent embeddings preserve the algebraic structure of physical transformations acting on images -- we identify a fundamental computational bottleneck. Specifically, for non-solvable groups (e.g., $\mathrm{SO}(3)$), maintaining such structure-preserving embeddings is lowerbounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, constant-depth ViTs with polynomial precision are strictly bounded by the complexity class $\mathsf{TC^0}$. Under the standard conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, a complexity boundary emerges: constant-depth architectures lack the logical depth required to capture non-solvable spatial structures in a single forward pass. To empirically validate this theoretical gap, we propose the Latent Space Algebra (LSA) benchmark, which reveals a significant degradation in ViT representations as the compositional depth of non-solvable tasks increases.