Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics
自注意力作为传输:对称谱诊断的极限
Dominik Dahlem, Diego Maniloff, Mac Misiura
AI总结 研究语言模型注意力路由的两种失效形状(过度集中或过度分散),证明对称谱诊断对方向不敏感,并揭示因果注意力中传输容量的理论下限,提出基于容量和方向的双轴诊断方法。
详情
- Comments
- 48 pages, 6 figures, 7 tables; 81-page online supplement (proofs, additional experiments, dataset statistics) as an ancillary file
当语言模型处理幻觉响应时,其注意力路由往往以两种形状之一失效:过度集中在狭窄的位置集合上,或者分散得如此广泛以至于相关性被稀释,而失效的形状携带诊断信号。我们研究这些形状作为诊断特征,从在基准标记响应的\emph{强制评分}下计算的注意力矩阵中得出,而不是在实时生成期间。一类广泛使用的谱方法分析度归一化注意力算子的对称分量,该算子控制传输\emph{容量};我们证明该算子的每个转置不变谱诊断在结构上是\emph{方向盲的}(它无法区分算子与其转置,因此无法检测信息流方向),并且盲定理的逆定理将任何Lipschitz诊断的转置敏感性限制为不对称系数$G$。将其与规范因果架构的闭式二分-Cheeger景观配对,我们证明均匀因果注意力满足一个与$n$无关的下界$\phi \ge 1/5$,而窗口注意力以$O(w/n)$穿透下界;失效模式在形状上不同,而不仅仅在数值上不同。这个下界是一个理想化架构的基准,而不是经验吸引子:穿透它的真实注意力头的比例本身就是一个架构特征。由此产生的双轴诊断($\phi$表示容量,$G$表示方向)产生一个可证伪的极性预测:瓶颈主导和分散主导的基准应表现出相反的极性。在长度控制评估下,传输特征在测试的仅解码器、仅编码器和编码器-解码器模型中保持可解释的信号(0.62-0.84 LC-AUROC),极性在HaluEval和MedHallu之间如预测般反转。
When a language model processes a hallucinated response, its attention routing tends to fail in one of two shapes: over-concentrating on a narrow set of positions, or spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. We study these shapes as a diagnostic characterization, computed from attention matrices under \emph{forced scoring} of benchmark-labeled responses rather than during live generation. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport \emph{capacity}; we prove that every transpose-invariant spectral diagnostic of this operator is structurally \emph{orientation-blind} (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a converse to the blindness theorem bounding any Lipschitz diagnostic's transpose sensitivity by the asymmetry coefficient $G$. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an $n$-independent floor $\phi \ge 1/5$, while window attention pierces the floor as $O(w/n)$; failure modes are shape-different, not just value-different. This floor is an idealized-architecture benchmark, not an empirical attractor: the fraction of real attention heads that pierce it is itself an architectural signature. The resulting two-axis diagnostic ($\phi$ for capacity, $G$ for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (0.62-0.84 LC-AUROC) across the tested decoder-only, encoder-only, and encoder-decoder models, with polarity reversing as predicted between HaluEval and MedHallu.