Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing
高斯混合注意力:通过概率潜在路由实现线性时间序列混合
Yongchao Huang, Hassan Raza
AI总结 提出高斯混合注意力(GMA),用K个高斯混合分量的潜在路由替代逐对查询-键比较,实现固定K的线性内存缩放,在长上下文分类任务中与注意力基线竞争。
Comments 55 pages
详情
标准点积注意力的密集token间交互模式仍然是扩展Transformer架构到长上下文的主要瓶颈。我们引入\textbf{高斯混合注意力(GMA)},一种概率注意力风格的序列混合器,通过$K$个学习的高斯混合分量进行路由,替代显式的逐对查询-键比较。查询和键被映射到共享潜在路由空间上的后验\textit{责任}向量;它们的重叠定义了隐式的责任空间亲和性,而值被写入和读取自一个$K$槽的潜在记忆。通过利用矩阵乘法的结合性,GMA避免了生成诱导的$N\times N$亲和矩阵,而是使用两个责任矩阵,其主导激活存储规模为$\mathcal{O}(NK)$而非固定$K$下的$\mathcal{O}(N^2)$。我们制定了GMA的双向和因果变体,提供了高斯混合分量的端到端可微参数化,并分析了其责任调制的梯度结构、约束非负低秩亲和性解释以及局部路由稳定性。实验上,GMA表现出预期的固定$K$线性内存缩放,并在长上下文分类上与注意力基线竞争,而因果GMA在WikiText-103上优于测试的线性/随机特征注意力变体,但在当前实现中仍落后于优化的因果SDPA和Mamba。对学习到的责任的分析进一步显示了广泛的组件使用和与表面形式词类别的适度对齐,支持GMA作为一种概率性、可解释、固定$K$的线性时间注意力风格替代方案,而非优化softmax注意力或状态空间模型的通用替代。
The dense token-to-token interaction pattern of standard dot-product attention remains a central bottleneck in scaling Transformer architectures to long contexts. We introduce \textbf{Gaussian Mixture Attention (GMA)}, a probabilistic attention-style sequence mixer that replaces explicit pairwise query--key comparison with routing through $K$ learned Gaussian mixture components. Queries and keys are mapped to posterior \textit{responsibility} vectors over a shared latent routing space; their overlap defines an implicit responsibility-space affinity, while values are written into and read from a $K$-slot latent memory. By exploiting the associativity of matrix multiplication, GMA avoids materializing the induced $N\times N$ affinity matrix and instead uses two responsibility matrices whose dominant activation storage scales as $\mathcal{O}(NK)$ rather than $\mathcal{O}(N^2)$ for fixed $K$. We formulate bidirectional and causal variants of GMA, provide an end-to-end differentiable parameterization of the Gaussian mixture components, and analyze its responsibility-modulated gradient structure, constrained non-negative low-rank affinity interpretation, and local routing stability. Empirically, GMA exhibits the intended fixed-$K$ linear memory scaling and is competitive with attention-style baselines on long-context classification, while causal GMA improves over tested linear/random-feature attention variants on WikiText-103 but remains behind optimized causal SDPA and Mamba in the current implementation. Analysis of learned responsibilities further shows broad component usage and moderate alignment with surface-form token categories, supporting GMA as a probabilistic, interpretable, fixed-$K$ linear-time attention-style alternative rather than a universal replacement for optimized softmax attention or state-space models.