RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
RoPE在长上下文中无法区分位置或令牌,证明性分析
Yufeng Du, Phillip Harris, Minyang Tian, Eliu A Huerta, Srikanth Ronanki, Subendhu Rongali, Aram Galstyan, Hao Peng
AI总结 本文证明RoPE在长上下文中因失去局部偏倚和令牌相关性一致性而失效,无法区分位置或令牌,且增加RoPE基值只能牺牲位置区分能力。
Comments 35 pages, 11 figures, submitted to NeurIPS 2026
详情
我们识别了旋转位置嵌入(RoPE)在基于Transformer的长上下文语言模型中的内在限制。我们的理论分析脱离了上下文的具体内容,仅依赖其长度。我们证明,随着上下文长度增加,基于RoPE的注意力变得不可预测,并失去两个对有效性至关重要的属性。首先,它失去局部偏倚:RoPE不再更倾向于 favor 近的位置而非远的位置。其次,它失去令牌相关性的一致性:一个关键向量在某一位置获得更高的注意力分数,可能在另一位置获得更低的分数。在两种情况下,失败的概率接近0.5,不优于随机猜测。我们进一步证明,当关键令牌被移动到不同位置或被不同令牌替换时,注意力分数可以保持不变,表明无法区分位置或令牌。调整RoPE基值在区分位置和令牌之间进行权衡,但无法同时保持两者。增加RoPE基值超参数,这是当前长上下文模型中的常见做法,有助于区分不同令牌,但不可避免地牺牲区分位置的能力。我们的实证分析显示,多头、多层架构不足以克服这些限制。我们的发现表明,未来基于Transformer的长上下文语言模型可能需要从根本上新的机制来编码位置和令牌顺序。
We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.