Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
架构感知强化学习使滑动窗口注意力在数学推理中具有竞争力
Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen
AI总结 提出SWARR方法,通过监督微调将预训练自注意力模型高效转换为滑动窗口注意力,并利用强化学习策略适应,缩小了与自注意力的性能差距,同时保持线性复杂度的高效性。
详情
推理和智能体大型语言模型的快速进展增加了对长上下文推理的需求,但自注意力的计算复杂度随上下文长度呈二次增长。为了解决这个问题,我们研究了SWARR(用于数学推理的滑动窗口注意力强化适应),这是一种将SWA模型适应数学推理的实用方案。SWARR包含两个阶段:(1)从预训练的SA模型高效转换为SWA,并通过监督微调(SFT)避免重新训练基础模型;(2)使用强化学习(RL)进行策略适应。我们发现,在SFT后SWA的性能仍低于SA,我们假设这一差距部分由数据-架构不匹配导致:大多数SFT数据是为SA模型准备的,可能包含SWA难以建模的长距离依赖。由于在策略RL在SWA约束下优化自生成轨迹,它可以使轨迹更好地匹配SWA。在数学推理基准上的实验表明,该方案显著缩小了SWA与SA之间的差距,恢复了SWA转换过程中丢失的大部分准确性,同时保持了线性复杂度注意力的效率优势。我们的核心贡献是实证发现,RL改变了仅通过转换和SFT得出的关于SWA在数学推理中可行性的结论。
The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.