Teaching Diffusion to Speculate Left-to-Right
教导扩散模型从左到右推测
Lexington Whalen, Yuki Ito, Ryo Sakamoto
AI总结 针对自回归解码的推理瓶颈,提出三种训练时干预方法(位置加权、首次错误焦点损失、链损失)来弥合块扩散草稿模型的双向生成与自回归目标模型从左到右验证之间的不对称性,显著提升接受草稿长度。
详情
- Comments
- 13 pages, technical report
大型语言模型(LLMs)在广泛任务中表现出色,但其自回归解码过程由于固有的顺序令牌生成而带来大量推理成本。推测解码通过使用轻量级草稿模型提出多个未来令牌,随后由更大的目标模型并行验证,从而解决这一瓶颈。近期工作表明,扩散语言模型非常适合此设置,因为它们可以并行生成整个草稿令牌块,从而缓解自回归草稿的顺序约束。该机制的一个微妙之处在于,块扩散草稿生成器在块内双向生成令牌,而验证由自回归目标模型以严格从左到右的方式评估令牌,导致对称的训练目标与非对称的验证奖励之间存在差距。在本工作中,我们对三种缩小这一差距的训练时干预措施进行了实证分析:令牌位置加权、针对每个块内破坏已接受前缀位置的首次错误焦点损失,以及用可微替代项替代期望接受长度的链损失项。这三种干预措施沿正交轴(位置、块条件首次错误、联合前缀)起作用,并且可加性组合;它们同样与测试时对齐机制(如多草稿自选)正交,原则上可以与之结合。在四个目标模型和六个推理、代码及对话基准测试中,与位置均匀基线相比,这三种干预措施使每个基准测试的接受草稿长度提高了21-76%,且无需增加额外前向传递,也无需改变推理流程或拒绝采样精确性约束。
Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.