$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
$R^2$-dLLM: 通过时空冗余减少加速扩散大语言模型
Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin
AI总结 提出 $R^2$-dLLM 框架,通过推理和训练两阶段减少扩散大语言模型解码中的空间和时间冗余,实现高达 88% 的解码步数减少并保持生成质量。
详情
扩散大语言模型(dLLMs)通过并行令牌预测成为自回归生成的有前途的替代方案。然而,实际的 dLLM 解码仍然遭受高推理延迟,限制了部署。在这项工作中,我们观察到这种低效率的很大一部分来自解码过程中反复出现的冗余,包括由置信度聚类和位置模糊性引起的空间冗余,以及由重复重新掩蔽已经稳定的预测引起的时间冗余。受这些模式的启发,我们提出了 $R^{2}$-dLLM,一个从推理和训练两个角度减少解码冗余的统一框架。在推理时,我们引入了无需训练的解码规则,聚合局部置信度和令牌预测,并最终确定时间稳定的令牌以避免冗余解码步骤。我们进一步提出了一个冗余感知的监督微调流程,使模型与高效解码轨迹对齐,并减少对手动调整阈值的依赖。实验表明,与现有解码策略相比,$R^{2}$-dLLM 一致地将解码步数减少高达 88%,同时在不同模型和任务上保持有竞争力的生成质量。这些结果验证了解码冗余是 dLLMs 的一个核心瓶颈,明确减少它能够带来显著的实用效率提升。我们的代码和模型可在 https://github.com/GATECH-EIC/R2-dLLM 获取。
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.