One-Forcing: Towards Stable One-Step Autoregressive Video Generation
One-Forcing: 迈向稳定的一步自回归视频生成
Jiaqi Feng, Justin Cui, Yuanhao Ban, Cho-Jui Hsieh
AI总结 该论文提出了一种名为 One-Forcing 的方法,旨在解决单步自回归视频生成中的稳定性和质量问题。该方法通过在动态模式分解(DMD)目标中引入辅助的生成对抗网络(GAN)损失,实现了高质量且高效的单步视频生成。实验表明,One-Forcing 在 VBench 数据集上取得了当前最优的性能,并且仅需三分之一的训练成本即可实现稳定的逐帧自回归生成,优于以往方法。
详情
- Comments
- Work in Progress. Project Page: https://aurora-edu.github.io/one-forcing/, Code: https://github.com/Aurora-edu/One-Forcing
最近的进展显著改善了自回归机制下的实时交互式视频生成。然而,大多数现有的少步自回归视频生成方法(通常从相应的多步教师模型蒸馏而来)默认采用4步采样配置,这在部署期间仍会产生相当大的延迟,并且当进一步减少采样步数(特别是在一步设置中)时,会遭受严重的质量下降。轨迹式一致性蒸馏方法通常生成动态较弱的视频,而基于DMD的方法(如Self-Forcing)往往产生模糊的帧。为了解决这一挑战,我们提出了One-Forcing,一种简单而有效的方法,它通过向DMD目标添加辅助GAN损失,实现高质量高效的一步视频生成。在VBench上的实验表明,One-Forcing的总得分为83.76,在一步因果视频生成方法中达到了最先进的性能,并且与强大的多步方法保持竞争力。我们进一步证明,仅需分块模型三分之一的训练成本,即可稳定实现逐帧的一步自回归生成,而先前的方法未能成功实现这一设置。
Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few-step autoregressive video generation methods, often distilled from a corresponding many-step teacher, default to a 4-step sampling configuration, which still incurs considerable latency during deployment and suffers from severe quality degradation when the number of sampling steps is further reduced, particularly in the one-step setting. Trajectory-style consistency distillation methods often produce videos with weak dynamics, while DMD-based approaches, such as Self-Forcing, tend to yield blurry frames. To address this challenge, we propose One-Forcing, a simple yet effective approach which augments the DMD objective with an auxiliary GAN loss for high-quality and efficient one-step video generation. Experiments on VBench show that One-Forcing achieves a total score of 83.76, establishing state-of-the-art performance among one-step causal video generation methods and remaining competitive with strong many-step approaches. We further demonstrate that one-step framewise autoregressive generation can be achieved stably with merely one-third of the training cost of the chunkwise model, a setting that prior methods have failed to achieve successfully.