Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
告别漂移:用于长时视频到视频生成的锚定树采样
Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He
AI总结 本文提出了一种名为锚定树采样的方法,通过减少关键路径步骤来解决长时视频生成中的漂移问题,并在静态相机模式下实现了稳定且高质量的视频生成。
Comments 30 pages, 23 figures
详情
长时视频生成面临两个交织的问题。首先,漂移问题,即视频质量随时间下降。其次,连续性问题,表现为物体永久性问题或不当渲染瞬态内容(例如,出现在非连续帧中的物体颜色/风格变化)。最近的工作集中在自回归蒸馏技术上,旨在同时解决这两个问题。我们选择专注于漂移问题,并引入锚定树采样(ATS):一种无训练的推理时间调度器,用稀疏到密集、锚定范围内的填补方法替代从左到右的滚动。根调用在全时间范围内生成稀疏锚点,递归细化生成中间锚点,最终叶跨度在相邻锚点之间合成。这将关键路径从K个连续滚动步骤减少到L+1个树状步骤,并将时间累积漂移转换为锚定范围内的漂移。我们专注于静态相机模式下的V2V生成,其中稀疏锚点在时间范围内可由密集条件信号近似,且基础模型可在不重新训练的情况下生成它们。我们在Wan 2.1 + VACE上评估了ATS,针对五种条件模式(修复、扩展、边缘、姿态、深度)。我们证明ATS在整体质量和漂移防止方面均优于两个竞争对手。此外,我们还展示了在LTX-2.3上稳定生成至少40分钟的视频。最后,我们提出了一条路径,将ATS扩展到任意长的T2V生成,以及动态相机和多镜头模式。
Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.