Teaching Video Generators to Remember: Eliciting Dynamic Memory for Out-of-Sight State Evolution
教会视频生成器记忆:为不可见状态演化引出动态记忆
Tianshuo Xu, Yichen Xie, Depu Meng, Chensheng Peng, Quentin Herau, Bo Jiang, Yihan Hu, Wei Zhan
AI总结 针对视频生成模型在观测中断时状态冻结的问题,提出ReMind框架,通过面向记忆的数据构建、事件感知训练和缓存适配,利用KV缓存机制实现动态记忆,在STEVO-Bench和恢复任务上取得最佳成绩。
详情
视频世界模型应在证据未被观测时维持演化状态,但当前生成器在中断时往往冻结隐藏状态。这不仅仅是容量问题:预训练的视频扩散Transformer已经具备能够进行非局部检索的KV缓存机制,但很少被训练用作动态记忆。我们引入ReMind,一个通过面向记忆的数据、事件感知训练和缓存适配来引出动态记忆行为的框架。围绕100多种动态事件的分类,我们构建了一个带相机标注的训练混合集,结合了VLM过滤的真实视频、生成的硬动态、合成相机循环和记忆中断增强。每个片段被转换为带有保护锚点、退化区间和显式时间间隙的帧图。节点结构化的课程,包括节点丢弃、噪声记忆、前沿延续和参考缓存训练,迫使模型在中断时检索相关的过去状态,而不是仅依赖局部连续性。PM-RoPE,一种优雅的相机相位RoPE扩展,以单注意力成本解锁了时空检索,同时保留了预训练路径。ReMind在STEVO-Bench和恢复任务上取得了最佳总体分数。此外,通用图像到视频评估证实该课程避免了灾难性遗忘。我们将开源代码、数据和模型。
Video world models should maintain evolving states when evidence is unobserved, yet current generators often freeze hidden states upon interruption. This is not simply a capacity problem: pretrained video diffusion transformers already possess KV-cache mechanisms capable of non-local retrieval, but they are rarely trained to use them as dynamic memory. We introduce ReMind, a framework eliciting dynamic memory behavior via memory-oriented data, event-aware training, and cache adaptation. Organized around a taxonomy of 100+ dynamic events, we build a camera-annotated training mixture combining VLM-filtered real videos, generated hard dynamics, synthetic camera loops, and memory-interruption augmentations. Each clip is converted into a frame graph with protected anchors, degraded intervals, and explicit temporal gaps. A node-structured curriculum, including node-drop, noisy memory, frontier continuation, and reference-cache training, forces the model to retrieve relevant past states across interruptions rather than relying solely on local continuity. PM-RoPE, an elegant camera-phase RoPE extension, unlocks spatiotemporal retrieval at a single-attention cost while preserving pretrained pathways. ReMind achieves the best overall scores on STEVO-Bench and recovery tasks. Furthermore, general image-to-video evaluations confirm this curriculum avoids catastrophic forgetting. We will open-source our code, data, and models.