StreamingEffect: Real-Time Human-Centric Video Effect Generation
StreamingEffect: 实时以人为中心的视频效果生成
Yiren Song, Cheng Liu, Yuxin Jiang, Mike Zheng Shou
AI总结 本文提出StreamingEffect框架,通过实时视频到视频编辑技术,在保持人类身份、背景内容和时间一致性的同时添加表现性效果,并构建了最大的以人为中心的视频效果数据集VideoEffect-130K,实现了单块H200 GPU上的实时高质量720p视频编辑。
详情
实时以人为中心的视频效果生成对于直播人为主的应用如电子商务直播、娱乐和vlogging具有高度需求,但仍然困难,由于缺乏合适的数据和可部署的编辑模型。与通用视频生成不同,此任务需要实时视频到视频编辑,添加表现性效果的同时保持人类身份、背景内容和时间一致性。现有加速努力主要集中在文本到视频生成,而高效的视频编辑蒸馏仍 largely underexplored。在本文中,我们提出StreamingEffect,一个实时以人为中心的流视频效果框架。我们采用上下文视频编辑架构并训练高质量的双向教师,然后将其蒸馏为因果自回归学生,并进一步将采样步骤从50步减少到4步。我们还引入关键帧控制,允许参考效果帧在线注入并通过流进行传播以实现交互式编辑。为了解决数据瓶颈问题,我们构建了VideoEffect-130K,据我们所知,这是最大的以人为中心的视频效果数据集,包含70000个效果视频和60000个编辑视频,涵盖600个效果类别,这些类别是从短视频和编辑平台中挑选的。实验表明,我们的方法能够在单块H200 GPU上实现实时、高质量的720p视频编辑。
Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbf{StreamingEffect}, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbf{VideoEffect-130K}, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.