Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
Soap2Soap:通过多智能体协作实现长 cinematic 视频重制
Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou
AI总结 本研究提出 Soap2Soap 框架,通过多智能体协作实现长 cinematic 视频重制,解决视频到视频生成中长期一致性与叙事保真度的问题。
详情
我们研究系列级 cinematic 重制,这是一个长视界视频到视频生成问题,通过风格化或演员替换局部化完整 episodes 或 films,同时严格保持叙事结构、动作编排和角色身份在数百个镜头中。现有视频生成和编辑管道在此领域常常失效,因为大相机运动和视角变化下会出现身份漂移、背景突变和语义侵蚀的叠加问题。我们提出 Soap2Soap,一个通过双桥一致性机制强制长期语言-视觉一致性的多智能体框架:一个场景感知的 JSON 剧本作为持久的语义骨架,以及在场景和镜头级别动态分配的视觉参考锚点。为在视频合成前抑制漂移,我们引入批次关键帧一致性,通过基于网格的公式共同生成多个关键帧在共享的潜在上下文中。一个闭环验证智能体进一步审计身份、稳定性和对齐度以触发选择性再生。在 SoapBench 上的实验显示,与商业视频生成 API 相比,在长期一致性和叙事保真度方面有显著提升。
We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.