Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models
Lumos-Nexus: 面向视频统一模型的高效频率桥接与同质潜在空间
Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu
AI总结 提出Lumos-Nexus框架,通过两阶段训练和渐进频率桥接,在保持推理能力的同时显著提升视频生成保真度。
详情
- Comments
- Project page (https://jiazheng-xing.github.io/nexus-lumos-home/) and Code (https://github.com/alibaba-damo-academy/Lumos-Custom/) are available
基于连接器的视频统一模型在指令引导的视频合成中展现出强大能力,但将大型高保真生成器集成到统一训练循环中计算成本过高,限制了可实现的视觉质量。因此,我们提出Lumos-Nexus,一个训练高效的统一视频生成框架,促进强推理驱动生成能力的发展,同时显著提升视觉保真度。Lumos-Nexus采用两阶段设计:1)训练时,仅将轻量级生成器与理解模块对齐,以学习接收推理驱动的语义控制。2)推理时,我们引入统一渐进频率桥接(UPFB),在共享潜在空间中逐步将生成任务移交给高容量预训练生成器,实现从粗到细的细化,在不牺牲推理质量的情况下生成高保真视频。为填补推理驱动视频生成基准的空白,我们引入VR-Bench,评估模型将推断意图转化为连贯且语义对齐的视频内容的能力。大量实验表明,Lumos-Nexus在VBench上实现了视觉真实感和时间连贯性的显著提升,同时在VR-Bench上展现出强大的基于推理的生成性能。代码和模型可在https://jiazheng-xing.github.io/nexus-lumos-home/获取。
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.