arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视频大模型

视频理解、视频生成、视频语言模型和时序视觉推理。

今日/当前日期收录 8 信号源:cs.CV, eess.IV, cs.MM
2606.18702 2026-06-18 cs.CV 新提交 95%

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

UniTemp: 通过双向蒸馏实现任意时间顺序的视频生成

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin, Jiuxiang Gu, Krishna Kumar Singh, Yuheng Li, Yin Li

发表机构 * University of Wisconsin Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe 研究院) University of California Los Angeles(加利福尼亚大学洛杉矶分校) University of California Davis(加利福尼亚大学戴维斯分校)

专题命中 视频生成 :任意时间顺序的视频生成方法

AI总结 提出UniTemp框架,通过双向蒸馏训练单个自回归模型,支持任意时间方向(前向、后向、中间插值)的视频生成,解决因果3D VAE在后向生成中的不连续性,提升可控性。

详情
AI中文摘要

自回归视频扩散模型已成为长视频生成的一种有前景的方法,在流式设置中表现出色。然而,现有方法仅限于前向时间生成,而实际视频创作通常需要灵活的生成顺序,例如,基于未来上下文进行后向扩展,或基于过去和未来上下文进行中间插值生成。我们通过训练一个支持任意时间方向生成的自回归模型来弥合这一差距。一个关键的技术挑战来自视频扩散模型中广泛使用的因果3D VAE,它编码的潜变量严格依赖于过去上下文。虽然这种因果结构适合前向生成,但在后向生成时会导致块间不连续性。为了解决这个问题,我们引入了块级锚点潜变量,这是一组辅助潜变量,用于在后向生成过程中恢复块边界处缺失的过去上下文。基于这一设计,我们提出了UniTemp,一个双向蒸馏框架,训练单个自回归学生模型用于任意方向的视频生成。在推理时,UniTemp可以基于任意过去和/或未来帧进行条件生成,提高了双向和中间插值生成的可控性。实验表明,与仅前向方法相比,UniTemp在短和长视频生成上保持了竞争性能,同时支持多种工作流程,如双向视频扩展、中间插值生成、循环视频生成、场景转换和视觉故事生成。项目网站:此 https URL

英文摘要

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

2606.18478 2026-06-18 cs.CV 新提交 95%

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏:恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

专题命中 视频生成 :少步视频生成中的蒸馏方法

AI总结 针对分布匹配蒸馏(DMD)在少步视频生成中出现的模式坍塌和过饱和问题,提出数据强制蒸馏(DFD)框架,通过教师评分差异引导学生接近真实数据分布,仅需一行代码修改即可恢复多样性和保真度。

详情
AI中文摘要

最近的进展表明,将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中,分布匹配蒸馏(DMD)及其后继DMD2实现了强大的生成质量和快速收敛。然而,由于反向KL目标的性质,这些方法表现出两个持续的失败模式:样本多样性大幅下降,以及明显过饱和的输出偏离真实视频外观。在这项工作中,我们提出了数据强制蒸馏(DFD),一个简单的训练后框架,通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异,用于引导学生朝向真实数据分布,将其拉向缺失的模式(缓解模式坍塌)并远离真实数据中不存在的问题模式(避免过饱和)。我们提供了框架的深入理论分析,并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调,DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度,解决过饱和伪影,显著改善视频动态和外观,甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

2606.18591 2026-06-18 cs.CV 新提交 90%

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

桥接创意意图与视觉质量:基于创作者驱动的循环视频生成与代理反馈循环

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang, Alexander Liu, Zhe Zhao

发表机构 * University of California, Davis(加州大学戴维斯分校) The Harker School(哈克学校) Basis Independent Silicon Valley(硅谷贝斯独立学校) Saratoga High(萨拉托加高中)

专题命中 视频生成 :CHIEF框架实现创作者驱动循环视频生成

AI总结 提出CHIEF框架,通过人类-AI协作的迭代视频精炼,结合创作者驱动和代理主观反馈,提升长视频的叙事连贯性与创意方向。

Comments Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

详情
AI中文摘要

生成式AI使内容创作日益普及,但许多AI生成的视频缺乏叙事连贯性和创意方向,尤其在较长时长时问题更为突出。与编码不同,AI生成受益于可靠的反馈和循环自我改进等技术,而视频生成需要关于情节、场景和叙事的主观反馈,这自然激发了融入人类创意方向的方法。我们提出了CHIEF,一个人类-AI协同创作视频生成框架,将创作者置于人机循环迭代视频精炼的中心,并通过提供自动主观反馈来支持他们。创作者通过驱动每次迭代来融入其创意方向,而他们的修订则由专门的精炼代理整合。反馈循环由基于角色条件的多模态LLM生成,这些LLM观看生成的视频并从观众角度产生主观批评,提供自我评估无法捕捉的反馈。为测试我们提出框架的有效性,我们与没有电影制作经验的高中生和大学生合作,创作从1分钟短视频到具有复杂情节的完整10分钟短片的视频。

英文摘要

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

2606.13768 2026-06-18 cs.CV cs.AI 新提交 90%

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra:面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.(Snap公司) UC Merced(加州大学默塞德分校)

专题命中 视频生成 :统一控制主体、事件、相机和镜头切换的视频生成

AI总结 提出CineOrchestra,一种统一控制主体、事件、相机和镜头切换的视频扩散模型,通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制,在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情
AI中文摘要

电影视频描绘了多个主体在特定时刻行动或互动,通过有意的相机运动捕捉,并由镜头切换拼接而成。这些元素共同要求比当前文本到视频模型更细粒度的控制。现有工作分别处理每个轴:多主体个性化、时间控制、多镜头合成或相机控制;没有先前的框架能联合集成所有四个轴。我们提出CineOrchestra,一种统一的视频扩散模型,同时控制主体、事件、相机和镜头切换。我们的关键洞察是,这些异构的电影元素共享一个基本结构:每个元素都是在特定时间间隔内行动的实体,因此都可以通过一个共享的实体中心条件原语结构来表达,并辅以视觉实体的参考图像。这种表述将架构挑战简化为单个位置编码问题,我们通过两个参数无关的协调旋转嵌入来解决:(a) 间隔采样的时间RoPE,在持续时间差异巨大的事件上产生一致注意力行为;(b) 2D实体-时间交叉注意力RoPE,消除每个实体条件的歧义,并将其路由到对应的时空区域。在两个新基准上,CineOrchestra在密集描述跟随和镜头切换时序上优于六种每轴专家方法,在成对用户研究和组件消融中持续获得增益。

英文摘要

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations. Project page: https://snap-research.github.io/CineOrchestra

2606.19271 2026-06-18 cs.DC 新提交 85%

TurboServe: Serving Streaming Video Generation Efficiently and Economically

TurboServe: 高效经济地服务流式视频生成

Youhe Jiang, Haoxu Wang, Haotong Bao, Kai Jiang, Jianfei Chen, Jun Zhu, Fangcheng Fu, Jintao Zhang

专题命中 视频生成 :流式视频生成服务系统TurboServe

AI总结 针对流式视频生成的会话时长和用户需求异构性,提出TurboServe系统,通过在线调度联合优化会话放置与GPU配置,采用迁移感知放置和负载驱动自动缩放,降低延迟和成本。

详情
AI中文摘要

流式视频生成正成为一种新的服务负载,用户与长时间运行的会话交互,会话逐步生成视频块。与离线视频生成或典型LLM服务不同,流式视频生成必须在活动和非活动期间保持会话状态,重复调度进行中的会话,并在严格的延迟目标下交付每个块。这在多用户、多GPU环境中带来了两个关键的服务挑战:会话时长异构性(长时间运行的会话使放置决策随时间变得次优)和时变用户需求异构性(活动会话数量在突发和空闲期间剧烈波动)。我们提出了TurboServe,这是首个专门为流式视频生成负载设计的服务系统。TurboServe将服务建模为一个在线调度问题,联合协调会话放置和GPU配置。其闭环调度算法结合了迁移感知放置控制器(通过跨GPU重新平衡会话以减少最大每块延迟)和负载驱动的自动缩放控制器(根据工作负载变化调整GPU预算以提高成本效率)。为在运行时支持这些决策,TurboServe实现了合并块处理(在同一GPU上批处理并发活动会话)、GPU-CPU卸载(用于会话挂起和恢复)以及基于NCCL的GPU-GPU迁移(用于在线重新平衡)。我们使用生数科技的真实生产轨迹,在多种模型大小和最多64个NVIDIA B300 GPU的集群上评估了TurboServe。与基线服务配置相比,TurboServe平均将最坏情况下的每块延迟降低了37.5%,总GPU运营成本降低了37.2%。我们的代码在此https URL公开。

英文摘要

Streaming video generation is emerging as a new serving workload in which users interact with long-lived sessions that generate video progressively, chunk by chunk. Unlike offline video generation or typical LLM serving, streaming video generation must preserve session state across active and idle periods, repeatedly schedule ongoing sessions, and deliver each chunk under a tight latency target. This creates two key serving challenges in multi-user, multi-GPU environments: session duration heterogeneity, where long-running sessions make placement decisions suboptimal over time, and temporal user-demand heterogeneity, where the number of active sessions fluctuates sharply across bursts and idle periods. We present TurboServe, the first serving system designed specifically for streaming video generation workloads. TurboServe formulates serving as an online scheduling problem that jointly coordinates session placement and GPU provisioning. Its closed-loop scheduling algorithm combines a migration-aware placement controller, which rebalances sessions across GPUs to reduce the maximum per-chunk latency, with a load-driven autoscaling controller, which adapts the GPU budget to workload variation for improved cost efficiency. To support these decisions at runtime, TurboServe implements coalesced chunk processing for batching concurrent active sessions on the same GPU, GPU-CPU offloading for session suspension and resumption, and NCCL-based GPU-GPU migration for online rebalancing. We evaluate TurboServe on real-world production traces from Shengshu Technology across multiple model sizes and GPU clusters with up to 64 NVIDIA B300 GPUs. Compared with baseline serving configurations, TurboServe reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average. Our code is publicly available at https://github.com/shengshu-ai/TurboServe.

2606.17030 2026-06-18 cs.CV 新提交 80%

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告:通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team(Qwen团队)

专题命中 视频生成 :视频世界模型,生成未来视觉轨迹

AI总结 提出Qwen-RobotWorld,一种以自然语言为统一动作接口的语言条件视频世界模型,通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练,在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测,在多个基准上取得最优结果。

详情
AI中文摘要

我们介绍Qwen-RobotWorld,一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口,它从当前观测预测物理上合理的未来视觉轨迹,涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向:用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境,以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的:a) 双流MMDiT与MLLM动作编码,其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合;b) 具身世界知识(EWK),一个860万视频-文本语料库(2亿+帧),包含20+种具身形态和500+动作类别的动作-语言映射;c) 通用+专家渐进式课程,一种两阶段训练策略,首先学习通用视觉先验,然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力:在EWMBench和DreamGen Bench上总体排名第一,在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

2606.13376 2026-06-18 cs.CV 新提交 80%

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

MoVerse: 基于全景高斯支架的实时视频世界建模

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

发表机构 * South China University of Technology Columbia University Orange Team, Youku Moku-Lab, HUJING Digital Media \& Entertainment Group Singapore Management University

专题命中 视频生成 :实时视频世界建模与渲染

AI总结 提出MoVerse,从单张窄视场图像实时构建可交互漫游的360度全景世界,通过拓扑感知扩散补全视场、全景几何残差预测生成3D高斯支架,并结合双向扩散教师蒸馏为因果自回归学生实现低延迟视频渲染。

Comments Project Page: https://orange-3dv-team.github.io/MoVerse/

详情
AI中文摘要

我们提出MoVerse,一个实时视频世界模型,能够从单张窄视场图像创建可交互导航的场景。该设置具有挑战性,因为输入仅观察到环境的一小部分,而交互式漫游需要完整的周围世界、持久的几何结构、可控的相机运动以及时间上一致的高保真观测。MoVerse通过将世界构建与观测渲染分离来解决这个问题。它首先使用拓扑感知扩散将输入扩展为重力对齐的360°全景图,在3D推理之前闭合缺失的视场。然后,利用全景几何感知残差预测将全景图提升为持久的3D高斯支架,形成密集且可直接渲染的空间记忆。最后,一个高斯条件视频渲染器将沿用户指定相机轨迹的支架渲染结果转换为逼真的视频。为了使该渲染器适用于交互,我们训练了一个双向扩散教师用于高质量条件渲染,并将其蒸馏为一个因果自回归学生以实现有界延迟流式传输。这种设计结合了显式3D表示的可控性和长程一致性以及生成视频模型的感知质量。MoVerse在单个NVIDIA RTX 4090 GPU上支持8 FPS的实时场景漫游,展示了通往具有交互式视频输出的单图像世界创建的实用路径。

英文摘要

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

2606.19163 2026-06-18 cs.DC 新提交 60%

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

Pulse: 面向大规模扩散模型的自动流水线并行训练加速

Boran Sun, Guoyong Jiang, Lin Zhang, Chen Chen, Yuechen Tao, Zhishu Che, Jieling Yu, Shan Chang, Huaxi Gu, Fangming Liu, Bo Li

专题命中 视频生成 :方法适用于视频生成模型训练加速

AI总结 提出PULSE自动流水线并行策略,通过将跳跃连接层同设备放置、局部缓存激活值,消除跨流水线通信,结合动态规划分区器、ILP调度合成器和混合并行调优器,在通信受限硬件上实现最高2.3倍吞吐提升。

Comments Accepted by International Conference on Distributed Computing Systems(ICDCS'26)

详情
AI中文摘要

扩散模型目前是高保真图像和视频生成的主流方法,但在GPU集群上扩展其训练仍具挑战。与仅含Transformer的架构不同,扩散骨干通常采用具有异构层和长距离跳跃连接的UNet风格编码器-解码器结构。在传统流水线并行下,这些非局部依赖迫使大型跳跃激活值及其梯度穿越多个流水线边界,使得点对点(P2P)通信成为主要瓶颈,并显著降低流水线效率。本文提出PULSE,一种自动流水线并行训练策略,将跳跃局部性作为首要优化目标。PULSE通过将跳跃连接的编码器-解码器层放置在同一设备上,并在本地缓存跳跃激活值以供反向传播使用,从而消除跳跃引起的通信。为了在保持高流水线利用率的同时实现这种放置,PULSE协同设计了:(1)一种跳跃感知的动态规划分区器,在对称共置约束下平衡异构阶段负载;(2)一种基于ILP的调度合成器,为生成的阶段到设备映射生成气泡高效的波调度;(3)一种混合并行调优器,在内存和网络约束下选择流水线/数据并行度及微批次大小。大量实验表明,与最先进的并行策略相比,通信量可减少89%,在通信受限硬件上训练吞吐量可提升高达2.3倍。

英文摘要

Diffusion models are now a dominant approach for high-fidelity image and video generation, yet scaling their training across GPU clusters remains challenging. Unlike transformer-only architectures, diffusion backbones commonly adopt UNet-style encoder-decoder structures with heterogeneous layers and long-range skip connections. Under conventional pipeline parallelism, these non-local dependencies force large skip activations and their gradients to traverse multiple pipeline boundaries, making peer-to-peer (P2P) communication a dominant bottleneck and substantially reducing pipeline efficiency. In this paper, we present PULSE, an automatic pipeline-parallel training strategy that makes skip locality a first-class optimization objective. PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for later use in backpropagation. To realize this placement while maintaining high pipeline utilization, PULSE co-designs: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer that generates bubble-efficient wave schedules for the resulting stage-to-device mapping, and (3) a hybrid parallelism tuner that selects pipeline/data-parallel degrees and microbatch sizes under memory and network constraints. Our extensive experiments show that the volume of communication can be reduced by 89 percent, and the training throughput can be increased by up to 2.3x on communication-bound hardware, compared with state-of-the-art parallelism strategies.