arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视频大模型

视频理解、视频生成、视频语言模型和时序视觉推理。

今日/当前日期收录 15 信号源:cs.CV, eess.IV, cs.MM

1. 视频生成 8 篇

2606.18702 2026-06-18 cs.CV 新提交 专题 95

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

UniTemp: 通过双向蒸馏实现任意时间顺序的视频生成

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin, Jiuxiang Gu, Krishna Kumar Singh, Yuheng Li, Yin Li

发表机构 * University of Wisconsin Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe 研究院) University of California Los Angeles(加利福尼亚大学洛杉矶分校) University of California Davis(加利福尼亚大学戴维斯分校)

专题命中 视频生成 :任意时间顺序的视频生成方法

AI总结 提出UniTemp框架,通过双向蒸馏训练单个自回归模型,支持任意时间方向(前向、后向、中间插值)的视频生成,解决因果3D VAE在后向生成中的不连续性,提升可控性。

详情
AI中文摘要

自回归视频扩散模型已成为长视频生成的一种有前景的方法,在流式设置中表现出色。然而,现有方法仅限于前向时间生成,而实际视频创作通常需要灵活的生成顺序,例如,基于未来上下文进行后向扩展,或基于过去和未来上下文进行中间插值生成。我们通过训练一个支持任意时间方向生成的自回归模型来弥合这一差距。一个关键的技术挑战来自视频扩散模型中广泛使用的因果3D VAE,它编码的潜变量严格依赖于过去上下文。虽然这种因果结构适合前向生成,但在后向生成时会导致块间不连续性。为了解决这个问题,我们引入了块级锚点潜变量,这是一组辅助潜变量,用于在后向生成过程中恢复块边界处缺失的过去上下文。基于这一设计,我们提出了UniTemp,一个双向蒸馏框架,训练单个自回归学生模型用于任意方向的视频生成。在推理时,UniTemp可以基于任意过去和/或未来帧进行条件生成,提高了双向和中间插值生成的可控性。实验表明,与仅前向方法相比,UniTemp在短和长视频生成上保持了竞争性能,同时支持多种工作流程,如双向视频扩展、中间插值生成、循环视频生成、场景转换和视觉故事生成。项目网站:此 https URL

英文摘要

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

2606.18478 2026-06-18 cs.CV 新提交 专题 95

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏:恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

专题命中 视频生成 :少步视频生成中的蒸馏方法

AI总结 针对分布匹配蒸馏(DMD)在少步视频生成中出现的模式坍塌和过饱和问题,提出数据强制蒸馏(DFD)框架,通过教师评分差异引导学生接近真实数据分布,仅需一行代码修改即可恢复多样性和保真度。

详情
AI中文摘要

最近的进展表明,将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中,分布匹配蒸馏(DMD)及其后继DMD2实现了强大的生成质量和快速收敛。然而,由于反向KL目标的性质,这些方法表现出两个持续的失败模式:样本多样性大幅下降,以及明显过饱和的输出偏离真实视频外观。在这项工作中,我们提出了数据强制蒸馏(DFD),一个简单的训练后框架,通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异,用于引导学生朝向真实数据分布,将其拉向缺失的模式(缓解模式坍塌)并远离真实数据中不存在的问题模式(避免过饱和)。我们提供了框架的深入理论分析,并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调,DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度,解决过饱和伪影,显著改善视频动态和外观,甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

2606.18591 2026-06-18 cs.CV 新提交 专题 90

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

桥接创意意图与视觉质量:基于创作者驱动的循环视频生成与代理反馈循环

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang, Alexander Liu, Zhe Zhao

发表机构 * University of California, Davis(加州大学戴维斯分校) The Harker School(哈克学校) Basis Independent Silicon Valley(硅谷贝斯独立学校) Saratoga High(萨拉托加高中)

专题命中 视频生成 :CHIEF框架实现创作者驱动循环视频生成

AI总结 提出CHIEF框架,通过人类-AI协作的迭代视频精炼,结合创作者驱动和代理主观反馈,提升长视频的叙事连贯性与创意方向。

Comments Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

详情
AI中文摘要

生成式AI使内容创作日益普及,但许多AI生成的视频缺乏叙事连贯性和创意方向,尤其在较长时长时问题更为突出。与编码不同,AI生成受益于可靠的反馈和循环自我改进等技术,而视频生成需要关于情节、场景和叙事的主观反馈,这自然激发了融入人类创意方向的方法。我们提出了CHIEF,一个人类-AI协同创作视频生成框架,将创作者置于人机循环迭代视频精炼的中心,并通过提供自动主观反馈来支持他们。创作者通过驱动每次迭代来融入其创意方向,而他们的修订则由专门的精炼代理整合。反馈循环由基于角色条件的多模态LLM生成,这些LLM观看生成的视频并从观众角度产生主观批评,提供自我评估无法捕捉的反馈。为测试我们提出框架的有效性,我们与没有电影制作经验的高中生和大学生合作,创作从1分钟短视频到具有复杂情节的完整10分钟短片的视频。

英文摘要

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

2606.13768 2026-06-18 cs.CV cs.AI 新提交 专题 90

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra:面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.(Snap公司) UC Merced(加州大学默塞德分校)

专题命中 视频生成 :统一控制主体、事件、相机和镜头切换的视频生成

AI总结 提出CineOrchestra,一种统一控制主体、事件、相机和镜头切换的视频扩散模型,通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制,在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情
AI中文摘要

电影视频描绘了多个主体在特定时刻行动或互动,通过有意的相机运动捕捉,并由镜头切换拼接而成。这些元素共同要求比当前文本到视频模型更细粒度的控制。现有工作分别处理每个轴:多主体个性化、时间控制、多镜头合成或相机控制;没有先前的框架能联合集成所有四个轴。我们提出CineOrchestra,一种统一的视频扩散模型,同时控制主体、事件、相机和镜头切换。我们的关键洞察是,这些异构的电影元素共享一个基本结构:每个元素都是在特定时间间隔内行动的实体,因此都可以通过一个共享的实体中心条件原语结构来表达,并辅以视觉实体的参考图像。这种表述将架构挑战简化为单个位置编码问题,我们通过两个参数无关的协调旋转嵌入来解决:(a) 间隔采样的时间RoPE,在持续时间差异巨大的事件上产生一致注意力行为;(b) 2D实体-时间交叉注意力RoPE,消除每个实体条件的歧义,并将其路由到对应的时空区域。在两个新基准上,CineOrchestra在密集描述跟随和镜头切换时序上优于六种每轴专家方法,在成对用户研究和组件消融中持续获得增益。

英文摘要

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations. Project page: https://snap-research.github.io/CineOrchestra

2606.19271 2026-06-18 cs.DC 新提交 专题 85

TurboServe: Serving Streaming Video Generation Efficiently and Economically

TurboServe: 高效经济地服务流式视频生成

Youhe Jiang, Haoxu Wang, Haotong Bao, Kai Jiang, Jianfei Chen, Jun Zhu, Fangcheng Fu, Jintao Zhang

专题命中 视频生成 :流式视频生成服务系统TurboServe

AI总结 针对流式视频生成的会话时长和用户需求异构性,提出TurboServe系统,通过在线调度联合优化会话放置与GPU配置,采用迁移感知放置和负载驱动自动缩放,降低延迟和成本。

详情
AI中文摘要

流式视频生成正成为一种新的服务负载,用户与长时间运行的会话交互,会话逐步生成视频块。与离线视频生成或典型LLM服务不同,流式视频生成必须在活动和非活动期间保持会话状态,重复调度进行中的会话,并在严格的延迟目标下交付每个块。这在多用户、多GPU环境中带来了两个关键的服务挑战:会话时长异构性(长时间运行的会话使放置决策随时间变得次优)和时变用户需求异构性(活动会话数量在突发和空闲期间剧烈波动)。我们提出了TurboServe,这是首个专门为流式视频生成负载设计的服务系统。TurboServe将服务建模为一个在线调度问题,联合协调会话放置和GPU配置。其闭环调度算法结合了迁移感知放置控制器(通过跨GPU重新平衡会话以减少最大每块延迟)和负载驱动的自动缩放控制器(根据工作负载变化调整GPU预算以提高成本效率)。为在运行时支持这些决策,TurboServe实现了合并块处理(在同一GPU上批处理并发活动会话)、GPU-CPU卸载(用于会话挂起和恢复)以及基于NCCL的GPU-GPU迁移(用于在线重新平衡)。我们使用生数科技的真实生产轨迹,在多种模型大小和最多64个NVIDIA B300 GPU的集群上评估了TurboServe。与基线服务配置相比,TurboServe平均将最坏情况下的每块延迟降低了37.5%,总GPU运营成本降低了37.2%。我们的代码在此https URL公开。

英文摘要

Streaming video generation is emerging as a new serving workload in which users interact with long-lived sessions that generate video progressively, chunk by chunk. Unlike offline video generation or typical LLM serving, streaming video generation must preserve session state across active and idle periods, repeatedly schedule ongoing sessions, and deliver each chunk under a tight latency target. This creates two key serving challenges in multi-user, multi-GPU environments: session duration heterogeneity, where long-running sessions make placement decisions suboptimal over time, and temporal user-demand heterogeneity, where the number of active sessions fluctuates sharply across bursts and idle periods. We present TurboServe, the first serving system designed specifically for streaming video generation workloads. TurboServe formulates serving as an online scheduling problem that jointly coordinates session placement and GPU provisioning. Its closed-loop scheduling algorithm combines a migration-aware placement controller, which rebalances sessions across GPUs to reduce the maximum per-chunk latency, with a load-driven autoscaling controller, which adapts the GPU budget to workload variation for improved cost efficiency. To support these decisions at runtime, TurboServe implements coalesced chunk processing for batching concurrent active sessions on the same GPU, GPU-CPU offloading for session suspension and resumption, and NCCL-based GPU-GPU migration for online rebalancing. We evaluate TurboServe on real-world production traces from Shengshu Technology across multiple model sizes and GPU clusters with up to 64 NVIDIA B300 GPUs. Compared with baseline serving configurations, TurboServe reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average. Our code is publicly available at https://github.com/shengshu-ai/TurboServe.

2606.17030 2026-06-18 cs.CV 新提交 专题 80

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告:通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team(Qwen团队)

专题命中 视频生成 :视频世界模型,生成未来视觉轨迹

AI总结 提出Qwen-RobotWorld,一种以自然语言为统一动作接口的语言条件视频世界模型,通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练,在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测,在多个基准上取得最优结果。

详情
AI中文摘要

我们介绍Qwen-RobotWorld,一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口,它从当前观测预测物理上合理的未来视觉轨迹,涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向:用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境,以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的:a) 双流MMDiT与MLLM动作编码,其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合;b) 具身世界知识(EWK),一个860万视频-文本语料库(2亿+帧),包含20+种具身形态和500+动作类别的动作-语言映射;c) 通用+专家渐进式课程,一种两阶段训练策略,首先学习通用视觉先验,然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力:在EWMBench和DreamGen Bench上总体排名第一,在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

2606.13376 2026-06-18 cs.CV 新提交 专题 80

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

MoVerse: 基于全景高斯支架的实时视频世界建模

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

发表机构 * South China University of Technology Columbia University Orange Team, Youku Moku-Lab, HUJING Digital Media \& Entertainment Group Singapore Management University

专题命中 视频生成 :实时视频世界建模与渲染

AI总结 提出MoVerse,从单张窄视场图像实时构建可交互漫游的360度全景世界,通过拓扑感知扩散补全视场、全景几何残差预测生成3D高斯支架,并结合双向扩散教师蒸馏为因果自回归学生实现低延迟视频渲染。

Comments Project Page: https://orange-3dv-team.github.io/MoVerse/

详情
AI中文摘要

我们提出MoVerse,一个实时视频世界模型,能够从单张窄视场图像创建可交互导航的场景。该设置具有挑战性,因为输入仅观察到环境的一小部分,而交互式漫游需要完整的周围世界、持久的几何结构、可控的相机运动以及时间上一致的高保真观测。MoVerse通过将世界构建与观测渲染分离来解决这个问题。它首先使用拓扑感知扩散将输入扩展为重力对齐的360°全景图,在3D推理之前闭合缺失的视场。然后,利用全景几何感知残差预测将全景图提升为持久的3D高斯支架,形成密集且可直接渲染的空间记忆。最后,一个高斯条件视频渲染器将沿用户指定相机轨迹的支架渲染结果转换为逼真的视频。为了使该渲染器适用于交互,我们训练了一个双向扩散教师用于高质量条件渲染,并将其蒸馏为一个因果自回归学生以实现有界延迟流式传输。这种设计结合了显式3D表示的可控性和长程一致性以及生成视频模型的感知质量。MoVerse在单个NVIDIA RTX 4090 GPU上支持8 FPS的实时场景漫游,展示了通往具有交互式视频输出的单图像世界创建的实用路径。

英文摘要

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

2606.19163 2026-06-18 cs.DC 新提交 专题 60

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

Pulse: 面向大规模扩散模型的自动流水线并行训练加速

Boran Sun, Guoyong Jiang, Lin Zhang, Chen Chen, Yuechen Tao, Zhishu Che, Jieling Yu, Shan Chang, Huaxi Gu, Fangming Liu, Bo Li

专题命中 视频生成 :方法适用于视频生成模型训练加速

AI总结 提出PULSE自动流水线并行策略,通过将跳跃连接层同设备放置、局部缓存激活值,消除跨流水线通信,结合动态规划分区器、ILP调度合成器和混合并行调优器,在通信受限硬件上实现最高2.3倍吞吐提升。

Comments Accepted by International Conference on Distributed Computing Systems(ICDCS'26)

详情
AI中文摘要

扩散模型目前是高保真图像和视频生成的主流方法,但在GPU集群上扩展其训练仍具挑战。与仅含Transformer的架构不同,扩散骨干通常采用具有异构层和长距离跳跃连接的UNet风格编码器-解码器结构。在传统流水线并行下,这些非局部依赖迫使大型跳跃激活值及其梯度穿越多个流水线边界,使得点对点(P2P)通信成为主要瓶颈,并显著降低流水线效率。本文提出PULSE,一种自动流水线并行训练策略,将跳跃局部性作为首要优化目标。PULSE通过将跳跃连接的编码器-解码器层放置在同一设备上,并在本地缓存跳跃激活值以供反向传播使用,从而消除跳跃引起的通信。为了在保持高流水线利用率的同时实现这种放置,PULSE协同设计了:(1)一种跳跃感知的动态规划分区器,在对称共置约束下平衡异构阶段负载;(2)一种基于ILP的调度合成器,为生成的阶段到设备映射生成气泡高效的波调度;(3)一种混合并行调优器,在内存和网络约束下选择流水线/数据并行度及微批次大小。大量实验表明,与最先进的并行策略相比,通信量可减少89%,在通信受限硬件上训练吞吐量可提升高达2.3倍。

英文摘要

Diffusion models are now a dominant approach for high-fidelity image and video generation, yet scaling their training across GPU clusters remains challenging. Unlike transformer-only architectures, diffusion backbones commonly adopt UNet-style encoder-decoder structures with heterogeneous layers and long-range skip connections. Under conventional pipeline parallelism, these non-local dependencies force large skip activations and their gradients to traverse multiple pipeline boundaries, making peer-to-peer (P2P) communication a dominant bottleneck and substantially reducing pipeline efficiency. In this paper, we present PULSE, an automatic pipeline-parallel training strategy that makes skip locality a first-class optimization objective. PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for later use in backpropagation. To realize this placement while maintaining high pipeline utilization, PULSE co-designs: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer that generates bubble-efficient wave schedules for the resulting stage-to-device mapping, and (3) a hybrid parallelism tuner that selects pipeline/data-parallel degrees and microbatch sizes under memory and network constraints. Our extensive experiments show that the volume of communication can be reduced by 89 percent, and the training throughput can be increased by up to 2.3x on communication-bound hardware, compared with state-of-the-art parallelism strategies.

2. 视频理解 7 篇

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交 专题 90

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队)

专题命中 视频理解 :长视频理解,POMDP主动感知框架

AI总结 提出OmniAgent,一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体,通过主动感知将推理复杂度与视频时长解耦,在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情
AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式,无论查询难度如何都统一处理帧,导致计算成本随视频时长增长。尽管出现了交互式框架,但它们通常依赖于全局预扫描,其上下文成本仍随视频长度扩展。我们提出OmniAgent,第一个原生全模态智能体,将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作,选择性地将视听线索提炼到持久文本记忆中,有效将推理复杂度与原始视频时长解耦。为实现这一点,我们引入了(1)智能体监督微调,通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知;(2)带TAURA(轮次感知自适应不确定性重缩放优势)的智能体强化学习,利用轮次级熵将信用分配引导至关键发现轮次。关键的是,OmniAgent表现出正向测试时缩放,性能随推理轮次增加而提升,验证了主动感知的有效性。在十个基准(如VideoMME、LVBench)上的实验结果表明,OmniAgent在开源模型中达到了最先进性能。值得注意的是,在LVBench上,我们的7B智能体优于10倍大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

2606.18943 2026-06-18 cs.CV 新提交 专题 85

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs(Anates实验室) Technical University of Munich(慕尼黑技术大学) University of Technology Nuremberg(纽伦堡技术大学) Tuebingen AI Center, University of Tuebingen(图宾根大学人工智能中心) Helmholtz AI, Munich(慕尼黑海德堡人工智能研究所) Google DeepMind research(谷歌DeepMind研究)

专题命中 视频理解 :评估视频生成模型对物理现实的理解

AI总结 本文提出Physics-IQ Verified基准,通过改进提示和地面真实质量及引入样本级评分系统,提升视频生成模型对物理现实的理解评估,验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情
AI中文摘要

视频生成模型(VGMs)已成为新的前沿,不仅用于视频生成,还用于多种下游任务,包括世界建模。为推进这些任务,一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域,催生了Physics-IQ基准,通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准,揭示不足并提出三种解决方案,改进如何衡量VGMs的物理理解。具体而言,我们提高了提示和地面真实质量以减少混淆因素影响,并进一步引入样本级评分系统,使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中,我们观察到中等但有意义的排名变化(Kendall's τ=0.46)。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展,向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

2606.18586 2026-06-18 cs.CV cs.AI 新提交 专题 85

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Dolby Laboratories(杜比实验室)

专题命中 视频理解 :APT表示视频因果状态变化提升VLM理解

AI总结 提出原子物理转变(APT)作为视频中因果状态变化的显式表示,并构建混合来源数据集,通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情
AI中文摘要

物理事件不仅通过其名称来理解,还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的,但同时隐藏了使事件在物理上有效的过程,从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化,我们引入了原子物理转变(APT):最小的、时间局部化的状态变化,将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列,而不是单个聚合事件标签:事件标签说明发生了什么;APT链解释为什么会发生。为了使VLM能够学习APT,我们从人工标注和模拟器真实数据构建了混合来源的APT数据,涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型,包含1,246个试验中的27,303个计时实例。利用这些数据,我们发现当前的VLM在转变级物理理解上存在不足,零样本召回率最多为14%,错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测,但会导致事件级遗忘,表明模型学习的是专门的答案格式,而不是可复用的物理表示。因此,我们提出了APT-Tune,一种参数高效的方案,教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码,使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数,APT-Tune显著提高了APT召回率,同时改善了事件级视频迁移。这些结果表明,APT不是一种新的答案格式,而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

2606.18441 2026-06-18 cs.CV 新提交 专题 85

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 视频理解 :提出视频推理奖励框架,提升视频MLLM推理能力

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

2606.14702 2026-06-18 cs.CV 新提交 专题 85

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K:通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所)

专题命中 视频理解 :视频问答与长时推理

AI总结 提出OmniVideo-100K数据集,通过实体锚定视频脚本和线索引导的QA生成机制,解决音视频问答中跨段实体不一致和长时推理不足的问题,微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情
AI中文摘要

当前的音视频问答(QA)自动化流水线通常采用“视频-字幕-QA”范式。然而,这些方法通常将视频分割成短片段,并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联,而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外,将长文本理解和QA合成耦合到单一步骤中,往往将模型限制在局部事件上,生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题,我们提出了一种自动化数据引擎,包含两种机制:(1)**实体锚定视频脚本**将视频转换为结构化脚本,包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验,确保跨片段引用一致性并重建音视频关联。(2)**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索,然后基于这些高价值线索生成QA对。利用这一流水线,我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B,在OmniVideo-Test上获得了高达20.59%的性能提升,并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力(提升高达12.64%)。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

2606.15632 2026-06-18 cs.CV 新提交 专题 80

Open-World Video Segmentation

开放世界视频分割

Qing Su, Kaiyang Li, Yuan Zhuang, Fei Miao, Shihao Ji

发表机构 * University of Connecticut(康涅狄格大学)

专题命中 视频理解 :长时视频分割与对象发现,视频理解

AI总结 提出Savvy系统,结合分层掩码发现、延迟接纳和轨迹整合,实现零样本开放世界长时视频分割;并设计粒度感知评估套件OGA,采用n:1匹配协议,解决传统1:1匹配对开放世界方法的不公平惩罚问题。

详情
AI中文摘要

尽管视频分割在短片段和封闭集基准上取得了快速进展,但开放世界视频分割仍然在很大程度上未被探索。挑战有两方面:(1)现有方法不支持在动态自我运动的长视频中进行对象发现和身份维护;(2)现有评估协议依赖于严格的1:1匹配,不公平地惩罚了具有不匹配粒度的语义有效预测。为了解决这两个问题,我们引入了Savvy,一个实用且强大的零样本开放世界长时视频分割系统。Savvy结合了分层掩码发现、延迟接纳和轨迹整合,以支持持久对象发现、安全轨迹提升和稳定的长距离身份维护。我们进一步提出了OGA,一个用于开放世界视频分割的粒度感知评估套件。基于粒度无关(GA)匹配协议,OGA将传统的1:1匹配放宽为n:1映射,但通过断点检测支持不连续性并通过对每个参考对象的优势连贯片段进行评分来强制执行时间严谨性。这防止了碎片化或闪烁的支持被过度奖励,同时实现了GA适应的指标和结构诊断:身份持久性(IP)和身份集中性(IC)。在VIPSeg上,我们展示了标准的1:1评估严重低估了开放世界方法,而GA评估恢复了许多被抑制的性能。在更现实的长时基准ScanNet和HM3D上,Savvy在经典指标和提出的指标(包括STQ、VPQ$_\infty$、IP和IC)上始终优于强基线。这些结果共同为开放世界长时视频分割建立了一个实用的基准和一个强基线。

英文摘要

While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ$_\infty$, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.

2606.18610 2026-06-18 cs.RO cs.CV 新提交 专题 60

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA(英伟达) Physical Intelligence Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

专题命中 视频理解 :利用视频基础模型模拟策略展开

AI总结 提出SC3-Eval方法,利用前向-反向动力学一致性、跨视角一致性和测试时一致性,将预训练视频基础模型转化为准确的策略评估器,在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情
AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差,多视角观测必须保持相互一致,且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战,这是一种自洽视频生成方案,通过强制三种互补的一致性,将预训练视频基础模型转化为准确的策略评估器。首先,前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作,将生成的 rollout 锚定在物理上合理的动作流形上,并抵消仅前向模型无法惩罚的漂移。其次,跨视角一致性训练模型从每个相机视角修补其他视角,使多相机观测在长 rollout 中保持连贯,无需任何显式记忆机制。第三,测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号,当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式,支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上,SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119,优于三个强先前的基于视频模型的基线,并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.