arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视频大模型

视频理解、视频生成、视频语言模型和时序视觉推理。

今日/当前日期收录 8 信号源:cs.CV, eess.IV, cs.MM

1. 视频生成 6 篇

2605.21028 2026-06-18 cs.CV cs.AI 版本更新 专题 95

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink:动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Lab. of Computer Network and Information Integration, Southeast University(东南大学计算机网络与信息集成重点实验室) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Institute of Automation, CAS(中国科学院自动化研究所)

专题命中 视频生成 :提出DySink框架用于自回归长视频生成,核心是视频生成。

AI总结 本文提出 DySink,一种基于检索的框架,通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks,以提高自回归长视频生成的动态性和时间质量。

详情
AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率,通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而,这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧,而丢弃可能更相关的中间历史。结果,保留的长程上下文可能变得不适应,并偏向过时的线索;在严重情况下,RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃,其中内容会回归到 sink 帧。我们提出 DySink,一种基于检索的框架,维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合,后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明,DySink 在动态度方面一致优于强基线,同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

2502.07531 2026-06-18 cs.CV cs.AI cs.LG cs.MM 版本更新 专题 95

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

VidCRAFT3: 面向图像到视频生成的相机、物体与光照控制

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) Zhejiang University(浙江大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Westlake University(西湖大学) School of Data Science and MOE Frontiers Center for Brain Science, Fudan University(复旦大学数据科学学院和脑科学前沿中心) Fudan ISTBI–ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University(复旦大学-浙江师范大学脑启发智能算法中心)

专题命中 视频生成 :可控图像到视频生成,控制相机、物体和光照

AI总结 提出VidCRAFT3框架,通过显式建模几何、运动与光照的跨因素交互,实现对相机运动、物体运动和光照方向的独立或联合控制,在控制精度和视觉一致性上达到最优。

Comments Accepted to TVCG 2026

详情
AI中文摘要

可控图像到视频(I2V)生成将参考图像转换为由用户指定控制信号引导的连贯视频。虽然对相机运动、物体运动和光照的精确控制对于高保真创作至关重要,但现有方法通常独立处理这些因素,忽视了动态场景中视角、几何和光照之间的物理耦合,导致同时变化时出现阴影不匹配和透视漂移等视觉不一致问题。我们提出了VidCRAFT3,一个统一且灵活的I2V框架,显式建模几何、运动和光照之间的跨因素交互,实现对相机运动、物体运动和光照方向的独立或联合控制。Image2Cloud提供显式的3D几何先验以实现精确的相机运动控制。ObjMotionNet将稀疏物体轨迹编码为多尺度运动特征,以引导逼真的物体运动。空间三重注意力变压器通过光照交叉注意力整合光照方向,实现一致的重光照。为了解决联合标注数据的稀缺性,我们构建了VideoLightingDirection(VLD)数据集,包含精确的逐帧光照方向标注,并引入三阶段渐进训练策略,使得无需完全联合标注即可实现鲁棒学习。大量实验表明,VidCRAFT3在多种场景下的控制精度和视觉一致性上达到了最先进水平。

英文摘要

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.

2606.06361 2026-06-18 cs.CV 版本更新 专题 90

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

两步物理:在视觉细化之前锁定运动先验会擦除它们

Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang

发表机构 * National Institute of Standards and Technology(国家标准与技术研究院)

专题命中 视频生成 :图像到视频扩散模型的物理一致性改进

AI总结 本文发现图像到视频扩散模型在两步生成中比多步生成具有更好的物理一致性,通过频谱分析将原因归结为去噪过程中的相位侵蚀,并提出无需训练的PhaseLock框架,通过从两步推理中提取运动先验并利用潜在增量引导强制到高保真生成中,有效缓解相位退化,提升物理一致性平均6.2点,同时保持视觉保真度且开销极小。

Comments ICML 2026

详情
AI中文摘要

图像到视频扩散模型利用输入图像生成视觉上令人惊艳的内容,但常常产生违反物理规律的运动。我们揭示了一个令人惊讶的发现:两步生成通常比同一模型的50步输出表现出更好的物理一致性。通过频谱分析,我们将其追溯到去噪过程中的相位侵蚀:相位显著退化(从第2步到第50步下降约18%),而幅度保持相对稳定。基于这一洞察,我们提出PhaseLock,一个无需训练的框架,在整个去噪轨迹中保留来自少步推理的有效运动先验。PhaseLock不依赖全步推理来获得物理一致性,而是仅从2步中提取运动先验,并通过潜在增量引导将其强制到高保真生成中。我们的方法有效缓解了相位退化,在多种模型上平均提升物理一致性6.2点,同时基本保持视觉保真度,且开销可忽略不计(时间1.06倍,内存1.02倍),并减少了对昂贵外部引导方法(时间约5倍)的依赖。

英文摘要

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time). Project Page: https://dnwjddl.github.io/phaselock

2605.15824 2026-06-18 cs.CV 版本更新 专题 90

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon:迈向实时和交互式的人体服装视频定制

Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Liujuan Cao

发表机构 * Xiamen University(厦门大学) Alibaba Group(阿里巴巴集团)

专题命中 视频生成 :提出实时交互式人体服装视频定制框架。

AI总结 本文提出FashionChameleon框架,通过单件服装视频数据实现交互式多服装视频定制,保留动作一致性,实现实时生成23.8FPS,比现有方法快30-180倍。

Comments Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

详情
AI中文摘要

以人为中心的视频定制,特别是在服装层面,已显示出显著的商业价值。然而,现有方法无法支持低延迟和交互式服装控制,这对电子商务和内容创作应用至关重要。本文研究如何在仅使用单件服装视频数据的情况下,实现交互式多服装视频定制并保持动作一致性。我们提出了FashionChameleon,一个用于自回归视频生成中的人体服装定制的实时交互框架,用户可以在生成过程中交互式切换服装。FashionChameleon包含三个关键技术:(i) 代替在多服装视频数据上训练,我们使用上下文学习在单个参考服装对上训练教师模型。通过保留图像到视频的训练范式,同时强制参考和服装图像之间不匹配,模型被鼓励在单件服装切换时隐式保持一致性。(ii) 为了在生成过程中实现一致性和效率,我们引入了带有上下文学习的流式蒸馏,通过上下文教师强制微调模型,并通过梯度加权分布匹配蒸馏提高外推一致性。(iii) 为了将模型扩展到交互式多服装视频定制,我们提出了无训练KV缓存调度,包括服装KV刷新、历史KV撤回和参考KV解耦,以在保持动作一致性的同时实现服装切换。我们的FashionChameleon独特地支持交互式定制和一致的长视频外推,同时在单个GPU上实现实时生成23.8 FPS,比现有基线快30-180倍。

英文摘要

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.

2510.21615 2026-06-18 cs.CV 版本更新 专题 90

Epipolar Geometry Improves Video Generation Models

极线几何改进视频生成模型

Orest Kupyn, Théo Uscidda, Marta Tintore Gazulla, Fabian Manhardt, Federico Tombari, Christian Rupprecht

发表机构 * University of Oxford(牛津大学) Google Research(谷歌研究院) CREST-ENSAE, Institut Polytechnique de Paris(巴黎理工学院CREST-ENSAE研究中心) Technical University of Munich(慕尼黑技术大学)

专题命中 视频生成 :利用极线几何约束改进视频生成模型的几何一致性。

AI总结 针对视频生成模型几何不一致和运动伪影问题,提出基于极线几何约束的偏好优化方法,在保持视觉质量的同时将极线误差降低31%,人类评分一致性从54%提升至72%。

详情
AI中文摘要

视频生成模型通过使用整流流技术训练的潜在扩散变换器取得了显著进展。然而,这些模型仍然存在几何不一致、运动不稳定以及破坏逼真3D场景错觉的视觉伪影。3D一致的视频生成可能对生成和重建任务中的众多下游应用产生重大影响。我们探索了极线几何约束如何改进现代视频扩散模型。尽管使用了大量训练数据,这些模型未能捕捉基本的几何原理。我们通过基于偏好的优化,利用成对极线几何约束对齐扩散模型,通过数学上合理的几何约束直接解决不稳定轨迹和几何伪影。我们的方法有效地强制执行几何原理,而不需要端到端的可微性。评估表明,经典的几何约束比现代学习度量提供了更稳定的优化信号。在静态场景和动态相机上的训练确保了度量质量,同时模型泛化到各种动态场景。通过将数据驱动学习与经典计算机视觉相结合,我们将极线误差降低了31%,并将人类评分一致性从54%提高到72%,且不损害视觉质量。

英文摘要

Video generation models have advanced significantly through the latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite using massive training data, these models fail to capture fundamental geometric principles. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics. Training on static scenes with dynamic cameras ensures metric quality while the model generalizes to various dynamic scenes. By bridging data-driven learning with classical computer vision, we reduce epipolar error by 31% and improve human-rated consistency from 54% to 72% without compromising visual quality.

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 专题 85

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中 视频生成 :视频生成能力,世界模拟器

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

2. 视频理解 2 篇

2602.08355 2026-06-18 cs.CV 版本更新 专题 90

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds:面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba Huazhong University of Science Vin University

专题命中 视频理解 :电商短视频理解基准,评估多模态大模型视频理解能力。

AI总结 提出电商短视频理解基准E-VAds,通过多模态信息密度评估框架量化领域复杂性,并构建多智能体生成的问答数据集,最后开发基于强化学习的推理模型E-VAds-R1,在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情
AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域,其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频,因为现有基准主要关注通用任务,忽略了商业意图的推理。在这项工作中,我们首先提出了一个多模态信息密度评估框架,以量化该领域的复杂性。我们的评估显示,与主流数据集相比,电商内容在视觉、音频和文本模态上表现出显著更高的密度,为视频理解建立了更具挑战性的前沿。为了弥补这一差距,我们引入了电商视频广告基准(E-VAds),这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频,涵盖广泛的产品类别,并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度,即感知与认知和推理,包含五个不同的任务。最后,我们开发了E-VAds-R1,一个基于强化学习的推理模型,具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导,同时为专家级精度创造非线性激励。实验结果表明,E-VAds-R1在仅使用几百个训练样本的情况下,在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新 专题 70

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni:从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) National University of Singapore(新加坡国立大学)

专题命中 视频理解 :视频未来预测基准,涉及时序推理

AI总结 提出FutureOmni基准,评估多模态大模型从音视频线索预测未来的能力,发现现有模型在语音密集场景下表现差,并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但它们从音视频线索预测未来事件的能力仍未被充分探索,因为现有基准主要关注回顾性理解。为弥补这一差距,我们引入了FutureOmni,这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理,并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建,包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明,当前系统在音视频未来预测方面存在困难,尤其是在语音密集场景中,Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限,我们整理了一个7K样本的指令微调数据集,并提出全模态未来预测(OFF)训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明,OFF增强了未来预测和泛化能力。我们公开发布所有代码(此 https URL )和数据集(此 https URL )。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).