机器人 / 具身智能 - arXivDaily 专题

2606.18375 2026-06-18 cs.RO 新提交 95%

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

PAIWorld: 用于机器人操作的三维一致世界基础模型

Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou, Hewen Xiao, Ziqiao Zhou, Kaiyun Huang, Zhiyu Peng, Juzhan Xu, Hang Zhao, Chenyang Zhu, Renjiao Yi, Yifei Huang, Douhui Wu, Yan Zhang, Kexu Cheng, Chunhe Song, Yunzhi Xue, Xiuhong Zhang, Leitao Guo, Yunji Chen, Bin Wu, Haibin Yu, Kai Xu

发表机构 * Institute of AI for Industries, Chinese Academy of Sciences（中国科学院人工智能产业研究院）

专题命中机器人基础模型：提出3D一致世界基础模型，用于机器人操作。

AI总结提出PAIWorld框架，通过几何感知交叉注意力、几何旋转位置编码和潜在3D-REPA蒸馏，解决多视图世界模型的3D不一致问题，在机器人操作基准上取得领先性能。

详情

AI中文摘要

世界基础模型（WFMs）是强大的模拟器，但它们主要运行在单视图设置中，缺乏机器人操作所需的多视图3D一致性。虽然机器人系统依赖多个摄像头（自我中心、眼到手和腕装）进行策略学习，但当前的多视图世界模型只是简单地拼接视图标记，没有显式的几何推理。这导致跨视图物体漂移、深度不一致和纹理错位。我们将这些失败归因于两个缺陷：缺乏显式的视图间通信机制和缺乏3D几何先验。我们认为同时解决这两个问题是必要且充分的。为此，我们提出PAIWorld，一个通过三个核心组件增强扩散变换器世界模型的框架：（1）几何感知交叉注意力块，建立跨视图的显式通路；（2）几何旋转位置编码，将相机射线方向和外部姿态编码到注意力机制中；（3）潜在3D-REPA，从冻结的3D基础模型中蒸馏3D感知特征以确保3D一致性。基于DiT世界基础模型，PAIWorld在机器人操作基准上实现了最先进的多视图3D一致性，在WorldArena排行榜上排名第一，在AgiBot-Challenge2026排行榜上排名第二，同时支持基于模型的规划、世界动作模型和多视图策略后训练等下游应用。

英文摘要

World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.

URL PDF HTML ☆

赞 0 踩 0

2606.17846 2026-06-18 cs.RO cs.CV cs.LG 新提交 90%

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告：对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team（Qwen团队）

专题命中机器人基础模型：机器人操作基础模型，大规模预训练

AI总结提出 Qwen-RobotManip，通过统一的对齐框架（表示、运动和行为维度）实现多源异构操作数据的大规模协同训练，构建约38,100小时预训练语料，在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情

AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练，实现了强大的泛化能力。在本报告中，我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性，因为与文本不同，操作数据本质上是异构的、收集成本高且多样性狭窄，使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip，一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架，使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹，一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频，无需专有数据收集，Qwen-RobotManip 构建了约38,100小时的预训练语料，并展现出涌现的泛化能力，包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量，因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型（包括 π0.5），在 RoboChallenge 中排名第一，相对改进20%，并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

URL PDF HTML ☆

赞 0 踩 0

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 90%

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3：面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中机器人基础模型：为具身智能体提供通用骨干网络

AI总结提出基于统一混合Transformer架构的全模态世界模型Cosmos 3，联合处理语言、图像、视频、音频和动作序列，在理解和生成任务上达到新最优，为具身智能体提供可扩展的通用骨干。

详情

AI中文摘要

我们介绍了Cosmos 3，一个全模态世界模型家族，设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明，Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平，展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型，并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署，我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准，网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

URL PDF HTML ☆

赞 0 踩 0

2606.18632 2026-06-18 cs.RO 新提交 85%

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences（工业人工智能研究所，中国科学院）； University of Science and Technology of China（中国科学技术大学）

专题命中机器人基础模型：具身基础模型安全数据集，预防人体伤害

AI总结为解决机器人伤害人类数据难以安全收集的问题，提出基于真实观测的安全数据构建流水线，生成包含1万条视频的ROBOSHACKLES数据集，涵盖直接和间接伤害类别，评估发现现有模型在安全关键场景下100%产生不安全动作。

详情

AI中文摘要

具身基础模型（EFMs）整合了多模态理解、未来状态推理和可执行的机器人动作。然而，它们在预防人体伤害方面的安全对齐仍未得到充分探索，主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战，我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发，经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变，而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线，我们构建了ROBOSHACKLES，一个包含10,000条机器人视频片段的数据集，源自真实的DROID观测，涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量，我们使用自动指标评估任务完成度和视觉质量，并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明，所有评估模型在测试的安全关键场景中都产生了不安全动作，不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.

URL PDF HTML ☆

赞 0 踩 0

2606.18610 2026-06-18 cs.RO cs.CV 新提交 85%

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； NVIDIA（英伟达）； Physical Intelligence ； Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； Allen Institute for AI（艾伦人工智能研究所）

专题命中机器人基础模型：通过自洽视频生成评估机器人基础模型

AI总结提出SC3-Eval方法，利用前向-反向动力学一致性、跨视角一致性和测试时一致性，将预训练视频基础模型转化为准确的策略评估器，在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情

AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差，多视角观测必须保持相互一致，且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战，这是一种自洽视频生成方案，通过强制三种互补的一致性，将预训练视频基础模型转化为准确的策略评估器。首先，前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作，将生成的 rollout 锚定在物理上合理的动作流形上，并抵消仅前向模型无法惩罚的漂移。其次，跨视角一致性训练模型从每个相机视角修补其他视角，使多相机观测在长 rollout 中保持连贯，无需任何显式记忆机制。第三，测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号，当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式，支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上，SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119，优于三个强先前的基于视频模型的基线，并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.17030 2026-06-18 cs.CV 新提交 75%

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team（Qwen团队）

专题命中机器人基础模型：具身世界模型，用于机器人操作等任务

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

详情

AI中文摘要

我们介绍Qwen-RobotWorld，一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口，它从当前观测预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识(EWK)，一个860万视频-文本语料库（2亿+帧），包含20+种具身形态和500+动作类别的动作-语言映射；c) 通用+专家渐进式课程，一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0