VLA / 视觉-语言-动作模型

2606.19297 2026-06-18 cs.LG cs.RO 新提交 95%

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗？衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab（CogAI实验室）； FusionBrain Lab（FusionBrain实验室）； IAI MSU（MSU人工智能研究所）； Lomonosov MSU（Lomonosov莫斯科大学）； NUST MISIS ； Applied AI Institute（应用人工智能研究所）； HSE University（俄罗斯高等经济大学）； Generalizable AI Systems（可泛化人工智能系统）； ISP RAS（俄罗斯科学院信息与自动化过程研究所）； MIRAI ； Domain-specific NLP Group（领域特定自然语言处理小组）

专题命中 VLA模型：提出Act2Answer评估VLA模型知识保留

AI总结提出 Act2Answer 协议，通过动作回答评估 VLA 模型的知识保留，发现模型在简单概念上表现良好，但在丰富语义类别上存在差距，且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情

AI中文摘要

具身视觉-语言-动作（VLA）模型通常通过在机器人数据上微调强大的预训练 VLM 获得，但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的，混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer，一种轻量级协议，通过要求智能体通过动作来回答，将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景，其中智能体执行单个物体放置动作以选择候选答案，从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件，并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中，我们系统地跨类别对模型进行排名，发现 VLA 在简单概念上表现稳健，但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距，VQA 联合训练与更好的知识保留相关，并且答案相关信号在 VLA 中间层达到峰值，但在上层减弱。Act2Answer 可在以下网址获取：此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

URL PDF HTML ☆

赞 0 踩 0

2606.17846 2026-06-18 cs.RO cs.CV cs.LG 新提交 95%

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告：对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team（Qwen团队）

专题命中 VLA模型：提出VLA基础模型用于机器人操作

AI总结提出 Qwen-RobotManip，通过统一的对齐框架（表示、运动和行为维度）实现多源异构操作数据的大规模协同训练，构建约38,100小时预训练语料，在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情

AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练，实现了强大的泛化能力。在本报告中，我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性，因为与文本不同，操作数据本质上是异构的、收集成本高且多样性狭窄，使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip，一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架，使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹，一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频，无需专有数据收集，Qwen-RobotManip 构建了约38,100小时的预训练语料，并展现出涌现的泛化能力，包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量，因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型（包括 π0.5），在 RoboChallenge 中排名第一，相对改进20%，并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

URL PDF HTML ☆

赞 0 踩 0

2606.18955 2026-06-18 cs.CV cs.RO 新提交 85%

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Tianfu Jiangxi Laboratory（天府江西实验室）

专题命中 VLA模型：从人类自我中心视频提取动作先验训练VLA。

AI总结提出基于潜在动作的框架，利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验，通过意图-感知解耦策略减少动作幻觉，仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情

AI中文摘要

训练通用视觉-语言-动作（VLA）模型通常需要大量、多样化的机器人数据集，并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性，但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题，我们提出了一种基于潜在动作的框架，旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE，通过物理掩码将运动动态与环境背景解耦，从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练，VLM骨干网络学习到动作意图的深层表示。为了适应特定实体，我们引入了一种意图-感知解耦策略，其中VLM预测动作意图，而一个独立的冻结视觉编码器为动作专家提供状态特定特征，从而减少动作幻觉。在仿真和真实环境中的结果表明，我们的方法仅在无标签人类视频上预训练，与在大量标注数据集上训练的最先进VLA模型相比具有竞争力，且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.18426 2026-06-18 cs.RO 新提交 85%

VEGA: Learning Navigation VLAs from In-the-Wild Egocentric Video with Geometric Trajectory Supervision

VEGA: 从野外自我中心视频中通过几何轨迹监督学习导航VLA

Gershom Seneviratne, Yohan Abeysinghe, Jianyu An, Vaibhav Shende, Dinesh Manocha

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）

专题命中 VLA模型：提出VEGA方法训练视觉-语言-动作导航策略

AI总结提出VEGA方法，利用未标注的自我中心视频通过重建场景几何生成障碍感知轨迹，训练流匹配VLA导航策略，在VEGA-Bench上碰撞减少33.0%，真实世界成功率提升至少150.0%。

详情

AI中文摘要

我们提出了VEGA，一种从未标注的自我中心导航视频中训练导航视觉-语言-动作（VLA）模型的方法。互联网规模的自我中心视频提供了可扩展的导航相关视觉观察来源，捕捉了杂乱场景、近距离障碍物以及通过真实世界空间的自然人体运动。然而，这些视频不能直接用于策略学习，因为它们没有提供在机器人坐标系中基于显式导航目标的障碍感知轨迹。VEGA通过从单目视频重建局部场景几何、采样导航目标（表示为文本、图像或空间路径点）并利用构建的几何生成障碍感知轨迹来解决这一差距。生成的轨迹分布随后用于训练流匹配VLA导航策略。通过仅在训练期间使用几何，VEGA将障碍感知规划直接蒸馏到基于视觉的策略中。此外，我们引入了VEGA-Bench，一个包含25万场景和约500万个导航目标（与场景几何配对）的基准，旨在评估VLA的目标进展、碰撞避免和障碍物间隙。我们的评估表明，VEGA在VEGA-Bench上实现了有竞争力的目标进展，同时相比最强基线碰撞减少33.0%，障碍物间隙提高17.9%，在真实世界试验中成功率至少提高150.0%，碰撞至少减少66.7%，障碍物间隙至少提高60.0%。最终，我们证明了视频衍生的几何监督为训练障碍感知导航VLA提供了可扩展且有效的信号。代码和基准将在发表时发布。

英文摘要

We introduce VEGA, an approach for training navigation VisionLanguage-Action (VLA) models from unlabeled egocentric navigation videos. Internet-scale egocentric videos provide a scalable source of navigation-relevant visual observations, capturing cluttered scenes, close-range obstacles, and natural human motion through real-world spaces. However, these videos are not directly usable for policy learning because they do not provide obstacle-aware trajectories conditioned on explicit navigation goals in the robot's coordinate frame. VEGA addresses this gap by reconstructing local scene geometry from monocular video, sampling navigation goals (represented as text, image, or spatial waypoints) and generating obstacle-aware trajectories using the constructed geometry. The resulting trajectory distribution is then used to train a flow-matching VLA navigation policy. By using geometry exclusively during training, VEGA distills obstacle-aware planning directly into a vision-based policy. Furthermore, we introduce VEGA-Bench, a benchmark containing 250k scenes and approximately 5 million navigation goals paired with scene geometry, designed to evaluate goal progress, collision avoidance, and obstacle clearance of VLAs. Our evaluation shows that VEGA achieves competitive goal progress while reducing collisions by 33.0% and improving obstacle clearance by 17.9% over the strongest baseline on VEGABench, while improving success by at least 150.0%, reducing collisions by at least 66.7%, and improving obstacle clearance by at least 60.0% in real-world trials. Ultimately, we demonstrate that video-derived geometric supervision provides a scalable and effective signal for training obstacle-aware navigation VLAs. The code and benchmark will be released at the time of publication.

URL PDF HTML ☆

赞 0 踩 0

2606.18890 2026-06-18 cs.AI 新提交 70%

Skill-Guided Continuation Distillation for GUI Agents

面向GUI代理的技能引导延续蒸馏

Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

发表机构 * StepFun ； University of Science and Technology Beijing（北京科技大学）； Tsinghua University（清华大学）； Nanyang Technological University（南洋理工大学）

专题命中 VLA模型：GUI Agent涉及视觉-语言-动作

AI总结提出技能引导延续蒸馏（SGCD）框架，通过技能引导策略生成成功延续轨迹，弥补专家轨迹中未覆盖的状态监督缺失，在OSWorld-Verified上将三个基础模型成功率从30%左右提升至50%以上。

详情

AI中文摘要

改进GUI代理通常依赖于在专家轨迹上的行为克隆。然而，当当前策略偏离专家策略时，在闭环执行过程中不可避免地会遇到策略导致的偏离轨迹状态，即超出专家轨迹的状态。由于专家轨迹未对这些未见状态提供演示，这些状态得不到有效监督，导致策略无法选择正确动作。为弥补这一监督缺口，我们提出技能引导延续蒸馏（SGCD），一种迭代式自我改进框架。SGCD首先在没有技能引导的情况下运行简单策略若干步，以到达真实的偏离轨迹状态。从这些状态出发，技能引导策略完成任务并生成成功的延续轨迹，这些轨迹与专家轨迹混合，为策略导致的偏离轨迹状态提供监督。技能从成功和失败的轨迹中提取，包括延续计划、关键目标、失败陷阱和成功标准。在OSWorld-Verified上，SGCD将三个基础模型的成功率从30%左右提升至超过50%，证明了其有效性和通用性。

英文摘要

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

URL PDF HTML ☆

赞 0 踩 0

2605.05925 2026-06-18 cs.RO 版本更新 60%

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

DexSynRefine：合成与精炼人-物交互运动以实现物理可行的灵巧机器人动作

Hyesung Lee, Hyunwoo Jung, Si-Hwan Heo, Sungwook Yang

发表机构 * Korea Institute of Science and Technology（韩国科学技术院）； KAIST（韩国科学技术院）； Hanyang University（翰阳大学）

专题命中 VLA模型：涉及视觉-语言-动作，但主要聚焦操作。

AI总结提出DexSynRefine框架，通过HOI-MMFP运动先验合成手-物轨迹，结合任务空间残差强化学习和接触动力学适应，将人-物交互数据转化为物理可行的灵巧操作，在五个任务上成功率提升50-70个百分点。

Comments Project page: https://dexsynrefine.github.io/

详情

AI中文摘要

从人-物交互（HOI）数据中学习灵巧操作为机器人遥操作提供了一种可扩展的替代方案，但HOI演示通常稀疏且纯运动学，在实体不匹配和接触丰富的动力学下直接重定向不可靠。我们提出DexSynRefine，一个耦合框架，将HOI数据视为结构化运动先验而非可执行的机器人动作。DexSynRefine首先使用HOI运动流形流基元（HOI-MMFP）——一种耦合手-物运动的运动先验，根据任务和初始物体状态合成手-物轨迹。然后通过任务空间残差强化学习对其进行物理接地，并通过从本体感受历史推断缺失的接触动力学上下文来适应执行。在五个灵巧操作任务中，每个阶段解决一个互补的瓶颈：HOI-MMFP提高了轨迹一致性和平滑性，任务空间残差在测试的替代方案中提供了最强的接地表示，接触动力学适应实现了鲁棒的真实世界执行。综合来看，DexSynRefine在真实世界中的成功率比运动学重定向提高了50-70个百分点。

英文摘要

Learning dexterous manipulation from human-object interaction (HOI) data offers a scalable alternative to robot teleoperation, but HOI demonstrations are typically sparse and purely kinematic, making direct retargeting unreliable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a coupled framework that treats HOI data as structured motion priors rather than executable robot actions. DexSynRefine first synthesizes hand-object trajectories conditioned on the task and initial object state using HOI Motion Manifold Flow Primitives (HOI-MMFP), a motion prior for coupled hand-object motion. It then physically grounds them with task-space residual reinforcement learning and adapts execution by inferring missing contact-dynamics context from proprioceptive history. Across five dexterous manipulation tasks, each stage addresses a complementary bottleneck: HOI-MMFP improves trajectory consistency and smoothness, task-space residuals provide the strongest grounding representation among the tested alternatives, and contact-dynamics adaptation enables robust real-world execution. Together, DexSynRefine improves real-world success rates over kinematic retargeting by 50-70~percentage points.

URL PDF HTML ☆

赞 0 踩 0