SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation
SARM2: 多任务阶段感知奖励建模用于自我改进的机器人操作
Qianzhong Chen, Hau Zheng, Justin Yu, Suning Huang, Jiankai Sun, Ken Goldberg, Chuan Wen, Pieter Abbeel, Yide Shentu, Philipp Wu, Mac Schwager
AI总结 提出多任务阶段感知奖励模型RM,结合动作基元阶段估计器和多门控专家混合值头,为机器人操作任务提供密集逐步奖励,并基于RM构建SPIRAL框架,通过廉价自主轨迹改进VLA策略,在10任务基准上显著提升成功率。
详情
微调视觉-语言-动作(VLA)策略以进行长程操作仍然严重依赖于行为克隆,这需要昂贵的高质量演示,并使策略保持在演示分布附近。奖励模型可以通过重新加权演示并为机器人上的强化学习(RL)提供密集监督来减少这种依赖,但它们必须密集、准确且通用。现有方法存在不足:特定任务的阶段感知模型准确但需要每任务注释,而通用视觉-语言模型(VLM)奖励模型适用范围广但对于细粒度的长程进展过于粗糙。我们引入了RM,一种多任务阶段感知奖励模型,它将基于动作基元的阶段估计器与多门控专家混合(MMoE)值头相结合,以在操作任务中产生密集的每步奖励。基于RM,我们进一步提出了SPIRAL(通过奖励对齐学习进行自策略改进),一种在策略奖励引导框架,通过廉价的自主轨迹改进VLA策略。在一个10任务基准上,RM将值估计MSE比最强基线降低了80%;当在SPIRAL中使用时,它将任务成功率从约50%提高到近乎完美,例如折叠短裤(58%到100%)和清洁白板(50%到90%),表明高质量密集奖励是稳定机器人数据飞轮的关键。项目网站:此https URL。
Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense, accurate, and general. Existing methods fall short: task-specific stage-aware models are accurate but require per-task annotations, while general vision-language-model (VLM) reward models are broadly applicable but too coarse for fine-grained long-horizon progress. We introduce RM, a multi-task stage-aware reward model that combines an action-primitive-based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head to produce dense per-step rewards across manipulation tasks. Building on RM, we further propose SPIRAL (Self-Policy Improvement via Reward-Aligned Learning), an on-policy reward-guided framework that improves VLA policies from cheap autonomous rollouts. On a 10-task benchmark, RM reduces value-estimation MSE by 80% over the strongest baselines; when used in SPIRAL, it improves task success from around 50% to near-perfect performance on Folding Shorts (58% to 100%) and Cleaning Whiteboard (50% to 90%), showing that high-quality dense rewards are key to a stable robot data flywheel. Project website: https://qianzhong-chen.github.io/sarm2.github.io/.