VLA / 视觉-语言-动作模型

2512.20014 2026-06-19 cs.RO cs.AI 版本更新 85%

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Bring My Cup! 使用视觉注意力提示个性化视觉-语言-动作模型

Sangoh Lee, Sangwoo Mo, Wook-Shin Han

发表机构 * GSAI, POSTECH（POSTECH 人工智能研究所）； IME, POSTECH（POSTECH 信息媒体研究所）

专题命中 VLA模型：个性化VLA模型，视觉注意力提示

AI总结针对VLA模型难以处理个性化指令的问题，提出无需训练的视觉注意力提示（VAP）方法，通过参考图像作为非参数记忆，利用开放词汇检测和嵌入匹配定位个人物品，并以视觉提示注入模型，在多个仿真和真实场景中显著提升成功率和正确物体操作。

Comments ICML 2026. Project page: https://vap-project.github.io/

详情

AI中文摘要

尽管视觉-语言-动作（VLA）模型能够很好地泛化到通用指令，但在处理个性化命令（如“bring my cup”）时却存在困难，因为机器人必须在视觉相似的物体中识别并操作特定实例。我们研究了这种操作个人物品的场景，其中VLA必须仅使用少量参考图像来识别并控制训练中未见过的用户特定物体。为了解决这一挑战，我们提出了视觉注意力提示（VAP），一种简单而有效的无需训练的感知适配器，为冻结的VLA模型赋予自上而下的选择性注意力。VAP将参考图像视为非参数视觉记忆，通过开放词汇检测和基于嵌入的匹配将个人物品定位到场景中，然后通过突出显示该物体并重写指令，将这种定位作为视觉提示注入模型。我们构建了两个仿真基准（Personalized-SIMPLER和Personalized-VLABench）以及一个真实桌面基准，用于评估多个机器人和任务上的个性化操作。实验表明，VAP在成功率和正确物体操作方面始终优于通用策略和令牌学习基线，有助于弥合语义理解与实例级控制之间的差距。

英文摘要

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

URL PDF HTML ☆

赞 0 踩 0