WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation
WAM-Nav:面向统一视觉导航的非对称潜在世界-动作建模
Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu
AI总结 提出WAM-Nav,一种联合学习动作生成与潜在视觉预测的非对称扩散Transformer模型,通过共享扩散Transformer实现长时程动作与短时程视觉预测的联合扩散,并引入双流上下文条件机制和目标对齐模块,在统一策略下支持图像目标、点目标和无目标导航,在ClutterScenes和InternScenes基准上分别提升15.7%和3.3%的成功率,并在真实环境中实现85%的任务成功率。
详情
视觉导航需要在复杂的几何和物理约束下生成平滑且无碰撞的轨迹。现有的反应式策略直接将观测映射到动作,缺乏预期推理能力,限制了其主动避障的能力。虽然视觉想象提供了预测性前瞻,但传统的模块化方法将场景预测与策略学习分离,常常导致误差累积和推理效率低下。为了解决这些限制,我们提出了WAM-Nav,一种用于具身视觉导航的潜在世界-动作模型,它联合学习动作生成和潜在视觉预测,从而在不影响推理效率的情况下实现更鲁棒和更具前瞻性的导航决策。具体来说,WAM-Nav利用共享的扩散Transformer进行非对称联合扩散,同时生成长时程动作和短时程视觉预测,减少了多步自回归展开中固有的推理延迟和视觉误差累积。为了进一步促进平滑且一致的轨迹生成,我们引入了一种双流上下文条件机制,将情节级别的自运动历史与顺序视觉观测相结合。结合统一的目标对齐模块,该模块在不同目标类型间保持平衡表示,WAM-Nav在单一策略下自然支持图像目标、点目标和无目标探索。在具有挑战性的ClutterScenes和InternScenes基准上的大量实验证明了WAM-Nav的强大泛化能力,特别是在图像目标和点目标导航中,成功率分别提高了15.7%和3.3%。真实世界部署进一步验证了有效的零样本模拟到现实迁移,在多样化的室内和室外环境中实现了平均85%的任务成功率。
Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.