机器人 / 具身智能 - arXivDaily 专题

2606.19383 2026-06-19 cs.RO cs.CV 新提交 80%

3D Scene Graphs: Open Challenges and Future Directions

3D场景图：开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

发表机构 * University of Stuttgart（斯图加特大学）； IMPRS-IS（马克斯·普朗克研究所-智能系统）； Sapienza University of Rome（罗马萨皮恩扎大学）； Google（谷歌）； MIT（麻省理工学院）； University of Freiburg（弗赖堡大学）； UTN University of Montreal（蒙特利尔大学UTN分校）； Mila TU Munich（慕尼黑技术大学Mila）

专题命中具身推理：3DSG用于机器人操作和导航。

AI总结本文统一综述3D场景图（3DSG）的构建、应用与评估，分析现有建模选择与开放挑战，旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情

AI中文摘要

3D场景图（3DSG）通过将几何基础与环境的语义和关系抽象相结合，已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关，包括操作、导航、任务规划、场景理解等。然而，该领域仍然分散：不同的社区采用不同的公式、构建流程和评估协议，使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾，特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG，并分析表征现有公式的主要建模选择，包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后，我们回顾如何从原始感官观察构建3DSG，讨论最常见的术语、约定和技术。最后，我们检查下游应用和评估策略，从内在图质量到任务级性能。为支持社区，我们还提供了一个专用网站，组织和扩展所调查的内容，可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

URL PDF HTML ☆

赞 0 踩 0

2606.20545 2026-06-19 cs.CV 新提交 60%

Current World Models Lack a Persistent State Core

当前世界模型缺乏持久状态核心

Jinpeng Lu, Dexu Zhu, Haoyuan Shi, Linghan Cai, Guo Tang, Yinda Chen, Jie Cao, Duyu Tang, Yi Zhang, Yong Dai, Xiaozhu Ju

发表机构 * University of Science and Technology of China（中国科学技术大学）； Beijing Innovation Center of Humanoid Robotics (X-Humanoid)（北京人形机器人创新中心）； NLPR, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所模式识别国家重点实验室）； Independent Researcher（独立研究者）； Dresden University of Technology（德累斯顿工业大学）； Peking University（北京大学）

专题命中具身推理：世界模型对具身智能至关重要。

AI总结提出WRBench基准测试，发现现有世界模型在观测中断时无法维持世界状态演化，强调物理状态核稳定性应成为世界模型设计首要目标。

Comments 39 pages, 16 figures

详情

AI中文摘要

世界模型日益被视为迈向通用人工智能的关键一步，然而对物理世界建模需要的不仅仅是按需生成令人信服的帧：它需要一个内部世界状态随时间持续演化，与观测解耦，使得物体持久存在、事件运行至结束，无论是否有相机在观察——就像月球在无人注视时仍保持轨道运行一样。这一要求是现有基准的盲点，它们奖励表面属性如保真度、运动和相机可控性，却从不询问生成的 world 在未被观测时是否持续演化。我们引入 \textbf{WRBench}，首个系统性的诊断基准，将相机运动视为对可观测性的干预，并将评估分解为一个人工校准的链条：询问相机是否执行了请求的交互，场景在视野内是否保持连续和可识别，以及返回的目标是否与已启动的事件保持一致。在来自 23 个模型（涵盖四种控制范式）的 9,600 个视频中，一个发现顽固地存在：当前系统将观测到的世界维持为跟踪镜头，返回的目标恢复为被遗弃时的状态，而非在未被观测时推进事件。由于这一失败在控制范式、模型家族和规模增量中重复出现，稳健的世界状态演化并非来自更清晰的图像、更严格的控制、更丰富的几何先验或单纯的参数数量。因此，我们主张物理状态核的稳定性和视角干预下世界线的一致性应成为世界模型设计的一级目标，使得世界模型捕捉世界将如何展开，而非下一帧如何呈现。

英文摘要

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

URL PDF HTML ☆

赞 0 踩 0