3D 视觉 - arXivDaily 专题

2606.19383 2026-06-19 cs.RO cs.CV 新提交 95%

3D Scene Graphs: Open Challenges and Future Directions

3D场景图：开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

发表机构 * University of Stuttgart（斯图加特大学）； IMPRS-IS（马克斯·普朗克研究所-智能系统）； Sapienza University of Rome（罗马萨皮恩扎大学）； Google（谷歌）； MIT（麻省理工学院）； University of Freiburg（弗赖堡大学）； UTN University of Montreal（蒙特利尔大学UTN分校）； Mila TU Munich（慕尼黑技术大学Mila）

专题命中空间理解：综述3D场景图，结合几何与语义。

AI总结本文统一综述3D场景图（3DSG）的构建、应用与评估，分析现有建模选择与开放挑战，旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情

AI中文摘要

3D场景图（3DSG）通过将几何基础与环境的语义和关系抽象相结合，已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关，包括操作、导航、任务规划、场景理解等。然而，该领域仍然分散：不同的社区采用不同的公式、构建流程和评估协议，使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾，特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG，并分析表征现有公式的主要建模选择，包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后，我们回顾如何从原始感官观察构建3DSG，讨论最常见的术语、约定和技术。最后，我们检查下游应用和评估策略，从内在图质量到任务级性能。为支持社区，我们还提供了一个专用网站，组织和扩展所调查的内容，可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

URL PDF HTML ☆

赞 0 踩 0

2606.19915 2026-06-19 cs.CV 新提交 85%

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University（中山大学智能工程学院）

专题命中空间理解：提出内化3D空间感知的MLLM框架SpatialSV

AI总结提出SpatialSV框架，通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示（深度图、相机姿态、点云），实现可解释的3D空间感知内化，无需外部工具，并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情

AI中文摘要

解锁多模态大语言模型（MLLMs）的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验，这会带来显著的推理开销，或依赖潜在特征蒸馏，后者缺乏可解释性和细粒度几何约束。为解决这些问题，我们提出SpatialSV，一个旨在将鲁棒的3D空间感知内化到MLLMs中，同时提供内在可解释性的框架。与被动特征模仿不同，SpatialSV采用任务导向的视觉监督，迫使模型主动将其2D视觉特征提升为显式3D表示，包括深度图、相机姿态和点云。关键的是，这个2D到3D的提升过程为模型的表示提供了一个透明窗口：生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外，该框架在半监督设置中展现出强泛化能力，验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.20515 2026-06-19 cs.CV 新提交 80%

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent：空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU（南洋理工大学）； THU（清华大学）； ByteDance（字节跳动）； NWPU（西北工业大学）

专题命中空间理解：聚焦连续3D世界的空间智能推理

AI总结提出S-Agent空间工具使用智能体范式，通过时空证据积累和层次化工具集，将VLM作为语义规划器，实现连续多视图图像和视频的空间推理，在无训练下提升开源和闭源VLM性能，并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情

AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理，然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}}，一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测，\textsc{S-Agent}将空间感知重塑为以场景为中心的理解，超越以帧为中心的识别。具体而言，\textsc{S-Agent}将VLM作为语义规划器，决定需要哪些证据，而层次化的空间工具和专家将物体锚定在2D中，将其提升为3D几何证据，并将这些证据聚合为高级空间知识（例如，计数、测量、方向和相对位置）。此外，时间记忆机制，包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆，实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明，\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强，在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调（SFT）得到了\textsc{S-Agent-8B}，一个紧凑的空间智能体，显著超越了类似规模的基线（例如，Qwen3-VL-8B），并与先进的闭源模型（例如，GPT-5.4和Gemini 3）性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

URL PDF HTML ☆

赞 0 踩 0