arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

3D 视觉

三维重建、NeRF、Gaussian Splatting、点云和空间智能。

今日/当前日期收录 33 信号源:cs.CV, cs.GR, cs.RO

1. 点云 1 篇

2606.20455 2026-06-19 cs.CV 新提交 95%

PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

PCFootprint:用于从航空LiDAR点云中提取矢量化建筑足迹的大规模数据集与基准

Haoyuan Shen, Kuihao Wang, Ruisheng Wang, Yujun Liu

发表机构 * School of Architecture and Urban Planning, Shenzhen University(深圳大学建筑与城市规划学院)

专题命中 点云 :从航空LiDAR点云提取建筑足迹,核心是点云处理

AI总结 提出首个大规模航空激光扫描点云建筑足迹提取数据集PCFootprint,含33000个瓦片及跨域测试集,通过评估主流方法揭示复杂地理环境下的挑战。

Comments 14 pages, 9 figures

详情
AI中文摘要

建筑足迹提取是摄影测量、遥感和计算机视觉中的基本任务。近年来,基于图像的方法在高分辨率光学影像的矢量化足迹提取方面取得了显著进展。然而,光学影像本质上易受遮挡、透视畸变和残余地形位移的影响,导致足迹提取不完整或错位。此外,缺乏显式高程信息限制了其在细节层次建筑建模中的直接适用性。本文提出PCFootprint,这是首个用于从机载激光扫描点云中提取足迹的大规模公共数据集。PCFootprint包含来自爱沙尼亚土地和空间发展局的33000个瓦片,覆盖多样化的城市和乡村景观。每个瓦片大小为128×128米,并配有与点云对齐的系统性矢量化足迹。该数据集包括一个3000个瓦片的跨域测试集,用于评估跨地理区域的泛化能力。我们通过评估主流方法建立了全面的基准。实验结果表明,在复杂地理环境中存在高类内方差、数据不平衡和噪声等显著挑战。我们相信PCFootprint将推动建筑建模、城市场景理解和地理空间分析的未来研究。PCFootprint数据集公开于:https://this https URL。

英文摘要

Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{https://huggingface.co/datasets/Haoyuan-Shen/PCFootprint}.

2. 空间理解 4 篇

2606.19383 2026-06-19 cs.RO cs.CV 新提交 95%

3D Scene Graphs: Open Challenges and Future Directions

3D场景图:开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

发表机构 * University of Stuttgart(斯图加特大学) IMPRS-IS(马克斯·普朗克研究所-智能系统) Sapienza University of Rome(罗马萨皮恩扎大学) Google(谷歌) MIT(麻省理工学院) University of Freiburg(弗赖堡大学) UTN University of Montreal(蒙特利尔大学UTN分校) Mila TU Munich(慕尼黑技术大学Mila)

专题命中 空间理解 :综述3D场景图,结合几何与语义。

AI总结 本文统一综述3D场景图(3DSG)的构建、应用与评估,分析现有建模选择与开放挑战,旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情
AI中文摘要

3D场景图(3DSG)通过将几何基础与环境的语义和关系抽象相结合,已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关,包括操作、导航、任务规划、场景理解等。然而,该领域仍然分散:不同的社区采用不同的公式、构建流程和评估协议,使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾,特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG,并分析表征现有公式的主要建模选择,包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后,我们回顾如何从原始感官观察构建3DSG,讨论最常见的术语、约定和技术。最后,我们检查下游应用和评估策略,从内在图质量到任务级性能。为支持社区,我们还提供了一个专用网站,组织和扩展所调查的内容,可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

2606.19915 2026-06-19 cs.CV 新提交 85%

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

SpatialSV: 通过任务导向的视觉监督在多模态大语言模型中内化可解释的3D空间感知

Jiayu Tang, Yuchen Zhou, Chao Gou

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能工程学院)

专题命中 空间理解 :提出内化3D空间感知的MLLM框架SpatialSV

AI总结 提出SpatialSV框架,通过任务导向的视觉监督将MLLM的2D特征提升为显式3D表示(深度图、相机姿态、点云),实现可解释的3D空间感知内化,无需外部工具,并在半监督设置中展现强泛化能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

解锁多模态大语言模型(MLLMs)的空间智能对于理解和与3D世界交互至关重要。当前主流方法通常通过外部工具注入空间先验,这会带来显著的推理开销,或依赖潜在特征蒸馏,后者缺乏可解释性和细粒度几何约束。为解决这些问题,我们提出SpatialSV,一个旨在将鲁棒的3D空间感知内化到MLLMs中,同时提供内在可解释性的框架。与被动特征模仿不同,SpatialSV采用任务导向的视觉监督,迫使模型主动将其2D视觉特征提升为显式3D表示,包括深度图、相机姿态和点云。关键的是,这个2D到3D的提升过程为模型的表示提供了一个透明窗口:生成的3D重建作为可视化和诊断模型内在空间知识质量的直观代理。跨多个模型和基准的广泛实验证明了SpatialSV在增强和解释MLLMs空间智能方面的有效性。此外,该框架在半监督设置中展现出强泛化能力,验证了其利用未标记视觉数据进行可扩展、可解释空间表示学习的潜力。

英文摘要

Unlocking the spatial intelligence of multimodal large language model (MLLMs) is crucial for understanding and interacting with the 3D world. Prevailing approaches typically inject spatial priors via external tools, which impose significant inference overhead, or rely on latent feature distillation, which remains uninterpretable and lacks fine-grained geometric constraints. To address these issues, we propose SpatialSV, a framework designed to internalize robust 3D spatial awareness within MLLMs while simultaneously offering inherent interpretability. Deviating from passive feature imitation, SpatialSV employs task-oriented visual supervision, compelling the model to actively lift its 2D visual features into explicit 3D representations, including depth maps, camera poses, and point clouds. Crucially, this 2D-to-3D lifting process provides a transparent window into the model's representations: the resulting 3D reconstructions serve as an intuitive proxy for visualizing and diagnosing the quality of the model's intrinsic spatial knowledge. Extensive experiments across multiple models and benchmarks demonstrate the effectiveness of SpatialSV in enhancing and interpreting MLLMs' spatial intelligence. Furthermore, the framework exhibits strong generalization in semi-supervised settings, validating its potential to leverage unlabeled visual data for scalable, interpretable spatial representation learning.

2606.20515 2026-06-19 cs.CV 新提交 80%

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent:空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU(南洋理工大学) THU(清华大学) ByteDance(字节跳动) NWPU(西北工业大学)

专题命中 空间理解 :聚焦连续3D世界的空间智能推理

AI总结 提出S-Agent空间工具使用智能体范式,通过时空证据积累和层次化工具集,将VLM作为语义规划器,实现连续多视图图像和视频的空间推理,在无训练下提升开源和闭源VLM性能,并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情
AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理,然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}},一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测,\textsc{S-Agent}将空间感知重塑为以场景为中心的理解,超越以帧为中心的识别。具体而言,\textsc{S-Agent}将VLM作为语义规划器,决定需要哪些证据,而层次化的空间工具和专家将物体锚定在2D中,将其提升为3D几何证据,并将这些证据聚合为高级空间知识(例如,计数、测量、方向和相对位置)。此外,时间记忆机制,包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆,实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明,\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强,在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调(SFT)得到了\textsc{S-Agent-8B},一个紧凑的空间智能体,显著超越了类似规模的基线(例如,Qwen3-VL-8B),并与先进的闭源模型(例如,GPT-5.4和Gemini 3)性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

2606.05833 2026-06-19 cs.CV cs.AI 版本更新 80%

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

专题命中 空间理解 :从视频学习3D几何表示,提升空间智能。

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

3. 三维重建 21 篇

2605.00569 2026-06-19 cs.CV cs.GR 95%

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

2D-SuGaR:面向表面的高斯点散布用于几何准确的网格重建

Prajwal Gupta C. R., Divyam Sheth, Jinjoo Ha, Mirela Ostrek, Justus Thies

发表机构 * TU Darmstadt(图宾根大学) ELIZA(ELIZA实验室) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

专题命中 三维重建 :提出2D-SuGaR方法提升网格重建几何精度

AI总结 本文提出2D-SuGaR方法,通过结合单目深度和法线先验,提升多视图图像中网格重建的几何精度和鲁棒性,实现在DTU数据集上达到最先进的重建效果。

Journal ref Eurographics 2026 Short Papers, The Eurographics Association, 2026

详情
AI中文摘要

3D高斯点散布(3DGS)已发展为一种强大的技术,用于实时生成逼真的场景渲染。然而,3DGS的体积性质限制了其准确捕捉表面几何的能力。为此,提出了2D高斯点散布(2DGS)以实现从多视角图像中一致且几何准确的表面重建。然而,2DGS对高斯原始体的初始化敏感。依赖结构从运动(SfM)初始化,在挑战性图像集上可能产生较差的估计,导致次优结果。在本文中,我们通过引入单目深度和法线先验来增强2DGS,提高几何精度和鲁棒性。我们提出了一种基于深度的初始化策略用于高斯点,并引入基于聚类的技巧来修剪退化高斯点。我们在DTU数据集上评估了我们的方法,其中它在网格重建中实现了最先进的结果,同时保持高质量的视点合成。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for generating photorealistic renderings of a scene in real-time. However, the volumetric nature of 3DGS limits its ability to accurately capture surface geometry. To address this, 2D Gaussian Splatting (2DGS) was proposed to enable view-consistent and geometrically accurate surface reconstruction from multi-view images. However, 2DGS can be sensitive to the initialization of the Gaussian primitives. Reliance on Structure-from-Motion (SfM) initializations, which can produce poor estimates on challenging image sets, may lead to subpar results. In this work, we enhance 2DGS by incorporating monocular depth and normal priors to improve both geometric accuracy and robustness. We propose a depth-guided initialization strategy for Gaussians and introduce a clustering-based technique for pruning degenerate Gaussians. We evaluate our method on the DTU dataset, where it achieves state-of-the-art results in mesh reconstruction while preserving high-quality novel view synthesis.

2512.00850 2026-06-19 cs.CV 版本更新 95%

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Smol-GS: 抽象3D高斯溅射的紧凑表示

Haishan Wang, Mohammad Hassan Vali, Arno Solin

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(阿alto大学)

专题命中 三维重建 :3D高斯溅射的紧凑表示,属于三维重建

AI总结 提出Smol-GS方法,通过八叉树位置编码和熵压缩学习高效溅射特征,实现3D高斯溅射的紧凑表示,在保持渲染质量的同时大幅降低存储。

详情
AI中文摘要

我们提出Smol-GS,一种学习3D高斯溅射(3DGS)紧凑表示的新方法。我们的方法学习高效的逐溅射特征来建模3D空间,这些特征捕获抽象线索,包括颜色、不透明度、变换和材质属性。我们提出八叉树导出的位置编码,显式建模空间局部性并增强表示效率。我们进一步应用基于熵的压缩来利用特征冗余,并使用递归体素层次压缩溅射坐标。这种设计在保持表示灵活性的同时,实现了数量级的存储减少。Smol-GS在标准基准测试上以高渲染质量实现了最先进的压缩性能。

英文摘要

We present Smol-GS, a novel method for learning compact representations for 3D Gaussian Splatting (3DGS). Our approach learns highly efficient splat-wise features to model 3D space, which capture abstracted cues, including color, opacity, transformation, and material properties. We propose octree-derived positional encoding, which explicitly models spatial locality and enhances representation efficiency. We further apply entropy-based compression to exploit feature redundancy and compress splat coordinates using a recursive voxel hierarchy. This design enables orders-of-magnitude reduction in storage while preserving representation flexibility. Smol-GS achieves state-of-the-art compression performance on standard benchmarks with high-level rendering quality.

2606.20424 2026-06-19 cs.RO 新提交 90%

LIT-GS: LiDAR-Inertial-Thermal Gaussian Splatting for Illumination-Robust Mapping

LIT-GS: 面向光照鲁棒建图的激光雷达-惯性-热高斯泼溅

Shikuan Shi, Chunran Zheng, Jiaming Xu, Tianyong Ye, Tao Yu, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University(深圳大学机电与控制工程学院) Department of Mechanical Engineering, The University of Hong Kong(香港大学机械工程系)

专题命中 三维重建 :激光雷达-惯性-热高斯泼溅用于光照鲁棒建图

AI总结 提出LIT-GS框架,利用激光雷达平面几何约束联合优化位姿与高斯,解决光照变化和纹理缺失场景下RGB依赖的脆弱性问题,提升几何精度与渲染质量。

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

高斯泼溅实现了实时神经渲染,但现有的激光雷达-惯性-视觉(LIV)高斯建图流程由于依赖RGB光度线索,在光照变化和纹理缺失场景下仍然脆弱。我们提出了LIT-GS,一个激光雷达-惯性-热高斯泼溅框架,将激光雷达导出的平面几何作为显式约束注入到位姿/结构优化和高斯优化中。具体来说,我们利用LIV视觉地图点作为置信度感知的跨模态锚点,建立可靠的热-激光雷达关联,并在弱热监督下将加权的激光雷达点到平面残差引入光束法平差,以联合优化相机位姿和3D点。基于优化后的结构,我们进一步引入一个激光雷达平面正则化的可微泼溅目标,约束渲染的3D点与局部观测平面对齐,从而减轻低对比度热图像中的表面增厚和结构漂移。在专有序列和公开数据集上的实验表明,LIT-GS在几何精度和渲染质量上持续优于最先进的基于LIV的高斯泼溅基线,尤其是在具有挑战性的光照条件下。

英文摘要

Gaussian Splatting has enabled real-time neural rendering, yet existing LiDAR-inertial-visual (LIV) Gaussian mapping pipelines remain fragile under illumination changes and texture-deficient scenes due to their reliance on RGB photometric cues. We present LIT-GS, a LiDAR-inertial-thermal Gaussian Splatting framework that injects LiDAR-derived plane geometry as an explicit constraint in both pose/structure refinement and Gaussian optimization. Specifically, we exploit LIV visual map points as confidence-aware cross-modal anchors to establish reliable thermal-LiDAR associations, and incorporate weighted LiDAR point-to-plane residuals into bundle adjustment to jointly refine camera poses and 3D points under weak thermal supervision. Building on the refined structure, we further introduce a LiDAR-plane-regularized differentiable splatting objective that constrains rendered 3D points to align with locally observed planes, mitigating surface thickening and structural drift in low-contrast thermal imagery. Experiments on proprietary sequences and public datasets demonstrate that LIT-GS consistently improves geometric accuracy and rendering quality over state-of-the-art LIV-based Gaussian Splatting baselines, particularly in challenging lighting conditions.

2606.20322 2026-06-19 cs.RO 新提交 90%

Towards 3D karst underwater scene reconstruction from rotating sonar data

基于旋转声纳数据的3D喀斯特水下场景重建

Georgios Evangelos Margaritis, Lionel Lapierre, Simon Rohou, Zhi Yan, Andreas Nüchter, François Goulette

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院U2IS实验室) Lab-STICC, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院Lab-STICC实验室) Informatics XVII – Robotics, Julius-Maximilians-Universität Würzburg(尤利乌斯-马克西米利安-维尔茨堡大学信息学XVII – 机器人学)

专题命中 三维重建 :水下喀斯特场景3D重建

AI总结 针对声纳数据稀疏噪声大、导航漂移导致3D重建困难的问题,提出结合连续时间SLAM校正轨迹与两阶段深度学习表面重建的流水线,生成可沉浸导航的3D网格。

Comments 1st Workshop on Long-term Deployments in the Wild (LoWi)

详情
AI中文摘要

喀斯特含水层提供关键的淡水资源,但由于其复杂且了解不足的地下几何结构,构成重大危害。由于水下探测的声纳数据稀疏且噪声大,而导航估计存在漂移,限制了标准3D重建方法,因此绘制这些环境具有挑战性。我们提出了一种从声纳剖面仪重建水下喀斯特管道的流水线。我们将连续时间SLAM方法用于校正轨迹漂移,与一种新颖的两阶段深度学习表面重建方法相结合,生成用于水文地质分析的沉浸式可导航3D网格。

英文摘要

Karst aquifers provide critical freshwater resources but pose significant hazards due to their complex and poorly understood subsurface geometry. Mapping these environments is challenging because sonar data from underwater exploration is sparse and noisy, while navigation estimates suffer from drift limiting standard 3D reconstruction methods. We present a pipeline for reconstructing underwater karst conduits from a sonar profiler. We combine a continuous-time SLAM approach to correct trajectory drift with a novel two-stage deep learning method for surface reconstruction, producing an immersive and navigable 3D mesh for hydrogeological analysis.

2606.20131 2026-06-19 cs.CV cs.GR 新提交 90%

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

TriFlow: 通过最近顶点向量场生成类艺术家3D网格拓扑

Haoxuan Li, Ziya Erkoç, Daniele Sirigatti, Vladislav Rosov, Lei Li, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) AUDI AG(奥迪股份公司) University of Virginia(弗吉尼亚大学)

专题命中 三维重建 :生成类艺术家3D网格拓扑。

AI总结 提出TriFlow,一种基于最近顶点向量场(NVF)的生成方法,通过流匹配模型合成NVF并引导拓扑感知的网格简化,直接从输入几何条件生成紧凑且具有类艺术家拓扑的3D网格。

详情
AI中文摘要

我们提出了TriFlow,一种新的生成方法,能够直接从输入几何条件(如符号距离场)生成具有类艺术家三角形拓扑的紧凑3D网格。我们的关键见解是将网格拓扑表示为在表面上定义的最近顶点向量场(NVF),其中每个点编码其在局部重心坐标系中与最近三角形顶点的关联。我们训练一个潜在流匹配模型来合成该场,从而实现基于输入几何条件的拓扑生成。为了提取连贯的网格,我们使用生成的NVF对表面区域进行聚类,并引导具有拓扑感知优化的约束二次误差度量(QEM)网格简化。这产生了与输入几何紧密匹配且具有结构化、类艺术家连接性的输出网格。实验表明,与最先进的基于学习方法相比,TriFlow实现了更强的泛化能力和显著提高的拓扑质量,同时Chamfer距离降低了90%,速度提升了8倍。

英文摘要

We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

2606.15966 2026-06-19 cs.CV cs.GR 新提交 90%

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

专题命中 三维重建 :提出端到端手部动态捕捉与配准管线

AI总结 提出面向有限视角(约20个)的端到端手部动态捕捉与配准管线,通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题,在12000+序列上验证了高保真重建与配准。

详情
AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础,但在实际多视角系统中仍具挑战性,这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线,专为视图高效设置(约20个视角)设计。我们通过两项主要创新应对关键挑战。首先,为克服重建困难(如视角重叠有限和背景杂乱),我们的无掩膜神经方法通过场景参数化和场景特定密度正则化,从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次,针对配准挑战(如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果),我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数,将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下,捕捉精细表面变形,确保在严重关节运动和自接触下的合理结果,并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性,并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面:https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://vephand.github.io/.

2606.15908 2026-06-19 cs.CV 新提交 90%

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉:基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR(谷歌XR) University of Science and Technology of China (USTC)(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

专题命中 三维重建 :高保真4D手-物体交互重建

AI总结 提出无需模板和标记的多视角系统,通过跨视角几何与时间线索的Transformer初始化,结合物理感知高斯优化,实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情
AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互(HOI)数据的需求日益增长,但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果,但它们对手和物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态具有挑战性,尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统,用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体,无需任何模板或标记。我们的系统包含两个主要创新组件:(1)一个多视角前馈Transformer模型,聚合跨视角几何和时间线索,为姿态和密集物体几何提供可靠的、度量一致的初始化;(2)一个手-物体物理感知高斯优化框架,用于细化初始估计,集成四面体约束、碰撞细化和外观分解,以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明,我们的流程实现了高度鲁棒、无伪影的重建,为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

2604.13416 2026-06-19 cs.CV cs.AI 版本更新 90%

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K:用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney(悉尼科技大学) University of Sydney(悉尼大学) National Yang Ming Chiao Tung University(阳明交通大学)

专题命中 三维重建 :无干扰新视角合成数据集与基准

AI总结 为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白,构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集,并基于此基准测试了九种最新方法,识别出最鲁棒的方法和最具挑战的场景。

详情
AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中,已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而,对于无干扰辐射场,每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏,限制了发展。为填补这一空白,我们引入了DF3DV-1K,一个包含1048个场景的大规模真实世界数据集,每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像,模拟随意拍摄,涵盖128种干扰类型和161种场景主题,包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K,我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试,识别出最鲁棒的方法和最具挑战的场景。除了基准测试,我们还展示了DF3DV-1K的一个应用:微调基于扩散的2D增强器以改进辐射场方法,在保留集(例如DF3DV-41)和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展,并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取:此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

2606.20563 2026-06-19 cs.CV 新提交 85%

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

JanusMesh: 通过跨空间去噪实现快速零样本3D视觉错觉生成

Siang-Ling Zhang, Huai-Hsun Cheng, Tsung-Ju Yang, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学)

专题命中 三维重建 :生成3D视觉错觉,涉及3D网格和纹理合成

AI总结 提出一种无需训练的快速框架,通过跨空间双分支去噪和视图条件纹理合成,在3-5分钟内生成高真实感双语义3D视觉错觉,优于现有方法。

Comments ECCV 2026. Project page: https://siang1105.github.io/JanusMesh.github.io/

详情
AI中文摘要

创建3D视觉错觉——一个从不同视角揭示完全不同语义的单一3D网格——是一个迷人但艰巨的挑战。现有的基于优化的方法速度慢且可能产生过饱和颜色。相比之下,简单的拼接方法无法生成几何一致的物体,导致可见的不自然接缝和语义泄露。在本文中,我们提出了一个快速且无需训练的框架,用于生成文本驱动的3D视觉错觉。我们的方法将生成过程解耦为两个阶段。首先,我们提出一个跨空间双分支去噪过程。该过程动态地将3D潜在变量解码到体素空间,用于CLIP引导的方向对齐和符号距离场(SDF)混合,确保无缝的几何融合。其次,我们引入一个视图条件纹理合成模块,将特定视图的2D扩散先验投影并聚合到融合的几何上。大量实验表明,我们的方法在仅3-5分钟内生成高度逼真的双语义3D错觉,在几何完整性、语义可识别性和效率上显著优于现有方法。项目页面:此https URL

英文摘要

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/

2606.19874 2026-06-19 cs.RO cs.CV 新提交 85%

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM:结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院合肥物质科学研究院) University of Science and Technology of China(中国科学技术大学) Aarhus University(奥胡斯大学) University of Tokyo(东京大学) Beijing University of Chemical Technology(北京化工大学) North China Electric Power University(华北电力大学)

专题命中 三维重建 :3DGS视觉SLAM,结构增强建图。

AI总结 提出MMD-SLAM,利用亚特兰大世界假设引导多元高斯表示,通过点线融合、主导方向编码和高斯进化策略,提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情
AI中文摘要

3D高斯泼溅(3DGS)显著提升了新视角合成和高保真场景重建,扩展了基于3DGS的视觉同步定位与建图(SLAM)方法的潜力。然而,大多数现有系统未能充分利用底层结构信息,这限制了渲染质量并常常导致地图不一致。为了解决这些限制,我们提出了MMD-SLAM,一个结构增强的视觉SLAM框架,利用亚特兰大世界(AW)假设来引导多元高斯表示以实现逼真的建图。首先,我们引入了一种点线融合策略用于位姿优化,其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次,我们设计了一种具有主导方向的多元高斯表示,显式编码来自AW假设的结构先验。最后,我们提出了一种高斯进化策略,该策略适应场景几何并将结构线索融入全局优化。大量实验表明,这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如,与MonoGS相比,我们的方法在ScanNet上实现了48.56%的ATE RMSE降低,在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

2606.19733 2026-06-19 cs.CV cs.AI 新提交 85%

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

QueryGaussian: 可扩展且无需训练的开词汇3D实例检索

Xiuyuan Zhu, Ke Lu, Zijie Yang, Chao Yue, Jian Xue, Dongming Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) State Key Laboratory of Communication Content Cognition(通信内容认知国家重点实验室) Peng Cheng Laboratory(鹏城实验室)

专题命中 三维重建 :提出无需训练的3D实例检索框架,结合2D视觉模型。

AI总结 提出QueryGaussian,一种无需训练的开词汇3D实例检索框架,通过实例级查询机制解耦语义与几何,结合2D视觉模型和时序融合模块,在保持精度的同时降低70%以上GPU内存并加速180倍,支持城市级场景。

Comments 8 pages, 4 figures, 6 tables. Accepted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)

详情
AI中文摘要

通过自然语言提示从大规模场景中高效检索特定3D实例仍然是多媒体分析中的一个严峻挑战。现有方法主要遵循“场景级嵌入”范式,需要将高维语义特征蒸馏到每个3D基元中。这种策略存在一个根本性的架构瓶颈:内存和计算成本随场景复杂度线性增长,不可避免地导致城市级环境中的内存溢出(OOM)故障。为了解决这一障碍,我们提出了QueryGaussian,一个无需训练的框架,用于快速且可扩展的开词汇3D实例检索。与整体语义蒸馏不同,QueryGaussian采用实例级查询机制,将语义理解与几何表示解耦。具体来说,我们利用预训练的2D视觉模型解释用户提示,并通过并发最大权重关联策略将分割掩码提升到3D,确保语义-视觉一致性。为了缓解投影歧义,我们引入了一个具有多阶段自适应密度聚类的时间融合模块。实验结果表明,QueryGaussian不仅匹配了最先进方法的准确性,还实现了决定性的效率飞跃,将GPU内存使用减少超过70%,并将推理速度提升180倍。关键的是,QueryGaussian能够在包含数千万个高斯的城市级场景中,使用消费级硬件实现快速的实例检索。

英文摘要

Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 新提交 85%

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP:自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

专题命中 三维重建 :自监督3D物体中心场景表示学习,分解为3D粒子。

AI总结 提出3D-DLP模型,通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子,每个粒子编码解耦属性,实现可解释的逐粒子分割图,并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情
AI中文摘要

我们引入了3D-DLP,一种自监督的物体中心表示学习模型,它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子(DLP)框架,每个粒子编码解耦的属性,包括3D关键点位置、边界框尺寸和外观特征,并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明,学习到的潜在空间是可解释和可控的:通过操纵粒子位置并解码,我们可以生成新颖的场景配置。此外,我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作,相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法,性能有所提升。代码和视频可在以下网址获取:此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新 85%

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Bosch Research(博世研究院) University of Haifa(海法大学)

专题命中 三维重建 :潜在高斯泼溅用于4D占据跟踪

AI总结 提出潜在高斯泼溅(LaGS)方法,通过特征高斯体作为动态关键点实现多视图特征聚合,用于4D全景占据跟踪,在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情
AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法通常只解决部分问题:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中,我们提出了潜在高斯泼溅(LaGS)用于4D全景占据跟踪(4D-POT)。我们重新审视底层表示,将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点,在泼溅到体素网格进行解码之前,能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互,这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节,进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型:this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

2503.01425 2026-06-19 cs.GR cs.CV 版本更新 85%

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

MeshPad: 交互式草图条件艺术家风格网格生成与编辑

Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) AUDI AG(奥迪股份公司)

专题命中 三维重建 :从草图生成和编辑3D网格,属于三维重建。

AI总结 提出MeshPad,一种基于草图输入的交互式3D网格生成与编辑方法,通过分解为网格区域的删除和添加操作,结合Transformer和顶点对齐推测策略,实现快速迭代编辑,在Chamfer距离上提升22%以上质量,并获90%用户偏好。

Comments Project page: https://derkleineli.github.io/meshpad/ Video: https://www.youtube.com/watch?v=_T6UTGTMZ1E

详情
AI中文摘要

我们介绍了MeshPad,一种从草图输入生成3D网格的生成方法。基于最近在艺术家风格三角形网格生成方面的进展,我们的方法解决了交互式网格创建的需求。为此,我们专注于通过将编辑分解为网格区域的“删除”和随后新网格几何的“添加”来实现一致编辑。这两个操作都由用户对草图图像的简单编辑触发,促进了迭代内容创建过程,并能够构建复杂的3D网格。我们的方法基于三角形序列网格表示,利用大型Transformer模型进行网格三角形的添加和删除。为了交互式地执行编辑,我们在加法网格生成器之上引入了一种顶点对齐的推测预测策略。该推测器预测对应于一个顶点的多个输出标记,从而显著降低推理的计算成本并加速编辑过程,使得每个编辑步骤只需几秒钟即可完成。综合实验表明,MeshPad优于最先进的草图条件网格生成方法,在Chamfer距离上实现了超过22%的网格质量改进,并且在感知评估中被90%的参与者所偏好。

英文摘要

We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.

2606.20556 2026-06-19 cs.CV 新提交 80%

Thinking in Boxes: 3D Editing in Real Images Made Easy

Thinking in Boxes: 真实图像中的3D编辑变得简单

Pradhaan S Bhat, Naveen Chandra R, Rishubh Parihar, Vaibhav Vavilala, R. Venkatesh Babu, D. A. Forsyth, Anand Bhattad

发表机构 * Indian Institute of Science(印度科学研究所) Apple(苹果公司) UIUC(伊利诺伊大学厄巴纳-香槟分校) Johns Hopkins University(约翰霍普金斯大学)

专题命中 三维重建 :使用3D盒子进行真实图像中的3D编辑。

AI总结 提出使用3D盒子作为结构化规范,通过用户提供输入和输出盒子来精确控制真实图像中的平移、旋转、缩放和视角变化,同时保持场景和物体身份,恢复未见的物体区域。

Comments Project Page: https://thinking-in-boxes.github.io/

详情
AI中文摘要

文本和2D条件接口在图像编辑中提供对空间变换的弱、模糊控制——特别是在大物体运动和相机变化下。先前的工作使用了如盒子这样的3D基元,但仅作为松散的调节信号指示近似物体位置,而非指定变换。我们则使用3D盒子作为结构化规范:用户提供编辑的输入和输出盒子,将编辑视为一个适定的几何问题。这种“在盒子中思考”的界面,其中每个盒子面都带有颜色编码以传达3D方向,提供了对真实图像中平移、旋转、缩放和视角变化的精确控制,同时保留场景和物体身份,并恢复之前未见的物体区域。为了将变换与场景外观联系起来,我们引入了一个深度对齐的平面地板作为全局参考框架,并用深度感知线索进行着色。基于这种结构,图像生成器在大变换下产生一致的结果。该系统在两个阶段训练——在合成多物体场景和来自Objectron的小型真实世界视频集上——能够泛化到复杂的、野外真实图像。我们的方法直接作用于真实照片,并在大型3D编辑上显著优于最近的最先进方法。

英文摘要

Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

2606.19718 2026-06-19 cs.CV 新提交 80%

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

基于3D先验引导扩散模型的单样本新视角与姿态人体图像合成

Shenjian Gong, Kangkan Wang, Shanshan Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院教育部高维信息智能感知与系统重点实验室、江苏省社会安全图像与视频理解重点实验室及PCA实验室) Advanced Laser Technology Laboratory of Anhui Province, Electronic Engineering Institute, National University of Defense Technology, and Jianghuai Advance Technology Center(国防科技大学电子工程学院安徽省先进激光技术实验室及江淮前沿技术中心)

专题命中 三维重建 :利用3D人体先验引导图像生成。

AI总结 提出一种基于条件去噪扩散模型的方法,利用3D人体先验(法线图和颜色提示)作为几何和颜色条件,从单张参考图像合成任意姿态和视角的高质量人体图像,包括被遮挡部分。

Comments 30 pages, 10 figures

详情
AI中文摘要

本文解决了单样本新视角和姿态人体图像合成的挑战。现有方法通过一组2D姿态关键点将参考人体图像转移到目标姿态,或基于可泛化人体NeRF(使用人体模型先验提取逐点特征)合成人体图像。然而,基于姿态转移的方法无法处理使用模糊2D姿态作为条件的复杂人体姿态,而可泛化人体NeRF在缺乏可靠特征时可能无法准确恢复被遮挡/不可见的人体部分。为解决这些问题,我们提出了一种基于条件去噪扩散模型的新方法,用于从单张人体图像进行新视角和姿态合成。我们的扩散模型将新视角和姿态合成问题分解为一系列条件去噪步骤。具体而言,为了生成具有复杂和任意姿态的人体,我们将3D人体先验(即3D法线图和颜色提示)作为几何和颜色条件引入生成过程。通过一系列扩散步骤将参考人体转移到目标人体,我们的扩散模型能够实现高质量合成,包括被遮挡/不可见部分。此外,我们提出了一种基于自重建的自定义细化方法,以在测试新视角时增强细节。在多个公共数据集上的实验结果表明,我们的方法显著优于先前方法,并显示出更好的跨数据集泛化能力。代码将在https://this https URL上公开。

英文摘要

This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel persons.Experimental results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at https://github.com/Yankeegsj/3DPGDM.

2606.18951 2026-06-19 cs.RO 新提交 80%

A High-accuracy Event-based Underwater SLAM System

高精度事件相机水下SLAM系统

Yifan Peng, Qihang Liu, Haoying Li, Yuzhe Li, Junfeng Wu, Ziyang Hong

专题命中 三维重建 :事件相机水下SLAM,属于三维重建

AI总结 针对事件相机水下SLAM中时间曲面成像质量差和匹配失败问题,提出基于结构感知度量和贝叶斯优化的高精度立体SLAM系统,并贡献首个高质量水下事件数据集UWE。

详情
AI中文摘要

虽然事件相机为水下SLAM提供了巨大潜力,但现有的基于时间曲面(TS)的方法在水下部署时被证明非常不可靠。波动的相机速度严重降低了TS成像质量,而宽立体基线和重复的水下纹理导致关键匹配失败,频繁引发系统崩溃。为克服这些挑战,我们开发了首个高精度事件相机水下立体SLAM系统。基于结构张量相干性和梯度,设计了一种结构感知度量来定量评估TS结构信息密度。通过将最优TS生成解耦为基于系统初始化的两个不同阶段,贝叶斯优化(BO)在初始化前首先预测最优先验TS,同时我们设置异步在线局部搜索方法,在跟踪阶段实时获取合适的TS。我们使用先验视差保证精确的数据关联,并采用“最新观测优先”三角测量机制实现稳定三角测量。作为这些解决方案的基准和社区资源,我们还贡献了UWE,这是首个高质量真实世界水下事件数据集,包含变化的相机运动、复杂纹理和不同轨迹特征。在公共数据集和UWE上的广泛评估表明,所提出的SLAM系统与最先进的事件相机方法相比具有竞争力的精度性能。代码和数据将开源。

英文摘要

While event cameras offer immense potential for underwater SLAM, existing Time Surface (TS)-based methods prove highly unreliable when deployed underwater. Fluctuating camera velocities severely degrade TS imaging quality, while wide stereo baselines and repetitive underwater textures induce critical matching failures, frequently triggering system failure. To overcome these challenges, we develop the first high-accuracy event-based underwater stereo SLAM system. A structure-aware metric for TS is designed based on structure tensor coherence and gradients to quantitatively evaluate TS structural information density. By decoupling the optimal TS generation into two distinct stages based on system initialization, Bayesian Optimization(BO) first predicts an optimal prior TS sequentially before initialization while we set an asynchronous online local searching method periodically to obtain appropriate TS in real-time during the tracking stage. We use the prior disparity to guarantee precise data association and "latest-observation-first'' triangulation mechanism to realize stable triangulation. As a benchmark for these solutions and a resource for the community, we also contribute UWE, the first high-quality real-world underwater event dataset containing variable camera motions, complex textures and different trajectory features. Extensive evaluations on public datasets and UWE show the competitive accuracy performance of the proposed SLAM system compared to the state-of-the-art event-based method. The code and data will be open-sourced.

2508.15228 2026-06-19 cs.CV 版本更新 80%

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore(南洋理工大学S实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

专题命中 三维重建 :协作多模态编码用于3D生成

AI总结 提出TriMM,首个前馈式3D原生生成模型,通过协作多模态编码融合RGB、RGBD和点云特征,结合辅助2D/3D监督和三平面潜在扩散模型,实现高质量3D资产生成。

详情
AI中文摘要

3D内容本质上具有多模态特性,可投影到不同模态(如RGB图像、RGBD和点云)。每种模态在3D资产建模中表现出独特优势:RGB图像包含生动的3D纹理,而点云定义精细的3D几何。然而,现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势,要么局限于3D结构,从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模,我们提出了TriMM,这是第一个从基本多模态(如RGB、RGBD和点云)学习的前馈式3D原生生成模型。具体来说,1) TriMM首先引入协作多模态编码,该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外,引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码,TriMM采用三平面潜在扩散模型生成更高质量的3D资产,增强了纹理和几何细节。在多个知名数据集上的大量实验表明,TriMM通过有效利用多模态,尽管使用少量训练数据,仍能达到与在大规模数据集上训练的模型相竞争的性能。此外,我们在最近的RGB-D数据集上进行了额外实验,验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

2509.13972 2026-06-19 cs.RO 版本更新 80%

BIM Informed Visual SLAM for Construction Environments

BIM 引导的视觉 SLAM 在建筑环境中的应用

Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠与信任跨学科研究中心(SnT),卢森堡大学)

专题命中 三维重建 :利用BIM增强视觉SLAM,减少轨迹漂移

AI总结 针对建筑环境中视觉SLAM轨迹漂移问题,提出利用建筑信息模型(BIM)的结构先验增强RGB-D SLAM系统,通过墙面对应与几何约束优化减少漂移,提升全局一致性,实验显示轨迹误差降低25.23%,地图精度提升7.14%。

Comments 9 pages, 7 tables, 4 figures

详情
AI中文摘要

监测建筑施工现场需要将计划设计与实际建造状态进行比较,而同步定位与地图构建(SLAM)技术可以实时估计实际状态。然而,视觉SLAM在建筑环境中容易产生轨迹漂移,生成的地图在几何上与实际环境不准确。为解决这一局限,我们利用从建筑信息模型(BIM)导出的结构先验增强现有的RGB-D SLAM系统。该系统将检测到的墙面与BIM中的对应墙面关联,并将这些对应关系作为几何约束加入后端优化,从而减少漂移并增强全局一致性。所提方法实时运行,并在多个真实建筑工地上验证,与最先进的基线相比,平均轨迹误差降低25.23%,地图精度提升7.14%。鲁棒性分析进一步表明,该方法对不完整的BIM数据以及计划模型与实际环境之间的几何差异具有韧性。

英文摘要

Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.

2606.20404 2026-06-19 cs.CV 新提交 70%

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FlowBender: 面向自校正条件流的反馈感知训练

Daniel Gilo, Sven Elflein, Ido Sobol, Or Litany

发表机构 * Technion(以色列理工学院) NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

专题命中 三维重建 :方法应用于3D纹理贴图,涉及三维重建

AI总结 针对条件扩散/流模型常违反任务约束的问题,提出FlowBender闭环框架,将对齐误差作为输入训练网络学习校正策略,在图像翻译、复原和3D纹理贴图中同时提升保真度与合理性。

Comments Project page: https://flow-bender.github.io/

详情
AI中文摘要

条件扩散和流模型通常无法满足定义其任务的约束条件。例如,深度条件模型经常产生重新提取的深度与输入不一致的图像,尽管定义约束的前向算子(深度预测器)在训练和推理期间都可用。现有方法通常分为两类:将条件信号视为静态线索并在推理时忽略对齐信息的监督模型,以及通过手动调整的线性更新咨询约束的基于引导的方法,通常以生成样本的合理性为代价来换取对条件的保真度。我们认为这两种范式的根本差距在于模型从未被训练利用自身的对齐误差。我们引入FlowBender,一个闭环框架,将此误差视为一等输入,训练网络学习基于推理时反馈的校正策略。在每一步,无引导的前瞻传递估计干净信号,通过前向算子计算特定任务的偏差,然后细化传递消耗此信号以产生校正速度。我们提出了FlowBender的几种变体,包括用于可微算子的基于梯度的公式和用于不可微设置(如JPEG压缩)的零阶变体。为了实现高效采样,我们引入了一个前一步捷径,使得以最小的额外计算成本实现闭环校正。在图像到图像翻译、复原和3D网格纹理贴图中,FlowBender始终优于标准监督基线、对齐损失增强训练和最先进的推理时引导,同时提高保真度和合理性,而不是在它们之间进行权衡。项目页面:此 https URL

英文摘要

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: https://flow-bender.github.io/

2606.19828 2026-06-19 cs.CV 新提交 70%

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

3D-PLOT-LLM: 用于三维大语言模型的部件级对象标记

Jintang Xue, Xinyu Wang, Yixing Wu, Jingwen Chen, C. -C. Jay Kuo

发表机构 * University of Southern California(南加州大学) Ohio State University(俄亥俄州立大学)

专题命中 三维重建 :处理3D点云并实现部件级理解。

AI总结 提出3D-PLOT-LLM,通过重组输入标记流使部件可直接通过LLM词汇寻址,无需分割解码器或边界框,在部件级基准上超越现有方法。

详情
AI中文摘要

三维多模态大语言模型(3D MLLMs)将3D对象作为一个整体进行描述,但无法处理、命名或推理其部件。先前的部件感知尝试增加了分割解码器、更重的3D编码器或边界框语法,导致参数成本大幅增加。我们采取了一条根本不同的路径:重新组织输入标记流,使得部件通过LLM自身的词汇变得可直接寻址。我们的模型3D-PLOT-LLM将冻结的点编码器的块分割成K个局部一致的区域,并在每个区域的块标记之前插入一个可学习的每区域标记和一个保留词汇标记<part_k>;然后,一个标记空间精化(MSR)模块根据每个区域的空间统计信息和邻接邻居对该标记进行条件化。因此,模型在其输出中引用部件,并遵循通过标记引用部件的提示,这是先前对象级3D MLLMs所不具备的能力。为了探究这一接口,我们构建了PartVerse-QA,一个基于PartVerse网格注释改编的词汇级部件问答基准(77K训练对和588个保留查询,基于不相交的对象划分),在该基准上,3D-PLOT-LLM达到了描述到槽的Jaccard指数0.459和精确匹配率13.78%,槽到描述的GPT-4o评判得分为44.68。在3DCoMPaT-GrIn部件感知接地描述基准上,3D-PLOT-LLM在所有文本输出指标上优于PointLLM、Kestrel、PARIS3D和SegPoint,并在4项指标中的3项上优于ShapeLLM,相比PointLLM的GPT-4o评判得分最高提升+3.03。在Objaverse整体对象描述中,在第二阶段添加PartVerse-QA使得相比PointLLM的SBERT得分提升+0.65,GPT-4o得分提升+1.85,并且在5项传统指标中的4项(SBERT、SimCSE、BLEU-1、METEOR)上超过PointLLM-PiSA,尽管其目标是不同的(部件接地)目标。所有这些仅需在冻结的点编码器上增加不到100万个可训练参数,比先前的部件感知3D MLLMs低一个数量级,且无需分割解码器或边界框头。

英文摘要

3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token <part_k>; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.

4. NeRF 1 篇

2606.20531 2026-06-19 cs.CV 新提交 85%

VisDom: Sparse Novel View Synthesis with Visible Domain Constraint

VisDom: 具有可见域约束的稀疏新视角合成

Mariia Gladkova*, Tarun Yenamandra*, Edmond Boyer, Robert Maier, Tony Tung, Daniel Cremers

发表机构 * TU Munich(慕尼黑工业大学) MCML(慕尼黑机器学习中心)

专题命中 NeRF :提出可见域约束增强NeRF和GS的稀疏视图合成

AI总结 提出VisDom,一种无学习的几何约束,通过最小多视角可见性要求增强视觉外壳重建,作为稀疏新视角合成中的空间先验,集成到NeRF和GS管线中,从四张输入图像实现高质量重建。

详情
AI中文摘要

稀疏新视角合成(NVS)由于从少量输入视角恢复3D几何的歧义性仍然具有挑战性。虽然基于NeRF和高斯泼溅(GS)的方法在密集监督下表现良好,但在稀疏设置中它们往往过拟合,产生漂浮伪影和不一致的几何。轮廓一致性通常用作正则化器,但还不够,因为轮廓一致区域可能超出真实物体几何。我们引入VisDom,一种无学习的几何约束,通过强制执行最小多视角可见性要求来增强经典的基于雕刻的视觉外壳重建。具体地,我们将可见域定义为至少被$K$个视角观察到的3D空间子集,并将其用作标准基于轮廓重建之上的额外过滤标准。这在稀疏视角设置中提供了更强的空间先验。我们通过限制体积采样和指导优化过程中的高斯放置,将VisDom集成到隐式(NeRF)和显式(GS)管线中。在三个具有挑战性的数据集上的实验表明,稀疏NVS的一致改进,使得从仅四张输入图像就能实现高质量以物体为中心的重建。我们的方法是领域无关的,仅需要轮廓,并且不引入学习参数,使其成为现有方法的简单补充。在GaussianObject之上应用VisDom进一步提高了在Omni3D和MipNeRF360上的性能,同时以22倍的训练成本匹配或超越它。

英文摘要

Sparse novel view synthesis (NVS) remains challenging due to the ambiguity of recovering 3D geometry from few input views. While NeRF- and Gaussian Splatting (GS)-based methods perform well with dense supervision, they often overfit in sparse settings, producing floating artifacts and inconsistent geometry. Silhouette consistency is commonly used as a regularizer, but it remains insufficient, as silhouette-consistent regions can extend beyond the true object geometry. We introduce VisDom, a learning-free geometric constraint that augments classical carving-based visual hull reconstruction by enforcing a minimum multi-view visibility requirement. Specifically, we define a visible domain as the subset of 3D space observed by at least $K$ views and use it as an additional filtering criterion on top of standard silhouette-based reconstruction. This provides a stronger spatial prior in sparse-view settings. We integrate VisDom into both implicit (NeRF) and explicit (GS) pipelines by restricting volumetric sampling and guiding Gaussian placement during optimization. Experiments on three challenging datasets show consistent improvements in sparse-view NVS, enabling high-quality object-centric reconstruction from as few as four input images. Our method is domain-agnostic, requires only silhouettes, and introduces no learned parameters, making it a simple complement to existing approaches. Applying VisDom on top of GaussianObject further improves performance on Omni3D and MipNeRF360, while matching or surpassing it at 22 $\times$ lower training cost.

5. 其他 1 篇

2606.20103 2026-06-19 cs.CV 新提交 80%

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University(庆熙大学)

专题命中 :使用3DGS进行几何标定

AI总结 针对LiDAR-相机标定中跨模态特征稀缺问题,提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何,提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情
AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置,但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景,通过密集光度监督实现外参优化。其中,3D高斯溅射(3DGS)被广泛用作几何代理,在单一可微框架内桥接LiDAR和相机。然而,由于3DGS最初是为新视图合成设计的,现有方法倾向于优先考虑渲染质量,导致代理几何偏离真实的LiDAR结构。我们提出了一种框架,通过聚合多视图LiDAR观测进行密集深度监督,并阻止光度梯度更新高斯空间参数,从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法,在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

6. Gaussian Splatting 1 篇

2606.19586 2026-06-19 cs.RO 新提交 80%

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies

一个演示胜过千条轨迹:用于视觉运动策略的动作-视角增强

Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Shuran Song

发表机构 * Stanford University(斯坦福大学) Columbia University(哥伦比亚大学) Toyota Research Institute(丰田研究所)

专题命中 Gaussian Splatting :使用高斯泼溅重建3D场景进行数据增强

AI总结 提出一种数据增强框架,通过高斯泼溅和轨迹优化生成逼真的鱼眼图像序列和物理可行的动作轨迹,提升操作策略在场景变化和障碍物下的成功率。

Comments Project website: https://chuerpan.com/1001-demos.github.io/. Published at CoRL 2025

Journal ref Proceedings of The 9th Conference on Robot Learning, PMLR 305:3902-3914, 2025

详情
AI中文摘要

用于操作的视觉运动策略在建模复杂机器人行为方面展现出显著潜力,但机器人初始配置的微小变化和未见障碍物容易导致分布外观测。在没有大量数据收集工作的情况下,这些会导致灾难性的执行失败。在这项工作中,我们引入了一个有效的数据增强框架,该框架从真实世界的眼在手演示中生成视觉上逼真的鱼眼图像序列和相应的物理上可行的动作轨迹,这些演示使用带有单个鱼眼摄像头的便携式平行夹爪捕获。我们引入了一种新颖的高斯泼溅公式,适用于广角鱼眼摄像头,以重建和编辑带有未见物体的3D场景。我们利用轨迹优化生成平滑、无碰撞、视图渲染友好的动作轨迹,并从相应新视角渲染视觉观测。在仿真和现实世界中的综合实验表明,我们的增强框架提高了各种操作任务在相同场景和需要避障的增强场景中的成功率。

英文摘要

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.

7. 其他3D视觉 1 篇

2606.20547 2026-06-19 cs.LG cs.CV cs.GR cs.RO math.DG 新提交 70%

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

Token 是群元素:关于矩阵李群上的李代数注意力

Przemyslaw Musialski

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

专题命中 其他3D视觉 :李群上注意力机制,可应用于3D变换

AI总结 提出李代数注意力机制,将token定义为矩阵李群元素,利用相对位姿的李代数范数作为注意力分数,无需学习核函数或表示论工具,适用于仿射全帧群等非紧致非阿贝尔群。

Comments preprint, 19 pages, 3 figures

详情
AI中文摘要

我们将注意力token置于群上:一个token是矩阵李群$G$的一个元素$g_i$——一个纯粹的变换,没有特征负载,也没有外部作用$\rho(g)$承载它。据我们所知,这是第一个token为裸矩阵李群元素的注意力构造:它们的分数是相对位姿的闭式代数范数,而非学习核,并且它达到了每个基于不可约表示或满射指数的方法必须排除的仿射全帧群。我们称之为李代数注意力。一旦token是群元素,其余部分无需通常的表示论机制。一对的相对几何是规范的,即$g_i^{-1} g_j$,因此成对不变量$w_{ij} = \log(g_i^{-1} g_j)$是内在的而非设计的;在$G$对角作用下的等变性是重言式的,且余循环条件自动成立。注意力分数是负平方代数范数$s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$:在块加权Frobenius内积下的规范邻近核,无需不可约表示、球谐函数、Clebsch-Gordan积或学习核。该构造适用于任何矩阵李群,在包含相对位姿的选定对数图上,包括具有尺度和剪切的非紧致非阿贝尔仿射群,这些是向量token注意力方法无法达到的:既不是不可约表示传统,也不是满射指数方法。在SE(2)、SO(3)和Aff(2)上的三个序列补全实验证实了这一点:闭式分数匹配了相同不变量上的学习MLP核,并在SE(2)上优于它,使用的分数参数少50到80倍,而向量token基线破坏了不变量,误差达五到十二个数量级。

英文摘要

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.