3D 视觉

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 新提交专题 95

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich（慕尼黑技术大学）； Huawei（华为）

专题命中空间理解：全景重投影实现3D场景理解

AI总结提出OneCanvas方法，将多视图补丁特征聚合到全景画布上，利用深度和相机位姿进行重投影，无需复杂几何编码器或大量训练，在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情

AI中文摘要

现有的视觉语言模型（VLM）中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器，要么为了追求空间推理而需要大量的训练预算。相反，OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说，每个补丁利用其深度和相机位姿被反投影到3D世界坐标，然后根据从画布原点看到的该点的连续经度和纬度放置在画布上，无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中，从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此，来自所有帧的补丁共享一个空间坐标系，无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心，相同的表示直接支持从特定视角进行情境推理，这是机器人和具身AI中的常见需求。得益于这种表示，我们还可以引入空间预训练课程：通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置，我们生成了涵盖广泛空间推理任务的即时监督，并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率，并在SPBench上泛化到分布外数据，其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18588 2026-06-18 cs.DC cs.CV 新提交专题 95

Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

Splaxel：通过像素级通信实现大规模场景重建的高效分布式3D高斯泼溅训练

Wenqi Jia, Zhewen Hu, Ying Huang, Yu Gong, Stavros Kalafatis, Yuke Wang, Wei Niu, Chengming Zhang, Ang Li, Sheng Di, Yuede Ji, Bo Fang, Miao Yin

发表机构 * Independent Researcher（独立研究者）； Rice University（里士满大学）； University of Georgia（佐治亚大学）； University of Houston（休斯顿大学）； University of Washington（华盛顿大学）； Argonne National Labs（阿贡国家实验室）

专题命中 Gaussian Splatting ：Splaxel框架高效分布式训练3DGS

AI总结提出Splaxel框架，通过像素级局部渲染与全局组合替代高斯同步，在保持数学一致性的同时稳定通信开销，结合可见性预测和冲突消除策略，实现大规模3DGS分布式训练加速7.6倍。

Comments 17 pages, 25 figures

详情

AI中文摘要

3D高斯泼溅（3DGS）能够实现高保真、实时的3D场景重建，但将训练扩展到大规模场景需要跨多个GPU优化数亿个高斯体。现有的分布式方法要么将场景划分为孤立区域，导致全局不一致，要么依赖全局高斯级交换，导致GPU间通信量大幅增长并迅速主导迭代时间。我们提出Splaxel，一种基于像素级局部渲染和全局组合的通信高效分布式3DGS训练框架。每个GPU渲染其局部子集并仅交换部分像素值，而非同步高斯体，从而在保持数学一致性的同时，使通信成本随场景规模增长保持稳定。Splaxel通过几何和透射率可见性预测进一步减少像素级冗余，并通过无冲突的相机视图整合提高GPU利用率。在包含多达1.2亿个高斯体的大规模数据集上评估，Splaxel相比最先进的分布式3DGS框架实现了高达7.6倍的加速，同时保持高重建质量。

英文摘要

3D Gaussian Splatting (3DGS) enables high-fidelity and real-time 3D scene reconstruction, but scaling training to large-scale scenes requires optimizing hundreds of millions of Gaussians across multiple GPUs. Existing distributed approaches either partition scenes into isolated regions, causing global inconsistency, or rely on global Gaussian-level exchanges, which lead to substantial growth in inter-GPU communication and quickly dominate iteration time. We propose Splaxel, a communication-efficient distributed 3DGS training framework based on pixel-level local rendering and global composition. Instead of synchronizing Gaussians, each GPU renders its local subset and exchanges only partial pixel values, maintaining mathematical consistency while keeping communication cost stable as the scene size increases. Splaxel further reduces pixel-level redundancy through geometric and transmittance visibility prediction and improves GPU utilization via conflict-free camera-view consolidation. Evaluated on large-scale datasets with up to 120M Gaussians, Splaxel achieves up to 7.6$\times$ speedup over the state-of-the-art distributed 3DGS framework while preserving high reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18623 2026-06-18 cs.CV eess.IV 新提交专题 85

Intrinsic 4D Gaussian Segmentation from Scene Cues

内在4D高斯分割：基于场景线索

Hasan Yazar, Mohamed Rayan Barhdadi, Erchin Serpedin, Mehmet Tuncel, Hasan Kurban

发表机构 * Istanbul Technical University（伊斯坦布尔理工大学）； Texas A&M University（德克萨斯农工大学）； Hamad Bin Khalifa University（哈马德·本·哈利法大学）

专题命中 Gaussian Splatting ：无需训练和掩码的4D高斯分割方法

AI总结提出Intrinsic-GS方法，无需训练和掩码，通过构建高斯原语的亲和图并利用社区检测实现4D场景分割，在Neu3D和HyperNeRF上达到与掩码监督方法相当的精度，且速度提升12.5倍。

Comments 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

详情

AI中文摘要

动态4D高斯泼溅以高保真度重建变形场景，并越来越多地被用作动态3D场景的表示。要利用此类场景进行编辑、操作或运动分析，首先需要对其进行分割：将高斯原语分组为连贯的对象。当前流程通过从基础模型（如SAM）导入2D掩码，并将其提升或蒸馏到高斯表示中来获得这种分组。在动态场景中，这些掩码必须在多个帧和视角中生成，成本高昂，并且所得分割可能强烈依赖于这些外部掩码的质量和一致性。我们探究能否从高斯本身恢复更多的对象级结构，并提出Intrinsic-GS，一种无需训练、无需掩码的方法，该方法根据外观、方向、尺度、变形轨迹和非学习渲染边界线索，在高斯原语上构建稀疏亲和图。该图通过Leiden社区检测进行划分，无需基础模型，也无需学习特征场。在标准的4D高斯分割基准Neu3D和HyperNeRF上，Intrinsic-GS在没有掩码监督的情况下恢复了大量的对象结构，在Neu3D上达到0.746 mIoU，在HyperNeRF上达到0.575；在Neu3D上，仅几何变体达到0.902 mIoU，与SAM监督的TRASE相当。在HyperNeRF上，Intrinsic-GS的运行速度比掩码监督流程中使用的掩码生成和特征渲染阶段快12.5倍。这些结果表明，大部分分割信号已经编码在高斯本身中，为3D和4D高斯分割提供了一种快速、无需掩码的方向，也可能指向在外部掩码不可靠或昂贵的情况下更可泛化、更鲁棒的分割。

英文摘要

Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

URL PDF HTML ☆

赞 0 踩 0

2606.18734 2026-06-18 eess.SP cs.LG 新提交专题 80

Point-Cloud-Assistant Localized Statistical Channel Prediction by Tangent Gaussian Splatting

点云辅助的切线高斯溅射局部统计信道预测

Ye Xue, Yiheng Wang, Xinhua Shao, Qi Yan, Shutao Zhang, Tsung-Hui Chang

发表机构 * China Telecom（中国电信）

专题命中 Gaussian Splatting ：使用高斯溅射进行信道预测

AI总结提出点云辅助切线高斯溅射（PC-TGS）框架，通过融合稀疏无线电测量与密集LiDAR几何数据，将角功率谱外推到未测量网格，实现大规模无线数字孪生中的高效信道预测。

详情

AI中文摘要

准确、特定地点的信道信息对于优化下一代无线网络至关重要。在各种方法中，局部统计信道建模（LSCM）通过从参考信号接收功率（RSRP）测量中建模信道多径角功率谱（APS），已成为一种针对高效网络优化的最先进方法。然而，尽管其有效性，LSCM无法在绝大多数没有测量值的位置预测APS，这严重限制了其在大规模真实场景中的适用性。为了解决这一挑战，我们提出了\emph{点云辅助切线高斯溅射}（PC-TGS），这是第一个通过将稀疏无线电测量与密集的基于LiDAR的几何信息相结合，将APS\emph{外推}到未测量室外网格的框架。PC-TGS将环境散射体表示为各向异性的3D高斯分布，通过原始点云的松弛均值重新参数化进行初始化和细化。切线平面投影将每个高斯分布精确映射到局部角度域，而深度感知的电磁溅射过程聚合它们的贡献。为了确保实际部署，我们推导了用于APS bin积分的闭式高斯加权平均（GWA），并提供了可证明的误差界。在LiDAR扫描的城市规模数据集（500万个点，6310个RSRP样本）上的评估表明，与最先进的基线相比，PC-TGS在APS和RSRP预测性能上更优，并且在外推APS任务中推理时间更快。这些结果突显了PC-TGS在大规模无线数字孪生中实现几何感知和数据高效信道预测的潜力。

英文摘要

Accurate, site-specific channel information is crucial for optimizing next-generation wireless networks. Among various approaches, localized statistical channel modeling (LSCM), which models the channel multipath angular power spectrum (APS) from the reference signal received power (RSRP) measurement, has emerged as a state-of-the-art method tailored for efficient network optimization. However, despite its effectiveness, LSCM cannot predict APS at the vast majority of locations where no measurements are available, which significantly restricts its applicability in large-scale, real-world scenarios. To address this challenge, we present \emph{point-cloud-assisted tangent Gaussian splatting} (PC-TGS), the first framework to \emph{extrapolate} APS to unmeasured outdoor grids by integrating sparse radio measurements with dense LiDAR-based geometry. PC-TGS represents environmental scatterers as anisotropic 3D Gaussians, initialized and refined through a relaxed-mean reparameterization of the raw point cloud. A tangent-plane projection accurately maps each Gaussian into the local angular domain, while a depth-aware electromagnetic splatting process aggregates their contributions. To ensure practical deployment, we derive a closed-form Gaussian-weighted average (GWA) for APS bin integration and provide a provable error bound. { Evaluations on a LiDAR-scanned city-scale dataset (5M points, 6,310 RSRP samples) demonstrate that PC-TGS achieves better APS and RSRP prediction performance compared to state-of-the-art baselines and faster inference time for APS extrapolation task. These results highlight the potential of PC-TGS to enable geometry-aware and data-efficient channel prediction in large-scale wireless digital twins.

URL PDF HTML ☆

赞 0 踩 0

2606.19156 2026-06-18 cs.CV 新提交专题 90

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

Hand-4DGS: 用于从第一人称视频进行4D手部重建的前馈3D高斯泼溅方法

Jeongmin Bae, Seoha Kim, Marc Pollefeys, Mahdi Rad, Youngjung Uh, Taein Kwon

发表机构 * Yonsei University（延世大学）； Electronics and Telecommunications Research Institute（电子电信研究院）； ETH Zurich（苏黎世联邦理工学院）； Microsoft Spatial AI Lab（微软空间AI实验室）； VGG, University of Oxford（VGG，牛津大学）

专题命中三维重建：从第一人称视频重建动态4D手部，前馈3DGS

AI总结提出Hand-4DGS，首个前馈框架，从第一人称视频直接重建动态4D手部，利用网格引导表示和时间卷积，实现快速推理和强泛化，无需3D真值标注。

Comments Project page: https://jeongminb.github.io/hand-4dgs/

详情

AI中文摘要

从第一人称视频进行动态3D手部重建对于下一代计算平台（如AR/VR和AI眼镜）至关重要。尽管其重要性，大多数先前工作要么关注多视角3D手部重建，要么关注4D人体重建。由于头部快速运动、手部快速动态、严重遮挡以及单视角观察固有的模糊性，第一人称4D手部重建仍然具有挑战性。为了解决这些挑战，我们引入了Hand-4DGS，这是第一个直接从第一人称视频重建动态4D手部的前馈框架，实现了快速（约60 FPS）推理和强泛化。我们的方法结合了用于结构先验的网格引导表示和用于建模动态运动的时间卷积。我们在两个具有挑战性的第一人称数据集H2O和ARCTIC上评估了我们的框架，并展示了相对于基线的显著改进。我们的方法受益于前馈网络的泛化能力以及通过高斯泼溅的有效2D图像监督，无需昂贵的3D手部姿态真值标注。

英文摘要

Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

URL PDF HTML ☆

赞 0 踩 0

2606.19019 2026-06-18 cs.CV 新提交专题 90

FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

FlowObject: 流引导以桥接生成先验与重建保真度

Yuchen Rao, Xuqian Ren, Yinyu Nie, Sayan Deb Sarkar, Biao Zhang, Vincent Lepetit, Friedrich Fraundorfer

发表机构 * Graz University of Technology Austria（奥地利格拉茨理工大学）； Tampere University Finland（芬兰塔尔库大学）； Technical University of Munich Germany（德国慕尼黑技术大学）； Stanford University The United States of America（美国斯坦福大学）； Xi’an Jiaotong University China（中国西安交通大学）； École des Ponts ParisTech France（法国巴黎综合理工学院）

专题命中三维重建：提出流引导框架，结合生成先验与3DGS实现稀疏视图重建。

AI总结提出FlowObject框架，通过双空间引导策略驱动流匹配模型的ODE轨迹，在利用生成先验完成未观测区域的同时保持与真实观测的一致性，并集成3DGS细化阶段弥合生成输出与真实感重建的差距，显著提升几何完整性和视角相关外观保真度。

Comments Project page: https://yuchenrao.github.io/projects/flowObject/flowObject.html

详情

AI中文摘要

从少量随意拍摄的图像中恢复物体的完整3D表示仍然是一个重大挑战。最近的3D生成模型，特别是基于流匹配（Flow-Matching, FM）的模型，可以合成高质量的纹理资产；然而，它们常常遭受“合成偏差”，即学习到的先验覆盖了观测证据，同时缺乏与观测实例的对齐。相反，基于优化的方法如3D高斯泼溅（3DGS）在可见表面上提供高保真度，但无法推理未观测的几何结构。在本文中，我们提出了FlowObject，一个将稀疏视图3D重建重新表述为无训练、引导逆问题的框架。我们的方法采用双空间引导策略来驱动流匹配模型的常微分方程（ODE）轨迹，通过学习的生成先验完成未观测区域，同时强制与真实世界观测严格一致。通过集成3DGS细化阶段，FlowObject进一步弥合了“合成外观”生成输出与真实感重建之间的差距。在合成和真实世界数据集上的全面基准测试表明，当前最先进的方法通常难以同时实现几何完整性和观测一致性，尤其是在严重遮挡下。相比之下，我们的方法在几何完整性和视角相关外观保真度方面显著优于最先进的生成模型和基于优化的框架。

英文摘要

Recovering complete 3D representations of objects from few casual image captures remains a significant challenge. Recent 3D generative models, particularly those based on Flow-Matching (FM), can synthesize high-quality textured assets; however, they often suffer from ''synthetic bias'' where learned priors override observational evidence, alongside a lack of alignment with the observed instance. Conversely, optimization-based methods like 3D Gaussian Splatting (3DGS) provide high fidelity on visible surfaces but fail to reason about unobserved geometry. In this paper, we present FlowObject, a framework that reformulates sparse-view 3D reconstruction as a training-free, guided inverse problem. Our approach applies a dual-space guidance strategy to steer the Ordinary Differential Equation (ODE) trajectory of a flow-matching model, enabling the completion of unseen regions through learned generative priors while enforcing strict consistency with real-world observations. By integrating a 3DGS refinement stage, FlowObject further bridges the gap between ''synthetic-looking'' generative outputs and photorealistic reconstructions. Comprehensive benchmarks on synthetic and real-world datasets demonstrate that current state-of-the-art methods often struggle to achieve geometric completeness and observational consistency simultaneously, especially under severe occlusions. In contrast, our method significantly outperforms state-of-the-art generative models and optimization-based frameworks in both geometric completeness and view-dependent appearance fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.18472 2026-06-18 cs.CV 新提交专题 90

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University（康考迪亚大学）

专题命中三维重建：3D视觉-语言模型域泛化微调

AI总结提出ReFine3D框架，通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略，提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

域适应仍然是3D视觉中的一个核心挑战，特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力，但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题，我们引入了ReFine3D，一个正则化的微调框架，专为3D大语言模型（LMMs）的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合：跨增强点云的多视图一致性，以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外，我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制，以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明，ReFine3D将基类到新类泛化提高了1.36%，跨数据集迁移提高了2.43%，对损坏的鲁棒性提高了1.80%，少样本准确率提高了最多3.11%，以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.18439 2026-06-18 cs.CV cs.RO 新提交专题 90

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT：面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of California, Irvine（加利福尼亚大学尔湾分校）； Nanyang Technological University（南洋理工大学）

专题命中三维重建：提出VGGT加速方法，用于多视图3D场景重建

AI总结提出RegimeVGGT，通过逐层U形压缩（显著性引导带状合并与选择性保护K/V下采样）去除冗余，在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情

AI中文摘要

视觉几何基础Transformer（VGGT）通过一次前向传播从多视图图像恢复密集3D场景结构，但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算，忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域：浅层缺乏跨视图结构，中层驱动跨视图对齐，深层对密集几何是冗余的，但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩：显著性引导带状合并保护几何和边缘显著性令牌，而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练，RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2606.19316 2026-06-18 cs.CV 新提交专题 85

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

NeuMesh++：基于解耦神经网格隐式场的多功能高效体积编辑

Chong Bao, Yuan Li, Bangbang Yang, Yujun Shen, Hujun Bao, Zhaopeng Cui, Yinda Zhang, Guofeng Zhang

发表机构 * State Key Lab of CAD&CG, College of Computer Science, Zhejiang University（浙江大学计算机科学学院CAD&CG国家重点实验室）； Ant Research（蚂蚁研究院）； Google（谷歌）； ByteDance（字节跳动）

专题命中三维重建：神经网格隐式场体积编辑

AI总结提出一种基于网格顶点的解耦神经辐射场表示，实现几何、纹理和语义引导的高效体积编辑，包括网格引导几何编辑、纹理交换填充绘制及语义编辑。

Comments TPAMI 2025; Project Page: https://zju3dv.github.io/neumeshplusplus/

详情

AI中文摘要

近年来，神经隐式渲染技术迅速发展，在新视角合成和3D场景重建方面展现出显著优势。然而，现有的用于编辑目的的神经渲染方法功能有限，例如刚性变换和类别特定编辑。在本文中，我们提出了一种新颖的基于网格的表示方法，通过在网格顶点上编码解耦的几何、纹理和语义码来编码神经辐射场，从而实现一系列高效且全面的编辑功能，包括网格引导的几何编辑、通过纹理交换、填充和绘制操作进行的指定纹理编辑，以及语义引导的编辑。为此，我们开发了几种技术，包括一种新颖的局部空间参数化以提高渲染质量和训练稳定性，一种可学习的顶点修改颜色以提高纹理编辑的保真度，一种空间感知优化策略以实现精确的纹理编辑，以及一种语义辅助区域选择以减轻隐式场编辑的繁琐标注。在真实和合成数据集上的大量实验和编辑示例证明了我们的方法在表示质量和编辑能力上的优越性。项目页面：此 https URL

英文摘要

Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: https://zju3dv.github.io/neumeshplusplus/

URL PDF HTML ☆

赞 0 踩 0

2606.18787 2026-06-18 cs.CV 新提交专题 85

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

基于UDF的点云重建中的学习半径估计

Eito Ogawa, Hiroshi Watanabe

发表机构 * Graduate School of FSE Waseda University Tokyo, Japan（Waseda大学研究生院FSE学院东京日本）

专题命中三维重建：点云表面重建，UDF方法

AI总结提出一种学习型逐查询半径选择器，预测连续支撑半径并插入冻结的LoSF-UDF骨干网络，通过抛物线插值获取离网目标半径进行训练，提高点云表面重建的细粒度精度。

2606.18558 2026-06-18 cs.CV 新提交专题 85

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； University of Washington（华盛顿大学）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

专题命中三维重建：预测3D点轨迹，涉及三维运动预测

AI总结提出一种基于语言指令的3D点运动预测方法，通过构建大规模数据集和基准，实现类无关、视角稳定的运动轨迹预测，并在机器人操作和视频生成中验证其有效性。

详情

AI中文摘要

运动预测是视觉智能的核心：智能体必须预测物体如何运动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务：给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述，模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务：(1) MolmoMotion-1M是一个大型语料库，包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹；(2) PointMotionBench是一个人工验证的基准，涵盖111个物体类别和61种运动类型；(3) MolmoMotion是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式，并在PointMotionBench上显著优于现有运动预测基线。最后，我们展示了学习到的3D运动先验能很好地迁移到下游应用：它提高了机器人操作的训练效率和泛化能力，其预测轨迹为生成模型提供了有效的运动指导，以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

URL PDF HTML ☆

赞 0 踩 0

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 新提交专题 85

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany（奥尔巴尼大学）

专题命中三维重建：提出CAD模型与扫描物体对齐方法

AI总结提出CAOA方法，结合语义感知点云补全和对称感知相对位姿估计，在Scan2CAD上实现17%精度提升，并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

Journal ref Thirteenth International Conference on 3D Vision (3DV), 2026

详情

DOI: 10.1109/3DV69130.2026.00047

AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度（DoF）位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐（CAOA），该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合，实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估，往往难以泛化到真实扫描。为弥合这一差距，我们引入了一种针对室内场景的合成数据生成策略，通过与广泛使用的补全数据集进行定量比较，验证了其显著减小合成到真实领域差距的效果。此外，我们发布了S2C-Completion，一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集，用于真实室内单物体补全，并作为该任务的新基准。对于物体-CAD对齐，我们通过对称感知损失融入对称信息，提高了对对称模糊的鲁棒性。在Scan2CAD基准上，CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.13376 2026-06-18 cs.CV 新提交专题 85

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

MoVerse: 基于全景高斯支架的实时视频世界建模

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

发表机构 * South China University of Technology ； Columbia University ； Orange Team, Youku Moku-Lab, HUJING Digital Media \& Entertainment Group ； Singapore Management University

专题命中三维重建：从单张图像构建可交互漫游的3D场景

AI总结提出MoVerse，从单张窄视场图像实时构建可交互漫游的360度全景世界，通过拓扑感知扩散补全视场、全景几何残差预测生成3D高斯支架，并结合双向扩散教师蒸馏为因果自回归学生实现低延迟视频渲染。

Comments Project Page: https://orange-3dv-team.github.io/MoVerse/

详情

AI中文摘要

我们提出MoVerse，一个实时视频世界模型，能够从单张窄视场图像创建可交互导航的场景。该设置具有挑战性，因为输入仅观察到环境的一小部分，而交互式漫游需要完整的周围世界、持久的几何结构、可控的相机运动以及时间上一致的高保真观测。MoVerse通过将世界构建与观测渲染分离来解决这个问题。它首先使用拓扑感知扩散将输入扩展为重力对齐的360°全景图，在3D推理之前闭合缺失的视场。然后，利用全景几何感知残差预测将全景图提升为持久的3D高斯支架，形成密集且可直接渲染的空间记忆。最后，一个高斯条件视频渲染器将沿用户指定相机轨迹的支架渲染结果转换为逼真的视频。为了使该渲染器适用于交互，我们训练了一个双向扩散教师用于高质量条件渲染，并将其蒸馏为一个因果自回归学生以实现有界延迟流式传输。这种设计结合了显式3D表示的可控性和长程一致性以及生成视频模型的感知质量。MoVerse在单个NVIDIA RTX 4090 GPU上支持8 FPS的实时场景漫游，展示了通往具有交互式视频输出的单图像世界创建的实用路径。

英文摘要

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

URL PDF HTML ☆

赞 0 踩 0

2606.18952 2026-06-18 cs.CV 新提交专题 80

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University（上海大学）； Southern University of Science and Technology（南方科技大学）； The University of Sydney（悉尼大学）

专题命中三维重建：真实单光子LiDAR基准，支持深度估计和多视图重建。

AI总结针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战，提出包含10个场景、10297个视角的真实捕获多任务基准STB，支持深度估计、多视图重建和3D语义理解评估。

详情

AI中文摘要

基于单光子雪崩二极管（SPAD）传感的单光子LiDAR（SPL）能够以极高灵敏度进行时间分辨光子测量，为光子匮乏环境下的主动3D感知提供了独特潜力。然而，由于独特的测量噪声和复杂的多回波瞬态现象，真实世界的单光子感知仍然面临根本性挑战，这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长，现有研究大多局限于模拟数据或小规模受控捕获。因此，在深度估计、多视图重建和3D语义理解方面，对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白，我们引入了SP-TransientBench（STB），一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图，使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议，STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.18861 2026-06-18 cs.CV cs.AI 新提交专题 80

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

专题命中三维重建：从RGB-D序列合成URDF，关节推理与验证。

AI总结提出KinemaForge管道，通过可微关节推理和能量一致性验证，从RGB-D序列联合估计部件形状、关节拓扑和参数，显著降低关节轴误差和仿真漂移。

详情

AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约：(i) 部件级几何重建与运动学参数估计分离，(ii) 恢复的模型常违反能量守恒等基本动态不变量，导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge，一种约束驱动管道，从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数，并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件：将关节-部件关联编码为软边的运动学约束图；通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器；以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上，KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%)，从基于交互的Ditto基线的5.30度降至2.83度(-46.6%)，在50秒滚动中长时仿真漂移比PARIS降低64%，初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.16849 2026-06-18 cs.NE cs.GR cs.HC 新提交专题 70

Evolution & Foundation: AI Shares Creative Control

进化与基础模型：AI共享创意控制

Dylan Banarse, Stephen Todd, William Latham, Frederic Fol Leymarie

专题命中三维重建：生成3D有机形态，涉及三维建模

AI总结提出一种结合遗传算法与多模态AI基础模型的框架，实现自动化设计3D有机形态，将艺术家角色从直接选择转变为系统设计，加速创意探索。

详情

AI中文摘要

本文研究使用进化系统进行自动化设计和艺术评估的创意过程。我们考虑多模态人工智能（AI）模型如何与组合生成和进化计算系统进行通信和引导。通过将遗传算法与大规模AI基础模型的视觉推理能力相结合，创建了一个用于进化美观的复杂3D有机形态的框架。该框架将艺术家的角色从密集的直接选择转变为系统设计；将详细的逐步策划转移给能够进行多模态审美判断的AI代理。该框架使人类艺术家/设计师能够快速穿越多维进化参数空间的大片区域，基于其语义目标找到创意结果。为每个实验生成AI审美推理的详细审计轨迹。交互式可视化工具，连同AI生成的摘要和进化叙事，使得能够深入探索每个进化实验，并提供对AI引导过程的透明洞察。

英文摘要

This paper investigates the creative process of automated design and artistic evaluation using an evolutionary system. We consider how a multimodal artificial intelligence (AI) model can communicate and guide a combined generative and evolutionary computational system. This creates a framework for the evolution of aesthetically pleasing complex 3D organic forms by integrating genetic algorithms with the visual reasoning capabilities of large-scale AI foundation models. The framework shifts the artist role from that of intensive direct selection to one of system design; transferring detailed step-by-step curation to an AI agent capable of multimodal aesthetic judgement. This framework enables the human artist/designer to rapidly traverse large areas of multi-dimensional evolutionary parameter space to find creative outcomes based on their semantic targets. Detailed audit trails of the AI's aesthetic reasoning are generated for each experiment. Interactive visualisation tools, together with AI-generated summaries and evolutionary narratives, enable deep exploration into each evolutionary experiment and providing a transparent insight into the AI-guided process.

URL PDF HTML ☆

赞 0 踩 0

2606.18826 2026-06-18 physics.optics cs.CV eess.IV 新提交专题 90

EDoF-NeRF: extended depth-of-field neural radiance fields using a coded aperture camera

EDoF-NeRF: 使用编码孔径相机扩展景深的神经辐射场

Yoshiyuki Shirasaki, Ryoichi Horisaki

发表机构 * Department of Information Physics and Computing, Graduate School of Information Science and Technology, The University of Tokyo（信息物理与计算系，信息科学与技术研究生学校，东京大学）

专题命中 NeRF ：扩展景深NeRF，编码孔径相机

AI总结提出一种通过编码孔径相机扩展景深的方法，构建高保真神经辐射场，实现从不同视角图像渲染新视图，并验证其优于传统孔径相机。

2606.18583 2026-06-18 cs.CV cs.RO 新提交专题 85

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

空地激光雷达地点识别：基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary（卡尔加里大学）； Nanchang University（南昌大学）； Nanyang Technological University（南洋理工大学）； Wuhan University（武汉大学）

专题命中点云：空地激光雷达地点识别框架，点云检索重排序

AI总结提出一种空地激光雷达地点识别框架，通过多尺度块级自监督学习缩小域差距，并利用扩展互逆重排序算法减少误检，在多个数据集上显著提升检索精度。

详情

AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描（ALS）数据作为空中先验地图可以克服这些缺点，使得跨视角地点识别变得必要且有利。然而，空地激光雷达地点识别面临重大挑战，包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题，我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识，我们的检索网络在多个尺度上引入了块级自监督学习模块，并与场景级学习相结合，以提高空中和地面点云之间全局特征的判别性。此外，利用ALS点云的结构化空间分布，我们引入了一种扩展互逆（ER）重排序算法，以最大化利用邻域信息，并根据邻域特征优化每个特征，然后用于更新相似度矩阵以进行最终排序。大量实验表明，我们的检索网络优于现有最先进（SOTA）方法，在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%，平均Recall@1%提高了3.2%，同时在CS-Campus3D数据集上也展示了最佳性能。此外，我们的ER重排序算法在无需额外训练的情况下，进一步将CS-Campus3D上的平均Recall@1提高了4.9%，CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

URL PDF HTML ☆

赞 0 踩 0

2606.18948 2026-06-18 cs.RO 新提交专题 75

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

C-ARC: 面向非重复式LiDAR传感器的连续自适应范围聚类

Nick B. Schroeder, Jonathan Lichtenfeld, Oskar von Stryk

发表机构 * Technical University of Darmstadt（德累斯顿技术大学）； Simulation, Systems Optimization and Robotics Group（仿真、系统优化与机器人组）

专题命中点云：处理非重复式LiDAR点云聚类，属于3D视觉

AI总结提出C-ARC框架，通过滑动窗口上的持久双图结构解耦高频点插入与按需聚类检索，并利用指数控制环自适应校准网格分辨率，实现非重复式LiDAR点云的实时聚类。

Comments Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures

详情

AI中文摘要

实时LiDAR聚类识别点云中的结构，是许多移动机器人算法的重要前提。当前方法主要针对重复式机械LiDAR传感器开发。近年来，由于成本和外形尺寸小，非重复式LiDAR传感器的使用显著增加。这类基于Risley棱镜的非重复传感器违反了重复式机械传感器的两个关键假设：结构化的扫描线和明确的帧边界。其Rhodonea曲线轨迹产生非均匀点分布，且缺乏旋转周期使得传统扫描线索引无法适用。为满足这些新需求，我们开发了C-ARC，一个连续自适应范围聚类框架，它在滑动窗口上维护一个持久双图，将高频点插入与按需聚类检索解耦。这对于SLAM或跟踪等关键功能至关重要。自适应范围网格分辨率机制在初始化时使用指数控制环校准网格尺寸，无需预先了解扫描模式即可平衡稀疏-碰撞权衡。作为开源的单线程C++17库实现，C-ARC在商用硬件上对Livox Mid-360以20 Hz产生实时聚类输出。在Livox Avia上的评估表明，对于扫描模式高度集中的传感器，无界单元占用是主要限制。自适应分辨率机制还提高了现有基于网格的方法在非重复数据上的聚类质量。

英文摘要

Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.

URL PDF HTML ☆

赞 0 踩 0

1. 空间理解 1 篇

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

2. Gaussian Splatting 3 篇

Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

Intrinsic 4D Gaussian Segmentation from Scene Cues

Point-Cloud-Assistant Localized Statistical Channel Prediction by Tangent Gaussian Splatting

3. 三维重建 12 篇

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

CAOA -- Completion-Assisted Object-CAD Alignment

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

Evolution & Foundation: AI Shares Creative Control

4. NeRF 1 篇

EDoF-NeRF: extended depth-of-field neural radiance fields using a coded aperture camera

5. 点云 2 篇

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors