arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

3D 视觉

三维重建、NeRF、Gaussian Splatting、点云和空间智能。

今日/当前日期收录 6 信号源:cs.CV, cs.GR, cs.RO
2512.00850 2026-06-19 cs.CV 版本更新 95%

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Smol-GS: 抽象3D高斯溅射的紧凑表示

Haishan Wang, Mohammad Hassan Vali, Arno Solin

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(阿alto大学)

专题命中 三维重建 :3D高斯溅射的紧凑表示,属于三维重建

AI总结 提出Smol-GS方法,通过八叉树位置编码和熵压缩学习高效溅射特征,实现3D高斯溅射的紧凑表示,在保持渲染质量的同时大幅降低存储。

详情
AI中文摘要

我们提出Smol-GS,一种学习3D高斯溅射(3DGS)紧凑表示的新方法。我们的方法学习高效的逐溅射特征来建模3D空间,这些特征捕获抽象线索,包括颜色、不透明度、变换和材质属性。我们提出八叉树导出的位置编码,显式建模空间局部性并增强表示效率。我们进一步应用基于熵的压缩来利用特征冗余,并使用递归体素层次压缩溅射坐标。这种设计在保持表示灵活性的同时,实现了数量级的存储减少。Smol-GS在标准基准测试上以高渲染质量实现了最先进的压缩性能。

英文摘要

We present Smol-GS, a novel method for learning compact representations for 3D Gaussian Splatting (3DGS). Our approach learns highly efficient splat-wise features to model 3D space, which capture abstracted cues, including color, opacity, transformation, and material properties. We propose octree-derived positional encoding, which explicitly models spatial locality and enhances representation efficiency. We further apply entropy-based compression to exploit feature redundancy and compress splat coordinates using a recursive voxel hierarchy. This design enables orders-of-magnitude reduction in storage while preserving representation flexibility. Smol-GS achieves state-of-the-art compression performance on standard benchmarks with high-level rendering quality.

2604.13416 2026-06-19 cs.CV cs.AI 版本更新 90%

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K:用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney(悉尼科技大学) University of Sydney(悉尼大学) National Yang Ming Chiao Tung University(阳明交通大学)

专题命中 三维重建 :无干扰新视角合成数据集与基准

AI总结 为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白,构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集,并基于此基准测试了九种最新方法,识别出最鲁棒的方法和最具挑战的场景。

详情
AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中,已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而,对于无干扰辐射场,每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏,限制了发展。为填补这一空白,我们引入了DF3DV-1K,一个包含1048个场景的大规模真实世界数据集,每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像,模拟随意拍摄,涵盖128种干扰类型和161种场景主题,包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K,我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试,识别出最鲁棒的方法和最具挑战的场景。除了基准测试,我们还展示了DF3DV-1K的一个应用:微调基于扩散的2D增强器以改进辐射场方法,在保留集(例如DF3DV-41)和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展,并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取:此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新 85%

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Bosch Research(博世研究院) University of Haifa(海法大学)

专题命中 三维重建 :潜在高斯泼溅用于4D占据跟踪

AI总结 提出潜在高斯泼溅(LaGS)方法,通过特征高斯体作为动态关键点实现多视图特征聚合,用于4D全景占据跟踪,在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情
AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法通常只解决部分问题:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中,我们提出了潜在高斯泼溅(LaGS)用于4D全景占据跟踪(4D-POT)。我们重新审视底层表示,将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点,在泼溅到体素网格进行解码之前,能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互,这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节,进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型:this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

2503.01425 2026-06-19 cs.GR cs.CV 版本更新 85%

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

MeshPad: 交互式草图条件艺术家风格网格生成与编辑

Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) AUDI AG(奥迪股份公司)

专题命中 三维重建 :从草图生成和编辑3D网格,属于三维重建。

AI总结 提出MeshPad,一种基于草图输入的交互式3D网格生成与编辑方法,通过分解为网格区域的删除和添加操作,结合Transformer和顶点对齐推测策略,实现快速迭代编辑,在Chamfer距离上提升22%以上质量,并获90%用户偏好。

Comments Project page: https://derkleineli.github.io/meshpad/ Video: https://www.youtube.com/watch?v=_T6UTGTMZ1E

详情
AI中文摘要

我们介绍了MeshPad,一种从草图输入生成3D网格的生成方法。基于最近在艺术家风格三角形网格生成方面的进展,我们的方法解决了交互式网格创建的需求。为此,我们专注于通过将编辑分解为网格区域的“删除”和随后新网格几何的“添加”来实现一致编辑。这两个操作都由用户对草图图像的简单编辑触发,促进了迭代内容创建过程,并能够构建复杂的3D网格。我们的方法基于三角形序列网格表示,利用大型Transformer模型进行网格三角形的添加和删除。为了交互式地执行编辑,我们在加法网格生成器之上引入了一种顶点对齐的推测预测策略。该推测器预测对应于一个顶点的多个输出标记,从而显著降低推理的计算成本并加速编辑过程,使得每个编辑步骤只需几秒钟即可完成。综合实验表明,MeshPad优于最先进的草图条件网格生成方法,在Chamfer距离上实现了超过22%的网格质量改进,并且在感知评估中被90%的参与者所偏好。

英文摘要

We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.

2508.15228 2026-06-19 cs.CV 版本更新 80%

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore(南洋理工大学S实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

专题命中 三维重建 :协作多模态编码用于3D生成

AI总结 提出TriMM,首个前馈式3D原生生成模型,通过协作多模态编码融合RGB、RGBD和点云特征,结合辅助2D/3D监督和三平面潜在扩散模型,实现高质量3D资产生成。

详情
AI中文摘要

3D内容本质上具有多模态特性,可投影到不同模态(如RGB图像、RGBD和点云)。每种模态在3D资产建模中表现出独特优势:RGB图像包含生动的3D纹理,而点云定义精细的3D几何。然而,现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势,要么局限于3D结构,从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模,我们提出了TriMM,这是第一个从基本多模态(如RGB、RGBD和点云)学习的前馈式3D原生生成模型。具体来说,1) TriMM首先引入协作多模态编码,该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外,引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码,TriMM采用三平面潜在扩散模型生成更高质量的3D资产,增强了纹理和几何细节。在多个知名数据集上的大量实验表明,TriMM通过有效利用多模态,尽管使用少量训练数据,仍能达到与在大规模数据集上训练的模型相竞争的性能。此外,我们在最近的RGB-D数据集上进行了额外实验,验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

2509.13972 2026-06-19 cs.RO 版本更新 80%

BIM Informed Visual SLAM for Construction Environments

BIM 引导的视觉 SLAM 在建筑环境中的应用

Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠与信任跨学科研究中心(SnT),卢森堡大学)

专题命中 三维重建 :利用BIM增强视觉SLAM,减少轨迹漂移

AI总结 针对建筑环境中视觉SLAM轨迹漂移问题,提出利用建筑信息模型(BIM)的结构先验增强RGB-D SLAM系统,通过墙面对应与几何约束优化减少漂移,提升全局一致性,实验显示轨迹误差降低25.23%,地图精度提升7.14%。

Comments 9 pages, 7 tables, 4 figures

详情
AI中文摘要

监测建筑施工现场需要将计划设计与实际建造状态进行比较,而同步定位与地图构建(SLAM)技术可以实时估计实际状态。然而,视觉SLAM在建筑环境中容易产生轨迹漂移,生成的地图在几何上与实际环境不准确。为解决这一局限,我们利用从建筑信息模型(BIM)导出的结构先验增强现有的RGB-D SLAM系统。该系统将检测到的墙面与BIM中的对应墙面关联,并将这些对应关系作为几何约束加入后端优化,从而减少漂移并增强全局一致性。所提方法实时运行,并在多个真实建筑工地上验证,与最先进的基线相比,平均轨迹误差降低25.23%,地图精度提升7.14%。鲁棒性分析进一步表明,该方法对不完整的BIM数据以及计划模型与实际环境之间的几何差异具有韧性。

英文摘要

Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.