arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

3D 视觉

三维重建、NeRF、Gaussian Splatting、点云和空间智能。

今日/当前日期收录 14 信号源:cs.CV, cs.GR, cs.RO
2606.19156 2026-06-18 cs.CV 新提交 90%

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

Hand-4DGS: 用于从第一人称视频进行4D手部重建的前馈3D高斯泼溅方法

Jeongmin Bae, Seoha Kim, Marc Pollefeys, Mahdi Rad, Youngjung Uh, Taein Kwon

发表机构 * Yonsei University(延世大学) Electronics and Telecommunications Research Institute(电子电信研究院) ETH Zurich(苏黎世联邦理工学院) Microsoft Spatial AI Lab(微软空间AI实验室) VGG, University of Oxford(VGG,牛津大学)

专题命中 三维重建 :从第一人称视频重建动态4D手部,前馈3DGS

AI总结 提出Hand-4DGS,首个前馈框架,从第一人称视频直接重建动态4D手部,利用网格引导表示和时间卷积,实现快速推理和强泛化,无需3D真值标注。

Comments Project page: https://jeongminb.github.io/hand-4dgs/

详情
AI中文摘要

从第一人称视频进行动态3D手部重建对于下一代计算平台(如AR/VR和AI眼镜)至关重要。尽管其重要性,大多数先前工作要么关注多视角3D手部重建,要么关注4D人体重建。由于头部快速运动、手部快速动态、严重遮挡以及单视角观察固有的模糊性,第一人称4D手部重建仍然具有挑战性。为了解决这些挑战,我们引入了Hand-4DGS,这是第一个直接从第一人称视频重建动态4D手部的前馈框架,实现了快速(约60 FPS)推理和强泛化。我们的方法结合了用于结构先验的网格引导表示和用于建模动态运动的时间卷积。我们在两个具有挑战性的第一人称数据集H2O和ARCTIC上评估了我们的框架,并展示了相对于基线的显著改进。我们的方法受益于前馈网络的泛化能力以及通过高斯泼溅的有效2D图像监督,无需昂贵的3D手部姿态真值标注。

英文摘要

Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

2606.19019 2026-06-18 cs.CV 新提交 90%

FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

FlowObject: 流引导以桥接生成先验与重建保真度

Yuchen Rao, Xuqian Ren, Yinyu Nie, Sayan Deb Sarkar, Biao Zhang, Vincent Lepetit, Friedrich Fraundorfer

发表机构 * Graz University of Technology Austria(奥地利格拉茨理工大学) Tampere University Finland(芬兰塔尔库大学) Technical University of Munich Germany(德国慕尼黑技术大学) Stanford University The United States of America(美国斯坦福大学) Xi’an Jiaotong University China(中国西安交通大学) École des Ponts ParisTech France(法国巴黎综合理工学院)

专题命中 三维重建 :提出流引导框架,结合生成先验与3DGS实现稀疏视图重建。

AI总结 提出FlowObject框架,通过双空间引导策略驱动流匹配模型的ODE轨迹,在利用生成先验完成未观测区域的同时保持与真实观测的一致性,并集成3DGS细化阶段弥合生成输出与真实感重建的差距,显著提升几何完整性和视角相关外观保真度。

Comments Project page: https://yuchenrao.github.io/projects/flowObject/flowObject.html

详情
AI中文摘要

从少量随意拍摄的图像中恢复物体的完整3D表示仍然是一个重大挑战。最近的3D生成模型,特别是基于流匹配(Flow-Matching, FM)的模型,可以合成高质量的纹理资产;然而,它们常常遭受“合成偏差”,即学习到的先验覆盖了观测证据,同时缺乏与观测实例的对齐。相反,基于优化的方法如3D高斯泼溅(3DGS)在可见表面上提供高保真度,但无法推理未观测的几何结构。在本文中,我们提出了FlowObject,一个将稀疏视图3D重建重新表述为无训练、引导逆问题的框架。我们的方法采用双空间引导策略来驱动流匹配模型的常微分方程(ODE)轨迹,通过学习的生成先验完成未观测区域,同时强制与真实世界观测严格一致。通过集成3DGS细化阶段,FlowObject进一步弥合了“合成外观”生成输出与真实感重建之间的差距。在合成和真实世界数据集上的全面基准测试表明,当前最先进的方法通常难以同时实现几何完整性和观测一致性,尤其是在严重遮挡下。相比之下,我们的方法在几何完整性和视角相关外观保真度方面显著优于最先进的生成模型和基于优化的框架。

英文摘要

Recovering complete 3D representations of objects from few casual image captures remains a significant challenge. Recent 3D generative models, particularly those based on Flow-Matching (FM), can synthesize high-quality textured assets; however, they often suffer from ''synthetic bias'' where learned priors override observational evidence, alongside a lack of alignment with the observed instance. Conversely, optimization-based methods like 3D Gaussian Splatting (3DGS) provide high fidelity on visible surfaces but fail to reason about unobserved geometry. In this paper, we present FlowObject, a framework that reformulates sparse-view 3D reconstruction as a training-free, guided inverse problem. Our approach applies a dual-space guidance strategy to steer the Ordinary Differential Equation (ODE) trajectory of a flow-matching model, enabling the completion of unseen regions through learned generative priors while enforcing strict consistency with real-world observations. By integrating a 3DGS refinement stage, FlowObject further bridges the gap between ''synthetic-looking'' generative outputs and photorealistic reconstructions. Comprehensive benchmarks on synthetic and real-world datasets demonstrate that current state-of-the-art methods often struggle to achieve geometric completeness and observational consistency simultaneously, especially under severe occlusions. In contrast, our method significantly outperforms state-of-the-art generative models and optimization-based frameworks in both geometric completeness and view-dependent appearance fidelity.

2606.18472 2026-06-18 cs.CV 新提交 90%

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University(康考迪亚大学)

专题命中 三维重建 :3D视觉-语言模型域泛化微调

AI总结 提出ReFine3D框架,通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略,提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

域适应仍然是3D视觉中的一个核心挑战,特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力,但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题,我们引入了ReFine3D,一个正则化的微调框架,专为3D大语言模型(LMMs)的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合:跨增强点云的多视图一致性,以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外,我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制,以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明,ReFine3D将基类到新类泛化提高了1.36%,跨数据集迁移提高了2.43%,对损坏的鲁棒性提高了1.80%,少样本准确率提高了最多3.11%,以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

2606.18439 2026-06-18 cs.CV cs.RO 新提交 90%

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT:面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of California, Irvine(加利福尼亚大学尔湾分校) Nanyang Technological University(南洋理工大学)

专题命中 三维重建 :提出VGGT加速方法,用于多视图3D场景重建

AI总结 提出RegimeVGGT,通过逐层U形压缩(显著性引导带状合并与选择性保护K/V下采样)去除冗余,在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情
AI中文摘要

视觉几何基础Transformer(VGGT)通过一次前向传播从多视图图像恢复密集3D场景结构,但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算,忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域:浅层缺乏跨视图结构,中层驱动跨视图对齐,深层对密集几何是冗余的,但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩:显著性引导带状合并保护几何和边缘显著性令牌,而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练,RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

2606.19316 2026-06-18 cs.CV 新提交 85%

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

NeuMesh++:基于解耦神经网格隐式场的多功能高效体积编辑

Chong Bao, Yuan Li, Bangbang Yang, Yujun Shen, Hujun Bao, Zhaopeng Cui, Yinda Zhang, Guofeng Zhang

发表机构 * State Key Lab of CAD&CG, College of Computer Science, Zhejiang University(浙江大学计算机科学学院CAD&CG国家重点实验室) Ant Research(蚂蚁研究院) Google(谷歌) ByteDance(字节跳动)

专题命中 三维重建 :神经网格隐式场体积编辑

AI总结 提出一种基于网格顶点的解耦神经辐射场表示,实现几何、纹理和语义引导的高效体积编辑,包括网格引导几何编辑、纹理交换填充绘制及语义编辑。

Comments TPAMI 2025; Project Page: https://zju3dv.github.io/neumeshplusplus/

详情
AI中文摘要

近年来,神经隐式渲染技术迅速发展,在新视角合成和3D场景重建方面展现出显著优势。然而,现有的用于编辑目的的神经渲染方法功能有限,例如刚性变换和类别特定编辑。在本文中,我们提出了一种新颖的基于网格的表示方法,通过在网格顶点上编码解耦的几何、纹理和语义码来编码神经辐射场,从而实现一系列高效且全面的编辑功能,包括网格引导的几何编辑、通过纹理交换、填充和绘制操作进行的指定纹理编辑,以及语义引导的编辑。为此,我们开发了几种技术,包括一种新颖的局部空间参数化以提高渲染质量和训练稳定性,一种可学习的顶点修改颜色以提高纹理编辑的保真度,一种空间感知优化策略以实现精确的纹理编辑,以及一种语义辅助区域选择以减轻隐式场编辑的繁琐标注。在真实和合成数据集上的大量实验和编辑示例证明了我们的方法在表示质量和编辑能力上的优越性。项目页面:此 https URL

英文摘要

Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: https://zju3dv.github.io/neumeshplusplus/

2606.18787 2026-06-18 cs.CV 新提交 85%

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

基于UDF的点云重建中的学习半径估计

Eito Ogawa, Hiroshi Watanabe

发表机构 * Graduate School of FSE Waseda University Tokyo, Japan(Waseda大学研究生院FSE学院东京日本)

专题命中 三维重建 :点云表面重建,UDF方法

AI总结 提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络,通过抛物线插值获取离网目标半径进行训练,提高点云表面重建的细粒度精度。

详情
AI中文摘要

从点云进行表面重建对于消费级3D捕获(包括AR/VR和室内扫描)非常重要。局部补丁无符号距离场(UDF)方法轻量且可泛化,但其精度依赖于支撑半径,传统上半径是固定的或通过一维曲率启发式选择,无法捕捉异质局部几何。我们提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络。该选择器使用通过抛物线插值从缓存的UDF误差曲线获得的离网目标半径进行训练。实验表明,该方法提高了细尺度重建精度。

英文摘要

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

2606.18558 2026-06-18 cs.CV 新提交 85%

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

专题命中 三维重建 :预测3D点轨迹,涉及三维运动预测

AI总结 提出一种基于语言指令的3D点运动预测方法,通过构建大规模数据集和基准,实现类无关、视角稳定的运动轨迹预测,并在机器人操作和视频生成中验证其有效性。

详情
AI中文摘要

运动预测是视觉智能的核心:智能体必须预测物体如何运动,以规划行动、推理物理交互并合成逼真的未来场景。我们认为,世界坐标系中的3D点提供了一种通用表示,具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务:给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述,模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务:(1) MolmoMotion-1M是一个大型语料库,包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹;(2) PointMotionBench是一个人工验证的基准,涵盖111个物体类别和61种运动类型;(3) MolmoMotion是一个通用运动预测模型,支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式,并在PointMotionBench上显著优于现有运动预测基线。最后,我们展示了学习到的3D运动先验能很好地迁移到下游应用:它提高了机器人操作的训练效率和泛化能力,其预测轨迹为生成模型提供了有效的运动指导,以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 新提交 85%

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany(奥尔巴尼大学)

专题命中 三维重建 :提出CAD模型与扫描物体对齐方法

AI总结 提出CAOA方法,结合语义感知点云补全和对称感知相对位姿估计,在Scan2CAD上实现17%精度提升,并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

Journal ref Thirteenth International Conference on 3D Vision (3DV), 2026

详情
AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度(DoF)位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐(CAOA),该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合,实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估,往往难以泛化到真实扫描。为弥合这一差距,我们引入了一种针对室内场景的合成数据生成策略,通过与广泛使用的补全数据集进行定量比较,验证了其显著减小合成到真实领域差距的效果。此外,我们发布了S2C-Completion,一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集,用于真实室内单物体补全,并作为该任务的新基准。对于物体-CAD对齐,我们通过对称感知损失融入对称信息,提高了对对称模糊的鲁棒性。在Scan2CAD基准上,CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

2606.13376 2026-06-18 cs.CV 新提交 85%

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

MoVerse: 基于全景高斯支架的实时视频世界建模

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

发表机构 * South China University of Technology Columbia University Orange Team, Youku Moku-Lab, HUJING Digital Media \& Entertainment Group Singapore Management University

专题命中 三维重建 :从单张图像构建可交互漫游的3D场景

AI总结 提出MoVerse,从单张窄视场图像实时构建可交互漫游的360度全景世界,通过拓扑感知扩散补全视场、全景几何残差预测生成3D高斯支架,并结合双向扩散教师蒸馏为因果自回归学生实现低延迟视频渲染。

Comments Project Page: https://orange-3dv-team.github.io/MoVerse/

详情
AI中文摘要

我们提出MoVerse,一个实时视频世界模型,能够从单张窄视场图像创建可交互导航的场景。该设置具有挑战性,因为输入仅观察到环境的一小部分,而交互式漫游需要完整的周围世界、持久的几何结构、可控的相机运动以及时间上一致的高保真观测。MoVerse通过将世界构建与观测渲染分离来解决这个问题。它首先使用拓扑感知扩散将输入扩展为重力对齐的360°全景图,在3D推理之前闭合缺失的视场。然后,利用全景几何感知残差预测将全景图提升为持久的3D高斯支架,形成密集且可直接渲染的空间记忆。最后,一个高斯条件视频渲染器将沿用户指定相机轨迹的支架渲染结果转换为逼真的视频。为了使该渲染器适用于交互,我们训练了一个双向扩散教师用于高质量条件渲染,并将其蒸馏为一个因果自回归学生以实现有界延迟流式传输。这种设计结合了显式3D表示的可控性和长程一致性以及生成视频模型的感知质量。MoVerse在单个NVIDIA RTX 4090 GPU上支持8 FPS的实时场景漫游,展示了通往具有交互式视频输出的单图像世界创建的实用路径。

英文摘要

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

2503.09439 2026-06-18 cs.CV 版本更新 85%

SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

SuperCarver: 纹理一致的3D几何超分辨率用于高保真表面细节生成

Qijian Zhang, Xiaozheng Jian, Xuan Zhang, Wenping Wang, Junhui Hou

发表机构 * Tencent Games, China(腾讯游戏,中国) Department of Computer Science & Engineering, Texas A & M University(电子与计算机工程系,德克萨斯A&M大学) Department of Computer Science, City University of Hong Kong(计算机科学系,香港城市大学)

专题命中 三维重建 :提出3D几何超分辨率管线,补充纹理一致表面细节。

AI总结 提出SuperCarver,一种3D几何超分辨率管线,通过先验引导的法线扩散模型和噪声鲁棒的逆渲染,为粗糙网格补充纹理一致的表面细节,实现高保真细节生成。

Comments Accepted in IEEE TVCG

详情
AI中文摘要

传统的高精度网格资产生产流程需要专业3D艺术家/建模师进行繁琐且费力的手动雕刻。近年来,AI赋能的3D内容创作在从图像或文本提示生成合理结构和复杂外观方面取得了显著进展。然而,合成逼真的表面细节仍然面临巨大挑战,并且增强现有低质量3D网格(而非图像/文本到3D生成)的几何保真度仍然是一个开放问题。在本文中,我们介绍了SuperCarver,一种3D几何超分辨率管线,用于为给定的粗糙网格补充纹理一致的表面细节。我们首先从多个视角将原始纹理网格渲染到图像域。为了实现细节增强,我们构建了一个确定性先验引导的法线扩散模型,该模型在精心策划的成对细节缺乏和细节丰富的法线图渲染数据集上进行微调。为了从潜在不完美的法线图预测更新网格表面,我们设计了一种通过可变形距离场的噪声鲁棒逆渲染方案。实验表明,我们的SuperCarver能够生成由实际纹理外观描述的逼真且富有表现力的表面细节,使其成为升级历史低质量3D资产和减少高多边形网格雕刻工作量的强大工具。

英文摘要

Conventional production workflow of high-precision mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized 3D artists/modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation for generating plausible structures and intricate appearances from images or text prompts. However, synthesizing realistic surface details still poses great challenges, and enhancing the geometry fidelity of existing lower-quality 3D meshes (instead of image/text-to-3D generation) remains an open problem. In this paper, we introduce SuperCarver, a 3D geometry super-resolution pipeline for supplementing texture-consistent surface details onto a given coarse mesh. We start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve detail boosting, we construct a deterministic prior-guided normal diffusion model, which is fine-tuned on a carefully curated dataset of paired detail-lacking and detail-rich normal map renderings. To update mesh surfaces from potentially imperfect normal map predictions, we design a noise-resistant inverse rendering scheme through deformable distance field. Experiments demonstrate that our SuperCarver is capable of generating realistic and expressive surface details depicted by the actual texture appearance, making it a powerful tool to both upgrade historical low-quality 3D assets and reduce the workload of sculpting high-poly meshes.

2606.18952 2026-06-18 cs.CV 新提交 80%

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University(上海大学) Southern University of Science and Technology(南方科技大学) The University of Sydney(悉尼大学)

专题命中 三维重建 :真实单光子LiDAR基准,支持深度估计和多视图重建。

AI总结 针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战,提出包含10个场景、10297个视角的真实捕获多任务基准STB,支持深度估计、多视图重建和3D语义理解评估。

详情
AI中文摘要

基于单光子雪崩二极管(SPAD)传感的单光子LiDAR(SPL)能够以极高灵敏度进行时间分辨光子测量,为光子匮乏环境下的主动3D感知提供了独特潜力。然而,由于独特的测量噪声和复杂的多回波瞬态现象,真实世界的单光子感知仍然面临根本性挑战,这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长,现有研究大多局限于模拟数据或小规模受控捕获。因此,在深度估计、多视图重建和3D语义理解方面,对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白,我们引入了SP-TransientBench(STB),一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图,使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议,STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

2606.18861 2026-06-18 cs.CV cs.AI 新提交 80%

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

专题命中 三维重建 :从RGB-D序列合成URDF,关节推理与验证。

AI总结 提出KinemaForge管道,通过可微关节推理和能量一致性验证,从RGB-D序列联合估计部件形状、关节拓扑和参数,显著降低关节轴误差和仿真漂移。

详情
AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约:(i) 部件级几何重建与运动学参数估计分离,(ii) 恢复的模型常违反能量守恒等基本动态不变量,导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge,一种约束驱动管道,从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数,并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件:将关节-部件关联编码为软边的运动学约束图;通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器;以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上,KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%),从基于交互的Ditto基线的5.30度降至2.83度(-46.6%),在50秒滚动中长时仿真漂移比PARIS降低64%,初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

2511.02036 2026-06-18 cs.RO 版本更新 80%

TurboMap: GPU-Accelerated Local Mapping for Visual SLAM

TurboMap: 面向视觉SLAM的GPU加速局部建图

Parsa Hosseininejad, Kimia Khabiri, Shishir Gopinath, Soudabeh Mohammadhashemi, Karthik Dantu, Steven Y. Ko

发表机构 * Simon Fraser University(西蒙弗雷泽大学) University at Buffalo(布法罗大学)

专题命中 三维重建 :GPU加速局部建图用于视觉SLAM

AI总结 针对视觉SLAM中局部建图延迟问题,提出GPU并行化与CPU优化结合的TurboMap后端,通过重构地图点创建、融合及关键帧管理,实现1.3-1.6倍加速且保持精度。

Comments Accepted for presentation at IROS 2026, preprint

详情
AI中文摘要

在实时视觉SLAM系统中,局部建图必须在严格的延迟约束下运行,因为延迟会降低地图质量并增加跟踪失败的风险。GPU并行化是降低延迟的有效途径。然而,由于同步共享状态更新以及将大型地图数据结构传输到GPU的开销,并行化局部建图具有挑战性。本文提出TurboMap,一个GPU并行化且CPU优化的局部建图后端,全面解决了这些挑战。我们重构了地图点创建,以在GPU上实现并行关键点对应搜索,重新设计并并行化了地图点融合,在CPU上优化了冗余关键帧剔除,并集成了基于GPU的快速局部光束法平差求解器。为最小化数据传输和同步成本,我们引入了持久化的GPU驻留关键帧存储。在EuRoC和TUM-VI数据集上的实验表明,平均局部建图速度分别提升1.3倍和1.6倍,同时保持精度不变。

英文摘要

In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.

2606.16849 2026-06-18 cs.NE cs.GR cs.HC 新提交 70%

Evolution & Foundation: AI Shares Creative Control

进化与基础模型:AI共享创意控制

Dylan Banarse, Stephen Todd, William Latham, Frederic Fol Leymarie

专题命中 三维重建 :生成3D有机形态,涉及三维建模

AI总结 提出一种结合遗传算法与多模态AI基础模型的框架,实现自动化设计3D有机形态,将艺术家角色从直接选择转变为系统设计,加速创意探索。

详情
AI中文摘要

本文研究使用进化系统进行自动化设计和艺术评估的创意过程。我们考虑多模态人工智能(AI)模型如何与组合生成和进化计算系统进行通信和引导。通过将遗传算法与大规模AI基础模型的视觉推理能力相结合,创建了一个用于进化美观的复杂3D有机形态的框架。该框架将艺术家的角色从密集的直接选择转变为系统设计;将详细的逐步策划转移给能够进行多模态审美判断的AI代理。该框架使人类艺术家/设计师能够快速穿越多维进化参数空间的大片区域,基于其语义目标找到创意结果。为每个实验生成AI审美推理的详细审计轨迹。交互式可视化工具,连同AI生成的摘要和进化叙事,使得能够深入探索每个进化实验,并提供对AI引导过程的透明洞察。

英文摘要

This paper investigates the creative process of automated design and artistic evaluation using an evolutionary system. We consider how a multimodal artificial intelligence (AI) model can communicate and guide a combined generative and evolutionary computational system. This creates a framework for the evolution of aesthetically pleasing complex 3D organic forms by integrating genetic algorithms with the visual reasoning capabilities of large-scale AI foundation models. The framework shifts the artist role from that of intensive direct selection to one of system design; transferring detailed step-by-step curation to an AI agent capable of multimodal aesthetic judgement. This framework enables the human artist/designer to rapidly traverse large areas of multi-dimensional evolutionary parameter space to find creative outcomes based on their semantic targets. Detailed audit trails of the AI's aesthetic reasoning are generated for each experiment. Interactive visualisation tools, together with AI-generated summaries and evolutionary narratives, enable deep exploration into each evolutionary experiment and providing a transparent insight into the AI-guided process.