Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate
这些视角能是一个场景吗?在3D基础模型产生幻觉时评估多视角3D一致性
Soumava Paul, Prakhar Kaushik, Alan Yuille
AI总结 本文研究了在3D基础模型产生幻觉时多视角3D一致性的可靠性问题,提出了一种可控的鲁棒性基准和参数化家族,将神经度量分解为backbone、残差和聚合组件,并引入基于COLMAP的度量方法,以提高与人类判断的一致性。
详情
- Comments
- Project Page at https://mvp18.github.io/3d-consistency-metrics/
多视角3D评估假设被评分的图像是对一个静态3D场景的观测。这一假设在NVS和稀疏视角重建中可能失效:输入或生成的输出可能包含伪影、异常帧、重复视角或噪声,但仍可能获得高3D一致性分数。现有基于参考的度量需要地面真实,而无需地面真实的度量如MEt3R依赖于学习的重建backbone,其失败模式尚不明确。我们通过比较神经重建先验与经典几何验证研究了这一可靠性问题。我们引入benchmark,一种用于多视角3D一致性的受控鲁棒性基准,以及一个参数化家族,将神经度量分解为backbone、残差和聚合组件。该家族恢复MEt3R并产生多达3倍更稳健的变体。我们的分析显示,VGGT、MASt3R、DUSt3R和Fast3R可以产生无关场景的密集几何和跨视角支持,重复图像和随机噪声。我们引入基于COLMAP的度量方法,利用匹配、注册、密集支持和重建失败作为失败感知的一致性信号。在真实的NVS输出和结构化的人类研究中,这些度量方法与人类判断的一致性比MEt3R高多达4倍。
Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to $3\times$ more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to $4\times$ higher correlation with human judgments than MEt3R.