Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs
面向相机鲁棒的3D定位:基于方程的工具使用用于MLLMs
Xueying Jiang, Wenhao Li, Quanhao Qian, Deli Zhao, Shijian Lu, Gongjie Zhang, Ran Xu
AI总结 本文提出了一种基于方程的工具使用框架,通过将空间工具作为公式变量重新利用,以解决多模态大语言模型(MLLMs)中3D定位的相机固有模糊问题,从而在3D物体检测和3D视觉定位任务中取得了显著提升。
详情
多模态大语言模型(MLLMs)中的3D定位,包括3D物体检测和3D视觉定位,本质上受限于相机内参的模糊性:相同图像在不同相机下可以对应不同的3D场景。现有的MLLMs要么忽略相机参数并过度拟合于标准训练内参,要么从外部工具检索深度和3D线索,但将返回值视为参考线索(数值提示,模型可以隐式解释)。我们提出了一种基于方程的工具使用框架,将空间工具重新作为公式变量。该框架主动检索相机内参并采样多点度量深度,将针孔反投影方程$\hat{X} = (u_c - c_x)ar{Z}/f_x$明确写出在Chain-of-Thought(CoT)中,并在回归最终9自由度包围盒之前将工具输出代入公式。在从$0.5 imes$到$1.5 imes$缩放的相机内参下,我们的方法在3D物体检测和3D视觉定位任务中优于仅使用RGB和工具增强的基线方法,特别是在相机偏离训练尺度最显著时有显著提升。代码和数据将被发布。
3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.