Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
在Blender中思考:基于视觉语言模型的分阶段可执行逆向图形
Guangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor
AI总结 提出分阶段可执行逆向图形(SEIG)框架,利用预训练视觉语言模型直接从单张图像重建可编辑的Blender程序,无需专用基础模型或可微渲染,通过逐步细化几何、材质、组合和光照提升重建保真度。
详情
逆向图形是一个长期存在且高度欠约束的问题,旨在将图像重建为可编辑的3D场景,这些场景可以渲染、重新照明和操作。在这项工作中,我们研究了预训练的视觉语言模型(VLM)是否可以直接从单张图像执行可执行逆向图形,通过将场景重建为可编辑的Blender程序,而不依赖于专门的2D或3D基础模型、可微渲染或多视图监督。我们引入了分阶段可执行逆向图形(SEIG),这是一个智能体框架,通过直接在可执行的Blender代码空间中逐步细化场景因素(包括几何、材质、组合和照明),从单张图像重建3D场景。我们使用一系列重建指标(涵盖像素级、感知和语义保真度)在各种场景上评估我们的框架。我们的实验表明,分阶段重建显著提高了重建保真度,突出了任务分解对于使用通用VLM进行可执行逆向图形的重要性。最后,我们展示了由重建的可编辑Blender场景启用的各种下游应用。
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.