Can vision-language models reconstruct editable 3D scenes from photos?

Inverse graphics—reconstructing 3D scenes from photos—is notoriously hard. This work shows that standard vision-language models can do it by generating executable Blender code directly, refining geometry, materials, lighting, and composition stage-by-stage. The staged approach beats single-shot reconstruction, and the resulting editable scenes support relighting and object manipulation without requiring differentiable rendering or 3D foundation models.