← Back to Computer Vision cs.CV
Can vision-language models reconstruct editable 3D scenes from photos?
Guangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor
June 1, 2026
Inverse graphics—reconstructing 3D scenes from photos—is notoriously hard. This work shows that standard vision-language models can do it by generating executable Blender code directly, refining geometry, materials, lighting, and composition stage-by-stage. The staged approach beats single-shot reconstruction, and the resulting editable scenes support relighting and object manipulation without requiring differentiable rendering or 3D foundation models.
Read the original paper →