← Back to Computer Vision
cs.CV

Can vision-language models reconstruct editable 3D scenes from photos?

Guangzhao He, Rundong Luo, Wei-Chiu Ma, Hadar Averbuch-Elor

June 1, 2026

Inverse graphics—reconstructing 3D scenes from photos—is notoriously hard. This work shows that standard vision-language models can do it by generating executable Blender code directly, refining geometry, materials, lighting, and composition stage-by-stage. The staged approach beats single-shot reconstruction, and the resulting editable scenes support relighting and object manipulation without requiring differentiable rendering or 3D foundation models.
Published as Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models arXiv:2606.02580
Read the original paper →