← Back to Computer Vision
cs.CV

Can vision models imagine unseen viewpoints to reason better?

Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu

June 4, 2026

Vision-language models struggle with spatial reasoning beyond what they can see—inferring hidden layouts, maintaining consistency across views, or reasoning from new angles. Astra couples a reinforcement-learning-trained vision model with a world simulator that generates novel viewpoints from text camera instructions. The simulator learns consistency-preserving 3D transformations, while the model learns when to invoke imagination versus answering directly. On spatial reasoning benchmarks, the full system boosts performance from 29.8% to 38.8%, showing that imagined evidence helps but only when the model knows when to use it.
Published as Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators arXiv:2606.06476
Read the original paper →