Can vision models imagine unseen viewpoints to reason better?

Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu

Vision-language models struggle with spatial reasoning beyond what they can see—inferring hidden layouts, maintaining consistency across views, or reasoning from new angles. Astra couples a reinforcement-learning-trained vision model with a world simulator that generates novel viewpoints from text camera instructions. The simulator learns consistency-preserving 3D transformations, while the model learns when to invoke imagination versus answering directly. On spatial reasoning benchmarks, the full system boosts performance from 29.8% to 38.8%, showing that imagined evidence helps but only when the model knows when to use it.