← Back to Robotics
cs.RO

Can robots reason in 3D space without seeing it?

Jiaxin Shi, Xidong Zhang, Fucai Zhu, Zhe Li, Siyu Zhu, Weihao Yuan

June 3, 2026

Robot control models struggle with spatial reasoning from images alone. This work injects 3D geometric awareness into vision-language-action models during training by having them learn from a 3D foundation model teacher, then distills that spatial knowledge into lightweight adapters. At test time, the robot uses only 2D images—no 3D sensors or teacher needed—yet outperforms prior approaches on LIBERO, LIBERO-PLUS, SimplerEnv benchmarks and real manipulation tasks.
Published as 3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training arXiv:2606.04436
Read the original paper →