Can robots reason in 3D space without seeing it?

Robot control models struggle with spatial reasoning from images alone. This work injects 3D geometric awareness into vision-language-action models during training by having them learn from a 3D foundation model teacher, then distills that spatial knowledge into lightweight adapters. At test time, the robot uses only 2D images—no 3D sensors or teacher needed—yet outperforms prior approaches on LIBERO, LIBERO-PLUS, SimplerEnv benchmarks and real manipulation tasks.