Can robots learn dexterous skills by aligning 3D space across cameras and bodies?

Huayi Zhou, Wei Gao, Dekun Lu, Ruiji Liu, Zhanqi Zhang, Ziyang Zhang, Jian Chen, Wenlve Zhou, Sheng Xu, Shumin Li, Kangyi Guo, Shichen Xu, Zixin Huang, Yongyi Su, Kui Jia

Robots trained on end-to-end manipulation often fail when camera angles or robot bodies change because policies learn from 2D images without spatial grounding. This work adds 3D awareness by computing pixel-wise 3D coordinates from camera calibration, then aligning both visual inputs and robot actions to a shared bird's-eye-view frame. A temporal alignment scheme also handles different recording speeds across robots and datasets. The method improves consistency and real-world transfer; code, trained models, and data pipeline are released.