Teaching robots to interact with objects from videos, not blueprints

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

The main bottleneck in teaching humanoid robots to manipulate objects is the lack of detailed 3D training data. This work sidesteps that by learning from video generative models instead. The key insight: track only a few critical points (hands, base, object) rather than the entire body, which eliminates the error-prone retargeting step. They anchor the sparse tracking in a behavior foundation model's latent space to maintain natural motion. The result: zero-shot deployment on physical robots in a mocap system without needing explicit CAD models or extensive morphing.