← Back to Computer Vision
cs.CV

Teaching robots to interact with objects from videos, not blueprints

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

May 21, 2026

The main bottleneck in teaching humanoid robots to manipulate objects is the lack of detailed 3D training data. This work sidesteps that by learning from video generative models instead. The key insight: track only a few critical points (hands, base, object) rather than the entire body, which eliminates the error-prone retargeting step. They anchor the sparse tracking in a behavior foundation model's latent space to maintain natural motion. The result: zero-shot deployment on physical robots in a mocap system without needing explicit CAD models or extensive morphing.
Published as Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors arXiv:2605.22272
Read the original paper →