Teaching robots locomotion and manipulation from human videos

Tianshu Wu, Xiangqi Kong, Yue Chen, Qize Yu, Hang Ye, Jia Li, Yizhou Wang, Hao Dong

Humanoid robots struggle to learn complex skills like picking and carrying objects while moving—methods either need hand-crafted rewards or fail on novel tasks. SUGAR trains robots from ordinary human videos by first extracting motion patterns automatically, then using physics simulation to fix errors (occlusions, contact mistakes) that plague raw video data, and finally compressing the result into a policy that works in the real world. On six tasks from object retrieval to stair climbing, it outperforms reference-motion baselines, scales with more video data, and transfers to physical hardware with robust closed-loop recovery.