Why robots learn better from human video when trained on camera movement

Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang

Robots trained on human video consistently underperform those trained on robot data, even though human video is more abundant. The gap stems from ignoring active perception—how humans reposition their viewpoint during manipulation. ActiveMimic recovers camera and wrist trajectories from egocentric RGB video, models viewpoint repositioning as an action, and jointly learns manipulation and active perception. Real-world robotic experiments show it matches robot-pretrained baselines while using only human video, suggesting active perception is the missing link for scaling robot learning to unconstrained human footage.