← Back to Robotics
cs.RO

Teaching robots to walk and manipulate from human videos

Haoran Huang, Haonan Dong, Huixu Dong

May 20, 2026

Mobile robots learning from human demonstrations face two tangled problems: camera footage mixes walking with hand motion, and inference delays cause the moving base to drift from predicted positions. Mobile UMI solves this using two cameras (one on chest, one on wrist) recorded without a robot present, then mathematically separates base movement from arm motion using spatial anchoring. An online executor continuously realigns actions to the robot's actual pose before they execute, discarding outdated waypoints. On four household tasks, the system achieved 83.8% success—substantially better than prior approaches, with decoupled kinematics and latency correction each closing significant gaps.
Published as Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation arXiv:2605.20894
Read the original paper →