Can video models control robots without learning to move?

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

Video generative models can predict future robot states, but existing approaches retrain them with action labels. Instead, researchers froze the video model and trained only a small inverse dynamics model to translate predicted frames into robot commands—like keeping a chess engine fixed and teaching just the move translator. This VERA approach works across different robot arms and hands without retraining the core planner, and succeeds on real Panda arm manipulation and complex 16-DoF hand control without task-specific data.