Using 3D point motion as a universal language for robot actions

Robots struggle to translate video predictions into physical actions because RGB video alone leaves 3D motion and contact geometry ambiguous. PointAction fine-tunes a video diffusion model to predict both future frames and dynamic 3D pointmaps of task-relevant objects, creating a structured middle ground between vision and control. These point trajectories act as an embodiment-agnostic action interface—a kind of 3D motion sketch that any robot arm can follow. The method outperforms baselines in simulation and transfers to unseen real robots with limited action labels.