← Back to Robotics cs.RO
Using 3D point motion as a universal language for robot actions
Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu
June 2, 2026
Robots struggle to translate video predictions into physical actions because RGB video alone leaves 3D motion and contact geometry ambiguous. PointAction fine-tunes a video diffusion model to predict both future frames and dynamic 3D pointmaps of task-relevant objects, creating a structured middle ground between vision and control. These point trajectories act as an embodiment-agnostic action interface—a kind of 3D motion sketch that any robot arm can follow. The method outperforms baselines in simulation and transfers to unseen real robots with limited action labels.
Read the original paper →