← Back to Robotics
cs.RO

Can tracking moving objects improve robot learning?

Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, Juho Kannala

May 22, 2026

Robot policies learn better from world models that predict environment dynamics, but pixel-level predictions get distracted by lighting and texture changes. JOPAT predicts latent observations, 2D point tracks with visibility, and actions simultaneously using a diffusion transformer. By explicitly modeling motion through tracking rather than raw pixels, it stays robust during occlusion and off-screen motion. On LIBERO and real LeRobot tasks, largest gains appear in long-horizon scenarios with object interactions and partial visibility.
Published as Point Tracking Improves World Action Models arXiv:2605.23856
Read the original paper →