← Back to Robotics cs.RO
Can tracking moving objects improve robot learning?
Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, Juho Kannala
May 22, 2026
Robot policies learn better from world models that predict environment dynamics, but pixel-level predictions get distracted by lighting and texture changes. JOPAT predicts latent observations, 2D point tracks with visibility, and actions simultaneously using a diffusion transformer. By explicitly modeling motion through tracking rather than raw pixels, it stays robust during occlusion and off-screen motion. On LIBERO and real LeRobot tasks, largest gains appear in long-horizon scenarios with object interactions and partial visibility.
Read the original paper →