Can tracking moving objects improve robot learning?

Robot policies learn better from world models that predict environment dynamics, but pixel-level predictions get distracted by lighting and texture changes. JOPAT predicts latent observations, 2D point tracks with visibility, and actions simultaneously using a diffusion transformer. By explicitly modeling motion through tracking rather than raw pixels, it stays robust during occlusion and off-screen motion. On LIBERO and real LeRobot tasks, largest gains appear in long-horizon scenarios with object interactions and partial visibility.