Can world models learn what objects are just by watching them move?

Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins

Current video models struggle to understand what constitutes an object or how physics works. This work trains a probabilistic world model using autoregressive sequence prediction, which naturally learns to infer probability distributions over any visual property given others. The model discovers objects by analyzing motion correlations across multiple predicted futures, extracts articulated subparts, and can manipulate 3D objects and solve tasks like Visual Jenga—all from raw video without object labels.