Should robot vision learn how things move?

Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang

Robots trained on standard vision encoders miss motion cues essential for manipulation. DynaFLIP pre-trains visual encoders using triplets of images, language descriptions, and 3D flow from human and robot videos, keeping the three modalities tightly aligned in a geometric space. The approach boosts downstream policy performance across simulation and real robots, particularly in novel scenarios—suggesting that encoding *how* the world changes matters as much as *what* is there.