← Back to Machine Learning cs.LG
Should robot vision learn how things move?
Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang
May 28, 2026
Robots trained on standard vision encoders miss motion cues essential for manipulation. DynaFLIP pre-trains visual encoders using triplets of images, language descriptions, and 3D flow from human and robot videos, keeping the three modalities tightly aligned in a geometric space. The approach boosts downstream policy performance across simulation and real robots, particularly in novel scenarios—suggesting that encoding *how* the world changes matters as much as *what* is there.
Read the original paper →