← Back to Machine Learning
cs.LG

Should robot vision learn how things move?

Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang

May 28, 2026

Robots trained on standard vision encoders miss motion cues essential for manipulation. DynaFLIP pre-trains visual encoders using triplets of images, language descriptions, and 3D flow from human and robot videos, keeping the three modalities tightly aligned in a geometric space. The approach boosts downstream policy performance across simulation and real robots, particularly in novel scenarios—suggesting that encoding *how* the world changes matters as much as *what* is there.
Published as DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation arXiv:2605.30350
Read the original paper →