Teaching robots to see before they act

Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yue Meng, Wenda Xu, Yuan He, Gao Huang

Current robotic manipulation models cram language understanding, scene perception, and motor control into one neural network, forcing the action component to relearn what vision-language models already know. AVP splits this into two stages: a pretrained vision-language model identifies the target and generates visual tokens, then a separate action network uses those tokens to plan movements via flow matching. On real robot pick-and-place tasks, this 27% improvement over prior methods translates to faster learning and better generalization to new objects and spatial arrangements.