Can robots learn to follow detailed execution instructions, not just goals?

Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu

Robot policies trained on coarse "pick up the cup" commands miss critical execution details like approach direction and contact point. FineVLA adds fine-grained annotations (47K verified trajectories from 972K robot videos) and shows that mixing detailed instructions with goal-level commands—peaked at 1:1 ratio—yields 86.8% success in simulation and 62.7/100 in real dual-arm tasks. The approach improves steerable control (+23 points for pose adjustments) while maintaining baseline performance.