Teaching robots to understand pointing gestures

Wenxuan Guo, Ziyuan Li, Meng Zhang, Yichen Liu, Yimeng Dong, Chuxi Xu, Yunfei Wei, Ze Chen, Erjin Zhou, Jianjiang Feng

Robot manipulation systems typically rely on text commands, but pointing at objects is faster and clearer for humans. GesVLA adds gesture as a native instruction modality alongside language, encoding hand position directly into the robot's decision-making. The team generated synthetic training data by rendering hand models onto real scenes, then trained the system to both recognize gestures and predict actions. On real robot tasks like picking produce or products, gesture guidance boosted accuracy in complex scenes where multiple similar objects create confusion.