Can a robot understand what you're looking at?

Kuangji Zuo, Gen Li, Bofan Lyu, Yanshuo Lu, Boyu Ma, Shijia Han, Xinyu Zhou, Xichen Yuan, Chuhao Zhou, Jiaqi Bai, Geng Li, Jianfei Yang

Language instructions alone can't tell a robot which fork to pick up or where exactly to cut. Gaze2Act adds human eye-gaze as a dynamic intent signal, mapping what a person is looking at from their first-person view into the robot's perspective via semantic matching. Tested on 16 real tasks with a Unitree G1 humanoid, it outperforms language-only baselines on object disambiguation and precise interactions.