← Back to Robotics
cs.RO

Can a robot understand what you're looking at?

Kuangji Zuo, Gen Li, Bofan Lyu, Yanshuo Lu, Boyu Ma, Shijia Han, Xinyu Zhou, Xichen Yuan, Chuhao Zhou, Jiaqi Bai, Geng Li, Jianfei Yang

May 28, 2026

Language instructions alone can't tell a robot which fork to pick up or where exactly to cut. Gaze2Act adds human eye-gaze as a dynamic intent signal, mapping what a person is looking at from their first-person view into the robot's perspective via semantic matching. Tested on 16 real tasks with a Unitree G1 humanoid, it outperforms language-only baselines on object disambiguation and precise interactions.
Published as Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation arXiv:2605.30282
Read the original paper →