← Back to Robotics
cs.RO

Robot learning to understand human actions from wearable cameras

Yuejiao Su, Xinshen Zhang, Zhen Ye, Lei Yao, Lap-Pui Chau, Yi Wang

May 14, 2026

Assistive robots need to understand human-environment interactions from egocentric (first-person) video, but current multimodal models struggle with precise spatial grounding. EARL uses a two-stage approach: first parsing egocentric interactions into structured descriptions, then generating text answers and pixel-level masks in response to specific queries. A novel Analysis-guided Feature Synthesizer bridges the stages by transferring coarse interaction semantics to query-oriented reasoning. Training uses a multi-faceted reward function optimized with GRPO. On Ego-IRGBench, EARL reaches 65.48% cIoU for grounding—8.37% above prior RL methods—and shows strong transfer to unseen egocentric scenarios on EgoHOS.
Published as EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding arXiv:2605.14742
Read the original paper →