Robot learning to understand human actions from wearable cameras

Assistive robots need to understand human-environment interactions from egocentric (first-person) video, but current multimodal models struggle with precise spatial grounding. EARL uses a two-stage approach: first parsing egocentric interactions into structured descriptions, then generating text answers and pixel-level masks in response to specific queries. A novel Analysis-guided Feature Synthesizer bridges the stages by transferring coarse interaction semantics to query-oriented reasoning. Training uses a multi-faceted reward function optimized with GRPO. On Ego-IRGBench, EARL reaches 65.48% cIoU for grounding—8.37% above prior RL methods—and shows strong transfer to unseen egocentric scenarios on EgoHOS.