When hands hide objects, use hands to find them

Jisu Shin, Junoh Lee, JunGyu Lee, Inhwan Bae, Dohyeon Lee, Hokyun Im, Youngwoon Lee, Hae-Gon Jeon

Robot manipulation and embodied AI need to track objects being held and manipulated by hands—but hands occlude the very objects they're moving. ComPose flips this problem: instead of ignoring hands as noise, it uses hand joint positions as a complementary signal for 6DoF pose estimation from RGB video alone. The method combines object and hand cues from foundation models, adaptively weights informative joints, and enforces temporal consistency across frames. It outperforms depth-dependent and template-based trackers under severe occlusion, and transfers directly to robot action reconstruction from human videos.