← Back to Computer Vision cs.CV
When hands hide objects, use hands to find them
Jisu Shin, Junoh Lee, JunGyu Lee, Inhwan Bae, Dohyeon Lee, Hokyun Im, Youngwoon Lee, Hae-Gon Jeon
May 22, 2026
Robot manipulation and embodied AI need to track objects being held and manipulated by hands—but hands occlude the very objects they're moving. ComPose flips this problem: instead of ignoring hands as noise, it uses hand joint positions as a complementary signal for 6DoF pose estimation from RGB video alone. The method combines object and hand cues from foundation models, adaptively weights informative joints, and enforces temporal consistency across frames. It outperforms depth-dependent and template-based trackers under severe occlusion, and transfers directly to robot action reconstruction from human videos.
Read the original paper →