← Back to Computer Vision cs.CV
How do we reconstruct moving people and their environments from video?
Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu
June 1, 2026
Reconstructing humans, scenes, and camera motion from multi-view video as one coherent 4D model is hard because prior methods decouple these components. TROPHIES jointly estimates dynamic humans, static geometry, and camera poses in a shared coordinate frame using separate branches for humans and scenes, coupled by enforcing physical constraints like contact and temporal consistency. On EgoHuman and EgoExo4D, it produces globally aligned reconstructions where people stay grounded and environments remain stable.
Read the original paper →