← Back to Computer Vision
cs.CV

How do we reconstruct moving people and their environments from video?

Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu

June 1, 2026

Reconstructing humans, scenes, and camera motion from multi-view video as one coherent 4D model is hard because prior methods decouple these components. TROPHIES jointly estimates dynamic humans, static geometry, and camera poses in a shared coordinate frame using separate branches for humans and scenes, coupled by enforcing physical constraints like contact and temporal consistency. On EgoHuman and EgoExo4D, it produces globally aligned reconstructions where people stay grounded and environments remain stable.
Published as TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos arXiv:2606.02350
Read the original paper →