How do we reconstruct moving people and their environments from video?

Reconstructing humans, scenes, and camera motion from multi-view video as one coherent 4D model is hard because prior methods decouple these components. TROPHIES jointly estimates dynamic humans, static geometry, and camera poses in a shared coordinate frame using separate branches for humans and scenes, coupled by enforcing physical constraints like contact and temporal consistency. On EgoHuman and EgoExo4D, it produces globally aligned reconstructions where people stay grounded and environments remain stable.