TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 21154-21164

Abstract


Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos--a unified framework tailored for this task. TROPHIES features a Human Branch that models human through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Liu_2026_CVPR, author = {Liu, Jinpeng and Xu, Yukang and Li, Yutong and Liu, Xingyu}, title = {TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {21154-21164} }