GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Hassan, Mariam; Stapf, Sebastian; Rahimi, Ahmad; Rezende, Pedro M B; Haghighi, Yasaman; Brüggemann, David; Katircioglu, Isinsu; Zhang, Lin; Chen, Xiaoran; Saha, Suman; Cannici, Marco; Aljalbout, Elie; Ye, Botao; Wang, Xi; Davtyan, Aram; Salzmann, Mathieu; Scaramuzza, Davide; Pollefeys, Marc; Favaro, Paolo; Alahi, Alexandre

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alexandre Alahi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 22404-22415

Abstract

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Hassan_2025_CVPR, author = {Hassan, Mariam and Stapf, Sebastian and Rahimi, Ahmad and Rezende, Pedro M B and Haghighi, Yasaman and Br\"uggemann, David and Katircioglu, Isinsu and Zhang, Lin and Chen, Xiaoran and Saha, Suman and Cannici, Marco and Aljalbout, Elie and Ye, Botao and Wang, Xi and Davtyan, Aram and Salzmann, Mathieu and Scaramuzza, Davide and Pollefeys, Marc and Favaro, Paolo and Alahi, Alexandre}, title = {GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {22404-22415} }