Grounded Latents for Entity-Centric 4D Scene Generation

Jinhyung Park, Navyata Sanghvi, Erica Weng, Shawn Hunt, Shinya Tanaka, Hironobu Fujiyoshi, Kris Kitani; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 21420-21430

Abstract


Although recent work has explored generative modeling of 3D or 4D driving scenes, most approaches operate on dense voxel-based representations, which are computationally expensive and struggle to maintain temporal or structural consistency. These methods often produce blurred or merged entities (i.e., cars, trucks, pedestrians) and lack fine-grained control over individual scene elements. We propose LatentWorld, a framework for generative modeling in a compact, entity-centric latent space, where each grounded 3D latent represents a semantically meaningful local region of the scene. This formulation enables precise, consistent control of both foreground and background elements while preserving geometric detail. We further extend this representation to 4D by learning a motion diffusion model for both ego and dynamic actors, conditioned on the generated 3D scene, and by propagating the grounded latents through time. Our framework produces physically consistent and temporally coherent 4D scenes, supporting controllable and realistic generation.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Park_2026_CVPR, author = {Park, Jinhyung and Sanghvi, Navyata and Weng, Erica and Hunt, Shawn and Tanaka, Shinya and Fujiyoshi, Hironobu and Kitani, Kris}, title = {Grounded Latents for Entity-Centric 4D Scene Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {21420-21430} }