- 
[pdf]
[supp]
[bibtex]@InProceedings{Wu_2025_CVPR, author = {Wu, Wei and Guo, Xi and Tang, Weixuan and Huang, Tingxuan and Wang, Chiyu and Ding, Chenjing}, title = {DriveScape: High-Resolution Driving Video Generation by Multi-View Feature Fusion}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {17187-17196} }
        DriveScape: High-Resolution Driving Video Generation by Multi-View Feature Fusion
    
    
    
    Abstract
    Recent advancements in generative models offer promising solutions for synthesizing realistic driving videos, aiding in training autonomous driving perception models. However, existing methods often struggle with high-resolution multi-view generation, mainly due to the significant memory and computational overhead caused by simultaneously inputting multi-view videos into denoising diffusion models.In this paper, we propose a driving video generation framework based on multi-view feature fusion named DriveScape for multi-view 3D condition-guided video generation. We introduce a Bi-Directional Modulated Transformer (BiMoT) module to encode, fuse and inject multi-view features along with various 3D road structures and objects, which enables high-resolution multi-view generation. Consequently, our approach allows precise control over video generation, greatly enhancing realism and providing a robust solution for creating high-quality, multi-view driving videos.Our framework achieves state-of-the-art results on the nuScenes dataset, demonstrating impressive generative quality metrics with an FID score of 8.34 and an FVD score of 76.39, as well as superior performance across various perception tasks. This lays the foundation for more accurate environment simulation in autonomous driving. We plan to make our code and pre-trained model publicly available.Please refer to index.html webpage in the supplementary materials for more visualization results.
    
    
    Related Material


 
         
        