Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Simon, Christian; Ishii, Masato; Wang, Wei-Yao; Saito, Koichi; Hayakawa, Akio; Shim, Dongseok; Zhong, Zhi; Cui, Shuyang; Shibuya, Takashi; Takahashi, Shusuke; Mitsufuji, Yuki

Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 15840-15849

Abstract

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations. Our project page: https://echoesovertime.github.io

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Simon_2026_CVPR, author = {Simon, Christian and Ishii, Masato and Wang, Wei-Yao and Saito, Koichi and Hayakawa, Akio and Shim, Dongseok and Zhong, Zhi and Cui, Shuyang and Shibuya, Takashi and Takahashi, Shusuke and Mitsufuji, Yuki}, title = {Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {15840-15849} }