-
[pdf]
[supp]
[bibtex]@InProceedings{Haji-Ali_2025_ICCV, author = {Haji-Ali, Moayed and Menapace, Willi and Siarohin, Aliaksandr and Skorokhodov, Ivan and Canberk, Alper and Lee, Kwot Sin and Ordonez, Vicente and Tulyakov, Sergey}, title = {AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {19373-19385} }
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
Abstract
We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self-attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive evaluations demonstrate that AV-Link achieves substantial improvements in audio-video synchronization, outperforming more expensive baselines such as MovieGen V2A model.
Related Material
