AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Haji-Ali, Moayed; Menapace, Willi; Siarohin, Aliaksandr; Skorokhodov, Ivan; Canberk, Alper; Lee, Kwot Sin; Ordonez, Vicente; Tulyakov, Sergey

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 19373-19385

Abstract

We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self-attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive evaluations demonstrate that AV-Link achieves substantial improvements in audio-video synchronization, outperforming more expensive baselines such as MovieGen V2A model.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Haji-Ali_2025_ICCV, author = {Haji-Ali, Moayed and Menapace, Willi and Siarohin, Aliaksandr and Skorokhodov, Ivan and Canberk, Alper and Lee, Kwot Sin and Ordonez, Vicente and Tulyakov, Sergey}, title = {AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {19373-19385} }