MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Tanvir Mahmud,Shentong Mo,Yapeng Tian,Diana Marculescu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7996-8005

Abstract


Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper we propose MA-AVT a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality while also attending to the cross-modal relationships between them. In addition unlike prior work that only aligns coarse features from the output of unimodal encoders we introduce blockwise contrastive learning to align coarse-to-fine-grained hierarchical features throughout the encoding phase. Furthermore to suppress the background features in each modality from foreground matched audio-visual features we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE VGGSound and CREMA-D datasets we achieve considerable performance improvements over SOTA methods.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Mahmud_2024_CVPR, author = {Mahmud, Tanvir and Mo, Shentong and Tian, Yapeng and Marculescu, Diana}, title = {MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7996-8005} }