Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2627-2638

Abstract


Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models e.g. video or audio classification models. However existing benchmarks predate the popularization of large multi-modal models such as CLIP and CLAP. In this work we explore such large pre-trained models to obtain features i.e. CLIP for visual features and CLAP for audio features. Furthermore the CLIP and CLAP text encoders provide class label embeddings which are combined to boost the performance of the system. We propose a simple yet effective model that only relies on feed-forward neural networks exploiting the strong generalization capabilities of the new audio visual and textual features. Our framework achieves state-of-the-art performance on VGGSound-GZSL UCF-GZSL and ActivityNet-GZSL with our new features. Code and data available at: https://github.com/dkurzend/ClipClap-GZSL.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Kurzendorfer_2024_CVPR, author = {Kurzend\"orfer, David and Mercea, Otniel-Bogdan and Koepke, A. Sophia and Akata, Zeynep}, title = {Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2627-2638} }