GAFNet: A Global Fourier Self Attention Based Novel Network for Multi-Modal Downstream Tasks

Onkar Susladkar, Gayatri Deshmukh, Dhruv Makwana, Sparsh Mittal, R. Sai Chandra Teja, Rekha Singhal; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5242-5251

Abstract


In "vision and language" problems, multimodal inputs are simultaneously processed for combined visual and textual understanding for image-text embedding. In this paper, we discuss the necessity of considering the difference between the feature space and the distribution when performing multimodal learning. We deal with this problem through deep learning and a generative model approach. We introduce a novel network, GAFNet (Global Attention Fourier Net) which learns through large-scale pre-training over three image-text datasets (COCO, SBU, and CC-3M), for achieving high performance on downstream vision and language tasks. We propose a GAF (Global Attention Fourier) module, which integrates multiple modalities into one latent space. GAF module is independent of the type of modality and it allows combining shared representations at each stage. There are various ways of thinking about the relationships between different modalities, which directly affect the model's design. Global attention is not considered as in conventional multimodal learning. A GAF-based model can work for any modality (language, image, audio, category) and is designed to be used for different tasks. In contrast to previous research, our work considers visual grounding as a pretrainable and transferable quality instead of something that must be trained from scratch. Experimental results demonstrate that our technique is competitive and achieves state-of-the-art performance on a variety of popular downstream vision-language tasks, including image generation and image-text retrieval.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Susladkar_2023_WACV, author = {Susladkar, Onkar and Deshmukh, Gayatri and Makwana, Dhruv and Mittal, Sparsh and Teja, R. Sai Chandra and Singhal, Rekha}, title = {GAFNet: A Global Fourier Self Attention Based Novel Network for Multi-Modal Downstream Tasks}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {5242-5251} }