Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

Shvetsova, Nina; Chen, Brian; Rouditchenko, Andrew; Thomas, Samuel; Kingsbury, Brian; Feris, Rogerio S.; Harwath, David; Glass, James; Kuehne, Hilde

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S. Feris, David Harwath, James Glass, Hilde Kuehne; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20020-20029

Abstract

Multi-modal learning from video data has seen increased attention recently as it allows training of semantically meaningful embeddings without human annotation, enabling tasks like zero-shot retrieval and action localization. In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space. We propose to train the system with a combinatorial loss on everything at once - any combination of input modalities, such as single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization. Our code for this work is also available.

Related Material

[pdf] [supp] [arXiv ]

[bibtex]

@InProceedings{Shvetsova_2022_CVPR, author = {Shvetsova, Nina and Chen, Brian and Rouditchenko, Andrew and Thomas, Samuel and Kingsbury, Brian and Feris, Rogerio S. and Harwath, David and Glass, James and Kuehne, Hilde}, title = {Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {20020-20029} }