Cross-modal Embeddings for Video and Audio Retrieval

Didac Suris, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giro-i-Nieto; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0

Abstract


In this work, we explore the multi-modal information provided by the Youtube-8M dataset by projecting the audio and visual features into a common feature space, to obtain joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Suris_2018_ECCV_Workshops,
author = {Suris, Didac and Duarte, Amanda and Salvador, Amaia and Torres, Jordi and Giro-i-Nieto, Xavier},
title = {Cross-modal Embeddings for Video and Audio Retrieval},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
month = {September},
year = {2018}
}