ViTS: Video Tagging System From Massive Web Multimedia Collections

Delia Fernandez, David Varas, Joan Espadaler, Issey Masuda, Jordi Ferreira, Alejandro Woodward, David Rodriguez, Xavier Giro-i-Nieto, Juan Carlos Riveiro, Elisenda Bou; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 337-346


The popularization of multimedia content on the Web has arised the need to automatically understand, index and retrieve it. In this paper we present ViTS, an automatic Video Tagging System which learns from videos, their web context and comments shared on social networks. ViTS analyses massive multimedia collections by Internet crawling, and maintains a knowledge base that updates in real time with no need of human supervision. As a result, each video is indexed with a rich set of labels and linked with other related contents. ViTS is an industrial product under exploitation with a vocabulary of over 2.5M concepts, capable of indexing more than 150k videos per month. We compare the quality and completeness of our tags with respect to the ones in the YouTube-8M dataset, and we show how ViTS enhances the semantic annotation of the videos with a larger number of labels (10.04 tags/video), with an accuracy of 80,87%. Extracted tags and video summaries are publicly available.

Related Material

author = {Fernandez, Delia and Varas, David and Espadaler, Joan and Masuda, Issey and Ferreira, Jordi and Woodward, Alejandro and Rodriguez, David and Giro-i-Nieto, Xavier and Carlos Riveiro, Juan and Bou, Elisenda},
title = {ViTS: Video Tagging System From Massive Web Multimedia Collections},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops},
month = {Oct},
year = {2017}