Adaptive Pooling in Multi-Instance Learning for Web Video Annotation

Yizhou Zhou, Xiaoyan Sun, Dong Liu, Zhengjun Zha, Wenjun Zeng; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 318-327

Abstract


Web videos are usually weakly annotated, i.e., a tag is associated to a video once the corresponding concept appears in a frame of this video without indicating when and where it occurs. These weakly annotated tags pose big troubles to many Web video applications, e.g. search and recommendation. In this paper, we present a new Web video annotation approach based on multi-instance learning (MIL) with a learnable pooling function.By formulating the Web video annotation as a MIL problem, we present an end-to-end deep network framework to solve this problem in which the frame (instance) level annotation is estimated from tags given at the video (bag of instances) level via a convolutional neural network. Experimental results demonstrate that our framework is able to not only enhance the accuracy of Web video annotation by outperforming the state-of-the-art Web video annotation methods on the large-scale video dataset FCVID, but also help to infer the most relevant frames in Web videos.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhou_2017_ICCV,
author = {Zhou, Yizhou and Sun, Xiaoyan and Liu, Dong and Zha, Zhengjun and Zeng, Wenjun},
title = {Adaptive Pooling in Multi-Instance Learning for Web Video Annotation},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops},
month = {Oct},
year = {2017}
}