Less is More: Picking Informative Frames for Video Captioning

Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang; The European Conference on Computer Vision (ECCV), 2018, pp. 358-373


In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard encoder-decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing discrepancy between generated caption and the ground-truth. The rewarded candidate will be selected and the corresponding latent representation of encoder-decoder will be updated for future trials. This procedure goes on until the end of the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation. Experiment results show that our model can achieve competitive performance across popular benchmarks while only 6~8 frames are used.

Related Material

author = {Chen, Yangyu and Wang, Shuhui and Zhang, Weigang and Huang, Qingming},
title = {Less is More: Picking Informative Frames for Video Captioning},
booktitle = {The European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}