Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning

Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, Yunde Jia; The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 8918-8927


Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

Related Material

author = {Hou, Jingyi and Wu, Xinxiao and Zhao, Wentian and Luo, Jiebo and Jia, Yunde},
title = {Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}