A Joint Sequence Fusion Model for Video Question Answering and Retrieval

Youngjae Yu , Jongseok Kim , Gunhee Kim; The European Conference on Computer Vision (ECCV), 2018, pp. 471-487


We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video sequence and a language sentence). Our idea is to learn an effective multimodal matching network for sequence data, consisting of two key components. First the Joint Semantic Tensor embeds the joint representation of two sequence data, and then the Convolutional Hierarchical Decoder computes a matching score or predicts a word as an answer to a question, by discovering hidden hierarchical joint relations of two sequence modalities. Both modules leverage a attention mechanism to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, we focus on video-language tasks including multimodal retrieval and video QA. For evaluation of our JSFusion model, we evaluate our model in three VQA and retrieval tasks in LSMDC, for which our model achieves the best performance with significant margins. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach also outperforms many state-of-the-art methods.

Related Material

author = {Yu, Youngjae and Kim, Jongseok and Kim, Gunhee},
title = {A Joint Sequence Fusion Model for Video Question Answering and Retrieval},
booktitle = {The European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}