Pivot Correlational Neural Network for Multimodal Video Categorization

Sunghun Kang, Junyeong Kim, Hyunsoo Choi, Sungjin Kim, Chang D. Yoo; Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 386-401


This paper considers an architecture for multimodal video categorization referred to as Pivot Correlational Neural Network (Pivot CorrNN). The architecture is trained to maximizes the correlation between the hidden states as well as the predictions of the modal-agnostic pivot stream and modal-specific stream in the network. Here, the modal-agnostic pivot hidden state considers all modal inputs without distinction while the modal-specific hidden state is dedicated exclusively to one specific modal input. The Pivot CorrNN consists of three modules: (1) maximizing pivot-correlation module that attempts to maximally correlate the modal-agnostic and a modal-specific hidden-state as well as their predictions, (2) contextual Gated Recurrent Unit (cGRU) module which extends the capability of a generic GRU to take multimodal inputs in updating the pivot hidden-state, and (3) adaptive aggregation module that aggregates all modal-specific predictions as well as the modal-agnostic pivot predictions into one final prediction. We evaluate the Pivot CorrNN on two publicly available large-scale multimodal video categorization datasets, FCVID and YouTube-8M. From the experimental results, Pivot CorrNN achieves the best performance on the FCVID database and performance comparable to the state-of-the-art on YouTube-8M database.

Related Material

author = {Kang, Sunghun and Kim, Junyeong and Choi, Hyunsoo and Kim, Sungjin and Yoo, Chang D.},
title = {Pivot Correlational Neural Network for Multimodal Video Categorization},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}