Motion Guided Region Message Passing for Video Captioning

Shaoxiang Chen, Yu-Gang Jiang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1543-1552


Video captioning is an important vision task and has been intensively studied in the computer vision community. Existing methods that utilize the fine-grained spatial information have achieved significant improvements, however, they either rely on costly external object detectors or do not sufficiently model the spatial/temporal relations. In this paper, we aim at designing a spatial information extraction and aggregation method for video captioning without the need of external object detectors. For this purpose, we propose a Recurrent Region Attention module to better extract diverse spatial features, and by employing Motion-Guided Cross-frame Message Passing, our model is aware of the temporal structure and able to establish high-order relations among the diverse regions across frames. They jointly encourage information communication and produce compact and powerful video representations. Furthermore, an Adjusted Temporal Graph Decoder is proposed to flexibly update video features and model high-order temporal relations during decoding. Experimental results on three benchmark datasets: MSVD, MSR-VTT, and VATEX demonstrate that our proposed method can outperform state-of-the-art methods.

Related Material

[pdf] [supp]
@InProceedings{Chen_2021_ICCV, author = {Chen, Shaoxiang and Jiang, Yu-Gang}, title = {Motion Guided Region Message Passing for Video Captioning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {1543-1552} }