Video Language Model Pretraining with Spatio-temporal Masking

Wu, Yue; Qi, Zhaobo; Sun, Junshu; Wang, Yaowei; Huang, Qingming; Wang, Shuhui

Yue Wu, Zhaobo Qi, Junshu Sun, Yaowei Wang, Qingming Huang, Shuhui Wang; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 8557-8567

Abstract

The development of self-supervised video-language models based on mask learning has significantly advanced downstream video tasks. These models leverage masked reconstruction to facilitate joint learning of visual and linguistic information. However, recent study reveals that reconstructing image features yields superior downstream performance compared to video feature reconstruction. We hypothesize that this performance gap stems from the way how masking strategies influence the model's attention to temporal dynamics. To validate this hypothesis, we performed two sets of experiments that demonstrate that alignment between the masked target and the reconstruction target is crucial for self-supervised video-language learning. Based on these findings, we propose a spatio-temporal masking strategy (STM) for video-language model pretraining that operates across adjacent frames, and a decoder leverages semantic information to enhance the spatio-temporal representations of masked tokens. Thanks to the combination of masking strategy and reconstruction decoder, STM enforces the model to learn spatio-temporal feature representation comprehensively. Experiments in three video understanding downstream tasks validate the superiority of our method.

Related Material

[pdf]

[bibtex]

@InProceedings{Wu_2025_CVPR, author = {Wu, Yue and Qi, Zhaobo and Sun, Junshu and Wang, Yaowei and Huang, Qingming and Wang, Shuhui}, title = {Video Language Model Pretraining with Spatio-temporal Masking}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {8557-8567} }