VideoMAC: Video Masked Autoencoders Meet ConvNets

Gensheng Pei, Tao Chen, Xiruo Jiang, Huafeng Liu, Zeren Sun, Yazhou Yao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22733-22743

Abstract


Recently the advancement of self-supervised learning techniques like masked autoencoders (MAE) has greatly influenced visual representation learning for images and videos. Nevertheless it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper we propose a new approach termed as VideoMAC which combines video masked autoencoders with resource-friendly ConvNets. Specifically VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously we present a simple yet effective masked video modeling (MVM) approach a dual encoder architecture comprising an online encoder and an exponential moving average target encoder aimed to facilitate inter-frame reconstruction consistency in videos. Additionally we demonstrate that VideoMAC empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM outperforms ViT-based approaches on downstream tasks including video object segmentation (+5.2% / 6.4% \mathcal J &\mathcal F ) body part propagation (+6.3% / 3.1% mIoU) and human pose tracking (+10.2% / 11.1% PCK@0.1).

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Pei_2024_CVPR, author = {Pei, Gensheng and Chen, Tao and Jiang, Xiruo and Liu, Huafeng and Sun, Zeren and Yao, Yazhou}, title = {VideoMAC: Video Masked Autoencoders Meet ConvNets}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {22733-22743} }