- [pdf] [supp] [arXiv]
ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders
Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt models, have demonstrated strong performance across different application scenarios. Like many other architectures, ConvNeXt models were designed under the supervised learning setting with ImageNet labels. It is natural to expect ConvNeXt can also benefit from state-of-the-art self-supervised learning frameworks such as masked autoencoders (MAE), which was originally designed with Transformers. However, we show that simply combining the two designs yields subpar performance. In this paper, we develop an efficient and fully-convolutional masked autoencoder framework. We then upgrade the ConvNeXt architecture with a new Global Response Normalization (GRN) layer. GRN enhances inter-channel feature competition and is crucial for pre-training with masked input. The new model family, dubbed ConvNeXt V2, is a complete training recipe that synergizes both the architectural improvement and the advancement in self-supervised learning. With ConvNeXt V2, we are able to significantly advance pure ConvNets' performance across different recognition benchmarks including ImageNet classification, ADE20K segmentation and COCO detection. To accommodate different use cases, we provide pre-trained ConvNeXt V2 models of a wide range of complexity: from an efficient 3.7M-parameter Atto model that achieves 76.8% top-1 accuracy on ImageNet, to a 650M Huge model that can reach a state-of-the-art 88.9% accuracy using public training data only.