- [pdf] [supp]
Efficient MAE Towards Large-Scale Vision Transformers
Masked Autoencoder (MAE) has demonstrated superb pre-training efficiency for vision Transformer, thanks to its partial input paradigm and high mask ratio (0.75). However, MAE often suffers from severe performance drop under higher mask ratios, which hinders its potential toward larger-scale vision Transformers. In this work, we identify that the performance drop is largely attributed to the over-dominance of difficult reconstruction targets, as higher mask ratios lead to more sparse visible patches and fewer visual clues for reconstruction. To mitigate this issue, we design Efficient MAE that introduces a novel Difficulty-Flatten Loss and a decoder masking strategy, enabling a higher mask ratio for more efficient pre-training. The Difficulty-Flatten Loss provides balanced supervision on reconstruction targets of different difficulties, mitigating the performance drop under higher mask ratios effectively. Additionally, the decoder masking strategy discards the most difficult reconstruction targets, which further alleviates the optimization difficulty and accelerates the pre-training clearly. Our proposed Efficient MAE introduces 27% and 30% pre-training runtime accelerations for the ViT-Large and ViT-Huge models, provides valuable insights into MAE's optimization, and paves the way for larger-scale vision Transformer pre-training. Code and pre-trained models will be released.