Delving Into Masked Autoencoders for Multi-Label Thorax Disease Classification

Junfei Xiao, Yutong Bai, Alan Yuille, Zongwei Zhou; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 3588-3600

Abstract


Vision Transformer (ViT) has become one of the most popular neural architectures due to its simplicity, scalability, and compelling performance in multiple vision tasks. However, since the scales of medical datasets are relatively small, ViT has shown inferior performance on medical datasets even after pre-trained on ImageNet. In this paper, we unleash the potential of ViT by pre-training on 266,340 unlabeled chest X-rays. Specifically, we explore Masked Autoencoders (MAE) whose task is to reconstruct missing pixels from a small proportion of each image and figure out a strong recipe for pre-training MAE and fine-tuning on chest X-ray datasets, revealing that medical reconstruction needs a much smaller proportion of an image than natural images (10% vs. 25%) and a more moderate RandomResizedCrop cropping range than natural images (0.5 1.0 vs. 0.2 1.0). With our recipe, ViT-S shows competitive results with the state-of-the-art CNN model (DenseNet-121) on three public chest X-ray datasets and 2.5x faster pre-training on the NIH ChestX-ray14 dataset and CheXpert. To the best of our knowledge, we are the first to make vanilla ViT achieve state-of-the-art performance on chest X-ray datasets. We hope that this study can direct future research on the application of Transformers to a larger variety of medical imaging tasks. Code will be made available.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Xiao_2023_WACV, author = {Xiao, Junfei and Bai, Yutong and Yuille, Alan and Zhou, Zongwei}, title = {Delving Into Masked Autoencoders for Multi-Label Thorax Disease Classification}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {3588-3600} }