DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Rangwani, Harsh; Mondal, Pradipto; Mishra, Mayank; Asokan, Ashish Ramayee; Babu, R. Venkatesh

Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 23396-23406

Abstract

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT we divide the input image into patch tokens and process them through a stack of self-attention blocks. However unlike Convolutional Neural Network (CNN) ViT's simple architecture has no informative inductive bias (e.g. locality etc.). Due to this ViT requires a large amount of data for pre-training. Various data-efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT we introduce an efficient and effective way of distillation from CNN via distillation \texttt DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks improving generalization for tail classes. Further to mitigate overfitting we propose distilling from a flat CNN teacher which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme the distillation DIST token becomes an expert on the tail classes and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018. Project Page: https://rangwani-harsh.github.io/DeiT-LT.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Rangwani_2024_CVPR, author = {Rangwani, Harsh and Mondal, Pradipto and Mishra, Mayank and Asokan, Ashish Ramayee and Babu, R. Venkatesh}, title = {DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {23396-23406} }