ViTamin: Designing Scalable Vision Models in the Vision-Language Era

Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 12954-12966

Abstract


Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models thanks to the training on the large-scale Internet image-text pairs. However despite the amazing achievement from the VLMs vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area it remains questionable whether it is also the case for image encoding especially considering that various types of networks are proposed on the ImageNet benchmark which unfortunately are rarely studied in VLMs. Due to small data/model scale the original conclusions of model design on ImageNet can be limited and biased. In this paper we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models covering their zero-shot performance and scalability in both model and training data sizes. To this end we introduce ViTamin a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks including classification retrieval open-vocabulary detection and segmentation and large multi-modal models. When further scaling up the model size our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chen_2024_CVPR, author = {Chen, Jieneng and Yu, Qihang and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh}, title = {ViTamin: Designing Scalable Vision Models in the Vision-Language Era}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {12954-12966} }