Bootstrapping SparseFormers from Vision Foundation Models

Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17710-17721

Abstract


The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs greatly reducing computational costs while still achieving promising performance. However training SparseFormers from scratch is still expensive and scaling up the number of parameters can be challenging. In this paper we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g. IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g. IN-1K) and without labels or captions within just a few hours. As a result the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition CLIP-bootstrapped SparseFormers which align the output space with language without seeing a word can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Gao_2024_CVPR, author = {Gao, Ziteng and Tong, Zhan and Lin, Kevin Qinghong and Chen, Joya and Shou, Mike Zheng}, title = {Bootstrapping SparseFormers from Vision Foundation Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {17710-17721} }