- [pdf] [supp]
Vision Transformer Compression and Architecture Exploration with Efficient Embedding Space Search
This paper addresses theoretical and practical problems in the compression of vision transformers for resource-constrained environments. We found that deep feature collapse and gradient collapse can occur during the search process for the vision transformer compression. Deep feature collapse diminishes feature diversity rapidly as the layer depth deepens, and gradient collapse causes gradient explosion in training. Against these issues, we propose a novel framework, called VTCA, for accomplishing vision transformer compression and architecture exploration jointly with embedding space search using Bayesian optimization. In this framework, we formulate block-wise removal, shrinkage, and cross-block skip augmentation to prevent deep feature collapse, and Res-Post layer normalization to prevent gradient collapse under a knowledge distillation loss. In the search phase, we adopt a training speed estimation for a large-scale dataset and propose a novel elastic reward function that can represent a generalized manifold of rewards. Experiments were conducted with DeiT-Tiny/Small/Base backbones on the ImageNet, and our approach achieved competitive accuracy to recent patch reduction and pruning methods. The code is available at https://github. com/kdaeho27/VTCA.