DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge

Ahmed, Sabbir; Al Arafat, Abdullah; Najafi, Deniz; Mahmood, Akhlak; Rizve, Mamshad Nayeem; Al Nahian, Mohaiminul; Zhou, Ranyang; Angizi, Shaahin; Rakin, Adnan Siraj

Sabbir Ahmed, Abdullah Al Arafat, Deniz Najafi, Akhlak Mahmood, Mamshad Nayeem Rizve, Mohaiminul Al Nahian, Ranyang Zhou, Shaahin Angizi, Adnan Siraj Rakin; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 30147-30156

Abstract

Vision Transformers (ViTs) excel in tackling complex vision tasks, yet their substantial size poses significant challenges for applications on resource-constrained edge devices. The increased size of these models leads to higher overhead (e.g., energy, latency) when transmitting model weights between the edge device and the server. Hence, ViTs are not ideal for edge devices where the entire model may not fit on the device. Current model compression techniques often achieve high compression ratios at the expense of performance degradation, particularly for ViTs. To overcome the limitations of existing works, we rethink model compression strategy for ViTs from first principle approach and develop an orthogonal strategy called \method. The objective of the \method is to encode the model weights to a highly compressed encoded representation using a novel training method, denoted as Unified Compression Training (UCT). Proposed UCT is accompanied by a decoding mechanism during inference, which helps to gain any loss of accuracy due to high compression ratio. We further optimize this decoding step by re-ordering the decoding operation using associative property of matrix multiplication, ensuring that the compressed weights can be decoded during inference without incurring any computational overhead. Our extensive experiments across multiple ViT models on modern edge devices show that \method can successfully compress ViTs at high compression ratios (>14x). \method enables the entire model to be stored on edge device, resulting in unprecedented reductions in energy consumption (>1470x) and latency (>68x) for edge ViT inference. Our code is available at \href https://github.com/ML-Security-Research-LAB/DeepCompress-ViT https://github.com/ML-Security-Research-LAB/DeepCompress-ViT .

Related Material

[pdf]

[bibtex]

@InProceedings{Ahmed_2025_CVPR, author = {Ahmed, Sabbir and Al Arafat, Abdullah and Najafi, Deniz and Mahmood, Akhlak and Rizve, Mamshad Nayeem and Al Nahian, Mohaiminul and Zhou, Ranyang and Angizi, Shaahin and Rakin, Adnan Siraj}, title = {DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {30147-30156} }