End-to-End Neural Network Compression via l1/l2 Regularized Latency Surrogates

Anshul Nasery, Hardik Shah, Arun Sai Suggala, Prateek Jain; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 5866-5877

Abstract


Neural network (NN) compression via techniques such as pruning quantization requires setting compression hyperparameters (e.g. number of channels to be pruned bitwidths for quantization) for each layer either manually or via neural architecture search (NAS) which can be computationally expensive. We address this problem by providing an end-to-end technique that optimizes for model's Floating Point Operations (FLOPs) via a novel l_1 / l_2 latency surrogate. Our algorithm is versatile and can be used with many popular compression methods including pruning low-rank factorization and quantization and can optimize for on-device latency. Crucially it is fast and runs in almost the same amount of time as a single model training run ; which is a significant training speed-up over standard NAS methods. For BERT compression on GLUE fine-tuning tasks we achieve 50% reduction in FLOPs with only 1% drop in performance. For compressing MobileNetV3 on ImageNet-1K we achieve 15% reduction in FLOPs without drop in accuracy while still requiring 3 times less training compute than SOTA NAS techniques. Finally for transfer learning on smaller datasets our technique identifies 1.2 times-1.4 times cheaper architectures than standard MobileNetV3 EfficientNet suite of architectures at almost the same training cost and accuracy.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Nasery_2024_CVPR, author = {Nasery, Anshul and Shah, Hardik and Suggala, Arun Sai and Jain, Prateek}, title = {End-to-End Neural Network Compression via l1/l2 Regularized Latency Surrogates}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5866-5877} }