Retraining-Free Model Quantization via One-Shot Weight-Coupling Learning

Chen Tang, Yuan Meng, Jiacheng Jiang, Shuzhao Xie, Rongwei Lu, Xinzhu Ma, Zhi Wang, Wenwu Zhu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15855-15865

Abstract


Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. Previous works only focus on determining the optimal bit-width configuration in the first stage efficiently while ignoring the considerable time costs in the second stage. However retraining always consumes hundreds of GPU-hours on the cutting-edge GPUs thus hindering deployment efficiency significantly. In this paper we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically in the first stage all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization leading to considerable performance degradation under a high compression ratio. To tackle this problem we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training to ensure the rest bit-widths converged properly. Then taking inspiration from information theory we present an information distortion mitigation technique to align the behaviour of the bad-performing bit-widths to the well-performing ones. In the second stage an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Tang_2024_CVPR, author = {Tang, Chen and Meng, Yuan and Jiang, Jiacheng and Xie, Shuzhao and Lu, Rongwei and Ma, Xinzhu and Wang, Zhi and Zhu, Wenwu}, title = {Retraining-Free Model Quantization via One-Shot Weight-Coupling Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {15855-15865} }