-
[pdf]
[supp]
[bibtex]@InProceedings{Kim_2024_ACCV, author = {Kim, Soosung and Park, Yeonhong and Lee, Hyunseung and Yi, Sungchan and Lee, Jae W.}, title = {ReLUifying Smooth Functions: Low-Cost Knowledge Distillation to Obtain High-Performance ReLU Networks}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2162-2178} }
ReLUifying Smooth Functions: Low-Cost Knowledge Distillation to Obtain High-Performance ReLU Networks
Abstract
Smooth activation functions like Swish, GeLU, and Mish have gained popularity due to their potentially better generalization performance than ReLU. However, there is still high demand for ReLU networks due to their simplicity, yielding higher execution efficiency and broader device coverage. To meet such practical demand, there has been research on producing ReLU networks from pretrained smooth function networks within a limited training timea process termed ReLUification. Specifically, knowledge distillation (KD) has been key tool for this endeavor. While KD-based ReLUification shows effectiveness to certain extent, the previous approach fails to fully leverage the potential of KD, resulting in suboptimal outcomes. Through in-depth empirical analysis, we uncover that employing a high learning rate synergizes effectively with KD, leading to a substantial improvement in KD-based ReLUification. Additionally, we introduce a novel approach of selectively excluding a portion of the network from ReLUification, significantly enhancing accuracy with negligible additional latency compared to the use of all ReLU networks. Thus, our proposed method produces ReLU networks substantially surpassing the quality of independently trained ReLU networks with an order of magnitude smaller training time.
Related Material