Efficient Deep Learning Inference Based on Model Compression

Qing Zhang, Mengru Zhang, Mengdi Wang, Wanchen Sui, Chen Meng, Jun Yang, Weidan Kong, Xiaoyuan Cui, Wei Lin; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018, pp. 1695-1702


Deep neural networks (DNNs) have evolved remarkably over the last decade and achieved great success in many machine learning tasks. Along the evolution of deep learning (DL) methods, computational complexity and resource consumption of DL models continue to increase, this makes efficient deployment challenging, especially in devices with low memory resources or in applications with strict latency requirements. In this paper, we will introduce a DL inference optimization pipeline, which consists of a series of model compression methods, including Tensor Decomposition (TD), Graph Adaptive Pruning (GAP), Intrinsic Sparse Structures (ISS) in Long Short-Term Memory (LSTM), Knowledge Distillation (KD) and low-bit model quantization. We use different modeling scenarios to test our inference optimization pipeline with above mentioned methods, and it shows promising results to make inference more efficient with marginal loss of model accuracy.

Related Material

author = {Zhang, Qing and Zhang, Mengru and Wang, Mengdi and Sui, Wanchen and Meng, Chen and Yang, Jun and Kong, Weidan and Cui, Xiaoyuan and Lin, Wei},
title = {Efficient Deep Learning Inference Based on Model Compression},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2018}