Hardware-Aware Pruning for FPGA Deep Learning Accelerators
Pruning has been widely used for deep neural network optimization and compression. In this paper we propose a pruning method to accelerate FPGA implementations of neural networks with the goal of lowering the inference time as much as possible, while taking into account hardware-specific constraints. We use the normalized L2-norm as a measure of filter importance and iteratively prune the network with a predefined pruning stepsize. We extend this method to also prune around residual connections using the maximum normalized L2-norm as a representation of importance for a group of connected channels. We introduce a hardware-aware pruning method for FPGA deep learning accelerators, adaptively pruning the neural network based on the size of the systolic array used to calculate the convolutions. We validate our methods by pruning a polyp segmentation model on two different datasets. This results in almost halving the inference time with minimal loss of accuracy on both datasets. We prove that our two contributions yield an extra 30% increase in processing speed compared to classical L2 pruning.