Communication Efficient SGD via Gradient Sampling With Bayes Prior
Gradient compression has been widely adopted in data-parallel distributed training of deep neural networks to reduce communication overhead. Some literatures have demonstrated that large gradients are more important than small ones because they contain more information, such as Top-k compressor. Other mainstream methods, like random-k compressor and gradient quantization, usually treat all gradients equally. Different from all of them, we regard large and small gradients selection as the exploitation and exploration of gradient information, respectively. And we find taking both of them into consideration is the key to boost the final accuracy. So, we propose a novel gradient compressor: Gradient Sampling with Bayes Prior in this paper. Specifically, we sample important/large gradients based on the global gradient distribution, which is periodically updated across multiple workers. Then we introduce Bayes Prior into distribution model to further explore the gradients. We prove the convergence of our method for smooth non-convex problems in the distributed system. Compared with methods that running after high compression ratio at the expense of accuracy, we pursue no loss of accuracy and the actual acceleration benefit in practice. Experimental comparisons on a variety of computer vision tasks (e.g. image classification and object detection) and backbones (ResNet, MobileNetV2, InceptionV3 and AlexNet) show that our approach outperforms the state-of-the-art techniques in terms of both speed and accuracy, with the limitation of 100* compression ratio.