Phase-Wise Parameter Aggregation for Improving SGD Optimization
Stochastic gradient descent (SGD) is successfully applied to train deep convolutional neural networks (CNNs) on various computer vision tasks. Since fixed step-size SGD converges to so-called error plateau, it is applied in combination with decaying learning rate to reach a favorable optimum. In this paper, we propose a simple yet effective optimization method to improve SGD with a phase-wise decay of learning rate. Through analyzing both a loss surface around the error plateau and a structure of the SGD optimization process, the proposed method is formulated to improve convergence as well as initialization at each training phase by efficiently aggregating the CNN parameters along the optimization sequence. The method keeps the simplicity of SGD while touching the SGD procedure only a few times during training. The experimental results on image classification tasks thoroughly validate the effectiveness of the proposed method in comparison to the other methods.