On the Efficiency of Recurrent Neural Network Optimization Algorithms Ben Krause, Liang Lu, Iain Murray, Steve Renals University of Edinburgh Department of Informatics s1470305@sms.ed.ac.uk, llu@staffmail.ed.ac.uk, i.murray@ed.ac.uk, s.renals@ed.ac.uk Abstract This study compares the sequential and parallel efficiency of training Recurrent Neural Networks (RNNs) with Hessian-free optimization versus a gradient descent variant. Experiments are performed using the long short term memory (LSTM) architecture and the newly proposed multiplicative LSTM (mLSTM) architecture. Results demonstrate a number of insights into these architectures and optimization algorithms, including that Hessian-free optimization has the potential for large efficiency gains in a highly parallel setup. 1 Introduction Recurrent neural networks (RNNs) have achieved remarkable success in a number of machine learning tasks including speech recognition (Robinson et al., 1996; Graves et al., 2013; Sak et al., 2014), language modelling (Mikolov et al., 2010), image captioning (Vinyals et al., 2015; Xu et al., 2015), and machine translation (Sutskever et al., 2014; Cho et al., 2014). However, training RNNs remains a challenging optimization problem. The recursive nature of recurrent weight matrices can lead simple gradient based algorithms to become unstable, or converge very slowly. Heuristics such as gradient clipping (Pascanu et al., 2013) and specially designed architectures such as LSTM (Hochreiter and Schmidhuber, 1997) have made RNNs work on difficult problems. However, these methods generally require the use of small minibatches for efficient training, rendering them unable to fully take advantage of highly parallelizable hardware such as GPUs. The Hessian-free optimization algorithm has been used to train RNNs with much larger batch sizes (Martens and Sutskever, 2011; Sutskever et al., 2011), allowing greater use of parallelization. However, it is non-trivial to compare the efficiency of Hessian-free optimization with commonly used first order algorithms, because the efficiency ratio of methods with different batch sizes is highly dependent on the hardware being used. Additionally, second order methods tend to make larger but more computationally expensive updates than first order methods. We compare Hessian-free optimization to gradient descent with gradient norm clipping in both a stochastic and a large batch setting, using two metrics of efficiency: the first proportional to runtime on fully parallelizable hardware, and the second proportional to runtime on fully sequential hardware. Since the performance of RNN optimization is highly architecture dependent, this study focuses on the widely used LSTM architecture, and the newly proposed multiplicative LSTM architecture (Krause, 2015), which builds on the multiplicative RNN (Sutskever et al., 2011). The standard RNN architecture operates over a sequence of input and output pairs of T timesteps, and predicts the output O(t) using the input I(t) as well as the previous hidden state H(t − 1), i.e. H(t) = f (Whi I(t) + Whh H(t − 1)) O(t) = g (Woh H(t)) (1) (2) where Whi ,Whh and Woh are weight matrices, and f (·) is the hidden layer nonlinear activation function (e.g. tanh), and g(·) denotes the output layer activation function (e.g. softmax). Training 1 an RNN with stochastic gradient descent is difficult due to the vanishing/exploding gradient problem (Bengio et al., 1994; Hochreiter et al., 2001), as the gradient tends to decay or explode exponentially as it is back-propagated through time. This problem makes it difficult for RNNs to learn long-term dependencies, which has previously been addressed through choice of architecture and optimization algorithms. LSTMs are popular because their gated units control the transition of memory states, making them less susceptible to vanishing gradient problems. Multiplicative RNNs (mRNNs, Sutskever et al., 2011), on the other hand, rely on the strength of 2nd order algorithms to handle pathological curvature, and allow complex transitions between states by factorizing the recurrent weight matrix in a standard RNN as Whh = Whm diag(Wmi I(t))Wmh . (3) A multiplicative LSTM – a combination of mRNNs and LSTM – was shown to be especially successful at the task of character level modelling (Krause, 2015). This paper presents further insights on optimizing these RNN architectures. 2 First-order and second-order optimization of RNNs A learning algorithm for RNNs must be able to compensate for vanishing and exploding gradients to learn long range dependencies and keep updates stable. A simple approach is first-order stochastic gradient descent with gradient norm clipping (Pascanu et al., 2013), which reduces the step size in the direction of the gradient when the magnitude of the gradient exceeds a certain threshold. While this method only explicitly addresses the exploding gradient problem, an LSTM architecture can be used in combination with this optimization method to address vanishing gradients. Second-order optimization approaches take advantage of the full curvature matrix, which can be used to appropriately rescale the gradient according to the second derivatives. As the second derivatives are expected to decay and explode at a similar rate to first derivatives (Martens, 2010), they can provide a sensible rescaling to account for the training difficulties of RNNs. In Newton’s method, the error function F(·) is approximated by the second order Taylor series expansion around the current estimate of the model parameters θ, and the update of the model parameters ∆θ is obtained as ∆θ = −H(F(θ))−1 ∇F(θ). (4) However, storing and inverting the curvature matrix H(F(θ)) is inefficient and computationally infeasible for large neural networks. Hessian-free optimization uses the conjugate gradient method to exploit the potentially sparse and low rank structure of the curvature matrix to solve the linear system H(F(θ))∆θ = −∇F(θ), (5) instead of directly computing Eq. (4). Because the conjugate gradient method is only guaranteed to converge for a positive semi-definite matrix, the Hessian matrix H(F(θ)) can be approximated using the Gauss–Newton matrix (G) which only includes positive curvature. In this case, matrix-vector products (Gv) required by the conjugate gradient method can be computed efficiently using the method derived by Schraudolph (2002). However, the matrix G is poorly conditioned, which makes conjugate gradients converge far outside the region where the local Taylor series approximation is accurate. Damping methods such as Tikhonov damping and structural damping can be used to improve the condition of the matrix G (Martens and Sutskever, 2011). Batch sizes: Batching can be used to train RNNs by splitting the training data into sequences of fixed length. For both first and second order algorithms, Bg sequences are used in parallel to approximate the gradient. In Hessian-free optimization, Bc training examples must also be used for computing Gv products during conjugate gradient iterations. Since there are many conjugate gradient iterations per gradient computation, Bg is typically several times larger than Bc because it is relatively cheap to get a more accurate approximation to the gradient. While first order methods must rely on many small updates to converge, Hessian-free optimization can use larger batch sizes. An accurate local approximation of the curvature allows large updates, justifying the cost of processing a large batch. Batch size greatly affects the efficiency and effectiveness of RNN optimization. Small batch sizes give noisy estimates of the derivative information cheaply, which can be very efficient if there is limited variance in the gradient across training examples. Larger batches give better derivative estimates, but also require more computation. However, the cost of this extra computation can be relatively cheap, as batch updates are easily parallelized. 2 Optimization methods for neural networks are traditionally evaluated by measuring the convergence rate over the number of training epochs. However, this is not a sensible approach for comparing first and second order methods as well as approaches with different batch sizes, as the number of epochs will not be a consistent measure of the computation time required. Furthermore, the ratio of computational efficiency of algorithms of different batch sizes will depend highly on the hardware being used. For these reasons, we derive two new units of training computation proportional to computation time using fully sequential and fully parallelized operations for an RNN of fixed size. Sequential computation cost measures: In a computer that can only perform sequential operations, the computation time will be proportional to the number of floating point operations performed. We reflect this extreme with the number of batch dependent forward passes, PD , which is proportional to the amount of computation in a forward pass through a single training sequence of fixed length for an RNN of fixed size. For a first order gradient algorithm, one forward pass and one gradient computation must be performed for It training iterations. Since a gradient computation requires approximately twice the computation of a forward pass, PD = 3It Bg , for first order methods. For Hessian-free optimization, the gradient must also be computed at each training iteration, and additionally, a Gv product must be performed for Ic conjugate gradient iterations. A Gv product requires approximately 4 times the computation of a forward pass, so PD = 3It Bg + 4Ic Bc , for Hessian-free methods. Parallel computation cost measures: On the opposite end of the spectrum would be a computer that has enough processors to fully parallelize batches so that computation time is independent of batch size. To reflect this scenario, we measure the number of batch independent forward passes, PI , in which the batch size terms become irrelevant: PI = 3It , for first order methods PI = 3It + 4Ic , for Hessian-free methods. 3 Experiments Experiments were carried out to compare the efficiency and effectiveness of Hessian-free optimization compared with a variant of gradient descent with norm clipping for small and large batch sizes for LSTM and mLSTM RNN architectures, trained on sequences of length 200. We compared stochastic Hessian-free optimization (Kiros, 2013) (SHF) with Bg = 320 and Bc = 32, large batch Hessian-free optimization (BHF) with Bg = 14000 and Bc = 3500, stochastic gradient descent (SGD) with Bg = 32, and large batch gradient descent (BGD) with Bg = 3500. The training and validation error were measured over training as a function of our computational cost measures PI and PD . We benchmarked these architectures and optimization methods with the task of character level sequence modelling. Character level models are challenging to optimize, and push the limits of fitting algorithms, making them a good option for benchmarking advanced optimization methods. These experiments were run on a subset of the Penn Treebank corpus (catalog.ldc.upenn.edu/LDC99T42), containing the first 2.8 million characters for training, and approximately 200,000 for validation (same data set as used in (Krause, 2015)). These experiments used the Penn Treebank corpus in its raw form, with no preprocessing, and numbers and rare words included. The LSTM and mLSTM had approximately 220,000 and 215,000 trainable parameters respectively. Results: Convergence plots for training error as a function of training computation are presented in Figure 1. Validation error was also obtained, however it is not included in the graphs. SHF and BGD for LSTM were too inefficient to fully converge in the time allotted. mLSTM with stochastic Hessian-free optimization achieved the best validation error of 1.82 bits/char. Other runs for both mLSTM and LSTM that converged all had validation errors that ranged from 1.89-1.93 bits/char. 4 Discussion There are several important insights from these experiments. The first is that Hessian-free optimization was far more efficient than gradient descent at the batch level, regardless of the architecture being 3 4.5 SGD BGD SHF BHF 4 3.5 Training error (bits/char) Training error (bits/char) 4.5 3 2.5 2 1.5 0 1 2 3 4 Computation (PI) 3 2.5 1.5 5 0 5 x 10 0.5 1 1.5 Computation (PD) 2 8 x 10 4.5 SGD BGD SHF BHF 4 3.5 Training error (bits/char) Training error (bits/char) 3.5 2 4.5 3 2.5 2 1.5 SGD BGD SHF BHF 4 SGD BGD SHF BHF 4 3.5 3 2.5 2 0 1 2 3 4 Computation (PI) 1.5 5 5 x 10 0 0.5 1 1.5 Computation (PD) 2 8 x 10 Figure 1: Training errors (bits/char) for LSTM over batch independent (upper left) and batch dependent (upper right) computation, and mLSTM over batch independent (lower left) and batch dependent (lower right) computation. trained. This suggests that the curvature in the error surface was significant, and that enforcing conjugacy of the search directions with respect to the curvature matrix was greatly increasing the convergence rate. For LSTM, there was very little difference in the batch independent convergence rate of BGD and SGD, suggesting that the gradient of LSTM can be well approximated with a few examples. This may partially explain why LSTM has traditionally been easy to optimize with first order methods. In mLSTM on the other hand, the batch independent convergence rate was similar for BGD and SGD at the beginning of training, but BGD began to converge faster fairly early on. This suggests that the gradient of mLSTM has a higher variance across training examples than LSTM. There was an enormous difference in the performance of SHF optimization for LSTM and mLSTM. In LSTM, SHF failed to fully converge, suggesting a greater variance in the curvature across training examples for LSTM. In roughly the same number of training iteration, SHF applied to mLSTM achieved by far the best training and generalization results in these experiments. It is likely that the large stochastic updates observed in this run created a sort of weight noise that improved regularization, and possibly helped avoid local minima. An important comparison can be drawn from SGD, which is used in nearly all LSTM experiments, and BHF, which was more than an order of magnitude more efficient on this dataset with full parallelization. While full parallelization is not a realistic goal, large compute clusters and multi GPU machines can often come close to this. Overall, these experiments highlight several important differences in training LSTMs and mLSTMs with first and second order algorithms. The results suggest that Hessian-free methods have the potential to be much more efficient on highly parallelizable hardware. 4 Acknowledgements This work was supported by a donation from MERL. References Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods on Natural Language Processing. Graves, A., rahman Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In Proceddings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. IEEE. Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In Kremer and Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9:1735– 1780. Kiros, R. (2013). Training neural networks with stochastic Hessian-free optimization. arXiv preprint arXiv:1301.3641. Krause, B. (2015). Optimizing and contrasting recurrent neural network architectures. arXiv preprint arXiv:1510.04953. Martens, J. (2010). Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 735–742. Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning, pages 1033–1040. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association, pages 1045–1048. Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning. Robinson, T., Hochberg, M., and Renals, S. (1996). The use of recurrent networks in continuous speech recognition. In Lee, C. H., Paliwal, K. K., and Soong, F. K., editors, Automatic Speech and Speaker Recognition – Advanced Topics, pages 233–258. Kluwer Academic Publishers. Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the Annual Conference of International Speech Communication Association. Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738. Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112. Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the Computer Vison and Pattern Recognition. Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 5