Training Larger Networks for Deep Reinforcement Learning Kei Ota 1 2 Devesh K. Jha 3 Asako Kanezaki 2 arXiv:2102.07920v1 [cs.LG] 16 Feb 2021 Abstract The success of deep learning in the computer vision and natural language processing communities can be attributed to training of very deep neural networks with millions or billions of parameters which can then be trained with massive amounts of data. However, similar trend has largely eluded training of deep reinforcement learning (RL) algorithms where larger networks do not lead to performance improvement. Previous work has shown that this is mostly due to instability during training of deep RL agents when using larger networks. In this paper, we make an attempt to understand and address training of larger networks for deep RL. We first show that naively increasing network capacity does not improve performance. Then, we propose a novel method that consists of 1) wider networks with DenseNet connection, 2) decoupling representation learning from training of RL, 3) a distributed training method to mitigate overfitting problems. Using this three-fold technique, we show that we can train very large networks that result in significant performance gains. We present several ablation studies to demonstrate the efficacy of the proposed method and some intuitive understanding of the reasons for performance gain. We show that our proposed method outperforms other baseline algorithms on several challenging locomotion tasks. 1. Introduction We have witnessed huge improvements in the fields of computer vision (CV) and natural language processing (NLP) in the last decade (Krizhevsky et al., 2012; He et al., 2016; Huang et al., 2017; Devlin et al., 2019; Brown et al., 2020). These developments could largely be attributed to training 1 Mitsubishi Electric, Kanagawa, Japan 2 Tokyo Institute of Technology, Tokyo, Japan 3 Mitsubishi Electric Research Labs, Cambridge, USA. Correspondence to: Kei Ota . (a) Average return. (b) Loss surface. Figure 1. Training curves of SAC agents with different number of layers on Ant-v2 environment, and the loss function surface of the deepest (16-layers) Q-network. The training curves suggest that simply building a deeper MLP with fixed number of units (256) does not improve the performance of DRL while building a larger network is generally effective in supervised learning. Motivated by this, we conduct an extensive study on how to train larger networks that contribute for performance gain for RL agents. of very large neural networks with millions (or even billions or trillions) of parameters which can be trained using massive amounts of data and an appropriate optimization technique to stabilize training. In general, the motivation of training larger networks comes from the intuition that larger networks allow better solutions as they increase the search space of possible solutions. Having said that, neural network training largely relies on the ability to find good minimizers of highly non-convex loss functions. These loss functions are also governed by the choices of network architecture, batch size, etc. This has also driven a lot of research in these communities towards understanding the underlying reasoning for performance gains (Lu et al., 2017; Zhang et al., 2017; Nguyen & Hein, 2017; Li et al., 2018). In a striking contrast, Deep Reinforcement Learning (DRL) community has not reported similar trend with regards to training larger networks for RL. It has been reported in some studies that deep RL agents experience instability while training with larger networks (Henderson et al., 2018; van Hasselt et al., 2018; Achiam et al., 2019; Sinha et al., 2020). As an example, in Fig. 1, we show the results of an Soft Actor Critic (SAC) (Haarnoja et al., 2018) agent that uses Multi-layered Perceptron (MLP) for function approximation with increasing number of layers while fixing its unit size Training Larger Networks for Deep RL to 256 (also notice the loss surface). These plots show that using deeper networks naively leads to poor performance for a deep RL agent. Consequently, using larger networks for training deep RL networks is not fully understood, and thus is limiting in several ways. As a result, most of the reported work in literature end up using similar hyperparameters (i.e., network structure, number and size of layers, etc.). Our work is motivated by this limitation, and we make an attempt to explore the interplay between the size, structure, training and performance of deep RL agents to provide some intuition and guidelines for using larger networks. In light of these facts, we present a large-scale study and provide empirical evidence for using larger networks for training DRL agents. We first highlight the challenges that one might come across while using larger networks for training deep RL agents. To circumvent these problems, we integrate a three-fold approach: decoupling feature representation from RL to efficiently produces high dimensional features, employing DenseNet architecture to propagate richer information, and using distributed training methods to collect more on-policy transitions to reduce overfitting. Our method is a novel architecture that combines these three elements, and we demonstrate our proposed method significantly improves the performance of RL agents in continuous control tasks. We also conduct an ablation study to show what component contributes the performance gain. Our contributions can be summarized as follows: • We conduct a large scale study on employing larger networks for DRL agents, and empirically show that, in contrary to deeper networks, wider networks can improve performance. • We propose a novel network architecture that synergistically combines recently proposed techniques to stabilize the training: decoupling representation learning from RL, DenseNet architecture, and distributed training, to demonstrate it significantly improves performance. • We analyze the performance gain of our method using metrics of effective ranks of features as well as visualization of loss function landscape of RL agents. 2. Related Work Our work is broadly motivated by (Henderson et al., 2018) that empirically demonstrates DRL algorithms are vulnerable to different training choices like architectures, hyperparameters, activation functions, etc. The paper compares performance on different number of units and layers, and demonstrates larger networks do not consistently improve performance. This is contrary to our intuition considering the recent progress on solving computer vision tasks such as ImageNet (Deng et al., 2009): larger and more complex network architectures have proven to achieve better performance (Krizhevsky et al., 2012; He et al., 2016; Huang et al., 2017; Tan & Le, 2019). Sutton & Barto (2018) identifies a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can be unstable, and potentially diverge with the value estimates becoming unbounded. Some prior works have challenged to mitigate this problem, including target networks (Mnih et al., 2015), double Q-learning (Van Hasselt et al., 2016), n-step learning (Hessel et al., 2018), etc. Our challenge of training larger networks is specifically related to function approximation, however, as the deadly triad is entangled in a complex manner, we also have to deal with the other problems. As for the network size, some studies investigate the effect of making network larger for continuous control task using MLP (Fu et al., 2019; Achiam et al., 2019) and Atari games using CNN (van Hasselt et al., 2018), and concluded the larger networks tend to perform better, but also become unstable and prone to diverge more. Andrychowicz et al. (2021) and Liu et al. (2021) performed similar study on on-policy methods and showed too small or large networks can cause significant drop in performance of the policy. While these studies are limited to relatively small size (hundreds of units with several layers), we will have more thorough study on much larger networks, combination of state representation learning, and employing different network architectures. To build a large network, unsupervised learning has been used to learn powerful representations for downstream tasks in natural language processing (Devlin et al., 2019; Radford et al., 2019) and computer vision (He et al., 2020; Chen et al., 2020). In the context of RL, auxiliary tasks such as predicting the next state conditioned on the past state(s) and action(s) have been widely studied to improve the sample efficiency of the RL algorithms (Jaderberg et al., 2017; Shelhamer et al., 2017; Ha & Schmidhuber, 2018). For the state-input setting, researchers have generally focused on learning a good representation that produces low dimensional features (Munk et al., 2016; Lesort et al., 2018). Contrary to that, Ota et al. (2020) proposes the use of online feature extractor network (OFENet) that intentionally increases input dimensionality, and demonstrates that larger feature size enables to improve RL performance on both sample efficiency and control performance. We leverage this idea and use larger input (or feature) for RL agents as well as using larger networks for the policy and the value function networks. Training Larger Networks for Deep RL Figure 2. Proposed architecture to train larger networks for deep RL agents. We combine three elements. Firstly, we decouple representation learning from RL to extract an informative feature zst from the current state st using a feature extractor network that is trained using an auxiliary task of predicting the next state st+1 . Secondly, we use large networks using DenseNet architecture, which allows stronger feature propagation. Finally, we employ the Ape-X-like distributed training framework to mitigate the overfitting problems which tends to happen in larger networks, and enables to collect more on-policy data that can improve performance. FC refers to a fully-connected layer. 3. Method While recent studies suggest that larger networks for DRL agents have potential to improve performance, it is nontrivial to alleviate some potential issues that lead to instability when using larger networks to train RL agents. Our method is based on two main key ideas: allowing better feature propagation using good network architectures and using huge amounts of more on-policy data using distributed training to avoid overfitting in larger networks. We first obtain good features apart from RL using an auxiliary task, and then propagate the features more efficiently by employing the DenseNet (Huang et al., 2017) architecture. Also, we use a distributed RL framework that can mitigate the potential overfitting problem. In the following, we describe in detail the three elements we use for training larger networks for deep RL agents. Our proposed approach is shown as a schematic in Fig. 2. matches our philosophy of providing larger solution space that allows us to find better policy. The representations can be obtained by learning the mappings zst = φs (st ) and zst ,at = φs,a (st , at ), which have parameters θφs , θφs,a by using an auxiliary task of predicting the next state st+1 from the current state and action representation zst ,at as:   Laux = E(st ,at )∼p,π kfpred (zst ,at ) − st+1 k2 , (1) where fpred is represented as a linear combination of the representation zst ,at . The learning of the auxiliary task is done concurrently with the learning of the downstream RL task. In our experiments, we allow input dimensionality much bigger than previously presented in (Ota et al., 2020). Furthermore, we also increase the network size of RL agents (see A.2 for the number of network parameters used in our experiments). For more details, interested readers are referred to (Ota et al., 2020). 3.2. Distributed Training 3.1. Decoupling Representation Learning from RL While the simplicity of learning whole networks in an endto-end fashion is appealing, updating all parameters of a large network using just a scalar reward signal can result in very inefficient training (Stooke et al., 2020). Decoupling unsupervised pretraining from downstream tasks is common in computer vision (He et al., 2020; Henaff, 2020). Taking inspiration from this, we adopt the online feature extractor network (OFENet) (Ota et al., 2020) to learn meaningful features separately from training of RL. OFENet learns representation vectors of states zst and stateaction pairs zst ,at , and provides them to the agent instead of original inputs st and at , which gives significant performance improvements on continuous robot control tasks. As the representation vectors zst and zst ,at are designed to have much higher dimensionality than original inputs, OFENet In general, larger networks need more data to improve accuracy of function approximation (Deng et al., 2009; Hernandez et al., 2021) and mitigate overfitting problem (Bishop, 2006). MLP with a large number of hidden layers is in particular known to cause an over-fitting to training data, which often results in inferior performance to shallow networks (Bengio et al., 2007; Ramchoun et al., 2017). In the context of RL, while we are training and evaluating on the same environment, there is still problem of overfitting: the agent is only trained on limited trajectories it has experienced, which cannot cover the whole state-action space of the environment (Liu et al., 2021). Fu et al. (2019) showed overfitting to the experience replay exists, and Fedus et al. (2020) empirically showed having more on-policy data in replay buffer, i.e. collecting more than one transition while updating policy one time can improve performance of RL agent. Training Larger Networks for Deep RL In light of these studies, we employ distributed RL framework, which leverages distributed training architectures that decouples learning from collecting transitions by utilizing many actors running in parallel on separate environment instances (Horgan et al., 2018; Kapturowski et al., 2019). In particular, we use Ape-X (Horgan et al., 2018) framework, where a single learner receives experiences from distributed prioritized replay (Schaul et al., 2016), and multiple actors collect transitions in parallel (see Fig. 2). This helps increase the number of data that are close to the current policy, i.e. more on-policy data, which can improve performance of off-policy RL agents (Fedus et al., 2020) and mitigate rank collapse issues of Q-networks (Aviral Kumar & Levine, 2021). One difference is that we do not use the RL algorithm used in (Horgan et al., 2018), but instead use standard off-policy RL algorithms: SAC (Haarnoja et al., 2018) and Twin Delayed Deep Deterministic policy gradient algorithm (TD3) (Fujimoto et al., 2018) in our experiments. 3.3. Network Architectures Tremendous developments have been made in the computer vision community in designing sophisticated architectures that enable training of very large networks (He et al., 2016; Huang et al., 2017; Tan & Le, 2019). Huang et al. (2017) proposed Dense Convolutional Network (DenseNet) that has a skip connection that directly connects each layer to all subsequent layers as: yi = fidense ([y0 , y1 , ..., yi−1 ]), where yi is the output of the ith layer, thus all the inputs are concatenated into a single tensor. Here, fidense is a composite function which consists of a sequence of convolutions, Batch Normalization (BN) (Ioffe & Szegedy, 2015), and an activation function. An advantage of DenseNet is its improved flow of information and gradients throughout the network, which makes the large networks easier to train. We borrow this architecture to train large networks for RL agents. Although using DenseNet architecture for DRL agents is existing, it has not been fully explored yet. D2RL (Sinha et al., 2020) employs a modified DenseNet architecture which concatenates the state or the state-action pair to each hidden layer of the MLP networks except the last linear layer. Contrary to this modified version, Ota et al. (2020) exactly follows the original DenseNet: it uses the dense connection that concatenates all the outputs of the previous layer instead of only the state or the state-action pair for training only OFENet, and is not used for training RL agents. We also follow the original DenseNet architecture as done in (Ota et al., 2020) to represent the policy and the value function networks. The schematic of the DenseNet architecture is also shown in Fig. 2. We omit BN for SAC agent because we found it inhibits improving performance. 4. Experiments In this section, we present results of numerical experiments in order to answer some relevant underlying questions posed in this paper. In particular, we answer the following questions. • Can RL agents benefit from usage of larger networks during training? More concretely, can using larger networks lead to better policies for DRL agents? • What characterizes a good architecture which facilitates better performance when using larger networks? • Can our method work across different RL algorithms as well as different tasks? Experimental settings We run all each experiment independently with five seeds, and the average and ±1 standard deviation results will be reported, which are solid lines and shaded regions when we show training curves. The horizontal axis of a training curve is the number of gradient steps, which is not identical to the number of steps an agent interacts with an environment only when we use the distributed replay. The network architectures, optimizers, and hyperparameters are the same as used in their original papers (Haarnoja et al., 2018; Fujimoto et al., 2018; Ota et al., 2020) unless otherwise noted. We used single NVIDIA Tesla V100 GPU with Xeon Gold 6148 Processor. Appendix A shows more details of experimental settings. Evaluation metrics We evaluate the experimental results on two metrics: average return and the recently proposed effective ranks (Aviral Kumar & Levine, 2021) of the features matrices of Q-networks. Aviral Kumar & Levine (2021) showed that MLPs used for approximating policy and value functions that use bootstrapping leads to reduction in the effective rank of the feature, and this rank collapse for the feature matrix results in poorer performance. We will show the effective rank of the features in the penultimate layer of the Q-networks to evaluate whether our proposed architecture can alleviate the rank collapse issue. 4.1. Does increasing the size of networks fail to improve performance? In the first set of experiments, we try to investigate if increasing network size always leads to poor performance. We quantitatively measure the effectiveness of increasing the network size by changing the number of units N unit and layers N layer while the other parameters are fixed. Figure 1a shows the training curves when increasing the number of layers while the unit size is fixed to N unit = 256. As we described in Sec. 1, we observe that the performance becomes worse as the network becomes deeper. In Fig. 3a, Training Larger Networks for Deep RL we show the effect of increasing the number of units while the number of layers is fixed to N layer = 2. Contrary to the results when making the network deeper, we can observe consistent improvement when making the network wider. In order to investigate more thoroughly, we also conduct a grid search, where we sample each parameter of the network from N unit ∈ {128, 256, 512, 1024, 2048}, and N layer ∈ {1, 2, 4, 8, 16} and evaluate the performance in Fig. 4. We can see the monotonic improvement in performance when widening networks on almost all depth of the network. This result is in line with the general belief that training deeper networks is, in general, harder and is more susceptible to choice of hyperparameters (Bengio et al., 2007; Ramchoun et al., 2017). This could be attributed to vanishing gradient problem with increasing number of layers (Bengio et al., 1994). However, we found that the reason that deeper networks are harder to train than wider network cannot be attributed to vanishing gradient, rather it results from the sharpness of the loss surface curvatures (Li et al., 2018). We show the loss surface of the deeper network (N layer = 16, N unit = 256) in Fig. 1b and the wider network (N layer = 2, N unit = 2048) in Fig. 3b by using the visualization method proposed in (Li et al., 2018) with the loss of TD error of Q-functions of SAC agents (see Appendix A.3 for more details). These figures show that wider networks have nearly convex surface while deeper networks have more complex loss surface which could be susceptible to choice of hyperparameters (Li et al., 2018). Comparison of deeper and wider networks have also been done in several works (Wu et al., 2019; Nguyen & Hein, 2017; Li et al., 2018), where wider networks are prone to have more generalization capability due to their smooth loss functions. From these results, we observe and conclude that larger networks can be effective in improving deep RL performance. In particular, we achieve consistent performance gains when widening individual layers instead of going deeper. Consequently, we fix the number of layers to N layer = 2, and only change the number of units to learn larger networks in the following experiments. 4.2. Architecture Comparison In the next set of experiments, we try to investigate the role of synergistic combination of connectivity architecture, state-representation and distributed training in allowing usage of larger networks for training deep RL agents. A brief introduction to these techniques is described in Sec. 3. Connectivity architecture We first compare four connectivity architectures: standard MLP, MLP-ResNet, MLPDenseNet, and MLP-D2RL, which is a recently proposed architecture to improve RL performance. MLP-ResNet is a modified version of Residual Networks (ResNet) (He et al., (a) Average return. (b) Loss surface. Figure 3. Training curves of the SAC agent with different number of units on Ant-v2 environment and the loss function surface of the widest (2048-units) Q-network. This shows the performance consistently improves when using wider MLPs. Figure 4. Grid search results of maximum average return at onemillion training steps over different number of units and layers for SAC agent on Ant-v2 environment. This demonstrates a deeper MLP (see horizontally) does not consistently improve performance while a wider MLP (see vertically) generally does. 2016; He et al., 2016), which has a skip-connection that bypasses the non-linear transformations with an identity function: yi = fires (yi−1 ) + yi−1 , where yi is the output of the ith layer, and fires is a residual module, which consists of fully connected layer and nonlinear activation function. An advantage of this architecture is that the gradient can flow directly through the identity mapping from top layers to bottom layers. MLP-D2RL is identical to (Sinha et al., 2020), and MLP-DenseNet is our proposed architecture that is defined in Sec. 3.3. We compare these four architectures on both small network (N unit = 128, denoted by S) and large networks (N unit = 2048, denoted by L). Figure 5 shows the training curves of average return in Fig. 5a, and the effective ranks in Fig. 5b. The results show that our MLP-DenseNet achieves the highest return on both small and large networks, while mitigating rank collapse comparable to MLP-D2RL. This shows that the Training Larger Networks for Deep RL (a) Average return. (b) Effective ranks. Figure 5. Comparison of connectivity architecture on Ant-v2. Our proposed DenseNet architecture produces the best return on both large (N unit = 2048, denoted by L) and small (N unit = 128, S) networks while mitigating rank collapse as good as MLP-D2RL. (a) Average return. (b) Effective ranks. Figure 6. Training curves of w/ and w/o OFENet on Ant-v2. This shows decoupling representation learning from RL is generally effective across different size of the networks in terms of both control performance and mitigating rank collapse issues. MLP-DenseNet is the best architecture among these four choices, and thus we employ this architecture for both the policy and the value function network in the following experiments. Decoupling representation learning from RL Next, we evaluate the effectiveness of using OFENet (see Sec. 3.1) to decouple representation learning from RL. In order to evaluate the performance on different network size, we sample the number of units from N units ∈ {256, 1024, 2048}, which we respectively denote S, M, and L, and compare these against the baseline SAC agents, which do not use OFENet and are trained only from a scalar reward signal. In other words, the baseline agents are identical to the DenseNet architecture of the previous connectivity comparison experiment. Figure 7. Grid search results of average maximum return over different number of units between SAC and OFENet. OFENet can improve performance on almost all settings, but saturates around the return of 8000. The results in Fig. 6 shows separating representation learning from RL improves control performance and mitigates rank collapse of Q-networks regardless of network size. Thus, we can conclude using bigger representations, which is learned using the auxiliary task (see Sec. 3.1), contributes to improve performance on downstream RL tasks. To investigate more in-depth, we also conduct a grid search over different number of units for both SAC and OFENet in Fig. 7. The baseline is SAC agent without OFENet (see leftmost column). The results suggest that the performance does improve when compared against baseline agent (see horizontally), however, it saturates around the average return of 8000. In the following experiments, we employ distributed replay and expect we can attain higher performance. Figure 8. Grid search results of average maximum return over different number of units between SAC and OFENet with ApeX-like distributed training. Compared to Fig. 7, adding distributed RL enables monotonic improvement when we widen either SAC or OFENet. Training Larger Networks for Deep RL (a) Hopper-v2 (b) Walker2d-v2 (d) Ant-v2 (e) Humanoid-v2 (c) HalfCheetah-v2 Figure 9. Training curves on five different MuJoCo tasks with two different RL algorithms (SAC and TD3). Table 1. The highest average returns for each environment. The bold number indicates the best performance. Our method outperforms OFENet (Ota et al., 2020) and original algorithm in most environments. SAC E NVIRONMENT H OPPER - V 2 WALKER 2 D - V 2 H ALF C HEETAH - V 2 A NT- V 2 H UMANOID - V 2 TD3 O URS OFEN ET O RIGINAL O URS OFEN ET O RIGINAL 3467.3 8802.4 19209.9 14021.0 14858.2 3511.6 5237.0 16964.1 8086.2 9560.5 3316.6 3401.5 14116.1 5953.1 6092.6 3206.7 7645.8 18147.5 12811.3 13282.0 3488.3 4915.1 16259.5 8472.4 120.6 3613.0 4515.6 13319.9 6148.6 340.5 Distributed RL Finally, we add distributed replay (Horgan et al., 2018) to further improve performance while using larger networks. We use an implementation similar to (Stooke & Abbeel, 2018), which collects experiences using N core cores on which each core contains N env environments, specifically we used N core = 2 and N env = 32. Similar to the previous experiments, we conduct a grid search over different number of units for SAC and OFENet with the distributed replay in Fig. 8, and also compare the training curves of three different network size S, M, and L in Appendix B. Comparing Fig. 8 and Fig. 7, we can clearly see the distributed training enables further performance gain on all network size. Furthermore, we can observe monotonic improvement when we increase the number of units for both SAC and OFENet. Thus, we verified combining distributed replay contributes further performance gain while training larger networks. How about generalization to different RL algorithms and environments? To quantitatively measure the effectiveness of our method across different RL algorithms and tasks, we evaluate two popular optimization algorithms, namely SAC and TD3 (Fujimoto et al., 2018), on five different locomotion tasks in MuJoCo (Todorov et al., 2012). We denote our method as Ours, which uses the largest network of N units = 2048 among the previous experiments for both the OFENet and the RL algorithms. We compare the proposed method against two baselines: the original RL algorithm denoted by Original. Furthermore, we also compare OFENet, which can achieve the current state-of-the-art performance on these tasks to the best of our knowledge. We plot the training curves in Fig. 9 and list the highest average return in Table 1. In the figure and the table, our method, SAC (Ours) and TD3 (Ours) achieves the best performance on almost all environments. Furthermore, we can see that our proposed method can work with both RL algorithms, and thus is agnostic to the choice of the trainign Training Larger Networks for Deep RL algorithm. In particular, our method notably achieves much higher episode return in Ant-v2 and Humanoid-v2, which are harder environments with larger state/action space and more training examples. Interestingly, the proposed method does not achieve reasonable solutions in Hopper-v2, which has the smallest dimensionality among five environments. We consider that the performance in smaller dimension problem saturates early and even additional methods are unable to provide any significant performance gain. 4.3. Ablation study Since our method integrates several different ideas into a single agent, we conduct additional experiments to understand what components contribute to the performance gain. We highlight that our method consists of three elements: feature representation learning using OFENet, DenseNet architecture, and distributed training. In addition to this, we compare the results without increasing the network size to reinforce that larger network does improve performance. Figure 10 shows the ablation study over SAC with Ant-v2 environment. Full is our method which combines all three elements we proposed, and uses large networks (N unit = 2048, N layer = 2) for the SAC agent. sac is the original SAC implementation. w/o Ape-X removes Ape-X-like distributed training setting. As distributed RL enables collection of more experiences close to the current policy, we consider that the significant performance gain can be explained by learning from more on-policy data, which was also empirically shown by (Fedus et al., 2020). Also, we believe that receiving more novel experiences helps the agent generalize to state-action space. In other words, more novel experience reduces overfitting to limited trajectories, which becomes more important in harder environments which has larger state/action space, and larger neural networks. w/o OFENet removes OFENet and trains the whole architecture by using only a scalar reward signal. The much lower return shows that learning the large networks from just the scalar reinforcement signal is difficult, and training the bottom networks (close to the input layer), i.e., obtaining informative features by using an auxiliary task enables better learning of control policy. w/o Larger NN reduces the number of units from N unit = 2048 to 256 for both OFENet and SAC. This also significantly drops the performance, and thus we can conclude that using larger networks is essential to achieve high performance. Finally, w/o DenseNet replaces MLP-DenseNet defined in Sec. 3.3 with standard MLP architecture. The result shows that strengthening feature propagation does contribute to improve performance. Figure 10. Training curves of the derived methods of SAC on Antv2. This shows that each element does contribute to performance gain, and our combination of DenseNet architecture, distributed training, and decoupled feature representation (shown as Full ) allows us to train larger networks that performs significantly better compared against the baseline SAC algorithm (shown as sac ). 5. Conclusion Deep Learning has catalyzed huge breakthroughs in the fields of computer vision and natural language processing making use of massive neural networks that can be trained with huge amounts of data. While these domains have hugely benefitted from the use of larger networks, the RL community has not witnessed similar trend in use of larger networks for training high performance agents. This is mostly due to instability that occurs when using larger networks for training RL agents. In this paper, we studied the problem of using larger network for training RL agents. To achieve this, we proposed a novel method for training larger networks for deep RL agents while reflecting on some of the important design choices one has to make when using such networks. In particular, the proposed method consists of three elements. First, we decouple representation learning from RL using an auxiliary loss of predicting the next state. This allows to obtain more informative features to be used to learn control policies with richer information compared to learning entire networks from a scalar reward signal. The learned representation is then propagated to the DenseNet architecture that consists of very wide networks. Finally, a distributed training framework provides huge amounts of on-policy data whose distribution is much closer to the current policy, and thus enables to mitigate overfitting problem and enhance generalization to novel scenarios. Our experiments demonstrate that this novel combination achieves significantly higher performance compared against the current state-of-the-art algorithms across different off-policy RL algorithms and different continuous control tasks. In the future, we would like to study the application to highdimensional inputs (e.g., images). We also would like to investigate how we can make use of the proposed method for other off-policy methods to make our method agnostic to the choice of underlying algorithm. Training Larger Networks for Deep RL References Achiam, J., Knight, E., and Abbeel, P. Towards characterizing divergence in deep q-learning. 2019. Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., Gelly, S., and Bachem, O. What matters in on-policy reinforcement learning? a largescale empirical study. In Proceedings of International Conference on Learning Representations (ICLR), 2021. Fedus, W., Ramachandran, P., Agarwal, R., Bengio, Y., Larochelle, H., Rowland, M., and Dabney, W. Revisiting fundamentals of experience replay. In Proceedings of the 37th International Conference on Machine Learning, pp. 3061–3071, 2020. Fu, J., Kumar, A., Soh, M., and Levine, S. Diagnosing bottlenecks in deep q-learning algorithms. In Proceedings of International Conference on Machine Learning (ICML), 2019. Aviral Kumar, Rishabh Agarwal, D. G. and Levine, S. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In Proceedings of International Conference on Learning Representations (ICLR), 2021. Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of International Conference on Machine Learning (ICML), 2018. Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Proceedings of Advances in Neural Information Processing Systems (NIPS). 2018. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. Greedy layer-wise training of deep networks. Proceedings of Advances in Neural Information Processing Systems (NIPS), 19:153, 2007. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. CoRR, 2018. Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Proceedings of Advances in Neural Information Processing Systems (NIPS), 2020. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. doi: 10.1109/CVPR.2016. 90. He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In Proceedings of European Conference on Computer Vision (ECCV), 2016. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning (ICML), pp. 1597–1607, 2020. Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of International Conference on Machine Learning (ICML), pp. 4182–4192, 2020. Deng, J., Dong, W., Socher, R., Li, L., Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In AAAI, 2018. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019. Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. Scaling laws for transfer. 2021. Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. G., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In AAAI, 2018. Training Larger Networks for Deep RL Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., and Silver, D. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, 2017. doi: 10.1109/ CVPR.2017.243. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of International Conference on Machine Learning (ICML), 2015. Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of International Conference on Learning Representations (ICLR), 2017. Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J., and Munos, R. Recurrent experience replay in distributed reinforcement learning. In Proceedings of International Conference on Learning Representations (ICLR), 2019. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Proceedings of Advances in Neural Information Processing Systems (NIPS), volume 25, pp. 1097–1105. Curran Associates, Inc., 2012. Munk, J., Kober, J., and Babuška, R. Learning state representation for deep actor-critic control. In IEEE Conference on Decision and Control (CDC), pp. 4667–4673, 2016. Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. In Proceedings of International Conference on Machine Learning (ICML), pp. 2603–2612, 2017. Ota, K., Oiki, T., Jha, D., Mariyama, T., and Nikovski, D. Can increasing input dimensionality improve deep reinforcement learning? In Proceedings of International Conference on Machine Learning (ICML), pp. 7424–7433, 2020. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019. Ramachandran, P., Zoph, B., and Le, Q. V. Searching for Activation Functions. CoRR, 2017. URL http: //arxiv.org/abs/1710.05941. Ramchoun, H., Idrissi, M. A. J., Ghanou, Y., and Ettaouil, M. New modeling of multilayer perceptron architecture optimization with regularization: an application to pattern classification. IAENG International Journal of Computer Science, 44(3):261–269, 2017. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In Proceedings of International Conference on Learning Representations (ICLR), 2016. Lesort, T., Dı́az-Rodrı́guez, N., Goudou, J.-F., and Filliat, D. State Representation Learning for Control: An Overview. CoRR, 2018. Shelhamer, E., Mahmoudieh, P., Argus, M., and Darrell, T. Loss is its own reward: Self-supervision for reinforcement learning. In Proceedings of International Conference on Learning Representations (ICLR), 2017. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 6391–6401, 2018. Sinha, S., Bharadhwaj, H., Srinivas, A., and Garg, A. D2rl: Deep dense architectures in reinforcement learning, 2020. Liu, Z., Li, X., Kang, B., and Darrell, T. Regularization matters in policy optimization. In Proceedings of International Conference on Learning Representations (ICLR), 2021. Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: a view from the width. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 6232–6240, 2017. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533, 2015. Stooke, A. and Abbeel, P. Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811, 2018. Stooke, A., Lee, K., Abbeel, P., and Laskin, M. Decoupling representation learning from reinforcement learning, 2020. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018. Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of International Conference on Machine Learning (ICML), pp. 6105–6114. PMLR, 2019. Training Larger Networks for Deep RL Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016. van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforcement learning and the deadly triad. CoRR, 2018. Wu, Z., Shen, C., and Van Den Hengel, A. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90:119–133, 2019. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In Proceedings of International Conference on Learning Representations (ICLR), 2017. Training Larger Networks for Deep RL A. Experimenal Details This section describes more details of our experiments. A.1. Implementation OFENet To implement OFENet, we referred the official codebase provided by Ota et al. (2020), which is available at their website1 . We also employed target networks (Mnih et al., 2015) to stabilize the training of OFENet, since the distribution of experiences stored in the shared replay buffer can change more dynamically by utilizing the Ape-X-like distributed training setting as described in Sec. 3.2. The target networks are updated on each training step by having them slowly track the learned networks: θ′ ← τ θ + (1 − τ )θ′ , where we assume θ to be the network parameters of the current OFENet, and θ′ is the parameters of the target network. We use the target smoothing coefficient τ = 0.005, which is the same with the one used to update target value networks in SAC (Haarnoja et al., 2018), in other words, we do not tune this parameter. RL agents Our implementation of the RL agents are also based on the public codebase used in (Ota et al., 2020). As for Batch Normalization (Ioffe & Szegedy, 2015), which we use for OFENet and TD3 (Ours) in Fig. 9, we use its training mode in updating the network, and test mode in collecting experiences as done in (Liu et al., 2021). We also used Huber loss to stabilize the training of RL agents for the same reason that we employ the target network for training OFENet described in the previous paragraph. Distributed training The distributed training setting we used is similar to (Stooke & Abbeel, 2018), which collects experiences using N core cores on which each core contains N env environments. Specifically, we used N core = 2 and N env = 32. Figure 11 shows the schematic of the distributed training. Since the actions are computed by the latest parameters, the collected experiences result in more on-policy data. Figure 11. Schematic of asynchronous training. We use N core = 2 cores for collecting experiences, where each core has N env = 32 environments. Since the network parameters are shared and the training and collecting transitions are decoupled, the collected experiences result in more on-policy data compared against the standard off-policy training, where the agent collects one transition while it applies one gradient step. 1 Codes used for implementing MLP-ResNet and MLP-DenseNet can be found at https://www.merl.com/research/ license/OFENet Training Larger Networks for Deep RL A.2. Network Architectures We highlight that our proposed architecture consists of three elements: 1) Decoupling representation learning from RL using OFENet, 2) DenseNet architecture with large NNs to effectively propagate the features obtained using OFENet, and 3) distributed training to obtain more on-policy experiences that can mitigate overfitting problems and improve performance (see Fig. 2 as well). As described in Sec. 3.3, the DenseNet architecture consists of a composite function of fully-connected layer, Batch Normalization, and an activation function. We choose the activation function to be Swish (Ramachandran et al., 2017) for MLP-ResNet and MLP-DenseNet, because it showed the smallest value of the auxiliary loss, i.e. attains the best accuracy of function approximation for predicting the next state on all environments as shown in (Ota et al., 2020). (We compared the performance of RL with different activation functions in Fig. 13) For OFENet, we designed the architecture to increase the feature dimensionality to 2048 from original inputs by using 8-layers DenseNet architecture as proposed in (Ota et al., 2020). For SAC agent, we designed the architecture to concatenate 2048 features at each layer. Table 2 shows the number of parameters for SAC (Ours) and SAC (Original) as reference on Ant-v2 environment as an example, which has 111 and 8 dimensionality for state and action space. As you can see from the table, we increase the parameters of the network to be 100 times more than the original SAC implementation. Table 2. The parameter size of SAC (Ours) and SAC (Original) used in experiments on Ant-v2 environment in Sec. 4.2 and Sec. 4.3. SAC (O URS ) I NPUT O UTPUT UNITS UNITS OFEN ET: zs 1ST LAYER 2ND LAYER 3RD LAYER 4TH LAYER 5TH LAYER 6TH LAYER 7TH LAYER 8TH LAYER T OTAL 111 367 623 879 1,135 1,391 1,647 1,903 256 256 256 256 256 256 256 256 29,696 95,232 160,768 226,304 291,840 357,376 422,912 488,448 2, 072, 576 SAC 1ST LAYER 2ND LAYER O UTPUT LAYER T OTAL 2,159 4,207 6,255 2,048 2,048 8 4,423,680 8,617,984 50,048 13, 091, 712 T OTAL PARAMETERS 10, 709, 000 SAC (O RIGINAL ) I NPUT O UTPUT UNITS UNITS 111 256 256 256 256 8 PARAMETERS 28,672 65,792 2,056 96, 520 96, 520 Training Larger Networks for Deep RL A.3. Visualizing loss surface of Q-function networks This section provides the details of how we estimate the loss surface shown in Fig. 1b and Fig. 3b. Li et al. (2018) proposed a method to visualize the loss function curvature by introducing filter normalization method. The authors empirically demonstrated the non-convexity of the loss functions can be problematic, and the sharpness of the loss surface correlates well with test error and generalization error. In light of this, we also visualize the loss surface of the networks to figure out why the deeper network could not lead to better performance while the wider networks result in high performance policies (Fig. 4). To visualize the loss surface of our Q-networks, we use the authors’ implementation2 with the loss of:   2  1 JQ (θ) = E(st ,at )∼D Qθ (st , at ) − Q̂ (st , at ) , 2 (2) with   Q̂ (st , at ) = r (st , at ) + γEst+1 ∼p Vψ̄ (st+1 ) , (3) in which we exactly follow the notations used by SAC paper (Haarnoja et al., 2018). To compute this objective JQ (θ), we collect all transitions used in the training ofthe deeper and wider  networks, and compute the target values of Q̂ (st , at ) after finishing training and stored the tuples of st , at , Q̂ (st , at ) for all transitions in the training. Then, we use the authors’ implementation to visualize the loss with the stored transitions and trained weights of the Q-network. Please refer to (Li et al., 2018) for more details. A.4. Hyperparameters OFENet All OFENet networks we used for our experiments consist of 8-layers DenseNet architectures with Swish activation (Ramachandran et al., 2017) as used in (Ota et al., 2020). The output dimensionality of OFENet is defined in each experiment. Table 2 in the previous section shows the detailed output units in each layer as an example. RL algorithms The hyperparameters of the RL algorithms are also the same with their original papers, except that the TD3 uses the batch size 256 instead of 100 as done in (Ota et al., 2020). Also for fair comparison to (Ota et al., 2020), we used a random policy to store transitions to replay buffer before training RL agents for 10K time steps for SAC, and for 100K steps for TD3. Effective rank We employ the effective rank, which is recently proposed in (Aviral Kumar & Levine, 2021), as a metric to 3. Following the notationsogiven by Aviral Kumar & Levine evaluate the effectiveness of our architecture as described in Sec. n P k σ (Φ) i ≥ 1 − δ , where {σi (Φ)} are the singular (2021), the effective rank can be computed as srankδ (Φ) = min k : Pi=1 d σ (Φ) i=1 i values of feature matrix Φ, which is the features of the penultimate layer of the Q-networks. We used srankδ (Φ) = 0.01 to calculate the number of effective ranks in the experiments, as in (Aviral Kumar & Levine, 2021). 2 Code used for these plots can be found at: https://github.com/tomgoldstein/loss-landscape Training Larger Networks for Deep RL B. Additional Results B.1. Training Curves Distributed RL In order to evaluate the performance of distributed RL, we compare the performance of our method w/ and w/o Ape-X-like distributed training over different three network size: N units ∈ {256, 1024, 2048}, which we respectively denote S, M, and L. We denote the baseline by w/o Ape-X, which are the same with w/ OFENet in Fig. 6. It is noted again that the horizontal axis indicates the number of steps we applied gradients, not environmental steps as we use distributed actors that interact with environments in parallel (see Fig. 2). Figure 12 shows that using Ape-X enables to improve performance on all network size, and the larger networks tend to further improve the performance. Activation As described in Appendix A.2, we used Swish activation for RL agents (policy and value function networks) as it is shown to improve accuracy of function approximation in (Ota et al., 2020). To evaluate the effect of different activation functions, we plot the results of SAC (Ours) with different activation functions (ReLU and Swish) for policy and value function networks used in SAC in Fig. 13. The results show we have slight performance gain by replacing ReLU with Swish. More comprehensive empirical studies can be found in (Andrychowicz et al., 2021; Henderson et al., 2018). Figure 12. Comparison of w/ and w/o using Ape-X architecture. (a) HalfCheetah-v2 (b) Ant-v2 (c) Humanoid-v2 Figure 13. Comparison of activation function of DenseNet architecture between Swish and ReLU. Training Larger Networks for Deep RL B.2. Loss surface In addition to showing the loss surface of the deeper and wider networks that are trained on Ant-v2 environment in Fig. 1b and Fig. 3b, we also visualize the loss landscapes of Q-networks trained on HalfCheeta-v2 environment in Fig. 14. Together with the results shown in Fig. 1b and Fig. 3b, we can see the wider networks tend to converge to a flatter minimum while the deeper networks have sharper minimums. (a) Wider network (N layer = 2, N unit = 2048) (b) Deeper network (N layer = 16, N unit = 256) Figure 14. Loss landscapes of models trained on HalfCheetah-v2 with one million steps, visualized using the technique in Li et al. (2018) and settings described in Appendix A.3.