arXiv:1707.06347v2 [cs.LG] 28 Aug 2017 Proximal Policy Optimization Algorithms John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov OpenAI {joschu, filip, prafulla, alec, oleg}@openai.com Abstract We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time. 1 Introduction In recent years, several different approaches have been proposed for reinforcement learning with neural network function approximators. The leading contenders are deep Q-learning [Mni+15], “vanilla” policy gradient methods [Mni+16], and trust region / natural policy gradient methods [Sch+15b]. However, there is room for improvement in developing a method that is scalable (to large models and parallel implementations), data efficient, and robust (i.e., successful on a variety of problems without hyperparameter tuning). Q-learning (with function approximation) fails on many simple problems1 and is poorly understood, vanilla policy gradient methods have poor data effiency and robustness; and trust region policy optimization (TRPO) is relatively complicated, and is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks). This paper seeks to improve the current state of affairs by introducing an algorithm that attains the data efficiency and reliable performance of TRPO, while using only first-order optimization. We propose a novel objective with clipped probability ratios, which forms a pessimistic estimate (i.e., lower bound) of the performance of the policy. To optimize policies, we alternate between sampling data from the policy and performing several epochs of optimization on the sampled data. Our experiments compare the performance of various different versions of the surrogate objective, and find that the version with the clipped probability ratios performs best. We also compare PPO to several previous algorithms from the literature. On continuous control tasks, it performs better than the algorithms we compare against. On Atari, it performs significantly better (in terms of sample complexity) than A2C and similarly to ACER though it is much simpler. 1 While DQN works well on game environments like the Arcade Learning Environment [Bel+15] with discrete action spaces, it has not been demonstrated to perform well on continuous control benchmarks such as those in OpenAI Gym [Bro+16] and described by Duan et al. [Dua+16]. 1 2 Background: Policy Optimization 2.1 Policy Gradient Methods Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form h i ĝ = Êt ∇θ log πθ (at | st )Ât (1) where πθ is a stochastic policy and Ât is an estimator of the advantage function at timestep t. Here, the expectation Êt [. . .] indicates the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization. Implementations that use automatic differentiation software work by constructing an objective function whose gradient is the policy gradient estimator; the estimator ĝ is obtained by differentiating the objective h i LP G (θ) = Êt log πθ (at | st )Ât . (2) While it is appealing to perform multiple steps of optimization on this loss LP G using the same trajectory, doing so is not well-justified, and empirically it often leads to destructively large policy updates (see Section 6.1; results are not shown but were similar or worse than the “no clipping or penalty” setting). 2.2 Trust Region Methods In TRPO [Sch+15b], an objective function (the “surrogate” objective) is maximized subject to a constraint on the size of the policy update. Specifically,   πθ (at | st ) (3) Ât maximize Êt θ πθold (at | st ) subject to Êt [KL[πθold (· | st ), πθ (· | st )]] ≤ δ. (4) Here, θold is the vector of policy parameters before the update. This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint. The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem   πθ (at | st ) maximize Êt Ât − β KL[πθold (· | st ), πθ (· | st )] (5) θ πθold (at | st ) for some coefficient β. This follows from the fact that a certain surrogate objective (which computes the max KL over states instead of the mean) forms a lower bound (i.e., a pessimistic bound) on the performance of the policy π. TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of β that performs well across different problems—or even within a single problem, where the the characteristics change over the course of learning. Hence, to achieve our goal of a first-order algorithm that emulates the monotonic improvement of TRPO, experiments show that it is not sufficient to simply choose a fixed penalty coefficient β and optimize the penalized objective Equation (5) with SGD; additional modifications are required. 2 3 Clipped Surrogate Objective Let rt (θ) denote the probability ratio rt (θ) = πθπθ (a(at t| s| ts)t ) , so r(θold ) = 1. TRPO maximizes a old “surrogate” objective   h i πθ (at | st ) CP I (6) L (θ) = Êt Ât = Êt rt (θ)Ât . πθold (at | st ) The superscript CP I refers to conservative policy iteration [KL02], where this objective was proposed. Without a constraint, maximization of LCP I would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move rt (θ) away from 1. The main objective we propose is the following: h i LCLIP (θ) = Êt min(rt (θ)Ât , clip(rt (θ), 1 − , 1 + )Ât ) (7) where epsilon is a hyperparameter, say,  = 0.2. The motivation for this objective is as follows. The first term inside the min is LCP I . The second term, clip(rt (θ), 1 − , 1 + )Ât , modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving rt outside of the interval [1 − , 1 + ]. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. Note that LCLIP (θ) = LCP I (θ) to first order around θold (i.e., where r = 1), however, they become different as θ moves away from θold . Figure 1 plots a single term (i.e., a single t) in LCLIP ; note that the probability ratio r is clipped at 1 −  or 1 +  depending on whether the advantage is positive or negative. CLIP L A<0 A>0 0 r 0 1− 1 r LCLIP 1 1+ Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function LCLIP as a function of the probability ratio r, for positive advantages (left) and negative advantages (right). The red circle on each plot shows the starting point for the optimization, i.e., r = 1. Note that LCLIP sums many of these terms. Figure 2 provides another source of intuition about the surrogate objective LCLIP . It shows how several objectives vary as we interpolate along the policy update direction, obtained by proximal policy optimization (the algorithm we will introduce shortly) on a continuous control problem. We can see that LCLIP is a lower bound on LCP I , with a penalty for having too large of a policy update. 3 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0.02 Et[KLt] L CPI = Et[rtAt] Et[clip(rt, 1 , 1 + )At] L CLIP = Et[min(rtAt, clip(rt, 1 0 , 1 + )At)] 1 Linear interpolation factor Figure 2: Surrogate objectives, as we interpolate between the initial policy parameter θold , and the updated policy parameter, which we compute after one iteration of PPO. The updated policy has a KL divergence of about 0.02 from the initial policy, and this is the point at which LCLIP is maximal. This plot corresponds to the first policy update on the Hopper-v1 problem, using hyperparameters provided in Section 6.1. 4 Adaptive KL Penalty Coefficient Another approach, which can be used as an alternative to the clipped surrogate objective, or in addition to it, is to use a penalty on KL divergence, and to adapt the penalty coefficient so that we achieve some target value of the KL divergence dtarg each policy update. In our experiments, we found that the KL penalty performed worse than the clipped surrogate objective, however, we’ve included it here because it’s an important baseline. In the simplest instantiation of this algorithm, we perform the following steps in each policy update: • Using several epochs of minibatch SGD, optimize the KL-penalized objective   πθ (at | st ) KLP EN Ât − β KL[πθold (· | st ), πθ (· | st )] L (θ) = Êt πθold (at | st ) (8) • Compute d = Êt [KL[πθold (· | st ), πθ (· | st )]] – If d < dtarg /1.5, β ← β/2 – If d > dtarg × 1.5, β ← β × 2 The updated β is used for the next policy update. With this scheme, we occasionally see policy updates where the KL divergence is significantly different from dtarg , however, these are rare, and β quickly adjusts. The parameters 1.5 and 2 above are chosen heuristically, but the algorithm is not very sensitive to them. The initial value of β is a another hyperparameter but is not important in practice because the algorithm quickly adjusts it. 5 Algorithm The surrogate losses from the previous sections can be computed and differentiated with a minor change to a typical policy gradient implementation. For implementations that use automatic differentation, one simply constructs the loss LCLIP or LKLP EN instead of LP G , and one performs multiple steps of stochastic gradient ascent on this objective. Most techniques for computing variance-reduced advantage-function estimators make use a learned state-value function V (s); for example, generalized advantage estimation [Sch+15a], or the 4 finite-horizon estimators in [Mni+16]. If using a neural network architecture that shares parameters between the policy and value function, we must use a loss function that combines the policy surrogate and a value function error term. This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration, as suggested in past work [Wil92; Mni+16]. Combining these terms, we obtain the following objective, which is (approximately) maximized each iteration:   +V F +S LCLIP (θ) = Êt LCLIP (θ) − c1 LVt F (θ) + c2 S[πθ ](st ) , (9) t t where c1 , c2 are coefficients, and S denotes an entropy bonus, and LVt F is a squared-error loss (Vθ (st ) − Vttarg )2 . One style of policy gradient implementation, popularized in [Mni+16] and well-suited for use with recurrent neural networks, runs the policy for T timesteps (where T is much less than the episode length), and uses the collected samples for an update. This style requires an advantage estimator that does not look beyond timestep T . The estimator used by [Mni+16] is Ât = −V (st ) + rt + γrt+1 + · · · + γ T −t+1 rT −1 + γ T −t V (sT ) (10) where t specifies the time index in [0, T ], within a given length-T trajectory segment. Generalizing this choice, we can use a truncated version of generalized advantage estimation, which reduces to Equation (10) when λ = 1: Ât = δt + (γλ)δt+1 + · · · + · · · + (γλ)T −t+1 δT −1 , where δt = rt + γV (st+1 ) − V (st ) (11) (12) A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments is shown below. Each iteration, each of N (parallel) actors collect T timesteps of data. Then we construct the surrogate loss on these N T timesteps of data, and optimize it with minibatch SGD (or usually for better performance, Adam [KB14]), for K epochs. Algorithm 1 PPO, Actor-Critic Style for iteration=1, 2, . . . do for actor=1, 2, . . . , N do Run policy πθold in environment for T timesteps Compute advantage estimates Â1 , . . . , ÂT end for Optimize surrogate L wrt θ, with K epochs and minibatch size M ≤ N T θold ← θ end for 6 Experiments 6.1 Comparison of Surrogate Objectives First, we compare several different surrogate objectives under different hyperparameters. Here, we compare the surrogate objective LCLIP to several natural variations and ablated versions. No clipping or penalty: Lt (θ) = rt (θ)Ât Clipping: Lt (θ) = min(rt (θ)Ât , clip(rt (θ)), 1 − , 1 + )Ât KL penalty (fixed or adaptive) Lt (θ) = rt (θ)Ât − β KL[πθold , πθ ] 5 For the KL penalty, one can either use a fixed penalty coefficient β or an adaptive coefficient as described in Section 4 using target KL value dtarg . Note that we also tried clipping in log space, but found the performance to be no better. Because we are searching over hyperparameters for each algorithm variant, we chose a computationally cheap benchmark to test the algorithms on. Namely, we used 7 simulated robotics tasks2 implemented in OpenAI Gym [Bro+16], which use the MuJoCo [TET12] physics engine. We do one million timesteps of training on each one. Besides the hyperparameters used for clipping () and the KL penalty (β, dtarg ), which we search over, the other hyperparameters are provided in in Table 3. To represent the policy, we used a fully-connected MLP with two hidden layers of 64 units, and tanh nonlinearities, outputting the mean of a Gaussian distribution, with variable standard deviations, following [Sch+15b; Dua+16]. We don’t share parameters between the policy and value function (so coefficient c1 is irrelevant), and we don’t use an entropy bonus. Each algorithm was run on all 7 environments, with 3 random seeds on each. We scored each run of the algorithm by computing the average total reward of the last 100 episodes. We shifted and scaled the scores for each environment so that the random policy gave a score of 0 and the best result was set to 1, and averaged over 21 runs to produce a single scalar for each algorithm setting. The results are shown in Table 1. Note that the score is negative for the setting without clipping or penalties, because for one environment (half cheetah) it leads to a very negative score, which is worse than the initial random policy. algorithm avg. normalized score No clipping or penalty Clipping,  = 0.1 Clipping,  = 0.2 Clipping,  = 0.3 Adaptive KL dtarg = 0.003 Adaptive KL dtarg = 0.01 Adaptive KL dtarg = 0.03 Fixed KL, β = 0.3 Fixed KL, β = 1. Fixed KL, β = 3. Fixed KL, β = 10. -0.39 0.76 0.82 0.70 0.68 0.74 0.71 0.62 0.71 0.72 0.69 Table 1: Results from continuous control benchmark. Average normalized scores (over 21 runs of the algorithm, on 7 environments) for each algorithm / hyperparameter setting . β was initialized at 1. 6.2 Comparison to Other Algorithms in the Continuous Domain Next, we compare PPO (with the “clipped” surrogate objective from Section 3) to several other methods from the literature, which are considered to be effective for continuous problems. We compared against tuned implementations of the following algorithms: trust region policy optimization [Sch+15b], cross-entropy method (CEM) [SL06], vanilla policy gradient with adaptive stepsize3 , 2 HalfCheetah, Hopper, InvertedDoublePendulum, InvertedPendulum, Reacher, Swimmer, and Walker2d, all “-v1” 3 After each batch of data, the Adam stepsize is adjusted based on the KL divergence of the original and updated policy, using a rule similar to the one shown in Section 4. An implementation is available at https://github.com/ berkeleydeeprlcourse/homework/tree/master/hw4. 6 A2C [Mni+16], A2C with trust region [Wan+16]. A2C stands for advantage actor critic, and is a synchronous version of A3C, which we found to have the same or better performance than the asynchronous version. For PPO, we used the hyperparameters from the previous section, with  = 0.2. We see that PPO outperforms the previous methods on almost all the continuous control environments. HalfCheetah-v1 Hopper-v1 InvertedDoublePendulum-v1 2000 2500 8000 1500 2000 6000 1000 1500 InvertedPendulum-v1 1000 800 600 4000 500 400 1000 0 2000 500 500 0 1000000 200 0 0 0 Reacher-v1 1000000 0 0 1000000 Swimmer-v1 Walker2d-v1 3000 100 40 80 60 60 80 40 2000 1000 20 100 0 120 0 1000000 1000000 A2C A2C + Trust Region CEM PPO (Clip) Vanilla PG, Adaptive TRPO 120 20 0 0 0 1000000 0 1000000 Figure 3: Comparison of several algorithms on several MuJoCo environments, training for one million timesteps. 6.3 Showcase in the Continuous Domain: Humanoid Running and Steering To showcase the performance of PPO on high-dimensional continuous control problems, we train on a set of problems involving a 3D humanoid, where the robot must run, steer, and get up off the ground, possibly while being pelted by cubes. The three tasks we test on are (1) RoboschoolHumanoid: forward locomotion only, (2) RoboschoolHumanoidFlagrun: position of target is randomly varied every 200 timesteps or whenever the goal is reached, (3) RoboschoolHumanoidFlagrunHarder, where the robot is pelted by cubes and needs to get up off the ground. See Figure 5 for still frames of a learned policy, and Figure 4 for learning curves on the three tasks. Hyperparameters are provided in Table 4. In concurrent work, Heess et al. [Hee+17] used the adaptive KL variant of PPO (Section 4) to learn locomotion policies for 3D robots. RoboschoolHumanoid-v0 RoboschoolHumanoidFlagrun-v0 2500 4000 3000 2000 3000 2000 1500 2000 1000 1000 1000 500 0 RoboschoolHumanoidFlagrunHarder-v0 0 0 Timestep 50M 0 0 Timestep 100M 0 Timestep 100M Figure 4: Learning curves from PPO on 3D humanoid control tasks, using Roboschool. 7 Figure 5: Still frames of the policy learned from RoboschoolHumanoidFlagrun. In the first six frames, the robot runs towards a target. Then the position is randomly changed, and the robot turns and runs toward the new target. 6.4 Comparison to Other Algorithms on the Atari Domain We also ran PPO on the Arcade Learning Environment [Bel+15] benchmark and compared against well-tuned implementations of A2C [Mni+16] and ACER [Wan+16]. For all three algorithms, we used the same policy network architechture as used in [Mni+16]. The hyperparameters for PPO are provided in Table 5. For the other two algorithms, we used hyperparameters that were tuned to maximize performance on this benchmark. A table of results and learning curves for all 49 games is provided in Appendix B. We consider the following two scoring metrics: (1) average reward per episode over entire training period (which favors fast learning), and (2) average reward per episode over last 100 episodes of training (which favors final performance). Table 2 shows the number of games “won” by each algorithm, where we compute the victor by averaging the scoring metric across three trials. (1) avg. episode reward over all of training (2) avg. episode reward over last 100 episodes A2C ACER PPO Tie 1 1 18 28 30 19 0 1 Table 2: Number of games “won” by each algorithm, where the scoring metric is averaged across three trials. 7 Conclusion We have introduced proximal policy optimization, a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update. These methods have the stability and reliability of trust-region methods but are much simpler to implement, requiring only few lines of code change to a vanilla policy gradient implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance. 8 Acknowledgements Thanks to Rocky Duan, Peter Chen, and others at OpenAI for insightful comments. 8 References [Bel+15] M. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. “The arcade learning environment: An evaluation platform for general agents”. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015. [Bro+16] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. “OpenAI Gym”. In: arXiv preprint arXiv:1606.01540 (2016). [Dua+16] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”. In: arXiv preprint arXiv:1604.06778 (2016). [Hee+17] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller, et al. “Emergence of Locomotion Behaviours in Rich Environments”. In: arXiv preprint arXiv:1707.02286 (2017). [KL02] S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. In: ICML. Vol. 2. 2002, pp. 267–274. [KB14] D. Kingma and J. Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [Mni+15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 529–533. [Mni+16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. “Asynchronous methods for deep reinforcement learning”. In: arXiv preprint arXiv:1602.01783 (2016). [Sch+15a] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015). [Sch+15b] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust region policy optimization”. In: CoRR, abs/1502.05477 (2015). [SL06] I. Szita and A. Lörincz. “Learning Tetris using the noisy cross-entropy method”. In: Neural computation 18.12 (2006), pp. 2936–2941. [TET12] E. Todorov, T. Erez, and Y. Tassa. “MuJoCo: A physics engine for model-based control”. In: Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE. 2012, pp. 5026–5033. [Wan+16] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. “Sample Efficient Actor-Critic with Experience Replay”. In: arXiv preprint arXiv:1611.01224 (2016). [Wil92] R. J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8.3-4 (1992), pp. 229–256. 9 A Hyperparameters Hyperparameter Horizon (T) Adam stepsize Num. epochs Minibatch size Discount (γ) GAE parameter (λ) Value 2048 3 × 10−4 10 64 0.99 0.95 Table 3: PPO hyperparameters used for the Mujoco 1 million timestep benchmark. Hyperparameter Horizon (T) Adam stepsize Num. epochs Minibatch size Discount (γ) GAE parameter (λ) Number of actors Log stdev. of action distribution Value 512 ∗ 15 4096 0.99 0.95 32 (locomotion), 128 (flagrun) LinearAnneal(−0.7, −1.6) Table 4: PPO hyperparameters used for the Roboschool experiments. Adam stepsize was adjusted based on the target value of the KL divergence. Hyperparameter Horizon (T) Adam stepsize Num. epochs Minibatch size Discount (γ) GAE parameter (λ) Number of actors Clipping parameter  VF coeff. c1 (9) Entropy coeff. c2 (9) Value 128 2.5 × 10−4 × α 3 32 × 8 0.99 0.95 8 0.1 × α 1 0.01 Table 5: PPO hyperparameters used in Atari experiments. α is linearly annealed from 1 to 0 over the course of learning. B Performance on More Atari Games Here we include a comparison of PPO against A2C on a larger collection of 49 Atari games. Figure 6 shows the learning curves of each of three random seeds, while Table 6 shows the mean performance. 10 Alien Amidar Assault 750 2000 4000 5000 2000 2000 2500 1500 0 0 0 Atlantis BankHeist 3000000 0 50 500 10000 2000 40 5000 1000 30 Breakout Centipede 400 ChopperCommand CrazyClimber 6000 10000 100000 4000 200 5000 0 Bowling 3000 Boxing 50 BeamRider 4000 15000 0 100 BattleZone 20000 1000 2000000 1000000 Asteroids 2500 250 500 1000 Asterix 7500 50000 2000 0 DemonAttack 40000 DoubleDunk Enduro 10.0 20000 Freeway 30 0 500 15.0 17.5 0 FishingDerby 750 12.5 Frostbite 250 50 0 100 Gopher 20 10 0 Gravitar IceHockey Jamesbond 4 300 200 40000 750 20000 500 8 250 10 0 100 Kangaroo Krull 6000 5000 4000 0 2000 NameThisGame 7500 400 200 0 KungFuMaster MontezumaRevenge MsPacman 100 20000 50 0 0 3000 2000 1000 Pitfall 0 10000 6 40000 8000 10000 600 Pong PrivateEye Qbert 20 15000 10000 500 0 100 5000 5000 2500 0 20 Riverraid RoadRunner 5000 2500 Seaquest 6 40000 7500 0 Robotank 10000 20000 4 0 2 SpaceInvaders 1500 1000 1000 500 500 StarGunner 0 Tennis TimePilot Tutankham 200000 300 40000 10 4000 200 15 20000 100000 100 20 3000 0 0 0 Venture VideoPinball WizardOfWor 150000 10 Zaxxon A2C ACER PPO 6000 4000 4000 100000 5 UpNDown 2000 50000 2000 0 0 0 Frames 40M 0 Frames 40M 0 Frames 40M 0 Frames 40M Figure 6: Comparison of PPO and A2C on all 49 ATARI games included in OpenAI Gym at the time of publication. 11 Alien Amidar Assault Asterix Asteroids Atlantis BankHeist BattleZone BeamRider Bowling Boxing Breakout Centipede ChopperCommand CrazyClimber DemonAttack DoubleDunk Enduro FishingDerby Freeway Frostbite Gopher Gravitar IceHockey Jamesbond Kangaroo Krull KungFuMaster MontezumaRevenge MsPacman NameThisGame Pitfall Pong PrivateEye Qbert Riverraid RoadRunner Robotank Seaquest SpaceInvaders StarGunner Tennis TimePilot Tutankham UpNDown Venture VideoPinball WizardOfWor Zaxxon A2C ACER PPO 1141.7 380.8 1562.9 3176.3 1653.3 729265.3 1095.3 3080.0 3031.7 30.1 17.7 303.0 3496.5 1171.7 107770.0 6639.1 -16.2 0.0 20.6 0.0 261.8 1500.9 194.0 -6.4 52.3 45.3 8367.4 24900.3 0.0 1626.9 5961.2 -55.0 19.7 91.3 10065.7 7653.5 32810.0 2.2 1714.3 744.5 26204.0 -22.2 2898.0 206.8 17369.8 0.0 19735.9 859.0 16.3 1655.4 827.6 4653.8 6801.2 2389.3 1841376.0 1177.5 8983.3 3863.3 33.3 98.9 456.4 8904.8 5287.7 132461.0 38808.3 -13.2 0.0 34.7 0.0 285.6 37802.3 225.3 -5.9 261.8 50.0 7268.4 27599.3 0.3 2718.5 8488.0 -16.9 20.7 182.0 15316.6 9125.1 35466.0 2.5 1739.5 1213.9 49817.7 -17.6 4175.7 280.8 145051.4 0.0 156225.6 2308.3 29.0 1850.3 674.6 4971.9 4532.5 2097.5 2311815.0 1280.6 17366.7 1590.0 40.1 94.6 274.8 4386.4 3516.3 110202.0 11378.4 -14.9 758.3 17.8 32.5 314.2 2932.9 737.2 -4.2 560.7 9928.7 7942.3 23310.3 42.0 2096.5 6254.9 -32.9 20.7 69.5 14293.3 8393.6 25076.0 5.5 1204.5 942.5 32689.0 -14.8 4342.0 254.4 95445.0 0.0 37389.0 4185.3 5008.7 Table 6: Mean final scores (last 100 episodes) of PPO and A2C on Atari games after 40M game frames (10M timesteps). 12