arXiv:1907.06077v1 [cs.NE] 13 Jul 2019 Evolvability ES: Scalable and Direct Optimization of Evolvability Alexander Gajewski∗ Jeff Clune Columbia University Uber AI Labs University of Wyoming Kenneth O. Stanley Joel Lehman Uber AI Labs Uber AI Labs ABSTRACT 1 Designing evolutionary algorithms capable of uncovering highly evolvable representations is an open challenge in evolutionary computation; such evolvability is important in practice, because it accelerates evolution and enables fast adaptation to changing circumstances. This paper introduces evolvability ES, an evolutionary algorithm designed to explicitly and efficiently optimize for evolvability, i.e. the ability to further adapt. The insight is that it is possible to derive a novel objective in the spirit of natural evolution strategies that maximizes the diversity of behaviors exhibited when an individual is subject to random mutations, and that efficiently scales with computation. Experiments in 2-D and 3-D locomotion tasks highlight the potential of evolvability ES to generate solutions with tens of thousands of parameters that can quickly be adapted to solve different tasks and that can productively seed further evolution. We further highlight a connection between evolvability in EC and a recent and popular gradient-based meta-learning algorithm called MAML; results show that evolvability ES can perform competitively with MAML and that it discovers solutions with distinct properties. The conclusion is that evolvability ES opens up novel research directions for studying and exploiting the potential of evolvable representations for deep neural networks. One challenge in evolutionary computation (EC) is to design algorithms capable of uncovering highly evolvable representations; though evolvability’s definition is debated, the idea is to find genomes with great potential for further evolution [2, 10, 15, 19, 21, 26, 33, 43]. Here, as in previous work, we adopt a definition of evolvability as the propensity of an individual to generate phenotypic diversity [21, 23, 26]. Such evolvability is important in practice, because it broadens the variation accessible through mutation, thereby accelerating evolution; improved evolvability thus would benefit many areas across EC, e.g. evolutionary robotics, open-ended evolution, and quality diversity (QD; [22, 32]). While evolvability is seemingly ubiquitous in nature (e.g. the amazing diversity of dogs accessible within a few generations of breeding), its emergence in evolutionary algorithms (EAs) is seemingly rare [33, 43], and how best to encourage it remains an important open question. There are two general approaches to encourage evolvability in EAs. The first is to create environments or selection criteria that produce evolvability as an indirect consequence [6, 10, 17, 21, 33]. For example, environments wherein goals vary modularly over generations may implictly favor individuals better able to adapt to such variations [17]. The second approach, which is the focus of this paper, is to select directly for evolvability, i.e. to judge individuals by directly testing their potential for further evolution [26]. While the first approach is more biologically plausible and is important to understanding natural evolvability, the second benefits from its directness, its potential ease of application to new domains, and its ability to enable the study of highly-evolvable genomes without fully understanding evolvability’s natural emergence. However, current implementations of such evolvability search [26] suffer from their computational cost. A separate (but complementary) challenge in EC is that of effectively evolving large genomes. For example, there has been recent interest in training deep neural networks (DNNs) because of their potential for expressing complex behaviors, e.g. playing Atari games from raw pixels [28]. However, evolving DNNs is challenging because they have many more parameters than genomes typically evolved by comparable approaches in EC (e.g. neural networks evolved by NEAT [38]). For this reason, the study of scalable EAs that can benefit from increased computation is of recent interest [5, 34, 40], and evolvability-seeking algorithms will require similar considerations to scale effectively to large networks. This paper addresses both of these challenges, by synthesizing three threads of research in deep learning and EC. The first thread involves a popular gradient-based meta-learning algorithm called MAML [11] that searches for points in the search space from which CCS CONCEPTS • Computing methodologies → Genetic algorithms; Neural networks; KEYWORDS Evolvability, neuroevolution, evolution strategy, meta-learning ACM Reference Format: Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman. 2019. Evolvability ES: Scalable and Direct Optimization of Evolvability. In Genetic and Evolutionary Computation Conference (GECCO ’19), July 13–17, 2019, Prague, Czech Republic. ACM, New York, NY, USA, 15 pages. https://doi.org/ 10.1145/3321707.3321876 ∗ Work done during an internship at Uber AI Labs Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. GECCO ’19, July 13–17, 2019, Prague, Czech Republic © 2019 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery. ACM ISBN 978-1-4503-6111-8/19/07. . . $15.00 https://doi.org/10.1145/3321707.3321876 INTRODUCTION GECCO ’19, July 13–17, 2019, Prague, Czech Republic one (or a few) optimization step(s) can solve diverse tasks. We introduce here a connection between this kind of parameter-space meta-learning and evolvability, as MAML’s formulation is very similar to that of evolvability search [26], which searches for individuals from which mutations (instead of optimization) yield a diverse repertoire of behaviors. MAML’s formulation, and its success with DNNs on complicated reinforcement learning (RL) tasks, hints that there may similarly be efficient and effective formulations of evolvability. The second thread involves the recent scalable form of evolution strategy (ES) of Salimans et al. [34] (which at heart is a simplified form of natural evolution strategy [45]) shown to be surprisingly competitive with gradient-based RL. We refer to this specific algorithm as ES in this paper for simplicity, and note that the field of ES as a whole encompasses many diverse algorithms [3, 36]. The final thread is a recent formalism called stochastic computation graphs (SCGs) [35], which enables automatic derivations of gradient estimations that include expectations over distributions (such as the objective optimized by ES). We here extend SCGs to handle a larger class of functions, which enables formulating an efficient evolvability-inspired objective. Weaving together these three threads, the main insight in this paper is that it is possible to derive a novel algorithm, called evolvability ES, that optimizes an evolvability-inspired objective without incurring any additional overhead in domain evaluations relative to optimizing a traditional objective with ES. Such efficiency is possible because each iteration of ES can aggregate information across samples to estimate local gradients of evolvability. The experiments in this paper demonstrate the potential of evolvability ES in two deep RL locomotion benchmarks, wherein evolvability ES often uncovers a diversity of high-performing behaviors using the same computation required for ES to uncover a single one. Further tests highlight the potential benefits of such evolvable genomes for fast adaptability, show that evolvability ES can optimize a form of population-level evolvability [46], and demonstrate that evolvability ES performs competitively with the popular MAML gradient-based method (while discovering pockets of the search space with interestingly distinct properties). The conclusion is that evolvability ES is a promising new algorithm for producing and studying evolvability at deep-learning scale. 2 BACKGROUND 2.1 Evolvability and Evolvability Search No consensus exists on the measure or definition of evolvability [31]; the definition we adopt here, as in prior work [21, 23, 26], follows one mainstream conception of evolvability as phenotypic variability [4, 18, 31], i.e. the phenotypic diversity demonstrated by an individual’s offspring. Exact definition aside, reproducing the evolvability of natural organisms is an important yet unmet goal in EC. Accordingly, researchers have proposed many approaches for encouraging evolvability [6, 10, 17, 19, 21, 29, 33]. This paper focuses in particular on extending the ideas from evolvability search [26], an algorithm that directly rewards evolvability by explicitly measuring evolvability and guiding search towards it. The motivation is that it often may be more straightforward to directly optimize evolvability than to encourage its indirect emergence, that it may Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman help compare the advantages or drawbacks of different quantifications of evolvability [24], and that it may facilitate the study of evolvability even before its natural emergence is understood. In evolvability search, the central idea is to calculate an individual’s fitness from domain evaluations of many of its potential offspring. In particular, an individual’s potential to generate phenotypic variability is estimated by quantifying the diversity of behaviors demonstrated from evaluating a sample of its offspring. Note the distinction between evolvability search and QD: QD attempts to build a collection of diverse well-adapted genomes, while evolvability search attempts to find genomes from which diverse behaviors can readily be evolved. In practice, evolvability search requires (1) quantifying dimensions of behavior of interest, as in the behavior characterizations of novelty search [20], and (2) a distance threshold to formalize what qualifies as two behaviors being distinct. Interestingly, optimizing evolvability, just like optimizing novelty, can sometimes lead to solving problems as a byproduct [26]. However, evolvability search is computationally expensive (it requires evaluating enough offspring of each individual in the population to estimate its potential), and has only been demonstrated with small neural networks (NNs); evolvability ES addresses both issues. 2.2 MAML Meta-learning [42] focuses on optimizing an agent’s learning potential (i.e. its ability to solve new tasks) rather than its immediate performance (i.e. how well it solves the current task) as is more typical in optimization and RL, and has a rich history both in EC and machine learning at large. The most common forms of neural meta-learning algorithms train NNs to implement their own learning algorithms, either through exploiting recurrence [16, 39, 44] or plastic connections [12, 27, 37, 39]. However, another approach is taken by the recent and popular model-agnostic meta-learning (MAML) algorithm [11]. Instead of training a NN that itself can learn from experience, MAML searches for a fixed set of NN weights from which a single additional optimization step can solve a variety of tasks. The insight is that it is possible to differentiate through the optimization step, to create a fully gradient-based algorithm that seeks adaptable areas of the search space. Interestingly, MAML’s formulation shares deep similarities with the idea of evolvability in EC: Both involve finding points in the search space nearby to diverse functional behaviors. One contribution here is to expose this connection between MAML and evolvability and explore it experimentally. Intriguingly, MAML has been successful in training DNNs with many parameters to quickly adapt in supervised and RL domains [11]. These results suggest that evolvable DNNs exist and can sometimes be directly optimized towards, a key source of inspiration for evolvability ES. 2.3 Natural Evolution Strategies The evolvability ES algorithm introduced here builds upon the ES algorithm of Salimans et al. [34], which itself is based on the Natural Evolution Strategies (NES; [45]) black-box optimization algorithm. Because in such a setting the gradients of the function to be optimized, f (z), are unavailable, NES instead creates a smoother version of f that is differentiable, by defining a population distribution Evolvability ES: Scalable and Direct Optimization of Evolvability GECCO ’19, July 13–17, 2019, Prague, Czech Republic π (z; θ ) and setting the smoothed loss function J (θ ) = Ez∼π [f (z)]. This function is then optimized iteratively with gradient descent, where the gradients are estimated by samples from π . Salimans et al. [34] showed recently that NES with an isotropic Gaussian distribution of fixed variance (i.e. π = N (µ, σI ) and θ = µ) is competitive with deep RL on high-dimensional RL tasks, and can be very time-efficient because the expensive gradient estimation is easily parallelizable. Salimans et al. [34] refer to this algorithm as Evolution Strategies (ES), and we adopt their terminology in this work, referring to it in the experiments as standard ES (to differentiate it from evolvability ES). Connecting to traditional EAs, the distribution π can be viewed as a population of individuals that evolves by one generation at every gradient step, where µ can be viewed as the parent individual, and samples from π can be termed that parent’s pseudo-offspring cloud. NES is formulated differently from other EAs; unlike them it operates by gradient descent. Because of NES’s formulation, the desired outcome (i.e. to increase the average fitness f ) can be represented by a differentiable function of trainable parameters of the population distribution. We can then analytically compute the optimal update rule for each generation: to take a step in the direction of the gradient. For such algorithms, the burden shifts from designing a generational update rule to designing an objective function. Thus, to enable fast experimentation with new objective functions, ES-like algorithms can benefit from automatic differentiation, an idea explored in the next two sections. represent this as a SCG, because it involves a function applied to a nested expectation (the key reason being that in general, Ez [f (z)] , f (Ez [z])). In light of these limitations, in supplemental sections S2 and S3, we generalize SCGs [35] to more easily account for nested expectations, and show that this new formalism yields surrogate loss functions that can automatically be differentiated by popular tools like PyTorch [30]. Through this approach, nearly any loss function involving potentially nested expectations over differentiable probability distributions can be automatically optimized with gradient descent through sampling. This approach is applied in the next section to enable an efficient evolvability-optimizing variant of ES. 2.4 Stochastic Computation Graphs In most modern tensor computation frameworks (e.g. PyTorch [1] and Tensorflow [30]), functions are represented as computation graphs, with input nodes representing a function’s arguments, and inner nodes representing compositions of certain elementary functions like addition or exponentiation. The significance of computation graphs is that they support automatic differentiation, enabling easy use of gradient descent-like algorithms with novel architectures and loss functions. Note, however, that these computation graphs are deterministic, as these variables have no probability distributions associated with them. Recently introduced in Schulman et al. [35], Stochastic Computation Graphs (SCGs) are a formalism adding support for random variables. Importantly, like deterministic computation graphs, SCGs support automatic differentiation. SCGs are thus a natural way to represent ES-like algorithms: The function ES tries to optimize is an expectation over individuals in the population of some function of their parameters. In this way, automatic differentiation of SCGs enables rapid experimentation with modifications of ES to different loss functions or population distributions; more details about SCGs can be found in supplemental sections S2 and S3. The next section extends SCGs, which is needed to explore certain formulations of evolvability ES. 3 NESTED SCGS While SCGs are adequate for simple modifications of ES (e.g. changing the population distribution from an isotropic Gaussian to a Gaussian with diagonal covariance), they lack the expressiveness needed to enable certain more complex objectives. For example, consider an ES-like objective of the form Ez [f (Ez ′ [д(z, z ′ )])] (the form of one variant of evolvability ES). There is no natural way to 4 APPROACH: EVOLVABILITY ES One motivation for evolvability ES is to create an algorithm similar in effect to evolvability search [26] but without immense computational requirements, such that it can scale efficiently with computation to large DNNs. To enable such an algorithm, a driving insight is that the same domain evaluations exploited in standard ES to estimate gradients of fitness improvement also contain the information needed to estimate gradients of evolvability improvement. That is, ES estimates fitness improvement by sampling and evaluating policies drawn from the neighborhood of a central individual, and updates the central individual such that future draws of the higher-fitness policies are more likely. Similarly, the central individual could be updated to instead make policies that contribute more to the diversity of the entire population-cloud more likely. The result is an algorithm that does not incur the cost of evolvability search of evaluating potential offspring of all individuals, because in an algorithm like ES that represents the population as a formal distribution over the search space, information about local diversity is shared between individuals; a further benefit is the computational efficiency and scalability of ES itself, which this algorithm retains. While such an insight sounds similar to what motivates noveltydriven ES algorithms [7, 9], there is a subtle but important difference: Novelty-driven ES drives the central individual of ES towards behaviors unseen in previous generations, while evolvability ES instead aims to optimize the diversity of offspring of the current central individual. Thus the product of evolvability ES is an evolvable central individual whose mutations lead to diverse behaviors (note that an experiment in section 5.4 extends evolvability ES to include multiple central individuals). We now formally describe the method, and two concrete variants1 . 4.1 Formal Description Consider the isotropic Gaussian distribution π of ES. Recall the analogy wherein this distribution represents the population, and the distribution mean can be seen as the central individual, or parent. By the definition adopted here, evolvable points in the search space can through mutation produce many different behaviors: In other words, mutations provide many different options for natural selection to select from. Thus, as in evolvability search [26], our aim is to maximize some statistic of behavioral diversity of an individual’s mutations. Formally, behavior is represented as a behavior characteristic (BC; [20]), a vector-valued function mapping 1 Source code available at: https://github.com/uber-research/Evolvability-ES GECCO ’19, July 13–17, 2019, Prague, Czech Republic Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman a genome z to behaviors B(z). For example, in a locomotion task, a policy’s behavior can be its final position on a plane. Here, we consider two different diversity statistics which lead to maximum variance (MaxVar) and maximum entropy (MaxEnt) variants of evolvability ES, here called MaxVar-EES and MaxEntEES. MaxVar-EES maximizes the trace of the covariance matrix of the BC over the population. The motivation is that the variance of a distribution captures one facet of the distribution’s diversity. This formulation results in the following loss function: Õ   J (θ ) = Ez (B j (z) − µ j )2 , (1) j 5.2 where the expectation is over policies z ∼ π (·; θ ), the summation is over components j of the BC, and µ j represents the mean of the jth component of the BC. MaxEnt-EES maximizes a distribution’s entropy rather than its variance. We expect in general this method to behave more similarly to evolvability search than MaxVar-EES, because over a fixed, bounded region of support in behavior space, the maximum-entropy distribution of behavior is a uniform distribution, which also would maximize evolvability search’s fitness criterion, i.e. to maximize the amount of distinct behaviors displayed by offspring. To estimate entropy, we first compute a kernel-density estimate of the distribution of behavior for a chosen kernel function φ, giving p(B(z); θ ) ≈ Ez ′ [φ(B(z ′ ) − z)], (2) which can be applied to derive a loss function which estimates the entropy:   J (θ ) = −Ez log Ez ′ [φ(B(z ′ ) − z)] . (3) In practice, these losses are differentiated with PyTorch (MaxEntEES in particular depends on the nested SCGs described earlier) and both the loss and their gradients are then estimated from samples (recall that these samples all come from the current population distribution, and information from all samples is integrated to estimate the direction of improving evolvability for the population as a whole). The gradient estimates are then applied to update the population distribution, enabling a generation of evolution. 4.2 In this figure, the horizontal axis represents a single-dimensional parameter space, and the vertical axis represents the single-dimensional BC. In this task, evolvability is maximized at the maximum of the slower sine wave, because this is the point at which the amplitude of the faster sine wave is greatest, meaning that mutation will induce the widest distribution of vertical change. Figure 1 shows that both variants of evolvability ES approached the optimal value in parameter space as training progressed. The next sections demonstrate that the algorithm also successfully scales to more complex domains. Proof of Feasibility It may at first seem unreasonable to expect that Evolvability ES could even in theory optimize the objective set out for it, i.e. to uncover a particular region of DNN weight-space such that mutations can produce the arbitrarily different mappings from input to output necessary to demonstrate diverse behaviors. In addition to the empirical results in this paper, we also prove by construction that such parts of the search space do exist. Specifically, we show that given any two continuous policies, it is possible to construct a five-layer network from which random mutations results in “flipping” between the two initial policies. This construction is described in supplemental section S6. 5 EXPERIMENTS 5.1 Interference Pattern Task To help validate and aid understanding of the evolvability ES approach, consider the interference pattern in figure 1a, generated as the product of a high-frequency and a low-frequency sine wave. 2-D Locomotion Task We next applied evolvability ES to a physically-simulated robotics domain. In this 2-D locomotion task, a NN policy with 74, 246 parameters controlled the “Half-Cheetah” planar robot from the PyBullet simulator [8]; this simulation and NN architecture are modeled after the Half-Cheetah deep RL benchmark of Finn et al. [11]. The NN policy received as input 26 variables consisting of the robot’s position, velocity, and the angle and angular velocity of its joints; the policy controlled the robots through NN outputs that are interpreted as torques applied to each of the robot’s 6 joints. In this domain, we characterized behavior as the final horizontal offset of the robot after a fixed number of simulation steps. All population sizes were 10, 000; other hyperparameters and algorithmic details can be found in supplemental section S1. Figure 2 shows the distribution of behaviors in the population over training time for standard ES as well as both evolvability ES variants. First, these results show that, perhaps surprisingly, the results from the interference pattern task generalize to domains with complex dynamics and large NNs. In particular, evolvability ES discovered policies such that small mutations resulted in diametrically opposed behaviors, i.e. approximately half of all mutations caused the robot to move left, while the others resulted in it moving right. Videos of evaluations of these policies (available at http://t.uber.com/evolvabilityes) show the striking difference between the rightward and leftward behaviors, highlighting the non-triviality in “fitting” both of these policies into the same NN. Interestingly, both evolvability ES variants produced similar distributions of behavior, even though MaxEnt-EES’s objective is explicitly intended to incentivize more uniform distributions of behavior. Supplemental figure S1 shows raw locomotion ability over training in this domain, i.e. the mean distance from the origin randomly sampled policies travel. Note that this is the metric optimized by standard ES (and does not reflect evolvability), and standard ES policies indeed moved further on average than both MaxVar- and MaxEnt-EES (p < 0.05, 12 runs; all tests unless otherwise specified are Mann-Whitney U Tests). However, on average, the MaxVar- and MaxEnt-EES variants yielded policies which moved 92.1% and 93.9% as far on average as those found by standard ES, so the performance price for evolvability is minimal. There was no significant difference between MaxVar- and MaxEnt-EES (p > 0.1, 12 runs). 5.2.1 Comparison to MAML. Evolvability ES takes inspiration from MAML’s ability to successfully find policies near in parameter space to diverse behaviors [11] and to (like evolvability-inspired methods) adapt quickly to new circumstances. Two questions naturally arise: (1) how does evolvability ES compare to MAML in Evolvability ES: Scalable and Direct Optimization of Evolvability GECCO ’19, July 13–17, 2019, Prague, Czech Republic 8 8 6 6 0 Genome 2 Genome Behavior 4 4 4 −2 2 2 −4 0 5 10 15 0 500 Genome (a) Interference Pattern 1000 1500 Generation 2000 (b) MaxVar Evolvability ES 0 500 1000 1500 Generation 2000 (c) MaxEnt Evolvability ES 103 102 101 Number of Pseudo-Offspring Figure 1: Interference Pattern Results. In the interference pattern task, a genome consisted of a single floating point parameter (x), and the resulting behavior was generated as a function of the genome parameter by: f (x) = 5 sin(x/5) sin(20x). (a) Shows the plot of behavior (vertical axis) as a function of genome (horizontal axis). The training plots shown in (b) and (c) validate both evolvablity ES variants, showing that they converged to the point with behavior that is most sensitive to perturbations, shown as a dashed line in all three plots. 100 Figure 2: Distribution of behaviors across evolution in the 2-D locomotion domain. Heat-maps of the final horizontal positions of 10,000 policies sampled from the population distribution are shown for each generation over training time for representative runs of each method. These plots suggest that both evolvability ES methods discovered policies that travel nearly as far both backwards and forwards as standard ES travels forward alone (note that standard ES is rewarded for traveling forward). enabling fast adaptation, and (2) is MAML itself drawn towards evolvable parts of the search space, i.e. are mutations of MAML solutions similar to those of evolvability ES solutions? To address these questions, MAML was trained in the 2-D locomotion task. Recall that MAML learns a policy which can readily adapt to a new task through further training. MAML, unlike evolvability ES, requires a distribution of training tasks to adapt to rather than a BC that recognizes distinct behaviors. MAML’s task distribution consisted of two tasks of equal probability, i.e. to walk left or right. MAML additionally makes use of per-timestep rewards (rather than per-evaluation information as in EC), given here as the distance travelled that timestep in the desired direction. All such details were similar to MAML’s canonical application to this domain [11]; see supplemental section S1 for more information. To explore how evolvability ES compares to MAML in enabling fast adaptation, we iteratively chose tasks, here directions to walk in, and allowed each algorithm to adapt to each given task. For evolvability ES, when given a task to adapt to, the central individual was mutated 40 times, and the mutation performing best was selected (and its reported performance calculated from 10 separate evaluations with different random seeds). For MAML, in an RL environment like the 2-D locomotion task, first the unmodified learned policy was evaluated in a new task (many times in parallel to reduce noise in the gradient), and then a policy-gradient update modified the MAML solution to solve the new task. During such test-time adaptation, MaxEnt-EES performed statistically significantly better than MAML (p < 0.005, 12 runs), walking further on average in the desired direction after mutation and selection than MAML after adaptation. There was no significant difference between MaxVar-EES and MAML (p > 0.1, 12 runs), and MaxEntEES was significantly better than MaxVar-EES (p < 0.05, 12 runs). The conclusion is that evolvability ES in this domain is competitive with the leading gradient-based approach for meta-learning. To compare what areas of the search space are discovered by MAML and evolvability ES, we explored what distribution of behaviors resulted from applying random mutations to the solutions produced by MAML. First, we subjected MAML policies to mutations with the same distribution as that used by evolvability ES, GECCO ’19, July 13–17, 2019, Prague, Czech Republic and then recorded the final positions of each of these offspring policies. Figure 3 shows the resulting distributions of behaviors for representative runs of MAML and both EES-variants. Qualitatively, the MAML policies are clumped closer to the origin (i.e. they typically move less far), and have less overall spread (i.e. their variance is lower). Both of these observations are statistically significant over 12 runs of each algorithm: The mean final distance from the origin for both evolvability ES variants was higher than for MAML (p < 0.00005 for both), and the variance of the distribution of final distances from the origin was higher for both evolvability ES variants than for MAML (p < 0.00005 for both). We also tested smaller perturbations of the MAML policy, under the hypothesis that gradient steps may be smaller than evolvability ES mutations, but found similar results; see supplemental figure S5. These two results together suggest that evolvability ES does indeed find more evolvable representations than MAML. Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman and rewarded the agent for walking as far as possible in a particular direction, here, for simplicity, the positive x direction. Figure 6 shows a heat-map of behavior characteristics for the pre-trained populations produced by MaxEnt-EES, as well as heatmaps following up to twenty standard ES updates (for a similar MaxVar-EES plot see supplemental figure S3). The population successfully converged in accordance with the new selection pressure, and importantly, it adapted much more quickly than randomly initialized population, as shown in figure 5. Indeed, further standard ES training of both variants of evolvability ES resulted in policies which moved statistically significantly further in the positive x direction than a random initialization after 5, 10, and 20 generations of adaptation (p < 0.0005, 12 runs). There was no significant difference between MaxVar- and MaxEnt-EES (p > 0.1, 12 runs). 5.4 5.3 3-D Locomotion Task The final challenge domain applied evolvability ES to a more complex 3-D locomotion task that controls a four-legged robot; here evolvability ES aimed to produce a genome with offspring that can locomote in any direction. In particular, in the 3-D locomotion task, a NN policy with 75, 272 parameters controlled the “Ant” robot from the PyBullet simulator [8], modeled after the domain in the MuJoCo simulator [41] that serves as a common benchmark in deep RL [11]. The 28 variables the policy received as inputs consisted of the robot’s position and velocity, and the angle and angular velocity of its joints; the NN output torques for each of the ant’s 8 joints. In this domain, a policy’s BC was its final (x, y) position. Figure 4 shows the population of behaviors after training (note that standard ES is rewarded for traveling forward). Interestingly, both evolvability ES variants formed a ring of policies, i.e. they found parameter vectors from which nearly all directions of travel were reachable through mutations. For evolvability ES there were few behaviors in the interior of the ring (i.e. few mutations were degenerate), whereas standard ES exhibited a trail of reduced performance (i.e. its policy was less robust to mutation). Supporting this observation, for standard ES more policies ended their evaluation less than 5 units from the origin than did those sampled from either variant of evolvability ES (p < 0.01, 12 runs). Supplemental figure S2 shows locomotion ability across training. As in the 2-D locomotion domain, standard ES policies moved further on average than MaxVar- and MaxEnt-EES (p < 0.05, 12 runs). However, the MaxVar- and MaxEnt-EES variants yielded policies which moved on average 85.4% and 83.2% as far as those from standard ES, which strikes a reasonable trade-off given the diversity of behaviors uncovered by evolvability ES relative to the single policy of standard ES. There was no significant difference in performance between MaxVar- and MaxEnt-EES (p > 0.1, 12 runs). 5.3.1 Seeding Further Evolution. While previous experiments demonstrate that evolvability ES enables adaptation to new tasks without further evolution, a central motivation for evolvability is to accelerate evolution in general. As an initial investigation of evolvability ES’s ability to seed further evolution, we used trained populations from evolvability ES as initializations for standard ES, Optimizing Population-Level Evolvability Finally, we explore (in the 3-D locomotion domain) the potential to evolve what Wilder and Stanley [46] term population-level evolvability, i.e. to consider evolvability as the sum of behaviors reachable from possible mutations of all individuals within a population; this is contrast to the individual evolvability targeted by evolvability search and evolvability ES as described so far (both reward behavioral diversity accessible from a single central individual). E.g. in nature different organisms are primed to evolve in different directions, and one would not expect a dandelion to be able to quickly adapt into a dinosaur, though both may be individually highlyevolvable in certain phenotypic dimensions. Indeed, for some applications, allowing for many separatelyevolvable individuals might be more effective than optimizing for a single evolvable individual, e.g. some environments may entail too many divergent phenotypes for the neighborhood around a single central individual to encompass them all. Interestingly, the idea of joint optimization of many individuals for maximizing evolvability is naturally compatible with evolvability ES. The main insight is recognizing that the population distribution need not be unimodal; as an initial investigation of this idea, we experimented with a population distribution that is the mixture of two unimodal distributions, each representing a different species of individuals. As a proof of concept, we compared two multi-modal variants of evolvability ES, where the different modes of a Gaussian Mixture Model (GMM; [25]) distribution can learn to specialize. In the first variant (vanilla GMM), we equipped Evolvability ES with a multimodal GMM population, where each mode was equally likely and was separately randomly initialized. In the second variant (splitting GMM), we first trained a uni-modal population (as before) with evolvability ES until convergence, then seeded a GMM from it, by initializing the new component means from independent samples from the pre-trained uni-modal population (to break symmetry). The second variant models speciation, wherein a single species splits into subspecies that can further specialize. A priori, it is unclear whether the different modes will specialize into different behaviors, because the only optimization pressure is for the union of both modes to cover a large region of behavior space. However, one might expect that more total behaviors be covered if the species do specialize. Evolvability ES: Scalable and Direct Optimization of Evolvability GECCO ’19, July 13–17, 2019, Prague, Czech Republic Figure 3: Distribution of behaviors compared to MAML in the 2-D locomotion domain. Histograms of the final x positions of 1,000 policies sampled from the final population distribution are shown each variant of evolvability ES, as well as for perturbations of MAML solutions of the same size as those of evolvability ES. These plots suggest that both evolvability ES methods discovered more evolvable regions of parameter space than did MAML. 6 × 100 4 × 100 3 × 100 2 × 100 Number of Pseudo-Offspring 101 100 Figure 4: Distribution of behaviors in the final population in the 3-D locomotion domain. Shown are heat-maps (taken from representative runs) of the final positions of 10,000 policies sampled from the population distribution at generation 100 for each of (a) standard ES, (b) MaxVar-EES, and (c) MaxEnt-EES. These plots suggest that both evolvability ES variants successfully found policies which moved in many different directions, and roughly as far as standard ES traveled in the positive x direction alone. Standard MaxVar MaxEnt Mean Final X 30 Figure 7 and supplemental figure S4 show the qualitative performance of these two variants. These figures first show that specialization did emerge in both variants, without direct pressure to do so, although there was no significant difference in how the speciated version covered the space relative to individual-based evolability ES (p > 0.1, 6 runs). This result shows that intriguingly it is possible to directly optimize population-level evolvability, but its benefit may come only from domains with richer possibilities for specialization (e.g. an environment in which it is possible to either swim or fly), a direction to be explored in future work. 20 10 0 0 5 10 Generation 15 Figure 5: Adaptation performance. The plot compares the performance of standard ES adapting to a new behavior when seeded with a random initialization, or initializations from completed runs of MaxVar- and MaxEnt-EES. Mean and standard deviation over 12 runs shown. Populations trained by both methods of Evolvability ES evolved more quickly than randomly initialized populations. 6 DISCUSSION AND CONCLUSION This paper’s results highlight that, surprisingly, it is possible to directly, efficiently, and successfully optimize evolvability in the space of large neural networks. One key insight enabling such efficient optimization of evolvability is that there exist deep connections between evolvability and parameter-space meta-learning algorithms like MAML, a gradient-based approach to meta-learning popular in traditional ML. These results suggest that not only may Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman 101 Number of Pseudo-Offspring GECCO ’19, July 13–17, 2019, Prague, Czech Republic 100 101 Number of Pseudo-Offspring Figure 6: Distribution of behaviors during adaptation in the 3-D locomotion domain. Heat-maps are shown of the final positions of 10,000 policies sampled from the population distribution initialized with MaxEnt-EES, and adapted to move in the positive x direction with standard ES over several generations. These plots suggest that MaxEnt-EES successfully found policies that could quickly adapt to perform new tasks. See supplemental figure S3 for the MaxVar version. 100 Figure 7: Distribution of behaviors for uni- and multi-modal MaxEnt-EES variants. Heat-maps are shown of the final positions of 10,000 policies sampled from the population at generation 100 of MaxEnt-EES, with (a) one component and (b) a vanilla GMM with two components. Also shown is the result of (c) splitting the trained single component into two components, and evolving for 20 additional generations. These plots suggest that both bi-modal variants performed similarly. See supplemental figure S4 for the MaxVar version. MAML-like algorithms be used to uncover evolvable policies, evolutionary algorithms that in general incline towards evolvability may serve as a productive basis for meta-learning. That is, evolutionary meta-learning has largely focused on plastic neural networks [12, 37, 39], but evolvable genomes that are primed to quickly adapt are themselves a form of meta-learning (as MAML and evolvability ES demonstrate); such evolvability could be considered instead of, or complementary to, NNs that adapt online. The results of this paper also open up many future research directions. One natural follow-up concerns the use of evolvability as an auxiliary objective to complement novelty-driven ES [7, 9] or objective-driven ES. The intuition is that increased evolvability will catalyze the accumulation of novelty or progress towards a goal. Interestingly, the same samples used for calculating novelty-driven or objective-driven ES updates can be reused to estimate the gradient of evolvability; in other words, such an auxiliary objective could be calculated with no additional domain evaluations. Additionally, the population-level formulation of evolvability ES is itself similar to novelty-driven ES, and future work could compare them directly. Interestingly, while part of Evolvability ES’s origin comes from inspiration from gradient-based ML (i.e. MAML), it also offers the opportunity to inspire new gradient-based algorithms: A reformulation of the evolvability ES loss function could enable a policy gradients version of evolvability ES which exploits the differentiability of the policy to decrease the variance of gradient estimates (supplemental section S5 further discusses this possibility). The conclusion is that Evolvability ES is a new and scalable addition to the set of tools for exploring, encouraging, and studying evolvability, one we hope will continue to foster cross-pollination between the EC and deep RL communities. Evolvability ES: Scalable and Direct Optimization of Evolvability REFERENCES [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org. [2] Lee Altenberg et al. The evolution of evolvability in genetic programming. Advances in genetic programming, 3:47–74, 1994. [3] Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies: A comprehensive introduction. Natural Computing, 1:3–52, 2002. [4] J.F.Y Brookfield. Evolution: The evolvability enigma. Current Biology, 11(3):R106 – R108, 2001. ISSN 0960-9822. doi: DOI:10.1016/S0960-9822(01)00041-0. [5] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. Back to basics: Benchmarking canonical evolution strategies for playing atari. arXiv preprint arXiv:1802.08842, 2018. [6] Jeff Clune, Jean-Baptiste Mouret, and Hod Lipson. The evolutionary origins of modularity. Proc. R. Soc. B, 280(1755):20122863, 2013. [7] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In Advances in Neural Information Processing Systems 31. 2018. [8] E Coumans and Y Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository, 2016. [9] Giuseppe Cuccu and Faustino Gomez. When novelty is not enough. In European Conference on the Applications of Evolutionary Computation, pages 234–243. Springer, 2011. [10] M. Ebner, M. Shackleton, and R. Shipman. How neutral networks influence evolvability. Complexity, 7(2):19–33, 2001. ISSN 1099-0526. [11] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017. [12] Dario Floreano and Francesco Mondada. Evolution of plastic neurocontrollers for situated agents. In Proc. of The Fourth International Conference on Simulation of Adaptive Behavior (SAB), From Animals to Animats. ETH Zürich, 1996. [13] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. CoRR, abs/1706.10295, 2017. URL http://arxiv.org/abs/1706.10295. [14] Michael C. Fu. Chapter 19 gradient estimation. Simulation Handbooks in Operations Research and Management Science, 13:575, 2006. doi: 10.1016/s0927-0507(06) 13019-4. [15] J.J. Grefenstette. Evolvability in dynamic fitness landscapes: A genetic algorithm approach. In Evolutionary Computation, 1999. CEC 99. Proceedings of the 1999 Congress on, volume 3. IEEE, 2002. ISBN 0780355369. [16] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001. [17] Nadav Kashtan, Elad Noor, and Uri Alon. Varying environments can speed up evolution. Proceedings of the National Academy of Sciences, 104(34):13711–13716, 2007. [18] Marc Kirschner and John Gerhart. Evolvability. Proceedings of the National Academy of Sciences, 95(15):8420–8427, 1998. [19] Loizos Kounios, Jeff Clune, Kostas Kouvaris, Günter P Wagner, Mihaela Pavlicev, Daniel M Weinreich, and Richard A Watson. Resolving the paradox of evolvability with learning theory: How evolution learns to improve evolvability on rugged fitness landscapes. arXiv preprint arXiv:1612.05955, 2016. [20] Joel Lehman and Kenneth O Stanley. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2):189–223, 2011. [21] Joel Lehman and Kenneth O Stanley. Improving evolvability through novelty search and self-adaptation. In IEEE Congress on Evolutionary Computation, pages 2693–2700, 2011. [22] Joel Lehman and Kenneth O Stanley. Evolving a diversity of virtual creatures through novelty search and local competition. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pages 211–218. ACM, 2011. [23] Joel Lehman and Kenneth O Stanley. Evolvability is inevitable: Increasing evolvability without the pressure to adapt. PloS one, 8(4):e62186, 2013. [24] Joel Lehman and Kenneth O Stanley. On the potential benefits of knowing everything. In Artificial Life Conference Proceedings, pages 558–565. MIT Press, 2018. [25] Geoffrey J. McLachlan, Sharon X. Lee, and Suren I. Rathnayake. Finite mixture models. Annual Review of Statistics and Its Application, 6(1):355–378, 2019. doi: 10.1146/annurev-statistics-031017-100325. URL https://doi.org/10.1146/ annurev-statistics-031017-100325. GECCO ’19, July 13–17, 2019, Prague, Czech Republic [26] Henok Mengistu, Joel Lehman, and Jeff Clune. Evolvability search. Proceedings of the 2016 on Genetic and Evolutionary Computation Conference - GECCO 16, 2016. doi: 10.1145/2908812.2908838. [27] Thomas Miconi, Jeff Clune, and Kenneth O Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. arXiv preprint arXiv:1804.02464, 2018. [28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [29] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Innovation engines: Automated creativity and improved stochastic optimization via deep learning. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pages 959–966. ACM, 2015. [30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS-W, 2017. [31] M. Pigliucci. Is evolvability evolvable? Nature Reviews Genetics, 9(1):75–82, 2008. ISSN 1471-0056. [32] Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40, 2016. [33] Joseph Reisinger, Kenneth O. Stanley, and Risto Miikkulainen. Towards an empirical measure of evolvability. In Genetic and Evolutionary Computation Conference (GECCO2005) Workshop Program, pages 257–264, Washington, D.C., 2005. ACM Press. URL http://nn.cs.utexas.edu/?reisinger:gecco05. [34] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017. [35] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. CoRR, abs/1506.05254, 2015. URL http://arxiv.org/abs/1506.05254. [36] Hans-Paul Paul Schwefel. Evolution and optimum seeking: the sixth generation. John Wiley & Sons, Inc., 1993. [37] Andrea Soltoggio, John A Bullinaria, Claudio Mattiussi, Peter Dürr, and Dario Floreano. Evolutionary advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In Proceedings of the 11th international conference on artificial life (Alife XI), number LIS-CONF-2008-012, pages 569–576. MIT Press, 2008. [38] Kenneth O. Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary Computation, 10:99–127, 2002. [39] Kenneth O Stanley, Bobby D Bryant, and Risto Miikkulainen. Evolving adaptive neural networks with and without adaptive synapses. In Evolutionary Computation, 2003. CEC’03. The 2003 Congress on, volume 4, pages 2557–2564. IEEE, 2003. [40] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017. [41] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, Oct 2012. doi: 10.1109/IROS.2012.6386109. [42] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of metalearning. Artificial Intelligence Review, 18(2):77–95, 2002. [43] Günter P Wagner and Lee Altenberg. Perspective: complex adaptations and the evolution of evolvability. Evolution, 50(3):967–976, 1996. [44] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016. [45] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, and JÃijrgen Schmidhuber. Natural evolution strategies, 2011. [46] Bryan Wilder and Kenneth Stanley. Reconciling explanations for the evolution of evolvability. Adaptive Behavior, 23(3):171–179, 2015. GECCO ’19, July 13–17, 2019, Prague, Czech Republic Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman 40 40 30 20 Standard MaxVar MaxEnt 10 0 0 20 40 60 Generation 80 Max Final Distance Max Final Distance 50 Included in the supplemental information are experimental details and hyperparameters for all algorithms (section S1); a description of stochastic computation graphs and how we extended them (sections S2 and S3); and the particular stochastic computation graphs that enable calculating loss and gradients for ES and Evolvability ES (section S4). S1 EXPERIMENTAL DETAILS This section presents additional plots useful for better understanding the performance of evolvability ES, as well as further details and hyperparameters for all algorithms. S1.1 Additional Plots Figures S1 and S2 display the mean distance from the origin of both standard ES and both variants of evolvability ES, on the 2D and 3D locomotion tasks respectively. Figure S3 highlights how the distribution of behaviors changes during meta-learning test-time to quickly adapt to the task at hand for MaxVar-EES (see figure 6 for the MaxEnt version). Figure S4 contrasts the two variants of multi-modal MaxVar-ES (see figure 7 for the MaxEnt version). Figure S5 compares perturbations of central evolvability ES policies to perturbations of MAML policies with a standard deviation 4 times smaller than that of EES. 20 10 0 20 40 60 Generation 80 100 Figure S2: 3-D locomotion ability across training. This plot compares raw locomotion ability of policies samples from standard ES, MaxVar-EES, and MaxEnt-EES. Each curve shows the mean final distance from the origin over 10,000 samples from the population distribution during training. Mean and standard deviation over 12 runs shown. While standard ES learned its single forward-moving policy more quickly, this plot highlights that both evolvability ES variants found policies which moved almost as far as standard ES on the ant domain, despite encoding policies that move in many more directions. S1.2 SUPPLEMENTAL INFORMATION 30 0 100 Figure S1: 2-D locomotion ability across training. This plot compares raw locomotion ability of policies sampled from standard ES, MaxVar-EES, and MaxEnt-EES during training. Each curve plots the mean final distance from the origin over 10,000 samples from the population distribution. The error bars indicate standard deviation over the 12 training runs. While standard ES learned slightly faster, this plot shows that both evolvability ES variants found policies which moved almost as far as standard ES in this domain, despite encoding both forwards and backwards-moving policies. Standard MaxVar MaxEnt Hyperparameters and Training Details For standard ES, fitness was rank-normalized to take values symmetrically between −0.5 and 0.5 at each generation before computing gradient steps. For both variants of evolvability ES, BCs were whitened to have mean zero and a standard deviation of one at each generation before computing losses and gradients. This was done instead of rank normalization in order to preserve density information for variance and entropy estimation. A Gaussian kernel with standard deviation 1.0 was used for the MaxEnt-EES to estimate the density of behavior given samples from the population distribution. S1.2.1 Interference Pattern Details. The interference pattern was generated by the function x (4) f (x) = 5 sin sin 20x . 5 Hyperparameters for the interference pattern task are shown in Tables S1 and S2. Hyperparameter Learning Rate Population Standard Deviation Population Size Setting 0.03 0.5 500 Table S1: MaxVar Hyperparameters: Interference Pattern Task. GECCO ’19, July 13–17, 2019, Prague, Czech Republic 101 Number of Pseudo-Offspring Evolvability ES: Scalable and Direct Optimization of Evolvability 100 101 Number of Pseudo-Offspring Figure S3: Distribution of behaviors during adaptation in the 3D locomotion domain. Heat-maps of the final positions of 10,000 policies sampled from the population distribution initialized with MaxVar-EES, and adapted to move in the positive x direction with Standard ES over several generations. These plots suggest that MaxVar-EES successfully found policies which could quickly adapt to perform new tasks. See figure 6 for the MaxEnt version. 100 Figure S4: Distribution of behaviors for uni- and multi-modal MaxVar-EES variants. Heat-maps of the final positions of 10,000 policies sampled from the population distribution at generation 100 of MaxVar-EES, with (a) one component and (b) a vanilla GMM with two components. Also shown is the result of splitting the single component in (a) into two components and evolving for 20 additional generations. These plots suggest that both bi-modal variants performed similarly. See figure 7 for the MaxEnt version. Hyperparameter Learning Rate Population Standard Deviation Population Size Kernel Standard Deviation Setting 0.1 0.5 500 1.0 Table S2: MaxEnt Hyperparameters: Interference Pattern Task. S1.2.2 Locomotion Task Details. For the locomotion tasks, all environments were run deterministically and no action noise was used during training. The only source of randomness was from sampling from the population distribution. Policies were executed in the environment for 1, 000 timesteps. For comparison to evolvability ES, the fitness function for standard ES was set to be the final x position of a policy, rewarding standard ES for walking as far as possible in positive x direction. NNs for both 2D and 3D locomotion were composed of two hidden layers with 256 hidden units each, resulting in 166.7K total parameters, and were regularized with L2 penalties during training. Inputs to the networks were normalized to have mean zero and a standard deviation of one based the mean and standard deviation of the states seen in a random subset of all training rollouts, with each rollout having probability 0.01 of being sampled. Experiments were performed on a cluster system and were distributed across a pool of 550 CPU cores shared between two runs of the same algorithm. Each run took approximately 5 hours to complete. GECCO ’19, July 13–17, 2019, Prague, Czech Republic Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman Figure S5: Distribution of behaviors compared to MAML in the 2-D locomotion domain. Histograms of the final x positions of 1,000 policies sampled from the final population distribution are shown each variant of evolvability ES, as well as for perturbations of MAML policies 4 times smaller than those of evolvability ES. Smaller perturbations of MAML did not change the fundamental result (and interestingly eliminated the potential of generating far-right-walking policies). Hyperparameters for both variants of Evolvability ES are shown Tables S3 and S4. Hyperparameter Setting Learning Rate Population Standard Deviation Population Size L2 Regularization Coefficient 0.01 0.02 10,000 0.05 Hyperparameter Setting Adaptation Learning Rate Conjugate Gradient Damping Conjugate Gradient Iterations Max Number of Line Search Steps Line Search Backtrack Ratio Discount Factor Adaptation Batch Size Outer Batch Size Table S3: MaxVar Hyperparameters: Locomotion Tasks. 0.1 1e-5 10 15 0.8 0.99 10 40 Table S5: MAML Hyperparameters. Hyperparameter Setting Learning Rate Population Standard Deviation Population Size L2 Regularization Coefficient Kernel Bandwidth 0.01 0.02 10,000 0.05 1.0 Table S4: MaxEnt Hyperparameters: Locomotion Tasks. S1.2.3 MAML Details. The MAML algorithm was run on the 2D locomotion task using the same fully-connected neural network with 2 hidden layers of 256 hidden units each as that used in evolvability ES. There are two main differences between our usage of MAML and those typically done in the past (e.g. in Finn et al. [11]). First, we used the PyBullet simulator [8] as opposed to the more canonical MuJoCo simulator [41]. Second, typically locomotion tasks have associated with them an energy penalty, supplementing the standard distance-based reward function. Because it is unclear how to incorporate an energy penalty into a (potentially vectorvalued) behavior characteristic, we used no energy penalty in our evolvability ES experiments. Consequently, for a fairer comparison we also used no energy penalty in our MAML experiments. Table S5 displays additional hyperparameters for MAML. S2 STOCHASTIC COMPUTATION GRAPHS A stochastic computation graph, as defined in Schulman et al. [35], is a directed acyclic graph consisting of fixed input nodes, deterministic nodes representing functions of their inputs, and stochastic nodes representing random variables distributed conditionally on their inputs. A stochastic computation graph G represents the expectation (over its stochastic nodes {zi }) of the sum of its output nodes { fi }, as a function of its input nodes {x i }: " n # Õ G(x 1 , . . . , xl ) = Ez1, ...,zm fi (5) i=1 For example, consider the stochastic computation graph in figure S6, reproduced from [35]. This graph G represents the expectation G(x 0 , θ ) = Ex 1,x 2 [f 1 (x 1 ) + f 2 (x 2 )] , (6) where x 1 ∼ p(·; x 0 , θ ) and x 2 ∼ p(·; x 1 , θ ). A key property of stochastic computation graphs is that they may be differentiated with respect to their inputs. Using the score Evolvability ES: Scalable and Direct Optimization of Evolvability θ f1 f2 x0 x1 x2 (7) Schulman et al. [35] also derive surrogate loss functions for stochastic computation graphs, allowing for implementations of stochastic computation graphs with existing automatic differentiation software. For example, given a sample {x 1i }1≤i ≤N of x 1 and {x 2i }1≤i ≤N of x 2 , we can write 1 Õ log p(x 1i ; θ, x 0 )(f 1 (x 1i )+ f 2 (x 2i ))+log p(x 2i ; θ, x 1i )f 2 (x 2i ). L̂(θ ) = N i (8) Now to estimate ∇θ G(x 0 , θ ), we have ∇θ G(x 0 , θ ) ≈ ∇θ L̂(θ ), (9) which may be computed with popular automatic differentiation software. S3 + Ex 2 [f 2 ] θ f1 f2 x0 x1 x2 Ex 1 [+] Figure S6: Example stochastic computation graph: input nodes are depicted with no border, deterministic nodes with square borders, and stochastic nodes with circular borders. function estimator [14], we have that  ∇θ G(x 0 , θ ) = Ex 1,x 2 ∇θ log p(x 1 ; θ, x 0 )(f 1 (x 1 ) + f 2 (x 2 ))  +∇θ log p(x 2 ; θ, x 1 )f 2 (x 2 ) . GECCO ’19, July 13–17, 2019, Prague, Czech Republic NESTED STOCHASTIC COMPUTATION GRAPHS We make two changes to the stochastic computation graph formalism: (1) We add a third type of node which represents the expectation over one of its parent stochastic nodes of one of its inputs. We require that a stochastic node representing a random variable z be a dependency of exactly one expectation node over z, and that every expectation node over a random variable z depend on a stochastic node representing z. (2) Consider a stochastic node representing a random variable z conditionally dependent on a node y. Rather than expressing this as a dependency of z on y, we represent this as a dependency between the expectation node over z on y. Formally, this means all stochastic nodes are required to be leaves of the computation graph. Because “nested stochastic computation graphs,” as we term them, contain their expectations explicitly, they simply represent the sum of their output nodes (instead of the expected sum of their output nodes, as with regular stochastic computation graphs). Figure S7: Example nested stochastic computation graph: input nodes are depicted with no border, deterministic nodes with square borders, stochastic nodes with circular borders, and expectation nodes with double elliptical borders. As an example, consider the nested stochastic computation graph G depicted in figure S7. First, note that G is indeed a nested stochastic computation graph, because the stochastic nodes and expectation nodes correspond, and because all stochastic nodes are leaves of the graph. Next, note that G is equivalent to the stochastic computation graph of figure S6 in the sense that it computes the same function of its inputs:   G(x 0 , θ ) = Ex 1 f 1 (x 1 ) + Ex 2 [f 2 (x 2 )] (10) = Ex 1,x 2 [f 1 (x 1 ) + f 2 (x 2 )] (11) The original stochastic computation graph formalism has the advantage of more clearly depicting conditional relationships, but this new formalism has two advantages: (1) Nested stochastic computation graphs can represent arbitrarily nested expectations. We have already seen this in part with the example of figure S7, but we shall see this more clearly in a few sections. (2) It is trivial to define surrogate loss functions for nested stochastic computation graphs. Moreover, these surrogate loss functions have the property that in the forward pass, they estimate the true loss function. Consider a nested stochastic computation graph G with input nodes {θ } ∪ {x i }, and suppose we wish to compute the gradient ∇θ G(θ, x 1 , . . . , x n ). We would like to be able to compute the gradient of any node with respect to any of its inputs, as this would allow us to use the well-known backpropagation algorithm to compute ∇θ G. Unfortunately, it is often impossible to write the gradient of an expectation in closed form; we shall instead estimate ∇θ G given a sample from the stochastic nodes of G. Suppose G has a output node Ez [f ], the only expectation node in G. Suppose moreover that Ez [f ] has inputs {ξ i } (apart from f ) so z ∼ p(·, ξ 1 . . . ξm ), and suppose f has inputs {yi }. Note that to satisfy the definition of a nested stochastic computation graph f must ultimately depend on z, so we write f as f (y1 , . . . , yl ; z). See figure S8 for a visual representation of G. GECCO ’19, July 13–17, 2019, Prague, Czech Republic Alexander Gajewski, Jeff Clune, Kenneth O. Stanley, and Joel Lehman ξ1 z f Ez [f ] .. . Ez [f ] f ξm y1 ... θ Figure S9: Nested stochastic computation graph representing Natural Evolution Strategies. yl z z B (·)2 Ez [(·)2 ] Figure S8: Nested stochastic computation graph with a single expectation node. θ If we wish to compute ∇ω Ez [f (y1 , . . . , yl ; z)], using the likelihood ratio interpretation of the score function [14] and given a sample {zi }1≤i ≤N of z, we can write L̂(ω) = 1 Õ f (y1 , . . . , yl ; z)L(zi ), N i (12) where L is the likelihood ratio given by L(zi ) = p(zi ; ξ 1 , . . . , ξm ) ′ ), p(zi ; ξ 1′, . . . , ξm (13) and setting ξ i′ = ξ i gives ∇ω p(zi ; ξ 1 , . . . , ξm ) ∇ω L(zi ) = p(zi ; ξ 1 , . . . , ξm ) = ∇ω log p(zi ; ξ 1 , . . . , ξm ). and 18 are. In particular, this similarity makes it straightforward to write a custom “expectation” operation for use in automatic differentiation software which computes the sample mean weighted by the likelihood ratio. S4 (14) (15) Note that f can either depend on ω directly, if ω ∈ {yi }, or through the distribution of z, if ω ∈ {ξ i }. Differentiating, we have 1 Õ ∇ω L̂(ω) = f (y1 , . . . , yl ; z)∇ω L(zi )+∇ω f (y1 , . . . , yl ; z)L(zi ), N i (16) and setting ξ i′ = ξ i we see that ∇ω L̂(ω) is an estimate of ∇ω Ez [f (y1 , . . . , yl ; z)] Figure S10: Nested stochastic computation graph representing the loss function of MaxVar-EES. (17) Generalizing this trick to an arbitrary nested stochastic computation graph G, we see that creating a surrogate loss function L̂ is as simple as replacing each expectation node with a sample mean as in Equation 12, weighted by the likelihood ratio L(zi ). Note that since L(zi ) = 1, the surrogate loss is simply the method of moments estimate of the true loss. Considering again the graph of figure S7, we can construct a surrogate loss function   Õ 1 Õ L̂(θ, x 0 ) = f 1 (x 1i ) + f 2 (x 2i )L(x 2i ; θ ) L(x 1i ; θ, x 0 ). (18) N i j While this may not seem like very much of an improvement at first, it is insightful to note how similar the forms of Equations 10 STOCHASTIC COMPUTATION GRAPHS FOR STANDARD ES AND EVOLVABILITY ES As mentioned in the main text of the paper, we can estimate the gradients of the standard ES and evolvability ES loss functions with the score function estimator because we can represent these loss functions as (nested) stochastic computation graphs. Figure S9 shows a (nested) stochastic computation graph representing the standard ES loss function. This yields the following surrogate loss function for ES: 1 Õ L̂(θ ) = f (zi )L(zi ), (19) N i where L(zi ) is the likelihood function. Figure S10 shows a nested stochastic computation graph representing MaxVar-EES, yielding the following surrogate loss function: 1 Õ L̂(θ ) = B(zi )2 L(zi ; θ ) (20) N i Finally, figure S11 shows a nested stochastic computation graph representing the loss function for MaxEnt-EES, yielding the following surrogate loss function: L̂(θ ) = − 1 Õ ©Õ ª log ­ φ(B(z ′ ) − z)L(z j ; θ )® L(zi ; θ ) N i « j ¬ (21) Evolvability ES: Scalable and Direct Optimization of Evolvability z S6 B φ z′ GECCO ’19, July 13–17, 2019, Prague, Czech Republic Ez′ [φ] − log Ez [− log] θ B THEORETICAL RESULTS Theorem S6.1. For every pair of continuous functions д : [0, 1]n → R and h : [0, 1]n → R, and for every ϵ > 0, there exists a δ 2 > 0 such that for every 0 < δ 1 < δ 2 , there exists 5-hidden-layer neural network f (x;W 1 , . . . ,W 5 ) : [0, 1]n → R with ReLU activations such that for any distribution of weight perturbations such that each component is sampled i.i.d. from a distribution with distribution function e 1, . . . , W e 5 ) − д(x)| < ϵ for all x ∈ [0, 1]n with probability F , | f (x; W e 1, . . . , W e 5 ) − h(x)| < ϵ for all at least F (δ 2 ) − F (δ 1 ), and | f (x; W n e l are x ∈ [0, 1] with probability at least F (−δ 1 ) − F (−δ 2 ), where W the perturbed weights. Figure S11: Nested stochastic computation graph representing the loss function of MaxEnt-EES. S5 POLICY GRADIENTS EVOLVABILITY ES With MaxVar-EES, the loss function we wish to optimize is given by: ! τ Õ 2 J (µ) = Eθ ∼N (µ,σ 2 ) Eat ∼π (· |st ,θ ) (Bτ − Eθ (Bτ )) 1[t = τ ] t =1 (22) where τ is the length of an episode; the action at at time t is sampled from a distribution π (·|st , θ ) depending on the current state st and the parameters of the current policy θ ; and Bτ is the behavior at time τ . Evolvability ES uses the following representation of the gradient of this function: Õ τ Eat ∼π (· |st ,θ ) (Bτ − Eθ (Bτ ))2 ∇ µ J (µ) = Eθ ∼N (µ,σ 2 ) t =1 (23)  ∇ µ log N (θ ; µ, σ 2 )1[t = τ ] Another way to write this same gradient is as follows, using the reparametrization trick: Õ τ Eat ∼π (· |st , µ+ϵ ) (Bτ − Eθ (Bτ ))2x ∇ µ J (µ) = Eϵ ∼N (0,σ 2 ) t =1  (24) ∇ µ log π (at |st , µ + ϵ)1[t = τ ] This representation is reminiscent of the NoisyNet formulation of the policy gradient algorithm [13]. Evolvability ES estimates its representation of the gradient from samples like this: ! m Õ τ Õ ∇ µ J (µ) ≈ (Bτk − B̄τ )2 1[t = τ ]∇ µ log N (θ k ; µ, σ 2 ) (25) k =1 t =1 k where Bτ is the BC of the kth individual, and B̄ is the mean BC among all m individuals. A hypothetical policy gradients version of Evolvability ES could estimate its gradient as follows: ! m Õ τ Õ k 2 ∇ µ J (µ) ≈ (Bτ − B̄τ ) 1[t = τ ]∇ µ log π (at |st , µ + ϵ) k =1 t =1 (26) The formulation above in terms of an explicit sum over τ timesteps and an indicator on the current timestep makes it easy to extend both of these algorithms to allow for diversity of behavior at multiple timesteps, closer to a traditional RL per-timestep reward function. Proof sketch. First, note that д and h can each be arbitrarily approximated as 2-layer neural networks G and H . Next, construct a 2-layer neural network σ : R → R2 with ReLU activations which is uniformly δ -close in each component to the function x 7→ (I [x > 0], I [x < 0]). Each of these 3 neural networks is uniformly continuous both in the inputs and in the weights). We now connect them into a neural network F . Create a node K in the first hidden layer of F with bias B, and connect G and H to the inputs such that their outputs are in the second hidden layer of F . Also construct a node K ′ in the second hidden layer of F , and label the weight connecting K and K ′ w k . Now place σ such that its outputs are in the fourth hidden layer of F . Next, create two nodes Ĝ and Ĥ in the 5th hidden layer of F , initialize the weight connecting G and Ĝ to 1, and that connecting H and Ĥ to 1. Label the weight connecting σ1 to Ĝ wд and that connecting σ2 to Ĥ wh . Finally, connect Ĝ and Ĥ to the output node F with weight 1 each. Initialize all unmentioned weights to 0. Now pick δ 2 such that uniformly over perturbations less than δ /2, nodes G and H are ϵ1 -close to their original outputs, and such that nodes σ1 and σ2 are ϵ1 /M-close to its original output, where M is a bound on the activations of nodes д and h. Fix 0 < δ 1 < δ 2 . Set B such that every perturbation in (δ 1 , δ 2 ) to w k results in node K positive over all inputs, and every perturbation in (−δ 2 , −δ 1 ) to w k results in node K negative over all inputs. We can do this because the inputs to K are bounded. Then for perturbations to w k in (δ 1 , δ 2 ), we can make σ ϵ2 -close to (1, 0) (in each component). We can then set wд and wh small enough that when σ1 is ϵ1 /M-close to 1, node Ĝ is 0 and node Ĥ is ϵ3 -close to its original output, and when σ2 is ϵ1 /M-close to 1, Ĥ is 0 and Ĝ is ϵ3 -close to its original output. Putting everything together, we can choose ϵ1 , ϵ2 , and ϵ3 such that for perturbations to w k in (δ 1 , δ 2 ), F is ϵ-close to д, and for perturbations to w k in (−δ 2 , −δ 1 ), F is ϵ-close to h. □ We hope that future work will strengthen these results in several ways. First, the number of layers can likely be made more economical, and the results should be extensible to other activation functions like Tanh. It should also be possible to merge more than two continuous functions in a similar way; a more open-ended question is whether, as in the Ant domain, some uncountable family of functions can be merged into a single neural network.