Stein Variational Goal Generation for adaptive Exploration in Multi-Goal Reinforcement Learning Nicolas Castanet 1 Olivier Sigaud 1 Sylvain Lamprier 2 Abstract goal distribution is unknown at train time, which means discovering the space of valid goals by experience, and optimizing success coverage1 In multi-goal Reinforcement Learning, an agent can share experience between related training tasks, resulting in better generalization for new tasks at test time. However, when the goal space has discontinuities and the reward is sparse, a majority of goals are difficult to reach. In this context, a curriculum over goals helps agents learn by adapting training tasks to their current capabilities. In this work we propose Stein Variational Goal Generation (SVGG), which samples goals of intermediate difficulty for the agent, by leveraging a learned predictive model of its goal reaching capabilities. The distribution of goals is modeled with particles that are attracted in areas of appropriate difficulty using Stein Variational Gradient Descent. We show that SVGG outperforms state-of-the-art multi-goal Reinforcement Learning methods in terms of success coverage in hard exploration problems, and demonstrate that it is endowed with a useful recovery property when the environment changes. To avoid deceptive gradient issues and a tedious reward engineering process, multi-goal RL often considers the sparse reward context, where the agent only obtains a non-null learning signal when the goal is achieved. In that case, the multi-goal framework makes it possible to leverage Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) which helps densify the reward signal by relabeling failures as successes for the goals achieved by accident. However, in settings with discontinuities in the goal space (e.g., walls in a maze), or in hard exploration problems where the long task horizon results in an exponential decrease of the learning signal (Osband et al., 2016), many goals remain hard to achieve and using HER does not suffice to reach all valid goals. In these more difficult contexts, and without any desired goal distribution at hand, the selection of training goals from a behavior distribution must be structured into a curriculum to help agents explore and learn progressively by adapting training tasks to their current capabilities (Colas et al., 2022). The question is: how can we organize a curriculum of goals to maximize the success coverage? A first approach consists in focusing on novelty, with the objective of expanding the set of achieved goals. This is the approach of RIG (Nair et al., 2018), DISCERN (Warde-Farley et al., 2018), S KEW-F IT (Pong et al., 2019) and MEGA (Pitis et al., 2020)2 . This leads to strong exploration results, but success coverage is only optimized implicitly. 1. Introduction Multi-goal Reinforcement Learning (RL) (Kaelbling, 1993), where agent policies are conditioned by goals specified according to the task at hand, has recently been at the heart of many research works (Schaul et al., 2015; Pitis et al., 2020; Yang et al., 2021), since it offers an efficient way for sharing experience between related tasks. The usual ambition in the multi-goal RL context is to obtain an agent able to reach any goal from some desired goal distribution. This is particularly challenging in settings where the desired Another strategy is to bias the goal generation process toward goals of intermediate difficulty (GOIDs). The general intuition is that addressing goals that are too easy or too hard does not foster progress, thus the agent needs to identify goals on which it can make some progress. The focus is thus more on performance. This is the approach of asymetric self-play (Sukhbaatar et al., 2017), G OAL GAN (Florensa 1 Sorbonne Université, ISIR, Paris, France 2 Univ Angers, LERIA, SFR MATHSTIC, F-49000 Angers, France. Correspondence to: Nicolas Castanet , Olivier Sigaud , Sylvain Lamprier . 1 Success coverage, which measures the average performance of the agent on all valid goals, is denoted competence in (Blaes et al., 2019). 2 This is also the case for OMEGA (Pitis et al., 2020), which extends MEGA, but for settings where the desired goal distribution is known, which leaves the scope of our work. Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL P∞ a goal g ∈ G: π ∗ = arg maxπ Eg∼pd Eτ ∼π(τ ) [ t=0 γ t rtg ], where rtg = Rg (st , at , st+1 ) stands for the goal-conditioned reward obtained at step t of trajectory τ using goal g, γ is a discount factor in ]0; 1[ and pd is the distribution of goals over G. In our setting we consider that pd is uniform over S (i.e., no known desired distribution), while the work could be extended to cover different distributions. et al., 2018), SETTER - SOLVER (Racaniere et al., 2019) or VDS (Zhang et al., 2020). By aiming at performance, those methods target more explicitly success in encountered goals, but benefit from implicit exploration. In this work, we propose a novel method which provides the best of both worlds. Our method, called SVGG3 , learns a model of the probability of succeeding in achieving goals by relying on a set of particles where each particle represents a goal candidate. This set of particles is updated via Stein Variational Gradient Descent (Liu & Wang, 2016) to fit GOIDs, modeled as goals whose success is the most unpredictable. A key feature of SVGD is that the gradient applied to particles combines an attraction term and a repulsion term, which helps strike a good balance between exploration and exploitation. In particular, when the agent cannot find any additional GOID, due to the repulsion term of SVGD, the current particles repel one another resulting in fostering more exploratory goal sampling. This endows SVGG with a very flexible model of the current capabilities of the corresponding agent. Based on this feature, we formally demonstrate that SVGG possesses a very useful recovery property that prevents catastrophic forgetting and enables the agent to adapt when the environment changes during training. We empirically validate SVGG on Multigoal RL problems where the goal space is of moderate size, and leave investigations on problems where goals are images for future work. Importantly, since S is not known in advance and we want to adapt training goals to the current capabilities of the agent, learning is performed at each step through goals sampled from a behavioral distribution pgoals , which is periodically updated by experience. While training the agent can be performed by any classical RL algorithm, our work focuses on the definition of this behavioral distribution pgoals , which drives the learning process. 2.2. Automatic Curriculum for sparse Reward RL Our SVGG method addresses automatic curriculum for sparse reward goal-conditioned RL (GCRL) problems and learns to achieve a continuum of related tasks. Achieved Goals Distributions Our work is strongly related to the MEGA algorithm (Pitis et al., 2020), which (1) maintains a buffer of previously achieved goals, (2) models the distribution of achieved goals via a kernel density estimation (KDE), and (3) uses this distribution to define its behavior distribution. By preferably sampling from the buffer goals at the boundary of the set of already reached states, an increase of the support of that distribution is expected. In that way, MEGA aims at overcoming the limitations of previous related approaches which also model the distribution of achieved goals. For instance, DISCERN (Warde-Farley et al., 2018) only uses a replay buffer of goals whereas RIG (Nair et al., 2018) and S KEW-F IT (Pong et al., 2019) rather use variational auto-encoding (Kingma & Welling, 2013) of the distribution. While RIG samples from the model of the achieved distribution, and DISCERN and S KEW-F IT skew that distribution to sample more diverse achieved goals, MEGA rather focuses on low density regions of the distribution, aiming to expand it. This results in improved exploration compared to competitors. Our approach differs from all these works as they only model achieved goals, independently from which goal was targeted when they were achieved, whereas we model the capability of reaching target goals. This makes a strong difference since, while MEGA selects goals at the frontier of what it already discovered, nothing indicates that goals g closer to the mode of the distribution can be achieved when they are targeted. MEGA is also prone to catastrophic forgetting and limits exploration to goals present in the replay buffer. 2. Background 2.1. Goal-conditioned Reinforcement Learning In this paper, we consider the multi-goal reinforcement learning setting, defined as a Markov Decision Process (MDP) Mg =< S, T, A, g, Rg >, where S is a set of states, T is the set of transitions, A the set of actions and the reward function Rg is parametrized by a goal g lying in the d-dimensional continuous goal space G ≡ Rd . In our setting, each goal g is defined as a set of states Sg ⊆ S that are desirable situations for the corresponding task, with states in Sg being terminal states of the corresponding MDP. Thus, a goal g is considered achieved when the agent reaches at step t any state st ∈ Sg , which implies the following sparse reward function Rg : S → {0; 1} in the absence of expert knowledge. The goal-conditioned reward function is defined as Rg (st , at , st+1 ) = I(st+1 ∈ Sg ) for discrete state spaces and Rg (st , at , st+1 ) = I(mins∗ ∈(Sg ) ||st+1 − s∗ ||2 < δ) for discrete ones, where δ is a distance threshold and I the indicator function. Then, the objective is to learn a goal-conditioned policy (GCP) π : S ×G → A which maximizes the expected cumulative reward from any initial state of the environment, given 3 Adversarial Goal Generation Another trend proposes to adversarially learn a goal generator, that produces targets For Stein Variational Goal Generation. 2 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL the perturbation, while ϵ is the magnitude. that are at the frontier of the agent’s capabilities. In that vein, G OAL GAN (Florensa et al., 2018) simultaneously learns a discriminator to sort out non GOIDs or generated goals from GOIDs in the buffer of achieved goals, and a generator that aims at producing goals that fool the discriminator. While appealing, the approach is prone to instabilities, with a generator that may usually diverge far from the space of reachable goals. SETTER - SOLVER (Racaniere et al., 2019) stands as an extension of G OAL GAN, where a goal setter is learned to provide goals of various levels of difficulty w.r.t. a judge network (similar to our success predictor, see next section). The provided goals remain close to already achieved goals, and are diverse enough to avoid mode collapse. This approach however suffers from relying on invertible networks to map from the latent space to the goal space, which severely limits the modeling power, and can reveal problematic for environments with strong discontinuities. Asymmetric self-play (Sukhbaatar et al., 2017) is another way to generate goals, with a teacher agent seeking to produce goals that are just beyond the capabilities of the student agent. Both teacher and student learn simultaneously, with an equilibrium of adverse rewards determined on their respective time to go. However, this balance is hard to maintain, and many useful areas are usually missed. Our SVGG algorithm also samples GOIDs, but it does so by learning a predictive model of the agent’s goal achievement capability and building a sampling distribution that focuses on goals whose achievement is the most unpredictable. Because it does not call upon an adversarial generative process, SVGG is less prone to instabilities. The authors draw a connection between KLdivergence and the Stein operator Ap ϕ(x) = ϕ(x)∇x log p(x)T + ∇x ϕ(x) by showing that ∇ϵ KL(q[T ] ||p)|ϵ=0 = −Ex∼q [trace(Ap ϕ(x)], where q[T ] is the distribution of particles after the transformation T . The KL minimization objective is thus related to the Stein Discrepancy, defined as: S(q, p) = max Ex∼q [trace(Ap ϕ(x)]4 . ϕ∈F Minimizing Stein Discrepancy being intractable as such, (Liu et al., 2016) and (Chwialkowski et al., 2016) introduce the Kernelized Stein Discrepancy (KSD) where the idea is to restrict to projections ϕ that belong to the unit ball of a reproducing kernel Hilbert space H (RKHS), for which there is a closed form solution. The KSD is defined as S(q, p) = maxϕ∈H {Ex∼q [trace(Ap ϕ(x)], s.t ||ϕ||H ≤ 1}, whose solution is given by: ϕ∗ (.) = Ex∼q [Ap k(x, .)], where k(x, x′ ) is the positive definite kernel of the RKHS H. The RBF kernel k(x, x′ ) = exp(− h1 ||x − x′ ||22 ) is commonly used. Therefore, the steepest descent on the KL-objective is given by the optimal transform xi ← xi +ϵϕ∗ (xi ), ∀i = 1 · · · n, where n ϕ∗ (xi ) = 2.3. Stein Variational Gradient Descent 1 X k(xj , xi )∇xj log p(xj ) n j=1 | {z } attractive force  + ∇xj k(xj , xi ) . {z } | (1) repulsive force Our method builds on Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) to approximate the distribution of goals of interest. SVGD is a powerful non-parametric tool for density estimation, when the partition function of the target distribution p to approximate is intractable, which is the case when we do not know the support of the valid goal distribution in the environment. It stands as an efficient alternative to MCMC methods, which are proven to converge to the true distribution p but are usually too slow to be used in complex optimization processes. It also stands as an alternative to variational inference of parametric neural distributions q, which are restricted to pre-specified families of distributions (e.g., Gaussian or mixtures of Gaussians) that may not fit target distributions. Instead, it models q as a n set of particles {xi }i=1 , all belonging to the support of p. The “attractive force” in the update 1 drives the particles toward high density areas of the target p. The “repulsive force” pushes the particles away from each other, therefore fosters exploration and avoids mode collapse. Note that if n = 1, the update in (1) corresponds to a Maximum a Posteriori. SVGD has already been successfully explored in the context of RL. The Stein Variational Policy Gradient (SVPG) (Liu et al., 2017) employs SVGD to maintain a distribution of efficient agents as particles. It strongly differs from our approach, since we consider particles as behavior goal candidates, while SVPG aims at capturing the epistemic uncertainty about policy parameters. (Chen et al., 2021) also relies on SVGD to build a strategy to generate goals to agents, but in a very simplified setting without the attractive force from (1), which prevents from fully benefiting from this theoretical framework. Notably, such a kind of approach is particularly sensitive to catastrophic forgetting. The idea behind SVGD is to approximate the target distribution p with q by minimizing their KL-divergence: minq KL(q||p). This objective is reached by iterative deterministic transforms as small perturbations of the identity map, on the set of particles: T (x) = x + ϵϕ(x), where ϕ is a smooth transform function that indicates the direction of 4 3 Note that S(q, p) = 0 only if q = p. Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL 3. Stein Variational Goal Generation We define the distribution pskills as an energy-based density whose potential is the output of the beta distribution f : In this section we introduce our Stein Variational Goal Generation (SVGG) algorithm. The pseudo-code of SVGG is given in Algorithm 1 (a more detailed version is given in Appendix G.2). Our aim is to obtain a curriculum to sample goals of appropriate difficulty for the RL agent, where the curriculum is represented as an evolving goal sampling probability pgoals . To do so, we maintain a model of the agent’s skills – or goal reaching capability –, which helps define a distribution pskills . This distribution assigns probability mass to areas of goals of appropriate difficulty. Additionally, with a simple one class SVM, we learn a validity distribution pvalid preventing the particles from being sampled from nonvalid areas of the goal space. Then, we aim at sampling training goals from the following target distribution: pgoals (g) ∝ pskills (g).pvalid (g). pskills (g) ∝ exp (fα,β (Dϕ (g)). In Appendix D, we compare the relative performance of 5 pairs of α and β and show that target a Medium difficulty works best. We stick to this setting in the rest of the study. Validity distribution As outlined in (Racaniere et al., 2019), we would like to only sample valid goals. To do so, instead of their validity loss, we define a validity distribution which represents the probability that a goal g belongs to the set of valid (reachable) goals G ∗ ⊆ G. However, G ∗ is not known in advance. To circumvent this difficulty, the states already reached by the agent are stored in an archive R and we aim at defining the validity distribution as depending on the posterior probability given R: pvalid (g) ∝ P(g ∈ G ∗ |R). We progressively build this distribution with a One Class SVM (OCSVM). This model is mainly designed for outlier or novelty detection in absence of labeled data. Given a dataset X ∈ Rd , it defines a boundary of the data support in Rd , while keeping a small portion of the data points out of that boundary. With data points being goals from R, we get (2) Since computing the partition function is intractable for such a distribution formulation, we rather sample uniformly over m a set of particles q = {xi }i=1 , that are optimized through SVGD to approximate pgoals (g). Importantly, for our setting where we are interested in tracking useful areas to train the agent, dealing with a set of particles representing the state of the full exploration landscape appears better fitted than methods relying on single candidates, such as MCMC or Langevin Dynamics, that would produce samples very correlated in time, with unstable dynamics. This choice also improves the interpretability of the process, by providing a comprehensive picture of the current behavior distribution along training. Formal definitions of the two components of pgoals are given below. pvalid (g) ∝ Vψ (g), X (5) where Vψ (g) is the output of the OCSVM model trained on R, with parameters ψ. As the agent progresses and expands its set of achieved states through training, it eventually reaches the environment boundary. In this case, we can expect Vψ (g) ≈ P(g ∈ G ∗ |g ∈ ω) for any area ω ⊆ G. Recovery property As demonstrated in Theorem 3.1 and empirically validated in Section 4.2, SVGG benefits from a useful recovery property: when the environment suddenly changes, the SVGG agent will spontaneously resample goals in the areas that are affected by the change. Model of the agent’s skills The probability pskills is modeled as a Neural Network Dϕ whose parameters ϕ are learned by gradient descent on the following Binary Cross Entropy (BCE) loss: Lϕ = (4) Theorem 3.1. Recovery property: Let us denote as G + the set of goals g such that Vψ (g) > 0 and C ∈ Rd its convex hull. Assume that, at a given iteration l, Dϕ (g) ≈ 1 for every g ∈ G + (i.e., the whole set G + is considered as mastered by Dϕ (g)), and that, on that set, Vψ is well calibrated: for any area ω ⊆ G + and any goal g ∈ ω, Vψ (g) ≈ P (g ∈ G ∗ |g ∈ ω). Assume also that we use a kernel which ensures that the Kernelized Stein Discrepancy KSD(q, pgoals ) of any distribution q with pgoals is 0 only if q weakly converges to pgoals 5 . Then, with no updates of the models after iteration l and a number of particles m > any area ω ⊆ G + ∩ G ∗ with diameter √ 1,diam(C) diam(ω) ≥ d ( √ d m−1) eventually contains at least one si (log Dϕ (g i ))+(1−si )(log(1−Dϕ (g i ))), (g i ,si )∈O (3) B where O = {g i , si }ni=1 is a batch of (goal, success) pairs coming from recent trajectories of the agent in the environment. The sampled goals are those whose predicted probability of success is neither too high nor too low (i.e., we avoid Dϕ (g) ≈ 1 or Dϕ (g) ≈ 0). To build pskills based on the prediction of Dϕ , we use a beta distribution whose maximum density point mass is determined according to the output of Dϕ , by two hyperparameters α and β that shape the distribution and control the difficulty of the addressed goals, as illustrated on Figure 7 in appendix. 5 As assumed for instance in (Liu, 2017). This is not always true, but gives strong insights about the behavior of SVGD. Refer to (Gorham & Mackey, 2017) for more discussions about KSD. 4 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Figure 1. Overview of the SVGG method. The interaction of agents with their environment is stored in a replay buffer (top right) and used to learn Dϕ , a model of its abilities to achieve goals (bottom right). We build on this model to compute a distribution of goals of appropriate difficulty pskills , leveraging a validity distribution pvalid to stay inside the space of valid goals. The obtained behavioral goal distribution pgoals is approximated with particles {gi }n i=1 using Stein Variational Gradient Descent ( SVGD ), and the agent samples a goal from these particles. particle, whenever KSD({xi }ni=1 , pgoals ) = 0 after convergence of the particle updates. 4. Experiments 4.1. Experimental setup Success coverage metric Our general objective is to obtain a policy that can reliably achieve all valid goals in some environment. To quantify this objective, we evaluate the resulting policy on the entire space of valid goals G ∗ in our environment using a success coverage metric, defined as R S(π) = V(G1 ∗ ) G ∗ P(π achieves g)dg, with V(G ∗ ) the volume of G ∗ . The goal space being continuous, we evaluate the policy on a finite subset Ĝ uniformly sampled from G ∗ . Then our objective reduces to: The above theorem (proof in Appendix A) ensures that, even if the success model overestimates the capacities of the agent for some area ω (e.g., due to changes in the environment, catastrophic forgetting or success model error), some particles are likely to go back to this area once every goal in G ∗ looks well covered by the agent, with an increasing probability for more particles. This way, the process can reconsider overestimated areas, by sampling again goals in them, and hence correcting the corresponding predictions, which leads to attracting attention of pskills back to these difficult areas. Approaches such as MEGA do not exhibit such recovery properties, since they always sample at the boundary of their achievable goal distribution, which is likely to incrementally grow towards G ∗ . The particles of our approach can be seen as attention trackers which remodel the behavior distribution and mobilize the effort on useful areas when needed. This is much better than uniformly sampling from the whole space of achievable states with small probability, which would also ensure some recovery of forgotten areas but in a very inefficient way. This theoretical guarantee of SVGG is empirically validated by the experiment from Figure 3, which shows the good recovery property of our approach, after a sudden change in the environment. |Ĝ| S(π) = 1 X |Ĝ| i=1 Eτ ∼π [1{∃s ∈ τ, ∗min ||s − s∗ ||2 < δ}]. s ∈Sgi (6) To build Ĝ, we split G ∗ into areas following a regular grid, and then uniformly sample 30 goals inside each part of the division. Compared Approaches As baselines, we choose two methods from the literature that span the existing trends in unsupervised RL with goal-conditioned policies, MEGA (Pitis et al., 2020) and G OAL GAN (Florensa et al., 2018). In addition, the Random baseline randomly selects the behavior goal among past achieved states. We also evaluate two SVGG ablations: a No Validity Distribution version, which considers pgoals ∝ pskills and a Only 5 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Algorithm 1 RL with Stein Variational Goal Generation 1: Input: a GCP πθ , a success predictor Dϕ , a reachability predictor Vψ , buffers of transitions B, reached states R and success outcomes O, a kernel k. 2: Sample a set of particles q = {xi }m i=1 uniformly from states reached by pre-runs of πθ ; 3: for n epochs do 4: ▷ Data Collection: Perform Rollouts of πθ conditioned on goals uniformly sampled from q; 5: Store transitions in B, visited states in R and success outcomes in O; 6: ▷ Model Update 7: Update model Dϕ (e.g., via ADAM), using gradient of Lϕ (3) with samples from O; 8: ▷ Prior Update 9: Update model Rψ according to all states in R; 10: ▷ Particles Update (t steps)  Pm  1 11: Update particles xi ← xi + ϵ m j=1 k(xj , xi )∇xj log pgoals (xj ) + ∇xj k(xj , xi ) ; 12: ▷ Agent Improvement 13: Improve agent with any Off-Policy RL algorithm (e.g., DDPG) using transitions from B; 14: end for Validity version, where pgoals ∝ pvalid . 4.2. Main results Learning to reliably achieve all valid goals can be decomposed in learning to sample appropriate behavioral goals, and learning to achieve these goals. Our focus being on the goal sampling mechanism, all compared approaches learn to achieve goals using DDPG and HER with a mixed strategy described in Appendix G. In all versions of SVGG, we use m = 100 particles to approximate the target distribution. Implementation details about all considered architectures are given in Appendix G. Success coverage evaluations Figure 2 shows that SVGG significantly outperforms all baselines in terms of success coverage. Especially in highly discontinuous goal space settings such as in mazes and the modified version on FetchReach and FetchPush, where it efficiently discovers and achieves most of the valid goals. On AntMaze and FetchPickAndPlace, where the goal space is more smooth, our approach obtains comparable results to its competitors. Due to the known stability issues in GAN training, G OAL GAN is the least efficient baselines. Another explanation of the failure of G OAL GAN is that it is likely to generate non valid goals, which is not the case for MEGA or SVGG. MEGA chooses behavior goals from a Replay Buffer of achieved goals, while the validity distribution pvalid considered in SVGG keeps particles inside valid areas. Questions To compare the performance of SVGG to baselines, we investigate the following questions: 1) Does SVGG maximize the success coverage metric better than the baselines? 3) Is SVGG more robust than the baselines to catastrophic forgetting that may occur in sudden environment changes? In Appendix D, we also investigate the impact of target difficulty (i.e., beta distribution as described above) on success coverage maximization. The minimum density heuristic of MEGA efficiently discovers all valid goals in the environment, but our results show that its success plateaus in almost all the considered environments. MEGA’s intrinsic motivation only relies on state novelty. Thus, when the agent has discovered all reachable states, it is unable to target areas that it has reached in the past but has not mastered yet. Evaluation environments To answer the above metricoriented questions, we use a modified version of the FetchReach and FetchPush environments (Plappert et al., 2018) where we have added obstacles in the workspace of the robot arm to increase the amount of discontinuities between the optimal goal-oriented behaviors. We also compare SVGG to baselines in the U-shaped AntMaze environment (Trott et al., 2019) and in a hard version of FetchPickAndPlace. Additionally, to provide more analysis-oriented visualizations, we use a Point-Maze environment where an agent moves a point without mass within a 2D maze with continuous spaces of states and actions. As the agent does not perceive the walls, maximizing success coverage in these environments is harder than it seems. Recovery Property Figure 3 shows the advantages of SVGG over MEGA and G OAL GAN in changing environments. Walls are suddenly added during the training process (dot black line from Figure 3), after the methods had reached their pick performance. We see that the performance of MEGA and G OAL GAN plateaus lower than their pick performance (see Figure 2) whereas SVGG discovers new difficulties resulting from the environment modification and focuses on those to finally recover its pick success coverage. 6 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Figure 2. Success coverage over 4 different PointMazes, AntMaze, FetchPickAndPlace (Hard), FetchReach (Stuck) and FetchPush (Maze) (4 seeds each). SVGG outperforms MEGA and G OAL GAN as well as ablations. pgoals reduces to pvalid , that is at this point uniform over the entire goal space. Therefore, the particles spread uniformly over the goal space and prevent SVGG from catastrophic forgetting, as the model rediscovers areas that the agent has forgotten how to reach (cf. rightmost column in Figure 4). Additional analyses on SVGG are given in Appendix D. We also observe that the advantages of our method over MEGA in terms of recovery ability are more significant when changes in the environment are more drastic (i.e., when starting from maze B). Further SVGG analyses To gain further intuition on how SVGG maximizes the success coverage, we show in Figure 4 the evolution of the particles throughout training. As the agent progresses and achieves novel and harder goals, the Dϕ model updates its predictions. Thus, the target distribution pgoals is updated accordingly (background color of the 2nd row of the figure). The particles q = {gi }ni=1 are then moved toward new areas of intermediate difficulty through SVGD to minimize KL(q||pgoals ). 5. Conclusion This paper introduces a new multi-goal reinforcement learning algorithm, SVGG, which leverages Stein Variational Gradient Descent to monitor a model w.r.t. its goal achievement capabilities. Using this model, the agent addresses goals of intermediate difficulty, resulting in an efficient curriculum for finally covering the whole goal space. Moreover, SVGG can recover from catastrophic forgetting, which is a classic Figure 4 also highlights the recovery property of SVGG. When the agent has nothing else to learn in the environment, 7 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Figure 3. Evolution of success coverage in a changing environment for MEGA, G OAL GAN and SVGG (4 seeds each). We add walls to the starting mazes A and B (left) and go from 2 to 4 walls in FetchReach (right). The triangles correspond to the pick success coverage of the methods on the final environments (which correspond to maze 1 and FetchReach (Stuck) Figure 2) during regular training. Figure 4. Parallel visualization of the model’s prediction (A), the particles evolution (B) and of the coverage (C) throughout training in Maze 1. pitfall in multi-goal RL. Some limitations should be addressed in future work. The environments used in our experiments have low dimensional goal space, which facilitates the approximation of the target distribution with SVGD and the agent’s model learning phase. When the agent’s observation and goal space will be images, the agent should learn a compact latent space of observation as in (Pong et al., 2019; Yarats et al., 2021; Liu & Abbeel, 2021; Hafner et al., 2019), using various representation learning techniques like contrastive learning, prototypical representations or variational auto-encoders. In future work, we should learn a latent goal space from observations, and perform SVGD over particles in this latent space. This would result in an end-to-end algorithm learning to discover and Studying the impact of the number of particles is left for future work. Actually, the target distribution being in constant evolution, the KL divergence minimization objective is hard to reach at all times, which makes it difficult to claim that using more particles is always better. Furthermore, a previous work (D’Angelo & Fortuin, 2021) spotted exploration failures in SVGD, and suggests that periodically annealing the attraction force in particle optimization (1) is required to enable particles to cover non-trivial distributions, e.g. in multimodal settings (which is the case for us) or in high dimensions. 8 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL achieve all possible goals in its environment from pixels. Colas, C., Karch, T., Sigaud, O., and Oudeyer, P.Y. Autotelic agents with intrinsically motivated goalconditioned reinforcement learning: a short survey. Journal of Artificial Intelligence Research, 74:1159–1199, 2022. Besides, we also envision to address larger and stochastic environments, where additional uncertainty estimation should be added to the goal generation process, to prevent the agent getting stuck in uncontrollable states (like a TV screen showing white noise) as in (Chua et al., 2018; Pathak et al., 2019), using methods such as model disagreement between multiple agents. D’Angelo, F. and Fortuin, V. Annealed stein variational gradient descent. CoRR, abs/2101.09815, 2021. URL https://arxiv.org/abs/2101.09815. Eysenbach, B., Geng, X., Levine, S., and Salakhutdinov, R. Rewriting history with inverse RL: hindsight inference for policy improvement. CoRR, abs/2002.11089, 2020. URL https://arxiv.org/abs/2002.11089. Acknowledgments We thank our colleges at ISIR and the anonymous reviewers for their helpful comments. This work benefited from the use of SCAI (Sorbonne Center for Artificial Intelligence) cluster of GPUs. This work has received funding from the European Commission’s Horizon Europe Framework Programme under grant agreement No 101070381 (PILLARrobots project). Florensa, C., Held, D., Geng, X., and Abbeel, P. Automatic goal generation for reinforcement learning agents. In International conference on machine learning, pp. 1515– 1528. PMLR, 2018. Gorham, J. and Mackey, L. Measuring sample quality with kernels. In International Conference on Machine Learning, pp. 1292–1301. PMLR, 2017. References Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight Experience Replay. arXiv preprint arXiv:1707.01495, 2017. Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2555–2565. PMLR, 09– 15 Jun 2019. URL https://proceedings.mlr. press/v97/hafner19a.html. Blaes, S., Vlastelica Pogančić, M., Zhu, J., and Martius, G. Control what you can: Intrinsically motivated taskplanning agent. Advances in Neural Information Processing Systems, 32, 2019. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Kaelbling, L. P. Learning to achieve goals. In IJCAI, pp. 1094–1099. Citeseer, 1993. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. Chen, J., Zhang, Y., Xu, Y., Ma, H., Yang, H., Song, J., Wang, Y., and Wu, Y. Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems, 2021. URL https://arxiv.org/abs/ 2111.04613. Liu, H. and Abbeel, P. Behavior from the void: Unsupervised active pre-training, 2021. URL https: //arxiv.org/abs/2103.04551. Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models, 2018. URL https: //arxiv.org/abs/1805.12114. Liu, Q. Stein variational gradient descent as gradient flow. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/ 17ed8abedc255908be746d245e50263a-Paper. pdf. Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2606–2615, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/ chwialkowski16.html. Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. arXiv preprint arXiv:1608.04471, 2016. 9 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-fit tests. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 276– 284, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/ liub16.html. Trott, A., Zheng, S., Xiong, C., and Socher, R. Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards, 2019. URL https: //arxiv.org/abs/1911.01417. Warde-Farley, D., Van de Wiele, T., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. arXiv preprint arXiv:1811.11359, 2018. Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017. Yang, D., Zhang, H., Lan, X., and Ding, J. Density-based curriculum for multi-goal reinforcement learning with sparse rewards. CoRR, abs/2109.08903, 2021. URL https://arxiv.org/abs/2109.08903. Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual reinforcement learning with imagined goals. arXiv preprint arXiv:1807.04742, 2018. Yarats, D., Fergus, R., Lazaric, A., and Pinto, L. Reinforcement learning with prototypical representations. arXiv, 2021. doi: 10.48550/ARXIV.2102.11271. URL https://arxiv.org/abs/2102.11271. Osband, I., Van Roy, B., and Wen, Z. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pp. 2377–2386. PMLR, 2016. Zhang, Y., Abbeel, P., and Pinto, L. Automatic curriculum learning through value disagreement. arXiv preprint arXiv:2006.09641, 2020. doi: 10.48550/ARXIV.2006. 09641. URL https://arxiv.org/abs/2006. 09641. Pathak, D., Gandhi, D., and Gupta, A. Self-supervised exploration via disagreement, 2019. URL https:// arxiv.org/abs/1906.04161. Pitis, S., Chan, H., Zhao, S., Stadie, B., and Ba, J. Maximum entropy gain exploration for long horizon multi-goal reinforcement learning. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7750–7761. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/ v119/pitis20a.html. Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., Kumar, V., and Zaremba, W. Multi-goal reinforcement learning: Challenging robotics environments and request for research, 2018. Pong, V. H., Dalal, M., Lin, S., Nair, A., Bahl, S., and Levine, S. Skew-fit: State-covering self-supervised reinforcement learning. arXiv preprint arXiv:1903.03698, 2019. Racaniere, S., Lampinen, A., Santoro, A., Reichert, D., Firoiu, V., and Lillicrap, T. Automated curriculum generation through setter-solver interactions. In International Conference on Learning Representations, 2019. Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International Conference on Machine Learning, pp. 1312–1320. PMLR, 2015. Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A., and Fergus, R. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017. 10 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL A. Proof of Theorem 1 B. Visualization of goals Theorem 1. Recovery property: Let us denote as G + the set of goals g such that Vψ (g) > 0 and C ∈ Rd its convex hull. Assume that, at a given iteration l, Dϕ (g) ≈ 1 for every g ∈ G + (i.e., the whole set G + is considered as mastered by Dϕ (g)), and that, on that set, Vψ is well calibrated: for any area ω ⊆ G + and any goal g ∈ ω, Vψ (g) ≈ P (g ∈ G ∗ |g ∈ ω). Assume also that we use a kernel which ensures that the Kernelized Stein Discrepancy KSD(q, pgoals ) of any distribution q with pgoals is 0 only if q weakly converges to pgoals 6 . Then, with no updates of the models after iteration l and a number of particles √ m > 1, any area ω ⊆ G + ∩ G ∗ √ with diameter diam(ω) ≥ d (diam(C) d m−1) eventually contains We visualize behavioral and achieved goals in Maze 1 (Figure 5) and Maze 2 (Figure 6), in parallel with the success coverage after training. The main advantages of our method lie in the capacity to target difficult areas and avoid catastrophic forgetting, which results in nearly optimal success coverage. We observe that MEGA efficiently discovers the environment but fails to master the corresponding goals. This also leads to catastrophic forgetting and a lack of adaptation when the environment changes, as studied in the main paper. One can see that the generation of GOIDs in G OAL GAN is very unstable and tricky in such discontinuous goal space, especially as the generator is susceptible to output goals outside the environment boundary, which SVGG avoids with the validity distribution. at least one particle, whenever KSD({xi }ni=1 , pgoals ) = 0 after convergence of the particle updates. C. Additional experiments details Proof. In settings where the KSD of a distribution µ from a target p is 0 only if µ weakly converges to p, (Liu, 2017) previously proved that, for any compact set, the empirical measure µn of µ, computed on a set of n particles, converges weakly towards the target p when a sufficient number of particles is used. Thus, under our hypotheses, the set of particles of SVGG appropriately represents pgoals after a sufficient number of steps of Stein transforms. C.1. Environments Pointmaze We use a 2D pointmaze environment where the state space and goal space are (x, y) coordinates (the agent cannot see the walls), and the agent moves according to its action, which is a 2-dimensional vector with dimensions constrained to [−0.95, 0.95]. All methods are tested on four 5 × 5 mazes with different shapes and a highly discontinuous goal space. The maximum number of steps in one trajectory is set to 30. We argue that the difficulty of these environments does not lie in their size, but rather in the complexity of their goal space and thus of the trajectories adopted by the agent to achieve these goals. Now, we know that particles cannot leave G + since Vψ = 0 outside, and so does pgoals . Since Vψ is well calibrated on G + , we also know that Vψ = 1 on every area of G + only containing achievable goals. Thus, since pskills = 1 in G + , pgoals is maximal and constant for any area ω ∈ G + ∩ G ∗ . This implies that the concentration of particles in any area of G + ∩ G ∗ is greater than if particles were uniformly spread over C. In that case, for any particle x from the set {xi }m i=1 , we know that P (x ∈ ω|KSD(q = {xi }m i=1 , pgoals ) = 0) ≥ P (x ∈ ω|KSD(q = {xi }m i=1 , U (X )) = 0), with X an hypercube of d dimensions with side length equal to diam(C) and U (X ) the uniform distribution over X . Antmaze An Ant has to move around in a U-shape hallway maze of size 20x4, the goal space remains (x, y) coordinates of the Ant as in Pointmaze. However, the exploratory behavior is much simpler than in the considered pointmazes environments, as the maze is simply U-shaped. The difficulty lies in the control of the Ant dynamics, as it must move its legs with a 8-dimensional action space and the observation space is a 30-dimensional vector corresponding to the angles of the legs. Next, if KSD(q = {xi }m i=1 , U (X )) = 0, we know that particles are spread as a grid over each dimension of X . Thus, in each dimension of X any particle is separated from its √ closest neighbor by at most a difference of diam(C)/( d m − 1) in the worst case. Thus, any area ω √ √ with diameter greater than d (diam(C) d m−1) is guaranteed to contain a particle in that case, which concludes the proof. FetchPickAndPlace (Hard version) We also perform comparisons on a hard version of FetchPickAndPlace-v1 from OpenAI gym (Brockman et al., 2016). The agent controls a robotic arm which must pick and place a block to a 3D desired location. In the hard version, the behavioral goals are all between 20 and 45cm in the air, while in the original version, 50% of behavioral goals are on the table and the remaining ones are from 0 to 45cm in the air. 6 As assumed for instance in (Liu, 2017). This is not always true, but gives strong insights about the behavior of SVGD. Please refer to (Gorham & Mackey, 2017) for more discussions about KSD. While this environment presents a greater difficulty in terms of control (the action dimension is 4 and the state dimen11 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Figure 5. Visualization of: (A) behavioral goals , (B) achieved goals and (C) success coverage for 1M steps in Maze 1. sion is 24 which correspond to the various positions of the robot joint), the goal generation component is significantly easier: as the 3-dimensional goal space is smooth, there is no obstacle for the agent to bypass. As Figure 2 shows, SVGG solves the environment within 1.5M steps, but does not significantly outperform MEGA. tion. The initial task is solved within 20.000 steps by all baselines, whereas with the modified version, only SVGG achieves a 0.8 success rate in 3M steps. FetchPush (Maze version) We add a maze structure with walls on the table to the FetchPush benchmark to once again add discontinuities in the goal space. The observations are 24-dimensional vectors, the actions as well as the goals are 3-dimensional. The success coverage is computed on goals uniformly sampled on the table, and the block’s initial position is sampled behind the first wall on the robot side. The standard FetchPush task was solved by SVGG and MEGA in less than 1M steps. The maze version is much harder: SVGG achieves a 0.8 success coverage rate within 10M steps, whereas MEGA barely reaches 0.5. The distance to reach the goals in all environments is δ = 0.15. We argue that the interest of our goal generation algorithm resides in environments with highly discontinuous goal space as the action control is largely supported by the RL algorithm (e.g. DDPG). Therefore, the smaller difference between MEGA and SVGG in this environment was expected, as SVGG is mainly designed to target non-smooth goal spaces, and to avoid pitfalls such as catastrophic forgetting. FetchReach (Stuck version) We compare SVGG to baselines with a modified version of the FetchReach environment, where the gripper is initially stuck between four walls and has to carefully navigate between them to reach the goals. The observations are 9-dimensional vectors, the actions as well as the goals are 3-dimensional. The success coverage is computed on uniformly sampled goals with a target range fixed to 0.2, which corresponds to the maximal L2 distance between the goal and the initial gripper posi- D. Control of the sampling difficulty Using beta distributions makes it possible to tune the goal difficulty an agent should aim at to efficiently explore and control an environment. Indeed, by tuning the α and β hyper-parameters of the distribution, one gets a different 12 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Figure 6. Visualization of: (A) behavioral goals , (B) achieved goals and (C) success coverage for 4M steps in Maze 2. SVGG is the only one to achieve nearly perfect success coverage. incentive for goal difficulty. We choose to compare the efficiency of 5 pairs of α and β hyper-parameters, as illustrated in Figure 7. In Figure 8, we compare the 5 versions of SVGG using different beta distributions from Figure 7, on the 4 previously considered mazes. One can observe that extreme targets difficulties are the least effective ones, especially the Very easy, which is too conservative to efficiently explore new areas of the space. On the other hand, SVGG performs very well with Medium and Hard distributions. This suggests that the optimal goal difficulty is somewhere between medium and hard. Performing a more advanced optimization over α and β is left for future work. E. Target goal distribution ablations We conduct an ablation which consists in keeping SVGD, but replacing our target goal distribution (corresponding to goals whose prediction of success is close to 0.5) with the target goal distributions of related work such as of S KEWF IT, MEGA or G OAL GAN (resp. the uniform distribution over the support of achieved goals, the region of low den- Figure 7. Modulation of goal’s target difficulty with beta distributions. Dϕ (g) is the model’s predicted probability of the agent achieving the goal g. sity of achieved goals and goals of intermediate difficulty (GOIDs)). For all criteria, the goal target distribution is rep13 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Figure 8. Evolution of success coverage for 5 target difficulties as intrinsic motivation for SVGG, on 4 mazes and 4 seeds each. resented as a set of particles and optimized with SVGD. We compare the average success coverage over our 4 mazes on Figure 9 where we can see that our choice of target distribution is the most effective. We describe below the compared goals’ distribution. “Only Validity” ablation with a beta distribution tuned to address the lowest probability region (i.e taking pgoals ∝ f (α, β, Vψ (g))), with f a Beta distribution and α and β set to target low density of V ). Due to the absence of a mechanism to avoid sampling unfeasible goals, the particles are attracted to low density areas that mostly correspond to non-valid goals, which make this baseline very inefficient. distribution of GOIDs (as in G OAL GAN): Our target distribution pgoals is very close to the GOID in G OAL GAN, the main difference is the smoothness of our Beta distribution versus the indicator function of G OAL GAN that labels a goal as of ”intermediate difficulty” when the probability of reaching it in an interval (eg. between 0.2 and 0.8). So, to move closer to the G OAL GAN criterion, we replaced the beta-distribution used in SVGG with the crisp distribution (with a generalized Gaussian distribution of skewing parameter β = 6) which outputs 1 for probabilities between 0.3 and 0.7 and 0 otherwise (which are the parameters that give the best results). Note that while this distribution is differentiable, the gradient is less informative than our version. As a consequence, approximating this distribution with SVGD is less efficient and gets a lower success coverage. Figure 9. Plot of the average success coverage of SVGG over the 4 mazes (3 seeds each for a total of 12 runs) after 4 millions training steps where we replace our target goal distribution with other choices and keep SVGD as sampling method, the error bar shows the standard deviation between all runs. F. Sampling method ablations To highlight the interest of the SVGD choice to sample goals from our target distribution pgoals , we conducted additional experiment where we swap SVGD with some other sampling tools : MCMC (Metropolis Hastings), GANs and direct sampling from the replay buffer. Uniform distribution over the support of achieved goals (as in S KEW-F IT and DISCERN): This corresponds to our ”Only validity” baseline in Figure 2. Indeed, the probability for a goal to be valid is learned with a One Class SVM which exactly strives to uniformly cover the distribution of achieved goals. As the results presented in Figure 9 show, this baseline performs relatively well, but is unable to target areas where the agent struggles. Therefore, it is not as efficient as our target distribution pgoals . In Figure 10 we can see that SVGD is by far the most efficient way to sample from pgoals and thus maximize the success coverage. We describe below the tested sampling tools. GANs : We use the same procedure as G OAL GAN by replacing their GOID criterion by our target distribution pgoals . the results are very similar to G OAL GAN, this can be explained by the proximity of our criterion in terms of Distribution derived from the region of low density of achieved goals (as in MEGA): We can obtain a distribution of goals very close to MEGA ’s by combining our 14 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Table 1. DDPG parameters DDPG Hyper-Parameters Batch size for replay buffer Discount factor γ Action L2 regularization (Gaussian) Action noise max std Warm up steps before training Actor learning rate Critic learning rate Target network soft update rate Actor & critic networks activation Actor & critic hidden layers sizes Replay buffer size Value 2000 0.99 0.1 0.1 2500 1e-3 1e-3 0.05 Gelu 5123 nb training steps Figure 10. Plot of the average success coverage of SVGG over the 4 mazes (3 seeds each for a total of 12 runs) after 4 millions training steps where we replace SVGD with other sampling methods, the error bar shows the standard deviation between all runs. intermediate difficulty, except from the fact that we add pvalid , we can also conclude that GANs are not the best choice for moving target distributions due to their training instability. We use the same hyperparameters as in the G OAL GAN baseline. Buffer sampling : We sample a batch of previously achieved goals in the buffer, and then compute a categorical distribution based on the pgoals criterion and sample candidate goals. This method is the least effective baseline, which was expected as the exploration component was provided by SVGD , the goals from the replay buffer are not sufficiently diverse to explore the environment if not combined with a diversity criterion. G.1. Hindsight Experience Replay The original goal relabeling strategy introduced in HER (Andrychowicz et al., 2017) is the future, which consists in relabeling a given transition with a goal achieved on the trajectory later on. This is very effective in sparse reward setting to learn a GCP. However, many works suggested that relabeling transitions with goals outside the current trajectory helps the agent generalize across trajectories. For example, one can use inverse RL to determine the optimal goal to relabel a given transition (Eysenbach et al., 2020). We use a naive version of this method. As in (Pitis et al., 2020), we relabel transitions for DDPG optimization using a mixed strategy. All methods presented in this work use the same strategy. MCMC (Metropolis Hastings) : At each pgoals update, we construct a markov-chain with normal movement (with µ = 0 and σ = 0.5) then classic likelihood ratio rejection sampling to draw goals samples that are provided to the agent. Metropolis-Hastings is a good candidate but is still far from the SVGD performance, it presents some good sampling accuracy at times but is very slow to converge to the true distribution, thus the goal sampling isn’t always aligned with the agent skills at the time, and enable the recovery property. 10% of real experience are kept while 40% of the transitions are relabeled using the future strategy of HER. We relabel the remaining 50% transitions with goals outside of their trajectory, with randomly sampled goals among the past behavioral and achieved goals. The latter part of the strategy helps share information between different trajectories that often contains similar transitions. G. Methods details We first focus on the behavioral goal sampling strategy. We use DDPG with HER to learn how to reach a goal in SVGG, MEGA and G OAL GAN. DDPG is parameterized as in Table 1. All algorithms are run using the same number of training epochs n. G.2. SVGG SVGG is described in the main paper, in this section, we go through implementation details and hyper-parameters. 15 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Algorithm 2 RL with Stein Variational Goal Generation 1: Input: a GCP πθ , a success predictor Dϕ , a reachability predictor Rψ , buffers of transitions B, reached states R and success outcomes O, a kernel k, hyper-parameters t(r) , t(m) , t(p) standing for number of rollouts, model updates and particles updates respectively, and hyper-parameter b standing for the batchsize of the model update. 2: Sample a set of particles q = {xi }m i=1 uniformly from states reached by pre-runs of πθ ; 3: for n epochs do 4: for t(r) iterations do ▷ Data Collection 5: Sample a goal g from q; 6: τ ← Rollout πθ (.|., g); 7: Store all (st , at , st+1 , rt , g) from τ in B and every st from τ in R; 8: Optionally (HER): Store relabeled transitions from τ ; 9: Store outcome (g, I(s|τ | ≈ g)) in O; 10: end for 11: for t(m) iterations do ▷ Model Update 12: Sample a batch {(g i , si )}bi=1 from O; 13: Oversample the minority class w.r.t. s to get a balanced success/failure dataset; 14: Update model Dϕ (e.g., via ADAM), with gradient of Lϕ (3)); 15: end for 16: Update model Rψ according to all states in R; ▷ Prior Update 17: for t(p) iterations do ▷ Particles Update 18: Compute the density of the target pP the set of particles q using 2; goals for  m  1 k(x , x )∇ log p (x ) + ∇xj k(xj , xi ) ; 19: Compute transforms: ϕ∗ (xi ) = m j i x goals j j j=1 20: Update particles xi ← xi + ϵϕ∗ (xi ), ∀i = 1 · · · m; 21: end for 22: Improve agent with any Off-Policy RL algorithm ▷ Agent Improvement 23: (e.g., DDPG) using transitions from B; 24: end for SVGG Hyper-Parameters SVGD Number of particles Optimization interval (in steps) Nb of particle moves per optim. RBF kernel k(., .) bandwidth Learning rate Agent’s skill model Dϕ Hidden layers sizes Gradient steps per optimization Learning rate Training batch size Training history length (episodes) Optimization interval (in steps) Nb of training steps Activations OCSVM validity distribution RBF kernel k(., .) bandwidth Optimization interval (in steps) Symbol m k (p) σ ϵ K l(m) k (m) σ Value G.3. MEGA 100 20 1 1 1e-3 The authors of MEGA (Pitis et al., 2020) train a GCP with previously achieved goals from a replay buffer. Their choice of goals relies on a minimum density heuristic, where they model the distribution of achieved goals with a KDE. They argue that aiming at novel goals suffices to efficiently discover and control the environment. We use the original implementation of the authors, the pseudocode is given in Algorithm 3 and specific hyper-parameters are in Table 2. (64, 64) 10 1e-3 100 1000 4000 100 Gelu Table 2. MEGA parameters MEGA Hyper-parameters RBF kernel bandwidth KDE optimization interval (in steps) Nb of state samples for KDE optim. Nb of sampled candidate goals Q-value cutoff 1 4000 Symbol Value σ 0.1 1 10.000 100 -3 N c G.4. GoalGAN G OAL GAN (Florensa et al., 2018) uses a procedural goal generation method based on GAN training. As our SVGG, it 16 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Algorithm 3 MEGA 1: Input: a GCP πθ , buffer of reached states R, a KDE model Pas of the achieved states, numbers c, N , t(r) and l(r) . 2: Initialize R with states reached by pre-runs of πθ ; 3: for n epochs do ▷ Data Collection 4: for t(r) iterations do 5: Sample a batch {g i }N i=1 uniformly from R; 6: Eliminate unachievable candidates if their Q-values are below cutoff c; (r) 7: Choose goals {g i }li=1 = argmingi Pas (gi ) 8: for i from 1 to l(r) do 9: τ ← Rollout πθ (.|., g = g i ); 10: Store all (st , at , st+1 , rt , g i ) from τ in B and every st from τ in R; 11: Optionally (HER): Store relabeled transitions from τ ; 12: Store outcome (g i , I(s|τ | ≈ g i )) in O; 13: end for 14: end for 15: Update KDE model Pas with uniform sampling from R; ▷ Model Update 16: Improve agent with any Off-Policy RL algorithm ▷ Agent Improvement 17: (e.g., DDPG) using transitions from B; 18: end for Algorithm 4 GoalGAN 1: Input: a GCP πθ , a goal Generator Gθg , a Discriminator Dθd , a success predictor Dϕ , buffers of transitions B, reached states R and success outcomes O, numbers t(r) , l(r) , l(m) , k (m) , l(g) and k (g) . 2: Initialize Gθg and Dθd with pre-runs of πθ ; 3: for n epochs do ▷ Data Collection 4: for t(r) iterations do (r) 5: Sample noise {zi }li=1 ∼ N (0, 1) (r) (r) 6: generate {g i }li=1 = Gθg ({zi }li=1 ) 7: for i from 1 to l(r) do 8: τ ← Rollout πθ (.|., g = g i ); 9: Store all (st , at , st+1 , rt , g i ) from τ in B and every st from τ in R; 10: Optionally (HER): Store relabeled transitions from τ ; 11: Store outcome (g i , I(s|τ | ≈ g i )) in O; 12: end for 13: end for (g) 14: Sample a batch {(g i , si )}li=1 from O; ▷ GAN training (g) l(g) 15: Label goals (GOID or not) with model Dϕ : {ygi }li=1 = {Pmin < Dϕ (gi ) < Pmax }i=1 (g) 16: for k iterations do 17: Update Gθg and Dθd (e.g. with ADAM) with gradients of LSGAN losses; (7) 18: end for 19: for k (m) iterations do ▷ Model Update i i l(m) 20: Sample a batch {(g , s )}i=1 from O; 21: Update model Dϕ (e.g., via ADAM), with gradient of Lϕ (agent model loss described in the paper); 22: end for 23: Improve agent with any Off-Policy RL algorithm ▷ Agent Improvement 24: (e.g., DDPG) using transitions from B; 25: end for Dθd is trained to distinguish between goals in GGOID and other goals, while a generator Gθg is trained to output goals in GGOID by relying on the discriminator outputs. They optimize Gθg and Dθd in a manner similar to the Least- aims at sampling goals of intermediate difficulty, which they define as GGOID = {g|Pmin < Pπ (g) < Pmax }, Pπ (g) being the probability for the policy π to achieve goal g, Pmin and Pmax are hyper-parameters. A Discriminator 17 Stein Variational Goal Generation for adaptive Exploration in Multi-Goal RL Squares GAN (LSGAN) with the following losses:  min V (Dθd ) =Eg∼R yg (Dθd (g) − 1)2 θd  + (1 − yg )(Dθd (g) + 1)2   min V (Gθg ) =Ez∼N (0,1) Dθd (Gθg (z))2 , (7) θg where yg is the label that indicates whether g belongs to GGOID or not. In (Florensa et al., 2018), the authors use Monte-Carlo sampling of the policy to estimate yg . For efficiency reasons, we use a learned model of the agent’s capabilities as in SVGG. The pseudocode is given in Algorithm 4 and specific hyper-parameters are in Table 3. Table 3. G OAL GAN parameters GoalGAN Hyper-parameters Gaussian prior dimension Generator hidden layers sizes Discriminator hidden layers sizes Optimization interval (in steps) GAN training batch size Nb of GAN optimization steps GAN Learning rate Minimum GOID probability Maximum GOID probability Symbol l(g) k (g) Pmin Pmax Value 4 (64, 64) (64, 64) 2000 200 100 1e-3 0.1 0.9 G.5. Ressources Every seed of each experiment was run on 1 GPU in the following list {Nvidia RTX A6000, Nvidia RTX A5000, Nvidia TITAN RTX, Nvidia Titan Xp, Nvidia TITAN V}. The total training time is close to 4.000 hours, most of it executed in parallel, as we had full time access to 12 GPUs. 18