arXiv:2107.08829v2 [cs.LG] 27 Jun 2022 Visual Adversarial Imitation Learning using Variational Models Rafael Rafailov1 Tianhe Yu1 Aravind Rajeswaran2,3 Chelsea Finn1 {rafailov, tianheyu, cbfinn}@stanford.edu, aravraj@fb.com 1 Stanford University, 2 University of Washington, 3 Facebook AI Research Abstract Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling onpolicy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at https://sites.google.com/view/variational-mail. 1 Introduction The ability of reinforcement learning (RL) agents to autonomously learn by interacting with the environment presents a promising approach for learning diverse skills. However, reward specification has remained a major challenge in the deployment of RL in practical settings [1, 2, 3]. The ability to imitate humans or other expert trajectories allows us to avoid the reward specification problem, while also circumventing challenges related to task-specific exploration in RL. Visual demonstrations can also be a more natural way to teach robots various tasks and skills in real-world applications. However, this setting is also fraught with a number of technical challenges including representation learning for visual observations, sample complexity due to the high dimensional observation spaces, and learning instability [4, 5, 6] due to lack of a stationary learning signal. We aim to overcome these challenges and to develop an algorithm that can learn from limited demonstration data and scale to high-dimensional observation and action spaces often encountered in robotics applications. Behaviour cloning (BC) is a classic algorithm to imitate expert demonstrations [7], which uses supervised learning to greedily match the expert behaviour at demonstrated expert states. Due to environment stochasticity, covariate shift, and policy approximation error, the agent may drift away from the expert state distribution and ultimately fail to mimic the demonstrator [8]. While a wide initial state distribution [9] or the ability to interactively query the expert policy [8] can circumvent 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Figure 1: Left: the variational dynamics model, which enables joint representation learning from visual inputs and a latent space dynamics model, and the discriminator which is trained to distinguish latent states of expert demonstrations from that of policy rollouts. Dashed lines represent inference and solid lines represent the generative model. Right: the policy training, which uses the discriminator as the reward function, so that the policy induces a latent state visitation distribution that is indistinguishable from that of the expert. The learned policy network is composed with the image encoder from the variational model to recover a visuomotor policy. these difficulties, such conditions require additional supervision and are difficult to meet in practical applications. An alternate line of work based on inverse RL [10, 11] and adversarial imitation learning [12, 13] aims to not only match actions at demonstrated states, but also the long term state visitation distribution of the expert [14]. Adversarial imitation learning approaches explicitly train a GAN-based classifier [15] to distinguish the visitation distribution of the agent from the expert, and use it as a reward signal for training the agent with RL. While these methods have achieved substantial improvement over behaviour cloning without additional expert supervision, they are difficult to deploy in realistic scenarios for multiple reasons: (1) the objective requires on-policy data collection leading to high sample complexity; (2) the reward function changes as the RL agent learns; and (3) high-dimensional observation spaces require representation learning and exacerbate the optimization challenges. Our main contribution in this work is the development of a new algorithm, variational model-based adversarial imitation learning (V-MAIL), which aims to overcome each of the aforementioned challenges within a single framework. As illustrated in Figure 1, V-MAIL trains a variational latent-space dynamics model and a discriminator that provides a learning reward signal by distinguishing latent rollouts of the agent from the expert. The key insight of our approach is that variational models can address these challenges simultaneously by (a) making it possible to collect on-policy roll-outs inside the model without environment interaction, leading to an efficient and stable optimization process and (b) providing a rich auxiliary objective for efficiently learning compact state representations and which regularizes the discriminator. Furthermore, the variational model also allows V-MAIL to perform zero-shot transfer to new imitation learning tasks. By generating on-policy rollouts within the model, and training the discriminator using these rollouts along with demonstrations of a new task, V-MAIL can learn policies for new tasks without any additional environment interactions. Through experiments on vision-based locomotion and manipulation tasks, we find that V-MAIL can successfully learn visuomotor control policies from limited demonstrations. In particular, V-MAIL exhibits stable and near-monotonic learning, is highly sample efficient, and asymptotically matches the expert level performance on most tasks. In contrast, prior algorithms exhibit unstable learning and poor asymptotic performance, often achieving less that 20% of expert performance on these visionbased tasks. We further show the ability to transfer the model to novel tasks, acquiring qualitatively new behaviors using only a few demonstrations and no additional environment interactions. 2 Preliminaries We consider the problem setting of learning in partially observed Markov decision processes (POMDPs), which can be described with the tuple: M = (S, A, X , R, T , U, γ), where s ∈ S 2 is the state space, a ∈ A is the action space, x ∈ X is the observation space and r = R(s, a) is a reward function. The state evolution is Markovian and governed by the dynamics as s0 ∼ T (·|s, a). Finally, the observations are generated through the observation model x ∼ U(·|s). The widely studied Markov decision process (MDP) is a special case of this 7-tuple where the underlying state is directly observed in the observation model. In this work, we study imitation learning in unknown POMDPs. Thus, we do not have access to the underlying dynamics, the true state representation of the POMDP, or the reward function. In place of the rewards, the agent is provided with a fixed set of expert demonstrations collected by executing an expert policy π E , which we assume is optimal under the unknown reward function. The agent can interact with the environment and must learn a policy π(at |x≤t ) that mimics the expert. 2.1 Imitation learning as divergence minimization In line with prior work, we interpret imitation learning as a divergence minimization problem [12, 14, 16]. For simplicity of exposition, we consider the MDP P∞ case in this section, and discuss POMDP extensions in Section 3.2. Let ρπM (s, a) = (1 − γ) t=0 γ t P (st = s, at = a) be the discounted state-action visitation distribution of a policy π in MDP M. Then, a divergence minimization objective for imitation learning corresponds to min D(ρπM , ρE M ), π (1) E where ρE M is the discounted visitation distribution of the expert policy π , and D is a divergence measure between probability distributions such as KL-divergence, Jensen-Shannon divergence, or a generic f −divergence. To see why this is a reasonable objective, let J(π, M) denote the expected value of a policy π in M. Inverse RL [17, 12, 13] interprets the expert as the optimal policy under some unknown reward function. With respect to this unknown reward function, the sub-optimality of any policy π can be bounded as: Rmax DT V (ρπM , ρE M ), 1−γ   since the policy performance is (1 − γ) · J(π, M) = E(s,a)∼ρπM r(s, a) . We use DT V to denote total variation distance. Since various divergence measures are related to the total variation distance, optimizing the divergence between visitation distributions in state space amounts to optimizing a bound on the policy sub-optimality. J(π E , M) − J(π, M) ≤ 2.2 Generative Adversarial Imitation Learning (GAIL) With the divergence minimization viewpoint, any standard generative modeling technique including density estimation, VAEs, GANs etc. can in principle be used to minimize Eq. 1. However, in practice, use of certain generative modeling techniques can be difficult. A standard density estimation technique would involve directly parameterizing ρπM , say through auto-regressive flows, and learning the density model. However, a policy that induces the learned visitation distribution in M is not guaranteed to exist and may prove hard to recover. Similar challenges prevent the direct application of a VAE based generative model as well. In contrast, GANs allow for a policy based parameterization, since it only requires the ability to sample from the generative model and does not require the likelihood. This approach was followed in GAIL, leading to the optimization h   i π max min E(s,a)∼ρE − log D (s, a) + E − log 1 − D (s, a) , (2) ψ ψ (s,a)∼ρ M M π Dψ where Dψ is a discriminative classifier used to distinguish between samples from the expert distribution and the policy generated distribution. Results from Goodfellow et al. [15] and Ho and Ermon [12] suggest that the learning objective in Eq. 2 corresponds to the divergence minimization objective in Eq. 1 with Jensen-Shannon divergence. In order to estimate the second expectation in Eq. 2 we require on-policy samples from π, which is often data-inefficient and difficult to scale to high-dimensional image observations. Some off-policy algorithms [18, 19] replace the expectation under the policy distribution with expectation under the current replay buffer distribution, which allows for off-policy training, but can no longer guarantee that the induced visitation distribution of the learned policy will match that of the expert. 3 3 Variational Model-Based Adversarial Imitation Learning Imitation learning methods based on expert distribution matching have unique challenges. Improving the generative distribution of trajectories (through policy optimization, as we do not have control over the environment dynamics) requires samples from ρπM , which requires rolling out π in the environment. Furthermore, the optimization landscape of a saddle point problem (see Eq. 2) can require many iterations of learning, each requiring fresh on-policy rollouts. This is different from typical generative modeling applications [15, 20] where sampling from the generator is cheap. To overcome these challenges, we present a model-based imitation learning algorithm. Model-based algorithms can utilize a large number of synthetic on-policy rollouts using the learned dynamics model, with periodic model correction. In addition, learning the dynamics model serves as a rich auxiliary task for state representation learning, making policy learning easier and more sample efficient. For conceptual clarity and ease of exposition, we first present our conceptual algorithm in the MDP setting in Section 3.1, and then extend this algorithm to the POMDP case in Section 3.2. Finally, we present a practical version of our algorithm in Sections 3.3 and 3.4. 3.1 Model-Based Adversarial Imitation Learning Model-based algorithms for RL and IL involve learning an approximate dynamics model Tb using environment interactions. The learned dynamics model can be used to construct an approximate MDP c In our context of imitation learning, learning a dynamics model allows us to generate samples M. c as a surrogate for samples from M, leading to the objective: from M E min D(ρπM c, ρM ), (3) π which can serve as a good proxy to Eq. 1 as long as the model approximation is accurate. This intuition can be captured using the following lemma (see Appendix A for proof). Lemma 1. (Simultaneous policy and model deviation) Suppose we have an α-approximate dynamics model given by DT V (Tb (s, a), T (s, a)) ≤ α ∀(s, a). Let Rmax = max(s,a) R(s, a) be the maximum of the unknown reward in the MDP with unknown dynamics T . For any policy π, we can bound the sub-optimality with respect to the expert policy π E as: J(π E , M) − J(π, M) ≤ Rmax α · Rmax E DT V (ρπM . c, ρM ) + 1−γ (1 − γ)2 (4) Thus, the divergence minimization in Eq. 3 serves as an approximate bound on the sub-optimality with a bias that is proportional to the model error. Thus, we ultimately propose to solve the following saddle point optimization problem: h   i π max min E(s,a)∼ρE − log D (s, a) + E − log 1 − D (s, a) , (5) ψ ψ (s,a)∼ρ c M π Dψ M c We can interleave which requires generating on-policy samples only from the learned model M. policy learning according to Eq. 5 with performing policy rollouts in the real environment to iteratively improve the model. Provided the policy is updated sufficiently slowly, Rajeswaran et al. [21] show that such interleaved policy and model learning corresponds to a stable and convergent algorithm, while being highly sample efficient. 3.2 Extension to POMDPs In POMDPs, the underlying state is not directly observed, and thus cannot be directly used by the policy. In this case, we typically use the notion of belief state, which is defined to be the filtering distribution P (st |ht ), where we denote history with ht := (x≤t , a