arXiv:2102.08363v2 [cs.LG] 27 Jan 2022 COMBO: Conservative Offline Model-Based Policy Optimization Tianhe Yu∗,1 , Aviral Kumar∗,2 , Rafael Rafailov1 , Aravind Rajeswaran3 , Sergey Levine2 , Chelsea Finn1 1 Stanford University, 2 UC Berkeley, 3 Facebook AI Research (∗ Equal Contribution) tianheyu@cs.stanford.edu, aviralk@berkeley.edu Abstract Model-based reinforcement learning (RL) algorithms, which learn a dynamics model from logged experience and perform conservative planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating conservatism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We empirically find that uncertainty estimation is not accurate and leads to poor performance in certain scenarios in offline model-based RL. We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that trains a value function using both the offline dataset and data generated using rollouts under the model while also additionally regularizing the value function on out-of-support state-action tuples generated via model rollouts. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. Theoretically, we show that COMBO satisfies a policy improvement guarantee in the offline setting. Through extensive experiments, we find that COMBO attains greater performance compared to prior offline RL on problems that demand generalization to related but previously unseen tasks, and also consistently matches or outperforms prior offline RL methods on widely studied offline RL benchmarks, including image-based tasks. 1 Introduction Offline reinforcement learning (offline RL) [30, 34] refers to the setting where policies are trained using static, previously collected datasets. This presents an attractive paradigm for data reuse and safe policy learning in many applications, such as healthcare [62], autonomous driving [65], robotics [25, 48], and personalized recommendation systems [59]. Recent studies have observed that RL algorithms originally developed for the online or interactive paradigm perform poorly in the offline case [14, 28, 26]. This is primarily attributed to the distribution shift that arises over the course of learning between the offline dataset and the learned policy. Thus, development of algorithms specialized for offline RL is of paramount importance to benefit from the offline data available in aformentioned applications. In this work, we develop a principled model-based offline RL algorithm that matches or exceeds the performance of prior offline RL algorithms in benchmark tasks. A major paradigm for algorithm design in offline RL is to incorporate conservatism or regularization into online RL algorithms. Model-free offline RL algorithms [15, 28, 63, 21, 29, 27] directly incorporate conservatism into the policy or value function training and do not require learning a dynamics model. However, model-free algorithms learn only on the states in the offline dataset, which can lead to overly conservative algorithms. In contrast, model-based algorithms [26, 67] learn a pessimistic dynamics model, which in turn induces a conservative estimate of the value function. By generating and training on additional synthetic data, model-based algorithms have the potential for 35th Conference on Neural Information Processing Systems (NeurIPS 2021). broader generalization and solving new tasks using the offline dataset [67]. However, these methods rely on some sort of strong assumption about uncertainty estimation, typically assuming access to a model error oracle that can estimate upper bounds on model error for any state-action tuple. In practice, such methods use more heuristic uncertainty estimation methods, which can be difficult or unreliable for complex datasets or deep network models. It then remains an open question as to whether we can formulate principled model-based offline RL algorithms with concrete theoretical guarantees on performance without assuming access to an uncertainty or model error oracle. In this work, we propose precisely such a method, by eschewing direct uncertainty estimation, which we argue is not necessary for offline RL. Our main contribution is the development of conservative offline model-based policy optimization (COMBO), a new model-based algorithm for offline RL. COMBO learns a dynamics model using the offline dataset. Subsequently, it employs an actor-critic method where the value function is learned using both the offline dataset as well as synthetically generated data from the model, similar to Dyna [57] and a number of recent methods [20, 67, 7, 48]. However, in con- Figure 1: COMBO learns a conservative value function trast to Dyna, COMBO learns a conservative by utilizing both the offline dataset as well as simucritic function by penalizing the value function lated data from the model. Crucially, COMBO does not require uncertainty quantification, and the value funcin state-action tuples that are not in the support tion learned by COMBO is less conservative on the of the offline dataset, obtained by simulating the transitions seen in the dataset than CQL. This enables learned model. We theoretically show that for COMBO to steer the agent towards higher value states any policy, the Q-function learned by COMBO compared to CQL, which may steer towards more optiis a lower-bound on the true Q-function. While mal states, as illustrated in the figure. the approach of optimizing a performance lower-bound is similar in spirit to prior model-based algorithms [26, 67], COMBO crucially does not assume access to a model error or uncertainty oracle. In addition, we show theoretically that the Q-function learned by COMBO is less conservative than model-free counterparts such as CQL [29], and quantify conditions under which the this lower bound is tighter than the one derived in CQL. This is illustrated through an example in Figure 1. Following prior works [31], we show that COMBO enjoys a safe policy improvement guarantee. By interpolating model-free and model-based components, this guarantee can utilize the best of both guarantees in certain cases. Finally, in our experiments, we find that COMBO achieves the best performance on tasks that require out-of-distribution generalization and outperforms previous latent-space offline model-based RL methods on image-based robotic manipulation benchmarks. We also test COMBO on commonly studied benchmarks for offline RL and find that COMBO generally performs well on the benchmarks, achieving the highest score in 9 out of 12 MuJoCo domains from the D4RL [12] benchmark suite. 2 Preliminaries Markov Decision Processes and Offline RL. We study RL in the framework of Markov decision processes (MDPs) specified by the tuple M = (S, A, T, r, µ0 , γ). S, A denote the state and action spaces. T (s0 |s, a) and r(s, a) ∈ [−Rmax , Rmax ] represent the dynamics and reward function respectively. µ0 (s) denotes the initial state distribution, and γ ∈ (0, 1) denotes the discountP factor. We denote the ∞ discounted state visitation distribution of a policy π using dπM (s) := (1 − γ) t=0 γ t P(st = s|π), where P(st = s|π) is the probability of reaching state s at time t by rolling out π in M. Similarly, we denote the state-action visitation distribution with dπM (s, a) := dπM (s)π(a|s). The goal of RL is to learn a policy that maximizes the return, or long term cumulative rewards: 1 maxπ J(M, π) := 1−γ E(s,a)∼dπM (s,a) [r(s, a)]. Offline RL is the setting where we have access only to a fixed dataset D = {(s, a, r, s0 )}, which consists of transition tuples from trajectories collected using a behavior policy πβ . In other words, the dataset D is sampled from dπβ (s, a) := dπβ (s)πβ (a|s). We define M as the empirical MDP induced by the dataset D and d(s, a) as sampled-based version of dπβ (s, a). In the offline setting, the goal is to find the best possible policy using the fixed offline dataset. Model-Free Offline RL Algorithms. One class of approaches for solving MDPs involves the use of dynamic programming and actor-critic schemes [56, 5], which do not explicitly require the learning 2 of a dynamics model. To capture the longPterm behavior of a policy without a model, we define the ∞ action value function as Qπ (s, a) := E [ t=0 γ t r(st , at ) | s0 = s, a0 = a] , where future actions are sampled from π(·|s) and state transitions happen according to the MDP dynamics. Consider the following Bellman operator: B π Q(s, a) := r(s, a) + γE s0 ∼T (·|s,a),a0 ∼π(·|s0 ) [Q(s0 , a0 )], and its sample based counterpart: Bbπ Q(s, a) := r(s, a) + γQ(s0 , a0 ), associated with a single transition (s, a, s0 ) and a0 ∼ π(·|s0 ). The action-value function satisfies the Bellman consistency criterion given by B π Qπ (s, a) = Qπ (s, a) ∀(s, a). When given an offline dataset D, standard approximate dynamic programming (ADP) and actor-critic methods use this criterion to alternate between policy evaluation [40] and policy improvement. A number of prior works have observed that such a direct extension of ADP and actor-critic schemes to offline RL leads to poor results due to distribution shift over the course of learning and over-estimation bias in the Q function [14, 28, 63]. To address these drawbacks, prior works have proposed a number of modifications aimed towards regularizing the policy or value function (see Section 6). In this work, we primarily focus on CQL [29], which alternates between: Policy Evaluation: The Q function associated with the current policy π is approximated conservatively by repeating the following optimization: h i 1 Qk+1←arg min β Es∼D,a∼µ(·|s) [Q(s, a)]−Es,a∼D [Q(s, a)] + Es,a,s0 ∼D Q(s, a)− Bbπ Qk (s, a) 2 , (1) Q 2 where µ(·|s) is a wide sampling distribution such as the uniform distribution over action bounds. CQL effectively penalizes the Q function at states in the dataset for actions not observed in the dataset. This enables a conservative estimation of the value function for any policy [29], mitigating the challenges of over-estimation bias and distribution shift. the Q function as Q̂π , the policy is improved as Policy Improvement: After happroximating i π ← arg maxπ0 Es∼D,a∼π0 (·|s) Q̂π (s, a) . Actor-critic methods with parameterized policies and Q functions approximate arg max and arg min in above equations with a few gradient descent steps. Model-Based Offline RL Algorithms. A second class of algorithms for solving MDPs involve the learning of the dynamics function, and using the learned model to aid policy search. Using the b given dataset D, a hdynamics model i T is typically trained using maximum likelihood estimation as: 0 b minTb E(s,a,s0 )∼D log T (s |s, a) . A reward model r̂(s, a) can also be learned similarly if it is unc = (S, A, Tb, r̂, µ0 , γ), known. Once a model has been learned, we can construct the learned MDP M which has the same state and action spaces, but uses the learned dynamics and reward function. Subsequently, any policy learning or planning algorithm can be used to recover the optimal policy in c π). the model as π̂ = arg maxπ J(M, This straightforward approach is known to fail in the offline RL setting, both in theory and practice, due to distribution shift and model-bias [51, 26]. In order to overcome these challenges, offline model-based algorithms like MOReL [26] and MOPO [67] use uncertainty quantification to construct a lower bound for policy performance and optimize this lower bound by assuming a model error oracle u(s, a). By using an uncertainty estimation algorithm like bootstrap ensembles [43, 4, 37], we can estimate u(s, a). By constructing and optimizing such a lower bound, offline model-based RL algorithms avoid the aforementioned pitfalls like model-bias and distribution shift. While any c we focus specifically on RL or planning algorithm can be used to learn the optimal policy for M, MBPO [20, 57] which was used in MOPO. MBPO follows the standard structure of actor-critic algorithms, but in each iteration uses an augmented dataset D ∪ Dmodel for policy evaluation. Here, D is the offline dataset and Dmodel is a dataset obtained by simulating the current policy using the learned dynamics model. Specifically, at each iteration, MBPO performs k-step rollouts using Tb starting from state s ∈ D with a particular rollout policy µ(a|s), adds the model-generated data to Dmodel , and optimizes the policy with a batch of data sampled from D ∪ Dmodel where each datapoint in the batch is drawn from D with probability f ∈ [0, 1] and Dmodel with probability 1 − f . 3 Conservative Offline Model-Based Policy Optimization The principal limitation of prior offline model-based algorithms (discussed in Section 2) is the assumption of having access to a model error oracle for uncertainty estimation and strong reliance on heuristics of quantifying the uncertainty. In practice, such heuristics could be challenging for complex datasets or deep neural network models [44]. We argue that uncertainty estimation is not 3 Algorithm 1 COMBO: Conservative Model Based Offline Policy Optimization Require: Offline dataset D, rollout distribution µ(·|s), learned dynamics model Tbθ , initialized policy and critic πφ and Qψ . 1: Train the probabilistic dynamics model Tbθ (s0 , r|s, a) = N (µθ (s, a), Σθ (s, a)) on D. 2: Initialize the replay buffer Dmodel ← ∅. 3: for i = 1, 2, 3, · · · , do 4: Collect model rollouts by sampling from µ and Tbθ starting from states in D. Add model rollouts to Dmodel . πi 5: Conservatively evaluate πφi by repeatedly solving eq. 2 to obtain Q̂ψφ using samples from D ∪ Dmodel . 6: Improve policy under state marginal of df by solving eq. 3 to obtain πφi+1 . 7: end for imperative for offline model-based RL and empirically show that uncertainty estimation could be inaccurate in offline RL problems especially when generalization to unknown behaviors is required in Section 5.1.1. Our goal is to develop a model-based offline RL algorithm that enables optimizing a lower bound on the policy performance, but without requiring uncertainty quantification. We achieve this by extending conservative Q-learning [29], which does not require explicit uncertainty quantification, into the model-based setting. Our algorithm COMBO, summarized in Algorithm 1, alternates between a conservative policy evaluation step and a policy improvement step, which we outline below. Conservative Policy Evaluation: Given a policy π, an offline dataset D, and a learned model of the MDP M̂, the goal in this step is to obtain a conservative estimate of Qπ . To achieve this, we penalize the Q-values evaluated on data drawn from a particular state-action distribution that is more likely to be out-of-support while pushing up the Q-values on state-action pairs that are trustworthy, which is implemented by repeating the following recursion: Q̂ k+1 2 1 π k b 0 ←arg min β Es,a∼ρ(s,a) [Q(s, a)]−Es,a∼D [Q(s, a)] + Es,a,s ∼df Q(s, a)− B Q̂ (s, a) . (2) Q 2 Here, ρ(s, a) and df are sampling distributions that we can choose. Model-based algorithms allow ample flexibility for these choices while providing the ability to control the bias introduced by these π choices. For ρ(s, a), we make the following choice: ρ(s, a) = dπM c(s)π(a|s), where dM c(s) is the c discounted marginal state distribution when executing π in the learned model M. Samples from c Similarly, df is an f −interpolation between the dπ (s) can be obtained by rolling out π in M. c M offline dataset and synthetic rollouts from the model: dµf (s, a) := f d(s, a) + (1 − f ) dµc(s, a), M where f ∈ [0, 1] is the ratio of the datapoints drawn from the offline dataset as defined in Section 2 and µ(·|s) is the rollout distribution used with the model, which can be modeled as π or a uniform distribution. To avoid notation clutter, we also denote df := dµf . Under such choices of ρ and df , we push down (or conservatively estimate) Q-values on state-action tuples from model rollouts and push up Q-values on the real state-action pairs from the offline dataset. When updating Q-values with the Bellman backup, we use a mixture of both the model-generated data and the real data, similar to Dyna [57]. Note that in comparison to CQL and other model-free algorithms, COMBO learns the Q-function over a richer set of states beyond the states in the offline dataset. This is made possible by performing rollouts under the learned dynamics model, denoted by dµc(s, a). We will show in Section 4 that the Q function learned by repeating the recursion in Eq. 2 M provides a lower bound on the true Q function, without the need for explicit uncertainty estimation. Furthermore, we will theoretically study the advantages of using synthetic data from the learned model, and characterize the impacts of model bias. Policy Improvement Using a Conservative Critic: After learning a conservative critic Q̂π , we improve the policy as: h i π 0 ← arg max Es∼ρ,a∼π(·|s) Q̂π (s, a) (3) π where ρ(s) is the state marginal of ρ(s, a). When policies are parameterized with neural networks, we approximate the arg max with a few steps of gradient descent. In addition, entropy regularization can also be used to prevent the policy from becoming degenerate if required [17]. In Section 4.2, we show that the resulting policy is guaranteed to improve over the behavior policy. 4 Practical Implementation Details. Our practical implementation largely follows MOPO, with the key exception that we perform conservative policy evaluation as outlined in this section, rather than using uncertainty-based reward penalties. Following MOPO, we represent the probabilistic dynamics model using a neural network, with parameters θ, that produces a Gaussian distribution over the next state and reward: Tbθ (st+1 , r|s, a) = N (µθ (st , at ), Σθ (st , at )). The model is trained via maximum likelihood. For conservative policy evaluation (eq. 2) and policy improvement (eq. 3), we augment ρ with states sampled from the offline dataset, which shows more stable improvement in practice. It is relatively common in prior work on model-based offline RL to select various hyperparameters using online policy rollouts [67, 26, 3, 33]. However, we would like to avoid this with our method, since requiring online rollouts to tune hyperparameters contradicts the main aim of offline RL, which is to learn entirely from offline data. Therefore, we do not use online rollouts for tuning COMBO, and instead devise an automated rule for tuning important hyperparameters such as β and f in a fully offline manner. We search over a small discrete set of hyperparameters for each task, and use the value of the regularization term Es,a∼ρ(s,a) [Q(s, a)]−Es,a∼D [Q(s, s)] (shown in Eq. 2) to pick hyperparameters in an entirely offline fashion. We select the hyperparameter setting that achieves the lowest regularization objective, which indicates that the Q-values on unseen model-predicted state-action tuples are not overestimated. Additional details about the practical implementation and the hyperparameter selection rule are provided in Appendix B.1 and Appendix B.2 respectively. 4 Theoretical Analysis of COMBO In this section, we theoretically analyze our method and show that it optimizes a lower-bound on the expected return of the learned policy. This lower bound is close to the actual policy performance (modulo sampling error) when the policy’s state-action marginal distribution is in support of the state-action marginal of the behavior policy and conservatively estimates the performance of a policy otherwise. By optimizing the policy against this lower bound, COMBO guarantees policy improvement beyond the behavior policy. Furthermore, we use these insights to discuss cases when COMBO is less conservative compared to model-free counterparts. 4.1 COMBO Optimizes a Lower Bound We first show that training the Q-function using Eq. 2 produces a Q-function such that the expected off-policy policy improvement objective [8] computed using this learned Q-function lower-bounds its actual value. We will reuse notation for df and d from Sections 2 and 3. Assuming that the Q-function is tabular, the Q-function found by approximate dynamic programming in iteration k, can be obtained by differentiating Eq. 2 with respect to Qk (see App. A for details): ρ(s, a) − d(s, a) Q̂k+1 (s, a) = (Bbπ Qk )(s, a) − β . df (s, a) (4) Eq. 4 effectively applies a penalty that depends on the three distributions appearing in the COMBO critic training objective (Eq. 2), of which ρ and df are free variables that we choose in practice as discussed in Section 3. For a given iteration k of Eq. 4, we further define the expected penalty under ρ(s, a) as: ρ(s, a) − d(s, a) ν(ρ, f ) := Es,a∼ρ(s,a) . (5) df (s, a) Next, we will show that the Q-function learned by COMBO lower-bounds the actual Q-function under the initial state distribution µ0 and any policy π. We also show that the asymptotic Q-function learned by COMBO lower-bounds the actual Q-function of any policy π with high probability for a large enough β ≥ 0, which we include in Appendix A.2. Let M represent the empirical MDP which uses the empirical transition model based on raw data counts. The Bellman backups over the dataset distribution df in Eq. 2 that we analyze is an f −interpolation of the backup operator in the empirical π c (denoted by B π ). The MDP (denoted by BM ) and the backup operator under the learned model M c M empirical backup operator suffers from sampling error, but is unbiased in expectation, whereas the model backup operator induces bias but no sampling error. We assume that all of these backups enjoy concentration properties with concentration coefficient Cr,T,δ , dependent on the desired confidence value δ (details in Appendix A.2). This is a standard assumption in literature [31]. Now, we state our main results below. Proposition 4.1. For large enough β, we have Es∼µ0 ,a∼π(·|s) [Q̂π (s, a)] ≤ Es∼µ0 ,a∼π(·|s) [Qπ (s, a)], where µ0 (s) is the initial state distribution. Furthermore, when s is small, such as in the large 5 sample regime, or when the model bias m is small, a small β is sufficient to guarantee this condition along with an appropriate choice of f . The proof for Proposition 4.1 can be found in Appendix A.2. Finally, while Kumar et al. [29] also analyze how regularized value function training can provide lower bounds on the value function at each state in the dataset [29] (Proposition 3.1-3.2), our result shows that COMBO is less conservative in that it does not underestimate the value function at every state in the dataset like CQL (Remark 1) and might even overestimate these values. Instead COMBO penalizes Q-values at states generated via model rollouts from ρ(s, a). Note that in general, the required value of β may be quite large similar to prior works, which typically utilize a large constant β, which may be in the form of a penalty on a regularizer [36, 29] or as constants in theoretically optimal algorithms [23, 49]. While it is challenging to argue that that either COMBO or CQL attains the tightest possible lower-bound on return, in our final result of this section, we discuss a sufficient condition for the COMBO lower-bound to be tighter than CQL. h i Proposition 4.2. Assuming previous notation, let ∆πCOMBO := Es,a∼dM (s),π(a|s) Q̂π (s, a) and h i ∆πCQL := Es,a∼dM (s),π(a|s) Q̂πCQL (s, a) denote the average values on the dataset under the Qfunctions learned by COMBO and CQL respectively. Then, ∆πCOMBO ≥ ∆πCQL , if: π(a|s) π(a|s) − Es,a∼dM (s),π(a|s) ≤ 0. (∗) Es,a∼ρ(s,a) πβ (a|s) πβ (a|s) Proposition 4.2 indicates that COMBO will be less conservative than CQL when the action probabilities under learned policy π(a|s) and the probabilities under the behavior policy πβ (a|s) are closer together on state-action tuples drawn from ρ(s, a) (i.e., sampled from the model using the policy π(a|s)), than they are on states from the dataset and actions from the policy, dM (s)π(a|s). COMBO’s objective (Eq. 2) only penalizes Q-values under ρ(s, a), which, in practice, are expected to primarily consist of out-of-distribution states generated from model rollouts, and does not penalize the Q-value at states drawn from dM (s). As a result, the expression (∗) is likely to be negative, making COMBO less conservative than CQL. 4.2 Safe Policy Improvement Guarantees Now that we have shown various aspects of the lower-bound on the Q-function induced by COMBO, we provide policy improvement guarantees for the COMBO algorithm. Formally, Proposition 4.3 discuss safe improvement guarantees over the behavior policy. building on prior work [46, 31, 29]. Proposition 4.3 (ζ-safe policy improvement). Let π̂out (a|s) be the policy obtained by COMBO. Then, if β is sufficiently large and ν(ρπ , f ) − ν(ρβ , f ) ≥ C for a positive constant C, the policy π̂out (a|s) is a ζ-safe policy improvement over πβ in the actual MDP M, i.e., J(π̂out , M) ≥ J(πβ , M) − ζ, with probability at least 1 − δ, where ζ is given by, O γf (1 − γ)2 "s Es∼dπ̂out M | # |A| γ(1 − f ) c −β C DCQL (π̂out , πβ ) +O DTV (M, M) . {z } |D(s)| (1 − γ)2 | (1 − γ) {z } | := (2) {z } := (3) := (1) The complete statement (with constants and terms that grow smaller than quadratic in the horizon) and proof for Proposition 4.3 is provided in Appendix A.4. DCQL denotes a notion of probabilistic distance between policies [29] which we discuss further in Appendix A.4. The expression for ζ in Proposition 4.3 consists of three terms: term (1) captures the decrease in the policy performance due to limited data, and decays as the size of D increases. The second term (2) captures the suboptimality induced by the bias in the learned model. Finally, as we show in Appendix A.4, the third term (3) comes from ν(ρπ , f ) − ν(ρβ , f ), which is equivalent to the improvement in policy performance as a result of running COMBO in the empirical and model MDPs. Since the learned model is trained on the dataset D with transitions generated from the behavior policy πβ , the marginal distribution ρβ (s, a) is expected to be closer to d(s, a) for πβ as compared to the counterpart for the learned policy, ρπ . Thus, the assumption that ν(ρπ , f ) − ν(ρβ , f ) is positive is reasonable, and in such cases, an appropriate (large) choice of β will make term (3) large enough to counteract terms (1) and (2) that reduce policy performance. We discuss this elaborately in Appendix A.4 (Remark 3). 6 Further note that in contrast to Proposition 3.6 in Kumar et al. [29], note that our result indicates the sampling error (term (1)) is reduced (multiplied by a fraction f ) when a near-accurate model is used to augment data for training the Q-function, and similarity, it can avoid the bias of model-based methods by relying more on the model-free component. This allows COMBO to attain the best-of-both model-free and model-based methods, via a suitable choice of the fraction f . To summarize, through an appropriate choice of f , Proposition 4.3 guarantees safe improvement over the behavior policy without requiring access to an oracle uncertainty estimation algorithm. 5 Experiments In our experiments, we aim to answer the follow questions: (1) Can COMBO generalize better than previous offline model-free and model-based approaches in a setting that requires generalization to tasks that are different from what the behavior policy solves? (2) How does COMBO compare with prior work in tasks with high-dimensional image observations? (3) How does COMBO compare to prior offline model-free and model-based methods in standard offline RL benchmarks? To answer those questions, we compare COMBO to several prior methods. In the domains with compact state spaces, we compare with recent model-free algorithms like BEAR [28], BRAC [63], and CQL [29]; as well as MOPO [67] and MOReL [26] which are two recent model-based algorithms. In addition, we also compare with an offline version of SAC [17] (denoted as SAC-off), and behavioral cloning (BC). In high-dimensional image-based domains, which we use to answer question (3), we compare to LOMPO [48], which is a latent space offline model-based RL method that handles image inputs, latent space MBPO (denoted LMBPO), similar to Janner et al. [20] which uses the model to generate additional synthetic data, the fully offline version of SLAC [32] (denoted SLAC-off), which only uses a variational model for state representation purposes, and CQL from image inputs. To our knowledge, CQL, MOPO, and LOMPO are representative of state-of-the-art model-free and model-based offline RL methods. Hence we choose them as comparisons to COMBO. To highlight the distinction between COMBO and a naïve combination of CQL and MBPO, we perform such a comparison in Table 8 in Appendix C. For more details of our experimental set-up, comparisons, and hyperparameters, see Appendix B. 5.1 Results on tasks that require generalization To answer question (1), we use Batch Batch COMBO Environment MOPO MOReL CQL two environments Mean Max (Ours) halfcheetah-jump halfcheetah-jump -1022.6 1808.6 5308.7±575.5 4016.6 3228.7 741.1 ant-angle 866.7 2311.9 2776.9±43.6 2530.9 2660.3 2473.4 and ant-angle con5% 100% 98.3%±3.0% 65.8% 42.9% 36.7% sawyer-door-close structed in Yu et al. [67], which requires Table 1: Average returns of halfcheetah-jump and ant-angle and average the agent to solve a success rate of sawyer-door-close that require out-of-distribution generalization. task that is different All results are averaged over 6 random seeds. We include the mean and max return from what the behav- / success rate of episodes in the batch data (under Batch Mean and Batch Max, respectively) for comparison. We also include the 95%-confidence interval for ior policy solved. In COMBO. both environments, the offline dataset is collected by policies trained with original reward functions of halfcheetah and ant, which reward the robots to run as fast as possible. The behavior policies are trained with SAC with 1M steps and we take the full replay buffer as the offline dataset. Following Yu et al. [67], we relabel rewards in the offline datasets to reward the halfcheetah to jump as high as possible and the ant to run to the top corner with a 30 degree angle as fast as possible. Following the same manner, we construct a third task sawyer-door-close based on the environment in Yu et al. [66], Rafailov et al. [48]. In this task, we collect the offline data with SAC policies trained with a sparse reward function that only gives a reward of 1 when the door is opened by the sawyer robot and 0 otherwise. The offline dataset is similar to the “medium-expert“ dataset in the D4RL benchmark since we mix equal amounts of data collected by a fully-trained SAC policy and a partially-trained SAC policy. We relabel the reward such that it is 1 when the door is closed and 0 otherwise. Therefore, in these datasets, the offline RL methods must generalize beyond behaviors in the offline data in order to learn the intended behaviors. We visualize the sawyer-door-close environment in the right image in Figure 3 in Appendix B.4. 7 We present the results on the three tasks in Table 1. COMBO significantly outperforms MOPO, MOReL and CQL, two representative model-based methods and one representative model-free methods respectively, in the halfcheetah-jump and sawyer-door-close tasks, and achieves an approximately 8%, 4% and 12% improvement over MOPO, MOReL and CQL respectively on the ant-angle task. These results validate that COMBO achieves better generalization results in practice by behaving less conservatively than prior model-free offline methods (compare to CQL, which doesn’t improve much), and does so more robustly than prior model-based offline methods (compare to MOReL and MOPO). 5.1.1 Empirical analysis on uncertainty estimation in offline model-based RL To further understand why COMBO outperforms prior model-based methods in tasks that require generalization, we argue that one of the main reasons could be that uncertainty estimation is hard in these tasks where the agent is required to go further away from the data distribution. To test this Figure 2: We visualize the fitted linear regression line between the model intuition, we perform empirical error and two uncertainty quantification methods maximum learned varievaluations to study whether ance over the ensemble (denoted as Max Var) on two tasks that test the uncertainty quantification with generalization abilities of offline RL algorithms (halfcheetah-jump deep neural networks, especially and ant-angle). We show that Max Var struggles to predict the true model error. Such visualizations indicates that uncertainty quantification in the setting of dynamics model is challenging with deep neural networks and could lead to poor perforlearning, is challenging and mance in model-based offline RL in settings where out-of-distribution could cause problems with generalization is needed. In the meantime, COMBO addresses this issue uncertainty-based model-based by removing the burden of performing uncertainty quantification. offline RL methods such as MOReL [26] and MOPO [67]. In our evaluations, we consider maximum learned variance over the ensemble (denoted as Max Var) maxi=1,...,N kΣiθ (s, a)kF (used in MOPO). We consider two tasks halfcheetah-jump and ant-angle. We normalize both the model error and the uncertainty estimates to be within scale [0, 1] and performs linear regression that learns the mapping between the uncertainty estimates and the true model error. As shown in Figure 2, on both tasks, Max Var is unable to accurately predict the true model error, suggesting that uncertainty estimation used by offline model-based methods is not accurate and might be the major factor that results in its poor performance. Meanwhile, COMBO circumvents challenging uncertainty quantification problem and achieves better performances on those tasks, indicating the effectiveness and the robustness of the method. 5.2 Results on image-based tasks To answer question (2), we evaluate COMBO on two image-based environments: the standard walker (walker-walk) task from the the DeepMind Control suite [61] and a visual door opening environment with a Sawyer robotic arm (sawyer-door) as used in Section 5.1. For the walker task we construct Dataset Environment COMBO LOMPO LMBPO SLAC CQL 4 datasets: medium-replay (M-R), (Ours) -Off medium (M), medium-expert (M- M-R walker_walk 69.2 66.9 59.8 45.1 15.6 E), and expert, similar to Fu et al. M walker_walk 57.7 60.2 61.7 41.5 38.9 M-E walker_walk 76.4 78.9 47.3 34.9 36.3 [12], each consisting of 200 traexpert walker_walk 61.1 55.6 13.2 12.6 43.3 jectories. For sawyer-door task M-E sawyer-door 100.0% 100.0% 0.0% 0.0% 0.0% 96.7% 0.0% 0.0% 0.0% 0.0% we use only the medium-expert expert sawyer-door and the expert datasets, due to Table 2: Results for vision experiments. For the Walker task each the sparse reward – the agent is number is the normalized score proposed in [12] of the policy at the last rewarded only when it success- iteration of training, averaged over 3 random seeds. For the Sawyer task, fully opens the door. Both en- we report success rates over the last 100 evaluation runs of training. For vironments are visulized in Fig- the dataset, M refers to medium, M-R refers to medium-replay, and M-E ure 3 in Appendix B.4. To extend refers to medium expert. COMBO to the image-based setting, we follow Rafailov et al. [48] and train a recurrent variational model using the offline data and use train COMBO in the latent space of this model. We present 8 Dataset type Environment BC COMBO (Ours) MOPO MOReL CQL SAC-off BEAR BRAC-p BRAC-v random random random medium medium medium medium-replay medium-replay medium-replay med-expert med-expert med-expert halfcheetah hopper walker2d halfcheetah hopper walker2d halfcheetah hopper walker2d halfcheetah hopper walker2d 2.1 1.6 9.8 36.1 29.0 6.6 38.4 11.8 11.3 35.8 111.9 6.4 38.8±3.7 17.9±1.4 7.0±3.6 54.2±1.5 97.2±2.2 81.9±2.8 55.1±1.0 89.5±1.8 56.0±8.6 90.0±5.6 111.1±2.9 103.3±5.6 35.4 11.7 13.6 42.3 28.0 17.8 53.1 67.5 39.0 63.3 23.7 44.6 25.6 53.6 37.3 42.1 95.4 77.8 40.2 93.6 49.8 53.3 108.7 95.6 35.4 10.8 7.0 44.4 86.6 74.5 46.2 48.6 32.6 62.4 111.0 98.7 30.5 11.3 4.1 -4.3 0.8 0.9 -2.4 3.5 1.9 1.8 1.6 -0.1 25.1 11.4 7.3 41.7 52.1 59.1 38.6 33.7 19.2 53.4 96.3 40.1 24.1 11.0 -0.2 43.8 32.7 77.5 45.4 0.6 -0.3 44.2 1.9 76.9 31.2 12.2 1.9 46.3 31.1 81.1 47.7 0.6 0.9 41.9 0.8 81.6 Table 3: Results for D4RL datasets. Each number is the normalized score proposed in [12] of the policy at the last iteration of training, averaged over 6 random seeds. We take results of MOPO, MOReL and CQL from their original papers and results of other model-free methods from [12]. We include the performance of behavior cloning (BC) for comparison. We include the 95%-confidence interval for COMBO. We bold the highest score across all methods. results in Table 2. On the walker-walk task, COMBO performs in line with LOMPO and previous methods. On the more challenging Sawyer task, COMBO matches LOMPO and achieves 100% success rate on the medium-expert dataset, and substantially outperforms all other methods on the narrow expert dataset, achieving an average success rate of 96.7%, when all other model-based and model-free methods fail. 5.3 Results on the D4RL tasks Finally, to answer the question (3), we evaluate COMBO on the OpenAI Gym [6] domains in the D4RL benchmark [12], which contains three environments (halfcheetah, hopper, and walker2d) and four dataset types (random, medium, medium-replay, and medium-expert). We include the results in Table 3. The numbers of BC, SAC-off, BEAR, BRAC-P and BRAC-v are taken from the D4RL paper, while the results for MOPO, MOReL and CQL are based on their respective papers [67, 29]. COMBO achieves the best performance in 9 out of 12 settings and comparable result in 1 out of the remaining 3 settings (hopper medium-replay). As noted by Yu et al. [67] and Rafailov et al. [48], model-based offline methods are generally more performant on datasets that are collected by a wide range of policies and have diverse state-action distributions (random, medium-replay datasets) while model-free approaches do better on datasets with narrow distributions (medium, medium-expert datasets). However, in these results, COMBO generally performs well across dataset types compared to existing model-free and model-based approaches, suggesting that COMBO is robust to different dataset types. 6 Related Work Offline RL [10, 50, 30, 34] is the task of learning policies from a static dataset of past interactions with the environment. It has found applications in domains including robotic manipulation [25, 38, 48, 54], NLP [21, 22] and healthcare [52, 62]. Similar to interactive RL, both model-free and model-based algorithms have been studied for offline RL, with explicit or implicit regularization of the learning algorithm playing a major role. Model-free offline RL. Prior model-free offline RL algorithms have been designed to regularize the learned policy to be “close“ to the behavioral policy either implicitly via regularized variants of importance sampling based algorithms [47, 58, 35, 59, 41], offline actor-critic methods [53, 45, 27, 16, 64], applying uncertainty quantification to the predictions of the Q-values [2, 28, 63, 34], and learning conservative Q-values [29, 55] or explicitly measured by direct state or action constraints [14, 36], KL divergence [21, 63, 69], Wasserstein distance, MMD [28] and auxiliary imitation loss [13]. Different from these works, COMBO uses both the offline dataset as well as model-generated data. Model-based offline RL. Model-based offline RL methods [11, 9, 24, 26, 67, 39, 3, 60, 48, 33, 68] provide an alternative approach to policy learning that involves the learning of a dynamics model using techniques from supervised learning and generative modeling. Such methods however rely either on uncertainty quantification of the learned dynamics model which can be difficult for deep network models [44], or on directly constraining the policy towards the behavioral policy similar to model-free algorithms [39]. In contrast, COMBO conservatively estimates the value function by penalizing it in out-of-support states generated through model rollouts. This allows COMBO to 9 retain all benefits of model-based algorithms such as broad generalization, without the constraints of explicit policy regularization or uncertainty quantification. 7 Conclusion In the paper, we present conservative offline model-based policy optimization (COMBO), a modelbased offline RL algorithm that penalizes the Q-values evaluated on out-of-support state-action pairs. In particular, COMBO removes the need of uncertainty quantification as widely used in previous model-based offline RL works [26, 67], which can be challenging and unreliable with deep neural networks [44]. Theoretically, we show that COMBO achieves less conservative Q values compared to prior model-free offline RL methods [29] and guarantees a safe policy improvement. In our empirical study, COMBO achieves the best generalization performances in 3 tasks that require adaptation to unseen behaviors. Moreover, COMBO is able scale to vision-based tasks and outperforms or obtain comparable results in vision-based locomotion and robotic manipulation tasks. Finlly, on standard D4RL benchmark, COMBO generally performs well across dataset types compared to prior methods Despite the advantages of COMBO, there are few challenges left such as the lack of an offline hyperparameter selection scheme that can yield a uniform hyperparameter across different datasets and an automatically selected f conditioned on the model error. We leave them for future work. Acknowledgments and Disclosure of Funding We thank members of RAIL and IRIS for their support and feedback. This work was supported in part by ONR grants N00014-20-1-2675 and N00014-21-1-2685 as well as Intel Corporation. AK and SL are supported by the DARPA Assured Autonomy program. AR was supported by the J.P. Morgan PhD Fellowship in AI. References [1] Alekh Agarwal, Nan Jiang, and Sham M Kakade. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 2019. [2] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020. [3] Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. arXiv preprint arXiv:2008.05556, 2020. [4] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian deep q-networks. In ITA, pages 1–9. IEEE, 2018. [5] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. [6] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016. [7] Ignasi Clavera, Violet Fu, and Pieter Abbeel. Model-augmented actor-critic: Backpropagating through paths. arXiv preprint arXiv:2005.08068, 2020. [8] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012. [9] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018. [10] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [11] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017. 10 [12] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020. [13] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. arXiv preprint arXiv:2106.06860, 2021. [14] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018. [15] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018. [16] Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021. [17] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. [18] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. International conference on machine learning. In International Conference on Machine Learning, 2019. [19] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010. [20] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498–12509, 2019. [21] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019. [22] Natasha Jaques, Judy Hanwen Shen, Asma Ghandeharioun, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Shane Gu, and Rosalind Picard. Human-centric dialog training via offline reinforcement learning. arXiv preprint arXiv:2010.05848, 2020. [23] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021. [24] Gregory Kahn, Adam Villaflor, Pieter Abbeel, and Sergey Levine. Composable actionconditioned predictors: Flexible off-policy learning for robot navigation. In Conference on Robot Learning, pages 806–816. PMLR, 2018. [25] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673. PMLR, 2018. [26] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020. [27] Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783. PMLR, 2021. [28] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761–11771, 2019. [29] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020. 11 [30] Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Reinforcement Learning, volume 12. Springer, 2012. [31] Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pages 3652–3661. PMLR, 2019. [32] Alex X. Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. In Advances in Neural Information Processing Systems, 2020. [33] Byung-Jun Lee, Jongmin Lee, and Kee-Eung Kim. Representation balancing offline modelbased reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=QpNz8r_Ri2Y. [34] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020. [35] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. CoRR, abs/1904.08473, 2019. [36] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. arXiv preprint arXiv:2007.08202, 2020. [37] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations (ICLR), 2019. [38] Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Dieter Fox. Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 4414–4420. IEEE, 2020. [39] Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. Deployment-efficient reinforcement learning via model-based offline optimization. arXiv preprint arXiv:2006.03647, 2020. [40] Rémi Munos and Csaba Szepesvari. Finite-time bounds for fitted value iteration. J. Mach. Learn. Res., 9:815–857, 2008. [41] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019. [42] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International Conference on Machine Learning, pages 2701–2710. PMLR, 2017. [43] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. CoRR, abs/1806.03335, 2018. [44] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua V Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv preprint arXiv:1906.02530, 2019. [45] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019. [46] Marek Petrik, Yinlam Chow, and Mohammad Ghavamzadeh. Safe policy improvement by minimizing robust baseline regret. arXiv preprint arXiv:1607.03842, 2016. [47] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In ICML, pages 417–424, 2001. 12 [48] Rafael Rafailov, Tianhe Yu, A. Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. ArXiv, abs/2012.11547, 2020. [49] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021, 2021. [50] Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer, 2005. [51] Stephane Ross and Drew Bagnell. Agnostic system identification for model-based reinforcement learning. In ICML, 2012. [52] Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup, Joelle Pineau, and Susan A Murphy. Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine learning, 84(1-2):109–136, 2011. [53] Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396, 2020. [54] Avi Singh, Albert Yu, Jonathan Yang, Jesse Zhang, Aviral Kumar, and Sergey Levine. Cog: Connecting new skills to past experience with offline reinforcement learning. arXiv preprint arXiv:2010.14500, 2020. [55] Samarth Sinha and Animesh Garg. S4rl: Surprisingly simple self-supervision for offline reinforcement learning. arXiv preprint arXiv:2103.06326, 2021. [56] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [57] Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991. [58] Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603–2631, 2016. [59] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. J. Mach. Learn. Res, 16:1731–1755, 2015. [60] Phillip Swazinna, Steffen Udluft, and Thomas Runkler. Overcoming model bias for robust offline deep reinforcement learning. arXiv preprint arXiv:2008.05533, 2020. [61] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018. [62] L. Wang, Wei Zhang, Xiaofeng He, and H. Zha. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018. [63] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019. [64] Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140, 2021. [65] F. Yu, H. Chen, X. Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, V. Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2633–2642, 2020. 13 [66] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pages 1094–1100. PMLR, 2020. [67] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. arXiv preprint arXiv:2005.13239, 2020. [68] Xianyuan Zhan, Xiangyu Zhu, and Haoran Xu. Model-based offline planning with trajectory pruning. arXiv preprint arXiv:2105.07351, 2021. [69] Wenxuan Zhou, Sujay Bajracharya, and David Held. Plas: Latent action space for offline reinforcement learning. arXiv preprint arXiv:2011.07213, 2020. 14 A Proofs from Section 4 In this section, we provide proofs for theoretical results in Section 4. Before the proofs, we note that all statements are proven in the case of finite state space (i.e., |S| < ∞) and finite action space (i.e., |A| < ∞) we define some commonly appearing notation symbols appearing in the proof: • PM and rM (or P and r with no subscript for notational simplicity) denote the dynamics and reward function of the actual MDP M • PM and rM denote the dynamics and reward of the empirical MDP M generated from the transitions in the dataset c • P c and r c denote the dynamics and reward of the MDP induced by the learned model M M M We also assume that whenever the cardinality of a particular state or state-action pair in the offline dataset D, denoted by |D(s, a)|, appears in the denominator, we assume it is non-zero. For any non-existent (s, a) ∈ / D, we can simply set |D(s, a)| to be a small value < 1, which prevents any bound from producing trivially ∞ values. A.1 A Useful Lemma and Its Proof Before proving our main results, we first show that the penalty term in equation 4 is positive in expectation. Such a positive penalty is important to combat any overestimation that may arise as a b result of using B. Lemma A.1 ((Interpolation Lemma). For any f ∈ [0, 1], and any given ρ(s, a) ∈ ∆|S||A| , let df be an f-interpolation of ρ and D, i.e., df (s, a) := f d(s, a) + (1 − f )ρ(s, a). For a given iteration k of Equation 4, we restate the definition of the expected penalty under ρ(s, a) in Eq. 5: ρ(s, a) − d(s, a) ν(ρ, f ) := Es,a∼ρ(s,a) . df (s, a) Then ν(ρ, f ) satisfies, (1) ν(ρ, f ) ≥ 0, ∀ρ, f , (2) ν(ρ, f ) is monotonically increasing in f for a fixed ρ, and (3) ν(ρ, f ) = 0 iff ∀ s, a, ρ(s, a) = d(s, a) or f = 0. Proof. To prove this lemma, we use algebraic manipulation on the expression for quantity ν(ρ, f ) and show that it is indeed positive and monotonically increasing in f ∈ [0, 1]. X ρ(s, a) − d(s, a) ν(ρ, f ) = ρ(s, a) f d(s, a) + (1 − f )ρ(s, a) s,a X ρ(s, a) − d(s, a) = ρ(s, a) (6) ρ(s, a) + f (d(s, a) − ρ(s, a)) s,a 2 1 dν(ρ, f ) X 2 =⇒ = ρ(s, a) (ρ(s, a) − d(s, a)) · ≥0 df (ρ(s, a) + f (d(s, a) − ρ(s, a)) s,a ∀f ∈ [0, 1]. (7) Since the derivative of ν(ρ, f ) with respect to f is always positive, it is an increasing function of f for a fixed ρ, and this proves the second part (2) of the Lemma. Using this property, we can show the part (1) of the Lemma as follows: ∀f ∈ (0, 1], ν(ρ, f ) ≥ ν(ρ, 0) = X s,a ρ(s, a) ρ(s, a) − d(s, a) X = (ρ(s, a) − d(s, a)) ρ(s, a) s,a = 1 − 1 = 0. (8) Finally, to prove the third part (3) of this Lemma, note that when f = 0, ν(ρ, f ) = 0 (as shown above), and similarly by setting ρ(s, a) = d(s, a) note that we obtain ν(ρ, f ) = 0. To prove the only if side of (3), assume that f 6= 0 and ρ(s, a) 6= d(s, a) and we will show that in this case ν(ρ, f ) 6= 0. ) When d(s, a) 6= ρ(s, a), the derivative dν(ρ,f > 0 (i.e., strictly positive) and hence the function df ν(ρ, f ) is a strictly increasing function of f . Thus, in this case, ν(ρ, f ) > 0 = ν(ρ, 0) ∀f > 0. Thus we have shown that if ρ(s, a) 6= d(s, a) and f > 0, ν(ρ, f ) 6= 0, which completes our proof for the only if side of (3). 15 A.2 Proof of Proposition 4.1 Before proving this proposition, we provide a bound on the Bellman backup in the empirical MDP, BM . To do so, we formally define the standard concentration properties of the reward and transition dynamics in the empirical MDP, M, that we assume so as to prove Proposition A.1. Following prior work [42, 19, 29], we assume: Assumption A1. ∀ s, a ∈ M, the following relationships hold with high probability, ≥ 1 − δ Cr,δ CP,δ |rM (s, a) − r(s, a)| ≤ p , ||PM (s0 |s, a) − P (s0 |s, a)||1 ≤ p . |D(s, a)| |D(s, a)| Under this assumption and assuming that the reward function in the MDP, r(s, a) is bounded, as |r(s, a)| ≤ Rmax , we can bound the difference between the empirical Bellman operator, BM and the actual MDP, BM , π BM π Q̂k − BM Q̂k = |(rM (s, a) − rM (s, a)) +γ X h i (PM (s0 |s, a) − PM (s0 |s, a)) Eπ(a0 |s0 ) Q̂k (s0 , a0 ) s0 ≤ |rM (s, a) − rM (s, a)| +γ X h i (PM (s0 |s, a) − PM (s0 |s, a)) Eπ(a0 |s0 ) Q̂k (s0 , a0 ) s0 ≤ Cr,δ + γCP,δ 2Rmax /(1 − γ) p . |D(s, a)| Thus the overestimation due to sampling error in the empirical MDP, M is bounded as a function of apbigger constant, Cr,P,δ that can be expressed as a function of Cr,δ and CP,δ , and depends on δ via a log(1/δ) dependency. For the purposes of proving Proposition A.1, we assume that: Cr,T,δ Rmax π p ∀s, a, BM π Q̂k − BM . (9) Q̂k ≤ (1 − γ) |D(s, a)| Next, we provide a bound on the error between the bellman backup induced by the learned dynamics model and the learned reward, BM c, and the actual Bellman backup, BM . To do so, we note that: π k π BM − BM Q̂k = rM (10) c Q̂ c(s, a) − rM (s, a) +γ X h i 0 0 k 0 0 PM c(s |s, a) − PM (s |s, a) Eπ(a0 |s0 ) Q̂ (s , a ) s0 ≤ |rM c(s, a) − rM (s, a)| + γ 2Rmax D(P, PM c), 1−γ (11) where D(P, PM c) is the total-variation divergence between the learned dynamics model and the actual MDP. Now, we show that the asymptotic Q-function learned by COMBO lower-bounds the actual Q-function of any policy π with high probability for a large enough β ≥ 0. We will use Equations 9 and 11 to prove such a result. Proposition A.1 (Asymptotic lower-bound). Let P π denote the Hadamard product of the dynamics P and a given policy π in the actual MDP and let S π := (I − γP π )−1 . Let D denote the totalvariation divergence between two probability distributions. For any π(a|s), the Q-function obtained π π by recursively applying Equation 4, with B̂ π = f BM + (1 − f )BM c, with probability at least 1 − δ, results in Q̂π that satisfies: " " ## ρ − d C R r,T,δ max p (s, a) + f S π ∀s, a, Q̂π (s, a) ≤ Qπ (s, a) − β · S π (s, a) df (1 − γ) |D| 2γRmax + (1 − f ) S π |r − rM | + D(P, P ) (s, a). c c M 1−γ 16 Proof. We first note that the Bellman backup B̂ π induces the following Q-function iterates as per Equation 4, ρ(s, a) − d(s, a) Q̂k+1 (s, a) = B̂ π Q̂k (s, a) − β df (s, a) ρ(s, a) − d(s, a) π π k = f BM Q̂k (s, a) + (1 − f ) BM (s, a) − β cQ̂ df (s, a) ρ(s, a) − d(s, a) π k π k = B π Q̂k (s, a) − β + (1 − f ) BM (s, a) c Q̂ − B Q̂ df (s, a) + f BM π Q̂k − B π Q̂k (s, a) 2γRmax Cr,T,δ Rmax ρ−d p − r | + ) +f + (1 − f ) |rM D(P, P ∀s, a, Q̂k+1 ≤ B π Q̂k − β M c c M df 1−γ (1 − γ) |D| Since the RHS upper bounds the Q-function pointwise for each (s, a), the fixed point of the Bellman iteration process will be pointwise smaller than the fixed point of the Q-function found by solving for the RHS via equality. Thus, we get that ## " " Cr,T,δ Rmax π π π ρ−d π p (s, a) Q̂ (s, a) ≤ S rM −β S (s, a) + f S | {z } df (1 − γ) |D| =Qπ (s,a) 2γRmax π + (1 − f ) S |r − rM D(P, PM c| + c) (s, a), 1−γ which completes the proof of this proposition. Next, we use the result and proof technique from Proposition A.1 to prove Corollary 4.1, that in expectation under the initial state-distribution, the expected Q-value is indeed a lower-bound. Corollary A.1 (Corollary 4.1 restated). For a sufficiently large β, we have a lower-bound that Es∼µ0 ,a∼π(·|s) [Q̂π (s, a)] ≤ Es∼µ0 ,a∼π(·|s) [Qπ (s, a)], where µ0 (s) is the initial state distribution. Furthermore, when s is small, such as in the large sample regime; or when the model bias m is small, a small β is sufficient along with an appropriate choice of f . Proof. To prove this corollary, we note a slightly different variant of Proposition A.1. To observe this, we will deviate from the proof of Proposition A.1 slightly and will aim to express the inequality using BM c, the Bellman operator defined by the learned model and the reward function. Denoting h i (I − T ρ−d −1 π π γPM as SM SM (s, a) c) c, doing this will intuitively allow us to obtain β (µ(s)π(a|s)) c df as the conservative penalty which can be controlled by choosing β appropriately so as to nullify the potential overestimation caused due to other terms. Formally, ρ(s, a) − d(s, a) π k ρ(s, a) − d(s, a) Q̂k+1 (s, a) = B̂ π Q̂k (s, a) − β = BM (s, a) − β cQ̂ df (s, a) df (s, a) π π k + f BM − BM (s, a) cQ̂ {z } | :=∆(s,a) By controlling ∆(s, a) using the pointwise triangle inequality: π k k π k π π k π ∀s, a, BM Q̂k − BM ≤ B π Q̂k − BM cQ̂ cQ̂ + BM Q̂ − B Q̂ , (12) π and then iterating the backup BM c to its fixed point and finally noting that ρ(s, a) = T π (µ · π) SM c (s, a), we obtain: ρ(s, a) − d(s, a) Eµ,π [Q̂π (s, a)] ≤ Eµ,π [QπM (s, a)] − β E + terms independent of β. ρ(s,a) c df (s, a) (13) 17 The terms marked as “terms independent of β” correspond to the additional positive error terms π π k Q̂k − B π Q̂k , which can be bounded similar to obtained by iterating B π Q̂k − BM and BM cQ̂ the proof of Proposition A.1 above. Now by replacing the model Q-function, Eµ,π [QπM c(s, a)] with π the actual Q-function, Eµ,π [Q (s, a)] and adding an error term corresponding to model error to the bound, we obtain that: π π Eµ,π [Q̂ (s, a)] ≤ Eµ,π [Q (s, a)] + terms independent of β − β Eρ(s,a) | ρ(s, a) − d(s, a) . df (s, a) {z } =ν(ρ,f )>0 (14) Hence, by choosing β large enough, we obtain the desired lower bound guarantee. Remark 1 (COMBO does not underestimate at every s ∈ D unlike CQL.). Before concluding this section, we discuss how the bound obtained by COMBO (Equation 14) is tighter than CQL. CQL learns a Q-function such that the value of the policy under the resulting Q-function lower-bounds the true value function at each state s ∈ D individually (in the absence of no sampling error), i.e., π ∀s ∈ D, V̂CQL (s) ≤ V π (s), whereas the bound in COMBO is only valid in expectation of the value π function over the initial state distribution, i.e., Es∼µ0 (s) [V̂COMBO (s)] ≤ Es∼µ0 (s) [V π (s)], and the value function at a given state may not be a lower-bound. For instance, COMBO can overestimate the value of a state more frequent in the dataset distribution d(s, a) but not so frequent in the ρ(s, a) c To see this more formally, note that marginal distribution of the policy under the learned model M. the expected penalty added in the effective Bellman backup performed by COMBO (Equation 4), in expectation under the dataset distribution d(s, a), νe(ρ, d, f ) is actually negative: νe(ρ, d, f ) = X d(s, a) s,a X ρ(s, a) − d(s, a) d(s, a) − ρ(s, a) =− < 0, d(s, a) df (s, a) f d(s, a) + (1 − f )ρ(s, a) s,a where the final inequality follows via a direct application of the proof of Lemma A.1. Thus, COMBO actually overestimates the values at atleast some states (in the dataset) unlike CQL. A.3 Proof of Proposition 4.2 In this section, we will provide a proof for Proposition 4.2, and show that the COMBO can h be lessi conservative in terms of the estimated value. To recall, let ∆πCOMBO := Es,a∼dM (s),π(a|s) Q̂π (s, a h i and let ∆πCQL := Es,a∼dM ,π(a|s) Q̂πCQL (s, a) . From Kumar et al. [29], we obtain that Q̂πCQL (s, a) := π(a|s)−π (a|s) β Qπ (s, a)−β . We shall derive the condition for the real data fraction f = 1 for COMBO, πβ (a|s) thus making sure that df (s) = dπβ (s). To derive the condition when ∆πCOMBO ≥ ∆πCQL , we note the following simplifications: ∆πCOMBO ≥ ∆πCQL X X =⇒ dM (s)π(a|s)Q̂π (s, a) ≥ dM (s)π(a|s)Q̂πCQL (s, a) s,a =⇒ β X s,a dM (s)π(a|s) s,a πβ ρ(s, a) − d (s)πβ (a|s) dπβ (s)πβ (a|s) ≤β X s,a (15) (16) dM (s)π(a|s) π(a|s) − πβ (a|s) πβ (a|s) (17) 18 . Now, in the expression on the left-hand side, we add and subtract dπβ (s)π(a|s) from the numerator inside the paranthesis. X ρ(s, a) − dπβ (s)πβ (a|s) dM (s, a) (18) dπβ (s)πβ (a|s) s,a X ρ(s, a) − dπβ (s)π(a|s) + dπβ (s)π(a|s) − dπβ (s)πβ (a|s) = dM (s, a) (19) dπβ (s)πβ (a|s) s,a = X s,a | dM (s, a) π(a|s) − πβ (a|s) X ρ(s) − dπβ (s) π(a|s) + dM (s, a) · · πβ (a|s) dπβ (s) πβ (a|s) s,a {z } (20) (1) The term marked (1) is identical to the CQL term that appears on the right in Equation 17. Thus the inequality in Equation 17 is satisfied when the second term above is negative. To show this, first note that dπβ (s) = dM (s) which results in a cancellation. Finally, re-arranging the second term into expectations gives us the desired result. An analogous condition can be derived when f 6= 1, but we omit that derivation as it will be hard to interpret terms appear in the final inequality. A.4 Proof of Proposition 4.3 To prove the policy improvement result in Proposition 4.3, we first observe that using Equation 4 for Bellman backups amounts to finding a policy that maximizes the return of the policy in the a modified “f-interpolant” MDP which admits the Bellman backup Bbπ , and is induced by a linear interpolation of c and the return of a backups in the empirical MDP M and the MDP induced by a dynamics model M c policy π in this effective f-interpolant MDP is denoted by J(M, M, f, π). Alongside this, the return is penalized by the conservative penalty where ρπ denotes the marginal state-action distribution of c policy π in the learned model M. π ˆ π) = J(M, M, c f, π) − β ν(ρ , f ) . J(f, 1−γ (21) c f, π), which We will require bounds on the return of a policy π in this f-interpolant MDP, J(M, M, we first prove separately as Lemma A.2 below and then move to the proof of Proposition 4.3. Lemma A.2 (Bound on return in f-interpolant MDP). For any two MDPs, M1 and M2 , with the same state-space, action-space and discount factor, and for a given fraction f ∈ [0, 1], define the f-interpolant MDP Mf as the MDP on the same state-space, action-space and with the same discount as the MDP with dynamics: PMf := f PM1 + (1 − f )PM2 and reward function: rMf := f rM1 + (1 − f )rM2 . Then, given any auxiliary MDP, M, the return of any policy π in Mf , J(π, Mf ), also denoted by J(M1 , M2 , f, π), lies in the interval: J(π, M) − α, J(π, M) + α , where α is given by: π π 2γ(1 − f ) γf π Rmax D (PM2 , PM ) + EdπM π PM − PM QM 1 2 (1 − γ) 1−γ f 1−f + Es,a∼dπM π [|rM1 (s, a) − rM (s, a)|] + Es,a∼dπM π [|rM2 (s, a) − rM (s, a)|]. (22) 1−γ 1−γ α= Proof. To prove this lemma, we note two general inequalities. First, note that for a fixed transition dynamics, say P , the return decomposes linearly in the components of the reward as the expected return is linear in the reward function: J(P, rMf ) = J(P, f rM1 + (1 − f )rM2 ) = f J(P, rM1 ) + (1 − f )J(P, rM2 ). 19 As a result, we can bound J(P, rMf ) using J(P, r) for a new reward function r of the auxiliary MDP, M, as follows J(P, rMf ) = J(P, f rM1 + (1 − f )rM2 ) = J(P, r + f (rM1 − r) + (1 − f )(rM2 − r) = J(P, r) + f J(P, rM1 − r) + (1 − f )J(P, rM2 − r) f = J(P, r) + Es,a∼dπM (s)π(a|s) [rM1 (s, a) − r(s, a)] 1−γ 1−f + Es,a∼dπM (s)π(a|s) [rM2 (s, a) − r(s, a)] . 1−γ Second, note that for a given reward function, r, but a linear combination of dynamics, the following bound holds: J(PMf , r) = J(f PM1 + (1 − f )PM2 , r) = J(PM + f (PM1 − PM ) + (1 − f )(PM2 − PM ), r) π π γ(1 − f ) π Es,a∼dπM (s)π(a|s) PM − PM QM = J(PM , r) − 2 1−γ π π γf π − Es,a∼dπM (s)π(a|s) PM − PM QM 1 1−γ π π γf π − PM QM ∈ J(PM , r) ± Es,a∼dπM (s)π(a|s) PM 1 (1 − γ) 2γ(1 − f )Rmax + D(P , P ) . M2 M (1 − γ)2 To observe the third equality, we utilize the result on the difference between returns of a policy π on two different MDPs, PM1 and PMf from Agarwal et al. [1] (Chapter 2, Lemma 2.2, Simulation Lemma), and additionally incorporate the auxiliary MDP M in the expression via addition and subtraction in the previous (second) step. In the fourth step, we finally bound one term that corresponds to the learned model via the total-variation divergence D(PM2 , PM ) and the other term corresponding to the empirical MDP M is left in its expectation form to be bounded later. Using the above bounds on return for reward-mixtures and dynamics-mixtures, proving this lemma is straightforward: J(M1 , M2 , f, π) := J(PMf , f rM1 + (1 − f )rM2 ) = J(f PM1 + (1 − f )PM2 , rMf ) ∈ J(PMf , rM ) ±   f 1−f  π π Es,a∼dM π [|rM1 (s, a) − rM (s, a)|] + Es,a∼dM π [|rM2 (s, a) − rM (s, a)|]  , 1−γ 1−γ  | {z } :=∆R where the second step holds via linear decomposition of the return of π in Mf with respect to the reward interpolation, and bounding the terms that appear in the reward difference. For convenience, we refer to these offset terms due to the reward as ∆R . For the final part of this proof, we bound J(PMf , rM ) in terms of the return on the actual MDP, J(PM , rM ), using the inequality proved above that provides intervals for mixture dynamics but a fixed reward function. Thus, the overall bound is given by J(π, Mf ) ∈ [J(π, M) − α, J(π, M) + α], where α is given by: α= π π 2γ(1 − f ) γf π Rmax D (PM2 , PM ) + EdπM π PM − PM QM + ∆R . 1 2 (1 − γ) 1−γ (23) This concludes the proof of this lemma. ˆ π) affects the Finally, we prove Theorem 4.3 that shows how policy optimization with respect to J(f, performance in the actual MD by using Equation 21 and building on the analysis of pure model-free algorithms from Kumar et al. [29]. We restate a more complete statement of the theorem below and present the constants at the end of the proof. 20 Theorem 2 (Formal version of Proposition 4.3). Let π̂out (a|s) be the policy obtained by COMBO. Assume ν(ρπout , f ) − ν(ρβ , f ) ≥ C for some constant C > 0. Then, the policy πout (a|s) is a ζsafe policy improvement over πβ in the actual MDP M, i.e., J(πout , M) ≥ J(πβ , M) − ζ, with πβ (s, a)): probability at least 1 − δ, where ζ is given by (where ρβ (s, a) := d c M "s ## " γf |A| Es∼dπMout O (DCQL (πout , πβ ) + 1) (1 − γ)2 |D(s)| C γ(1 − f ) DTV (PM , PM . +O c) − β (1 − γ)2 (1 − γ) Proof. We first note that since policy improvement is not being performed in the same MDP, M as the f-interpolant MDP, Mf , we need to upper and lower bound the amount of improvement occurring in the actual MDP due to the f-interpolant MDP. As a result our first is to relate J(π, M) c f, π) for any given policy π. and J(π, Mf ) := J(M, M, Step 1: Bounding the return in the actual MDP due to optimization in the f-interpolant MDP. By directly applying Lemma A.2 stated and proved previously, we obtain the following upper and lower-bounds on the return of a policy π: c f, π) ∈ [J(π, M) − α, J(π, M) + α] , J(M, M, where α is shown in Equation 22. As a result, we just need to bound the terms appearing the expression of α to obtain a bound on the return differences. We first note that the terms in the expression for α are of two types: (1) terms that depend only on the reward function differences (captured in ∆R in Equation 23), and (2) terms that depend on the dynamics (the other two terms in Equation 23). To bound ∆R , we simply appeal to concentration inequalities on reward (Assumption A1), and bound ∆R as: f 1−f Es,a∼dπM π [|rM1 (s, a) − rM (s, a)|] + Es,a∼dπM π [|rM2 (s, a) − rM (s, a)|] 1−γ 1−γ # " 1 Cr,δ 1 u + ≤ Es,a∼dπM π p ||RM − RM c|| := ∆R . 1−γ 1−γ D(s, a) ∆R := Note that both of these terms are of the order of O(1/(1 − γ)) and hence they don’t figure in the informal bound in Theorem 4.3 in the main text, as these are dominated by terms that grow quadratically with the horizon. To bound the remaining terms in the expression for α, we utilize a result directly from Kumar et al. [29] for the empirical MDP, M, which holds for any policy π(a|s), as shown below. π π γ π Es,a∼dπM (s)π(a|s) PM − PM QM 1 (1 − γ) " p # |A| q 2γRmax CP,δ Es∼dπ (s) p DCQL (π, πβ )(s) + 1 . ≤ M (1 − γ)2 |D(s)| Step 2: Incorporate policy improvement in the f-inrerpolant MDP. Now we incorporate the c and M. In what follows, improvement of policy πout over the policy πβ on a weighted mixture of M we derive a lower-bound on this improvement by using the fact that policy πout is obtained by ˆ π) from Equation 21. As a direct consequence of Equation 21, we note that maximizing J(f, π β ˆ πout ) = J(M, M, c f, πout ) − β ν(ρ , f ) ≥ J(f, ˆ πβ ) = J(M, M, c f, πβ ) − β ν(ρ , f ) (24) J(f, 1−γ 1−γ 21 c f, π) for policy π = πout and a Following Step 1, we will use the upper bound on J(M, M, c f, π) for policy π = πβ and obtain the following inequality: lower-bound on J(M, M, n ν(ρπ , f ) ν(ρβ , f ) 4γ(1 − f )Rmax D(PM , PM J(πout , M) − β ≥ J(πβ , M) − β − c) 1−γ 1−γ (1 − γ)2 i h 2γf πout πout EdπMout PM − PM QπMout − (1 − γ) | {z } :=(∗) 4γRmax CP,δ f Es∼dπβ − M (1 − γ)2 | {z "s :=(∧) # o |A| −∆uR . |D(s)| } The term marked by (∗) in the above expression can be upper bounded by the concentration properties of the dynamics as done in Step 1 in this proof: # " p |A| q 4γf CP,δ Rmax π (∗) ≤ Es∼dMout (s) p DCQL (πout , πβ )(s) + 1 . (25) (1 − γ)2 |D(s)| Finally, using Equation 25, we can lower-bound the policy return difference as: ν(ρβ , f ) 4γ(1 − f )Rmax ν(ρπ , f ) u −β − D(PM , PM c) − (∗) − ∆R 1−γ 1−γ (1 − γ)2 C 4γ(1 − f )Rmax u ≥β − D(PM , PM c) − (∗) − ∆R . 1−γ (1 − γ)2 J(πout , M) − J(πβ , M) ≥ β Plugging the bounds for terms (a), (b) and (c) in the expression for ζ where J(πout , M)−J(πβ , M) ≥ ζ, we obtain: # " p |A| q 4f γRmax CP,δ ζ= DCQL (πout , πβ )(s) + 1 + (∧) − ∆uR Es∼dπMout (s) p (1 − γ)2 |D(s)| + 4(1 − f )γRmax C D(PM , PM . c) − β (1 − γ)2 1−γ (26) Remark 3 (Interpretation of Proposition 4.3). Now we will interpret the theoretical expression for ζ in Equation 26, and discuss the scenarios when it is negative. When the expression for ζ is negative, the policy πout is an improvement over πβ in the original MDP, M. • We first discuss if the assumption of ν(ρπout , f ) − ν(ρβ , f ) ≥ C > 0 is reasonable in practice. Note that we have never used the fact that the learned model PM c is close to the actual MDP, PM on the states visited by the behavior policy πβ in our analysis. We will use this fact now: in practical scenarios, ν(ρβ , f ) is expected to be smaller than ν(ρπ , f ), since ν(ρβ , f ) is directly controlled by the difference and density ratio of ρβ (s, a) and d(s, a): 2 P πβ πβ π ν(ρβ , f ) ≤ ν(ρβ , f = 1) = s,a d c (s, a) d c (s, a)/dMβ (s, a) − 1 by Lemma A.1 M M which is expected to be small for the behavior policy πβ in cases when the behavior policy π marginal in the empirical MDP, dMβ (s, a), is broad. This is a direct consequence of the π fact that the learned dynamics integrated with the policy under the learned model: P cβ is M π closer to its counterpart in the empirical MDP: PMβ for πβ . Note that this is not true for any other policy besides the behavior policy that performs several counterfactual actions in a rollout and deviates from the data. For such a learned policy π, we incur an extra error which depends on the importance ratio of policy densities, compounded over the horizon and manifests as the DCQL term (similar to Equation 25, or Lemma D.4.1 in Kumar et al. [29]). Thus, in practice, we argue that we are interested in situations where the assumption ν(ρπout , f ) − ν(ρβ , f ) ≥ C > 0 holds, in which case by increasing β, we can make the expression for ζ in Equation 26 negative, allowing for policy improvement. 22 • In addition, note that when f is close to 1, the bound reverts to a standard model-free policy improvement bound and when f is close to 0, the bound reverts to a typical model-based policy improvement bound. In scenarios with high sampling error (i.e. smaller |D(s)|), if we can learn a good model, i.e., D(PM , PM c) is small, we can attain policy improvement better than model-free methods by relying on the learned model by setting f closer to 0. A similar argument can be made in reverse for handling cases when learning an accurate dynamics model is hard. B Experimental details In this section, we include all details of our empirical evaluations of COMBO. B.1 Practical algorithm implementation details Model training. In the setting where the observation space is low-dimensional, as mentioned in Section 3, we represent the model as a probabilistic neural network that outputs a Gaussian distribution over the next state and reward given the current state and action: Tbθ (st+1 , r|s, a) = N (µθ (st , at ), Σθ (st , at )). We train an ensemble of 7 such dynamics models following [20] and pick the best 5 models based on the validation prediction error on a held-out set that contains 1000 transitions in the offline dataset D. During model rollouts, we randomly pick one dynamics model from the best 5 models. Each model in the ensemble is represented as a 4-layer feedforward neural network with 200 hidden units. For the generalization experiments in Section 5.1, we additionally use a two-head architecture to output the mean and variance after the last hidden layer following [67]. In the image-based setting, we follow Rafailov et al. [48] and use a variational model with the following components: ht = Eθ (ot ) st ∼ qθ (st |ht , st−1 , at−1 ) Latent transition model: st ∼ Tbθ (st |st−1 , at−1 ) Reward predictor: rt ∼ pθ (rt |st ) Image decoder: ot ∼ Dθ (ot |st ). Image encoder: Inference model: (27) We train the model using the evidence lower bound: max θ T −1 h X h i i Eqθ [log Dθ (oτ +1 |sτ +1 )] − Eqθ DKL [qθ (oτ +1 , sτ +1 |sτ , aτ )kTbθτ (sτ +1 , aτ +1 )] τ =0 At each step τ we sample a latent forward model Tbθτ from a fixed set of K models [Tbθ1 , . . . , TbθK ]. For the encoder Eθ we use a convolutional neural network with kernel size 4 and stride 2. For the Walker environment we use 4 layers, while the Door Opening task has 5 layers. The Dθ is a transposed convolutional network with stride 2 and kernel sizes [5, 5, 6, 6] and [5, 5, 5, 6, 6] respectively. The inference network has a two-level structure similar to Hafner et al. [18] with a deterministic path using a GRU cell with 256 units and a stochastic path implemented as a conditional diagonal Gaussian with 128 units. We only train an ensemble of stochastic forward models, which are also implemented as conditional diagonal Gaussians. Policy Optimization. We sample a batch size of 256 transitions for the critic and policy learning. We set f = 0.5, which means we sample 50% of the batch of transitions from D and another 50% from Dmodel . The equal split between the offline data and the model rollouts strikes the balance between conservatism and generalization in our experiments as shown in our experimental results in Section 5. We represent the Q-networks and policy as 3-layer feedforward neural networks with 256 hidden units. 23 For the choice of ρ(s, a) in Equation 2, we can obtain the Q-values that lower-bound the true value of the learned policy π by setting ρ(s, a) = dπM c(s)π(a|s). However, as discussed in [29], computing π by alternating the full off-policy evaluation for the policy π̂ k at each iteration k and one step of policy improvement is computationally expensive. Instead, following [29], we pick a particular distribution ψ(a|s) that approximates the the policy that maximizes the Q-function at the current iteration and set ρ(s, a) = dπM c(s)ψ(a|s). We formulate the new objective as follows: Q̂k+1 ← arg min β Es∼dπc (s),a∼ψ(a|s) [Q(s, a)] − Es,a∼D [Q(s, a)] M Q 2 1 + R(ψ), Q(s, a) − Bbπ Q̂k (s, a)) + Es,a,s0 ∼df 2 (28) where R(ψ) is a regularizer on ψ. In practice, we pick R(ψ) to be the −DKL (ψ(a|s)kUnif(a)) and under such a regularization, the first term in Equation 28 corresponds to computing softmax of the Q-values at any state s as follows: " Q̂k+1 ← arg min max β Q ψ Es∼dπc (s) log M # X ! Q(s, a) − Es,a∼D [Q(s, a)] a 2 1 . + Es,a,s0 ∼df Q(s, a) − Bbπ Q̂k (s, a)) 2 (29) We estimate the log-sum-exp term in Equation 29 by sampling 10 actions at every state s in the batch from a uniform policy Unif(a) and the current learned policy π(a|s) with importance sampling following [29]. B.2 Hyperparameter Selection In this section, we discuss the hyperparameters that we use for COMBO. In the D4RL and generalization experiments, our method are built upon the implementation of MOPO provided at: https://github.com/tianheyu927/mopo. The hyperparameters used in COMBO that relates to the backbone RL algorithm SAC such as twin Q-functions and number of gradient steps follow from those used in MOPO with the exception of smaller critic and policy learning rates, which we will discuss below. In the image-based domains, COMBO is built upon LOMPO without any changes to the parameters used there. For the evaluation of COMBO, we follow the evaluation protocol in D4RL [12] and a variety of prior offline RL works [29, 67, 26] and report the normalized score of the smooth undiscounted averaged return over 3 random seeds for all environments except sawyer-door-close and sawyer-door where we report the average success rate over 3 random seeds. As mentioned in Section 3, we use the regularization objective in Eq. 2 to select the hyperparameter from a range of pre-specified candidates in a fully offline manner, unlike prior model-based offline RL schemes such as [67] and [26] that similar hyperparameters as COMBO and tune them manually based on policy performance obtained via online rollouts. We now list the additional hyperparameters as follows. • Rollout length h. We perform a short-horizon model rollouts in COMBO similar to Yu et al. [67] and Rafailov et al. [48]. For the D4RL experiments and generalization experiments, we followed the defaults used in MOPO and used h = 1 for walker2d and sawyer-door-close, h = 5 for hopper, halfcheetah and halfcheetah-jump, and h = 25 for ant-angle. In the image-based domain we used rollout length of h = 5 for both the the walker-walk and sawyer-door-open environments following the same hyperparameters used in Rafailov et al. [48]. • Q-function and policy learning rates. On state-based domains, we apply our automatic selection rule to the set {1e − 4, 3e − 4} for the Q-function learning rate and the set {1e − 5, 3e − 5, 1e − 4} for the policy learning rate. We found that 3e − 4 for the Q-function learning rate (also used previously in Kumar et al. [29]) and 1e−4 for the policy learning rate (also recommended previously in Kumar et al. [29] for gym domains) work well for almost all domains except that on walker2d where a smaller Q-function learning rate of 1e − 4 and a correspondingly smaller policy learning rate of 1e − 5 works the best according to our automatic hyperparameter selection scheme. In the image-based domains, we followed the defaults from prior work [48] and used 3e − 4 for both the policy and Q-function. 24 • Conservative coefficient β. We use our hyperparameter selection rule to select the right β from the set {0.5, 1.0, 5.0} for β, which correspond to low conservatism, medium conservatism and high conservatism. A larger β would be desirable in more narrow dataset distributions with lower-coverage of the state-action space that propagates error in a backup whereas a smaller β is desirable with diverse dataset distributions. On the D4RL experiments, we found that β = 0.5 works well for halfcheetah agnostic of dataset quality, while on hopper and walker2d, we found that the more “narrow” dataset distributions: medium and medium-expert datasets work best with larger β = 5.0 whereas more “diverse” dataset distributions: random and medium-replay datasets work best with smaller β = 0.5 which is consistent with the intuition. On generalization experiments, β = 1.0 works best for all environments. In the image-domains we use β = 0.5 for the medium-replay walker-walk task and and β = 1.0 for all other domains, which again is in accordance with the impact of β on performance. • Choice of ρ(s, a). We first decouple ρ(s, a) = ρ(s)ρ(a|s) for convenience. As discussed in Appendix B.1, we use ρ(a|s) as the soft-maximum of the Q-values and estimated with log-sum-exp. For ρ(s), we apply the automatic hyperparameter selection rule to the set π {dπM c, ρ(s) = df }. We found that dM c works better the hopper task in D4RL while df is better for the rest of the environments. For the remaining domains, we found ρ(s) = df works well. • Choice of µ(a|s). For the rollout policy µ, we use our automatic selection rule on the set {Unif(a), π(a|s)}, i.e. the set that contains a random policy and a current learned policy. We found that µ(a|s) = Unif(a) works well on the hopper task in D4RL and also in the ant-angle generalization experiment. For the remaining state-based environments, we discovered that µ(a|s) = π(a|s) excels. In the image-based domain, we found that µ(a|s) = Unif(a) works well in the walker-walk domain and µ(a|s) = π(a|s) is better for the sawyer-door environment. We observed that µ(a|s) = Unif(a) behaves less conservatively and is suitable to tasks where dynamics models can be learned fairly precisely. • Choice of f . For the ratio between model rollouts and offline data f , we input the set {0.5, 0.8} to our automatic hyperparameter selection rule to figure out the best f on each domain. We found that f = 0.8 works well on the medium and medium-expert in the walker2d task in D4RL. For the remaining environments, we find f = 0.5 works well. We also provide additional experimental results on how our automatic hyperparameter selection rule selects hyperparameters. As shown in Table 4, 5, 6 and 7, our automatic hyperparameter selection rule is able to pick the hyperparameters β, µ(a|s), ρ(s) and f and that correspond to the best policy performance based on the regularization value. Task halfcheetah-medium halfcheetah-medium-replay halfcheetah-medium-expert hopper-medium hopper-medium-replay hopper-medium-expert walker2d-medium walker2d-medium-replay walker2d-medium-expert β = 0.5 performance β = 0.5 regularizer value β = 5.0 performance β = 5.0 regularizer value 54.2 55.1 89.4 75.0 89.5 111.1 1.9 56.0 10.3 -778.6 28.9 189.8 -740.7 37.7 -705.6 51.5 -157.9 -788.3 40.8 9.3 90.0 97.2 28.3 75.3 81.9 27.0 103.3 -236.8 283.9 6.5 -2035.9 107.2 -64.1 -1991.2 53.6 -3891.4 Table 4: We include our automatic hyperparameter selection rule of β on a set of representative D4RL environments. We show the policy performance (bold with the higher number) and the regularizer value (bold with the lower number). Lower regularizer value consistently corresponds to the higher policy return, suggesting the effectiveness of our automatic selection rule. B.3 Details of generalization environments For halfcheetah-jump and ant-angle, we follow the same environment used in MOPO. For sawyer-door-close, we train the sawyer-door environment in https://github.com/ rlworkgroup/metaworld with dense rewards for opening the door until convergence. We collect 50000 transitions with half of the data collected by the final expert policy and a policy that reaches the performance of about half the expert level performance. We relabel the reward such that 25 Task µ(a|s) = Unif(a) performance µ(a|s) = Unif(a) regularizer value µ(a|s) = π(a|s) performance µ(a|s) = π(a|s) regularizer value 97.2 7.9 -2035.9 -106.8 52.6 81.9 -14.9 -1991.2 hopper-medium walker2d-medium Table 5: We include our automatic hyperparameter selection rule of µ(a|s) on the medium datasets in the hopper and walker2d environments from D4RL. We follow the same convention defined in Table 4 and find that our automatic selection rule can effectively select µ offline. Task ρ(s) = dπ M̂ performance ρ(s) = dπ M̂ regularizer value ρ(s) = df performance ρ(s) = df regularizer value 97.2 1.8 -2035.9 14617.4 56.0 81.9 -6.0 -1991.2 hopper-medium walker2d-medium Table 6: We include our automatic hyperparameter selection rule of ρ(s) on the medium datasets in the hopper and walker2d environments from D4RL. We follow the same convention defined in Table 4 and find that our automatic selection rule can effectively select ρ offline. the reward is 1 when the door is fully closed and 0 otherwise. Hence, the offline RL agent is required to learn the behavior that is different from the behavior policy in a sparse reward setting. We provide the datasets in the following anonymous link1 . B.4 Details of image-based environments Figure 3: Our image-based environments: The observations are 64 × 64 and 128 × 128 raw RGB images for the walker-walk and sawyer-door tasks respectively. The sawyer-door-close environment used in in Section 5.1 also uses the sawyer-door environment. We visualize our image-based environments in Figure 3. We use the standard walker-walk environment from Tassa et al. [61] with 64 × 64 pixel observations and an action repeat of 2. Datasets were constructed the same way as Fu et al. [12] with 200 trajectories each. For the sawyer-door we use 128 × 128 pixel observations. The medium-expert dataset contains 1000 rollouts (with a rollout length of 50 steps) covering the state distribution from grasping the door handle to opening the door. The expert dataset contains 1000 trajectories samples from a fully trained (stochastic) policy. The data was obtained from the training process of a stochastic SAC policy using dense reward function as defined in Yu et al. [66]. However, we relabel the rewards, so an agent receives a reward of 1 when the door is fully open and 0 otherwise. This aims to evaluate offline-RL performance in a sparse-reward setting. All the datasets are from [48]. B.5 Computation Complexity For the D4RL and generalization experiments, COMBO is trained on a single NVIDIA GeForce RTX 2080 Ti for one day. For the image-based experiments, we utilized a single NVIDIA GeForce RTX 2070. We trained the walker-walk tasks for a day and the sawyer-door-open tasks for about two days. B.6 License of datasets We acknowledge that all datasets used in this paper use the MIT license. 1 The datasets of the generalization environments are available at the anonymous link: https://drive. google.com/file/d/1pn6dS5OgPQVp_ivGws-tmWdZoU7m_LvC/view?usp=sharing. 26 Task f = 0.5 performance f = 0.5 regularizer value f = 0.8 performance f = 0.8 regularizer value 97.2 70.9 -2035.9 -1707.0 93.8 81.9 -21.3 -1991.2 hopper-medium walker2d-medium Table 7: We include our automatic hyperparameter selection rule of f on the medium datasets in the hopper and walker2d environments from D4RL. We follow the same convention defined in Table 4 and find that our automatic selection rule can effectively select f offline. Environment Batch Mean Batch Max COMBO (Ours) CQL+MBPO halfcheetah-jump ant-angle sawyer-door-close -1022.6 866.7 5% 1808.6 2311.9 100% 5392.7±575.5 2764.8±43.6 100%±0.0% 4053.4±176.9 809.2±135.4 62.7%±24.8% Table 8: Comparison between COMBO and CQL+MBPO on tasks that require out-of-distribution generalization. Results are in average returns of halfcheetah-jump and ant-angle and average success rate of sawyer-door-close. All results are averaged over 6 random seeds, ± the 95%-confidence interval. C Comparison to the Naive Combination of CQL and MBPO In this section, we stress the distinction between COMBO and a direct combination of two previous methods CQL and MBPO (denoted as CQL + MBPO). CQL+MBPO performs Q-value regularization using CQL while expanding the offline data with MBPO-style model rollouts. While COMBO utilizes Q-value regularization similar to CQL, the effect is very different. CQL only penalizes the Q-value on unseen actions on the states observed in the dataset whereas COMBO penalizes Q-values on states generated by the learned model while maximizing Q values on state-action tuples in the dataset. Additionally, COMBO also utilizes MBPO-style model rollouts for also augmenting samples for training Q-functions. To empirically demonstrate the consequences of this distinction, CQL + MBPO performs quite a bit worse than COMBO on generalization experiments (Section 5.1) as shown in Table 8. The results are averaged across 6 random seeds (± denotes 95%-confidence interval of the various runs). This suggests that carefully considering the state distribution, as done in COMBO, is crucial. 27