O FFLINE Q-L EARNING ON D IVERSE M ULTI -TASK DATA B OTH S CALES A ND G ENERALIZES Aviral Kumar1,2 Rishabh Agarwal1 Xinyang Geng2 George Tucker∗,1 Sergey Levine∗,1,2 1 Google Research, Brain Team 2 UC Berkeley arXiv:2211.15144v2 [cs.LG] 17 Apr 2023 {aviralk, young.geng, svlevine}@eecs.berkeley.edu, {rishabhagarwal, gjt}@google.com A BSTRACT The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches. 1 I NTRODUCTION High-capacity neural networks trained on large, diverse datasets have led to remarkable models that can solve numerous tasks, rapidly adapt to new tasks, and produce general-purpose representations in NLP and vision (Brown et al., 2020; He et al., 2021). The promise of offline RL is to leverage these advances to produce polices with broad generalization, emergent capabilities, and performance that exceeds the capabilities demonstrated in the training dataset. Thus far, the only offline RL approaches that demonstrate broadly generalizing policies and transferable representations are heavily-based on supervised learning (Reed et al., 2022; Lee et al., 2022). However, these approaches are likely to perform poorly when the dataset does not contain expert trajectories (Kumar et al., 2021b). Offline Q-learning performs well across dataset compositions in a variety of simulated (Gulcehre et al., 2020; Fu et al., 2020) and real-world domains (Chebotar et al., 2021; Soares et al., 2021), however, these are largely centered around small-scale, single-task problems where broad generalization and learning general-purpose representations is not expected. Scaling these methods up to high-capcity models on large, diverse datasets is the critical challenge. Prior works hint at the difficulties: on small-scale, single-task deep RL benchmarks, scaling model capacity can lead to instabilities or degrade performance (Van Hasselt et al., 2018; Sinha et al., 2020; Ota et al., 2021) explaining why decade-old tiny 3-layer CNN architectures (Mnih et al., 2013) are still prevalent. Moreover, works that have scaled architectures to millions of parameters (Espeholt et al., 2018; Teh et al., 2017; Vinyals et al., 2019; Schrittwieser et al., 2021) typically focus on online learning and employ many sophisticated techniques to stabilize learning, such as supervised auxiliary losses, distillation, and ∗ Co-senior authors 1 Train on potentially sub-optimal offline trajectories. Eval online. Fine-tune novel games offline or online for novel modes. ResNet encoder ResNet encoder ResNet encoder FC layers FC layers FC layers Q-values for Q-values Q-values for Amidar for Q-values Amidar Amidar for Q-values Amidar for Amidar Q-values for Q-values Q-values for Amidar for Q-values Amidar Amidar for Q-values Amidar for Amidar Q-values for fine-tuning game Initialize encoder with pretrained weights. Separate heads for 40 games Figure 1: An overview of the training and evaluation setup. Models are trained offline with potentially sub-optimal data. We adapt CQL to the multi-task setup via a multi-headed architecture. The pre-trained visual encoder is reused in fine-tuning (the weights are either frozen or fine-tuned), whereas the downstream fully-connected layers are reinitialized and trained. pre-training. Thus, it is unclear whether offline Q-learning can be scaled to high-capacity models trained on a large, diverse dataset. In this paper, we demonstrate that with careful design decisions, offline Q-learning can scale to highcapacity models trained on large, diverse datasets from many tasks, leading to policies that not only generalize broadly, but also learn representations that effectively transfer to new downstream tasks and exceed the performance in the training dataset. Crucially, we make three modifications motivated by prior work in deep learning and offline RL. First, we find that a modified ResNet architecture (He et al., 2016) substantially outperforms typical deep RL architectures and follows a power-law relationship between model capacity and performance, unlike common alternatives. Second, a discretized representation of the return distribution with a distributional cross-entropy loss (Bellemare et al., 2017) substantially improves performance compared to standard Q-learning, that utilizes mean squared error. Finally, feature normalization on the intermediate feature representations stabilizes training and prevents feature co-adaptation (Kumar et al., 2021a). To systematically evaluate the impact of these changes on scaling and generalization, we train a single policy to play 40 Atari games (Bellemare et al., 2013; Agarwal et al., 2020), similarly to Lee et al. (2022), and evaluate performance when the training dataset contains expert trajectories and when the data is sub-optimal. This problem is especially challenging because of the diversity of games with their own unique dynamics, reward, visuals, and agent embodiments. Furthermore, the sub-optimal data setting requires the learning algorithm to “stitch together” useful segments of sub-optimal trajectories to perform well. To investigate generalization of learned representations, we evaluate offline fine-tuning to never-before-seen games and fast online adaptation on new variants of training games (Section 5.2). With our modifications, • Offline Q-learning learns policies that attain more than 100% human-level performance on most of these games, about 2x better than prior supervised learning (SL) approaches for learning from sub-optimal offline data (51% human-level performance). • Akin to scaling laws in SL (Kaplan et al., 2020), offline Q-learning performance scales favorably with model capacity (Figure 6). • Representations learned by offline Q-learning give rise to more than 80% better performance when fine-tuning on new games compared to representations learned by state-of-the-art return-conditioned supervised (Lee et al., 2022) and self-supervised methods (He et al., 2021; Oord et al., 2018). By scaling Q-learning, we realize the promise of offline RL: learning policies that broadly generalize and exceed the capabilities demonstrated in the training dataset. We hope that this work encourages large-scale offline RL applications, especially in domains with large sub-optimal datasets. 2 R ELATED W ORK Prior works have sought to train a single generalist policy to play multiple Atari games simultaneously from environment interactions, either using off-policy RL with online data collection (Espeholt et al., 2 DT (40M) BC (80M) Online MT DQN (5X)* Sub-optimal Data 40% 20% 0% Fraction of games with score > τ 60% Scaled QL (Ours, 80M) 100% 80% Human-Normalized IQM Human-Normalized Median DT (200M) 60% 40% 20% 75% 50% 25% 0% 0% Behavior Policy Performance Profile (Sub-optimal Data) 0.0 0.5 1.0 1.5 Human Normalized Score (τ) 2.0 Figure 2: Offline multi-task performance on 40 games with sub-optimal data. Left. Scaled QL significantly outperforms the previous state-of-the-art method, DT, attaining about a 2.5x performance improvement in normalized IQM score. To contextualize the absolute numbers, we include online multi-task Impala DQN (Espeholt et al., 2018) trained on 5x as much data. Right. Performance profiles (Agarwal et al., 2021) showing the distribution of normalized scores across all 40 training games (higher is better). Scaled QL stochastically dominates other offline RL algorithms and achieves superhuman performance in 40% of the games. “Behavior policy” corresponds to the score of the dataset trajectories. Online MT DQN (5X), taken directly from Lee et al. (2022), corresponds to running multi-task online RL for 5x more data with IMPALA (details in Appendix B.5). 2018; Hessel et al., 2019a; Song et al., 2019), or policy distillation (Teh et al., 2017; Rusu et al., 2015) from single-task policies. While our work also focuses on learning such a generalist multi-task policy, it investigates whether we can do so by scaling offline Q-learning on suboptimal offline data, analogous to how supervised learning can be scaled to large, diverse datasets. Furthermore, prior attempts to apply transfer learning using RL-learned policies in ALE (Rusu et al., 2015; Parisotto et al., 2015; Mittel & Sowmya Munukutla, 2019) are restricted to a dozen games that tend to be similar and generally require an “expert”, instead of learning how to play all games concurrently. Closely related to our work, recent work train Transformers (Vaswani et al., 2017) on purely offline data for learning such a generalist policy using supervised learning (SL) approaches, namely, behavioral cloning (Reed et al., 2022) or return-conditioned behavioral cloning (Lee et al., 2022). While these works focus on large datasets containing expert or near-human performance trajectories, our work focuses on the regime when we only have access to highly diverse but sub-optimal datasets. We find that these SL approaches perform poorly with such datasets, while offline Q-learning is able to substantially extrapolate beyond dataset performance (Figure 2). Even with near-optimal data, we observe that scaling up offline Q-learning outperforms SL approaches with 200 million parameters using as few as half the number of network parameters (Figure 6). There has been a recent surge of offline RL algorithms that focus on mitigating distribution shift in single task settings (Fujimoto et al., 2018; Kumar et al., 2019; Liu et al., 2020; Wu et al., 2019; Fujimoto & Gu, 2021; Siegel et al., 2020; Peng et al., 2019; Nair et al., 2020; Liu et al., 2019; Swaminathan & Joachims, 2015; Nachum et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021; Kidambi et al., 2020; Yu et al., 2020b; 2021). Complementary to such work, our work investigates scaling offline RL on the more diverse and challenging multi-task Atari setting with data from 40 different games (Agarwal et al., 2020; Lee et al., 2022). To do so, we use CQL (Kumar et al., 2020), due to its simplicity as well as its efficacy on offline RL datasets with high-dimensional observations. 3 P RELIMINARIES AND P ROBLEM S ETUP We consider sequential-decision making problems (Sutton & Barto, 1998) where on each timestep, an agent observes a state s, produces an action a, and receives a reward r. The goal of a learning algorithm is to maximize the sum of discounted rewards. Our approach is based on conservative Q-learning (CQL) (Kumar et al., 2020), an offline Q-learning algorithm. CQL uses a sum of two loss functions to combat value overestimation on unseen actions: (i) standard TD-error that enforces Bellman consistency, and (ii) a regularizer that minimizes the Q-values for unseen actions at a given state, while maximizing the Q-value at the dataset action to counteract excessive underestimation. 3 Figure 3: An overview of the network architecture. The key design decisions are: (1) the use of ResNet models with learned spatial embeddings and group normalization, (2) use of a distributional representation of return values and cross-entropy TD loss for training (i.e., C51 (Bellemare et al., 2017)), and (3) feature normalization to stablize training. Denoting Qθ (s, a) as the learned Q-function, the training objective for CQL is given by: " !# ! X ′ min α Es∼D log exp(Qθ (s, a )) −Es,a∼D [Qθ (s, a)] + TDError(θ; D), θ (1) a′ where α is the regularizer weight, which we fix to α = 0.05 based on preliminary experiments unless noted otherwise. Kumar et al. (2020) utilized a distributional TDError(θ; D) from C51 (Bellemare et al., 2017), whereas (Kumar et al., 2021a) showed that similar results could be attained with the standard mean-squared TD-error. Lee et al. (2022) use the distributional formulation of CQL and found that it underperforms alternatives and performance does not improve with model capacity. In general, there is no consensus on which formulation of TD-error must be utilized in Equation 1, and we will study this choice in our scaling experiments. Problem setup. Our goal is to learn a single policy that is effective at multiple Atari games and can be fine-tuned to new games. For training, we utilize the set of 40 Atari games used by Lee et al. (2022), and for each game, we utilize the experience collected in the DQN-Replay dataset (Agarwal et al., 2020) as our offline dataset. We consider two different dataset compositions: 1. Sub-optimal dataset consisting of the initial 20% of the trajectories (10M transitions) from DQN-Replay for each game, containing 400 million transitions overall with average humannormalized interquartile-mean (IQM) (Agarwal et al., 2021) score of 51%. Since this dataset does not contain optimal trajectories, we do not expect methods that simply copy behaviors in this dataset to perform well. On the other hand, we would expect methods that can combine useful segments of sub-optimal trajectories to perform well. 2. Near-optimal dataset, used by Lee et al. (2022), consisting of all the experience (50M transitions) encountered during training of a DQN agent including human-level trajectories, containing 2 billion transitions with average human-normalized IQM score of 93.5%. Evaluation. We evaluate our method in a variety of settings as we discuss in our experiments in Section 5. Due to excessive computational requirements of running huge models, we are only able to run our main experiments with one seed. Prior work (Lee et al., 2022) that also studied offline multi-game Atari evaluated models with only one seed. That said, to ensure that our evaluations are reliable, for reporting performance, we follow the recommendations by Agarwal et al. (2021). Specifically, we report interquartile mean (IQM) normalized scores, which is the average scores across middle 50% of the games, as well as performance profiles for qualitative summarization. 4 O UR A PPROACH FOR S CALING O FFLINE RL In this section, we describe the critical modifications required to make CQL effective in learning highly-expressive policies from large, heterogeneous datasets. Parameterization of Q-values and TD error. In the single game setting, both mean-squared TD error and distributional TD error perform comparably online (Agarwal et al., 2021) and offline (Kumar et al., 2020; 2021a). In contrast, we observed, perhaps surprisingly, that mean-squared TD error does 4 10% 1% 0% −1% −10% −100% Fraction of runs with score > τ 100% Amidar Assault Asterix Atlantis BankHeist BattleZone BeamRider Boxing Breakout Carnival Centipede ChopperCommand CrazyClimber DemonAttack DoubleDunk Enduro FishingDerby Freeway Frostbite Gopher Gravitar Hero IceHockey Jamesbond Kangaroo Krull KungFuMaster NameThisGame Phoenix Pooyan Qbert Riverraid Robotank Seaquest TimePilot UpNDown VideoPinball WizardOfWor YarsRevenge Zaxxon % Improvement (Log scale) 1000% 1.00 Sub-optimal data 0.75 0.50 0.25 0.00 0% 200% 400% 600% 800% 1000% Improvement of Scaled QL over DT (τ) Figure 4: Comparing Scaled QL to DT on all training games on the sub-optimal dataset. not scale well, and performs much worse than using a categorical distributional representation of return values (Bellemare et al., 2017) when we train on many Atari games. We hypothesize that this is because even with reward clipping, Q-values for different games often span different ranges, and training a single network with shared parameters to accurately predict all of them presents challenges pertaining to gradient interference along different games (Hessel et al., 2019b; Yu et al., 2020a). While prior works have proposed to use adaptive normalization schemes (Hessel et al., 2019b; Kurin et al., 2022), preliminary experiments with these approaches were not effective to close the gap. Q-function architecture. Since large neural networks has been crucial for scaling to large, diverse datasets in NLP and vision (e.g., Tan & Le, 2019; Brown et al., 2020; Kaplan et al., 2020)), we explore using bigger architectures for scaling offline Q-learning. We use standard feature extractor backbones from vision, namely, the Impala-CNN architectures (Espeholt et al., 2018) that are fairly standard in deep RL and ResNet 34, 50 and 101 models from the ResNet family (He et al., 2016). We make modifications to these networks following recommendations from prior work (Anonymous, 2022): we utilize group normalization instead of batch normalization in ResNets, and utilize point-wise multiplication with a learned spatial embedding when converting the output feature map of the vision backbone into a flattened vector which is to be fed into the feed-forward part of the Q-function. To handle the multi-task setting, we use a multi-headed architecture where the Q-network outputs values for each game separately. The architecture uses a shared encoder and feedforward layers with separate linear projection layers for each game (Figure 3). The training objective (Eq. 1) is computed using the Q-values for the game that the transition originates from. In principle, explicitly injecting the task-identifier may be unnecessary and its impact could be investigated in future work. Feature Normalization via DR3 (Kumar et al., 2021a). While the previous modifications lead to significant improvements over naïve CQL, our preliminary experiments on a subset of games did not attain good performance. In the single-task setting, Kumar et al. (2021a) proposes a regularizer that stabilizes training and allows the network to better use capacity, however, it introduces an additional hyperparameter to tune. Motivated by this approach, we regularize the magnitude of the learned features of the observation by introducing a “normalization” layer in the Q-network. This layer forces the learned features to have an ℓ2 norm of 1 by construction, and we found that this this speeds up learning, resulting in better performance. We present an ablation study analyzing this choice in Table 2. We found this sufficient to achieve strong performance, however, we leave exploring alternative feature normalization schemes to future work. To summarize, the primary modifications that enable us to scale CQL are: (1) use of large ResNets with learned spatial embeddings and group normalization, (2) use of a distributional representation of return values and cross-entropy loss for training (i.e., C51 (Bellemare et al., 2017)), and (3) feature normalization at intermediate layers to prevent feature co-adaptation, motivated by Kumar et al. (2021a). For brevity, we call our approach Scaled Q-learning. 5 E XPERIMENTAL E VALUATION In our experiments, we study how our approach, scaled Q-learning, can simultaneously learn from sub-optimal and optimal data collected from 40 different Atari games. We compare the resulting multi-task policies to behavior cloning (BC) with same architecture as scaled QL, and the prior 5 state-of-the-art method based on decision transformers (DT) (Chen et al., 2021), which utilize returnconditioned supervised learning with large transformers (Lee et al., 2022), and have been previously proposed for addressing this task. We also study the efficacy of the multi-task initialization produced by scaled Q-learning in facilitating rapid transfer to new games via both offline and online fine-tuning, in comparison to state-of-the-art self-supervised representation learning methods and other prior approaches. Our goal is to answer the following questions: (1) How do our proposed design decisions impact performance scaling with high-capacity models?, (2) Can scaled QL more effectively leverage higher model capacity compared to naïve instantiations of Q-learning?, (3) Do the representations learned by scaled QL transfer to new games? We will answer these questions in detail through multiple experiments in the coming sections, but we will first summarize our main results below. Main empirical findings. Our main results are summarized in Figures 2 and 5. These figures show the performance of scaled QL, multi-game decision transformers (Lee et al., 2022) (marked as “DT”), a prior method based on supervised learning via return conditioning, and standard behavioral cloning baselines (marked as “BC”) in the two settings discussed previously, where we must learn from: (i) near optimal data, and (ii) sub-optimal data obtained from the initial 20% segment of the replay buffer (see Section 3 for problem setup). See Figure 4 for a direct comparison between DT and BC. In the more challenging sub-optimal data setting, scaled QL attains a performance of 77.8% IQM human-normalized score, although trajectories in the sub-optimal training dataset only attain 51% IQM human-normalized score. Scaled QL also outperforms the prior DT approach by 2.5 times on this dataset, even though the DT model has more than twice as many parameters and uses data augmentation, compared to scaled QL. DT (200M) DT (40M) BC 60% 40% 20% Human-Normalized IQM Sub-optimal Data 80% Human-Normalized IQM MT Impala-DQN* Scaled QL (Ours, 80M) Behavior Policy Near-optimal Data 100% 80% 60% 40% 20% In the 2nd setting with near-optimal data, where the training dataset already contains expert tra0% 0% jectories, scaled QL with 80M parameters still outperforms the DT approach with 200M param- Figure 5: Offline scaled conservative Q-learning vs eters, although the gap in performance is small other prior methods with near-optimal data and suboptimal data. Scaled QL outperforms the best DT model, (3% in IQM performance, and 20% on median attaining an IQM human-normalized score of 114.1% performance). Overall, these results show that on the near-optimal data and 77.8% on the sub-optimal scaled QL is an effective approach for learn- data, compared to 111.8% and 30.6% for DT, respecing from large multi-task datasets, for a vari- tively. ety of data compositions including sub-optimal datasets, where we must stitch useful segments of suboptimal trajectories to perform well, and near-optimal datasets, where we should attempt to mimic the best behavior in the offline dataset. To the best of our knowledge, these results represent the largest performance improvement over the average performance in the offline dataset on such a challenging problem. We will now present experiments that show that offline Q-learning scales and generalizes. 5.1 D OES O FFLINE Q-L EARNING S CALE FAVORABLY ? One of the primary goals of this paper was to understand if scaled Q-learning is able to leverage the benefit of higher capacity architectures. Recently, Lee et al. (2022) found that the performance of CQL with the IMPALA architecture does not improve with larger model sizes and may even degrade with larger model sizes. To verify if scaled Q-learning can address this limitation, we compare our value-based offline RL approach with a variety of model families: (a) IMPALA family (Espeholt et al., 2018): three IMPALA models with varying widths (4, 8, 16) whose performance numbers are taken directly from Lee et al. (2022) (and was consistent with our preliminary experiments), (b) ResNet 34, 50, 101 and 152 from the ResNet family, modified to include group normalization and learned spatial embeddings.These architectures include both small and large networks, spanning a wide range from 1M to 100M parameters. As a point of reference, we use the scaling trends of the multi-game decision transformer and BC transformer approaches from Lee et al. (2022). Observe in Figure 6 that the performance of scaled Q-learning improves as the underlying Q-function model size grows. Even though the standard mean-squared error formulation of TD error results in 6 Scaled QL + ResNet/MSE Scaled QL + ResNet/C51 CQL + IMPALA DT 100% 80% 60% 40% 30 40 60 100 Parameters (x1 Million) 200 Human-Normalized Median Human-Normalized IQM Scaling curves with near-optimal data 100% 80% DT 60% 40% 20% 30 40 60 100 Parameters (x1 Million) 200 Figure 6: Scaling trends for offline Q-learning. Observe that while the performance of scaled QL instantiated with IMPALA architectures (Espeholt et al., 2018) degrades as we increase model size, the performance of scaled QL utilizing the ResNets described in Section 4 continues to increase as model capacity increases. This is true for both an MSE-style TD error as well as for the categorical TD error used by C51 (which performs better on an absolute scale). The CQL + IMPALA performance numbers are from (Lee et al., 2022). Scaled QL (ours) Scaled QL (frozen) Normalized Score Alien MsPacman 1.5 1.0 0.5 0.0 Scaled QL (scratch) MAE 1.0 0.5 0.0 BC (pre-trained) DT (pre-trained) Pong SpaceInvaders CPC+DT 0.8 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.2 0.00 0.0 0.00 StarGunner 0.6 0.4 Figure 7: Offline fine-tuning performance on unseen games trained with 1% of held-out game’s data, measured in terms of DQN-normalized score, following (Lee et al., 2022). On average, pre-training with scaled QL outperforms other methods by 82%. Furthermore, scaled QL improves over scaled QL (scratch) by 45%, indicating that the representations learned by scaled QL during multi-game pre-training are useful for transfer. Self-supervised representation learning (CPC, MAE) alone does not attain good fine-tuning performance. worse absolute performance than C51 (blue vs orange), for both of these versions, the performance of scaled Q-learning increases as the models become larger. This result indicates that value-based offline RL methods can scale favorably, and give rise to better results, but this requires carefully picking a model family. This also explains the findings from Lee et al. (2022): while this prior work observed that CQL with IMPALA scaled poorly as model size increases, they also observed that the performance of return-conditioned RL instantiated with IMPALA architectures also degraded with higher model sizes. Combined with the results in Figure 6 above, this suggests that poor scaling properties of offline RL can largely be attributed to the choice of IMPALA architectures, which may not work well in general even for supervised learning methods (like return-conditioned BC). 5.2 C AN O FFLINE RL L EARN U SEFUL I NITIALIZATIONS THAT E NABLE F INE -T UNING ? Next, we study how multi-task training on multiple games via scaled QL can learn general-purpose representations that can enable rapid fine-tuning to new games. We study this question in two scenarios: fine-tuning to a new game via offline RL with a small amount of held-out data (1% uniformly subsampled datasets from DQN-Replay (Agarwal et al., 2020)), and finetuning to a new game mode via sample-efficient online RL initialized from our multi-game offline Q-function. For finetuning, we transfer the weights from the visual encoder and reinitialize the downstream feedforward component (Figure 1). For both of these scenarios, we utilize a ResNet101 Q-function trained via the methodology in Section 4, using C51 and feature normalization. Scenario 1 (Offline fine-tuning): First, we present the results for fine-tuning in an offline setting: following the protocol from Lee et al. (2022), we use the pre-trained representations to rapidly learn a policy for a novel game using limited offline data (1% of the experience of an online DQN run). In Figure 7, we present our results for offline fine-tuning on 5 games from Lee et al. (2022), A LIEN , M S PACMAN , S PACE I NVADERS , S TAR G UNNER and P ONG, alongside the prior approach based on 7 decision transformers (“DT (pre-trained)”), and fine-tuning using pre-trained representations learned from state-of-the-art self-supervised representation learning methods such as contrastive predictive coding (CPC) (Oord et al., 2018) and masked autoencoders (MAE) (He et al., 2021). For CPC performance, we use the baseline reported in Lee et al. (2022). MAE is a more recent self-supervised approach that we find generally outperformed CPC in this comparison. For MAE, we first pretrained a vision transformer (ViT-Base) (Dosovitskiy et al., 2020) encoder with 80M parameters trained via a reconstruction loss on observations from multi-game Atari dataset and freeze the encoder weights as done in prior work (Xiao et al.). Then, with this frozen visual encoder, we used the same feed forward architecture, Q-function parameterization, and training objective (CQL with C51) as scaled QL to finetune the MAE network. We also compare to baseline methods that do not utilize any multi-game pre-training (DT (scratch) and Scaled QL (scratch)). Results. Observe in Figure 7 that multi-game pre-training via scaled QL leads to the best fine-tuning performance and improves over prior methods, including decision transformers trained from scratch. Importantly, we observe positive transfer to new games via scaled QL. Prior works (Badia et al., 2020) running multi-game Atari (primarily in the online setting) have generally observed negative transfer across Atari games. We show for the first time that pre-trained representations from Q-learning enable positive transfer to novel games that significantly outperforms return-conditioned supervised learning methods and dedicated representation learning approaches. Scenario 2 (Online fine-tuning): Next, we study the efficacy of the learned representations in enabling online fine-tuning. While deep RL agents on ALE are typically trained on default game modes (referred to as m0d0), we utilize new variants of the ALE games designed to be challenging for humans (Machado et al., 2018; Farebrother et al., 2018) for online-finetuning. We investigate whether multi-task training on the 40 default game variants can enable fast online adaptation to these never-before-seen variants. In contrast to offline fine-tuning (Scenario 1), this setting tests whether scaled QL can also provide a good initialization for online data collection and learning, for closely related but different tasks. Following Farebrother et al. (2018), we use the same variants investigated in this prior work: B REAKOUT, H ERO, and F REEWAY, which we visualize in Figure 8 (left). To disentangle the performance gains from multi-game pre-training and the choice of Q-function architecture, we compare to a baseline approach (“scaled QL (scratch)”) that utilizes an identical Q-function architecture as pre-trained scaled QL, but starts from a random initialization. As before, we also evaluate fine-tuning performance using the representations obtained via masked auto-encoder pre-training (He et al., 2021; Xiao et al.). We also compare to a single-game DQN performance attained after training for 50M steps, 16× more transitions than what is allowed for scaled QL, as reported by Farebrother et al. (2018). Results. Observe in Figure 8 that fine-tuning from the multi-task initialization learned by scaled QL significantly outperforms training from scratch as well as the single-game DQN run trained with 16x more data. Fine-tuning with the frozen representations learned by MAE performs poorly, which we hypothesize is due to differences in game dynamics and subtle changes in observations, which must be accurately accounted for in order to learn optimal behavior (Dean et al., 2022). Our results confirm that offline Q-learning can both effectively benefit from higher-capacity models and learn multi-task initializations that enable sample-efficient transfer to new games. 5.3 A BLATION S TUDIES Finally, in this section we perform controlled ablation studies to understand how crucial the design decisions introduced in Section 4 are for the success of scaled Q-learning. In particular, we will attempt to understand the benefits of using C51 and feature normalization. MSE vs C51: We ran scaled Q-learning with identical network architectures (ResNet 50 and ResNet 101), with both the conventional squared error formulation of TD error, and compare it to C51, which our main results utilize. Observe in Table 1 that C51 leads to much better performance for both ResNet 50 and ResNet 101 models. The boost in performance is the largest for ResNet 101, where C51 improves by over 39% as measured by median human-normalized score. This observation is surprising since prior work (Agarwal et al., 2021) has shown that C51 performs on par with standard DQN with an Adam optimizer, which all of our results use. One hypothesis is that this could be the case as TD gradients would depend on the scale of the reward function, and hence some games would likely exhibit a stronger contribution in the gradient. This is despite the fact that our implementation of MSE TD-error already attempts to correct for this issue by applying the unitary scaling technique 8 Scaled QL (Ours) Scaled QL (Scratch) 25 MAE (Pretrain) Freeway (m1d0) Hero (m1d0) Breakout (m12d0) 150 6000 20 Game Score Single-game DQN (50M) 15 100 4000 10 5 0 2000 50 0 0 Figure 8: Online fine-tuning results on unseen game variants. Left. The top row shows default variants and the bottom row shows unseen variants evaluated for transfer: Freeway’s mode 1 adds buses, more vehicles, and increases velocity; Hero’s mode 1 starts the agent at level 5; Breakout’s mode 12 hides all bricks unless the ball has recently collided with a brick. Right. We fine-tune all methods except single-game DQN for 3M online frames (as we wish to test fast online adaptation). Error bars show minimum and maximum scores across 2 runs while the bar shows their average. Observe that scaled QL significantly outperforms learning from scratch and single-game DQN with 50M online frames. Furthermore, scaled QL also outperforms RL fine-tuning on representations learned using masked auto-encoders. See Figure A.1 for learning curves. Table 1: Performance of Scaled QL with the standard mean-squared TD-error and C51 in the offline 40-game setting aggregated by the median human-normalized score. Observe that for both ResNet 50 and ResNet 101, utilizing C51 leads to a drastic improvement in performance. Scaled QL (ResNet 50) Scaled QL (ResNet 101) with MSE 41.1% 59.5% with C51 53.5% (+12.4%) 98.9% (+39.4%) from (Kurin et al., 2022) to standardize reward scales across games. That said, we still observe that C51 performs significantly better. Importance of feature normalization: We ran small-scale experiments with and without feature normalization (Section 4). In these experiments, we consider a multi-game setting with only 6 games: A STERIX, B REAKOUT, P ONG, S PACE I NVADERS, S EAQUEST, and we train with the initial 20% data for each game. We report aggregated median human-normalized score across the 6 games in Table 2 for three different network architectures (ResNet 34, ResNet 50 and ResNet 101). Observe that the addition of feature normalization significantly improves performance for all the models. Motivated by this initial empirical finding, we used feature normalization in all of our main experiments. To summarize, these ablation studies validate the efficacy of the two key design decisions introduced in this paper. However, there are several avenues for future investigation: 1) it is unclear if C51 works better because of the distributional formulation or the categorical representation and experiments with other distributional formulations could answer this question, 2) we did not extensively try alternate feature normalization schemes which may improve results. Table 2: Performance of Scaled QL with and without feature normalization in the 6 game setting reported in terms of the median human-normalized score. Observe that with models of all sizes, the addition of feature normalization improves performance. Scaled QL (ResNet 34) Scaled QL (ResNet 50) without feature normalization 50.9% 73.9% Scaled QL (ResNet 101) 80.4% with feature normalization 78.0% (+28.9%) 83.5% (+9.6%) 98.0% (+17.6%) Additional ablations: We also conducted ablation studies for the choice of the backbone architecture (spatial learned embeddings) in Appendix A.3, and observed that utilizing spatial embeddings is better. We also evaluated the performance of scaled QL without conservatism to test the importance of utilizing pessimism n our setting with diverse data in Appendix A.4, and observe that pessimism is crucial for attaining good performance on an average. We also provide some scaling studies for another offline RL method (discrete BCQ) in Appendix A.2. 9 6 D ISCUSSION This work shows, for the first time (to the best of our knowledge), that offline Q-learning can scale to high-capacity models trained on large, diverse datasets. As we hoped, by scaling up model capacity, we unlocked analogous trends to those observed in vision and NLP. We found that scaled Q-learning trains policies that exceed the average dataset performance and prior methods, especially when the dataset does not already contain expert trajectories. Furthermore, by training a large-capacity model on a diverse set of tasks, we show that Q-learning alone is sufficient to recover general-purpose representations that enable rapid learning of novel tasks. Although we detailed an approach that is sufficient to scale Q-learning, this is by no means optimal. The scale of the experiments limited the number of alternatives we could explore, and we expect that future work will greatly improve performance. Given the strong performance of transformers, we suspect that offline Q-learning with a transformer architecture is a promising future direction. For example, contrary to DT (Lee et al., 2022), we did not use data augmentation in our experiments, which we believe can provide significant benefits. While we did a preliminary attempt to perform online fine-tuning on an entirely new game (S PACE I NVADERS), we found that this did not work well for any of the pretrained representations (see Figure A.1). Addressing this is an important direction for future work. We speculate that this challenge is related to designing methods for learning better exploration from offline data, which is not required for offline fine-tuning. Another important avenue for future work is to scale offline Q-learning on other RL domains such as robotic navigation, manipulation, locomotion, education, etc. This would require building large-scale tasks, and we believe that scaled QL would provide for a good starting point for scaling in these domains. Finally, in line with Agarwal et al. (2022), we’d release our pretrained models, which we hope would enable subsequent methods to build upon. AUTHOR C ONTRIBUTIONS AK conceived and led the project, developed scaled QL, decided and ran most of the experiments. RA discussed the experiment design and project direction, helped set up and debug the training pipeline, took the lead on setting up and running the MAE baseline and the online fine-tuning experiments. XG helped with design choices for some experiments. GT advised the project and ran baseline DT experiments. SL advised the project and provided valuable suggestions AK, RA, GT, SL all contributed to writing and editing the paper. ACKNOWLEDGEMENTS We thank several members of the Google Brain team for their help, support and feedback on this paper. We thank Dale Schuurmans, Dibya Ghosh, Ross Goroshin, Marc Bellemare and Aleksandra Faust for informative discussions. We thank Sherry Yang, Ofir Nachum, and Kuang-Huei Lee for help with the multi-game decision transformer codebase; Anurag Arnab for help with the Scenic ViT codebase. We thank Zoubin Ghahramani and Douglas Eck for leadership support. R EFERENCES Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pp. 104–114. PMLR, 2020. Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021. Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. NeurIPS, 2022. Anonymous. Pre-training for robots: Leverage diverse multitask data via offline rl. under submission to ICLR, 2022. 10 A Badia, B Piot, S Kapturowski, P Sprechmann, A Vitvitskyi, D Guo, and C Blundell. Agent57: Outperforming the human atari benchmark. In Proceedings of the 37th International Conference on Machine Learning, Online, PMLR, volume 119, pp. 2020, 2020. Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013. Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449–458. PMLR, 2017. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110, 2018. Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, and Sergey Levine. Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749, 2021. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021. Victoria Dean, Daniel Kenji Toyama, and Doina Precup. Don’t freeze your embedding: Lessons from policy finetuning in environment transfer. In ICLR Workshop on Agent Learning in OpenEndedness, 2022. Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, and Yi Tay. Scenic: A jax library for computer vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21393–21398, 2022. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, 2018. Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020. Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. arXiv preprint arXiv:2106.06860, 2021. Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900, 2018. Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. Benchmarking batch deep reinforcement learning algorithms. arXiv preprint arXiv:1910.01708, 2019. Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. Rl unplugged: Benchmarks for offline reinforcement learning. arXiv preprint arXiv:2006.13888, 2020. 11 K He, X Chen, S Xie, Y Li, P Doll’ar, and RB Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv.2111.06377, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3796–3803, 2019a. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019b. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020. Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021. Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pp. 11761–11771, 2019. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020. Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, and Sergey Levine. DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization. arXiv preprint arXiv:2112.04716, 2021a. Aviral Kumar, Joey Hong, Anikait Singh, and Sergey Levine. Should i run offline reinforcement learning or behavioral cloning? In International Conference on Learning Representations, 2021b. Vitaly Kurin, Alessandro De Palma, Ilya Kostrikov, Shimon Whiteson, and M Pawan Kumar. In defense of the unitary scalarization for deep multi-task learning. arXiv preprint arXiv:2201.04122, 2022. Kuang-Huei Lee, Ofir Nachum, Mengjiao Yang, Lisa Lee, Daniel Freeman, Winnie Xu, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski, et al. Multi-game decision transformers. arXiv preprint arXiv:2205.15241, 2022. Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. CoRR, abs/1904.08473, 2019. Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. arXiv preprint arXiv:2007.08202, 2020. Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018. Akshita Mittel and Purna Sowmya Munukutla. Visual transfer between atari games using competitive reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0, 2019. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013. 12 Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019. Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. Kei Ota, Devesh K Jha, and Asako Kanezaki. Training larger networks for deep reinforcement learning. arXiv preprint arXiv:2102.07920, 2021. Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015. Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019. Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022. Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015. Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, and David Silver. Online and offline reinforcement learning by planning with a learned model. Advances in Neural Information Processing Systems, 34:27580–27591, 2021. Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396, 2020. Samarth Sinha, Homanga Bharadhwaj, Aravind Srinivas, and Animesh Garg. D2rl: Deep dense architectures in reinforcement learning. arXiv preprint arXiv:2010.09163, 2020. Douglas W Soares, Acordo Certo, Telma Lima, and Deep Learning Brazil. Pulserl: Enabling offline reinforcement learning for digital marketing systems via conservative q-learning. 2021. H Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, et al. V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control. arXiv preprint arXiv:1909.12238, 2019. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. J. Mach. Learn. Res, 16:1731–1755, 2015. Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR, 2019. Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. arXiv preprint arXiv:1707.04175, 2017. Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 13 Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019. Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018. Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782, 2020a. Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. arXiv:2005.13239, 2020b. Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. arXiv:2102.08363, 2021. 14 Appendices A A DDITIONAL R ESULTS A.1 A DDITIONAL R ESULTS FROM THE PAPER Scaled QL (pretrain) Breakout Average Score 200 Scaled QL (scratch) Freeway H.E.R.O 0 1 2 3 Frames (in Millions) 600 4000 10 0 Space Invaders 6000 20 100 MAE 400 2000 1 2 3 Frames (in Millions) 0 200 1 2 3 Frames (in Millions) 1 2 3 Frames (in Millions) Figure A.1: Learning curves for online fine-tuning on unseen game variants. The dotted horizontal line shows the performance of a single-game DQN agent trained for 50M frames (16x more data than our methods). See Figure 8 for visualization of the variants. 100% BC (200M) Online MT DQN* Scaled QL (80M) Behavior Policy Near-optimal Data Human-Normalized IQM Human-Normalized Median DT (200M) DT (40M) 100% 80% 60% 40% 20% 0% 80% 60% 40% 20% 0% Figure A.2: Offline scaled conservative Q-learning vs other prior methods with near-optimal data. Scaled QL outperforms the best DT model, attaining an IQM human-normalized score of 114.1% and a median human-normalized score of 98.9% compared to 111.8% and 78.2% for DT, respectively. A.2 R ESULTS FOR S CALING D ISCRETE -BCQ To implement discrete BCQ, we followed the official implementation from Fujimoto et al. (2019). We first trained a model of the behavior policy, π bβ (a|s), using an architecture identical to that of the Q-function, using negative log-likelihood. Then, following Fujimoto et al. (2019), we updated the Bellman backup to only perform the maximization over actions that attain a high likelihood under the probabilities learned by the behavior policy, as shown below: y(s, a) := r(s, a) + γ ′ max bβ (a′′ |s′ ) a :b πβ (a′ |s′ )≥τ ·maxa′′ π Q̄(s′ , a′ ), where τ is a hyperparameter. To tune the value of τ , we ran a preliminary initial sweep over τ = {0.05, 0.1, 0.3}. When using C51 in our setup, we had to use a smaller CQL α of 0.05 (instead of 0.1 for the MSE setting from Kumar et al. (2021a)), possibly because a discrete representation of Q-values used by C51 is less prone to overestimation. Therefore, in the case of discrete-BCQ, we chose to perform an initial sweep over τ values that were smaller than or equal to (i.e., less conservative) the value of τ = 0.3 used in Fujimoto et al. (2019). Since BCQ requires an additional policy network, it imposes a substantial memory overhead and as such, we performed a sweep for initial 20 iterations to pick the best τ . We found that in these initial 15 Scaled QL (BCQ) Scaled QL (CQL) Human-Normalized IQM 220% 200% 180% 160% 140% 120% ResNet 34 ResNet 50 s) s) (31M param (60M param Network Architecture Figure A.3: Performance of scaling CQL and BCQ in terms of IQM human-normalized score. We perform this comparison on the six-game setting for 100 epochs (note that these results are after 2x longer training than other ablations in Table 2). Observe that for discrete-BCQ the performance improves from ResNet 34 to ResNet 50, indicating that it does scale favorably as network capacity increases. experiments, τ = 0.05 performed significantly worse, but τ = 0.1 and τ = 0.3 performed similarly. So, we utilized τ = 0.3 for reporting these results. We ran these scaling experiments with ResNet 34 and ResNet 50 in the six-game setting and report human-normalized IQM performance after 100 epochs = 6.25M gradient steps in Figure A.3. We also present the results for CQL on the side for comparisons. Observe that we find favorable scaling trends for BCQ: average performance over all games increases as the network size increases, indicating that other offline RL algorithms such as BCQ can scale as we increase network capacity. A.3 A BLATION FOR BACKBONE A RCHITECTURE In this section, we present some results ablating the choice of the backbone architecture. For this ablation, we ablate the choice of the spatial embedding while keeping group normalization fixed in both cases. We perform this study for the 40-game setting. Observe that using the learned spatial embedding results in better performance, and improves in 27 out of 40 games compared to not using the learned embeddings. Table A.1: Ablations for the backbone architecture in the 40-game setting with ResNet 101. Observe that learned spatial embeddings leads to around 80% improvement in performance. Scaled QL without backbone Scaled QL w/ backbone Median human-normalized score IQM human-normalized score 54.9% 68.9% 98.9% 114.1% Num. games with better performance 13 / 40 27 / 40 Regarding the choice of group normalization vs batch normalization, note that we have been operating in a setting where the size of the batch per device / core is only 4. Particularly, we use Cloud TPU v3 accelerators with 64 / 128 cores, and bigger batch sizes than 4 do not fit in memory, especially for larger-capacity ResNets. This means that if we utilized batch normalization, we would be computing batch statistics over only 4 elements, which is known to be unstable even for standard computer vision tasks, for example, see Figure 1 in Wu & He (2018). A.4 R ESULTS FOR S CALED QL W ITHOUT P ESSIMISM In Table A.2, we present the results of running scaled Q-learning with no conservatism, i.e., by setting the value of α in Equation 1 to 0.0 in the six game setting. We utilize the entire DQN-replay dataset (Agarwal et al., 2020) for each of these six games that would be present in the full 40-game dataset, to preserve the per-game dataset diversity. 16 Observe that while utilizing no conservatism does still learn, the performance of scaled QL without conservatism is notably worse than standard scaled QL. Interestingly, on A STERIX, the performance without pessimism is better than performance with pessimism, whereas, the use of pessimism in S PACE I NVADERS and S EAQUEST leads to at least 2x improvement in performance. Table A.2: Performance of scaled QL with and without conservatism in terms of IQM humannormalized score in the six-game setting for 100 epochs (2x longer training compared to other ablations in Table 2) performed with a ResNet 50. Observe that utilizing conservatism via CQL is beneficial. We also present per-game raw scores in this table. Observe that while in one games no pessimism with such data can outperform CQL, we do find that overall, conservatism performs better. Scaled QL without CQL Scaled QL w/ CQL A STERIX B REAKOUT P ONG Q BERT S EAQUEST S PACE I NVADERS 38000 322 12.6 13800 1378 1675 35200 410 19.8 15500 3694 3819 IQM human-normalized score 188.3% 223.4% We also present some results without pessimism in the complete 40-game setting in Table A.3. Unlike the smaller six game setting, we find a much larger difference between no pessimism (without CQL) and utilizing pessimism via CQL. In particular, we find that in 6 games, not using pessimism leads to slightly better performance, but this strategy hurts in all other games, giving rise to an agent that performs worse than random in many of these 34 games. This indicates that pessimism is especially deisrable as the diversity of tasks increases. Table A.3: Scaled QL with and without conservatism in terms of IQM human-normalized score in the 40-game setting with ResNet 101. Observe that utilizing conservatism via CQL is still beneficial. B Scaled QL without CQL Scaled QL w/ CQL Median human-normalized score IQM human-normalized score 11.1% 13.5% 98.9% 114.1% Num. games with better performance 6 / 40 34 / 40 I MPLEMENTATION D ETAILS AND H YPER - PARAMETERS In this section, we will describe some of the implementation details behind our method and will provide implementation details for our approach, including the details of the network architectures, the details of feature normalization and the details of our training and evaluation protocol. B.1 N ETWORK A RCHITECTURE In our primary experiments, we consider variants of ResNet architectures for scaled Q-Learning. The vision backbone in these architectures mimic the corresponding ResNet architectures from He et al. (2016), however, we utilize group normalization (Wu & He, 2018) (with a group size of 4) instead of batch normalization, and instead of applying global mean pooling to aggregate the outputs of the ResNet, we utilize learned spatial embeddings (Anonymous, 2022), that learn a matrix that point-wise multiplies the output feature map of the ResNet. The output volume is then flattened to be passed as input to the feed-forward part of the network. The feed-forward layer part of the network begins with a layer of size 2048, and then applies layer norm on the network. After this we apply 3 feed-forward layers with hidden dimension 1024 with ReLU activations, to obtain the representation of the image observation. Then, we apply feature normalization to the representation, by applying a normalization layer which divides the representation of a given observation by its ℓ2 norm. Note that we do pass gradients 17 through this normalization term. Now, this representation is passed into different heads that are supposed to predict the Q-values. The total number of heads is equal to the number of games we train on. Each head consists of a linear layer that maps the 1024-dimensional normalized representation to a vector of K elements, where K = |A| (i.e., the size of the action space) for the standard real-valued parameterization of Q-values, and K = |A| × 51 for C51. The network does not apply any output activation in either case, and the Q-values are treated as logits for C51. B.2 D ETAILS OF C51 For the main results in the paper, we utilize C51. The main hyperparameter in C51 is the size of the support set of Q-values. Unlike the paper from Bellemare et al. (2017) which utilizes a support set of [−10, 10], we utilize a support set of [−20, 20] to allow for flexibility of CQL: Applying the CQL regularizer can underestimate or overestimate Q-values, and this additional flexibility aids such scenarios. Though, we still utilize only 51 atoms in our support set, and the average dataset Q-value in our training runs is generally always smaller, around ∼ 8 − 9. B.3 T RAINING AND E VALUATION P ROTOCOLS AND H YPERPARAMETERS We utilize the initial 20% (sub-optimal) and 100% (near-optimal) datasets from Agarwal et al. (2020) for our experiments. These datasets are generated from runs of standard online DQN on stochastic dynamics Atari environments that utilize sticky actions, i.e., there is a 25% chance at every time step that the environment will execute the agents previous action again, instead of the new action commanded. The majority of the training details are identical to a typical run of offline RL on single-game Atari. We discuss the key differences below. We trained our ResNet 101 network for 10M gradient steps with a batch size of 512. The agent hasn’t converged yet, and the performance is still improving gradually. When training on multiple games, we utilize a stratified batch sampling scheme with a total batch size of 512. To obtain the batch at any given training iteration, we first sample 128 game indices from the set all games (40 games in our experiments) with replacement, and then sample 4 transitions from each game. This scheme does not necessarily produce an equal number of transitions from each game in a training batch, but it does make sure that all games are seen in expectation throughout training. Since we utilize a larger batch size, that is 16 times larger than the standard batch size of 32 on Atari, we scale up the learning rate from 5e − 05 to 0.0002, but keep the target network update period fixed to the same value of 1 target update per 2000 gradient steps as with single-task Atari. We also utilize n-step returns with n = 3 by default, with both our MSE and C51 runs. Evaluation Protocol. Even though we train on Atari datasets with sticky actions, we evaluate on Atari environments that do not enable sticky actions following the protocol from Lee et al. (2022). This allows us to be comparable to this prior work in all of our comparisons, without needing to re-train their model, which would have been too computationally expensive. Following standard protocols on Atari, we evaluate a noised version of the policy with an epsilon-greedy scheme, with εeval = 0.001. Following the protocol in Castro et al. (2018), we compute average return over 125K training steps. B.4 F INE -T UNING P ROTOCOL For offline fine-tuning we fine-tuned the parameters of the pre-trained policy on the new domain using a batch size of 32, and identical hyperparameters as those used during pre-training. We utilized α = 0.05 for fine-tuning, but with the default learning rate of 5e − 05 (since the batch size was the default 32). We attempted to use other CQL α values {0.07, 0.02, 0.1} for fine-tuning but found that retaining the value of α = 0.05 for pre-training worked the best. For reporting results, we reported the performance of the algorithm at the end of 300k gradient steps. For online fine-tuning, we use the C51 algorithm (Bellemare et al., 2017), with n-step= 3 and all other hyperparameters from the C51 implementation in the Dopamine library (Castro et al., 2018). We swept over two learning rates, {1e − 05, 5e − 05} for all the methods and picked the best learning rate per-game for all the methods. For the MAE implementation, we used the Scenic library (Dehghani et al., 2022) with the typical configuration used for ImageNet pretraining, except 18 Table B.1: Hyperparameters used by multi-game training. Here we report the key hyperparameters used by the multi-game training. The differences from the standard single-game training setup are highlighted in red. Hyperparameter Setting (for both variations) Eval Sticky actions Grey-scaling Observation down-sampling Frames stacked Frame skip (Action repetitions) Reward clipping Terminal condition Max frames per episode Discount factor Mini-batch size Target network update period Training environment steps per iteration Update period every Evaluation ǫ Evaluation steps per iteration Learning rate n-step returns (n) CQL regularizer weight α No True (84, 84) 4 4 [-1, 1] Game Over 108K 0.99 512 every 2000 updates 62.5k 1 environment steps 0.001 125K 0.0002 3 0.1 for MSE, 0.05 for C51 using 84 × 84 × 4 sized Atari observations, instead of images of size 224 × 224 × 3. We train the MAE for 2 epochs on the entire multi-task offline Atari dataset and we observe that the reconstruction loss plateaus to a low value. B.5 D ETAILS OF M ULTI -TASK I MPALA DQN The “MT Impala DQN” comparison in Figures 2 & 5 is a multi-task implementation of online DQN, evaluated at 5x many gradient steps as the size of the sub-optimal dataset. This comparison is taken directly from Lee et al. (2022). To explain this baseline briefly, this baseline runs C51 in conjunction with n-step returns with n = 4, with an IMPALA architecture that uses three blocks with 64, 128, and 128 channels. This baseline was trained with a batch size of 128 and update period of 256. C R AW T RAINING S CORES FOR DIFFERENT M ODELS 19 Table C.1: Raw scores on 40 training Atari games in the sub-optimal multi-task Atari dataset (51% human-normalized IQM). Scaled QL uses the ResNet-101 architecture. Game Amidar Assault Asterix Atlantis BankHeist BattleZone BeamRider Boxing Breakout Carnival Centipede ChopperCommand CrazyClimber DemonAttack DoubleDunk Enduro FishingDerby Freeway Frostbite Gopher Gravitar Hero IceHockey Jamesbond Kangaroo Krull KungFuMaster NameThisGame Phoenix Pooyan Qbert Riverraid Robotank Seaquest TimePilot UpNDown VideoPinball WizardOfWor YarsRevenge Zaxxon DT (200M) DT (40M) Scaled QL (80M) BC (80M) MT Impala-DQN* Human 72.9 392.9 1518.8 10525.0 13.1 3750.0 1535.8 71.4 38.8 993.8 2645.4 1006.2 85487.5 2269.7 -14.5 336.5 15.9 16.2 1014.4 1137.5 237.5 6741.2 -8.8 378.1 1975.0 6913.8 17575.0 4396.9 3560.0 1053.8 8371.9 6191.9 14.9 781.9 2512.5 5288.8 1277.4 237.5 11867.4 287.5 82.2 124.7 2256.2 13125.0 15.6 7687.5 1397.5 74.2 38.2 791.2 3026.9 1093.8 86050.0 1049.4 -20.2 266.2 16.8 20.5 776.2 1251.2 193.8 6295.3 -11.1 312.5 2687.5 4377.5 14743.8 4502.5 2813.8 1394.7 5917.2 4265.6 12.8 512.5 2700.0 5456.2 1953.1 881.2 10436.8 337.5 33.1 1380.8 9967.3 485200.0 18.6 8500.0 5856.5 95.2 351.1 199.3 2711.4 752.2 122933.3 14229.8 -12.4 2297.6 13.7 24.4 2324.5 1041.0 260.3 4011.9 -3.7 58.7 5796.6 9333.7 24320.0 6759.6 12770.0 1264.5 14877.9 9602.7 17.4 1021.8 767.3 35541.3 40.0 107.0 11482.4 1.4 14.5 1060.0 745.3 2494.1 87.6 1550.0 327.2 95.4 274.7 792.7 2260.8 336.7 121394.4 765.8 -13.6 638.7 -88.1 0.1 234.8 231.5 248.8 7485.8 -10.8 7.1 307.1 9585.3 15778.6 2756.8 762.9 718.7 5759.6 6657.2 5.7 113.9 3841.1 8395.2 2650.3 495.3 17755.5 0.0 629.8 1338.7 2949.1 976030.4 1069.6 26235.2 1524.8 68.3 32.6 2021.2 4848.0 951.4 146362.5 446.8 -156.2 896.3 -152.3 30.6 2748.4 3205.6 492.5 26568.8 -10.4 264.6 7997.1 8221.4 29383.1 6548.8 3932.5 4000.0 4226.5 7306.6 9.2 1415.2 -883.1 8167.6 85351.0 975.9 18889.5 -0.1 1719.5 742.0 8503.3 29028.1 753.1 37187.5 16926.5 12.1 30.5 3800.0 12017.0 7387.8 35829.4 1971.0 -16.4 860.5 -38.7 29.6 4334.7 2412.5 3351.4 30826.4 0.9 302.8 3035.0 2665.5 22736.3 8049.0 7242.6 4000.0 13455.0 17118.0 11.9 42054.7 5229.2 11693.2 17667.9 4756.5 54576.9 9173.3 20 Table C.2: Raw scores on 40 training Atari games in the near-optimal multi-task Atari dataset. Scaled QL uses the ResNet 101 architecture. Game DT (200 M) DT (40M) BC (200M) MT Impala-DQN* Scaled QL (80M) Human Amidar 101.5 Assault 2385.9 Asterix 14706.3 Atlantis 3105342.3 BankHeist 5.0 BattleZone 17687.5 BeamRider 8560.5 Boxing 95.1 Breakout 290.6 Carnival 2213.8 Centipede 2463.0 ChopperCommand 4268.8 CrazyClimber 126018.8 DemonAttack 23768.4 DoubleDunk -10.6 Enduro 1092.6 FishingDerby 11.8 Freeway 30.4 Frostbite 2435.6 Gopher 9935.0 Gravitar 59.4 Hero 20408.8 IceHockey -10.1 Jamesbond 700.0 Kangaroo 12700.0 Krull 8685.6 KungFuMaster 15562.5 NameThisGame 9056.9 Phoenix 5295.6 Pooyan 2859.1 Qbert 13734.4 Riverraid 14755.6 Robotank 63.2 Seaquest 5173.8 TimePilot 2743.8 UpNDown 16291.3 VideoPinball 1007.7 WizardOfWor 187.5 YarsRevenge 28897.9 Zaxxon 275.0 1703.8 1772.2 4575.0 304931.2 40.0 17250.0 3225.5 92.1 160.0 3786.9 2867.5 3337.5 113425.0 3629.4 -12.5 770.8 19.2 32.8 934.4 3827.5 75.0 19667.2 -5.2 712.5 11581.2 8295.6 16387.5 7777.5 4744.4 1191.9 12534.4 11330.6 50.9 3112.5 3487.5 9306.9 9671.4 687.5 25306.3 4637.5 101.0 1872.1 5162.5 4237.5 63.1 9250.0 4948.4 90.9 185.6 2986.9 2262.8 1800.0 123350.0 7870.6 -1.5 793.2 5.6 29.8 782.5 3496.2 12.5 13850.0 -8.3 431.2 12143.8 8058.8 4362.5 7241.9 4326.9 1677.2 11276.6 9816.2 44.6 1175.6 1312.5 10454.4 1140.8 443.8 20738.9 50.0 629.8 1338.7 2949.1 976030.4 1069.6 26235.2 1524.8 68.3 32.6 2021.2 4848.0 951.4 146362.5 446.8 -156.2 896.3 -152.3 30.6 2748.4 3205.6 492.5 26568.8 -10.4 264.6 7997.1 8221.4 29383.1 6548.8 3932.5 4000.0 4226.5 7306.6 9.2 1415.2 -883.1 8167.6 85351.0 975.9 18889.5 -0.1 21.0 3809.6 34278.9 881980.0 33.9 8812.5 10301.0 99.5 415.0 926.1 3168.2 832.2 140500.0 56318.3 -13.1 2345.8 23.8 31.9 3566.4 3776.9 262.3 20470.6 -1.5 483.6 2738.6 10176.9 25808.3 11647.0 5264.0 2020.1 15946.0 18494.8 53.2 414.1 4220.5 55512.9 285.7 301.6 24393.9 2.1 1719.5 742.0 8503.3 29028.1 753.1 37187.5 16926.5 12.1 30.5 3800.0 12017.0 7387.8 35829.4 1971.0 -16.4 860.5 -38.7 29.6 4334.7 2412.5 3351.4 30826.4 0.9 302.8 3035.0 2665.5 22736.3 8049.0 7242.6 4000.0 13455.0 17118.0 11.9 42054.7 5229.2 11693.2 17667.9 4756.5 54576.9 9173.3 21