Value-Based Deep RL Scales Predictably Oleh Rybkin1 , Michal Nauman1,2 , Preston Fu1 , Charlie Snell1 , Pieter Abbeel1 , Sergey Levine1 and Aviral Kumar3 DMC OpenAI Gym Isaac Gym arXiv:2502.04327v1 [cs.LG] 6 Feb 2025 1 University of California, Berkeley, 2 University of Warsaw, 3 Carnegie Mellon University (I) Compute-Data Pareto frontier (II) Budget extrapolation (III) Fits for multiple J Figure 1: Scaling properties when increasing compute ๐’ž, data ๐’Ÿ, budget โ„ฑ, or performance ๐ฝ. Left: Compute versus data requirements Pareto frontier controlled by the UTD ratio ๐œŽ. We observe that we can trade off data for compute and vice versa, and this relationship is predictable. Middle: Extrapolation from low to high performance. We observe that the optimal resource allocation controlled by ๐œŽ evolves predictably with increasing budget, and can be used to extrapolate from low to high performance. Right: Pareto frontiers for several performance levels ๐ฝ. Abstract: Scaling data and compute is critical to the success of modern ML. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance. Corresponding author(s): oleh.rybkin@gmail.com, aviralku@andrew.cmu.edu Value-Based Deep RL Scales Predictably 1. Introduction Many latest advances in various areas of machine learning have emerged from training big models on large datasets. In this scaling guided research landscape, successfully executing even one single training run often requires a large amount of data, computational resources, and wall-clock time, such as weeks or months (Achiam et al., 2023; Team et al., 2023; Ramesh et al., 2022; Brooks et al., 2024). To maximize the success of these large-scale runs, the trend in the machine learning (ML) community has shifted toward not just performant, but also more predictable algorithms that scale reliably with more computation and training data size, such that downstream performance can be predicted from small-scale experiments, without actually running the large-scale experiment. (McCandlish et al., 2018; Kaplan et al., 2020; Hoffmann et al., 2022; Dubey et al., 2024). In this paper, we study if deep reinforcement learning (RL) is also amenable to such scaling and predictability benefits. We focus on value-based methods that train value functions using temporal difference (TD) learning, which are known to be performant at small scales, especially in dense reward environments (Mnih et al., 2015; Lillicrap et al., 2015; Haarnoja et al., 2018a). Compared to policy gradient (Mnih, 2016; Schulman et al., 2017) and search methods (Silver et al., 2016), value-based RL can learn from arbitrary data and require less sampling or search, which can be inefficient or infeasible for open-world problems where environment interaction is costly. We study scaling properties by predicting relationships between different resources required for training. Data requirement ๐’Ÿ is the amount of data needed to attain a certain level of performance. Likewise, compute requirement ๐’ž refers to the amount of FLOPs or gradient steps needed to attain a certain level of performance. In RL uniquely, performance can be improved by increasing either available data or compute (e.g., training multiple times on the same data), which we capture via a budget requirement that combines data and compute โ„ฑ = ๐’ž + ๐›ฟ ยท ๐’Ÿ, where ๐›ฟ refers to a constant multiplier. An additive budget function is representative of practical scenarios where the cost of data and compute can be expressed in similar units, such as wall-clock time or required finances. To establish scaling relationships, we first require a way to predict the best hyperparameter settings at each scale. We find that learning rate ๐œ‚, batch size ๐ต, and the updates-to-data (UTD) ratio ๐œŽ are the most crucial hyperparameters for value-based RL. While supervised learning benefits from abundant theory to establish optimal hyperparameters (Krizhevsky, 2014; McCandlish et al., 2018; Yang et al., 2022), value-based RL often does not satisfy assumptions typical of supervised learning. For example, value-based RL needs to account for the non-i.i.d. nature of training data. Distribution shift due to periodic changes in the data collection policy (Levine et al., 2020) contributes to a form of overfitting where minimizing training TD error may not result in a low TD error under the data distribution induced by the new policy. In addition, objective shift due to changing target values (Dabney et al., 2020) contributes to โ€œplasticity lossโ€ (Dโ€™Oro et al., 2022; Kumar et al., 2021a). We show that it is possible to account for the training dynamics unique to value-based RL, and are able to find the best hyperparameters by setting the batch size and learning rate inversely proportional to the UTD ratio. We estimate this dependency using a power law (Kaplan et al., 2020), and observe that this model makes effective predictions. Using the best predicted hyperparameters, we are now able to establish that data and compute requirements evolve as a predictable function of the UTD ratio ๐œŽ. Furthermore, ๐œŽ defines the tradeoff between data and compute, which can be visualized as a Pareto frontier (Figure 1, left). Using this model, we are able to extrapolate the resource requirements from low-compute to high-compute setting, as well as from low-data to high-data setting as shown in the figure. 2 Value-Based Deep RL Scales Predictably Using the Pareto frontiers, we are now able to extrapolate from low to high performance levels. Instead of extrapolating as a function of return, which can be arbitrary and non-smooth, we extrapolate as a function of the allowed budget โ„ฑ. We can define an optimal tradeoff between data and compute, and we observe that such optimal tradeoff value evolves predictably to higher budgets, which also attains a higher performance level (Figure 1, middle). Thus we are able to predict optimal hyperparameters, as well as data and compute allocation, for high-budget runs using only data from low-budget runs. Our contribution is showing that the behavior of value-based deep RL methods based on TD-learning is predictable in larger data and compute regimes. Specifically, we: 1. establish predictable rules for dependencies between hyperparameters batch size (๐ต), learning rate (๐œ‚), and UTD ratio (๐œŽ) in value-based RL, and show that these rules enable more effective scaling. 2. show that data and compute required to attain a given performance level lie on a Pareto frontier, and are respectively predictable in the higher-compute or higher-data regimes. 3. show the optimal allocation of budget between data and compute, and predict how such allocation evolves with higher budgets for best performance. Our findings apply to algorithms such as SAC, BRO, and PQL, and domains such as the DeepMind Control Suite (DMC), OpenAI Gym, and IsaacGym. The generality of our conclusions challenges conventional wisdom and community lore that value-based deep RL does not scale predictably. 2. RL Preliminaries and Notation We study standard off-policy online RL, which maximizes the agentโ€™s return by training on a replay buffer and periodically collecting new data (Sutton and Barto, 2018). Value-based deep RL methods train a Q-network, ๐‘„๐œƒ , to minimize the temporal difference (TD) error: [๏ธ(๏ธ€ )๏ธ€ ]๏ธ ยฏ โ€ฒ , ๐‘Žโ€ฒ ) โˆ’ ๐‘„๐œƒ (๐‘ , ๐‘Ž) 2 , ๐ฟ(๐œƒ) = E(๐‘ ,๐‘Ž,๐‘ โ€ฒ )โˆผ๐’ซ,๐‘Žโ€ฒ โˆผ๐œ‹(ยท|๐‘ โ€ฒ ) ๐‘Ÿ(๐‘ , ๐‘Ž) + ๐›พ ๐‘„(๐‘  (2.1) ยฏ is the target Q-network, ๐‘  denotes a state, and ๐‘Žโ€ฒ is an action drawn where ๐’ซ is the replay buffer, ๐‘„ from a policy ๐œ‹(ยท|๐‘ ) that aims to maximize ๐‘„๐œƒ (๐‘ , ๐‘Ž). We implement this operation by sampling a batch of size ๐ต from the buffer and taking a gradient step along the gradient of this loss with a learning rate ๐œ‚. In theory, off-policy algorithms can be made very sample efficient by minimizing the TD error fully over any data batch, which in practice translates to making more update steps to the Q-network per environment step, or higher โ€œupdates-to-dataโ€ ratio (UTD) (Chen et al., 2020). However, increasing the UTD ratio naรฏvely can lead to worse performance (Nikishin et al., 2022; Janner et al., 2019). To this end, unlike the standard supervised learning or LLM literature that considers ๐ต and ๐œ‚ as two main hyperparameters affecting training (Kaplan et al., 2020; Hoffmann et al., 2022), our setting presents another hyperparameter, the UTD ratio ๐œŽ, that we also study in our paper. Notation. In this paper, we focus on the following key hyperparameters: the UTD ratio ๐œŽ, learning rate ๐œ‚, and the batch size ๐ต. We will answer questions pertaining to performance of a policy ๐œ‹ denoted by ๐ฝ(๐œ‹), the total data utilized by an algorithm to reach a given target level of performance ๐ฝ (denoted by ๐’Ÿ๐ฝ ), and the total compute budget utilized by the algorithm to reach performance ๐ฝ (denoted by ๐’ž๐ฝ ), which is measured in terms of FLOPs or wall-clock time taken by the algorithm. 3 Value-Based Deep RL Scales Predictably 3. Problem Statement and Formulation To demonstrate that the behavior of value-based RL can be predicted reliably at scale, we first post multiple resource optimization questions that guide our scaling study. Viewing data and compute as two resources, we answer questions of the form: what is the minimum value of [resource] needed to attain a given target performance? And what should the hyperparameters (e.g., ๐ต, ๐œ‚, ๐œŽ) be in such this training run? We will answer questions of this form by fitting empirical laws from low data and compute runs to determine relationships between hyperparameters. Doing so, in turn, enables us to determine how to set hyperparameters and allocate resources to maximize performance when provided with a larger data and compute budget. Note that we wish to make these hyperparameter predictions without running the large data and compute budget experiment. While questions of this form have been studied in supervised learning, answering them is different in the context of online RL, because online RL requires the algorithm to collect its own data during training, which ties data and compute in a complex manner and breaks i.i.d. nature of datapoints and induces complexities. Concretely, we study three resource optimization questions: (1) maximizing sample efficiency (i.e., minimize the amount of data ๐’Ÿ to attain a given target performance under a given compute budget), (2) conversely, minimizing compute ๐’ž (e.g., FLOPs or gradient steps, whichever is more appropriate for the practitioner) to attain a given performance given an upper bound on data that can be collected, and (3) maximizing performance given a total bound on data and compute. Problem 3.1 (Resource optimization problems). Find the best configuration (๐ต, ๐œ‚, ๐œŽ) for algorithm Alg that minimizes either the data ๐’Ÿ or compute ๐’ž consumed to obtain performance ๐ฝ0 : 1. Maximal sample efficiency: (๐ต * , ๐œ‚ * , ๐œŽ * ) := arg min (๐ต,๐œ‚,๐œŽ) ๐’Ÿ s.t. ๐ฝ (๐œ‹Alg (๐ต, ๐œ‚, ๐œŽ)) โ‰ฅ ๐ฝ0 , ๐’ž โ‰ค ๐’ž0 2. Maximal compute efficiency: (๐ต * , ๐œ‚ * , ๐œŽ * ) := arg min (๐ต,๐œ‚,๐œŽ) ๐’ž s.t. ๐ฝ (๐œ‹Alg (๐ต, ๐œ‚, ๐œŽ)) โ‰ฅ ๐ฝ0 , ๐’Ÿ โ‰ค ๐’Ÿ0 We solve these problems by fitting empirical models of the minimum data and compute needed to attain a target performance for different values of ๐ฝ0 . Doing so allows us to then solve the third setting (3) for maximizing performance given a total budget on data and compute as shown below. Problem 3.2 (Maximize performance at large data and compute budget). Find the best configuration (๐ต, ๐œ‚, ๐œŽ) and resource allocations for data ๐’Ÿ and compute ๐’ž that enable Alg to maximize performance at budget โ„ฑ0 (๐ต * , ๐œ‚ * , ๐œŽ * ) := arg max (๐ต,๐œ‚,๐œŽ) ๐ฝ (๐œ‹Alg (๐ต, ๐œ‚, ๐œŽ)) s.t. ๐’ž + ๐›ฟ ยท ๐’Ÿ โ‰ค โ„ฑ0 . 4. Scaling Results For Value-Based Deep RL We will now present our main results addressing Problem 3.1 under the two settings discussed above. We will then use these results to present results for Problem 3.2. In order to do so, we run several experiments and estimate scaling trends from the results. Although this procedure might appear standard from scaling studies in language modeling, we found that instantiating it for value-based RL requires 4 Value-Based Deep RL Scales Predictably understanding the interaction of the various hyperparameters appearing in TD updates, and the data and compute efficiency of the algorithm. We will formalize these relationships via empirically estimated laws and show that these laws extrapolate reliably to new settings not used to obtain these empirical laws. Therefore, in this section, we present empirical and conceptual arguments to build functional forms of relationships between different hyperparameters. Before doing so, we provide our answers to Problems 3.1 and 3.2. 4.1. Main Scaling Results We begin by answering Problem 3.1 where we need to maximize sample efficiency. We wish to estimate the minimal amount of data ๐’Ÿ๐ฝ needed to attain a given target performance, given an upper bound on compute ๐’ž โ‰ค ๐’ž0 . To do so, we fit ๐’Ÿ๐ฝ needed to attain the target performance ๐ฝ = ๐ฝ0 parameterized by the UTD ratio ๐œŽ (Eq. (4.1)). Intuitively, we would expect the mini- Figure 2: The data-compute tradeoff on DMC. Left: The minimum required mum amount of data needed to attain data ๐’Ÿ๐ฝ scales with the UTD ๐œŽ as a power law. Right: The minimum required compute ๐’ž๐ฝ increases with the UTD ๐œŽ as a sum of two power laws. a given performance to be low as more updates are made per datapoint (i.e., when ๐œŽ is high), as more โ€œvalueโ€ could be derived from the same datapoint. In addition, we would expect that even for the best value of ๐œŽ, there is a minimum number of datapoints ๐’Ÿmin that are needed to learn given the โ€œintrinsicโ€ difficulty of the task at hand. Based on these intuitions, we hypothesize a power law relationship between ๐’Ÿ๐ฝ (๐œŽ) and ๐œŽ, with an offset ๐’Ÿmin and constants ๐›ผ๐ฝ and ๐›ฝ๐ฝ . (๏ธ‚ )๏ธ‚๐›ผ๐ฝ ๐›ฝ๐ฝ min ๐’Ÿ๐ฝ (๐œŽ) โ‰ˆ ๐’Ÿ๐ฝ + (4.1) ๐œŽ Empirical fits of ๐’Ÿ๐ฝ and ๐œŽ on the DMC suite are in Figure 2 and they validate the efficacy of this fit. Scaling Observation 1: Data Requirements The amount of data ๐’Ÿ๐ฝ needed to reach a given return target ๐ฝ0 decreases as a predictable function of the UTD ๐œŽ, and is a power law (Eq. (4.1)). We also emphasize that the existence of this power law makes ๐’Ÿ๐ฝ predictable, in that this relation is able to predict ๐’Ÿ๐ฝ for larger values of ๐œŽ that fall outside the range of ๐œŽ values used to get the fit (Figure 6). To answer the optimization questions in Problem 3.1, we also need an expression for required compute until the target return ๐’ž๐ฝ . As ๐œŽ determines the number of gradient steps run per data point, ๐’ž๐ฝ is a function of ๐œŽ. In particular, total compute is equal to the number of gradient steps taken multiplied by the parameter count of the model. Our study does not optimize over the model size and treats it as a constant. Thus, we can write the compute ๐’ž๐ฝ as a function of ๐œŽ as: ๐’ž๐ฝ (๐œŽ) โ‰ˆ 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ (๐œŽ) (4.2) where ๐‘ denotes the model size, ๐ต(๐œŽ) denotes the โ€œbest choiceโ€ batch size for a given UTD value ๐œŽ, and other variables follow definitions from before. Note the additional factor of 10 in Eq. (4.2) emerges from 5 Value-Based Deep RL Scales Predictably the use of multiple forward passes to compute the loss function for value-based RL and the backward pass, through the Q-network (to contrast with language modeling, the typical multiplier is 6; the gap in our setting comes from the use of multiple forward passes). We plot ๐’ž๐ฝ (๐œŽ) for different values of ๐œŽ and ๐ฝ = ๐ฝ0 in Figure 2. Since ๐’Ÿ๐ฝ (๐œŽ) is not a constant and depends itself on ๐œŽ, we note that this particular relationship between ๐’ž๐ฝ (๐œŽ) and ๐œŽ is not a simple power law unlike Eq. (4.1). Instead, our derivation in Eq. (A.4) shows that ๐’ž๐ฝ (๐œŽ) is given by a sum of two different power laws in ๐œŽ. Similarly to ๐’Ÿ๐ฝ , we also observe that the compute utilized is a predictable function of ๐œŽ: we are able to accurately estimate the compute at larger values of ๐œŽ using the relationship in Eq. (4.2). Scaling Observation 2: Compute Requirements The compute ๐’ž๐ฝ to attain a given return target ๐ฝ0 increases as a predictable function of the UTD ratio ๐œŽ, and is a sum of two power laws (Eq. (4.2)). We observe that both required compute and data are controlled by the UTD ratio ๐œŽ, which allows us to define a tradeoff between compute and data controlled by ๐œŽ. We plot this tradeoff as a curve with compute ๐’ž๐ฝ (๐œŽ) as ๐‘ฅ-axis and ๐’Ÿ๐ฝ (๐œŽ) as ๐‘ฆ-axis in Figure 1 (left). Further, as ๐’Ÿ๐ฝ (๐œŽ) is a monotonically decreasing function of ๐œŽ, this curve defines a Pareto frontier: we can move left on the curve to increase data efficiency as the expense of compute and move right to increase compute efficiency at the expense of data. Also interestingly, due to the compute law being a sum of two power laws, in many environments there is a minimum ๐œŽ after which compute efficiency no longer improves as seen on OAI Gym in Figure 1. Solving for maximal data efficiency (Problem 3.1, (1)). We can now solve Problem 3.1 in setting (1). our strategy to address setting (1) is to find the largest ๐œŽ (say ๐œŽmax ) that satisfies the compute constraint ๐’ž๐ฝ (๐œŽ) โ‰ค ๐’ž0 , and then plug this ๐œŽmax into ๐’Ÿ๐ฝ (๐œŽ) to obtain the data estimate. This approach enables us to express ๐’Ÿ๐ฝ directly as a function of the available compute ๐’ž0 , as we calculate in Eq. (4.2). This can be visualized as finding the value ๐’Ÿ๐ฝ corresponding to some value ๐’ž0 on the Pareto frontier (Figure 1, left) Solving for maximal compute efficiency (Problem 3.1, (2)). Likewise, the solution in (2) can be obtained by finding the smallest value of ๐œŽ in the range that satisfies the data constraint ๐’Ÿ๐ฝ (๐œŽ) โ‰ค ๐’Ÿ0 , and computing the corresponding value of ๐’ž๐ฝ (๐œŽ). This can similarly be visualized on the Pareto frontier (Figure 1, left). We summarize our observations in terms of the following takeaway. Solving Problem 3.1: Defining the Compute-Data Pareto frontier The UTD ratio ๐œŽ defines a Pareto frontier between data and compute requirements, and estimating this frontier yields predictable solutions to resource optimization problems in settings (1) and (2). Theoretically, the optimal ๐’Ÿ๐ฝ* for an available compute budget ๐’ž0 is: ๐’Ÿ๐ฝ* (๐’ž0 ) โ‰ˆ ๐’ž0 ยท (10 ยท ๐‘ ยท ๐ต(๐œŽ * ) ยท ๐œŽ * )โˆ’1 . (4.3) The optimal ๐’ž๐ฝ for a given data budget ๐’Ÿ0 is: ๐’ž๐ฝ* (๐’Ÿ0 ) โ‰ˆ 10 ยท ๐‘ ยท ๐ต(๐œŽ * ) ยท ๐œŽ * ยท ๐’Ÿ0 . (4.4) Above, ๐œŽ * denotes the minimizing UTD value. Calculation details are in Appendix A. Maximize return within a budget (Problem 3.2). Finally, we tackle Problem 3.2 in order to extrapolate from low to high return. Here, we do not want to minimize resources, but rather want to maximize performance within a given total โ€œbudgetโ€ on data and compute. As discussed in Section 3, we consider budget functions linear in both data and compute, i.e., โ„ฑ = ๐’ž +๐›ฟ ยท๐’Ÿ, for a given constant ๐›ฟ. Our estimated 6 Value-Based Deep RL Scales Predictably Pareto frontier in Eq. (4.4) will enable answering this question. To do so, we turn to directly predicting a good UTD value ๐œŽ * . This UTD value is one that not only leads to maximal performance, but also stays within the total resource budget โ„ฑ0 . Once the UTD value has been identified, it prescribes a concrete way to partition the total resource budget into good data and compute requirements using the solutions to Problem 3.1. We plot the data-compute Pareto frontiers for multiple values of ๐ฝ0 in Figure 3 and in Figure 1 (right), and find that these curves move diago- Figure 3: Visualization of the solution to Problem 3.2. Several nally to the top-right for larger ๐ฝ0 . Intersecting Pareto frontiers (Figure 1, left) are shown, together with lines budget points (๐’Ÿ* , ๐’ž * ). these curves with iso-budget frontiers over ๐’Ÿ and of iso-budget โ„ฑ, which define optimal * Corresponding optimal UTD ratios ๐œŽ are a predictable function ๐’ž prescribed by the budget function, gives us the of the budgets โ„ฑ0 , trend line shown dashed. largest possible ๐ฝ0 for which there is still a (๐’Ÿ, ๐’ž) pair that just falls just within the budget โ„ฑ0 but attains performance ๐ฝ0 (see Figure 3 for a worked out version of this procedure). Since both ๐’Ÿ and ๐’ž are explained by ๐œŽ, we can associate this point with a given ๐œŽ value. Hence, we can estimate the best value of ๐œŽ * (โ„ฑ0 ) for a given budget threshold โ„ฑ0 . Concretely, we observe a power law between ๐œŽ(โ„ฑ0 ) and โ„ฑ0 , with constants ๐›ฝ๐œŽ and ๐›ผ๐œŽ . (๏ธ‚ )๏ธ‚๐›ผ๐œŽ ๐›ฝ๐œŽ * ๐œŽ (โ„ฑ0 ) โ‰ˆ . (4.5) โ„ฑ0 Solving Problem 3.2: Maximize return given a total data and compute budget The best UTD value ๐œŽ that leads to maximal ๐ฝ is a predictable function of the budget โ„ฑ0 over data and compute, this relationship follows a power law, and also extrapolates to large budgets. This relationship produces the optimal ๐œŽ, and as a result, the optimal data and compute allocations to reliably attain maximum performance. As shown in Figure 1, estimating this law from low-budget experiments is sufficient for predicting good ๐œŽ values for large budget runs. These predicted ๐œŽ * (โ„ฑ0 ) values extrapolate reliably to budgets outside the range used to fit this law (as shown by ร— in Figure 1). This concludes an exposition of our main results. 4.2. Fitting Relationships Between (๐ต, ๐œ‚, ๐œŽ) To arrive at these scaling law fits above, we had to set hyperparameters ๐ต and ๐œ‚, which we empirically observed to be important. We fit these hyperparameters as a function of ๐œŽ, the only variable appearing in many of the scaling relationships discussed above. In this section, we will now describe how to estimate good values of ๐ต and ๐œ‚ in terms of ๐œŽ. Our analysis here relies crucially on the behavior of TD-learning that is distinct from supervised learning, where the UTD ratio ๐œŽ does not exist. To understand relationships between batch size ๐ต, learning rate ๐œ‚, and the UTD ratio ๐œŽ, we ran an extensive grid search. We first attempted to explain the relationship between the ๐ต and ๐œ‚ values that attain the highest data efficiency (denoted ๐ต * , ๐œ‚ * ) using the standard heuristic in supervised learning: when the batch size is smaller than the critical batch size, ๐ต and ๐œ‚ are inversely correlated with each other (McCandlish et al., 2018). However, as shown in Figure 5 (right), we find that without including the 7 Value-Based Deep RL Scales Predictably -- Training โ— Best batch size Loss Overfitting โ€“ Validation โ— Critical batch Size Batch size Loss Plasticity โ€“ TD Learning -- Supervised Learning โ— โ— Learning rate (I) Hparam choice for SL vs RL (II) Effect of UTD ratio ฯƒ (III) Effect of B and ฮท Figure 4: Hyperparameter effects in supervised learning and TD learning on DMC. Top: Overfitting increases with UTD while batch size can be used to counteract it. Bottom: Higher UTD leads to poor training dynamics and plasticity loss (Dโ€™Oro et al., 2022). Lower learning rates can be used to counteract it. While these relationships are not perfectly predictable, we use them to inform our design choices. UTD ratio ๐œŽ, best ๐ต * and ๐œ‚ * exhibit very weak correlation. Further, the critical batch size (McCandlish et al., 2018) does not correlate with empirically best batch size as we show in Appendix E. Instead, surprisingly, we observe a strong correlation between ๐ต * and ๐œŽ, as well as ๐œ‚ * and ๐œŽ, respectively. Since ๐ต * and ๐œ‚ * exhibit near zero correlation among themselves, we can simply omit their dependency and opt for modeling them independently as a function of the UTD ratio, ๐œŽ. We conceptually explain relationships between ๐ต * and ๐œŽ, and ๐œ‚ * and ๐œŽ below and show that models developed from this understanding enable us to reliably predict good values of ๐ต and ๐œ‚, allowing us to fully answer Problem 3.1. Predicting best choice of ๐ต in terms of ๐œŽ. Our proposed functional form for the best batch size ๐ต * takes the form of a power law in ๐œŽ, which we also empirically validate in Figure 5 (left). We posit this form because, intuitively, large batch sizes increase the risk of overfitting because they lead to repetitive training on a fixed set of data. Furthermore, a small training loss on the distribution of data in the buffer does not necessarily reflect the behavior policy distribution of a learning agent (Levine et al., 2020). This means that minimizing the training loss to a large extent can result in poor test performance ๐ฝ(๐œ‹), as also seen by prior work (Li et al., 2023a; Nauman et al., 2024a). One way to counteract this form of โ€œoverfittingโ€ from a high UTD value ๐œŽ is to instead reduce the batch size in the run so that the training process sees a given sample fewer times. In fact, for a fixed UTD value ๐œŽ, we empirically validate this hypothesis that a lower ๐ต leads to substantially reduced overfitting on several tasks in Figure 4. Hence, we post an inverse relationship between the best batch size ๐ต * and the UTD value ๐œŽ. We show in Figure 5 that indeed this inverse relationship can be estimated well by a power law, given formally as: (๏ธ‚ )๏ธ‚๐›ผ๐ต ๐›ฝ๐ต * ๐ต (๐œŽ) โ‰ˆ . (4.6) ๐œŽ Predicting best choice of learning rate ๐œ‚ as a function of ๐œŽ. Next we turn to understanding the 8 Value-Based Deep RL Scales Predictably Figure 5: Left, middle: Fitting the best learning rate ๐œ‚ * and batch size ๐ต * given UTD ๐œŽ on DMC. Modeling the dependency on ๐œŽ is crucial to obtain good hyperparameters, whereas using constant ๐ต, ๐œ‚ as is commonly done leads too poor extrapolation. Right: the best learning rate and batch size are not significantly correlated, a major difference from supervised learning. relationship between ๐œ‚ and ๐œŽ. We start from a simple observation: a very large ๐œŽ typically leads to worse performance not only due to overfitting but also due to plasticity loss (Kumar et al., 2021a; Dโ€™Oro et al., 2022; Lyle et al., 2023), defined broadly as the inability of the value network to fit TD targets appearing later in training. Prior work states that plasticity loss is inherently related to the number of gradient steps performed and claims that larger norms of parameters of the Q-network are indicative of plasticity loss (Dโ€™Oro et al., 2022; Lyle et al., 2023). We would expect a larger learning rate to make higher magnitude updates against the same TD target, and hence move parameters to a state that suffers from difficulty in fitting subsequent targets (Dabney et al., 2021; Lee et al., 2024). As shown in Figure 4, the parameter norm indeed increases with a high learning rate. Therefore, given a UTD value ๐œŽ, we hypothesize that the best choice of learning rate, ๐œ‚ * (๐œŽ) for a given performance should scale inversely in ๐œŽ. Empirically we observe that this is indeed the case (Figure 5 (middle)), and we model this relationship: (๏ธ‚ )๏ธ‚๐›ผ๐œ‚ ๐›ฝ๐œ‚ * ๐œ‚ (๐œŽ) โ‰ˆ . (4.7) ๐œŽ Scaling Observation 3: Hyperparameter Selection The best choices for the batch size and learning rate are predictable functions of the UTD ๐œŽ, and both of these relationships follow a power law. 4.3. Empirical Workflow for Obtaining Fits Our Workflow for Fitting Empirical Relationships 1. Run a sweep for batch size ๐ต and learning rate ๐œ‚ for several values of UTD ๐œŽ. Since the batch size and learning rate are independent for the best ๐œŽ, we can run these sweeps independently. หœ and learning rate ๐œ‚หœ, with statistical bootstrapping. 2. Estimate empirically the best of batch size ๐ต * * หœ ๐œ‚หœ according to Equations (4.6) and (4.7). 3. Fit ๐ต (๐œŽ) and ๐œ‚ (๐œŽ) on ๐ต, * 4. Using the found fits ๐ต (๐œŽ), ๐œ‚ * (๐œŽ), run different values of ๐œŽ that cover a range spanning an order of magnitude; we use 16ร—, i.e., ๐œŽmax /๐œŽmin > 16. 5. Fit ๐’Ÿ๐ฝ (๐œŽ) according to Eq. (4.1). 6. Using fits of ๐’Ÿ๐ฝ (๐œŽ) for different values of ๐ฝ0 , fit ๐œŽ * (โ„ฑ0 ) according to Eq. (4.5). 7. Optimal hyperparameters can now be extrapolated to larger data, larger compute, or larger budget settings according to Problem 3.1. 9 Value-Based Deep RL Scales Predictably 0.25 0.50 1.00 2.00 4.00 8.00 : UTD Ratio ร—1e4 128 91 64 45 32 Empirical value Extrapolation Fit J( ) 0.25 0.50 1.00 2.00 4.00 8.00 : UTD Ratio ร—1e4 100 J: Data until J 128 91 64 45 32 Empirical value Extrapolation Fit J( ) J: Data until J J: Data until J ร—1e4 62 39 24 15 Empirical value Ours J( ) Constant fit J( ) 0.50 1.00 2.00 4.00 8.00 : UTD Ratio Figure 6: Extrapolation towards unseen values of ๐œŽ on OpenAI Gym. Left: We show Pareto frontier extrapolation towards higher data regime. Middle: We show Pareto frontier extrapolation towards higher compute regime. Right: We compare the best-performing hyperparameters (red) for ๐œŽ = 2 to hyperparameters predicted via our proposed workflow (blue). Having presented solutions to Problems 3.1 and 3.2, we now present the workflow we utilize to estimate these empirical fits. Further details are in Section 5 and Appendix D. This workflow can serve as a useful skeletion for scaling law studies with other value-based algorithms as well. 4.4. Evaluating Extrapolation Evaluating budget extrapolation. Results on all environments are shown in Figure 1 (middle). We estimate several Pareto frontiers corresponding to points with equal changes in budget. We perform the ๐œŽ * (โ„ฑ0 ) fit, while holding out two largest budgets. The quality of our fit for these two extrapolated budgets can be seen in the figure. Evaluating Pareto frontier extrapolation. Results on OpenAI Gym are shown in Figure 6. We fit the data efficiency equation ๐’Ÿ๐ฝ (๐œŽ) Eq. (4.1) while holding out either two UTD values ๐œŽ with largest data requirement (left) or two ๐œŽ values with largest compute requirement (right). The quality of our fit for these two extrapolated ๐œŽ values can be seen in the figure. Hyperparameter fit extrapolation. Results on OpenAI Gym are shown in Figure 6 (right). We plot the data efficiency fit when using hyperparameters according to our found dependency ๐ต * (๐œŽ), ๐œ‚ * (๐œŽ) (shown in olive). These fits are estimated from ๐œŽ = 1, ยท ยท ยท , 8 and extrapolated to ๐œŽ = 0.5. We compare the typical approach of tuning hyperparameters in online RL, where hyperparameters are tuned for one setting of ๐œŽ = 2 and this setting is used for all UTD values (shown in blue). We see that our proposed hyperparameter fits improve results for values other than ๐œŽ = 2. Further, this improvement is larger for larger values of ๐œŽ, showing that accounting for hyperparameter dependency is critical. 5. Experimental Details Experimental Setup We focus on 12 tasks from 3 domains in our study. On OpenAI Gym (Brockman et al., 2016), we use Soft Actor Critic, a commonly used TD-learning algorithm (Haarnoja et al., 2018b). We first run a sweep on 5 values of ๐œ‚, then a grid of runs with 4 values of ๐œŽ and 3 values of ๐ต, and then use hyperparameter fits to run 2 more value of ๐œŽ with 8 seeds per task. To test our approach with larger models, we use DMC (Tassa et al., 2018), where, we utilize the state-of-the-art Bigger, Regularized, Optimistic (BRO) algorithm (Nauman et al., 2024b) that uses a larger and more modern architecture. We first run 5 values of ๐ต, 4 values of ๐œ‚, and 4 ๐œŽ; and then use hyperparameters fits to run 2 more values of ๐œŽ, with 10 seeds per task. Finally, we test our approach with more data on IsaacGym (Makoviychuk et al., 2021), where we use the Parallel Q-Learning (PQL) algorithm (Li et al., 2023b), which was designed to 10 Value-Based Deep RL Scales Predictably leverage massively parallel simulation like Isaac Gym that can quickly produce billions of environment samples. Because of computational expense, we only run one IsaacGym task. We first run 4 values of ๐œŽ, 3 values of ๐œ‚, as well as 5 values of ๐ต, with 5 seeds per task, after which we run a second round of grid search with 7 values of ๐œŽ. Further details are in Appendices B and D and Table 3. Fitting Functional Forms for Scaling Laws We approximate Eq. (4.1) via brute-force search followed by LBFG-S with a log-MSE loss following (Hoffmann et al., 2022). For Equations (4.6) and (4.7), we fit a line in log space using least squares regression following Kaplan et al. (2020). In our experiments, we run a single fit that is shared across different tasks in a given benchmark. Specifically, we share the env , ๐œŽ env (as defined in Equations (4.6) and (4.7)) to be slope ๐›ผ๐ต , ๐›ผ๐œ‚ and use task-specific intercepts ๐œŽ๐ต ๐œ‚ different for separate tasks. This technique is standard in ordinary least squares modeling and is referred to as fixed effect regression (Bishop and Nasrabadi, 2006). Sharing this slope serves the goal of variance reduction, which can be important if the granularity of the grid search over various hyperparameters run is coarse. More details are in Appendices B and D. 6. Related Work Scaling laws and predictability. Prior work has studied scaling laws in the context of supervised learning (Kaplan et al., 2020; Hoffmann et al., 2022), primarily to predict the effect of model size and training data on validation loss, while marginalizing out hyperparameters like batch size (McCandlish et al., 2018) and learning rate (Kaplan et al., 2020). There are several extensions of such scaling laws for language models, such as laws for settings with data repetition (Muennighoff et al., 2023) or mixture-ofexperts (Ludziejewski et al., 2024), but most focus on cross-entropy loss, with an exception of Gadre et al. (2024), which focuses on downstream metrics. While scaling laws have guided supervised learning experiments, little work explores this for RL. The closest works are: Hilton et al. (2023) which fits power laws for on-policy RL methods using model size and the number of environment steps; Jones (2021) which studies the scaling of AlphaZero on board games of increasing complexity; and Gao et al. (2023) which studies reward model overoptimization in RLHF. In contrast, we are the first ones to study off-policy value-based RL methods that are trained via TD-learning. Not only do off-policy methods exhibit training dynamics distinct from supervised learning and on-policy methods (Kumar et al., 2021b; Lyle et al., 2023), but we show that this distinction also results in a different functional form for scaling law altogether. We also note that while Hilton et al. (2023) use minimal compute, i.e., ๐’ž๐ฝ in our notation as a metric of performance, our analysis goes further in several respects: (1) we also study the tradeoff between data and compute (Figure 1), (2) we can predict the algorithm configuration for best performance (Problem 3.1); (3) we study many budget functions (๐’ž + ๐›ฟ ยท ๐’Ÿ can be any affine function). Methods for large-scale deep RL. Recent work has scaled deep RL across three axes: model size (Kumar et al., 2023; Schwarzer et al., 2023; Nauman et al., 2024b), data (Kumar et al., 2023; Gallici et al., 2024; Singla et al., 2024), and UTD (Chen et al., 2020; Dโ€™Oro et al., 2022). Naรฏve scaling of model size or UTD often degrades performance or causes divergence (Nikishin et al., 2022; Schwarzer et al., 2023), mitigated by classification losses (Kumar et al., 2023), layer normalization (Nauman et al., 2024a), or feature normalization (Kumar et al., 2021b). In our work, we use scaled network architectures from Nauman et al. (2024b) (Section 5). In on-policy RL, prior works focus on effective learning from parallelized data streams in a simulator or a world model (Mnih, 2016; Silver et al., 2016; Schrittwieser et al., 2020). Follow-up works like IMPALA (Espeholt et al., 2018) and SAPG (Singla et al., 2024) use a centralized learner that collects experience from distributed workers with importance sampling updates. These 11 Value-Based Deep RL Scales Predictably works differ substantially from our study as we focus exclusively on value-based off-policy RL algorithms that use TD-learning and not on-policy methods. In value-based RL, prior work on data scaling focuses on offline (Yu et al.; Kumar et al., 2023; Park et al., 2024) and multi-task RL (Hafner et al., 2023). In contrast, we study online RL and fit scaling laws to answer resource optimization questions. 7. Discussion, Limitations, and Future Work In this paper, we show that value-based deep RL algorithms scale predictably. We establish relationships between good values of hyperparameters of value-based RL. We then establish a relationship between required data and required compute for a certain performance. Finally, this allows us to determine an optimal allocation of resources to either data and compute. Although only estimated from small-scale runs, our empirical models reliably extrapolate to large compute, data, budget, or performance regimes. To the best of our knowledge, this is the first demonstration that it is possible to predict behavior of value-based off-policy RL algorithms at larger scale using small-scale experiments. At the same time, this first study also presents a number of open questions and challenges: 1. While simple power law models work well, an open question remains as to whether such laws are theoretically grounded, and whether there are better and more refined functional forms. 2. Our study only focused on three hyperparameters (๐ต, ๐œ‚, and ๐œŽ). We do not focus on optimal tradeoff between model size and UTD, which is important for compute scaling. For data efficient RL, it is important to analyze the dependency of weight decay and weight reset frequency on UTD, which are typical tricks employed by many of the most performant methods in literature. 3. While we focus on online RL, it is important to study scaling of offline-to-online and offline RL, which will allow direct applications of scaling law findings to large model training. 4. Finally, while we study relatively small models, future work will focus on verifying our results with larger model scales, larger scale tasks, study the effect of modern architectures, and cover a larger range of compute scales spanning multiple orders of magnitude. Our work is only the first step in studying scaling laws for value-based RL methods. Further research has the potential to improve our understanding of value-based RL at scale, provide researchers with tools to focus innovation on more important components, and eventually provide guidelines towards scaling value-based RL similarly to scaling enjoyed by other modern deep learning approaches. Acknowledgements We would like to thank Zhang-Wei Hong, Amrith Setlur, Rishabh Agarwal, Seohong Park, and Max Simchowitz for feedback on an earlier version of this paper. We would like to thank Andrea Zanette, Seohong Park, Kyle Stachowicz, and Qiyang Li for informative discussions. This research was supported by ONR under N00014-24-12206, N00014-22-1-2773, and ONR DURIP grant, with compute support from the Berkeley Research Compute, Polish high-performance computing infrastructure, PLGrid (HPC Center: ACK Cyfronet AGH), that provided computational resources and support under grant no. PLG/2024/017817. Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This work was done at UC Berkeley and CMU, and is not associated with Amazon. 12 Value-Based Deep RL Scales Predictably References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. Richard E Barlow and Hugh D Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140โ€“147, 1972. Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016. Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. Xinyue Chen, Che Wang, Zijian Zhou, and Keith W Ross. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations, 2020. Will Dabney, Andrรฉ Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G Bellemare, and David Silver. The value-improvement path: Towards better representations for reinforcement learning. arXiv preprint arXiv:2006.02243, 2020. Will Dabney, Andrรฉ Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G Bellemare, and David Silver. The value-improvement path: Towards better representations for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7160โ€“7168, 2021. Pierluca Dโ€™Oro, Max Schwarzer, Evgenii Nikishin, Pierre-Luc Bacon, Marc G Bellemare, and Aaron Courville. Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2022. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018. Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540, 2024. Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. Simplifying deep temporal difference learning. arXiv preprint arXiv:2407.04811, 2024. 13 Value-Based Deep RL Scales Predictably Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835โ€“10866. PMLR, 2023. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In arXiv, 2018a. URL https://arxiv.org/pdf/ 1801.01290.pdf. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861โ€“1870. PMLR, 2018b. Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442, 2023. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498โ€“12509, 2019. Andy L. Jones. Scaling scaling laws with board games, 2021. URL https://arxiv.org/abs/2104. 03113. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014. Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=O9bnihsFfXU. Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, and Sergey Levine. DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization. arXiv preprint arXiv:2112.04716, 2021b. Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Offline q-learning on diverse multi-task data both scales and generalizes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4-k7kUavAj. Hojoon Lee, Hanseul Cho, Hyunseung Kim, Daehoon Gwak, Joonkee Kim, Jaegul Choo, Se-Young Yun, and Chulhee Yun. Plastic: Improving input and label plasticity for sample efficient reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 14 Value-Based Deep RL Scales Predictably Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020. Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=14-kr46GvP- . Zechu Li, Tao Chen, Zhang-Wei Hong, Anurag Ajay, and Pulkit Agrawal. Parallel ๐‘ž-learning: Scaling off-policy reinforcement learning under massively parallel simulation. In International Conference on Machine Learning, pages 19440โ€“19459. PMLR, 2023b. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. Jan Ludziejewski, Jakub Krajewski, Kamil Adamczewski, Maciej Piรณro, Michaล‚ Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Krรณl, Tomasz Odrzygรณลบdลบ, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts. In Forty-first International Conference on Machine Learning, 2024. Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks. In International Conference on Machine Learning, pages 23190โ€“23211. PMLR, 2023. Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021. Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018. Volodymyr Mnih. Asynchronous methods for deep reinforcement learning. arXiv:1602.01783, 2016. arXiv preprint Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529โ€“533, 2015. Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36:50358โ€“50376, 2023. Michal Nauman, Michaล‚ Bortkiewicz, Piotr Miล‚oล›, Tomasz Trzcinski, Mateusz Ostaszewski, and Marek Cygan. Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning. In Proceedings of the 41st International Conference on Machine Learning, 2024a. URL https://arxiv.org/pdf/2403.00514. PMLR 235:37342-37364. Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miล‚oล›, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control. arXiv preprint arXiv:2405.16158, 2024b. 15 Value-Based Deep RL Scales Predictably Evgenii Nikishin, Max Schwarzer, Pierluca Dโ€™Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. In International conference on machine learning, pages 16828โ€“16847. PMLR, 2022. Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl? arXiv preprint arXiv:2406.09329, 2024. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604โ€“609, 2020. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Max Schwarzer, Johan Samir Obando Ceron, Aaron Courville, Marc G Bellemare, Rishabh Agarwal, and Pablo Samuel Castro. Bigger, better, faster: Human-level atari with human-level efficiency. In International Conference on Machine Learning, pages 30365โ€“30380. PMLR, 2023. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484โ€“489, 2016. Jayesh Singla, Ananye Agarwal, and Deepak Pathak. Sapg: split and aggregate policy gradients. arXiv preprint arXiv:2407.20230, 2024. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10. 1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/pii/ S2665963820300099. Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261โ€“272, 2020. Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022. 16 Value-Based Deep RL Scales Predictably T. Yu, A. Kumar, et al. How to Leverage Unlabeled Data in Offline Reinforcement Learning. ICML 2022. 17 Value-Based Deep RL Scales Predictably Appendices A. Additional details on derivations FLOPs calculation. Recall that FLOPs per forward and backward passes are equal to ๐’ž๐ฝforward (๐œŽ) โ‰ˆ 2 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ (๐œŽ) and ๐’ž๐ฝbackward (๐œŽ) โ‰ˆ 4 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ (๐œŽ), with ๐œŽ denoting the number of gradient steps per environment steps. Q-learning methods used in our study use MLP and ResNet architectures, which are well modeled with this approximation. Assuming same size for actor and critic as an approximation, a training iteration of the critic requires three forward passes and one backward pass, totaling ๐’ž๐ฝcritic (๐œŽ) โ‰ˆ 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ (๐œŽ). A training iteration of the actor requires two forward and two backward passes, totaling ๐’ž๐ฝactor (๐œŽ) โ‰ˆ 12 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ (๐œŽ). Here we follow the standard practice of updating the actor every time a new data point collected, while the critic is updated according to the UTD ratio ๐œŽ. Since we expect the critic to be updated more then the actor. As such, in this study we assume ๐’ž๐ฝ (๐œŽ) โ‰ˆ ๐’ž๐ฝcritic (๐œŽ) โ‰ˆ 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ (๐œŽ). (A.1) Compute and sample efficiency. Following Eq. (4.1), the number of data points required to achieve performance ๐ฝ is equal to: ๐’Ÿ๐ฝ (๐œŽ) โ‰ˆ ๐’Ÿ๐ฝmin + (๏ธ‚ ๐›ฝ๐ฝ ๐œŽ )๏ธ‚๐›ผ๐ฝ (A.2) Given the expressions for required data points, practical batch size, and FLOPs Equations (4.1), (4.6) and (A.1), we can now derive the expression for compute required to reach a particular performance expressed in terms of ๐œŽ. First, note that the number of parameter updates is ๐œŽ ยท ๐’Ÿ๐ฝ (๐œŽ) โ‰ˆ ๐œŽ ยท ๐’Ÿ๐ฝmin + ๐›ฝ๐ฝ๐›ผ๐ฝ ๐œŽ ๐›ผ๐ฝ โˆ’1 (A.3) Combining above, Eq. (4.6) with Eq. (A.1) yields: (๏ธ‚ )๏ธ‚ ๐›ฝ๐ฝ๐›ผ๐ฝ min ๐’ž๐ฝ (๐œŽ) โ‰ˆ 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ + ๐›ผ โˆ’1 ๐œŽ ๐ฝ (๏ธ‚ )๏ธ‚๐›ผ๐ต (๏ธ‚ )๏ธ‚ ๐›ฝ๐ฝ๐›ผ๐ฝ ๐›ฝ๐ต min โ‰ˆ 10 ยท ๐‘ ยท ยท ๐œŽ ยท ๐’Ÿ๐ฝ + ๐›ผ โˆ’1 ๐œŽ ๐œŽ ๐ฝ (๏ธ‚ min ๐›ผ๐ต ๐›ผ๐ฝ ๐›ผ๐ต )๏ธ‚ ๐’Ÿ๐ฝ ยท ๐›ฝ๐ต ๐›ฝ ยท๐›ฝ โ‰ˆ 10 ยท ๐‘ ยท + ๐›ผ๐ฝ +๐›ผ ๐ตโˆ’1 . ๐›ผ โˆ’1 ๐œŽ ๐ต ๐œŽ ๐ฝ ๐ต (A.4) We observe that the resulting expression is a sum of two power laws. In practice, one of the power laws will dominate the expression and a simple mental model is that compute increases with UTD as a power law with a coefficient < 1 (see Figure 2). 18 Value-Based Deep RL Scales Predictably Maximal compute efficiency. Here, we solve the compute optimization problem presented in Section 3. We write the problem: (๐ต * , ๐œ‚ * , ๐œŽ * ) := arg min ๐’ž (๐ต,๐œ‚,๐œŽ) s.t. ๐ฝ (๐œ‹Alg (๐ต, ๐œ‚, ๐œŽ)) โ‰ฅ ๐ฝ0 โˆง ๐’Ÿ โ‰ค ๐ท0 . (A.5) Firstly, we formulate the Lagrangian โ„’: โ„’(๐œŽ, ๐œ†) = ๐’ž๐ฝ (๐œŽ) + ๐œ† ยท (๐’Ÿ๐ฝ (๐œŽ) โˆ’ ๐ท0 ) )๏ธ‚ (๏ธ‚ )๏ธ‚ (๏ธ‚ (๏ธ‚ )๏ธ‚๐›ผ๐ฝ ๐›ฝ๐ฝ๐›ผ๐ฝ ๐›ฝ๐ฝ min min โ‰ˆ 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ + ๐›ผ โˆ’1 + ๐œ† ยท ๐’Ÿ๐ฝ + โˆ’ ๐’Ÿ0 ๐œŽ ๐ฝ ๐œŽ (A.6) Here, the constrained with respect to performance ๐ฝ0 is upheld through the use of ๐’ž๐ฝ (๐œŽ) and ๐’Ÿ๐ฝ (๐œŽ) which are defined such that ๐ฝ = ๐ฝ0 . We proceed with calculating the derivative with respect to ๐œ† to find the minimal ๐œŽ that is able to achieve the desired sample efficiency ๐’Ÿ๐ฝ . We denote such such optimal UTD as ๐œŽ * : ๐œ•โ„’ = ๐’Ÿ๐ฝmin + ๐œ•๐œ† (๏ธ‚ ๐›ฝ๐ฝ ๐œŽ )๏ธ‚๐›ผ๐ฝ โˆ’ ๐’Ÿ0 = 0 =โ‡’ ๐œŽ * = (๏ธ€ โˆ’๐›ฝ๐ฝ ๐’Ÿ๐ฝmin โˆ’ ๐’Ÿ0 )๏ธ€1/๐›ผ๐ฝ (A.7) Then, we substitute the ๐œŽ * into the expression defining compute, as well as use Eq. (4.6): (๏ธ‚ )๏ธ‚ ๐›ผ๐ต ๐›ฝ๐ฝ๐›ผ๐ฝ ๐›ฝ๐ต min ๐’ž๐ฝ (๐œŽ ) โ‰ˆ 10 ยท ๐‘ ยท ๐›ผ โˆ’1 ยท ๐’Ÿ๐ฝ + ๐›ผ ๐œŽ ๐ต ๐œŽ ๐ฝ (๏ธƒ (๏ธ€ )๏ธ€ )๏ธƒ ๐›ผ๐ต ๐›ฝ๐ฝ๐›ผ๐ฝ ยท ๐’Ÿ๐ฝmin โˆ’ ๐’Ÿ0 ๐›ฝ๐ต min โ‰ˆ 10 ยท ๐‘ ยท * ๐›ผ โˆ’1 ยท ๐’Ÿ๐ฝ + (๐œŽ ) ๐ต โˆ’๐›ฝ๐ฝ๐›ผ๐ฝ * (A.8) ๐›ผ๐ต โ‰ˆ 10 ยท ๐‘ ยท ๐›ฝ๐ต ยท (๐œŽ * )1โˆ’๐›ผ๐ต ยท ๐’Ÿ0 Maximal sample efficiency. Firstly, we note that we treat ๐ต(๐œŽ) as a constant and do not optimize with respect to it. We start with the problem definition: (๐ต * , ๐œ‚ * , ๐œŽ * ) := arg min (๐ต,๐œ‚,๐œŽ) ๐’Ÿ s.t. ๐ฝ (๐œ‹Alg (๐ต, ๐œ‚, ๐œŽ)) โ‰ฅ ๐ฝ0 โˆง ๐’ž โ‰ค ๐ถ0 . (A.9) Similarly to the maximal compute efficiency problem, we formulate the Lagrangian โ„’: โ„’(๐œŽ, ๐œ†) = ๐’Ÿ๐ฝ (๐œŽ) + ๐œ† ยท (๐’ž๐ฝ (๐œŽ) โˆ’ ๐ถ0 ) (๏ธ‚ )๏ธ‚๐›ผ๐ฝ (๏ธ‚ (๏ธ‚ )๏ธ‚ )๏ธ‚ ๐›ฝ๐ฝ๐›ผ๐ฝ ๐›ฝ๐ฝ min min โ‰ˆ ๐’Ÿ๐ฝ + + ๐œ† ยท 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ + ๐›ผ โˆ’ ๐’ž0 ๐œŽ ๐œŽ ๐ฝ (A.10) 19 Value-Based Deep RL Scales Predictably Again, we uphold the constraint with respect to the performance through the use of ๐’Ÿ๐ฝ (๐œŽ) and ๐’ž๐ฝ (๐œŽ). We calculate the derivative with respect to ๐œ†: (๏ธ‚ )๏ธ‚ ๐›ฝ๐ฝ๐›ผ๐ฝ ๐œ•โ„’ min = 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ ยท ๐’Ÿ๐ฝ + ๐›ผ โˆ’ ๐’ž0 = 0 ๐œ•๐œ† ๐œŽ ๐ฝ =โ‡’ ๐’Ÿ๐ฝmin + ๐›ฝ๐ฝ๐›ผ๐ฝ ๐’ž0 = = ๐’Ÿ๐ฝ ๐›ผ ๐ฝ ๐œŽ 10 ยท ๐‘ ยท ๐ต(๐œŽ) ยท ๐œŽ (A.11) Since ๐’Ÿ๐ฝ is monotonic in ๐œŽ and does not model impact of ๐ต on the sample efficiency, the optimization problem can be solved via Weierstrass extreme value theorem. As such, we find the biggest ๐œŽ and that fulfills the compute constraint, and find the data requirement for such ๐œŽ. B. Experimental details For our experiments, we use a total of 12 tasks from 3 benchmarks (DeepMind Control (Tunyasuvunakool et al., 2020), Isaac Gym (Makoviychuk et al., 2021), and OpenAI Gym (Brockman et al., 2016)). We list all considered tasks in Table 1. Table 1: Tasks used in presented experiments. Domain Task Optimal ๐œ‹ Returns DeepMind Control Cartpole-Swingup Cheetah-Run Dog-Stand Finger-Spin Humanoid-Stand Quadruped-Walk Walker-Walk 1000 1000 1000 1000 1000 1000 1000 Isaac Gym Franka-Push 0.05 OpenAI Gym HalfCheetah-v4 Walker2d-v4 Ant-v4 Humanoid-v4 8500 4500 6625 6125 Figure 1. We use all available UTD values for the fits, which is 6 for DMC, 5 for OAI Gym, and 7 for Isaac Gym. Given the dependency of compute and data on UTD, we plot the resulting curve. We average the data efficiencies across all tasks in each domain, as described in Appendix D. We calculate compute given the model sizes of ๐‘ = 4.92e6 for DMC, ๐‘ = 1.5e5 for OAI Gym, and ๐‘ = 2e6 following standard implementations of the respective algorithms. For budget extrapolation, we use tradeoff values ๐›ฟ to mimic the wall-clock time of the algorithm. We use ๐›ฟ = 1e10 for DMC, ๐›ฟ = 5e9 for OAI Gym, and ๐›ฟ = 1e4 for Isaac Gym. We exclude runs affected by resets (๐œŽ = 8) for DMC since the returns right after the reset are lower, which adds noise to the results. 20 Value-Based Deep RL Scales Predictably Figure 2. We use the same data as for DMC in Figure 1 (left). Figure 3. We use the same data as for DMC in Figure 1 (right). Figure 5. In the left and central Figures, we evaluate the ๐ต * and ๐œ‚ * models. For each DMC task, we find the best hyperparameters according to our workflow and procedure described in Section 5 and Appendix D. While the intercepts vary across environments, for simplicity we plot data points and fits from all environments in the same figure by shifting them with the corresponding intercept. In the right Figure, we marginalize over ๐œŽ and visualize best performing pairs of ๐ต and ๐œ‚. Figure 4. Left: we show an illustration that reflects our observed empirical results about the dependencies between hyperparameters. Right, middle: we investigate the correlations between overfitting, parameter norm of the critic network, and ๐œŽ. We observed the same relationships on all tasks. Here, to avoid clutter, we plot 3 tasks from DMC benchmark: cheetah-run, dog-stand, and quadruped-walk. To measure overfitting, we compare the TD loss calculated on samples randomly sampled from the buffer (corresponding to training data) to TD loss calculated on 16 newest transitions (corresponding to validation data) according to: Overfitting = ๐‘‡ ๐ทtraining โˆ’ ๐‘‡ ๐ทvalidation . (B.1) We fit the linear curves using ordinary least squares with mean absolute error loss. Figure 6. Here, we investigate 4 tasks from OpenAI Gym, listed in Table 1, and compare the extrapolation performance of two hyperparameter sets: the best performing hyperparameters for ๐œŽ = 1, found by testing 8 different hyperparameter values listed in Table 3 (we refer to this configuration as baseline); and hyperparameters predicted by our proposed models of ๐ต * and ๐œ‚ * . We fit our models using ๐œŽ โˆˆ (1, 2, 4, 8), and extrapolate to ๐œŽ โˆˆ (0.5, 16). The graph shows the data efficiency with threshold as 700, normalized according to the procedure in Appendix D. Figure 7. The goal of the left Figure is to visualize the effects of isotropic regression fit on a noisy data. We use the SciPy package (Virtanen et al., 2020) to run the isotropic model. In the right Figure we visualize the process of best hyperparameter selection using bootstrapped confidence intervals. We describe the bootstrapping strategy in Appendix D. 21 Value-Based Deep RL Scales Predictably C. Resulting Fits DMC Refer to Table 2 for environment-specific values. ๐œ‚ * = ๐›ฝ๐œ‚ ยท ๐œŽ โˆ’0.26 ๐ต * = ๐›ฝ๐ต ยท ๐œŽ โˆ’0.47 (๏ธ‚ (๏ธ ๐œŽ )๏ธโˆ’0.74 )๏ธ‚ min ๐’Ÿ๐ฝ = ๐’Ÿ ยท 1+ 0.45 (C.1) ๐œŽ * = 1.4e8 ยท โ„ฑ0โˆ’0.53 OpenAI Gym Refer to Table 2 for environment-specific values. ๐œ‚ * = ๐›ฝ๐œ‚ ๐œŽ โˆ’0.30 ๐ต * = ๐›ฝ๐ต ๐œŽ โˆ’0.33 (๏ธ‚ (๏ธ ๐œŽ )๏ธโˆ’0.69 )๏ธ‚ ๐’Ÿ๐ฝ = ๐’Ÿmin ยท 1 + 4.02 (C.2) ๐œŽ * = 1.4e8 ยท โ„ฑ0โˆ’0.53 Isaac Gym (๏ธ‚ )๏ธ‚ ๐œŽ )๏ธโˆ’0.26 ๐œ‚ = 8.77 ยท 1 + 2.57e-3 (๏ธ‚ (๏ธ ๐œŽ )๏ธโˆ’0.68 )๏ธ‚ ๐ต * = 38.6 ยท 1 + 1.42e-2 (๏ธ‚ (๏ธ ๐œŽ )๏ธโˆ’0.87 )๏ธ‚ ๐’Ÿ๐ฝ = 6.8e7 ยท 1 + 1.88 * (๏ธ (C.3) ๐œŽ * = 11.3 ยท โ„ฑ0โˆ’0.57 Table 2: Coefficients for DMC and OpenAI Gym fits. Domain Task DMC cartpole-swingup cheetah-run finger-spin humanoid-stand quadruped-walk walker-walk Ant-v4 HalfCheetah-v4 Humanoid-v4 Walker2d-v4 OpenAI Gym ๐›ฝ๐œ‚ ๐›ฝ๐ต ๐’Ÿmin 7.55e-4 6.25e-4 8.77e-4 3.86e-4 8.46e-4 9.38e-4 1.35e-4 1.86e-3 1.65e-4 7.85e-4 538.2 564.9 608.2 451.8 526.4 313.3 447.0 415.4 351.6 399.1 2.4e4 3.5e5 2.9e4 3.8e5 6.2e4 3.3e4 2.7e5 7.8e4 1.8e5 1.7e5 22 Value-Based Deep RL Scales Predictably Table 3: Tested configurations. Hyperparameters Updates-to-data ๐œŽ Batch size ๐ต Learning rate ๐œ‚ DeepMind Control Isaac Gym OpenAI Gym 1, 2, 4, 8 32, 64, 128, 256, 512 15e-5, 3e-4, 6e-4, 12e-3 1 1 1 1 1 1 1 1024 , 2048 , 4096 , 8192 , 16384 , 32768 , 65536 1, 2, 4, 8 128, 256, 512 1e-4, 2e-4, 5e-4, 1e-3, 2e-3 512, 1024, 2048, 4096, 8192 1e-4, 2e-4, 3e-4 D. Additional details on the fitting procedure Preprocessing return values. In order to estimate the fits from our laws, we need to track the data and compute needed by a run to hit a target performance level. Due to stochasticity both in training and and evaluation, naรฏve measurements of this point can exhibit high variance. This in turn would result in low-quality fits for ๐’Ÿ๐ฝ and ๐’ž๐ฝ . Thus, we preprocess the return values before estimating the fits by running isotonic regression (Barlow and Brunk, 1972). Isotonic regression transforms return values to the most aligned monotonic sequence of values that can then be used to estimate ๐’Ÿ๐ฝ . While in general return values can decrease with more training after reaching a target value, and this will result in a large deviation between the isotonic fit and true return values, the proposed isotonic transformation still suffices for us as our goal is to simply fit the minimum number of samples or compute needed to attain a target return. As we can still make reliable predictions that extrapolate to larger scales, the downstream impact of this error is clearly not substantial. We also average across random seeds before running isotonic regression to further reduce noise. We normalize the returns for all environments to be between 0 and 1000 (Table 1 lists pre-normalized returns), and reserve the points of 700 and 800 for budget extrapolation in Figure 1. Uncertainty-adjusted optimal hyperparameters. While averaging across seeds and applying isotonic regression reduces noise, we observe that the granularity of our grid search on learning rate and batch หœ ๐œ‚หœ. Noise due to random seed generation size limits the precision of the resulting hyperparameter fits ๐ต, makes hyperparameter selection harder as some hyperparameters that appear empirically optimal might simply be so due to noise. We observe that we can correct for this precision loss by constructing a more หœ ๐œ‚หœ adjusted for this uncertainty. Specifically, we run ๐พ = 100 bootstrap estimates precise estimate of ๐ต, by sampling ๐‘› random seeds with replacement out of the original ๐‘› random seeds, applying isotonic หœ๐‘˜ , ๐œ‚หœ๐‘˜ . We then use the mean of this bootstrapped regression, and selecting the optimal hyperparameters ๐ต estimate to improve the precision: หœbootstrap = 1 ๐ต ๐พ โˆ‘๏ธ หœ๐‘˜ ๐ต ๐‘˜ 1 โˆ‘๏ธ ๐œ‚หœbootstrap = ๐œ‚หœ๐‘˜ ๐พ (D.1) ๐‘˜ We have also experimented with more precise laws for learning rate and batchsize by adding an additive offset. In this case, we follow Hoffmann et al. (2022) and fit the data using brute-force search followed 23 Value-Based Deep RL Scales Predictably by LBFG-S. We use MSE in log space as the error: MSElog (๐‘Ž, ๐‘) = (log ๐‘Ž โˆ’ log ๐‘)2 . ๐œŽ๐ต ๐œŽ ๐›ผ๐ต ๐œŽ๐œ‚ ๐œ‚ * (๐œŽ) โ‰ˆ ๐œ‚min + ๐›ผ๐œ‚ . ๐œŽ ๐ต * (๐œŽ) โ‰ˆ ๐ตmin + (D.2) (D.3) However, we found that this more complex fit did not validate the decrease of degrees of freedom given a limited sweep range, resulting in accuracy of extrapolation. Independence of ๐ต and ๐œ‚. Whereas the optimal choice of ๐ต and ๐œ‚ is often intertwined as UTD changes, we observe in our experiments that the correlation between them is relatively low (Figure 5). If we ran a cross-product grid search with hyperparameter space {๐ต1 , . . . , ๐ต๐‘›๐ต } ร— {๐œ‚1 , . . . , ๐œ‚๐‘›๐œ‚ }, we can use this fact หœ over different values of ๐œ‚. That is, we produce to further improve the results by averaging the estimate ๐ต [๐ต=๐ต ] [๐œ‚=๐œ‚ ] หœ ๐‘– ) by only looking at the runs where ๐œ‚ = ๐œ‚ , and averaging such ๐‘– (respectively ๐œ‚ the estimate ๐ต หœ ๐‘– estimates. โˆ‘๏ธ หœmean = 1 หœ [๐œ‚=๐œ‚๐‘– ] ๐ต ๐ต ๐‘›๐œ‚ ๐‘– (D.4) 1 โˆ‘๏ธ [๐ต=๐ต๐‘– ] ๐œ‚หœmean = ๐œ‚หœ ๐‘›๐ต ๐‘– Data efficiency. We fit data efficiency of the runs with our found practical hyperparameters ๐ต * , ๐œ‚ * according to Eq. (4.1). We follow Hoffmann et al. (2022) and fit the data using brute-force search followed by LBFG-S. We use MSE in log space as the error: MSElog (๐‘Ž, ๐‘) = (log ๐‘Ž โˆ’ log ๐‘)2 . In DeepMind Control Suite, we would like to share the data efficiency fit across different environments env. We normalize the data efficiency ๐’Ÿ by the intra-environment median data efficiency medians env = median{๐’Ÿ env ๐’Ÿmed [๐œŽ=๐œŽ๐‘– ] |๐‘– = 1..๐‘›๐œŽ }. For interpretability, we further re-normalize ๐ท with the overall env . We will need to express the data efficiency law alternatively as: median ๐’Ÿmed : ๐’Ÿnorm = ๐’Ÿ ยท ๐’Ÿmed /๐’Ÿmed (๏ธ‚ (๏ธ‚ )๏ธ‚๐›ผ๐ฝ )๏ธ‚ ๐›ฝ๐ฝ min ๐ท๐ฝ (๐œŽ) โ‰ˆ ๐’Ÿ๐ฝ 1+ . (D.5) ๐œŽ This is equivalent to Eq. (4.1) because the coefficient ๐›ฝ๐ฝ absorbs ๐’Ÿ๐ฝmin . However, this expression makes explicit an overall multiplicative offset1 ๐’Ÿ๐ฝmin . Our median normalization is then equivalent to fitting per-environment coefficients ๐’Ÿ๐ฝmin , following our procedure for environment-shared hyperparameter fits. However, we further improve robustness by fixing the per-environment coefficients to be the median data efficiency and do not require fitting them. E. Critical batch size analysis Previous work has argued that there is a critical batch size ๐ตcrit for neural network training in image classification, generative modeling, and reinforcement learning with policy gradient algorithms (McCandlish et al., 2018) โ€” a transition point at which increasing the batch size begins to yield diminishing returns. 1 This form enforces that ๐’Ÿ๐ฝmin is positive. 24 Value-Based Deep RL Scales Predictably Figure 7: Left: Determining performance via isotonic regression on DMC. Right: improving hyperparameter selection with uncertainty adjustment on DMC. Further details are in Appendix D. Figure 8: An approximation of the critical batch size over training. Further details are in Appendix E. 25 Value-Based Deep RL Scales Predictably หœfinal vs. ๐ต หœcrit , grouped by task and UTD. Figure 9: ๐ต We follow this work and compute an estimate of the gradient noise scale ๐ตnoise โ‰ˆ ๐ตcrit according to the following procedure: throughout training, we compute the gradient norm |๐บ๐ต | of the critic network for batches of size ๐ต = ๐ตsmall := 64 and ๐ต = ๐ตbig := 1024. Then, we evaluate (๏ธ )๏ธ 1 ๐ตbig |๐บ๐ตbig |2 โˆ’ ๐ตsmall |๐บ๐ตsmall |2 ๐ตbig โˆ’ ๐ตsmall (๏ธ )๏ธ 1 2 2 ๐’ฎ := |๐บ๐ตsmall | โˆ’ |๐บ๐ตbig | 1/๐ตsmall โˆ’ 1/๐ตbig |๐’ข|2 := หœcrit := ๐’ฎ/|๐’ข|2 . In practice, to account for the noisiness of |๐บ|2 , we first take rolling averages and take ๐ต of |๐บ๐ตsmall | and |๐บ๐ตbig | over training, and tune the window size so that the estimates for |๐’ข|2 and ๐’ฎ are stable. หœcrit over training in Figure 8. Unlike policy gradient methods, we find that the We show the values of ๐ต critical batch size (averaged over training) has little correlation with the optimal batch size, as shown in Figure 9. Table 4: Batch size values predicted by the proposed model on DMC. Task cartpole-swingup cheetah-run dog-stand finger-spin humanoid-stand quadruped-walk walker-walk ๐œŽ = 0.25 ๐œŽ = 0.5 ๐œŽ=1 ๐œŽ=2 ๐œŽ=4 ๐œŽ=8 1040 1088 240 1168 864 1008 608 752 784 176 848 624 736 432 544 560 128 608 448 528 320 384 400 96 432 320 384 224 288 288 64 320 240 272 160 208 208 48 224 176 192 112 26 Value-Based Deep RL Scales Predictably Table 5: Learning rate values predicted by the proposed model on DMC. Task ๐œŽ = 0.25 ๐œŽ = 0.5 ๐œŽ=1 ๐œŽ=2 ๐œŽ=4 ๐œŽ=8 cartpole-swingup cheetah-run dog-stand finger-spin humanoid-stand quadruped-walk walker-walk .00108 .000893 .000664 .00125 .000551 .00121 .00134 .000902 .000747 .000555 .00105 .000461 .00101 .00112 .000755 .000625 .000465 .000877 .000386 .000846 .000938 .000631 .000523 .000389 .000734 .000323 .000708 .000785 .000528 .000438 .000325 .000614 .00027 .000592 .000657 .000442 .000366 .000272 .000514 .000226 .000496 .000549 Table 6: Batch size values predicted by the proposed model on OpenAI Gym. Task ๐œŽ = 0.25 ๐œŽ = 0.5 ๐œŽ=1 ๐œŽ=2 ๐œŽ=4 ๐œŽ=8 ๐œŽ = 16 704 672 560 640 560 528 432 496 448 416 352 400 352 336 272 320 288 256 224 256 224 208 176 192 176 160 144 160 Ant-v4 HalfCheetah-v4 Humanoid-v4 Walker2d-v4 Table 7: Learning rate values predicted by the proposed model on OpenAI Gym. Task ๐œŽ = 0.25 ๐œŽ = 0.5 ๐œŽ=1 ๐œŽ=2 ๐œŽ=4 ๐œŽ=8 ๐œŽ = 16 Ant-v4 HalfCheetah-v4 Humanoid-v4 Walker2d-v4 .000206 .002820 .000251 .001180 .000167 .002280 .000203 .000958 .000138 .001900 .000169 .000806 .000109 .001510 .000134 .000640 .000087 .001210 .000107 .000512 .000070 .000972 .000086 .000412 .000060 .000827 .000073 .000347 27