Title: User Comment Replies — LessWrong Description: A community blog devoted to refining the art of rationality Keywords: No keywords Text content: User Comment Replies — LessWrong This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. LESSWRONGLWLoginAll of Julian Schrittwieser's Comments + Replieschinchilla's wild implicationsJulian Schrittwieser2y226An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one "epoch". In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down. Replygwern2y103 only if the model has more parameters than the dataset tokens and training for >10 epochs does overfitting kick in and scaling break down. That sounds surprising. You are claiming that you observe the exact same loss, and downstream benchmarks, if you train a model on a dataset for 10 epochs as you do training on 10x more data for 1 epoch? I would have expected some substantial degradation in efficiency such that the 10-epoch case was equivalent to training on 5x the data or something. Replynostalgebraist2y302This distinction exists in general, but it's irrelevant when training sufficiently large LMs.It is well-established that repeating data during large LM training is not a good practice.  Depending on the model size and the amount of repeating, one finds that it is eithera suboptimal use of compute (relative to training a bigger model for 1 epoch), or actively harmful, as measured by test loss or loss on out-of-distribution datawith (2) kicking in earlier (in terms of the amount of repeating) for larger models, as shown in this paper (Figure 4 and ... (read more)ReplyREPL's: a type signature for agentsJulian Schrittwieser3y80Could you explain how this differs from the standard Reinforcement Learning formulation? (See eg. http://incompleteideas.net/book/first/ebook/node28.html for an introduction) ReplyOpenAI Solves (Some) Formal Math Olympiad ProblemsJulian Schrittwieser3y20This is indeed amusing. In reality, the action space can be taken to be of size 256 (the number of possible byte values), with the number of bytes in the solution as the episode length. Note also that 256 is an upper bound, not all byte values are valid at all points, and most of the time only the 128 ASCII values are used. Using a tokenizer as is standard in language models simply reduces the episode length by increasing the action space, it does not change the size of the overall state space. This also means that, despite their claims, the search space for the example solutions shown on their website is similar or smaller than for board games such as Chess and Go :D ReplyEfficientZero: How It WorksJulian Schrittwieser3y160Nice summary! I agree, this is an interesting paper :) But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There's a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there's a just-so story... (read more)Reply11a3orn3yAh, that does make sense, thanks. And yeah, it would be interesting to know what the curve / crossover point would look like for the impact from the consistency loss.4gwern3yYes, that was the comment I meant to leave but apparently didn't: it's just another bias-variance tradeoff. In the limit (say, 20b frames...) all of these regularizations and auxiliary tasks (and/or informative priors) are either going to be neutral or hurt converged performance compared to pure end-to-end reward-only learning. And they should, if you do them right, help most early on when data is scarce and the end-to-end reward-only approach hasn't been able to learn much. This isn't post hoc, it's just what any ML person should predict from bias-variance tradeoff. The devil is in the details, though, and you could be doing any of it wrong or not be where you think you are in the tradeoff, and that's where this sort of research finding lives.Omicron Post #3Julian Schrittwieser3y20Does Omicron already having spread through community transmission in the Netherlands (and other European countries) before the reports from South Africa, yet still not being as widespread in Europe, suggest that it's not that transmissive after all? ReplyParameter counts in Machine LearningJulian Schrittwieser3y40The difference in compute between AlexNet and AlphaZero is because for AlexNet you are only counting the flops during training, while for AlphaZero you are counting both the training and the self-play data generation (which does 800 forwards per move * ~200 moves to generate each game).If you were to compare supervised training numbers for both (e.g. training on human chess or Go games) then you'd get much closer.Reply3Rohin Shah3yThat's fair. I was thinking of that as part of "compute needed during training", but you could also split it up into "compute needed for gradient updates" and "compute needed to create data of sufficient quality", and then say that the stable thing is the "compute needed for gradient updates".How much compute was used to train DeepMind's generally capable agents?Julian Schrittwieser3y40The TOPS numbers from the wiki page seem wrong. TPUv1 had 92 TOPS (uint8); for TPUv3 the "90 TOPS" refers to a single chip, but I'm fairly sure that when the paper says "8 TPUv3s" they mean 8 cards, as that's how they are available on Google Cloud (1 card = 4 chips).Reply2Daniel Kokotajlo3yHuh, thanks! I guess my guesstimate is wrong then. So should I multiply everything by 8?How much compute was used to train DeepMind's generally capable agents?Julian Schrittwieser3y30Only Anakin actually runs the environment on the TPU, and this only works for pretty simple environments (basically: can you implement it in JAX?) Sebulba runs environments on the host, which is what would have been done for this paper too (no idea if they used Sebulba or had a different setup).This doesn't really matter though, because for these simulated environments it's fairly simple to fully utilize the TPUs by running more (remote) environments in parallel. Reply2gwern3yYes, I see that they used Unity, so the TPUs themselves couldn't run the env, but the TPU CPU VM* could run potentially a lot of copies (with that like 300GB of RAM it's got access to), and that'd be a lot nicer than running remote VMs. At least in Tensorfork, when we try to use TPU pods, a lot of time goes into figuring out correct use of the interconnect & traffic because the on-TPU ops are so optimized by default. (And regardless of which of those tricks this open-ended paper uses, this is a point well worth knowing about how research could potentially gets way more performance out of a TPU pod than one would expect from knowing TPU usage of old stuff like AlphaStar.) * advertisement: access to the VM was recently unlocked for non-Google TPU users. It really changes how you treat TPU use!