[Music] welcome to signals and threads in-depth conversations about every layer of the tech stock from Jane Street I'm Ron Minsky it's my pleasure to introduce svan gerer syvan is a machine learning engineer here at Jan Street and he's done a bunch of interesting stuff in the outside world as well he was a core maintainer of hugging faces Transformers Library he wrote hugging face accelerate which is a nice library from them that helps you run your models performant on a lot of different kinds of hardware and he also wrote a lovely book along with Jeremy Howard called Deep learning for coders with fast Ai and pytorch so he's done a bunch of interesting stuff in the outside world he's also doing a lot of interesting machine learning work here at Jan street so thanks for joining me thanks I'm very honor to be here and just to kick things off I'd love to hear a little bit more about your background and in particular how did you get to work on machine learning in the first place so that's a good question I was originally a math teacher like 10 years ago and fr teaching as the first of University level and yeah I moved to the US in 2015 I had kids so took some small projects at home to mostly take care of my kids in 2017 AI was kind of becoming more mainstream I actually read an article in the New York Times about it it was going to steal like everyone's jobs in the next two three years that didn't happen but still something that became more mainstream and at the end of the article they are mentioning a couple of online courses for people interesting into diving more and I was interested so I dived into it so one of of the courses mentioned was the fast course by Jeremy award which I followed was very interesting and yeah I started commenting a little bit more and more on the forums I'm making a couple contribution to the fast a library which is used throughout the course to make training models a little bit faster and a little bit easier and then towards the end of the course like the fast Jeremy Leed fast team to this competition called the to bench competition which is the ancestor of the ml path Benchmark it was organized by Stanford and the goal was to train computer vision model as fast as possible to a given accuracy and so yeah we entered the competition and helped B the team and we were like positioned first for the longest time and yeah at the very end Google kind of trolled Us by publicly releasing tpus for the first time and yeah those massive computers that no one else had access to trashed our best entry and our best time so I want to hear more about the competition but before that can you tell me a little bit about like what is fast AI what's the basic program there what's the mission behind that organization so fast is a nonprofit whose goal is to educate people about de planning especially in those early years it was starting to become more mainstream but not necessarily as mainstream as it is today and the idea behind it is that Jer Mard and I believe that to get the best model you need good machine learning Engineers but you also need people who really understand the data that those model are going to consume so if you want good model in Radiology you need very good Radiologists to kind of understand of machine learning works so that we're going to be able to help you build those best models the F course I both F coders who want to like really dive deep into machine learning but also like beginning is more of an introduction that anyone who is interested can in to learn more about what is machine learning what are those deing models and what they can do so the basic idea is to democratize machine learning so all sorts of domain experts can actually know enough about it to really leverage it in a meaningful way exactly you said it way better than I did let's get back to the competition so like the end of the competition is like a great dramatic story of Google gorilla Stomps on everything by dropping tpus at the moment but what were you actually doing in order to get into the first place before Google kind of jumped in there yeah so a couple of things the main thing is related to the way we are training the model and in particular as the learning rate schedule so take a little step back when you train those machine learning models initially your model Is Random so it's outputting crappy predictions but then you compute a loss and from that loss some gradients that are going to make your model a little bit better if you adjust the weight following those gradients so the whole process is called stochastic gradient descent right and you say a higher level thing about all this this is just an example of a more General thing of called function optimization right you have some function that you want to optimize in this case the function is given the set of model weights and the input you want to run on you want to like find the set of model weights that give you the best and most accurate answer and we just approach this like we do in some sense almost any other kind of optimization problem with techniques that actually like go back 50 years or something of we're just going to compute a derivative and we're going to walk the model weights in the direction of the derivative and just do that over and over until we get to a more optimal result yes exactly like the whole process and the whole Mass behind it existed for like 50 60 years is just that with GPU becoming more and more powerful we actually as the compute to apply that process to complex problems like deep learning so that very important type of parameter the learning rate is kind of the size of the step we take like following those gradients at the time of the competition the most popular learning rate schedules were like very inefficient just training at a low learning rate for a very long time and then we divide that low learning rate by 10 and we trained for even a longer time that did converge to a good accuracy but it was very inefficient and one of the thing we had in our competition entry was to follow a learning rate schedule that is more like a warm-up from a low learning rate to a high learning rate so not start at a high learning rate because overwise the model immediately explodes but by warming up starting from something low and gradually increasing it to the maximum we can have the more and learn a little bit of something for those and then have high learning rate for a little bit of time so that we can explore the low landscape efficiently and then decrease it towards the end and this kind of schedule made it possible to train the model to the same accuracy but way fast why is that a better schedule if you kind of just think about this without knowing a lot of the details the idea that when you're very far from the answer you want to take large steps and when you get closer to the answer you want to take smaller steps seems intuitive but here you say instead actually you start with small steps and then go up to Big steps and then go down to small steps so what about the structure of the problem makes that the right approach I mean at the beginning of the problem like since your model is randomly initialized in the landscape of your loss function you very very high and you act have very steep Canyons so if you take large Step At the beginning you can at least begin to descend into one of those Canyons the L function and then increase that learning rate to like dive through there fast and you will skip over a lot of local minimat because your learning rate is large so towards the end you need that decrease to step down further into like one of those smaller part of the landscape of the L that have like this local Mini Mass so is the intuition here that when you start at a random initialized point the terrain around which you're trying to optimize is just more wild and if you take big steps the derivatives are very high and you're kind of jumping all over the place but even with a little bit of optimization away from that initial Randomness you end up into something that feels like a more regular space and now you can go back to what makes more intuitive sense yes it depends also like we're talking about just a 3D problem but we have millions of Dimensions because your model has millions of parameters so it's the idea that yeah on some of his Dimensions the landscape is very very spiky so at least taking care of that at the beginning with a low rate it's going to make the world optimization problem easier and then you can have larger steps yeah I do think this is kind of terrible intuition one has one thinks about a problem like this of like I'll try and visualize it in two or three dimensions and you're like you have just lost all of the important structure and you really need to think about this High dimensional problem to really know what's going on that was one of the optimization the other optimization we did is was a computer vision problem and so like the kind of models we apply to them which are called CNN for convolutional neural networks they can work on any size of images because it's just some kind of filter that you apply over all of your image and so the idea is at the beginning of the model like when you train the model is random it's crappy so it doesn't really need to see the wall picture we kind of gave it more blurry version of the picture like just 128 by 128 and then gradually as training goes on WE increase the size of those images to make them more of the standard size people doing that problem we're using and you that gradual resizing also well because if at the beginning your image is smaller and you have your filter to like put all around the place of image it's going to be more efficient if you have less pixels compared to doing the training we've always like the high resolution images right so there's two neat properties of convolutional neural networks that are coming into play here this this convolution neural networks are in general a dimensionality reduction trick you can imagine like a big Network that's applying to all of the different inputs in all the different parts of the image and then you could just have weights that are individual to all the neurons that are associated with all the different parts of the image but that's enormously wasteful because in the early parts of the network you actually want sort of the same regular structure over and over and so the basic idea of a CNN is you kind of lock those weights together so you in some sense just have one copy of this neuron which is activated multiple times in multiple places but then once you've done that trick you also have this resolution Independence where you can run it at multiple different resolutions and you're saying well we're just going to like train this thing at low resolution and then again after it gets into the ballpark and we need to more precisely fine-tune it then we'll increase the resolution and do the rest of it yeah and were these essentially kind of new techniques at the time of this puzzle yeah both of them were new techniques like the gradual resizing is still not that widely used the new kind of learning schedule is like no everyone uses that like all Transformers model like GPT 3.5 I think GPT 5 now not for sure because open doesn't publish his research but like the open source versions of that are trained using that kind of schedule and since birth we have seen that kind of learning was scheduled all the time so that's how you got into fast Ai and you got into this competition space how did you end up being co-author of this book so after like collaborating on the fast library and they participating on the form that competition Jeremy aart cently offered me a job at fast. which I accepted so I worked there for two years built a couple version of The Fast library and two iteration of the online course and it was very natural going from like the course to publish a reference book with kind of the same content just in a different format for people who prefer to learn from books instead of YouTube videos got it and then what brought you to hugging face and what is hugging face hugging face is kind of the GitHub of machine learning the idea is that we have a we site that looks kind of like GitHub except of having repos with code you have repos with model weights so like Lama one LMA 2 Lama 3 and 3.2 that was released a couple of days ago are all on hugging face along with I think now is a million public model from all kind of applications of machine learning like computer vision text speech etc etc and yeah the idea is that there are like kind of the Forefront of the open source AI by allowing people to share those models we' it happen a couple of libraries because model weights are very good but if you don't have the code to actually instantiate those models they're kind of useless to complement that they have libraries like the Transformers Library which actually contains the code of those models and how did you end up at hugging face in 2020 there is this thing that happened worldwide oh yeah I think remember that yeah and so that kind of disrupted some plan so I yeah looked for another job and there was this startup from French people which was based in New York City and so I knew them like from the French Tech Community in New York City and i' had met them a couple of time before we're looking to expand so I applied at wiing and I joined them like randomly in June of 2020 has a continuation of my work in open source from Fast to democratize machine learning on all people use the Transformers library or like their website with all the public weights on it so what kind of technical work did you end up doing at hugging face couple of things the maintenance of the open source libraries because yeah are people doing poll request having issues kind of all the time so that is already a huge amount of work then I did veloped like new tutorial and new examples to people use those libraries and that kind of ended with an online course that was meant to be taken after the fast a course like for more for people who wanted to specialize a little bit more into Transformers so there are those two aspects and then yeah at some point like all the researchers at face were kind of annoyed by like our big black box trainer which contain like all the stuff of the training Loop and it becomes with time like this huge amount of spti code because you have like new flags that appear to kind of control everything that people want to do with with our trainings and so I created a new open source library to yeah make it much more lightweight to all people with our trainings so that we can have more flexibility yeah the idea is that usually with API is that train models you have a trainer API and you like give it some things like your model and your data and you click the train and it trains and it's marvelous for people who just want that but yeah researchers we wanted to change like tweak training Loop a little bit where like struggling a bit more so like there are various techniques that have applied for that in the past like in fast we had like some cold back big systems like so we had callbacks that the researcher could Implement to like change a little bit the behavior of the training loop at this particular point or another the Ying face trainer was less extensible but yeah for that Library called accelerate I went back to like yeah the researcher is just going to Wi their training Loop and there's not going to be like a blackbox trainer and they just need to change like a couple of lines here and there to make it run on any kind of systems at first it was like six line then five lines and we tried to reduce that number of lines to the absolute minimum so that there was as little intrusion possible but kind of give that API from accelerat and when you say you want to make it possible for people to do their training on multiple different kinds of systems what is the diversity of systems underneath that you're thinking about what are the kinds of systems and different variations on the training that you are trying to enable with accelerate training requires a lot of data usually when you train those large langage models or like even over kind of models and to make it more efficient usually you kind of divide and conquer and if you have multiple gpus you give a slice of the data set to each of your gpus and so let's say you have n gpus then your training time should be reduced by n at the end of the day because they fully parallelized the things that you cared about just by splitting your data this way so this is called Data parallelism and it's kind of the first level of parm we can use when we have multiple gpus and we want to run a training on them and so yeah you can do that in P torch except yeah it requires some kind of bower plate code that is a bit annoying so the idea of accelerate was to like remove that b PL Cat by just having to change a couple of lines in your training Loop and poof your model can run training on multiple gpus also on tpus because obvious of course like the code to run them with the same kind of distributed data parm on tpus is different from the one of gpus that would be too simple otherwise and then if you have done the modification nothing runs on CPU again so like the idea is that it kind of deals with all of that crap for you of detecting which kind of environment you're on and then adding like the B PL code that is needed for your training to run successfully for yeah all of those kind of system and then it also adds like if you want to train in a mixed Precision setting like because you want to use lower Precision types we can talk about that later it also dealt with like the additional L of codes that were required to properly do that kind of automatically yeah I mean I think this whole discussion kind of underlines just the diversity of different hardware and setups that you can do when you're doing training there's the kind of in some sense simplest thing of like you can run your training on a CPU which is a thing that people did for a long time and then there are mult multiple different parallel architectures gpus which are like literally descendants of graphic programming chips and tpus which is the tensor processors that Google came up with and the main game here going from the CPUs to the gpus and tpus is about parallelism it turns out CPUs are these kind of funny machines that have lots of parallel circuits but they're interpreters for a brutally sequential programming language right and so they're not that good at doing lots of things in parallel and in fact there's all the complexities of like multi-core architectures and stuff on that side which is how you try and take advantage of parallelism there but then gpus and tpus are machines that are much more directly parallel in their structure and built for large scale highly regular parallel computations and then at some point those things aren't enough either and now you start getting to various forms of distributed you want so much parallels and you want multiple gpus and then the first thing you were talking about was this data parallel training where what we're doing is we're running this like stochastic gradient descent where we're like picking random subsets of data breaking it up into batches and then training on individual gpus and Computing like net gradient which we then use for updating the model and then there's also pipeline style parallelism which you might need when your model itself is too big to fit in fact not just pipeline parallelism but various kinds of model level parallelism where you actually take the model and break it up and split it among multiple gpus because even the model weights themselves are too big to fit there and then accelerate is trying to like help you write your model once and your training Lube once and do a modest amount of modifications to be able to access this whole sweep of different ways of doing the training yeah exactly if your model does not fit anymore on one GPU you can split it different ways you can split the layers you can say if it's a deep learning model like usually those come by they are bigger because you have stacked more layer you have layer one on gpu1 Layer Two on gpu2 lay gp3 etc etc which is like a good idea because then your model fits but then there is this inefficiency in the sense of gpu2 has to wait for gpu1 to be finished to be able to process as a results and pass it along to gp3 and so that's where like pipeline ISM comes into play while you're trying to yeah pipeline things efficiently so like giv a little bit of your data to gpu1 which is going to send it to gpu2 and then gpu1 will process the second little bit of data while gpu2 is busy Computing the first part and there this ping pong between a forward when you run through your model and the backward path where you compute all of your gradients so you can also efficiently interl like some part of the forward and some part of the backward computation in that pipeline parallelism and then there is like tonsor parallelism where instead of splitting your by layers you actually split the weights of your model into chunks and like each GPU only sees like one part of the weights and so like when the GP you need to come together and agree on the results of all the metrix multiplies that you compute so this kind of parm requires way more I mean a very efficient way to communicate between gpus to be accessible that's right maybe the other interesting thing about the hardware around this kind of stuff is the criticality of the network you need these very fast Network transfers to do the tensor exchanges and yeah there are some context where it can be a little less critical because you can overlap lap compute and and data and some things like this tensor parallelism the gpus are just going to be sitting idle while you're waiting so we nowadays have these kind of wild new networks which have much much higher capacity and are very focused on these very low latency and high determinism data transfers one of the things I think is interesting about this is the way in which the networking stack has changed right I think when I started learning about how do you do high performance trading systems I learned about well the operating system kernel is obviously too slow so if you want to be reasonably fast you have to do kernel bypass you have to have a user level networking stack that's doing the communication and these systems use a technology called RDMA remote direct memory access which I think an easier way of understanding what's going on here is it's CPU bypass right basically Network comes in on the Nick and then without going through any CPU at all just gets copied directly to the place it in memory that it needs to go maybe directly into the GPU memory so you're really cutting away all of the fat from the bones that you can to make this stuff go as fast as possible yes and even like in the more recent Hardware that Nvidia has announced ATL GTC like you kind of Stack your gpus as close as possible and you try to like put as many as possible you can in a single cab so like there at 72 gpus in the same cab very close to each other so that you can have even faster Network between those because yeah like B stack some the network in the middle some gpus above some gpus below and like we have this big EnV Link in the back that links everything together very fast just because they sit very close together yeah you start caring an enormous amount about the physical layer at this point today we can get these Envy link setups where inside of a single box with say hpus in it you get this fast Network and yeah what you're describing is doing this at the cabinet level yes which is funny yeah I mean I remember hearing people talk about like earlier hacks not for machine learning but for other purposes where people would like you know basically try and make little supercomputers where you unroll your PCI Express Network and basically spread it over an entire cabinet and in some sense in band sort of grew out of the similar supercomputer networking Fabric and indeed infin band plays A real role in how these GPU networks work as well okay that was the stuff you did at hugging face so more recently you've joined Jane Street tell me a little bit about what your role here entails sure so Jan Street mostly work here on the engineering performance are on machine learning the day-to-day life is a researcher will come to me with a model with trained and they're like oh my training is going really slowly could you help me with that and yeah we'll profile it together try to identify the bottleneck and yeah make it faster to take a step back like most of the researcher here J stre to use spy torch which is the software to write neural Nets and train models which has the particularity of being really accessible because it's eager the cont parts from Google tons of FL and Jax are more like compiled languages so it's kind of harder to get started because you write your model but then it does not compile and so you need to fix some of the operations that like seems like valid python operation but you need to kind of modify them so that t oflow or Jack recognize them and see oh this is what you are trying to do as in p you can do anything you want but then your code can be inefficient in surprising ways because that particular operations for instance there's no implementation in the GPU and so like the computer need need to like transfer data back to the CPU just to be able to execute it and then send it back the other way and in general especially on Modern gpus and the way P works is that when you want to execute a model the CPU dispatches the operation on the GPU I thinkc so that like the CPU immediately runs to the next instruction and you're getting your Hardware in a good State and like if your CPU is always ahead of your GPU and then the GPU has like lots of stuff to process but as soon as your code requires some synchronization because you need some data back from the GPU to the CPU it can become pretty inefficient just because you're kind of stalling the GPU the CPU will wait to get the data back and then it will take time for the CPU to send back some new operations to execute to the GPU right and it's that waiting where the GPU is waiting on the CP it's slow for a lot of reasons it's slow because the memory transfers are slow it's slow because CPUs are inherently slow and then oh my God it's slow because the code that's running is written in Python which is maybe like 60 times slower than what the corresponding thing written in C might have looked like exactly like even if you don't care about GPU most of your python code you always write to have it vectorized we're trying to write as few for Loops in python as possible because those will be very slow whereas if you can execute like an operation from like numai which will be backed by C or C++ will be much faster and it's kind of the same ID for the GPU except on top of that you have that complexity of to avoid synchronization point between the CPU and the GPU as much as possible and notably when a c programmer says oh I want to make sure this is vectorized what they mean is I want to make sure I'm using like the ssse AVX whatever instructions they're vectorizing that are like using fundamental parallelism Technologies baked into the CPU to be able to do like four or eight or whatever computations in parallel and when a python programmer says vectorized what they mean is the in is in C and maybe it's also vectorized with AVX or whatever at the bottom but the fundamental thing is getting away from the python interactive Loop exactly sometimes you can have code that looks very inous but you're actually executing a for Loop which is going to at every iteration trigger a synchronization between the CPU and the GPU which is extremely bad because you'll like launch a tiny operation on the GPU and then have to wait for the GPU to finish it and get back the result to the CPU and then launch a new tin operation GPU ET ET and this is the also because one thing we forgot to mention is start something on the GPU is also very slow it takes sometimes for the CPU like to send the code of the Kel all the input and the outputs but takes like a couple of mics or even sometimes a millisecond to like get started and actually having your GPU starting to do the work it's maybe worth saying we're throwing this word around kernel a lot which is kind of a funny GPU specific word and basically the kernel is the small computational program that you are typically running on the GPU and writing these GPU kernels is actually really hard because they're highly parallel and they're hard to reason about and so the programs in fact tend to be numerically very intense but in terms of lines of code pretty small you're not creating like million line code bases that are running on the GPU they're a lot tighter than that yeah yeah you call those individual small kernels a kernel to do metol and then a kernel to do some activation function inet yeah this is just one python line which is then dispatch on GPU to be executed in parallel so the thing that always kills me about this whole pytorch story is that if you asked me to design something I would definitely design something like tensorflow or Jax just like the basic idea of tensorflow and Jacks is that you're more or less like hijacking python as like a meta programming system you kind of write what looks like python but what really you're doing is you're writing in some domain specific language for expressing the computational graph that represents the program that you're going to run on the GPU and the reason I would have wanted to do it that way is because it seems just dramatically easier to make sure that thing is going to run fast you can't take every arbitrary python thing and make it run fast on the GPU so you restrict yourself to some DSL where you can guarantee that things are running fast and it just seems like the whole thing is going to be much easier to reason about whether I'm staying inside of the envelope of reasonable fast programs and all of that pytorch is kind of clearly won Jax is new and exciting and maybe that will get more mind share over time but like tensor flow was the big thing and then pytorch has been much more successful and it just kind of like frustrates my intuitions as a person who designs apis do you have a view as to like why is it that pytorch kind of one and things like tensorflow and Jacks are more Niche so yeah pych one for flexibility like ml researchers to easily F around with various IDs and maybe at first it will be very inefficient but they want to be able to iterate really fast through their IDs and yeah test quickly if they're going to be worth it or not and even if the first training round is inefficient if the ID turns out to be a good ID then we can spend some time like optimizing it and making it as fast as possible by Tor kind of represent that model well you can fool around very easily and yeah also look with that model of execution that is asynchronous you still get the performance like unless your code trigger some of a z done like CPU GPU synchronization your code is still performant when you run it from py toor there is this flexibility this IDE you can easily follow around and they did come around having a compile thing like py 2o introduce T compile which is kind of what people didn't like about t oflow but they kind of had to implement it at the end modern gpus are really really fast and that programming model of I'm just going to dispatch the operation as synchronously from python was starting to lose just because the GPU was so fast but by the time your CPU had scheduled the kernel the GPU was already finished phasically and even if you kept telling the CPU like this can this can this can like inow it would just not be fast enough for the GPU and this IDE be charge the compile is again to get the whole computational graph from your model and then try to identify in that graph maybe there are places where you're doing something that's very inefficient and we can simplify the instructions but more importantly try to take two consecutive instructions and fuse them together on the GPU so that instead of launching a lot of small kernels on the GPU launch one big kernel which does a lot of work and this is very efficient for first you don't pay the overhead and the second thing is very efficient is that very often like Kel that are in a row they read the data that the previous canel had already written so you have this inefficiency I'm going to write something in GPU memory and then immediately in the next canel oh I'm going to read that GPU memory at justwood and there are some cach systems in the GPU but still you have some bit of overhead doing that whereas in a fuse scannel you can just keep that data in registers in the GPU and you don't have to move it around if it wasn't necessary right so you get to skip both the memory transfers and the kernel launch time yeah kernel launch overhead and so like they do this which is kind of a crazy hack by using another python DSR which is called Tron which is kind of a subset of python where you can directly write efficient Cuda canels which works well like if you want to write a fast matrix multiplication and Tron It's relatively easy and we have some crazy templates basically like for all operations you can do in a p torch model and they fuse these templates from the graph that they extracted during torch compile to create like big Triton canels that are going to execute big chunks of the model on the GPU at once right so yeah maybe we should talk for a second about the programming language ecosystem around gpus gpus have a really interesting underlying computational model and then there's a big collection of programming tools for it maybe you can like walk us through what some of the major pieces of this ecosystem are yeah so if we start at the bottom level like the equivalent of C for GPU is Cuda which is a proprietary language from Nvidia that they develop to program their gpus AMD supports most of Cuda as well because they kind of have to they bit l in the game and if they want people to ad up their product they kind of need to make sure the software is what people are used to it's basically C except you have those Kel that you write which are executed in parallel on the GPU and it comes with everything that kind of a pain in C like you have to do like lots of poter arithmetic to make sure that you are looking at the right data you have undefined behaviors every time you're not super careful and it's pretty hot Tob so it's a very lowlevel system and it also exposes you directly to the performance characteristics of the GPU or like not exactly directly because it gives you some layer of abstraction but you get to see a lot of the underlying details and I guess one of the things that struck me as someone who's mostly used to thinking about performance in the CPU context is how different the concept of threads is on a GPU versus a CPU I wonder if like you can see a little bit of like how should someone who's coming to gpus for the first time think about threads so you will have lots and lots of them for one the GPU can launch a million of threads pretty easily and all execute them in parallel the idea is that you have those blocks that corresponds to physical block on the hardware where like a bunch of thread is executed and even like those threads are like they're executed in a group which is called a warp when you write a kernel so actually each instruction is going to be seen exactly at the same time by 42 threads which together form a warp and one block has a number of warps I mean any number of wars that you want that's not too too large like one block can accommodate like 1,4 threat maximum and then you can launch several of those blocks in parall yes the idea of that block layer is that it's physically on the GPU chip at one location so you can have like some memory that is shared between those thread which is useful for instance if you're doing Matrix multiply you're going to load some of the weights into that shared memory and then use it with those threads to compute something repeatedly instead of like accessing the same region in global memory several times right so there's some more expensive smaller closer to the thread memory that sits there to be shared among these threads that are on the same M right streaming multiprocessor right and then maybe the other thing that's perhaps not obvious to someone who hasn't thought much about gpus is you also have dedicated registers yeah up to like a certain amount of register like is 65k for the wall SM you can have a program with lots of threads that use few registers or maybe a pram that has less threads but each thread can use more registers right and the critical difference here between CPUs and gpus is on a CPU you have a really small number of registers and then when there's a thread there's just like one thread running on the CPU and using all of those registers and then when you want a different thread to run you have to swap it out in all of the registers and swap the new thread in and so you have this fairly large contact switch time and contact switch times on gpus are incredibly small and so this is part of what enables you to do this kind of massive multi-threading you have all of these different threads and the threads are both able to execute in these warp groups so they can do stuff in parallel in groups of 32 but also they often end up being blocked not typically blocked on iO because the GPU is not doing IO but just blocked on memory right they need to do a thing they need to wait for memory to be shuffled in and so you can immediately grab some other group of threads that's running and get them started and you can hide a lot of the memory latency by having all of these threads that are consuming different pieces of memory concurrently yeah that's the job of the smm so the warp controller is going to schedule warp like I on a unit that's going to do some float Matrix float arithmetic other unit specifically dedicated to Matrix multiply which can do like a small 16 by 16 Matrix multiply for like those 32 threats that we just mentioned and however units that are going to load something from Global memory or from shared memory in this function is dispatched on one of those cars and then immediately after it's finished like another warp is going to take its place and this way like most of the latency is is hiden from the user as long as you can express your program in a way that you always have a warp Computing something and Cuda gives you direct explicit low-level access to more or less this computation model in an unsafe programming model which is not especially clearly documented and can be like hard to figure out and hard to understand and when you get it wrong you just get weird undefined behavior and your program breaks in hard to understand ways okay so that's Cuda it's great and terrible what else is there in the kind of programming language space so we mentioned py and sof on Jacks which are kind of the exact other hand so it's something that's in Python with all the good and the bad of python that then is going to express either compile the computational graph on side of Jackson ton oflow or like directly send instruction to the GPU on the side of pytorch which yeah they're going to dispatch those Cuda canels that we just talked about and in the middle there is a flurry of new languages because as it turns out researchers they love to hack and Test new IDs but they also don't love to code in Cuda for some reason and in the middle like there are several languages like Triton which kind of sit in Python l in the sense that it's a syntax that looks like Python and you have like some subset of python operation that are supported but actually just dsls to generate efficient Cuda canals so we mentioned Triton is one of them and I guess one thing about Triton is it's in some ways not quite his general purpose it's really good for doing things that kind of vaguely look like Matrix multipliers yeah I mean in general modern gpus are really really good at Matrix multiplies they have those special course on them called tonsor Calles which are really efficient and any way you can make your program look like a matrix multiply you're going to get way more flops than if it's just like regular floating Point operations Tron is really good at programming those styles of arrays and metrix multiplying them or like then reducing them if you want your model computation is slightly different than that sadly like very often Tron will not compile your code and won't necessarily tell you why as your message is not always super clear and the debugging experience is also not always super nice because you're not in Python anymore like it's gener qua can and so you can't really in the middle of it inspect the states of everything or like you can try to print a bit of the stuff but it kind of stops like that there's also this weird decision that the whole machine learning world has made that we can have all the Innovation we want on the programming language side but the syntax always has to be python yeah people are used to python so you can try to move them all to another language and some terms like Google tried Swift s oflow to try to get Swift programmers into machine learning or to move away from python programmer to another language that's is more efficient but that didn't go so well like it sounds out like that there is a world ecosystem in Python with all the rest of the libraries you need to like process your data or inspect your results and stuff like that you can try moving researchers away from what they like but very usually they don't really follow you so another interesting language in the space which I actually you don't know a ton about is Mojo I'm kind of curious what your thoughts on that are so Mojo I think I don't know lot about it so I hope people will excuse me if I say many mistakes but it's kind of the same as Triton except instead of wanting to be a new DSL in in Python it's kind of this new this new language which looks a bit like python but it is its own language the support for GPU in Mojo is going to be released in a couple of months from what I heard but it's not there yet but the idea is that you will be able to write those efficient C canels like you do in Triton in that language Modo but since you are not trying to do a DSR in Python like there is going to be support for like debugging or like maybe better her own link just because you're writing in a language that was specifically designed for that instead of trying to add that functionality in Python right and I think unlike writing stuff directly in Cuda it's a safe language right I think it's got enough type system support that if you do something crazy it will actually try and catch it for you the way I understand is it's like a little bit rust inspired I think it has some of the same rust-like mechanisms lifetimes and things like that and so if it's following that kind of approach I would expect them to try and make it actually safe yeah and then you have all our projects into kind of the same space so m GPU and Mosaic TPU are some Google projects that kind of do the same thing of like giving you some python interface to create efficient CA can and if you want to write Cuda canels but in Python because we really love python there are some languages like Numba you're doing exactly the same thing as you would do in Cuda just the syntaxis python got it stepping away from all this penelopy of languages out there how does this all play into the work you do here you know researchers working on a Model they've put together their model in pytorch it's not running as fast as they think it should or they hope it would what do you do how do you approach these performance questions first things first is profiling multiple times to identify like we talk about the CPU and GPU synchronization points which are inefficient so a profile will show you that very easily and you can like track oh this instruction like created a chck point by synchronizing GPU and CPU so let's remove that or let's try to find a way to remove it like some of them are easy to remove because you can express them in different ways over can be a bit trickier like for instance if you want your training to stop because your loss is none so if you have a final loss after Computing your loss from your data on your like randomly initialized weight that is very larger than all your gradients are going to be nine and then all your model weights are going to be nine so basically your training is finished and completely bed so you might as well want to stop and stop wasting GPU RS on it so like even that tiny thing is kind of difficult because when you type if L that is none in Python to be able to know which branch of that if statement the CPU should execute it needs to know like if the loss is n or not so it needs to wait on the GPU to have finished computing to be able to inspect the value you have kind of a synchronization points here that looks difficult to remove one of the solution is to do that check but like in another thread like launch another thread that's going to do that check where the CPU is going to be blocked but that's okay because the main thread will continue executing the model and maybe you will do a couple of iteration wrong on the CPU with your weights that will ultimately be n but that's okay because your program will be sted by that of thread this is one example of something that's a bit trickier to remove the idea is that once you remove all those GPU CPU synchronizations you GPU is fed up as fast as possible and then the next step is you can try to like compile your model to access this world of Kel Fusion which talked about just before to make it even faster in the process you might also want to use different type for your floting point operations most of the models have been trained for a long time in flood 32s but we discovered that for deep noral networks flood 16 is actually kind of enough for the Precision in the layers in in the middle as long as you do your sum for instance like when you do a matrix multiply you can have the weights of both matrices being in flood 16s and still have a results that's kind of correct as long as you do that accumulation over the sum in flood 32s and that has led to Nvidia and introducing on these gpus like very efficient metrix multiplies for like flute 16 or no it's like flute 8 or even like fp4 for the new generation of black world gpus that we are going to relas soon is the float 4 a real thing or is that just a joke I have no idea it's on the slides so I don't not looking forward to float one but sounds interesting it's either zero or one or something I don't even but yeah without going as deep as that like FL 16s are really great because yeah you can train as much like as 2X or 4X faster depending on the shapes of your problem for free by doing this mixed Precision things where like some operations are completing in flood 16 some ofs are completed in flu 32 just because you access those stor cores that are like really specialized Matrix multiply units and they do it really fast if like the two matrices are in FL 16 or like this variant called B flute 16 that was invented at Google as a b standing for brain so how do you think about like the programming language set up influencing the problem that you have of helping people build fast models like one thing you might imagine wanting is having this split between I'm doing the inefficient thing or I'm doing the efficient thing be really explicit so that instead of having to come to someone who knows a lot about performance they could just look at their code and be like huh let me press the is it fast button and be like oh yeah it's not fast here and it's not fast here and then they could move things around until it was fast but it sounds like that's kind of not what's going on what's going on is everything just kind of looks okay and you can run it and stuff but it's a little harder to figure out whether or not you're doing the bad thing so is there like something to improve there to make it easier for the end users of the system to understand when they are doing the slow thing it's kind of hard and in that regard by is actually better than soflow for instance because it's lets you explicitly manage what data is on the GPU and what data is on the CPU you choose when you do the transfers unless like there is an instruction if is then as we talked about that kind of create a transfer bit Tor flow for instance does not even let you unall what is on the GPU and what is on CPU it's going to like take care of everything for you because it has compiled everything and it decides for you like where is your data and it moves and so sometimes it can also result in efficient code just because the compiler decided that this line should be executed on CPU and this line should be executed in GPU but sadly the compiler was wrong so at least in P you can fix things because you get more fine grain control into where stuff like that and an employee which was was like in love with caras for instance which Jo recently was like H this thing in P is really great I can choose where my data is an on which device and like move it when I want it to move and it's not going to move back unless I ask for it so there's like two different dimensions along which you might want explicit control one is about am I doing a thing that can be put on the GPU or not and the other is even if I am doing a thing that could be put on the GPU or could be put on the CPU I can explicitly control where it goes and it sounds like it's more explicit on one side and that it actually just forces everything into the like completely understood domain specific language but then the actual execution of that language has a bunch of compiler magic that you don't get control over in an explicit way and this like Echo is actually a bunch of stuff that we're doing on like the OK camel compiler side where we are trying to do a lot of stuff to try and make OK camel faster mostly by giving end users more explicit control as opposed to making the compiler magically faster exactly for this reason of when you're trying to enable performance engineering the key to the realm is control which Al the idea behind accelera in some way like the key that's to give back resar control over training Loop because they wanted to mess around with it there is this idea of making sure synchronization so the Cod is bad we only see it by profiling we're trying to do a better job your stre that at least like automatically profiling all the jobs to identify ntion oh by the way like this particular model like to very very long to do this particular step are you sure that it's implemented efficiently maybe you should profile it and we're trying to get everyone an easy ways to profile and look at traces to kind of identify the bottle next when we have done all of that like sometimes researchers some ideas that cannot be expressed into the building blocks that we have like if they want to do something that doesn't have a fast CA implementation already package in by torch we need to dive deeper into the stack so like we mentioned Triton and then writing into Cuda directly so yeah sometimes this is needed just because like there is a specific layer researcher invented and they want to either try it or put it into production and we need to make it as fast as possible right and then there's a couple of other interesting features from Cuda that I've heard you guys talk about a bunch one of them is Cuda graphs and the other is Cuda streams oh yeah how do those fit in so Cuda grass is something that Cuda releas and that was used by P Tor before tor. compile it's been designed explicitly to remove that kernel launch overhead we talked about earlier like when you're trying to launch a lot of small kernels and you pay that overhead for each of the small launch and so CAG graph is the technology that allows you to play that graph of kernels once inefficiently but it's going to record all of the K launches and the next time you replay that graph it's going to like remove all of those over because it already knows what it has to dispatch it can do that more efficiently so that technology is really useful to remove the overhead of launching a series of small Cel so it gives you like a lightweight form of fusion where you're not really changing any of the operations you're just taking all of the scheduling work and putting that on the hardware so you never have to go back to the CPU to do it and you don't do unnecessary copying of memory back and forth exactly and you don't get the kernel Fusion which would give you the additional benefit of avoiding memory transfers you're still doing both memory transfers if Canal 2 like requires something in memory from Canal one you still have like the canal one is going to write it and then Kel 2 is going to read it this is still there like the only way to remove that memory inefficiency is to fuse those two Kel either by hand or using something like T compile but you remove all the overheads which is already nice when you think about Fusion in lots of programming languages you can get rid of memory operations but you can also sometimes get rid of other operations right you can do other kind of optimizations across the two kernels where like if I know I'm going to do these set of things maybe there's some things can be merged together so are you also getting that computational benefit sometimes yeah if your two Kels had like some in the Middle with something you didn't really need you can remove that when you fuse them together usually like the benefits come more from avoiding memory transfers in some instances you can remove like maybe some intermediate State wasn't really needed and we can avoid Computing it got it okay so that's Cuda graphs what are Cuda streams it's a way of paralyzing stuff with Cuda when you build these Cuda canels and then you take Cuda to execute them it's going to execute them sequentially so like canel 2 is only going to be executed when Canal one is fully finished on the GPU and C stream is the way to parallelize that if you have two canels and you know they can be run in parallel because they don't touch the same data you can lo the two of them in different streams and they will be executed in parall at least up until a certain limit like we you shouldn't use quid streams if you want to run 100 things in parallels Nvidia told us it's not a good idea and it's true like that c stream do not really perform well this API is exposed all the way up to py torch so for instance if you want to do some stuff like I'm loading my data and I'm going to put it on the GPU and in parallel I would also like to compute the prediction of my prev B data which is already on GPU yeah I would like to do those two things in parallel and you can use C streams for that like you have one stream that does a compute and one stream where like you transfer the data from the CPU to the GPU if your model is written well like with no synchronization point you have your GPU is fully utilized all the time without any break so can I just think of Cuda streams as a coar grained threading protocol where each of the threads themselves has lots of little mini threads on the inside yeah kind of it's more like a hint than a hard requirement like it's hinting to the GPU you can run with to thing in par it's safe the GPU might choose not to do it sometimes okay so a lot of the different optimizations you've talked about here have been very focused on the GPU programming itself and the kind of connection between the CPU and GPU pieces what other parts of the process end up needing to be thought about when you're trying to like get the maximum performance out of a training run we talked about CPU GPU transfer gping like networking talk a little bit about that as well like that's another part that is really important if you're training on multiple gpus they need to communicate efficiently if you don't want to be bottleneck by that a thing that I've seen spend a bunch of time on is thinking about not just about making the fabric of the network efficient but also organizing the data loading in a way that you're not going to stall out I mean this is kind of General problem the gpus are these kind of incredibly productive compute machines and so they're very hungry right they want data as fast as possible what do you need to do to make sure you can keep them fed yeah yeah that loing is definitely important subject especially when you have some data that is a symmetrical you have examples in your training set that are really long and example in training that are really really short and you kind of need to batch them together to do like one iteration of that stochastic gradient descent we talked about before there are lots of ways you can do that like for instance you can just decide I'm going to take the long and the short together and I'm going to pad everything so I have a bunch of zero in my tensor after like the short sequence has finished and until like the end of the very long sequence which consumes a lot of memory so that's not super efficient you can like some kind of representation of tensors where like you concatenate everything together but you save the offsets at which each things are which is a bit of a more efficient memory layout but even then like when you do this R training we kind of sad because if you have one GPU that has to load a very very long sample and the other gpus have like shorter samples since they need to communicate together like to agree on which gradient is a red gradient gpus with a very short samples are going to wait for the GPU with the long samples for a long time you kind of need to organize your data loading in the way where you think about your distributed training so that each GPU has kind of a balanced load of data so that they all take the same time to load the samples and so that at least like when they have the long samples everyone has a long sample to load otherwise it's pretty efficient but then it might impact your training accuracy because you're not shuffling fly your data set you're kind of doing a pseudo Shuffle where you still group things by size so it's kind of a trade-off between like performance and accuracy by like removing some degrees of Randomness in your Shuffle of the data yeah one thing that's maybe not obvious is that a lot of these algorithms are structured around barrier synchronizations you're going to do a bunch of stuff and then you're going to wait until everyone's done and then you're going to do a bunch of stuff and wait till everyone's done barrier synchronizations are like super terrible if you have a lot of non-determinism or just non-uniformity in the different pieces that are going into meeting that barrier because like some people are going to get to the barrier first and you're going to wait on the others and while you're waiting you're just not doing anything happily gpus are mostly pretty deterministic in terms of the amount of time it takes to do a given computation but you have to feed it a computation of the same shape everywhere in order to get it really neatly lined up and we were talking before like when you were asking me why was P I think like one thing but we really want in the side of PCH and people thought is the way when you have like asymmetrical data and like different sizes at different batch sh it's way easier to code that and py torch which is more flexible because compiling that kind of thing is really really hard yeah in terms of flow object Jacks like you kind of need to go to some extreme lengths in your code to make your data the same shape again and then send it to your model in py is really easy like you can just batch together small things and then as the next PR is going to be very long thing on py is still happy because it's EAS girl and not compiled right I mean I guess this is like always the problem when you go to some simpler more highly structured domain specific language is like there are some things that's good at expressing and there's some things that's bad at expressing when you want to do the thing that it's bad at expressing you can just be in a world of heard yeah exactly you know you've spent a bunch of time in your career working on various kind of Open Source training machine learning e systems and you now spend a lot of time internally working in our world I think it's fair to say like we are in various ways more immature than lots of other organizations I think at the time where Google was already designing their own custom hardware for efficiently evaluating neural net models we weren't really using neural net models at all like I all of this effort on our side is really spun up in the last few years we've been doing various kind of statistically driven inference of trading strategies for as long as I've been at J sheet like that's the first job I had like 21 years ago doing various kinds of optimizations and model fitting stuff but very different models and didn't have any of the same performance shape and so all of our tooling around this is relatively new and kind of curious for you as someone who's like seen stuff in the outside world and seen our ecosystem here what are the things that you see as the big gaps what are the kind of things that don't work as well here as they should and that you want to see us improve and that you want to work on one nice aspect of the fact that we are newer to this Ming stuff is that people are not necessarily aware of the things that are not performant and making a lot of mistakes when writing record so it's really easy for me to come in and like spend a couple of hours on a project and be like oh yeah no it's going to train four times faster and you just had to trans those five lines of code so like it makes mag jum very easy in that regard uh sometimes it's a little bit more difficult than that but yeah there have been a couple of instances where like optimizing a given training was really really easy just by profiling it once because of this we should improve like our infrastructure around like training Loops in general like making the training infrastructure work better for researchers because we're kind of making the same mistakes as other people already did in the open source world like this giant training loops with lots of spaghetti code that researchers end up not willing to use because they can't modify what's inside of them it feels like sometimes we have that same problem internally as well so do you think we need to sort of do morally the same thing that accelerate did of trying to build a set of libraries that are highly configurable and modular instead of having like one training Loop that everyone uses make it easy for people to build their own training Loops for different use cases yeah especially since people are very smart and really like to ha things together it feels like a better solution for them the magic training Loop where you press play on your model trains as its appe which I can understand for people who are like less familiar with machine learning but at least for people who are deeply familiar with all the internals of machine learning and want do deep Research into like every part of a training Loop they need something that's a ke to accelerate where you just have like small composable building blocks that are very easy to use and not like this giant black box with like 150 arguments that you have to pass in the correct order yeah that's terrible I'm talking about over training apis like not giving any a bad time to any Engineers here but does not exist internally right certainly there are pieces of code that we have you know some of which I've written that have the property of you have hard-coded a bunch of concrete behaviors into it and it has become aifi and hard to change and it's certainly a problem that shows up maybe it's worth just saying like a few words about the way in which the problems we solve here are different than problems solved in the outside world and maybe just to talk about like what role machine learning actually has here so just say like some very high level things we use machine learning for a bunch of stuff we use it for some kind of general purposes in the way that any organization might we have a whole AI assistance team whose job is to try and leverage various AI techniques for building various kinds of automations a lot of it focused around llms and coding assistance but not just that so that's like one kind of use case and then we have a bunch of use cases that are very focused on trading even inside of the trade world I think there's kind of two major streams of applications there's we are going off and trying to extract data from the outside world in order to inform our trading but we're using the same kind of data that has already been shown to be a good Target for standard machine learning techniques so maybe we want to get data out of images or geospatial data or text Data there are all sorts of published models out there and published architectures that you can do for this and we are like happy to leverage and fine-tune and exploit those existing things and there the work that we end up doing looks a lot like the work that people do on machine learning in the outside world and then in some sense the magic is more about how do we pick the data that we're going to apply it to and how do we integrate that into the decisions we're making on the trading side and then there are places where we are applying machine learning techniques to trading data itself like the data that you get from exchanges various alternative sources of data that can inform that and I'm kind of curious like how do you think of that set of data as being different from the kind of that you typically see in the larger world of machine learning data is much nosier so it's way harder to train good models on it just because the signal you can extract from it is actually way weaker than in something very structured like text or images very often like you'll never get the same kind of accuracies as what you get on computer vision and on text but even extracting a very small amount of signal can still lead to good training strategies so you can still get valuable feedback from that kind of data it's maybe saying there are fundamental reasons why trading data sets are noisy because the fundamental process of trading is one where when there's a signal people trade that signal and that signal kind of gets removed from the data essentially and so to the first order of approximation the time series of the prices of a security look kind of like a random walk and there's a little bit of signal in there but it really is mostly noise and so whatever your training technique is it has to be able to deal with the fact that there's a ton of noise in the labels essentially and so yeah so that's one very noisy and as you said it's also changes and react to the way use it so like difference with text for instance like you bet that was relased in 2018 like you use it right now it's still as good as it was in 2018 the same is definitely not true for a model that you train on Market data the kind of strategies that you could run a couple of years ago and won't necessarily work right now just because the market has reacted to them and adapted to them so you kind of need to come up with new modeling strategies all the time and reinvent yourself another aspect of that data that is different from the rest is of the world is that it's huge we have massive amount of Market data think like it's a couple of terabytes per day so you multiply that by the number of day or like a couple of years and yeah you have a massive amount of data to feed your model so that that brings its own challenges in terms of data loading and making sure it's efficient and that the GPU gets saturated right and in practice the model sizes that we tend to do are tend to be smaller than like the sizes of the very largest language models and so the overall like ratios of flops per bite are just very very different and so the things that people are optimizing for in the designs of the GP use and the designs of the network are often not exactly the thing that we're seeing we have to do a world bu of research basically we can't just rely on what's been done for other kind of modalities like text or images we need to reinvent new models that are adapted to Market data and yeah you new ways of loading that data and keeping the GPU fed as much as possible and yeah sometimes we car about algorithm that are like completely different from the one Nvidia or P torch car about because they're not necessarily used in llms and everyone is all about LMS these days it gives a good of like programming to do in terms of like GPU performance so yeah I think it's actually an exciting part of the machine learning World here in general is that there's just like a wide variety of coming up with and experimenting with new architectures and new models and new ways of applying it to data sets where it's like there just aren't a lot of papers telling you how to like analyze Financial time here is because the people who are good at that mostly don't publish papers I I wonder why another thing that I think comes up which is interesting is just inference times right so we care about using these things in the context of trading and that in general our the level of speed at which we care about responsiveness to some input is sometimes very very small and can vary by orders of magnitude depending on the application like sometimes we care about turning around a packet in literally 100 Nanos and sometimes a few microseconds and sometimes a few hundred microseconds is f slow enough sometimes milliseconds there are some kinds of machine learning problems where like oh getting an inference once an hour would be great that's like all we need and sometimes even less than that so you just have like a wide variety of inference times you need and at the very low end of the scale it's nothing anyone else cares about yeah that's why I was you were mentioning before like a model have various sizes some are very small because we want them to run very very fast but even if they are small like there are some challenges to make sure like that they can run in the time frame we need to make sure we are as low latency as possible right and just to keep up with the data rate yeah because like they can be a million events in a single stock one day so if you're not fast enough to like just process them it might not be the case that we need the prediction very fast but sometimes if you want to keep up and not like get behind too much you just need to be couple of mics per event and not much more than that so those are a whole bunch of differences between the kinds of problems we look at and what you've seen in other places how do you think that influences the tooling we build are there ways in which we need to build different kinds of machine learning tools in response to the ways in which the shape of the problem are different yeah we talked about data loading for instance that comes with its own challenges so like obviously we have developed of custom like data loading utilities that we can use to make this faster we also talked about like models that are not necessarily the same ones as everyone else cares about for like other kind of well studied modalities like text or image so we have a lot of custom models written internally that we have found work well that we keep trying to optimize for training and inference so this is a bunch of exciting work like the rest of the training is mostly the same as in any other machine learning job I guess like you stochastic gradient descent has not changed it's the same and the same algorithm it was 40 years ago yeah so another thing that you've done a lot of in your career is education right you were a math teacher for a bunch of years and then you did a lot of Education work both at fasted Ai and at hugging face so you're also involved some in the education story here can you tell us a little bit more about what that looks like yeah J streets is trying to like up it's machine learning games both by hiring more people who do machine learning but also by educating the existing people into machine learning and we talked a little little bit before at first and like how it was important to like make radiologist for instance competent at machine learning so that they can do Radiology better with machine learning it's the same here like we need to educate Traders about machine learning so that they can do better trading using machine learning and like can inform the choice of models that then machine learning researchers can pick because they know the data very well and they are kind of domain experts so we have a boot camp that we run like every couple of months with either Traders or researchers that are not super familiar with with machine learning and we try to make them up to speed with the latest technique both from the outside world and from inside gen Street you mentioned this point of in part you want to teach people so people who are not primarily going to be machine learning modelers as like their primary job but still have them better understand people who are experts in other aspects of the trading problem and have them understanding more about machine learning that's one goal I think it's also the case that you can teach people the machine learning stuff in some ways like it can't be that hard to learn modern machine learning cuz in some ways modern machine learning is 10 years old I was saying like make them into Dom main Express so they can help but they some of them like actually end up training models doing a lot of machine learning themselves it depend of whether they liking to it or not because like machine learning is a bit like cooking like you do a bunch of stuff and then you let your model training Ste for a while and even you seees it was good or not it's not the same kind of like just programming some training system so like some people like it some people e dislike it yeah that makes total sense so there's a lot of things you're trying to convey to people when we were running these classes and these courses what are things that you find hard to get across the point that's most difficult to get across to people is that yeah no one knows anything about machine learning like it's really just a cooking science like we still don't know why neural net General generalized so well we know we have there's a bit of theory explaining why like they are able to train on the training data but why do are they any good at the out of training samples we still don't know like why they so good at generalizing and in general like you can try to get a little bit of intuition into like trying to do this to fix this kind of problem like I'm overfitting so I'm going to try this regularization technique maybe that will help but yeah there's always that maybe like it's not until you've tried that you know for sure that the thing is going to work so like this is really hard to convey and then like try to get people very disciplined about reproducibility like one mistake that beginners in machine learning do all the time is they train a model and then they forget what they did to train that model and so like two months later like oh I did train that model it was good I should try to train it and maybe use it and they never managed to reproduce their initial results just because they didn't write on all of the stuff that was needed to train their model like those two points are really like the ones that are difficult to make acoss because I guess you can't fully understand them until you have gone through the pain of two of them you don't understand the importance of reproducibility until you've gone through your first reproducibility crisis yeah exactly then you fully understand why it's so important to like save absolutely everything down to the revision of the code so that you can run the exact same thing at another time yeah and I think some of the reproducibility stuff is about getting people on board with the discipline required we've talked a lot about technology that meets the researchers where they are and makes it easy for them to express what they want but there's some part of it if you want to be a good researcher you actually just need an enormous amount of discipline because the tools are somewhat imperfect and you just have to be really careful to get that reproducibility story right and at the same time I think there's also a lot of work we can do on the tool side to make reproducibility easier right I think it's complicated in part because the overall ecosystem is complicated just managing python packages is shockingly complicated and making sure like you didn't have an upgrade of a random package that broke everything yeah that's already difficult and then making sure like your code is checked out and that you know the revision of the code that you are using that's another thing because you can change a small line of code in your model and think oh this is totally harmless but then it actually destroys the performance of your model because it was a key ingredients in your cooking recipe and you hadn't realize that so yeah Mak make sure that your code is still there and the last thing is training involves like a certain number of Viper parameters like usually people like write all of that in some kind of config so make sure like that you save that config somewhere so that when you want to reproduce your training you actually know like that you had use this learning rate this batch size etc etc I guess another fun source of nondeterminism is to the degree that you're doing your research in Python notebooks the fact that you can run python cells in Python notebooks in arbitrary orders and if you do this in the wrong way you can end up Computing a result that you just like have no record of exactly where that result came from yeah that's another kind of fun fortunately notebooks are still a bit difficult to check in into any kind of repo so like usually people move away from notebook by the time it's time to check the code in some kind of infrastructure so like this issue kind of disappears but it's really like while you are experimenting this is another fun source of possibility and then you have like GPU being nondeterministic at a fundamental level because we are heavily parall so that's always so fun like when you're trying to debug exactly why my floating Point res at the end is not the same thing as what I was expecting just because floating Point arithmetic is not associative and gpus have many threads which may end in any kind of random order GPU training is in some sense non-deterministic because it's in parallel but it's also in some sense non-deterministic just because we can tolerate it you could do things highly in parallel and make sure that you're always doing things in the same order and do stuff to conserve the determinism it's usually at a huge cost in performance it's at a huge cost in performance and it doesn't totally matter right that's like one of the interesting things about machine learning is because you're doing this kind of soft numerical optimization process you can just take some error and actually a lot of the early research in various places it's called kind of Hog Wild concurrency where you just had shared model weights and things checking out and producing new gradients and updating them and were there data races yes they were they were and like it was kind of okay but I think over time that's Fallen somewhat into disfavor because it's just even harder to predict what's going on it's completely unreproducible so you can end up with a model that's PR good but you have no idea why and you were never able to reproduce it anymore so that's a bit annoying anyway this was a lot of fun thanks for joining me oh thanks a lot for having me you'll find a complete transcript of the episode along with links to some of the things that we discussed at signals and threads.com thanks for joining us and see you next time Back To Top