so it's my great pleasure to introduce our speaker tonight Horas he hor he graduated from Cornwell uh in 2020 and he works at meta on the PCH team specifically on at the intersection of compilers and machine learning uh if you've used things like torch. compile which is a thing in by torch that makes your model goes like two to 4X faster in just one line of code or if youve used Flex attention which is something that lets researcher uh design fast Cal for attention uh without leaving the world of python is the person responsible for both of those things you should also check his blog which is pretty awesome and in particular the BL post making deep learning gobu from first principles and without further Ado I'll give it to our us uh thanks for the intro today I'm going to give a talk about building machine learning systems for a trillion trillion floating Point operations uh my name is horse he and I'm on the pyrid compilers team at meow so you know I think we live in pretty unprecedented times in terms of infrastructure buildout uh which is I think you know nicely reflected in nvidia's stock price um I feel like you know basically every month we see like a different headline about like a new nuclear power plant from like Microsoft or you know like a massive 300K GPU data cluster from xai H you know people like uh I feel like when Amazon first like built like a or like bought a nuclear power center it was like a big news but now practically like you know every cool AI company has their like own nuclear data center um and you know of course this like kind of really massive build out has kind of also resulted in pretty ludicrous uh fundraises from startups I think uh like I I remember back in like 2016 like you know if you like made it as a startup it would be like if you're worth a billion dollars right it's like a unicorn is like the mark of like really making it as a startup beyond your wildest dreams but you know in 2024 you know you got to raise a billion dollars just like get started you know it's like just to play the game you need like a billion dollarss you know of which most of it ghes to Nvidia and so I think uh like it's kind of crazy to think that you know all of this like you know billions of dollars is really just to do like a absolutely insane amount of floating Point operations here's like a nice chart from like epoa AI showing like the growth of a compute over time and um like these FL Point operations are really just like Big M moles done like over and over uh like you know for millions of iterations over like months and nowadays uh like the Leading Edge models are currently trained with about like 1826 floating Point operations which is approximately uh 100 uh trillion trillion floating Point operations so like a trillion Trill like a trillion Tera flops worth of floating Point operations and so you know back uh you know another kind of effect you might have noticed is like uh back part to 2016 if you searched on like hacker new was like ml you'd often get like a lot of people asking about you know the ml family of languages uh but nowadays you know you search ML on a hacker news and you get like a very different uh like um type of article and so I think one of the things that kind of is missed here is like you know with all these like billions of dollars and like you know y flops of operations it's kind of easy to forget that you know like these operations needed to actually run somehow uh on on the like machines and so you know a modern uh stack might involve you know like you know call like an LM API and then this calls like pytorch like nickel Triton Cuda like mbcc like all these like different layers in the stack that like somebody was involved in in writing and I'm often uh reminded of this like uh XKCD comic you know showing like the state of like modern digital infrastructure and so you know you kind of you know often times forget about all the infrastructure that was involved uh for you to like get where you are uh but like really we're just kind of building like layers onp top of layers and so you know if you work in a systems you know like I do and I supect many of you do you kind of I think oftentimes think about your work a little bit like a um like a benevolent dictator of sorts where kind of you imagine like what you're doing here is that like a lot of people build on top of your work and so if you can kind of you know give like a small amount of improvement to you know like millions of people you know your small improvements can lead to like you know significant impacts on the world and so for example I kind of Imagine like quido uh with python he just like you know kind of sits on top of his cloud and you know eventually like gives us like no Gil or like faster SE python or like the wallr operator I feel like this is kind of often times like how I imagine like infrastructure work did and it's kind of a lot of why gotone into infra work in the first place and so on the other hand I feel like if you work in ml uh things can sometimes feel a little bit different I recently came across this post on threads uh where this person said you know my main gripe with like working on top of like LM apis is that you're not really like no one is like engineering anything you're just like chancing a like a prayer to this like man-made demon deity to like do your bidding um and this like very similar to this kind of like shag metaphor that has become like very popular in deep learning circles where the idea here is that like we really like put all these like mmols and like computation into producing this like very weird alien intelligence uh and then we like uh kind of like rhf it you know and like provide it in this like nice convenient to interface to people with like you know chat gbt or or something like that but you know I think if you think about it I think we're kind of working with shagos like all all the way uh down uh like uh like even like if you're working like in systems and you're not calling like ml models you still have this kind of like you know massive amount of infrastructure that you presumably don't really understand written by people you probably never talked to uh and like the end result you know they try to expose as like some kind of a simple interface B uh where you can just like import torch and then you know hopefully run your code across like you know 100K uh gpus and so I think as a result there are some I think interesting ways in which I think ml models feel kind of different from regular systems uh one of those ways um is that like ml models are extremely simple and as a result we have like very high expectations from the performance of like the the code um I think you kind of like all the this all the way back to this very nice article called The Bitter lesson uh which I'd really recommend reading if you guys haven't come across it before uh where there main observation here was that like clever ideas in machine learning have basically throughout like this 50y year history always lost out to just simple ideas that like scaled really well uh with Moors law and I'm not really like joking here when I say that like machine learning logic is exceedingly simple uh there's like this cool project from uh someone called like Andre kpass called L llama 2.c and basically here he's like implemented um like llama 2 in about like 973 lines of C with like no other dependencies like you know every single Loop every single like map mole is just implemented from scratch uh so with 973 lines you can't like you know do it very fast U but it does run and I think it does kind of indicate just like how fundamentally simple uh the models that like we're spending all this compute uh end up being and so the end result uh like although like the problem themselves are extremely simple and like very like uh easy to optimize in some sense uh the expotations are very high uh for how well we can optimize these mapoles uh so one example here is that like the predominant metric for measuring uh your like models performance in deep learning is called Model flop utilization and so this is basically the like percentage of the theoretical Max flops uh that your GPU is able to do and so if you kind of think about this this is actually like a very absurd metric to hit like if on a CPU like you measured any of your code by this metric like you can only hit 100% if like at every single time uh every single core of your CPU is always issuing Max withth uh simd instructions that's like the only way you can hit 100% uh utilization and so you know if you take a look at any of the code that you guys presumably right uh like almost no CPU code is like anywhere near like you know 100% flop it's probably like way under like 1% most of the time on the other hand in like machine learning for like large scale training uh we're typically often hitting around like 50% of like the peak flops um and this is I think like uh this is kind of like indicative that like even though the overall problems like very simple the kind of corresponding difficulty just goes into like making your uh like models hit like this like very high perf barrier um another kind of a interesting observation in uh like machine learning is that that has kind of exacerbated this a bit is that the field has Consolidated significantly over the last five to 10 years uh so one of them is that like you know like maybe 10 years ago you had like a lot more architectures a lot more variants and like different things that people were trying but nowadays people really just uh like Transformers are like the dominant architecture for everything like you have Transformers for vision you have Transformers for language you have Transformers for Like You Know audio it's kind of like all Transformers and the other way like things have uh kind of changed is that um instead of like you know many different people training soda models you often times just have a few companies training soda models and so we've kind of gone up from a bit of like a monopoly where like you know previously there was just like you know one person providing the infra and like many people using the infra to in some ways it feels a little bit more like a monopsony where you have like many people trying to provide infra and then you know only one person is actually training the the job at the end of the day um and so as a kind of result uh I think that uh like there there's kind of two general ways to think about like getting performance from like your systems and so one of the ways generally is basically like optimization it's like you have a compiler you like make the compiler faster um you know you improve like the performance of everybody using your compiler um and so the other way though is kind of uh like programming models and so like this is kind of analogous as opposed to like a system whose responsibility is to like chop down a tree and then you know you're just like optimizing how fast you can chop down the tree and the other alternative is like you're providing tools for people to cut down the trees uh themselves and so to kind of like talk a little bit about programming models um I I think it's uh like illustrative to kind of talk about how like ml Frameworks have kind of evolved in terms of what programing model they've Expos to users so originally um like I think like 2010 like 2011 2012 the first like uh ml framework that kind of got a lot of popularity was this framework called Cafe and the way you like Express neural networks in Cafe is like a very declarative uh nature and by that I mean you like edited a protuff file I think uh where like the protuff like specified all the things you needed to care like care about your neural network um and so as you can imagine you know programming Proto Buffs is like not very fun you know there's a lot of things you might want to do uh that you can't do in protuff and so a like natural NE thing that people did uh was kind of these kind of graph Builder type apis and so this is kind of you know how tensorflow one kind of looked like where the idea is that oh you know like program like you know programming Proto Buffs or like no human should need to write Proto Buffs by hand and so I ideally you know you should just like write like a DSL of sorts that allows you to generate the Proto Buffs uh like from this DSL however like even this DSL still kind of has like a lot of uh confusion in it like it's not super clear that like given this DSL like how code actually executes on your GPU and so kind of finally uh like around 2016 or 2017 pyro started to become like really successful and kind of the like uh the feature that like was most uh emblematic of pytorch was basically what was called imperative or eager execution and so what this means is that like uh um or like yes so P you know was very successful and I think it's like worth talking about like why eager execution was so successful and I think that the main reason it was so successful just comes down to like what the programming model of the execution look like where in imperative SL like e execution it's basically like you know you call a function uh the GPU runs a function and then your function finishes and you know that's it like you know like that's basically all you do you call like tor. MMO and this is basically the sequence of operations that happens um but you know with kind of like a graph mode type approach or you know where kind of this compilot interjects you first like Define the function the function gets converted into some like intermediate IR uh you know a bunch of who knows what happens to to your function and then eventually like the function eventually executes uh on the GP um and the first like the top one was like a very simple execution model for people to understand um and I think another thing that's kind of interesting to notice about this is that uh this kind of also like the top uh half also kind of describes how python uh executes and I think this is kind of uh pretty illustrative to me of like why python has been so successful in machine learning it's basically that like uh like a funny statement that you can make about python is that uh like if you tried to like train a model uh today like you know you like I gave you like a day to like train a model uh it would run faster if you ran it in Python compared to doing in C++ uh and you might argue that this is like a unfair comparison um because you know like you know all the infra like you know all the Frameworks that people buildt are in Python uh but I think the reason why so many of these Frameworks have been built in Python is that uh python is like an ex exceedingly like a simple language and so it's also like a very growable language and and what I mean by like grow is that like it's very easy for people to build their own infrastructure on top of python without really needing to fight with like anything the Python language does because Python language itself does basically nothing like like like you know it doesn't do any optimizations for you it just like you know takes your function and runs it and so I think that py church uh basically historically is at like a very similar uh point in the design space where py George is like execution model is like so very simple and although this like doesn't really give you a lot in terms of like it doesn't like automatically do a lot of things for you it does mean that it's very easy for people to like build their own uh infrastructure and Frameworks on top of pych I I think another important uh detail to realize especially about like pyro when it first came out is that like this kind of unoptimized execution didn't actually even sacrifice any performance at all like you know when people kind of benchmarked a pyro versus like you know tensorflow a cafe pyro often times wasn't even slower than those Frameworks and I think there's like two main reasons for this so the first reason is that uh back in the day like you know about like 90% plus of your time was spent in mmols and so there's basically nothing else you need to optimize and so mols here are like Matrix multiplications and so they're often provided by these like vendor libraries like kublos or like qnn uh in like you know they're provided by Nvidia and they're like very hand optimized and so you know if 90% of your time is spent in mat moles then like what else can you even do to like optimize the performance of your neural network um and I think another kind of uh interesting uh like um important piece here for like why pie torches performance was like quite good uh is that it had this like async execution model uh where basically uh the idea here is that like you kind of have like a parallel work CU on your GPU and so what the uh CPU does is that it's only responsible for scheduling work on your work Q uh and then you know the GPU executes work from the work Q um and I think of it generally as like this GIF is what usually comes in mind and basically you can imagine that like the the dog is like or grommet is like python you know they're like you know trying to put down the train track uh in front of the train which is a GPU and so you know as long as like grommet is able to put down the train tracks faster than the train actually rolls along the train tracks uh you can actually kind of view python as like having zero overhead like like it doesn't provide any extra cost uh compared to you know if python was like in a more efficient language like C++ um and so in this way like you know eager execution not only had like a much easier to understand program model for users it was also like basically just as fast um as like non eager execution um unfortunately you know good things that never last and in 2017 like Nvidia introduced a what are called like tensor Cordes um and if you guys are un with tensores uh they're basically like uh Hardware units on the gpus that only do M Matrix mplications like uh and I I don't mean this like figuratively in the sense that like people often say that gpus are like well suited for mat moles I mean this like very literally in that like there's actually an assembly instruction that just does like a a mini map mole and this is how you like interact with the tensor course and so if you look at this like plot uh of like the amount of like matal flops versus non-ml flops I you can really see like when uh Nvidia realized that like deep learning was a big deal uh because like all of a sudden you know you kind of had this like massive like 10x Gap and so this is like a log scale um and you had this kind of like massive like 10x gap between how fast mmoles were on the GPU and how fast like literally anything else you wanted to run on on the GPU uh was um and so the end result is you know previously we said that like you know mmal still like 90% of the time and so if the if like you know Nvidia has fit up mmal by like 10x but then everybody everything else like stays the same amount of speed and all of a sudden you know like you need to spend you're spending a lot more of your time doing like non-ml operations and so as a result uh we've kind of gotten like ml compilers due to this I think largely due to this change and so one of the I think the important details about ml compilers like you know in terms of like how they differ from the Frameworks I came before uh is that ml Frameworks still keep like the eager programming model in that like the code that you right like logically the the program model expose users is that you're still just writing python code and it executes like line by line uh the only difference now is that instead of actually executing line by line we kind of capture it into like a graph uh in some Manner and and so torch compile I think actually kind of does this in a pretty uh interesting way uh and that torch uh compile actually like uh intercepts at like the python by code interpreter level uh where python kind of exposes these apis where you can kind of uh insert your like own frame interpreter and so this looks very much just like a traditional like you know jit for like any kind of other VM except this jit is kind of you know only meant for like P programs um and so if you kind of look at like you know how uh like things have evolved over time originally you kind of had like Frameworks like tensorflow or a cafe before that uh where both the the user the user like programing model that users wrote was like a graph Builder type abstraction um but then the execution programming model was also a graph execution type abstraction uh and then like after that you kind of had pytorch you know one uh type like uh stuff uh where the user prum model now switched like an eager style execution uh but the execution program and the execution program model was also eager uh where like you know each operator is executed one at a time but kind of now finally you know modern ml Frameworks like pretty much all ml Frameworks nowadays uh use like an imperative eager programming model uh but almost all ml Frameworks now also have some way to like capture this program model into a graph of some kind so they can perform optimizations and so this is like you know jack. jit this is like T grad this is like mlx they kind of all have their like different approaches for capturing the graph um uh to optimize um and so I think next I I kind of want to talk about you know we've kind of discussed how we've gotten to ml compilers in the first place and so I I think next I want to talk about like what ml compilers are actually doing for you uh and what kind of autom ations are performing and and so generally speaking the way I think about uh deep learning performance or like performance on gpus in general is that there's basically three things you can be spending your time on the first one is compute so this is times uh on your GPU Computing like actual floating Point operations uh the next one is a memory which is uh you know time spent transferring your tensors within a GPU so this is like you know across various memory subsystems uh in your GPU and so finally like overhead which is like everything else like you know it's like time your GB spending idle and so on and so first we're going to talk about a compute and so I think to a sum of uh approximation you can say that all runtime under GPU is either compute or it's a shuffling data um in that like you know me uh data movement is like not like a real operation it's like a no it's like a no op from like the theoretical point of view all it's doing is it's moving data from one place where it's convenient to another place where it's convenient and so basically a floating Point operation is like the only real thing a GPU can do um but you can actually I think simplify this even more and say that uh in reality actually nowadays like all runtime is either like M moles or essentially shuffling data and so this is like because if you look at the actual like uh flop chart on like an age 100 um like GPU uh you can see here that like the fp32 flops is like you only have 67 Tera flops of fb32 compute uh but you actually have like a thousand terlop s of a tf32 compute which is basically like matrix multiplication compute and so what this essentially you can some sense like what this means is that if you're not doing major multiplications on your uh GPU you're really only like getting like 7% of like your Peak flop utilization and so you know by like the metric that I mentioned before like model flop utilization even if your GB was fully occupied doing stuff that wasn't a m mole you could only ever get like 7% flop utilization which is like much lower than you know our theoretical Peak um yeah um I I I did I do have a brief interlude about like I think an interesting case where like these kind of abstractions uh do break down even more um and so I I do have like a kind of a fun question which is like do the Matrix contents affect your matrix multiplication performance and so I think uh you know if you kind of uh like you know are familiar with like you know General performance there are like a lot of things that where a lot of ways where like data can impact your performance but in this case moles actually avoid a lot of them so for example they have identical memory access patterns regardless of like what data is in your tensor uh there's like no control flow uh in the map mole and the gpus also don't have like d normals uh so like that's like not not a possibility as well um so if you like you know so and in this case we're like taking a three tensors we're initializing them with like all zeros uh like a tensor initialized from the gausian distribution um and then a tensor initialized from like the uniform distribution which is like from 0 to one um and so funnily enough if you Benchmark this you actually find that there is a performance difference uh depending on the actual data that is within your uh tensors and so uh there's this tweet uh from uh some of you guys might know a long time ago uh that really I think uh like when I first saw this I actually was very much reminded of this tweet where you know I thought I knew how like a GPU worked uh but then I I was like very confused about what could possibly be causing these like performance differences um like you know between the different uh data and so the actual cause here uh is something called uh like leakage power or or dynamic power where I think you know most of you are probably familiar that you know when you like when a CPU or GPU is underload it uses more like power and at some point it can like throttle uh you know it's like using the max amount of power it can use or or the max amount of like heat uh you're it's allowed uh but the actual thing is that like you know this power doesn't just like come from no where uh it actually largely comes from what's called like dynamic or uh switching power and what this means is like every time a transistor like on the GPU switches from like zero to one or like you know high to low or low to high uh it like loses a little bit of this power and so the actual like power usage on your GPU is kind of like a like sum across the total amount of like switching uh that goes on in your GPU and so this is why like if you're multiplying with all zeros you can imagine that like your GPU ends up not like a transistors don't end up switching at all uh and so like it doesn't actually consume that much power and it's much less a throttle and so if you actually uh like look at this um you can actually get like very different performance for like all these different kind of fun distributions like whether it's like you know the the normal distribution or whether it's like a checkboard type uh like pattern or you know it's like spars or Turner and the reason why is just that like it's it's like this kind of abstract thing where these different patterns lead to like more or less transistor uh flips and which leads to like more or less power throttling which leads to like more or less performance um and so I remember actually one time uh somebody told me a funny story where like they uh they were like training their machine learning model in benchwork performance um and then they like at some point their model would Nan and then they' be like wow my performance just got way better uh and like they they so I like wrote an article about this and they like messaged me and they're like oh you know that was very illustrative because I was really confused performance would be getting better but but that's why you know like if all your tensors are Nan uh your transistors also don't need to do a lot of flipping and so you know you'll measure like a better perf um yeah so so that's kind of compute and so the next thing that your GPU can spend you can be spending a lot of your time on is like memory uh which is essentially time spent transferring your tensors within a GPU and and so I think one thing to observe from this kind of empirical plot uh from a paper on like data movement is that uh although like uh so this paper kind of breaks down the operations on like a I think a birt type model into what it calls like tensor contractions I.E M locations uh and then like you know normalization operations and element wise operations and so you can see that although M Matrix locations are responsible for like 99.8% of your flops they're only responsible for 61% of your runtime and so you know where like why are we spending like you know 40% of our runtime doing like operations that cumulatively only take 0.2% of our flops um the kind of a key thing here is what's called like a memory bandwidth cost uh where the way I typically think about this is that uh like even like uh and so here I'm talking like all of your data already lives on the GPU like you know it's like you know it's like you know occupying your like um gb's vram um but the thing is that like your gpus is a vram uh is not where it like the compute units are located and in order to actually do operations on a GPU you need to move your data from like the vram uh to like where the compute uh units are located uh which is like your SRAM or like compute units and and so usually I kind of think this of like as like a factory where you have like a factory with like that not that much space and then you have a ware house located um like much further away uh and so you have like a lot more space in your Warehouse um but now in order to do any operations like on those uh like uh supplies you need to move them from your Warehouse to your factory uh and then back and so this cost of like moving data around is called the memory bandwidth cost um and so this is actually like responsible for like a lot of uh like what your GPU is spending its time doing uh where if you imagine like let's say we do like you know three operations like on a GPU so maybe you're doing like an ad and then like a you know relo and then like a sign operation um like you can imagine that what actually happens when you do these operations like first uh the GPU sends the data from like the memory to the compute units and then you know turns it from a square to a triangle and then it sends it all the way back uh and then you know sends the triangle from like the memory units to the compute units again where it does like another operation and then sends it all the way back and then finally you know you guys get the idea it's like sending the circle from the memory units to the compu units again uh and then it's like sending it all the way back and so by default whenever run any operations like in pyour like let's say you ran like add and then multiply uh and then cosign this is exactly what would be happening on your dvu and so you might think that this is a very uh like dumb thing to do um and you would be correct uh and that you know why are we like sending our triangle back from like the factory to the warehouse just to send the data from like the warehouse back to the factory again and so a very like common operation uh for uh GBU compilers to do and I'd say what I call like the most important optimization in a deep learning compiler by far is called operator fusion and so what an operator Fusion does is that instead of like you know sending the data back and forth so much uh we like do a single GPU kernel uh where you send the data once uh to the factory units you do all of the operations uh and then you send the data back um also notably this is also like uh like an issue like the this optimization is not really something you can do in eager mode right because in eager mode I was you know mentioning like the execution model is very simple where you like run an operation and then it executes operation and now if we want to do this optimization that program model is like no longer sufficient and so there's actually like a lot of different ways to minimize memory movement uh although at the end of the day like operator Fusion is like the most important thing you can do uh for like an ml compiler there's actually like a lot of decisions that go into operator Fusion uh that like kind of enable it to be more or less effective uh one of the kind of examples here is kind of these like recomputation versus reuse trade-offs uh if you guys are kind of familiar with maybe like register allocation type settings you kind of often have a similar uh issue where like if you have a register um you can choose to either like store it in global memory and then load from it later or you can just choose to like recompute uh that value from like values that are already in the registers and so you kind of have a similar idea here um where you oftentimes have cases where by doing some amount of recomputation you can significantly reduce your uh memory you can significantly reduce your number of memory accesses and so in this way like the recomputation uh can not only like reduce your uh like Peak memory usage it can also often like improve your actual runtime performance and so I I think uh one of the things uh to mention actually about like why uh so this alteration actually ends up being like quite important for deep learning performance like recomputation versus reuse and I I think the the reason why is that like the shape of machine learning programs actually I think looks quite unusual compared to like your typical program that you might have uh where like it's generally like a kind of a bit of an axiom uh in like programs that usually most intermediates that you have are very short-lived so like you know your your program generally consists of a lot of very short-lived intermediates that you know are created and then very shortly destroyed um but in machine learning uh this is actually not the case uh because in machine learning uh like the typical like model that you'll execute uh will first like run the model forwards uh like you know layer zero to layer one to Layer Two to layer three to layer four and then need to run What's called the backwards pass uh which will like run the layers in Reverse so then it'll go from like layer four to layer three to Layer Two to layer one and to layer zero and in between uh the forward pass and the backwards pass uh you need to save what are called like uh inter intermediates or activations and so these are like uh a lot of like you have a lot of them and they're like often times like you know larger responsible uh for like running into like out of memory type errors uh and so I think this is actually like kind of a pretty unusual uh like program structure uh in machine learning uh that's caused by like you know back propagation um and gradient descent um and so finally you know the last thing you can be spending your time on is overhead uh where you can imagine that you know if a poor grommet is not able to put down the train tracks uh faster than the train can like go on the train tracks uh then sometimes a train is going to be just stuck waiting for him to put down like the next train track and so here I have like a a profile Trace where you can see that like the bottom line which is a GPU Trace is largely idle and it's mostly just idle waiting for the CPU to like schedule the next operation and so there are a lot of like ways to actually address this nowadays one of the most powerful is called cographs which is like an Nvidia provided API but you can also have like other approaches like in a compiler for example like code Jenning like a lower overhead wrapper or something like this so you know I've talked about like uh ml compilers um like and like you know what they can do to your program and how they can be useful but I think kind of like an interesting uh question that you often see like you know I talked a lot about how like um you know we have like you know super massive infer buildout and the programs are super simple and we we've seen like a lot of consolidation in terms of like what the architecture looks like and so I think like a reasonable question is like if you only have like one architecture and you're spending like billions of dollars to train it uh why do you even need a compiler you know like why can't you just like you know assign some group of people to like optimize it by hand uh instead of like leveraging a a compiler um and uh I'm going to say some like uh kind of uh or sorry and the other thing I I'll say about this uh actually is that like in practice a lot of times people do not use compilers uh like for kind of this reason and so this section is going to be talking a little bit about like why I think that's the case and what are kind of some of the challenges when it comes to like using compilers uh in this setting and yeah so disclaimer you know I I do really like compilers I'm going to say some kind of mean things about compilers in a bit and but you know as like to to uh establish my credibility you know I work on a team called Fighters compilers uh and so there are like to be clear like a lot of reasons why compilers can be very useful in particular this kind of like notion of Leverage like being able to do the optimization once uh in the compiler and then having everybody be able to like take advantage of it um and also you know compilers are also is like very fun uh to work on um that being said uh I'm going to introduce you you know my new exciting Library uh horse's exciting Library abbreviated HDL and so it has a couple of cool features that you might be interested in uh so the first feature has is that it doesn't always work um and you know to address that it also has no documentation about why it or when it will work except by reading my libraries implementation um and you know in exchange for that when you update the library it may totally change what code works and work code doesn't work I.E like no backwards comp compatibility or like guarantees uh on um your like you know on whether your code works and so are you interested in using my uh Library uh I guess you know most people would probably say no and so I think one thing to note here that like if work means like has a desired performance and is applying the desired optimizations uh this is kind of largely describing how compiler optimizations work and that compiler optimizations don't always work um they are often like there's no real real documentation on when a compiler authorization will trigger um and you know when you update your compiler uh it may completely change when it does or does not apply these ENT and so there's like very um influential to me article called uh like about this compiler called ispc and they have this note here called autov vectorization is not a programming model now where they note here is that like uh you know the problem with an autov vectorizer which is kind of like a compiler or so the overall like framing of the article is that he wrote this compiler called like isbc which you can think of as like Cuda for Intel simd instructions um and he kind of you know is constantly like you know as part of the article is constantly trying to fight against like the Intel compiler team which is like where he worked where the Intel compiler team wanted to kind of Leverage autov vectorization uh to get vectorization done instead of introducing like a new programming model in ispc and so he kind of I think uh elucidates like what the problem with autov vectorizer which is that the problem with the autov vectorizer is that as long as vectorizer can fail and it will then if your programmer that actually cares about what code the compiler generates from your program you need to deeply understand the compiler uh you need to deeply understand the autov vectorizer then when it fails to vectorize code you want to be vectorized you need to either poke it in the right way or change your program in the right way so that it works for you again and so this is like a very horrible way to program and you know if most of you are if any of you here are like very uh like into using simd instructions you probably also do not trust the auto Vector at all and you're mostly just you know writing uh intrinsics and so um you know with a proper programm model ideally the user is able to like learn what does and does not work uh without you know needing to be tied to the implementation uh and then you know one compiler implements it and then the user can like learn to reliably rely on this optimization uh without needing to understand the compiler and only needing to understand the programming model and so one I guess way you can phrase this is that like a compiler optimization that always works uh is just part of the program model like for example when you're writing uh like you know simd instructions you know that a simd instruction will like a simd intrinsic will always get map to simd instruction and so that's like part of your program model and not really an optimization that the compiler is doing um so I think you know to uh like when I uh see like the people kind of complaining about like shagos uh like and working with them I kind of am often reminded times of a compiler we can imagine that like a compiler is often times it's like large piece of code uh that you know has been worked on by a lot of like very smart people and often times has a lot of like tricky Det details and implementation details in it and so ideally when you're doing a compiler like only the program model ends up being exposed to the user so like the actual compiler implementation ends up being completely hidden and so the user only needs to deal with like the nice program model which is usually just like the language that the compiler is uh compiling from um unfortunately you know when compilers fail and like you know they don't apply the optimization that you want them to apply uh the kind of entire thing becomes exposed and so you know this kind of applies anytime that you're like struggling with it you're wrestling with a compiler and you're trying to understand why the compiler like did or did not like inline my code or uh things like this um and so to give some examples of like cases where like very kind of uh nuanced details here can lead to compilers like having like can lead to compilers struggling a lot uh one of them here is like numerics in machine learning uh where numerics can be like a kind of a like in general floating Point uh like arithmetic is like a very cursed thing to deal with generally speaking and it's just gotten even worse with the fact that like Nvidia or and not just Nvidia like everybody in Industry keeps on pushing our data types lower to lower and lower bits uh where like on the V100 they kind of introduced like 16 point 16 bit operations as kind of the default and on a100 they introduced 8bit operations and on the B100 you know they're now pushing for like four bit operation like four-b bit floats and operations on a four- bit floating Point numbers uh I I think it's reasonable to question like uh how is this even a floating Point number at this point um but uh this is kind of what we're typically dealing with and so as a result of like this kind of like low precision and the fact that numerics end up being so subtle you often times have like very annoying numerical problems to deal with and I think a good example for me that was like very frustrating was this kind of a nan in like a flash detention implementation uh where the underlying cost here is something called an FMA or like a fuse multiply accumulate it can actually be like disastrously bad uh when it comes to your numerics uh so for folks who are unfamiliar with FMA it's basically like it takes like a and c and it does like a single operation that does a * B plus C and so one of the ways that FMA differs from normal operations is that in addition to be to being faster it's also usually computed in what people call like infinite Precision internally and and what that means is that like the result of your FMA um is like um um ex like the closest possible representation to like the true value of your FMA and so this is different from like if you just did this operation separately where typically after your like multiply operation uh you would have like a rounding term uh where it becomes a bit different and so in this particular case we're Computing exponent of like AI * B minus a maximum scale where Max scale is like a like a a a singular scalar value that's result of taking the maximum of cross that AI times B and so the main idea here is that Xmen is like a very numerically unstable operation and so we want to make sure that like we we're not taking any very large exponents and so we're like we're always we're subtracting out the maximum value so that all the exponents uh like the maximum size exponent they can get here is zero and you know the compiler in its uh wisdom helpy helpfully rewrites this as like FMA of AI B negative Max scale where you know it's smart here it actually realizes that you know uh it can rewrite the plus into like a plus uh and minus uh but then you know if you actually look at the numerics like are going on here Max scale is like a rounding of AI * B and we actually end up Computing this quantity instead of like Xmen of AI * B minus the rounded value of AI * B which can be disastrously off and so the end result here is that uh we FM with uh sorry we naned with fmas turned on and you don't Nan with FM has turned off even though fmas are like theoretically a uh Precision improving optimization uh the underlying cause here to summarize is basically that uh our numerical properties relied on two separate computations of the same value to be exactly identical uh and in this case with fmas uh can like you can apply them to only one branch but not the other Branch uh and this leads to uh like Nan in this case another thing that algebraic sorry another thing that compilers often times struggle with are what are called algebraic rewrites uh so if you guys are familiar with this uh authorization called flash attention uh which kind of like fuses attention operators together um it actually relies on this like kind of subtle uh rewrite uh which people often times call like online softmax uh which is that typically softmax is implemented uh with a couple of like um Global synchronizes I suppose uh but you can actually rewrite softmax in a way that red uh removes one of those global synchronizations um but this requires like some amount of math it's not very hard math but it like compilers are generally very bad at math and so it's kind of difficult for them to actually come up with this Rewrite by themselves and so I think flash attention is like a very good example uh of like an a case where you need to think about the programing model you exposed for Flash detention where you know before you with this like fused attention uh kernel came about uh people just typically wrote attention like this you know you did like one M mole and then you did a softmax and then you did another M mole and you know this was like not super efficient but it was as efficient as you could do uh so people were happy with it uh but unfortunately with flash attention we now want to fuse like these three operations into a single kernel and so like the the difficulty is like you know what API do we expose uh for Flash detention um and so one way you might uh like tackle this is a pattern matching so you can uh like you know use a compiler you know try to find the sequence of mmol softmax and mmol uh and then you know pattern match it into a like single flash detention operator and so this is like a reasonable oper uh reasonable option but the issue here is that it becomes very frustrating to debug like for example the user might like you know change how they write softmax like they might reimplement softmax with their like own softmax implementation instead of using your like vendor provided softmax implementation and then all of a sudden your pattern like no longer applies uh and you know they're sad because like the memory suddenly blows up and their code is like 3x slower uh and so this like uh is very frustrating for users to deal with and so instead you know what pyri did uh in this case is that we just said okay uh like this these three operators now need to be fused into one operator we're just going to directly provide you that like single operator um as a like you know a single uh API that you can call and so this is one way typically that you can deal with like programing models is you can just kind of introduce a slightly high level thing that like does one a very specific thing for the users um but this is also still uh kind of frustrating and and you might have some questions about like whether this is actually good enough to do U because you know when you consolidate multiple uh more primitive apis into a single more monolithic API you oftentimes run into issues where the monolithic API is now no longer able to represent all the things that users want to do and so with attention we do see that people kept on coming out with like new attention variants um like you know sliding window attention Alibi you know page attention neighborhood attention all this kind of stuff like you know you look on Twitter uh and like a new four like attention papers come out every single week and uh as a result like the kind of fuds of tenture Colonels that people have I keep on addition accumulating like new quars like you know flash attention like this one's from like the flash attention repo it now has like a Dropout a soft scale a causal one a window size a soft cap Alibi slopes and so on and you know some of these were like added in like the past couple months uh and you know they just like keep on adding them and so and once you've added like a quark to an API or like to a programing model you can't remove the Quark anymore because now that's like you know breaking what users rely upon and even worse like even though user like we've people have been kind of aggressive about adding new qus uh it still doesn't end up being enough um and you have like all sorts of users who are like constantly complaining that you know like nobody has implemented flash attention for their you know pet uh attention variant like there's no flash attention for prefix LM uh this internal this bottom one was from like a blog post from somebody who used to be at Google who's like complaining about the ecosystem outside of Google um and so basically the point here is that like a single monolithic operator is not actually always sufficient and so we're in a situation where compilers like we don't really want to rely on a compiler uh to generate from scratch but it's also painful to do modifications by hand and so this might kind of seem like a like no win situation where you kind of must choose like one of them you must choose either like unpredictability um and you know doing as a compil optimization or you must choose uh like a single monolithic uh API that's like difficult for users to um like modify um but you know this is kind of where you kind of can be clever and come up with like a new programing model that wasn't either of the programming models that users had before so like one thing to notice here is that this kind of custom kernel that's difficult for a compiler to gener from scratch can actually be decomposed into like a handwritten SL like complicated flash detention kernel and a bunch of like trivial modifications from users they can actually be like very mechanically generated um and so here we have like an API that we've introduced recently called Flex attention and I I I really uh like Flex attention and I think we've seen like very positive reception from the community uh including from I think users who like traditionally don't actually like use compilers that much and so one of the reasons that Flex ATT detention is like so liked even though like it relies on Torch compil to work um is that like it's guaranteed to always result in a single fused detention kernel and it's always guaranteed to have like the same memory properties as a fuse detention kernel and so this means that like the user now has a programming model that they can rely upon where they can kind of program against this programing model and like try a bunch of different variants uh that all like fit within this programming model um and the user does not need to understand how the like actual API is implemented under the hood um so to give you guys like a a bit of a like a peak how this API looks like if you guys are familiar with like sliding window attention or CA causal attention uh here you can kind of implement causal attention uh by checking whether like your query position is like greater than equal to your KV position and then also checking whether the distance between them is like and your sliding window size and then you can just like and these masks together uh to get your like slid new window causal attention and so the way we kind of think about uh like and so this is I think a good example of where compilers uh can provide a lot of value uh for users even in these like super like uh handcrafted scenarios uh where like previously prior to this API this attention API uh you kind of have this situation where you had a bunch of these like you had this like big cube of like uh you know masking or like positional biases or whether it supported training or whether it supported inference and you know a bunch of these dots were like filled in with users who had manually implemented attention kernels but now with flex attention uh every single one of these like dots are like now filled in and so it's now like consistently users can rely upon rely upon fuse attention kernels regardless of like the attention variant that they're using and it gives an example of like some stuff that users have been doing with a flex attention you have on the last like a bit of a fun attention mask uh from like some some guy on Twitter uh where basically in this case he had like a bunch of like molecular graph uh like uh molecular graphs of like different sizes and he was able to convert this into like an intention mask uh that Flex attention supported um and so this is like a very weird mask that you know I definitely did not think about When developing flex attension but when you've developed like I think a good program model for users users are often times able to do things that like you ever considered doing when you develop the obstruction um another kind of analogy that that I sometimes think about when it comes to like you know optimizations versus program models um is there's kind of like a famous uh I think quote from Gro and deque uh about like you know tackling math problems uh where he said that you know when you tackled like a math problem you he imagine a math problem as like a walnut and it's like you know when you're trying to open the Walnut you can either tackle it by like just hitting the Walnut a bunch and like opening it up or you can kind of you know like soak the walnut in water and you know kind of you know like establish like new like mental models for how to think about the problem until eventually like the Walnut just opens by itself after being soaked in water um and so I think about program models like very similarly where like you know if you think about like autov vectorization from like the Intel compiler FSE perspective uh you know he had this kind of fun anecdote where like he said that the Intel people kept on asking like what happens when the Cuda compiler fails to vectorized and he was like you know absolutely baffled about you know what happened like he like he felt like this was just a thing he needed to understand to like you know understand you know how the Intel people needed to change a compiler uh to be more competitive but the kind of like misunderstanding here is that like uh it's almost a nonsensical question to even ask like when does a Cuda compiler fail to like make your code parallel because if you've written your code in Cuda and in the Cuda programming model it it like must be parallel like it's kind of like a acatic part of the program model like there is no way you can write a Cuda program that uh does not execute in or like does not execute across multiple cores like you can write Cuda Pro you can write Cuda programs that are like incorrect or deadlock uh but there's still guaranteed to always run across multiple cores and this is because the parm is inherent to the program model in Cuda while the parm is not inherent to the program model uh of autov vectorization um I think uh uh yeah I think uh kind of ML compilers and kind of you know how they uh apply to like you know just generally optimizing gpus I think like is already quite difficult but you have a lot more issues when it comes to like making gpus run at scale um and so when you come to like you know getting GPS to run a scale I do think there are a couple of like kind of interesting differences between a distributed ml programs uh versus like traditional distributed uh systems and so one of the differences is that traditional distributed systems often times are just trying to scale a QPS and so when you're like trying to scale QPS you have a lot of like very small interchangeable queries and like the performance per Hardware is like not like that critical and that often times you're willing to like double your amount of Hardware used in order to get like fault tolerance or you know like higher uptime or or things like this on the other hand in ml systems uh we just basically have like a single query like you know you're just like training your ml job uh we have frequent Global synchronization across all of our Hardware um and uh performance is like extremely important you know so much so that like there's no way we could tolerate like a 2X a loss in performance uh for basically uh any reason and so um one way to kind of think about how you can paralyze programs and this is actually like not specific to ml this is kind of a general like way to think about paralyzing programs is that you can think that there's basically a like you have like this cube of computation to do uh where one of the dimensions is like the um like the bat Dimension so it's like different tasks that you can perform you also have like the within task Dimension so like the different operations within a single task and then finally you have like the time Dimension which is what your G like what your uh like GPU is doing at any given point in time and so you can think of data parallelism as basically just splitting along like the batch Dimension uh you can think about like task parallelism as splitting it across a task Dimension and you think about pylon parallelism as kind of like split splitting across the time Dimension um yeah and so for data parm uh like the basic idea I think is like pretty simple which is just that we have like a bunch of different tasks we have a bunch of parall them and so each GPU just handles a a different task and so because of that uh we just have like we just put one task on each GPU and then run it and this seems like very nice and very trivial uh but the issue is that like we have a massive synchronization of parameters after every step uh because after every step that we do um like the gradients need to be like synchronized across all of your Hardware um and you can't just uh like avoid synchronizing your your parameters uh because you also have this kind of like uh mathy constraint uh from like the ml training side which is that you simply can't train with like too large of a batch size or or your model just like will not converge properly um there's also kind of another detail here where like you can't just like naively replicate your parameters people often times use uh like what's called a fully shorted data parallel um the the second kind of Paralis that you have is like task parallelism which is commonly known as tensor parallelism and in this case you basically just have two gbus that split the same task and so the main problem that you run into here that's kind of specific to ml is that like task parallelism because you're like running the same task does not often times have like uh obvious ways to overlap your communication with your computation uh but nevertheless because the performance is so important uh we still really want to overlap the communication and so there ends up being like a lot of kind of uh involved things that people do to try to improve the overlap uh one of them here is called like async tensor parallelism uh where the general idea here is that you have like a communication op followed by like a computation operation and so usually you know while you're doing the communication the GB can't do any computation and so this is like you know wasted idle time on GPU uh but the OB observation here about Asing t them is you can actually kind of like mini pipeline uh your like communication and your map mole where even within like a single tensor operation there's still often times like a batch Dimension that you can paralyze along and so by like doing this kind of like micro pipelining of your like uh communication and computation you can actually still enable overlap uh even though like we're doing uh tensor perels uh finally the kind of a last kind of parm that we have is a pipeline paral where the general idea here is that you assign like the first part of the task to the first GPU and the second part of the task to the second GPU uh this kind of differs from uh tensor parallelism or like tens task parallelism in that in this case uh you do not run on the same task with both gpus at the same time uh and so in this way you can think of the task is actually being shed across like the time Dimension and so there are a couple like uh additional wrinkles here about pipeline parm that are kind of unique to ml I think uh so one of them is that you know once again the frequent massive synchronizations prevent us from like filling up the pipeline uh but the second issue is that like a back propagation like the you know the forwards and backwards pattern that I mentioned before also adds like a lot of very fun wrinkles uh to your problem parm schedule and then in this case this is kind of like a a pipeline like you know that people have designed where the blue boxes are like the for pass uh the Canan boxes are like the backwards pass and then the green boxes are like uh the back CL split up even more um and in this case like the the main uh kind of nuance here is that you can see at like there's various points here uh where your pipeline can actually choose to either run a forwards uh micro batch or like a backwards micro batch and so this kind of like uh choice is not something you need to deal with uh in like a traditional Pyon parm setting and it ends up leading into like a lot of different um like optimizations that people do with pipeline perils um and so it kind of like putting this all together this is a diagram from the Llama 3 paper where they showed like what's kind of being done to train llama 3 and so in this case you can see that we we're combining like you know tensor parm I task parm with data parm uh with python parm and there's also a fourth one thrown in here called context parm you can think of that as just like another form of a task task or tensor parm um yeah and so I think again here like compilers I think have historically struggled quite a bit when it comes to keeping up with the distributed authorizations that people do and the main reason is that like compilers are dumb and humans are smart and what I mean by this is that for any given set of parm schemes or like any different any given set of parm configs it often is pretty feasible to come up to create like a automatic model for the compiler to determine how to automatically like paralyze your program uh but the issue here is that actually that like most of the Innovation and parm it doesn't come from like searching within your existing search space it usually comes from like expanding your search space along a new dimension and so this is something that compilers are not like super great at and it's been one of the struggles when people kind of try to like you know automate parm uh with with uh with compilers or like General systems um and yeah for example like one of the digal like kind of wrinkles that I've kind of shown up uh like you know at certain scales uh is a fall tolerance uh where you know when you're running like you on like you know 10 or 20,000 gpus uh like where we're doing these like you know globally full synchronizations across all of our gpus we have this issue where rgbs can fail for like all sorts of reasons some of them might be like user related some of them might be like Hardware related some of them might be networking related um but basically like a single failure takes down your entire run and this is like you know quite problematic and so this tables from like a paper that a meta published about like training llama 3 um and fall tolerance and um like you know I I I think like one kind of example of like how this ends up being like a interesting issue is that when you're training on like 16,000 GPU hours uh you you're only getting like a error once every 1.8 hours but you know now if you like scale it up to 131,000 gpus you now get a failure like every like 15 minutes or so and so like this turns something that might not be that problematic at 16,000 scale or smaller uh into something that's like very problematic uh at a 131,000 GB scale or even higher where you can imagine that you know now you have a situation where you might not be able to make even a single step before single GPU in your tire Fleet fails um so to kind of conclude you know I think ml has basically become the single most important computational workload in the world uh over the last decade uh may maybe a bit biased uh but I think by like the amount of like flops and investment I think you can make a strong case for for it um but the K cared ICS I think of the workload both from a social POV like you know the massive infrastructure required as well as a technical POV I think kind of often does mean that a lot of the traditional approaches to Building Systems or compilers uh don't directly apply like you can't just like you know build a compiler and then you know have people who just optimize a compiler for like five years without thinking about you know how like how the workloads are evolving on top of them uh affect the compiler and so I I think to me like the kind of most interesting question about Building Systems here isn't really about like building the right optimizations for systems it's instead about like coming up with the right programming models uh for expressing like your systems and kind of coming up with the right program models to enable people to kind of like do their own optimizations and kind of like build the next uh you know 100,000 GPU model thanks for uh coming to my talk uh hope you guys like it Back To Top