hello hello everybody um I'm uh Pasha shamis from Nvidia and it's through honor to introduce today's speaker someone I had the privilege knowing and working alongside for over 20 years pan is a principal research scientist at Mai where he leads transformative work in GPU training system and EI Communications libraries his contributions have been pivotal in building some of the meta's most advanced AI super Compu systems including Grand tetan architecture which supports meta's most demanding care workloads ranging from recommendations models to the Gen of eii models like Lama pan INF fluid extends far beyond his current Ro during his time at Aragon National Laboratory he made significant strides in the design and implementation of communication runtime systems and tring models I I had distinct pleasure to of collaboration with him on Project like NPI and ucx where his expertise and commitment were truly inspiring his work on uh imp project UCC argots has been impacting supercomputers and Commercial products around the globe today pan will be sharing insights into one of the meta's latest and most exciting projects llama 3 this is not just an AI model it's groundbreaking achievement the chol cases State ofthe art capabilities and ering uh from general knowledge uh Genera knowledge and multilingual uh translation one will give us an exclusive look at the massive infrastructure behind Lama 3 and diving into the challenges innovations that make it possible please join me welcoming my colleague and friend Dr pan thank you Pasha thank you for the very kind introduction and thank you for the invite to come here and speak it's a great pleasure and honor to be here um I think I started attending hot interconnects U 20 years ago as a as a grad student it's it's a pleasure to be back here um to give this talk I'm going to share my slides I have a Cheesy video I don't know if it'll play or not or the volume will go through or not but we we'll try okay can you guys see my slides yeah you probably want to switch to uh presenter mode yeah okay um I'll get started in a second I will try we'll try with the video if it doesn't work we'll move on but uh let's let's give it a shot okay so my talk is on Lama 3 I think it's uh it's the it's the most recent model that we have release 3.1 actually um the talk will uh I will try to cover a little bit of the infrastructure details the talk is not about the model itself but about the infrastructure that's powering the uh the model both from a training and from a uh suring perspective okay so here's the cheesy video I'm going to play it let me know if the volume is coming through uh if it's not we'll we'll skip it infrastructure fundamentally is what I call the engine behind the growth of this company we have been building data centers since 2010 we're talking about serving maybe a half of humanity now when we talking about AI you need a similar engine behind AI so that AI can achieve the potential that we are sort of dreaming AI workloads are growing at a pace of a THX every two years I mean just contemplating what that means for our systems our silicon our software stack we're in the middle of a pivot to the next age of information the models themselves are becoming hundreds or thousands of times larger and more complex and what this is going to require is infrastructure at the exaflop level we're reimagining everything we do about our infrastructure for AI we're creating data centers that are specific for AI we're creating new hardware including our own silicon we're building out new kinds of network architectures we're reimagining the software stack like py thousands of Engineers are innovating in on this large scale infrastructure infrastructure that's built specifically for AI okay that was my cheesy video let me move on hopefully the volume came through um having trouble with that um oops oops yes it was audible oh great okay so uh just just a bit of a Know video to talk about what the kind of infrastructure building at meta so let me take a step back over here let me talk about the generative AI models at meta um for meta generative a has been relatively new uh just a few years old two two two or two and a half years old and uh most of the investment over here has been in the context of LV so the idea is to build out models that can create new realistic content like new open up new possibilities for more creative applications to be uh to be built out uh we started out with llama a few years ago llama 2 and most recently earlier this year we released Lama 3 uh in two batches first batch we released 8 billion and 70 billion parameter models and very recently we released our large 405 billion parameter model uh this is the largest dense uh uh parameter model that we are that's used in production today um so it's it's it's a big big deal and it is a um it's it's it's the only open model that is competitive with uh with the top uh closed models um in in a variety of benchmarks across the board that's just the model the model works in the background it's like mathematical equation it runs in the background um there are products that you can interact with with this model meta AI was what was recently announced you have a picture of Zak over there presenting the announcement uh it's integrated into meta's family of apps WhatsApp Instagram whatever So Meta AI is how you interact with the model you go and type your query at the back end it goes and uh uses Lama 3 or Lama 3.1 to be more precise to come back with the answers that you're looking for a few um statistics on the right hand side if you're interested in looking at that so we come a long way uh in the last few years over here um so we here is a picture of where the evolution work how the evolution went the left hand side picture this is the same prompt at the top um this was around the time of llama Lama one with the same uh prompt same query uh you can see on the left hand side we get a decent picture but it's it's it's cartoonish in some sense right it's it's just uh it's it's good but it's not you know it's not that realistic fast forward to around mid 20 23 about 6 to 9 months later and the about the Lama 2 generation the the out outcome of the The Prompt is much much more realistic right it's more realistic more richer colors and all of that stuff this is uh still in the previous generation LV 2 coming to 2024 this is with Lama 3 of course we we have all of the realistic content that we had with L 2 but we this is enhan us a bit more uh with the respect to two um two axes if you will the first is more accuracy uh the model has to be more accurate and we achiev that by using much larger models and more data to train that um so 405 billion parameters a lot of data I'll talk about that in a bit and the second axis is with respect to speed how fast we can answer queries on that uh unfortunately this accuracy and speed are they strong duals it's very hard to achieve both you can can gain a lot of accuracy by making the model very large training with a lot of data but in doing that you usually lose speed of serving because a model model is too large to serve achieving both together that's the trick and that's where the infrastructure comes in we build large supercomputers we build very fast supercomputers so that we can um achieve very fast responses very fast wores while still maintaining the high accuracy that comes from the large models um we'll talk about the infrastructure I'll focus a little bit more on the network aspect of it Network and communication libraries aspects of it um but it the whole infrastructure itself is uh important to be able to achieve both of these at the same time a little bit about a quick training versus serving 101 um just just just to kind of warm up the crowd on this there are two part parts uh to be able to uh run the model to be able to serve the model for users to be able to use it the first part is we train the model so we take a lot of data we take an uh basically untrained neural network you can think of it as a neural network with the weights initialized to random values for Simplicity then you train it with a large amount of data and end up with a trained model this takes a long process it takes a lot of gpus a huge Network it takes months to train and at end of it we end up with a trained model so the weights are are um optimized for for this data set once the training is completed this model goes through a bunch of like pruning and other optimizations and then it's ready for serving uh for the users this is when a new query comes in we serve the model based on this train uh we serve the query based on this train model that we have in the backet so you can see that the the larger the train model is the more harder the suring gets so it's a trade-off between these two that we need to uh handle okay let me jump into the um uh training uh scale a little bit of it I'll talk about four uh parts of it I'll focus more on performance reliability but I'll cover four sub topics within that uh compute uh side the storage aspects network of course and the software track associated with it so first of all the compute part um meta uses a well first of all meta has a large infrastructure with lot of dozens of data centers around the world um and AI is a big part of it and a lot of our AI a lot not all but a lot of our AI infrastructure is powered by multiple types of gpus as well as our internal silicon um for example from we buy a a lot of G gpus from Nvidia we have our portfolio has about 350,000 gpus by the end of this year and if you include all other kinds of gpus that we have like for example A1 100s or other vendors and so on we have equivalent of 600,000 h100 uh gpus in the system so that's a very large set of gpus that are available and of course we need to kind of combine them to make a supercomputer out of it so we um recently well recently two years ago announced a architecture called Grand ton it's openly available on on ocp um you can take a look at it but there here's a rough TLD of how the architecture looks like it's a tightly coupled chassis um architecture which has CPUs eight h100 gpus inside a box EnV L switches within the Box PCI switches we have a backend adime Nick uh for um each of the gpus so each each GPU gets a 400 gig RDMA Nick and some additional frontend Nicks where tcpb traffic goes like you know for loading data and checkpointing and things of that sort we have two uh so this is one chassis Eight gpus in One chassis we have two chassis per rack so a rack has 16 gpus and we have multiple racks connected together in two layers uh one layer I'll talk a little bit more about that but one layer has full bisection bandwidth and then we have multiple of those um units what we call AI zones connected together in a tapered manner okay that's just the architecture of the system let's talk a little bit about Lama 3 so Lama 3 training was supported on by using two large supercomputers each with about 24,000 h100 gpus so this is an aggregate of about 200 xof flops of fp8 performance at 2 to1 sparcity uh this is a bit misleading statement uh it's actually 100 xof flops because we don't use the sparsity but it's it's a term I was just being consistent with the term the number that is floating around elsewhere um so about 24,000 h100 gpus let's say about 100 xof flops of uh PK fp8 performance now this is Peak the cheap performance is of course lower than that um and these are two supercomputers one was built with Rocky the other was built with infin band other than the network difference everything else is identical between these two supercomputers and both are based on meta Granton architecture so this these supercomputers were mainly uh the ones used for our training uh we used a few other um supercomputers for other smaller pieces but bulk of the work was on this these superc computers that the compute side itself uh let's talk a little bit about the data and storage um of course having just a lot of comput is not uh not particularly helpful if our data is we don't have a lot of data to train with not not just a lot of data but high quality data so we spend a lot of time taking data and bucketization train our our models with the Llama 3 paper has a lot more details of all of the work that goes on in it with respect to what kind of cleaning um happens on the data sets um so you can take a look at that for more details um this was train on 15 trillion tokens of data uh that's a fairly large data set it's not the largest data set out there um I think other models um have trained on even larger like 40 trillion I believe um but this one we trrain on 15 trillion tokens it's seven times larger than Lama 2 and it's a mix of publicly available online data now this is 15 trillion tokens of data um then there's a question of the number of epoch trained so the way it it's trained is that not all of the data is replicated some parts of the data are replicated so you get this like a mini Epoch sort of behavior if you replicate some parts of the data so you uh some parts that are more uh important to be trained can get replicated more some parts that are not as important don't get replicated as much um this was the entire storage infrastructure was based on flash media we used a meta tectonic distributed storage solution I'll talk a little bit more about that in a second as well um we also co-developed an NFS solution with hammer space and so on for interactive debing okay um let's look at a very high level top level view of the storage system um so this is a very simple cartoon I think most people who work on Training Systems know that there's a lot more that happens not just this part but just bear with me as I walk through the simple model so we have we start the model um every iteration uh we ingest data uh from the storage system to see okay here is the data that I need to train for this iteration or for the next few iterations and um at some point after some number of iterations um uh we checkpoint our State and every iteration we're ingesting data every few iterations we are checkpointing data and we continue through this Loop many many times um and eventually come to an end of the model once we have run out of data or we have reached our desired accuracy the inest part is of course entirely read based and it's synchronizing so we need all of the data before we can start training and the checkpoint is of course entirely uh right based um Pasha did you have a hand up Pasha your hand is up what is the question P I think pasha's hand is up I don't know if you have a question or okay it was mistake ignore it sorry about this I clicked on something okay okay so uh J training storage needs uh so one thing think we um as I mentioned earlier the um gen was relatively new for mattera earlier we of course we used to use AI but we used to use AI in a different context we used to use it for for example a recommendation models um as well as us storage systems for things like you know our regular traditional data warehouse for example you know you upload your pictures to Facebook that goes and sits in a data warehouse somewhere the storage needs change dramatically compared with what the traditional usage of the storage was like for example when you're using in traditional Warehouse terms you have long consistent stream iio the throughput matters more not the latency and storage is typically the expensive part over there uh for Gen training it's a bit of a reversal um the iio is very short and bursty because we are doing we're loading in just data we are checkpointing every once in a while uh but over it's not like a continuous stream of um IO so that this changes a lot uh we latency matters a lot the training is synchronous across thousands of gpus so the the P90 P99 latency right when the last person gets the data that matters a lot so we are um we have to be very careful about that uh that part because of synchronous training aspect and of course storage is a cheaper aspect so we can afford to do things that where for example replicate things or create new storage devices as needed uh we we've played around with a lot of various Solutions um vendor solu provided Solutions but eventually we decided to build our own uh for two reasons first reason is doesn't work other Solutions vendor solutions that we evaluated one don't work at scale in some cases or two the big one um they did not offer the strict privacy and security that meta requires uh this is a big no no if the solution doesn't offer The Meta stands of privacy and security it doesn't matter how good the solution is it's just not going to use it because um it's um you know we can get sued if we are you know if you are not maintaining the Privacy correctly um so at the end of the day we went with uh the system called tectonic FS this is meta's exabyte scale uh distributor file system uh it has several advantages with respect to scalability um it's a simplified semantics it's append only file semantics and so on uh but it's sufficient for what we need for our gen workloads it doesn't have some cool features that other uh file systems have like for example no NFS look and feel but that's uh it's okay uh we'll we'll um uh we can modifier applications we can deal with those sort of problems um here's a quick um picture of uh how this was working so we have when we are running our traditional um applications right the traditional Facebook streaming or whatever it is or Facebook wall um our storage access P99 latencies where you know it's it's it's that you can see the picture on the left over there they okay less than a certain threshold but the moment we add things like AI checkpointing you can see on the right hand side there's huge bursty traffic because every time you're doing a checkpoint we dumping a ton of data to the storage and we have this you know 24,000 or two super computers combined 48,000 gpus all dumping data at the same time the entire job is dumping data at the same time which creates this giant burst of iio and the P99 latencies in the shoot up every once in a while again I want to remind folks that the P99 latency matters a lot because this is synchronous training even if one process is delayed all processes have to wait it doesn't matter that my average latency was great what matters is what my P99 latency is because um the all processes have to wait now there we do some buffering like for example we do U um things like you know try to not wait for the checkpoint to finish um so that we can move forward and so on but there is only so much buffering can do eventually the latency difference will catch up with us okay um that's with as for storage let's uh switch gears here and talk a little bit about the network aspect so I mentioned this a little bit earlier we have a multi-layer uh uh switch architecture we have um a rack contains two nodes each nodes with eight gpus so at the bottom you'll see the 16 GPU racks and each rack has a switch called called a rack switch um or top of rack switch it's not actually the top of the rack but it's called a top of rack switch um then the second layer we connect a bunch of these racks together uh with uh the first or second level switches if you will which are called training cluster switches um CTS WS which together this is full bisection bandwidth so we we get in this training cluster that subset at the bottom left you get bsection B now I think everyone in this group would know um building a fat tree with full bisection bandwidth does not mean that you will not have Network collisions it's not a crossbar so you will still have Network collisions and that's something we need to deal with but at least from a theoretical perspective there is full bisection bandwidth in that set then you have multiple of these sets that are combined together um oh sorry I should point out at the bottom there are 256 such um racks at the bottom within one um training switch and we have multiple of these um uh clusters what we call AI zones connected together with a third level switch aggregation switches uh to form the overall uh supercomputer the aggregation switch is tapered um at a 7 to1 grity 7 to1 subscription so communication between these AI zones is um it's going to be slower so we need to be careful how we are communicating between them and how much we are communicating between them okay um performance at scale so first thing I would point out is I would uh point out a paper over here multi-stage networks are not crossbars this is from tson like from 20 years ago it's a very old paper uh I think it might have been from hot intercon at hot interconnect I don't fully remember it's a recommended reading I would highly recommend uh looking through that um basically the what it says is that irrespective of what we do unless we are very very very careful in most practical cases if you have a full fatory Network it's almost impossible to avoid collisions Network collisions so we have to deal with that um there are some ways we can avoid it but when um unless we do a global colle itive it's very hard to do that um and other aspect is that um we why do we have such a large netor why can't we fit more inside each rack switch for example um this is main more to do with power constraints I'll talk a little bit more about this in later as well but some many times power constraints require us to spread out our gpus much more further away so because they are more spread out we need more levels of switches we can't as densely pack them as other some others might be able to so that creates other challenges for us for example we have a larger Network that we need to deal with we have more collisions more latencies Etc that we have to deal with to get around that and last thing um one of the things we noticed that there a lot of um excellent papers out there that talk about how we can do all sorts of optimizations in various cases um but many of the assume that the networks are well behaved like nothing goes down every once in a while a node will go down and we we need to be able to restart the job with the remaining nodes so we'll not have very well- behaved networks all the time or well behaved systems all the time what sometimes we refer to as Corner cases are often frontend Center uh for these production systems I'll give you a simple example over here on the right hand side let's pick any Collective algorithm doesn't matter which one let's say all gather um there are three types of algorithms we have seen many papers on this as you might imagine um there is a full mesh algorithm where each process is directly sending data to all of the processes you can do an logarithmic algorithm like a tree or recursive doubling or whatever other algorithm you can think of or you can do a linear algorithm like a ring are um uh we might think that in all cases the full mesh should do better because I'm sing the same amount of data but I'm doing a constant number of HS uh whereas logarithmic might do better uh whereas ring with the same amount of data I'm using a lot more steps to do the exchange so it would be worse it's not always the case in some cases the number of hash collisions that can occur the network route collisions that can occur while I'm doing dumping data in a full mesh might actually make it worse than for example doing a logarithmic algorithm or even a ring algorithm in some cases so something to keep in mind where u a full f a factory is not the same thing as a crossbar we will have route collisions we will have hash collisions and we need to deal with that and our algorithms need to be able to should be aware of that for best performance um one uh example we want I want to call out with respect to static routing um so here is that the same picture from the last slide but slightly shown us slightly differently I have three racks over here each rack has two servers Ser 1 2 3 4 5 6 now let's assume that originally my uh my my job requires four servers 32 gpus and originally I was using server 1 2 3 4 okay and the way my routing was set up was that if I'm sending to the the odd numbered server I will go through the first half of ctws um if I and the first half of the RTS ports if I'm going to the the even numbered server I'll go to the second half ports of rtws and second half RTS ctsw switches now let's say that server 4 dies for some reason there's a fault whatever it has to be taken offline and I get allocated new server server five so still a four node job but now I have server 1 2 3 and five in this case because both server three and five are odd numbered servers all of my traffic both to server three and server five are going through just half the switches and half the core uh half the ports in rtw and half the uh core switches this obviously creates a route Collision um we can think of other static configurations as well but in reality almost no static routing config can solve this problem because there will always be some failure case where I will have route conflicts because of uh static routing so the obvious question is why not adaptive routing well that's more to do with physical distances and latency um let me check the time 10:30 okay um so here's a picture of our Arizona metas Arizona data Center you can see that this is the huge spread physical distance-wise it's huge there's a huge spread and power constrains often Force us to spread these gpus out over very large distances right because they might be coming from different substations they might the power or the cooling might not be able to handle a very densely packed um um U GPU layout and so on now it's very hard for us to do adaptive routing over here because the Nick needs to maintain a large window of outof order packets just because my bandwidth delay product is much larger physical distance speed of light my delay my latency is much larger um so bandd delay product might make it um just hard to be able to do full out of order on very high latency networks um we need more custom transports to be able to handle this maybe more flexible software uh uh than verbs um um maybe lip fabric I I'm supposed to say also ucx i' not offend anyone over here but we might need to have make U custom software um to be able to expose maybe out of order communication of some sort or ability to chunk data across multiple uh paths for example and things of that sort to be able to work around such issues okay um that's with respect to the network architecture itself let's take a quick look at the software stack or the model architecture stack uh as well so the model architecture uh needs to scale to first of all 24,000 or 48,000 gpus uh 24,000 there are two different super computers scaled to 24,000 gpus and we can't simply use data parallelism uh our Global bat size is is know it's fairly small that uh we use about 2 2K bat size so we can't um scale it even if you give one sample per GPU we can't go beyond 2,000 for our data paralysis because it's limited by my bat size and typically we want more than one sample per um per GPU so we need to come up with multi-dimensional parallelism techniques which include some form of data parallelism and model parallelism to be able to scale to this large uh superc computers so for lava 3 uh we use three levels of parallelism llama uh the the next Lama will have more but um for llama 3 we use three data parallelism pipeline parallelism and tensor parallelism so this allows us to essentially build it as a cartisan so each each dimension of the cartisan can be based on how many samples I have how much I want to split the model and so on data Paralis obviously splits the global bat size pipeline parallelism splits the model the layers of the model across the gpus and tensor parallelism splits the tensors across the gpus now what this means from a network perspective is that I'm doing when i'm doing communication I'm no longer doing a Global Communication that includes all the gpus I'm only doing communication across uh each Dimension so I'm doing a data parall dimension communication pipeline parallel communication tensor parallel communication and we have many such uh Communications going on so if we just take an example of data parallel communication so how many gpus do we have have 4 16 32 32 gpus and my data parall Dimension only has two uh two two processes two gpus so when I'm doing communication across the data parallel Dimension each Collective operation only has two gpus but I have have 16 of these Collective operations happening in parallel and the trick is each Collective operation at the communication libraries layer is not aware of the other uh 15 uh communication operations that are happening at the same time so when each communication each Collective is dumping data over the network it does not have a global picture of all other 16 all other 15 um collectives that are happening and this makes it harder for it to communicate in a way so that it's not U conflicting on the routes Network routes with the other operations that's really the challenge that we need to uh solve okay um that's with respect to the network um let's take a one step higher and see where we are at with all these constraints so first thing power is a key basic resource okay um meta is committed to sustainable data centers 100% renewable energy um so the infrastructure has to be very power efficient this is something meta had made a decision a long time ago and we we stick with that so it makes it very hard for us to get a lot of power um so we we try to be as efficient as possible this is a little different from other companies for example uh like Amazon just bought a nuclear power data center uh which is uh which is clean energy but not renewable energy so we kind of differentiate that as well so it is a very important uh resource for us so we have to be very careful in how we spend that um and a lot of design decisions that we make for the data center for the system overall are influenced heavily by this know this this big constraint that we have to work around some examples over here um Optics versus copper uh one of them is more expensive but is uh more puffer efficient U so we we tend to pick uh pick pick that in many cases Optics in many cases um we buy a lot of gpus but we run each of them at lower power um the reason is that um even though I'm not getting the best performance out of e GPU as an aggregate as a full superc computer I get higher aggregate performance if I buy more gpus and run them at lower power and assume that my network will do its job and do my will be able to scale well then I buying more gpus at lower power is more uh performant than buying lesser GPS and running them at full power so that's that's the uh another trade-off that we make and of course we're also building meta internal silicon for more aggressive for what optimization is going forward um a quick look at the reliability uh or unavailability events for the system uh so we measured a 54 day snapshot period so first of all the training runs for many months Lama 3 training run for many months we took a 54 day snapshot over there to look at to study the interrupts that are happening we had a total of 466 uh job interruptions 419 of those were unplanned and tldr is that the GPU issues were the top category of failures accounting for um close to 60% of all of our unplanned um failures the top failure modes were 40 gpus or GPU hbm3 memory and software bugs in many cases it was the GPU driver stack but also included other um soft other parts of the software stack as well and at Fourth this is just a the top four listed over here uh we also had some failures on the network switch and cable side but relatively small compared to the number of interruptions we had from the gpus so this is something to keep in mind at the scale um this we we have continuous interruptions is a is a fact of life when we running at that scale for that long and um another interesting thing I want to point out is that on soft errors um soft errors are are deadly these are if you have a hard error if GPU dies for example the job will fail so we'll know immediately we'll know um soft errors are are tricky because um if a GPU is working it's it's working but not quite optimally and it's a synchronous training job just one GPU doing that would slow down the entire training because it's synchronizing at every step um so there were cases where our training performance this example is actually a very extreme example our our training throughput dropped by 50% so this is an obvious example right you can look at it and say oh something went wrong my training um uh performance is 50% dropped so we can go and debug it and figure out what happened but imagine what happens if the training performance is 5% lesser or 2% lesser right how do we detect such errors um soft errors going forward so we have very uh sophisticated Diagnostics to do that but they're never perfect uh because you know very small drops in performance we have to very carefully study to tune that correctly to be able to diagnose such things correctly uh one fun fact I can mention is that um while we were doing llama 3 training um every every day in midday around afternoon time uh when the temperatures are very high the um the training throughput drops by 1 to 2% this is this is a natural occurrence because the temperature outside is very hot and we use air cooling the GPU gets thermal throttled more often than uh when it's a cooler at night for example so that causes the training throughput to to drop a little bit one to 2% every day around midday and by night it'll go back up to uh good performance so this is a this is a natural occurrence it just happens every time it's uh you can consider it as a soft error but it's not really an error it's just an artifact of you know how the how the environment is there's nothing we can do about so lesson learned for us is run it in the winter do your training job in the winter so we can uh get a little bit more bang for the buck um for for for our systems okay um how am I doing on time let's see 10:42 all right U let's switch g a little bit uh let's talk about uh serving uh the part of uh so after we done we need to serve the model what kind of uh things do we need in that okay serving LV 3 um so this part you should all be aware of there is you can go to any um system you can go to meta doai and type a query um uh any query that you like um and over here I type what is Li fabric you get a response from that saying here is here is what uh what it is about in the back end of course there are two things that we should be aware of uh to get this response the first thing is that that is called what is called uh time to First token this is basically if you um if you run the job uh if you if you send a query I mean um the first word what how much time after you sent the query before you get the first word response that's a time to First token um typically we need this to be in the order of seconds this lower the better obviously and this is this part is very compute intensive and network bandwidth intensive because it sent a lot of data around moves a lot of data around to get to that that's the first part the second part is what is called time per incremental token or time to incremental token this is after you get the first word back how long does it take for you to get the second word back to the third word back and so on uh this is in the order of hundreds of milliseconds uh this is very memory bandwidth intensive it's not Network bandwidth intensive because doesn't move a lot of data but it's very Network latency sensitive so how much time you take for the for the first bite to get transferred is very important in this case all right so again as I said there are two parts to it the the prefill part um which is more important for the time to First token and the decode part um which is more important for the time to incremental token um so let me focus more on the decode part the latency part of it these are small messages well smallish messages less than a megabyte sometimes tens of kilobytes and these need to occur fast from a latency perspective so we need to finish the collective as quickly as possible and this is not um easy to to do um so for example if you look at um nvidia's nickel uh communication Library there it goes through a bunch of steps to get through through this one is the a compute buffer that is sitting on the GPU it needs to get it gets copied into some temporary buffers it goes to the Rd neck transfers the r neck gets back to the other side and finally goes to the U compute buffer and there's a CPU proxy sitting on top that orchestrates the usage of the RDM Nick GPU itself is not directly um triggering the RDMA operation so in this case uh we there are few things to note one is the GPU dat the first part of GPU data going to the Rd minic that itself is bottl neck by other aspects of the GPU system for example the GPU memory has to be flushed my if I am running a Cuda stream I need to finish I need to launch a Cuda stream to block other um other work that is going on uh and of course my control messages that go between the two um two CPUs also plays a role all of this together even for a very small Collective we it's very and just let's assume for even without an RDM neck our latency is I 20 30 microc if you add an rdmn egg that goes to 40 micros seconds or so now this is a very high latency number because as all your network experts no if I'm moving like a bite of data it should be taking me less than a micros shouldn't I shouldn't have to spend 40 50 microc seconds to do this communication operation and this is something it's U it's um highly desirable but also very hard to achieve because fundamentally GP and Nyx are two separate decoupled entities uh it's it's very hard for them to be able to access each other's internal state to be able to do low latency collectives uh easily do L low latency collectives and this is a problem of course many vendors are looking to solve uh for example Nvidia is looking at building out more tightly Integrated Network systems like nvlink scale for example um which they recently announced um and other companies are following suit as well for looking at how we can bring these to a very low latency um uh communication U capabilities all right um last piece I want to talk about is infrastructure fungibility so we talked about all the infrastructure needs that are needed for our generative AI models but I want to take a step back to see what uh what we need to look at from uh from a broader um infrastructure support perspective so I mentioned this earlier traditionally our infrastructure was all based on um U recommendation ranking and recommendation models so these are they have a very different um characteristic if you will um they don't use a lot of uh flops they don't do a lot of computation but they uh rely on other things like for example they do a lot of Network all to all sort of communication they do irregular memory accesses and bunch of other different things so you can see on that table over here um ranking and recommendation models like take about 100 Gig flops per iteration uh serving or inference takes about a few few gig of flops per second and about 100 millisecond latency um if you look at generative AI like Lama models and this is even a much older model right this is L 65 billion tokens the amount of uh flops needed per for training is much much larger one PA of flop per second amount of flops needed for serving also much much larger um so much more comput in intensive than recommendation models and they run at a much larger scale because the compute need so the number of gpus they use is much larger as well but the network access is more um regular if you if you will or closer to Scientific Computing applications if you will where you're doing more cartisian style communication you're communicating across each dimension of your ction grid recommendation models on the other hand are doing this weird Alto all communication or all to all V communication if if you will that that everyone is communicating with everyone makes it a pain in the neck from a network perspective to optimize that because if I'm doing an all to all we communication across all my processes I can't scale it to you know 10,000 gpus it's very hard to do that because imagine doing all to all V across you know 10,000 gpus um and I can't do rail based communication uh networks for example in in these it becomes very hard to do allall communication so from a infrastructure buildout perspective this is the decision we need to make with respect to how should we build out we can't just custom build everything for generative AI because then we can't support ranking models similarly if you custom build everything for ranking models we are spending a lot of money on generative AI if the system with the architecture is not suited for generative AI or or is less optimal for generative AI we are wasting a lot of money so this is a trade-off that we need to make this is something we think about every day there's no perfect solution but it's something we want to um strike a balance so there you will always notice some strange things in our in our design saying that hey if you did it this way it'll be more optimal for generative AI um we intentionally made that tradeoff because generative is not the only product we need to support we also need to still support a ranking models like ads for example and uh here a kind of a picture that shows what different areas require uh for example if you look at the ranking and recommendation charts uh they are on training for example they're more uh um intensive with respect to network bandwidth memory bandwidth or irregular access memory bandwidth if you will and things of that sort but not as intensive a compute whereas generative training is more intensive of compute and scale number of nodes uh but not has intensive on memory bandwidth so this is a this is a trade-off that we need to make uh building a super super computer that is a super set of all of this is just incredibly wasteful we can't do that it'll cost a lot of money so we need to uh in some cases specialize a few things but in some cases try to keep it as fungible as possible for our other work GRS okay um almost at time uh so just just quick parting thoughts so first thing is scale changes everything in our experience with Lama 3 we never found uh raw performance to be the core or very rarely found raw performance to the uh core problem performance is always based on in this weird situation when something happened what is the performance I would get I'm almost always limited by that not with respect to the raw performance the hardware can give could be based on power physical layout system reliability physical constraints and so on training and serving are both very important but they require very different from a network perspective they look very different um training focuses on scale obviously serving focuses on latency of response um these unfortunately go hand inand you cannot build a amazing system do a very large model training but if you don't have a good enough system to serve it it's pointless you have a trained model but you can't serve it it's it's not not serving the purpose maybe and of course Hardware pability is very important which makes us forces into some decisions which we would we would not have otherwise made so that just so that we can reuse the hardware for other modules as well okay um that's it I want to sorry I know this is not good etiquette to show in a in a keynote but I want to um just call it out we are looking for people to work with us um we have uh there's a QR code if you're interested in in learning more um we have of course full-time opportunities uh internships sabatical research collaboration and whatnot so if you're interested if any of this sounds interesting to you uh please reach out we would love to work with you on this that's it thank you very much hey Pan um thank you so much for very interesting talk so we uh slack exploded with questions uh I can I would have to cherry pick probably so I have the privilege to cherry pick the questions that overlap is my questions um um how much time I hope we have enough time for questions because there is a lot so uh first very long questions I will break down to three smaller questions um what is the solution for instantaneous power draw on T of megawatts after checkpointing if you have a solution I would love to hear it um we work very closely with power companies um because it is a big problem where um when we restart a job um so well two three problems if you will one is when you restart a job your you know 20 24,000 gpus are coming up so they'll draw a lot of power uh when they come up uh second problem is is uh things like checkpoint when I'm checkpointing I'm stalling the gpus I'm drawing you know I have a dip medium siiz dip on the power and the third part is uh communication whenever I'm doing installing and communication my GPU is not active so I see a smaller dip so I have a large dip on restart I have a mediumsized dip on checkpointing I have a small dip on Collective communication um the from what we can tell from the power companies the the last dip is not the problem the large and medium dips are not the problem because they have worse customers than us apparently who who draw even more power more so that's not the problem I think the bigger problem is more with respect to the spikiness when we're doing Collective communication that is an unsolved problem unfortunately and if you have any smart Solutions we happy to hear those um another one for the message size that they using for llama training how does infin Compares toi um um we did not notice any performance difference between the two uh end to endend um uh performance for the models in some cases one does better some cases the other does better um but U at the end of the day I don't we didn't have any major uh differentiation between the two okay um any plans for um uh question specific a trail I would say multirail uh optimized network instead of doing Top of the Rock yeah um yes um so we are we are looking at multi-rail systems as well one of the um things I wanted to that was my funil set of slides multirail is is great for generative AI uh in the sense that we are doing uh more cartisan style communication so I don't need to have an all to all connectivity or a first hop all to all connectivity all gpus so multirail seems more natural uh in those scenarios but imagine we using the same infrastructure for recommendation models which are doing all toall Communication in those cases multi is going to be horrible because almost all communication is going to the top level core switches the real itself is not very useful there are some tricks we can play like uh gpus Communicator EnV link first and then send over a multirail but then we lose zero copy Communication in those cases so TDR answer is we are considering of course material communication but it's not as easy as you know hey it performs well in this case we'll just turn it on it will lose fungibility with our recommendation models heavily if we do that great uh another question um not directly Network related uh but still very interesting is um can you provide next level of details so what kind of diagnostic use for detection of heart and soft errors is there some ml based euristic or it's rule based theistic um let know if you can share yeah so for the I can share for the hard uh errors for the soft errors unfortunately we don't have a good solution um I mean we monitor we throw some alerts when the training performance drops but it's it's very ad hoc in some sense so nothing to to boast about in some sense for Hardware failures like for hard faults however we have a very sophisticated set of tooling we modified uh all through the stack we modified nickel we modified um py we modified the models as well as all our tooling infrastructure to quickly detect and show uh throw errors and uh and what we call localization exactly detecting what went wrong uh quickly uh so that we can repair and move on or even before repairing can restart the job on a different set of nodes the paper has a lot more details I think they have an entire section on what kind of fults it detects um including um we went and modified nickel to have uh to have pyo have direct access to the internal state of nickel so it can tell when you know some part of communication dropped it can dump that information saying that look look here this Collective this operation to this process stalled or failed or whatever it is so we can diagnose it lot lot lot more easily um I it's hard for me to um talk about it directly in this in this talk but the paper has a full section with a lot of details on on this this part great um I believe we still have time for questions um one question probably will try to SK is in um can you elaborate a bit more on the next level of paral Beyond 3D paral that you guys thinking to explore I don't know if I'm allowed to share that part but there is a lot of literature in the community with respect to other forms of parallelism that exist um I think Nvidia published a paper on um the Megatron uh which which talks about uh five I think or six levels of parallelism apart from the three that I already mentioned there is also uh expert parallelism there is context parallelism there is sequence parm various other forms exist but sorry I can't I can't um for next version of llama you'll have to wait for the Llama four paper we will wait um I have time for one more question I'm asking the organizers because I don't have schedule in front of me I think we can borderline I I will go for one more bonus bonus question um what is the increase in storage needs uh from just text on llama training to audio and video llama training yeah um so from a tokens perspective it's some whatever number of tokens that we put um as we are increasing the number of um so there are two parts to it llama 3 the pre-training overall happened on text based data which was the the big job that ran for many months for a lot of gpus then we did a a post training uh for adding images videos and so on so the post training jobs are not that large so the storage needs in Lama 3 generation were not that large for the images images and videos because it is a part of the post rting it wasn't part of the main pre-training foundational model that know that that might change in the future in which case our storage needs be dramatically increased um based on these but for Lama 3 it was not a big problem mainly because it was the past of post training not a part of pre training okay thank you very much for for taking time answering questions if you have extra time and you can jump on slack there is many more more questions in in the chat um and you know people yeah there's tons of so you you will have a lot of work if you have time to to answer all of this really appreciate time uh you know presenting it was uh super interesting talk I'm really excited to hear about all infrastructure I personally would love to learn even more so I will be posting more questions as well thank you so much thank you Pasha thank you for having me Back To Top