uh okay uh hello everyone welcome to lecture 32 I believe of GPU mode today like I'm really thrilled that we have like the Daniel the the Han Brothers Daniel and and Mike to here to talk to us about unslot I think the first time like I'd heard about you guys was like for the nups Alum efficiency competition from last year uh yall were like doing so already some really cool stuff and some like very sort of precise performance work um and you know it's kind of been very obvious to me meeting both of you like that you both really love what you do uh so like your enthusiasm is prettyy infectious so I'm I'm really glad you're here in the server um so yeah like if if yall throughout the chat if you throughout the talk if you have any questions please post them in the Q&A here on Zoom uh and we'll occasionally read them out read them out to Daniel uh so thank you yeah Daniel please take it away yeah thanks Mark for like inviting and like thanks to everyone and the GP mode used to be C but like yes um yeah like we started off with the um llm efficiency challenge um way back last year and then yeah so like we did unso as like a side project um but um yeah okay I'll sh I'll share my screen you guys see that is that oh press slide show uh yeah we see it it's not in slideshow show yet but we just yeah it's it's good now okay cool um so I was like thinking what should I talk about today um it was going to be about like Triton Cuda like everything so there will be like half of the talk will be about that but then I also wanted to talk about like General systems engineering um cuz when when we first started unsoft it was like we first made it as an optimization Library like you know making fine-tuning faster reducing memory usage but we did not expect all the extra stuff that came with it um such as fixing bugs in models and stuff like that um so some of you might know me from um like Twitter as well so we fix bugs for example in Jamma me and my brother like you know it's currently just two people in the startup but like essentially we fix bugs in Gemma um there was like some BOS token issues some rope issues um Leonor issues um sgo uh yeah and like just activation function issues also we do like model analysis as well so like for example gro we post like these I don't know if you guys like see this but like we post like these um stacked like you know fully packed paint images so I use like paint to do all these analysis um it's very fun sometimes it can get a bit tough and confusing when the model architecture is very different um and sometimes they do mistakes as well um for example in Gro I think I did a mistake with the um I think the 10an H part um I think I did division but I was actually supposed to multiplication so yeah there are some mistakes in my model analysis as well but it's fun um also like for example like uh you know nvidia's nron 340 billion and did some analysis on that um and also like you know token I issues there's a lot of tokenization problems in language models um so that's also very fun to provide to the community as well and you know recently um if you guys have been following there was a gradient accumulation bug um and so like we showed that um so someone else so Benjamin posted about the bug uh I think one week ago or something and this bug has been in you know most Trainers for like four years um and essentially the grad in accumulation was theorized to be equivalent to full bash training but it actually wasn't um and you know like I'll talk about that today as well mainly um so Joey like posted about a you know very cool like picture showcasing the main issue of grin in accumulation mainly the denominator was not correct um in the cross entury loss calculation um and yeah so we have like a GitHub package as well so like we post for our fine-tuning our you know our fixes our bug fixes Cuda kernels um well we use Triton actually Triton kernels all in out get package as well um so definitely check that out um yeah so yeah in terms of like a backstory we first started onof as like we've Tron kernels some mths um we have like our own back propagation engine and the goal at the beginning was to make llm fine tuning like llama mro jamama um it was just llama it was just llama to fine tuning two times faster at the very start so it was this this was around December and this was you know launched after the llm efficiency challenge um and we reduced usage by 70% that was that was like at the beginning um but we did not know that there was actually lots of extra stuff that you have to like put together with a package in order to make it work um so I'll be talking about like most of these things for example like you know um tokenizer problems pre-train model issues exporting models to like BLM um you know collaborations with different entities and companies and organizations um you know making inference better doing fine tuning GPO Oro best practices for Laur and also like many algorithms um chunk cross entropy long contact fine tuning chain mat application training on completions and the gr and accumulation bug fits um so yeah the goal is like it's not just about making the Triton kernels better um and you know writing optimized libraries that's actually very important um but that was actually just the beginning of making like a full-fledged um training library and yes there all bug and issues in every single one um so that was the unexpected part um so yeah so like we essentially have to like notice okay even if you make like you know uh like Implement for example we did not support Mist models back when we first started so you know we have to support m models of sliding window attention and then JMA came along and there was like inter interleaved sliding window attention and like Global attention so that was like another new thing and so like there's always like these new things that come out and so we have to implement them in unsoft um yeah so it was very interesting like many all of these parts of the stack kind of had like bugs and issues yeah so going back to like our first release um essentially we took the transformer architecture um like the decoder Lama style transform archetecture and we like essentially like wrote it back wrote it down in like all of maths you know just try to like make it as like into like one page of maths um and then we noticed like okay we could like Tron every single um you know like do some back propagation tracks like you know doing gradients um derive all the gradients uh do calculus do Matrix calculus and then like we essentially like make the entire process fully Triton um so our theory was okay this will make training somewhat faster um and reduce memory usage so for example the RMS lorm kernel that we wrote um um it's cited in many packages as well um we posted so these these kernels were like our beginning colel that we post around December sometime we worked with them around like uh November December October Rin the llm efficiency challenge as well and we like perfected them and like released them around December I think December one I'm not I can't remember the release date but yeah so these are Leo Norm kernels you can see that we commented out um you know some of these like upcasting and downcasting things um this is actually through Power and era that we did especially like the backwood kernels we had to like upcast everything to float 32 um we actually spent a lot of time time and energy trying to like exactly um you know copy the exact methodologies of like the correct gradients and the correct um upcasting and downcasting is actually very painful because like for example the ARs L on Corel there's like not that much casting and downcasting to do but the other cels you will see gets more complicated so so so so then a quick question uh like I've been recently like sort of doing like a GitHub scan of like Trident kernels that exist and I feel like RMS Norm is like by far the most popular Trident kernel on the internet do you have some sense as to like why like like relative to all the other kernels people possibly right that's a great question I think RMS leor is like the first kernel probably is like not that complicated but not that easy so it's like medium difficulty um you don't need to do like some interesting there's not that much maths that you have to do um for the like derivative for the back propop part it's not that complicated to derive the derivative so my view is like RMS lorm is like the first Turnal people will try that is like not that easy like you know addition is going to be very easy but I think it's the difficulty levels reasonable for people to like try it out um that's my view um if that anwers the question yeah it does yeah thank you does raid have a question is that it's okay like I think it's gonna be hard for us to do hand raises so ra if you have a question just post it in chat and we can do like a cute live answers near the end oh yeah yeah I would do Q&A um yeah okay so the next kernel that we did was like R embeddings this is actually more involved um the main reason why the rip edings was more involved is because now you have to like actually derive the derivative and it was actually unclear what the derivative was um and so we did like some maths and we noticed that you know in in llama um they use a function called rate um rotate half might sound confusing but like essentially they like divided the tensor into two they moved to the right tensor to the left they move their left to the right and then then put a like you know minus sign and stuff like that and it's it's interesting um and so like we um so the derivative was complicated um and then we noticed that actually it's just a rotation Matrix and if you do the transpose it's just literally just transpose it and you just the transpose is just you just put a negative sign um and so like that was interesting um and so you can see like for example this line if backward pass it's just sign is equal to minus sign so like essentially the most hardest part of it was like deriving the derivatives the rest is actually not that hard um I think there are like some implementations of R kernel which is like they don't actually utilize they don't actually know that the layout of the kernels was the most important and so like we essentially fuse we essentially like wrote a kernel which are like essentially um considers the layout of the Matrix multiplications as well and so like you don't need to the multiplications not Matrix multiplications but like you don't need to like you know write a very comp kernel if you just consider that okay um if you write if you draw the if you like draw on a piece of paper a layout of the Rope kernel when you do the actual multiplications you can actually see that you don't actually need to complicated write the kernel and just like step by step um you know write all of these lines of code um so when we write kernels you have to like write them on the piece of paper and you will see that okay it's not as complicated as people think yeah um the next thing it's not really Trent kernels but like we do have like a you know um efficient flash retention inside of our um package um one of the fundamental problems though is we had to use three different implementations of flash attention and this was launched back in December though um so that's probably why now I think people will I guess you can just use scale dot product detention from pytorch um I think yeah pytorch 2.5 just had the um you know Hopper architecture much faster so like yeah I would suggest people to just use pytorch's version but you know once we when we launched back in December um we had to like include three different implementations one from xers one from the actual flash attention repo um from trow and like the actual pytorch um scale drop product attention and we're going to implement Flex attention as well into unsoft but one of the problems was we couldn't use Flash attention because um you know the Tesla T4 gpus did not have B FL 16 support and so also scale J product attention did not have B FL 16 support during the December um and so like we had to use ex forers um as a temporary measure in order to support floid 16 um you know flash attention on gpus um so that one one of the reasons why we have to use forers um yeah Lama but you know J has like um gigo and like other things um but like for at the very beginning Swig Glo it was okay to write down the derivatives and do all the um you know differentiation it was okay um but I think the biggest the biggest problem was again the casting and the downcasting um so like you can see like okay why did we comment out you know DW's upcast right e has like an upcast G we had to like comment out the upcast and so like this was like Dre you know we used many ways to like check out the accuracy of the kernels and so like this is actually the correct way to do it um we you know now we can like utilize you know like when you T compile you can like generate Triton kernels and that's much more helpful right you don't have to like manually test which one is the correct method um you can like generate the Tron kernels and look which upcast is correct um so that's how we normally do it now um no more like pain of like trying to try every single combination um we also released cross entry loss kernels um it's a bit the code is a bit like combined together so like I could screenshot it um but the cross entry loss was very interesting as well um we it was mainly it was mainly about like you have to understand the maths behind it like you know how do we actually derive the derivatives how do I like do this efficiently um and so like we provided some of this um in TR to um during you know like when new models got released we had to like add soft capping log soft capping through Gemma um we haded like loget scaling for the coher model so we had to like edit the cross entropy loss kernels to make those work as well um we also like one of the biggest uh importance was like we moved the casting of like um you know the um linear layer for like float like when you finish the when you do x times W like the LM head you have to like upcast it to FL 32 instead you can like move it into the actual cross enp loss kernel to like reduce memory usage dramatically um so logits you know you just cast it dynamically inside the actual kernel um so yeah so like now I'm going to move over to like pre-train models um so there's many models that got released llama mestro Jamma Fe you know quen um Kia models many many models from different companies and organizations um one of the things that we found was there's lot of bugs in there um and it's not really the model Creator's fault I would say that when you when you do pre-training of a large model there's always going to be issues because the organization that like created the model um you know is very large and big and so like there's always going to be some sort of issue um that can't be like solved with like you know the whole organization you have to like have like one person to look through the whole Cod base and look through the whole system to see like if there are any issues um so like for example like Gemma we found this was actually our first bug fix that we ever did um to like support Gemma and we found that actually Gemma had like some issues with their um using the approximate J um and you know you're supposed to use approximate jell not exact um and for example like also we found a rope fix at the same time and so like rope was not done in float 32 and it was actually done in beat float 16 and Float 16 and we found that actually llama Gemma implementations should not use um you know lower Precision um rope kernels um if you do that then you will lose accuracy at long context fine tuning um so definitely do not do that um so yeah we posted like around seven or eight bug fixes for Gemma this is around yeah this is around like March this year um quite a while back um yeah so uh so so D I'm sure like a lot of people have been curious about asking you like these seem like I mean the way to solve these would be like a guess and I'm sort of curious how you went about like finding finding these in the first place yeah that's a great question um I think it's interesting so like the first thing you have to do is like you have to read the code base for you know when they release like a model implementation we just take like a skim for it like okay is there anything interesting in this model architecture so like that's why we do like these model analysis like you know we also analyze models um and then we like post them and so like once you go through the analysis you're like wait a second why is this like line over here like why is like for example why is the car implementation approximate is equal to true but then the py implementation approximate is false like you know why is it like that um and so like we essentially read every single implementation that the um companies or like Transformers or any single organization launches and then we compare all the implementations and then we when we like compare side by side these imp a we see that okay they're not you know why is this line like different from the other one um and so like after you do all these analysis and then we like write up a correct version so we actually have to hypothesize this is the correct version and then we like once you like hypothesize that this is the correct version we then test okay like is there like errors like you know what is the AL to um Norm between the um between each implementation and so like it is a long process but in general if you can like read through different implementations and like see which line is like wrong um you will then like find all these bugs um so it's actually not that complicated I think it's like once you get used to reading a lot of code Transformers um you will see these little issues um yeah like it's I think like yeah the search space is bounded I think it's interesting okay so so there is like one long question in the Q&A I think it might be best if you read it instead of me maybe let's just go over that one okay let me got my how do I it's a Q&A in the chat oh wait it's in a Q&A wait wait wait okay maybe maybe let me just read it for you then okay oh yeah here I'm a bit confused about this partial upcasting inside kernel if you upcast only one of the two operat I'm pretty sure the compiler will automatically in an upcast anyway oh yes okay interesting point um so sometimes if you're upcast or downcast it's fine but sometimes the operations don't actually work um so I think like so for example if you go back to the wait let's go back to the thing um where is it um yeah right yeah so for example if you upcast and down for example this one right if because they were originally in float 16 right so they're originally in B float 16 or float 16 because we're using mixed Precision training all of these like tensor all of these like you know things are in float 16 B flat 16 right so like you have to manually upcast them to float 32 for example sigmo is fine right sigmo you must upcast FL 32 because they there's no operation to exist in flo 16 so actually I think I'm not sure if like new Triton versions but you must upcast this to V 32 um I think otherwise your Triton will crash I'm not 100% sure but like you must upcast it to 30 32 but I think the other ones um you know General Matrix you know General multiplications um it's not actually in float 32 so you have to actually Force the comp like force the Triton compiler to tell them okay you must upcast this or downcast this right so like some of them we don't actually upcast um so the main the main point is like during the kernel because we're using mixed Precision um all of the tensors are in float 16 or B float 16 and so you must upcast manually um FL 32 um if that answer the question continuing on um okay here so uh when llama got released was also very interesting this was around April um and not sure guys remember llama 3 was just April but you know that was very cool um we also found some issues in it as well um for example like we noticed that the so there's like a base model and there's an instructor model um we noticed that the base model actually had untrained tokens so um the L team probably accidentally set some tokens to like zero um so in during fine tuning it's probably not a good idea to set um tokens to zero because like some people when you do fine tuning um the fine tuners actually don't know that you know some tokens are zero and so sometimes what people do um they shouldn't actually do this it's more like a user ER is like they use the base model and then they use the Llama 3 chat template to F tune the base model um and so we notice that actually this you're not allowed to do that because the eot like the end of turn tokens the start header tokens are actually all zero um in the actual base model and so um if you do that your gradients will become Nan um so don't do that and so like we for example in unsa we actually warned the user actually we error out we just error out and we tell the user that okay if you're using the llama three template on the base model um you will get Nan gradients um so definitely do not do that um yeah so we also like provide like a check for all models now so if there's like a llama mro Jamma Fe any single pre-trained model we now check internally in unsoft that you know like um you're using like you know these untrained tokens please do not do this um yeah so this is not a problem just for llama it's actually a problem for many pre-trained models um when fee got released there was also like an issue with B um the sliding window was 2047 but it's actually supposed to be 2048 um you know like it might not sound very silly like you know it's just plus one but um it actually could affect some fine tuning runs um so definitely fix that I'm not sure if they still fix it now but like essentially they um from what I understand is like they um it was add they had like plus one in their code base so it's 247 plus one in their code base but actually in the Transformers implementation um they did not actually you know um copy paste the exact code and so like the Transformers implementation you must use a power of two well in this case a power of two um it might be like coincidental but like actually it was 2048 so we verified with the Fe team the f team that it actually is 2048 um yeah and also like for fee um you have to like um we noticed that they used a very interesting architecture so when you do um they fused all the MLP layers like the um MLP weights and they also fuse the qkv into one um into one large Matrix so so we found that if you do Laura fine tuning or fine tuning um you need to actually unfuse them so you need to unfuse the qkv into like separate three matrices um and if you don't do that your accuracy will be lower than if you use if you fine tune a unfused model um the reason is because when you fuse the matrices into three when you do lur fine tuning your a matrix so when you do Laur your a matrix there will be only one um and the B Matrix will be fine but during Laura you must like each of the qk and vs must have separate um a matrices so definitely unfuse them so CH templates were also a problem that we found um so we you know llama mistro and many other pre-trained models um even like you know they all have these issues of tokenization um you know tokenizers are a bit unfortunate for language models I think like that's probably the biggest issues for language models currently is like tokenizers are tokenizers are like a good temporary fix to like most issues um like it's relatively hard to get like a piece of text and how do you like chunk them into pieces you could do a like you know like character level that's probably not a good idea you could do like bite level um or you could use like the current tokenization methods like you know bpe and stuff like that so it's a very good um temporary it's probably not going to be the future maybe it's not going to be for the future but the main issue is because there's so many tokenization problems in current pre train models um so for example on the left um the sun picture is like spaces um so the sun emoji um showcases spaces in each tokenization you can see llama 2 is like a bit different from Mist one Mist 2 had like a different to tokenization um Mist 3 had like different ones and so like the goal is like you know which one is correct um because like if you don't select the correct tokenizer um during the tokenization stage you could like screw up the fine tune so like definitely you need to like um look at the tokenizer um look at which one is the correct one um and yeah so like those there's a lot of issues in tokenization and chat templates as well um and it gets like more complicated when you want to export to ggw ggw has like llama CPP um llama CPP has their own tokenization thing and and so like that also causes even more problems as well um yeah so like definitely look at that as well yeah so more tokenization problems um we found for example when llama 3.1 got released um we noticed that the tokenizer did not add a b token by default so this was like a few hours it's of its release so we worked with you know hugging B to like add a BOS token by default in llama 3.1 um I think it's yeah you're supposed to add a BOS token by default um and yeah also like Mr oimo also had some untrained token problems um yeah and also was very fun when we were posting these like model analysis on Twitter um you can see both llama and M was like packed full of information um yes it was it was very tough in paint trying to like squeeze everything in into the actual um you know model analysis but it was very fun um yeah so now moving on to like bits and bites um so in unsoft we provide quantized we provide Kora and we make that two times faster and use 70% less memory and the trick is we also have to like consider that okay people are going to download the float 16 um and you know like do we need to like someh make this process better and so we decided to like start uploading models um to hugging face and so we upload like for example pre- quantized bits and bytes models we upload like ggfs and we upload like many um models from different model providers um and so like we do this ourselves as well and so like we find that you know users when they use unso um it becomes much easier if we provide our own uploads as well um and also a very big issue that's currently we're trying to solve is when you finish fine tuning with Kora you have like the nf4 weights um you know like the weights get quanti down to like um you know N4 um it's a data format and the biggest issue is when you serve these models we tell people to upcast the NFL weights to float 16 and then we add in back the lurer weights um there is another another method that we know some people do is like they take the actual float 16 weights um like and then you add in back the lower adapters um directly right so like you don't actually do the upcasting for step um we're still investigating which one's better um and you know there are like two methods you could do we find that maybe the first one might be a bit less um ideal um the main reason is during N4 conversion there are like some large outliers um which like you know language models have like large outliers according to the bits and bites paper and so like we find that if you don't the Kora paper and so like if you if you cust it you know you might screw up the outliers and so we find that actually maybe if you do the float 16 just you take the float 16 weights like the actual untrained float 16 weights the base model and just literally add on the lower adapters we that might actually increase accuracy um so we're going to push that out into unof maybe like in a few weeks um so after fine tuning we notice that you know people have lots of issues with exporting models um so you know we don't when you do the fine shoting process that's okay but then like how do we actually run the models you know after the fine tune process so we allow people to like run the models in you know native Transformers We also like allow people to run the model in VM llama CPP olama and more um we found actually that if you use like a collab um if you save to save tenses it's actually going to make the saving process five times slower so actually you have to save to py Toge bin of style um so that was actually very interesting so this actually made saving like much faster um so we also provide like different quantization methods for people to like push push to like hugging face as well um and so you know that was like a very highly highly requested feature pushing to hugging face um which we also added um we also like allow you to like push multiple quantization methods for example like llama CPP there like different um you know different uh precisions you know um you know int 8 int four and stuff like that so we allow people to like push different um you know versions of the model to like hugging p as well and we also like worked with oama to like you know now you can like fine tune a model and Export it to oama and use that on your local computer as well so we also provide I think our biggest difference with other training packages is like we actually provide collabs and kles um notebooks who people to use um for example llama 3.2 we have a collab notebook and a Caro notebook for people to use um fee Gemma um for example Mist small M Small 22 billion actually fits in a cab interestingly enough so 22 billion actually fits exactly in a 16gb card um so all of these actually fit in 16GB cards um you can do DPO you can do Oro and many other methods and these all fit in 16GB cards um yeah so like our collab notebooks we provide all of this to the community um and we also provide kago notebooks as well for people to use yeah so for more code we actually interestingly people were complaining to us that you know like when you finish fine tuning how do you actually run the model inside of unof um we didn't actually have inference type code inside of unsoft but I think it around January or February that that we decided to add inference directly inside of Uno um so generally people we would suggest people to use coach coach. compile now for like faster inference but in general when we launched like back in like uh when we provided inference back in like January or February this was around two times faster than native 4bit inference inside of Transformers and I think I last checked is around 20 to 30% faster than torch comp still so like essentially we make the 4bit inference much faster for people and we we leverage ideas from like page attention we do like we do like the preing stage and then we do the decoding stage separately so we leverage ideas from VM um and we pre-allocate the KV cach as well um so we can do many things um you know use torch in place operations to make it much faster and so like this was also very interesting um and it's very different from our fine tuning optimizations um so essentially unso is not just like a fine-tuning faster Library we also have faster inference as well inside of unof as well we also provide like fine-tuning best practices um so in unsoft we notice that actually you can do continued pre-training um so there was like a paper talking about Laura learns less and forgets less I'm not sure if you guys saw that but essentially the paper was trying to say that Laura if you do Laura it will forget less than if you do full fine tuning we also show that if you can do um two times a rank or like you know if you make alpha equal to the rank in lower ofine tuning this actually can increase accuracy dramatically and this will essentially um allow you to like reach a full fine tuning capabilities if you make the rank go to like all the way to like 4,096 for example um so if you if you make the rank 4,096 you know you have to make alpha also 4,096 um I think most people when they did fine tuning they would set Alpha to 128 um so that's actually not good um and so we found that if you if you set Alpha is equal to the rank you can get much higher accuracies um for fine traing we also show that you can actually do continued pre-training um with Kora um maybe the naming is not very correct but essentially what you do is we notice that actually if you train if you actually find you in the LM head and the embed tokens um you don't need to find you in the lorm but if you do that you can actually um mimic full fine tuning um one of the biggest problems when people do full fine tuning um is they train all the layers right including the LM head including the Ed tokens imms layer Norm kernels and you know they train all of the layers um but in Cur you don't actually train the LM head in ined tokens um so we show that in some you know some research that we did is like if you train the LM head and the embed tokens you can actually mimic full fine tuning as well um but one of the issues is this this will actually increase memory usage so just be careful of that so in terms of more algorithms have we provided um so one of the biggest things that we did during Flur and Cur fine tuning was something called chain matrix multiplication um so we showed that actually when you do Laura you do x times you do x * you do x * a * B and actually if you The Ordering of the x * a * B can actually make a difference in the number of flops that you use um so be sure that actually if you bracket correctly during the backward pass and forward pass you can actually reduce actual flops um so this does not reduce actual accuracy um the result of x x a * B depending on how your bracket is always going to be like very similar um there's obviously some numerical Precision issues but it's like so small that we just make them the same um and so during the backward pass um we like fuse all all the operations we show that during Laura um and Cura that if you like you know manually fuse all the operations correctly you can actually make training much faster um and also reduce memory usage um so we show through some maths that um you can actually show in maths that doing these methods actually makes them much faster as well um we also like released something called unsub grading checkpointing um or like offloaded gradient checkpointing um we showed that this I'm not sure when we release this this was like maybe I think it was April or something or March um we showed that if you offload the activations during gradient checkpointing to system Ram asynchronously you can actually reduce vram us it a lot um and actually you can only this will only increase the training um you know this will only increase the training runtime by like maybe 1 or 3% um so actually it was a very good um it was a good it was reasonably okay to increase the training run the time spent by like 1 to 3% but like I think the benefit was like you can reduce ream dramatically um we show that actually if you use um if you use unsoft grading checkpointing and like other methods you can actually find you number 70 billion and a 48 GB card um and we're trying to make that go to 40 GB as well but like essentially you can show that um fine tuning 70 billion um you only need like an RTX a6000 right you don't need to use h100s um and we also showed that you can actually have six times longer context lengths if you use um offloaded gradient checkpointing um I think one of the surprise that I found when we talk to people about the gring checkpointing is like the code wasn't that long um it was reasonably short um and so like you don't need to write that much code to make this happen um but I think it was a bit of a surprise that people didn't actually notice that you can just just offload it to like system Ram asynchronously um and this seems to work um so I think the biggest interesting text was this actually works um and so like yeah so I know this is like in many implementations now um yeah so like this was like one of our popular type um implementation uh algorithms in unso um we also like we also have to like Implement something called chunked cross entropy um when JMA when GMA got released um one of the issues of GMA was that they increased the um the vocabulary size to quite large and so we noticed that actually you can't just use our old cross eny lost kernel um it does not work anymore and when Gemma got released um you know like we were trying to like see if we can just use our old kernels but it actually did not work because the max the maximum um block size for um in Cuda was like 65 yeah it was like two to the power of 16 is you have to actually be careful of that um and so like 65536 I think um and so like so essentially when the vocab size was larger than 2 to the^ of 16 we had to like split the kernel into like multiple blocks in order to do them separately and so we had to like invent like maybe it's in like other kernels but like we had to like essentially rewrite to the Cross entry lost kernel to account for large vocabulary sizes and so we called this like chunked cross entry loss um and so like we do we do each of the Cross entry Bel loss calculations separately in chunks and then this essentially also made um interestingly this made it faster as well um and so we also yeah so we wrote some kernel for this um we also like wrote a uh this is not a TR to Kel anymore but like we also like training completions and training responses um this was actually one of our largest requests that we did not have in unsoft was like um how do we like mask out the um how do we mask out the instruction or the input tokens um effectively um and one of the biggest issues was we were trying to write like an effective like a fast implementation of doing this and we just decided to use pure P python um it seems to work well um so it's around like 2,000 tokens per second it's it's yeah it's not a 2,000 iteration per second so like 2,000 examples per second so it's reasonably fast um we know other implementations don't do this effectively they actually have to like um tokenize each sentence separately and they like which sentence like okay is there like a you know is does this sentence have like a instruction does this sentence have like a response and then they have to do this separately the trick that we did is you don't need to do that anymore you can just tokenize in one in one shot like just tokenize it and then we fix fix the labels afterwards um so that was one interesting thing that we do um we also support like multi-turn conversations um so I think the hugging face implementation doesn't yet support multiturn compensations um so like that's another thing that we provided is like this works on like multi-turn conversations like sh gbt yeah so like a recent algorithm fix um I don't know if you guys have been following was a great accumulation fix um so Joey posted like a nice picture showcasing what the issue was so like as a as like a story well it's really a story but like generally speaking the theory was people believed that gradient accumulation was mathematically equivalent to full batch training um so what is grading accumulation the trick is grading accumulation allows you to reduce ream usage by chunking up the um CH chuning up the batch into like smaller components and putting smaller components through the model um through the you know language model and you can like essentially get the gradients on the Fly and you can do like in place addition to the gradients to reduce vram usage um but one of the problems that um Benjamin posted so Benjamin was the one who found the bug um actually this bug was four years ago um but we found that actually um you have to actually do the denominator correctly um so during the cross entry loss kernel um the denominator was actually not correct and so we show so if you see over here like on the right the losses was much higher than if you use um you know full batch training so during gradient accumulation so what is gradient accumulation so on the left hand side if you do full batch training it's actually very simple you you take like you know four four sequences you shove it through the network you get the gradients and then you do the optimizer step right so not not that complicated but the problem is this will use much more memory so the trick for gradient accumulation is you separate you separate each mini batch into like you know steps you do text one first right you take you get the gradients for text one and then you go to text two you get the gradients and then the trick is you get the gradients of text two and you add the gradients back to text one right so you do in place you do in place addition of all the gradients um and then once you do the gradient accumulation so you like you accumulate all the gradients you then do like some some Mass tricks and then you do the optimizer State um and so we so grading accumulation is extremely popular for especially for um GPU poor people because this allows you to like you know do mimic full batch training without increasing vram usage at all but one of the problems was you know during the cross entry loss calculation the denominator was not correct ly implemented um so we found that actually if you do the L2 Norm analysis you know the difference between um you know fine tuning uh fine tuning on like you know batch size eight gradient accumulation two um batch size four gradient accumulation four and so on um this needs to match up exactly to batch size 16 gradient accumulation one right so like because because gradient accumulation is supposed to be similar to full batch training batch size times the gradient accumulation steps must be held constant um and so like we found that actually if you do standard gradient accumulation the loss um the error actually increases if you use more gradient incrimination steps um and if you fix it we can then reduce the error to like very small um there is still some error though because when you do accumulation of gradients um there is some Precision um issues and so that's the only problem um there are like some Precision small little Precision issues um but essentially you can reduce the error by one um by a lot um if you do this um fix so we were so this is currently in already in hugging phas so they push we worked with the hugging face team to like fix the gr accumulation bug um and at first when I posted about um thought it wasn't that serious of a problem then after we did more experiments we found that actually it's very problematic um so for example if you have the um grad accumulation fix on the left these are the training losses for Lama 3 um 3.2 I think it's 3 billion or 1 billion um we that on on the wiki Text data set um we showed that if you don't add the fix and you do a gradient accumulation steps of 32 and a batch size of one the training loss actually goes Haywire um it's really bad um and so it's around 1, times larger if you do the training losses and this will actually screw up the training process um and if you do the fix right if you do the fix you can get the left hand side but the right hand side the batch size of 32 is a desired actually so this is actually what you're supposed to get um and if you don't fix it you will get large training losses so essentially the problem was much larger than I expected um and yeah so like we're working with many trainers to like fix this issue so huging face already has this ISS um has already has has this fix um and slop already has this fix um we know like some other trainers have this problem as well and so we we're more than happy to like work with them to like solve these issues as well like if you graph out the actual training losses the gring accumulation problem it's really bad um so this is not scaled um so this is actual loss and so the red line is the one if you use the broken gradient accumulation um and the yellow line and the blue line are the ones if you use the correct um correct fixes and also so the blue line is the grin accumulation um if you use grate accumulation of 32 and the yellow line is if you use just a batch size of 32 right no gradient accumulation you can see that the yellow line like stops um that's because it was so if you use a batch size of 32 you will get out of memory um because you know you're using the GPU resources to use for you know bat size of 32 is very large um so the trick is now instead of using a batch size of 32 and getting SS you can use gradient accumulation to mathematically equivalently you know simulate for batch training another one we know that some trainers what they did is they use unnormalized um unnormalized loss this also works um so this also removes the graining accumulation problem um one of the problems is you know loss goes ha wire so probably don't do that um yeah don't don't provide like a don't use un normalized loss in your training runs um this does show that if you do unnormalized loss if you remove the denominator from the equation all of the loss codes match up um which is as expected right just this is correct um but you don't want the you know you don't want the cross entry be loss to be like one you know 12,000 or 14,000 right so like don't don't use un normalized loss um yeah yeah so like we also do collaborations events and organization you know we collaborate with lots of people um I do have like a we did like a um AI Engineers World Fair talk about lowlevel technicals of llms talking it's like it's quite long there like three hours um we also like you know we went to the pie toor conference um that was very fun talking about like you know um making stuff faster um talking about like llms and the future of llms as well um so we also do like collaborations and events and organizations um we do we like communicate with like the py toch team we're trying to get like torch AO into like unso we're trying to get like you know Flex attention in as well and we yeah we really like collaborating with the entire community and yes so like we have a GitHub package that you know please check that out um you know we provide cab notebooks um you know we have like a Discord Channel as well if you want to like ask questions and everything we're on there um and yeah so like yeah so the talk has finished but like thank you a lot for inviting us um yes so I'm on Twitter Daniel hen on Twitter I post a lot about bottle bug fixes you know like making stuff faster um yeah we also have like an official um company page called unof Ai and my brother so it's actually kind of like a two person team now um so it's just me and my brother still um so if you guys want to like you know have fun on you know writing panels you know Finding bugs and issues um we're looking for you know um people to like join us on the fun as well um yeah those are the those are the um slides um thanks a lot sweet thank you so much Daniel um I I guess if people have any questions to Danielle like please start typing them in in the Q&A in the meantime I'll start reading some and and ask my own um so so there's two questions around tokenization which is like why don't people train some tokens for the base model and can't you basically uh like is also asking like when you're working with an even larger data set how can we scale a custom tokenization function which could work best for our specific data yeah so the for the first question so technically if I had to like give advice to like model trainers yes for base models you should train the chat template that you're going to use for the instruct model because it's more like a user error it's like most people will use the base model and use the chat template that the model providers provide and so like you should frame the tokens used for the instruct model um inside the actual base model but it's not necessary so like the main reason why people don't like the model trainers don't train the tokens in the chat template for the base model is because like the base model is not for you know not for right it's not an instruct model and so you don't actually have to train the in you know the tokens so it's not necessary um but like as a you know for like caution for like users like you know we provide like a check in unsoft but like in general um I would suggest maybe the model trainers are like okay maybe just train some of the maybe train some of the tokens um make it like not zero right so one of the issues is like just setting it to zero um maybe like a compromise is to set it to the mean maybe that's one way right set it to the mean of all the LM head tokens and the ined tokens um I think the second the second point was um I think the second point was like about like tokenization like should we do like custom tokenizer is that the question like is that the main like tokens back into the model yes yes that's my understanding yes yeah so like I think we suggest people not to add tokens to the actual fine truning run um one of the problems of adding tokens is this will increase memory usage dramatically um if you add tokens you must train the LM head and the ined tokens right so like if you do that now you have to like get gradients for like you have to upcast float you know you have to upcast the LM head and the Ed tokens to float 32 and this will dramatically increase vram usage um so we don't suggest people to do that and another problem is this will actually screw up screw up the other um you will actually destroy the model as well um because this will this might actually reduce this will actually change the this will change the behavior of the model by actually changes the distribution of the actual tokens right so like maybe some of the tokens in the actual instruct model you're like reducing the impact of it but like you know when you do the grading updates you have to update all of the tokens um so like maybe don't do that and so we normally suggest people to like don't add tokens to the model just use the bite back um because like all of these models now use bpe and there is bite level fullback right so like essentially if you have like some new language just just use the tokenizer itself um to tokenize it um this will increase like this will not be very efficient right you're using more tokens but in general um I would suggest people not to train more tokens um yeah got it so Sam is asking like what's the compromise for grent accumulation is there a performance hit uh yeah yeah so when we first we so when Benjamin posted the gr accumulation problem on Twitter we like thought it's going to be that serious um it's not as serious as we thought but um like you know at the very beginning we thought it was not that serious as we thought but then once we more investigated like you know we worked of face to solve the problem it was actually much more serious than we thought um one of the problems we found is if you do varying sequence LS so during fine tuning runs there can be very long sequences and there can be very short sequences because you know when you when you chat when you chat with the language model um most people chat with like very short but some people really use a language model very like you know their cheat goes very very long and so the main problem we found is during the grading accumulation problem long long sequence lenss were penalized we essentially like we're under waiting the long sequence conversations and over waiting the short sequence conversations and so we found that actually maybe this could be one of the reasons why um when you do fine tuning on like models long context fine tuning was not very effective um one of the problems was when people use grad in accumulation the long the long sequences were like essentially not even in the fine tting run um and so like essentially that is why you must fix fix this issue um especially for long contact fine shoting um in terms of like a comp compromis fix um the only compromise fix that we found was like you can unnormal you can use unnormalized um cross entropy loss um that's one way but this will screw up the fine-tuning run so you need to actually divide the learning rate by like a large amount um and so like probably not a good idea to do it um yeah so so so I guess what I found interesting about your grain accumulation bug is like if I were to compare it to some other bugs you found like for example some of the Gemma bugs those to me seem like a bit more innocuous because like one there's like how many people are actually downloading the model a few days after the release to then also later fine tuneit and then yes there's sort of like this thing that's but this seemed to be like an issue that was like hidden in plain sight yet it's like so far-reaching that it's kind of easy to imagine it should have been caught a long time ago uh so I know there's like maybe some speculation in your answer but I'm curious if you could speak more to that as well yeah so I think this was yeah so this is one of first bug fixes which is like ever like you know it kind of affected many trainers it wasn't so this is not just a Transformers problem right this affected many trainers throughout the entire ecosystem so this was the theory was at the very beginning was the theory the I think there was a paper that was released in 2017 I think um I think technium posted about a paper and so the paper showed that you can use gradient accumulation um in even inside the paper it was kind of wrong um so they show that and an approximation you can get grade you can accumulate the grade and divide by one over the number of gradient accumulation steps and if you do this you can kind of approximate full batch training um but the biggest issue was we didn't expect to use gradient accumulation for language models um and sequence models like language models have varying sequence lengths um and that was actually the biggest difference it's like it was an uncaught bug for like I think four years or something so some people have reported about this issue on Transformers um you know four years ago and the biggest difference is no one bothered to check that actually wait a second okay if you if you do this for like CNN or like you know old style you know not old style like not models that are not language models not language models this you know this problem is fine right this there is no problem in like you know not language models but if you do this for language models the problem exists and I think people just didn't really bother to check um wait a second um you know like why is why the error you know why the error lost why the error logs not matching um so like I think it wasn't like a I think it's more like a um I think it's when people first wrote the Transformer like you implementation when people wrote the training code for like other trainers as well we just took it as like gospel like okay this should be fine um and so like I think no one bothered to like actually check um that actually wait a second there is one part of the equation which is like incorrect um so I wouldn't blame this on anyone it's not really anyone's fault it's more like just it's more like just common it was a common misconception that you know GR in accumul was equivalent but actually it was not um but now it is right so after the fix actually it is equivalent right mathematically equivalent um yeah if that kind of makes sense yeah it does like I think there seems to be like a lot of value in like writing out the math and I think you seem to do this like quite a bit uh to sort of guide the intuition um okay so so so so Mark Mark from his iPhone is asking uh they're trying to speed up a attention for a model they're not doing anything fancy no research should they stick with flash attention or should they try flux attention yeah so if you're not doing anything fancy um just use scale do product attention um from you know py to doeses I think for I think for the new py to 2.5 release they provide you know they can use CNN's fast implementation so like definitely use scale do product detention if you don't do anything dancing probably just use that um flash attention's fine um the biggest issue with flash attention is the dependency hell so like when you install it it's quite complicated to we find lots of users have problems installing that's the problem um so I would I would suggest people just just use you know scale drob product attention Flex attention is another very good option um so that's also that's highly suggested as well um so you can also customize the attention mechanism as well inside of flex attention so definitely do that as well um so but if you want like a you know if you don't have any customizations just use scale job product attention um yeah like I I think I I have a pain point near and dear to my heart which is like building Cuda extensions on multiple Python and Cuda versions and os's is just like kind of like a lottery um so at least like a big part of what makes this work in the P CI is we basically build it on everything uh speaking of is this one of the reasons why like because I know you you worked at Nvidia briefly uh how come when you were picking up onslot like the Sten was newer I guess and so I'm sort of curious was building a big part of the one of your motivations for using or was it more like the user experience for using Trident or is it more the user experience so curious if you could speak up one to the other yeah that's a great question um yeah so back at Nvidia it was fun writing Kuda kernel um so actually when I when I joined Nvidia I actually did not know how to write crud kernels um I knew how to like call K Blas operations and like you know C Spar operations didn't actually know how to write cud um so actually I spent like a few months trying to just learn how to write Cuda um it was painful at the beginning but think the learnings that we got from wran Cuda was very important so like I think there was like a common misconception like you know Triton's very easy it is relatively easy but I think it's good to have like a Cuda experience before you write start writing Triton um and I think um we chose Triton the main reason why we chose Triton was because we were unsure if um you know like Nvidia gpus would always be here so obviously for now it's very common that like you Nvidia is like taking you know everyone's using Nvidia gpus but we were unsure right so could there be like some new hardware or like some you know EMD or like Intel or like some other new hardware and so the goal was we selected Triton as like a intermediary um method to like not do just Cuda but like do like a general language for like um you know titon programming it's like you know it compiles down to like all of the other STS so like we don't actually have to write like just Cuda code um I think if you write Cuda code this will still you can get like more speed UPS um like maybe like 10 20% more maybe um but like I think Triton is good enough um we generated rely on like you know torch compile um we rely on like you know the frighten teams they will definitely put all the optimizations back and like make the compiler better so it's more like a yeah it's more like a um course that we made that it should be good for like future proofing kind of um yeah and also it's easy to yeah like it sort of makes sense because like I think a lot of the optimizations you really talked about like are also algorithmic in nature they're not purely systems level optimizations yeah and so if you're doing like math then like sticking to python seems like like an easier easier Choice um all right so this is more gossip I guess chat is going towards gossip now so have you been watching the sampling methods that XJ drr is exploring with the entropic code base oh yes I saw something like this on um Twitter I've not actually explored it but um it's very interesting I think there was like um I think from um Opti llm um they posted on Twitter how um I think like it's it still does worse than like you know Chain of Thought um I did investigate a little bit but I have not really investigated that much um I think I'm going to do that over the weekend but yes I did see it's relatively interesting it does increase accuracy a lot so very interesting um that's all I can say um so I'm not the best person to ask about that um yeah all right um so uh Eric uh okay we have Warren who's saying he's been working on it yeah Warren if you have any comments like to post it in chat Eric again is asking I haven't really looked at the math yet but if grad accumulation is not equal to mini batch do we also get that the expectation of the mini batch is not equal to the full data set gradient yeah so okay there are like two I think there are two points of the question um I think if you do mini batch if you do mini batch training do mean wait does the question mean like if you do like full batch like the whole data set like actually grading desent or like okay wait wait okay maybe a backtrack before when you do accumulation the the theory was it was equivalent to full batch training so if you use like a batch size of like 128 um you know just a full back size of 128 the theory was this would be the same as if you use graining accumulation steps of 128 um and batch size of one um that was a theory but actually it's incorrect right so like if you if you use old grading accumulation methods this was actually not equivalent to full batch training um so definitely that's not correct but after the fix if you actually divide by the correct Den denominator actually it is equivalent um except for like some floating Point errors there is like we do show there is like floating Point accumulation errors of like 0.01 something around there so there is like an L2 Norm of like 0.01 so there is that issue but if you use float 32 right if you just use float 32 um then grening after the fix it is equivalent um but before the fix it was not equivalent um so I think that's trying to answer the question I think the other part of the question was like um the difference between full like actual full batch training like if you shove the whole data set in um I don't think some people do that like you know I don't think some people like shove all one trillion tokens and like train like you know do like make that as the back size um I I don't think some people would do that um the main reason is mainly like speed um and you know you're not going to get any like grading updates if you like shove in one batch like you know the whole batch size the model probably won even train that well um so defitely you have to like shrink the batch sizes to like you know I think llama llama was like 1 million um batch size I think um so yeah don't use two large batch sizes um yeah otherwise your mod won't train um at all yeah like for what it's worth there's like some other Hardware vendors like like when I when I worked at graphcore like the research team there was investigating how it's like much you get much better convergence if you can do like bat size wine training is just that like on gpus that's not a not a very good choice um I think Yan was sort of like plus one some of this work all right more questions okay you have a lot of questions by the way um so RT is asking are we supposed to use graded accumulation from hugging face now if we want to get checkpointing Plus One DB logging to work in the training ARS I never used UNS sloth but are the one one Tob logging and checkpointing there in there if I change to unso trainer yeah so um we implemented the effects of hugging face a few days ago so if you install the nightly branch of hug H hugging face Transformers um the fix should be there um for unof we provided the fix on day one um we did not it was like just pure pytch code so there was no like one one be like logging and like you know any any logging things there was none um so but now we so in onof I think I pushed a commit yesterday to like install the late nightly Transformers version and so now there's like logging so you can now use Transformers directly um inside of unop as well um so this fix is a nightly Transformers um so yeah no no need to like you know you it's not NE to use our trainer but um if you want like faster training and stuff like that um we do have to fix internally um I think there are like some issues for um so if you just use a Transformers code I think there are like some still some issues so we're still working with the hugy face team to fix it so if you do want like a stable version of gradient accumulation um then just for now you should probably use unsoft trainer for now um but I think we we're still working with the huging face team to like fix all the issues um yeah but it's not just a huging face problem so if you use like other trainers I don't know if they're fixed so do not like you know don't don't like ask me if the trade is fixed or not like I don't know so like some I know a lot of people like you know they're a lot and does a lot of training libraries um and if they all Implement grad in accumulation they will all have this problem um and so like well not all trainers but like most trainers will have this problem so I'm uncertain if other packages have fixed it I do know like firsthand that Transformers have fixed it um you know because we worked with them to fix it I do not know about other packages um yeah Al obviously unop has a fix as well but like yes I'm unsure about other packages um yeah um so Tanish is asking have you checked bitnet CPP and what are your thoughts on it yeah so was this released yesterday I think it was released yesterday yeah yeah it was pretty cool um so I think they also released the models I think as well so they one maybe they did they ever release the 1.58 bit models I think they maybe yeah I think they also released the models as well together with it um I think the more interesting part was I think so orbe did they release the models before I'm actually uncertain um I think like like they didn't release anything you could run up until yesterday was my understanding yeah actually it was the models yeah it was more interesting than the models I think that was my view was like wait a second they actually released the models that they like did you know the training runs with this 1.5a bit thing um so like yes it's very interesting that they released the models um and they did show that actually accuracy is very similar so like it's very interesting actually um and I think in general um I think so I yeah because this is I think this is CPU only right I haven't actually checked it too much but this is CPU only for now is it correct or maybe I'm Mis misrepresenting it is it it's typically CP only yeah yeah so like I think um obviously this would be very interesting if this was GPU as well but like I think for now um It's relatively complicated to like do it on GPU because like you know you're not leveraging the I guess you could like maybe UPC somehow in the tentacles but like it's probably not you know it's probably not going to be that effective in gpus um maybe if Nvidia like you know adds like support for this in like their newer Hardware maybe um but you know I I would fall back to like float 4 you know float 4 in the new Blackwell chips is very interesting it could be like an intermediary like a you know a temporary solution but I think the 1.58 bit is actually very interesting you can essentially fit you know you know very large parameter models and like fit this on like CPUs only um and the training losses don't seem that bad um obviously there is like there is a gap right there is a gap between full float 16 training um even float 8 training with the actual 1.58 bit um and we're not exactly sure about the actual true you know um capabilities of it like you know we're not like is this going to be actually that useful for like for example long long context um you know long conversation runs is this going to be useful is it going to like forget some things we're not sure yet so I think it's very good for the teams to like release this out to the Wild and so I guess it remains to be seen um I will be trying this out over the weekend as well so let's see if this is very useful um but I think it is I think the biggest importance was they actually release the model um and so I was actually expected them not to release the model so like you know like why is there such a long you know like why is this taking so long for them to release the model right so like and finally they released it um so that was yeah that was very insightful um yeah so if anyone is interested in pre-training like bit net models on GPU like gurst has like an implementation in tcho it's very experimental we're also not sure how useful it is but if if yeah again if you're CU about experimenting okay I I will take like maybe three more questions and then we'll let like Daniel get back to work um so so Mark is also asking like have you seen lolcats and would this be something you're interested in integrating into ons sloth um is this the um I should probably just check uh it's it's the it's the linearizing linearizing llms basic linearizing attention and oh yes I remember that one um the one yeah yeah um I don't maybe U maybe someone should put a feature request um I think like I think may we might maybe I'm not I'm not exactly sure yet um I think for now we want to like focus on getting more model support so actually one of our things that is on a road map is like making Apple support um we do need help on like Apple support that's one of the things that we have to have um but like once we get like Apple support done ride like a UI um do like some extra stuff like fine tuning making um Vision support maybe we'll add maybe that um so maybe someone should like make that as a feature request first um I'll look into it I think I did like like a Twitter post about it um I did not like investigate the package yet but I think that's I'll also write write that as like a to-do list um so it's a lot to do in the weekend um but I I think I did read about it on Twitter it's actually very interesting linear linearizing attention um I think I think my take is I still think attention is still going to be here to say um I think the trick is attention is like you know you can see the entire context in one go um and if you linearize attention you definitely lose some you lose something right so like we can't really say what you lose but I think for I like you know the experiments show that generally it's fine um but I think in general I think through like for example on local llama on Reddit on Twitter when people try these linearized attention type mechanisms um the quality just seems to be worse in general than full attention um I think the trick for example people read the character AI blog about the you know um how they did like faster inference instead you should use like you know interleaved tension right so like six sliding window attentions very small you know very very very small um you know sliding window like 2048 or4 and then some sort of like large Global attention um that seems to work very well um and so like I think maybe you could like do combination right linearize attention but then add one Global attention could work um so like I think you need to add Global attention somewhere um and yeah that's that's my take um I I think you're your your instinct is correct and I think even the same lab from his research like basically later published work saying exactly what you're were saying which is basically you need the global plus the sliding and that's like will out compete most sort of like linear attention variants uh so so maybe I'll ask my own question then here um so something something kind of interesting is like you know you don't have control over how large Labs uh train their models and they make may make certain choices let's say in training that makes like fine-tuning or inference like slower uh and so like one one piece of work I'm seeing is sort of like the sort of model surgery where you like later find tun a model to recover like some of the accuracy degradations and I'm wondering if this is like something uh you you you've explored and and if not why not yeah so is this like for example like you quantize for example like quantize the weights down to like one one bit or two bit and you train like C adapters is that no no it's more like for example you change like the norm pattern from a model and then you uh you you you find unit to recover the accuracy laws you change the activations uh you change the kind of attention like it's sort of like like actual model surgery interest yes so I think there are some people in the unsolved Community who have tried this out um I have not tried this out personally so like do you mean like for example like changing like Swig Glo or like prob not but like r instead something like that yes yeah that could work um I think if you so the trick is I think the trick people do is like there's two ways to like go about it you can either train QA you can train Laura adapters combined with a model I I think someone actually released a was that like a few days ago like someone actually released something about this they train lower adapters as well um one of the model releases did something like that um they you train lower adapters together with the actual change that you did to the model and you can actually recover the accuracy back um this also works for like if you if you use torch AR for example right you quantize it down um and then you use like for example mbius hqq you quantize it down in torch AO now but like anyways like if you use torch AO in general um the accuracy does degrade but if you train lower adapters on top of it you can kind of like recover the accuracy back as well um and so like I think if you do like if you train low adapters if you like do if you edit parts of the model and if you do like continuously fine tune on the model you can recover the accuracy back um yeah I guess it's kind of interesting as well like you know why does this even work um I think it makes the low adap I think Mak sense for me personally if you like remove like if you swap out this swiglo and like replace it like Rue like it doesn't seem like it should work um like because like you're kind of like truncating all the small numbers to like zero it doesn't kind of doesn't seem to make sense but I think it could be interesting um yeah so I think experiments do show that kind of works maybe not all the time I think like some other things that people do is like they spfy the network um there were like some research papers which show that if you like you know set if you set like 99% of the small weights to zero um and then if you do that the accuracy is really really really bad um but if you train like if you find talur adapter on top of this changed model or you continuous find you in the model um you could recover back the accuracy um so I think like yeah so I I think it is relatively very interesting that this seems to work so um yeah yeah yeah like I guess the way I would understand the intuition you described is like Laura has typically thought of like something to adapt the model to a different task but you could also think of it as something that adapts a model to basically pretend like it's a different architecture uh yeah and okay I see um yeah that makesense yeah so there's there's more questions I'm sorry Daniel like we're gonna okay so so uh okay so is UNS sloth used in Federated learning contexts like Poland hi mind pedals open Dilo style you don't think so um so is pedal wait you mean like distributed learning is that I I'm not actually not familiar is that yeah like I think they mean that like the actual distributed learning like the decentralized learning I guess is what people are referring to here wait what speak now yes sorry is Daniel still here somehow your co-host got remove for Reon back sorry okay all right we're back wait did we both crash yeah sorry folks yeah anyways back what was the question um let me see uh we're still recording yes uh yes so so I think it was about decentralized training basically yeah so you could use like I guess you could use unsl on like each device to like make it faster but I don't think so we don't have like a we don't have like an orchestration system to like do distributor training we like you know like computer one computer two like you know do some sort of like communication between them we don't have that yet um I think Noah's research is working on that news research I always get the pronunciations wrong as well Prime was it Prime intellect they did something like that um so like I think like yeah for now we don't we don't support that um but yeah um yes we don't support that yet maybe someone do a feature request as well maybe maybe um this feels this feels like a good Community contribution to whoever asked the question uh okay so so maybe the next question again so from Eric but like I think this is in reference to the the mini batch point it's like but the mini batch from Theory should be in expectation be the same as the full data set yeah so that's correct in expectation um so like if you do like large batches in general for for pre-training runs it should be fine um because like you're essentially smoothing out the denominator across all of the data so it's generally okay um but for fine-tuning runs it's definitely not good right so like because your sequence your batch sizes like 32 or like 16 or 128 they're not large but even batch sizes 10,24 they're generally not good um so for smaller runs this effect is much more pronounced um but I think it definitely needs to be fixed right so like in expectation in the old the old theory was in expectation this was good right so like but we show that actually it's not in expectation it's mathematically equivalent right so actually it is the same um if you do this correctly right so like in order to make it exactly the same you have to actually divide the den denominator so like essentially this was like a this was actually a hidden like a small little bug that was in all of Transformers and all of the you know trainers that if you don't actually divide by the correct denominator you will get yes you only satisfy the in expectation part right so like large batches will be fine but on smaller runs this this effect is much much much more problematic um for example in DPO especially DPO for example right DPO people don't use large batches um generally speaking DPO you use like not you don't need to use like one million batch sizes right so like you can use much smaller batch sizes and so in DPO or like reward modeling or opo this effect is much more problematic and so like I think um so after the fix we show that it's not just an expect any it is actually equivalent um if that Mak sense yeah okay absolutely um so Patrick is asking have you checked deep speake Janice uh the unified multimodal understanding and generation yes so so interestingly even the deepsea models we do not support it the slop yet right the Deep Sea Cod type models um so that is also remains I will have to check that as well um for multimodal specifically as well um we're trying to get that out as soon as possible as well so Lama 3 to um multimodal that is on our road map as well um so maybe next week I do not know um so yes another thing that is on our to-do list um if people want to help on contributing that as well you know but we're kind of like drowning in feature requests now so like you know if anyone wants to help out that would be much appreciate it as well um but yeah multimodal for llama 3.2 maybe next week maybe um yeah it's more like how can we actually fit this in like a T4 GPU that's the main question um and so like that's yeah so we're trying to do that as well um yeah um any new suggestions on learning rates depending on model sizes for fine-tuning or do the older rules of thumb still hold that is a great question um I think for so the Laur learns less forgets less paper is very useful so we collaborated with them so like they cited some of the methods that we decide like you know said that okay you should use like you know the rank needs to be equal to the alpha the learning rate needs to be much larger so we do show that some of the experiments that they showed is like you know you need to use some of the correct presets um I think for you know for fine-tuning runs we generally tell people to use learning rates of like you know 2 e minus 4 um to 2 e minus 5 somewhere around those ranges I think for pre-training runs you have to use like much smaller learning rates um but I think for fine-tuning generally um you know you don't need to use very small learning rates um for model sizes I don't think there is a that's a very interesting question so I think maybe the question is like do smaller models can smaller models use like different learning rates than like large models like llama 70 billion or Llama 45 billion that is a great question I think maybe someone should write a research paper on that um I don't think there's like any research on this like you know sh like you know can you fine tune small models and large models at the same learning rate maybe the large models you can use uh maybe the large models you can use a smaller learning rate maybe a smaller one um and in the small models you use like a larger learning rate maybe or the other way around I'm actually not sure um yeah to be honest I don't know um yeah yeah there's very much feels like a run the experiments kind of thing all right so so so maybe next question from gecko do you think torch compile will ever be as good as writing Trion kernels by hand that is a good question um I think we so we ched to oh there's like a video of like the torch there was like a video we did like the torch team as well so like I think I think in general I think if it's just about fusing so fusing p compell is fine so definitely do that um so I would also suggest people to like just use if you have like a torch if you like have a model literally like if you just write code and TR like you know write code and py torch just add torch compile like I don't know why people don't do but like just add torch compile this makes your mix py torch magically faster I think one of the goals maybe of pytorch is to make this default maybe somehow um like allow this to be so stable and so usable that you can just use torch compile everywhere um so maybe that's one of the goals of P torch I'm not sure I'm not from py torch so maybe that's one of the goals um but I think I think the biggest difference though is like for the fusing component I think you know for Flex attention now right so like Flex attention does have you know like some handwritten Tri kernels inside of actual pie torch and so they like they use this template to like you know specialize the template um for like different parts so like that's one thing um I think for other algorithms there is like a lot of maths involved um there are like for example chunk cross entropy was like a thing that we like you know had to like um think about the maths and stuff like that I think maybe now that might be like that might be like um placed into torch compile um but I think there's many other components that's like that do require lots of maths um that's relatively hard for torch comp power to do um but I think torch comp power can get you most of the way there um there's like you know there's like much more stuff that you have to do for the mass component that's left um yeah maybe in the future yeah I mean so what you're describing is also very in line like there's like if people go on open review and check the reviews for The Flash attention to paper reviewer number two left an interesting comment which was like well like why do we need this work like can't compilers just do this and like I think a TR point was like no because like compilers won't like algorithmically change your math and make sure it's numerically equivalent I guess once humans figure out those tricks you can always template match them uh but like I don't think compilers are good at novel math like this is actually one area where maybe LMS are a bit better but like I think for now humans are still win um okay I'm G to keep going with more questions so do you think training Laura on Bert seems like a free lunch or does that not make sense to do yeah you yes you could do that I think some people have actually asked us to support Bert inside of unof we do not support that yet so one of the problems of Bert is like you know it's incoda style so like it's a bit different from general you know like General decod style models we don't support that in unof yet but yes I would suggest that yeah people could train lauras on that um yeah I don't see why not um but I think it's a bit different though um I think it's not different in terms of results I think the results will be still fine um I think you like Laura is not like a it's not just for decoder style language mods right Laura is using like you know diffusion models it's using you know image models it's used everywhere so like Laura is like a general technique that can work um yeah so I guess yeah yes you should yeah you yeah you should try it out for b um yeah okay I'm gonna ask you one last big question and it's a batch of like maybe five questions I see in chat uh people seem to really like your approach the way you go about like debugging and testing and talking about your work and people want to work more with you uh and they're trying to figure out like what does that actually look like how do I contribute to UNS sloth like what kind of research can I do on top of unoth to be helpful to the unoth community so where should people engage with you can I maybe potentially P the server as like one place where you could also do this uh so yeah just like I guess one Mega question for you yeah definitely like yeah so we have like a GitHub package so if anyone wants to like contribute on the GitHub issues there's a lot of issues um some of them have not closed but there's a lot of issues like feature requests um yeah so definitely that's one way to collaborate um we have like a Discord Community ourselves on unof um but the GP mode well could okay re it to g mode but the g mode definitely one place to collaborate as well I'm also on it as well so like sometimes I will L around and so like if you want to collaborate on there that's also another way to collaborate um you know we also tweet on Twitter that's another way to I know people unfortunately some people DM me on Twitter and Linkedin and email me for like fixing GitHub issues um please just use the GitHub issues directly okay fine if you want to escalate problem I would respond to that um so if you want to collaborate through those channels you know like to be honest we can collaborate like if anyone wants to collaborate um we like you know everywhere any public forum yeah any public for any public form we'll always answer um we scour like we essentially do like daily scouring of like everywhere so like we all we'll see your um yeah we we more than happy to collaborate um and we also we always welcome more contributors um yeah all right well uh I think that that's probably like a good good time to end this like Daniel thank you so much for coming in Michael thank you also for for being here with us um so for for next week actually for the next two lectures uh we're gonna have like two two two lectures on like low bit kernels so we're going to have like Mam and then we're going to have like Lei uh from the bit blast team MSR in at MSR U and so like it's going to be a little bit Galore for the next like two weeks uh so yeah if you care about efficiency and if you enjoy this lecture I'll bet you'll enjoy the ones coming next and thanks again Daniel thank you Michael yeah thanks a lot um yeah Back To Top