good morning good day and good evening I'm as well as your host Brody Robertson and today we have the developer of a project on that I don't know I'm sure a lot of you've probably heard of but maybe haven't had a chance to use just yet is it pronounced zudo because that's how I've been saying it it's it's a very good pronunciation it's not 100% per correct but I don't think I can give polish lessons to the whole world so let's let's let's say we zuda you can certainly try how how would you say it if if like you try to say it correctly well it's it's written incorrectly but you would pronounce it zuda ah okay yeah I was never going to get to that from the way it's written yeah so uh how about you just introduce yourself and um we can go from there yeah so so am ano Yanik it's it's nice being here I'm a software engineer uh I wrote Z Zoda I'm still working on zuda and uh hopefully I'll be still working on zuda in in the near and far future um hello hello yeah and and I have my tea because I heard is a requirement for this podcast I didn't even mention but yes I yeah yeah yeah it works hey look if if you actually bring it you are better than most of the people and better than me most of the time as well yes so um I'm going to try to not look really shiny right now because it is I'm in Australia and it is oh 9:30 p.m. what time is it yes it's 9:30 p.m. and it's still 29° C so oh wow yeah um if you notice me looking shiny that's going to be why I've got my fan on but we we'll see how it goes hopefully it cools down a bit later uh anyway enough of me complaining about the Terrible Weather here um I guess we can just start with what the project actually is and why was it initially created oh sarting those are two difficult questions yeah yeah it's a long story what is easier I I think hopefully I will get to the end of it before you collapse from the heat uh so so basically it started because I thought will not get in trouble uh writing this project so it it went wrong from the day one so uh it started when I was working at Intel so I was working at Intel uh on a certain project which uh okay it will remain unnamed let's let's call it Project X sure I worked on I worked on Project X with and those two guys are important for this story uh Alex Alex lashi and Alex Tito they're grandfathers of this project so so work work at intel was a uh so some some details I can provide so it was a project certain uh GPU library for a certain workload for obviously in int Intel gpus um was one one of the first projects actually using in Intel uh CLE implementation called DPC Plus+ so so we can think of it as sort of Intel's answer to to Cuda this is very verely a version of CLE and we had a lot of discussion about well how useful uh it is how good it is and there one of those uh discussions or arguments uh I don't remember who Alexander or or Alexi uh told me something that that stuck with me is that okay CLE might be good might not be good uh but what our inner customers want they want Cuda they want Cuda they want Cuda on on Intel gpus and U uh I don't remember what what we uh discussed uh in detail but it's it's really F this discussion he was right I I had no no comeback from this uh but but the thought uh stuck with me so time passed the Project X was celled and I started working as a manager uh as a manager I had almost no opportunities to write code and it started to become slightly frustrating uh if if you don't write code you you sort of become every day you you get slightly more di diverse from the daily realities of writing code especially if it's a Intel GPU code where at the time so it was pre pre uh what is the public name alchemist that was still in flag still moving quickly so uh I was getting a little bit more frustrated and uh 2020 uh came in with this covid we're all uh sucky in inside our homes uh me included and at the at the same time uh Intel when it comes to Intel GPU development started moving to a different host code Library so previously most of the development used opencl but opencl is relatively high level and if you want to have some some kind of extensions to opcl you have to to some degree negotiate with other vendors so you don't create something that is useful to you and another vendor comes something useful to them and there are just very minor differences so so so Intel came something with called level zero this was the new premier way to write host code and it looked fairly good and we're all just starting to use it starting to learn it uh and I wanted to to to to know more about it so uh in my mind I had two choices either resurrect uh Project X which was cancelled I was specifically asked not to do this not to do this in the open source and I thought well okay if I do this I get in trouble and there's another another project I thought would be in interesting imp planting Cuda on top of level zero and my thinking at the time was okay what what I really wanted to do was learn uh level zero understand how it works what are the all I can do this do this that that I cannot do with opencl um if it works well it's not going to work but if it works it's going to be very cool so I started working on and I worked on this through most of the 2020 and well that's how Zula came to be surprisingly it worked surprisingly reached decent performance and I release it uh Sometime Late 2020 what would be decent performance in your mind like if you were to take an equivalent Nvidia GPU and Intel GPU like what percentage of the performance were you sort of getting at right so so this is difficult question it's very diff yeah difficult to to compare the performance because you need a workload where you have uh implementation native ~ Introduction Segment Removed ~ CA implementation and open seal implementation uh that works natively on Intel so so uh a good Target was gig bench that's why gig bench was pretty much the only thing that was uh working so in my mind something like 80% of the Native open performance would be would be really good uh surprisingly good I don't remember what was the exact performance was like something like 90% so so gig bench is actually a suit of tests so some surprisingly uh some tests were even faster than the native versions some were slower uh but the performance was I think around 90% overall which was much better than I expected well it's much better than the alternative which was Zero yeah yeah yeah so for anyone who may be unaware of what zuda actually is like at a high level can you just explain like what the general purpose of the project is like general purpose of the project is you have your software and your software doesn't work on your Hardware so Zoda makes your software work on your Hardware that is okay maybe maybe AIT lower level than that yeah yeah I know no no no so so understand what what what is the uh good about this project and specifically targets uh applications libraries whatever plugins using Cuda and the the reality is if you if you have your GPU and maybe use your GPU for gaming which is horrible why would you do this uh gpus Were Meant to multiply matrices gpus Were Meant to uh calculate chemistry and physics so so if you so if you have a gaming work workload right if you have a game then you're going to use uh a specific API to to do things that game do I don't know what they do uh generate poly so yeah yeah so things like direct x uh Volcan openl but if you want to do computations so uh physics simulations uh chemical simulations whatever machine learning then you have specific APS for compute and by far the biggest one is Cuda and the think about Cuda is written by Nvidia it's written created for NVIDIA gpus uh but it turns out and this is what Z does you can implement this sort of the runtime environment of Cuda and then run some other uh GPU from another vendor uh you can you can use your application on another GPU non Nvidia GPU from the perspective of the application it's it's a normal Nvidia gpus it's slightly stret has a little bit strange configuration some things doesn't work but the From perspective of an application it's um it's the same it's it's similar to how uh wine works right on on Linux you can run Windows applications on Linux M MH so so there something I was going to say and then okay okay and I had something I completely blanked on it lower and uh lower level uh explanation so if you have if you have a good applic so for those of you who wrote Cuda they might be thinking okay how is it possible uh because if you write Cuda CA also a cod is many things but other than runtime environment it's also a programming language so write your program and Coda it's a dialect of C++ and you can s you can mix GPU code and CPU code MH but this is an illusion because what happens at the end at the sort of bottom level and that's where it's Li you have uh it's it's a normal application which calls [Music] into uh nvq dll or lipq o it's a normal application which uses certain functions provided by a certain Library there's no magic and if you implement this uh runtime driver library then you can then then you have you have Z upls this library and this library is can be relatively complex because you need things uh so so what what her time Library does it allows you to control your GPU so you can uh query your system okay I have uh GPU number one with this name and with this much run with this and you have this many flops I can I have GPU number two or three and on on this GPU I can allocate memory on this GPU I can uh execute a computation so so R time also contains a compiler for virtual assignment this is a bit more complex but uh uh you have to also provide this and z z does it so obviously Invidia likes you know Cuda being on the Nvidia gpus were you ever concerned with Nvidia like having any issue with you doing this or so I have so I had initially but uh the case has been settled with Oracle versus Google and they have been contacted by a fa bit of organizations I mean initially including Intel so and pretty much everyone uh tells me they they check with their lawyers and lawyers think that after Oracle versus Google implementing apis is um is fine you can you can just do this and actually I I saw this when I was at Intel so I released uh end of 2020 next version early 2021 I don't remember how much later but I think three months later uh the Oracle Google was uh settled the the Judgment was made in the United States Supreme Court and shortly after I was contacted by Intel that and they told me hey Oracle Google is settled uh we think this is we can do this yeah that yeah I know about that case I didn't realize it was only settled that that recently yeah yeah yeah well two years ago yeah well yeah still it's still pretty recently like when did the the case actually start if you remember what year was like early uh I don't know maybe 2018 2019 those cases take a long time sure yeah yeah but basically you you just effectively assume like okay that case has gone well now that it's actually done it's just like yeah by the way by the way I'm not a lawyer just I'm not a lawyer either this neither neither of us are lawyers here so don't take anything we say why is the camera following me don't do that jitu um why is why yeah don't ass assume that nothing we say is legal advice here oh no now now where is it even focusing oh it's off there is it going to move no what is I hate I hate JY I don't know on my screen uh the camera is fixed oh okay I'll sure okay I'll just fix it up on here then um fine uh anyway getting getting distracted by that um techy c yeah there's probably a setting to turn that off I just don't know where it is um so what is it about Cuda that makes such a driving force in this space like why does everybody want access to Cuda is it just the fact that Nvidia is like has that market share now or is there more to it than just that uh there's much more to it so if you look at how the current market for Server gpus look like for uh GPU compute so so you have three vendors right Nvidia who is competent and Intel and AMD who are not and I'm talking about software software because um that's my expertise well I can't really speak much about hardware and Hardware looks fine from my well other than some embarrassing failures the hardware looks fine but but the problem is with the software um so you are a developer then you expect a certain level of quality uh in your software so your compiler will not misc compile your code your profiler will give you profiling information your debug will work and as things are for some reason possible in calan they totally possible in uh Hep and DPC whatever whatever is called Intel land um and that's it that's it things just work on on Nvidia so so it's not like Nvidia stack is perfect we run for example we run into a misc compilation in the Nvidia compiler so so they have the their own bugs but it's like things work you have profiler it works uh you have compiler it works you have debugger so I can specifically talk talk more about IMD because at Intel I was working in Pre silicon environment so there's expectation that think things will not work um so I didn't really touch use the public versions much but on the IMD side so imagine that your uh run time for example operational images doesn't work like silently give you wrong result so and it's arbitrarily for only just some formats so so we have image with floating 32bit floating Point pixels it works if it's six 16 bit floating Point pixels you get wrong results for no good reasons uh you have 8bit unsigned integer it works a bit unsigned unsigned Works signed gives wrong result for no reason and this sort of thing uh like you you cannot you just cannot live the life where you have to second guess everything you do with your uh compiler and some tools are just just just not usable so if you're writing GPU code you're not doing this for your deep La of gpus you do it because you want to have performance and what you need you need a profiler you need a profiler to tell you okay what are my performance problems and you're going to use it a lot when writing GPU code I think an average GPU programmer have a higher need for than average CPU uh programmer so so an example on uh AMD performance profilers so I you want to profile a workload and a workload will have let's say uh performance uh events and say course grin performance events meaning uh copying data between CPU and GPU and dispatch dispatch of a Kel and some other things that take time on on on a GPU and in a normal work you will have tens of thousands uh hundreds of thousands maybe low Million number of those events and you open a imagine this you open a amdgpu profile there and while it has certain limitations it can only capture a certain number of performance events so I want you to to guess how many it can capture uh if you were talking about dealing with like Millions I would hope that you could handle that much but if there's a problem here maybe in the range of a few hundred thousand 50 oh 50 oh okay that's the limit at least it was uh last year when when I was using it so you cannot you can you cannot measure performance I see that sounds like a problem is is a problem I think I think they they improve and you can uh now capture uh low thousand maybe low 100 okay I mean it's better but it's still or is the magnitude better and yeah yeah but still if if you have an V profile you can just capture the things uh so so I so so with smaller workloads when it's maybe tens of thousand and it's just works so so that and um that's so that's it's more of an AMD problem so so I mean going back to a higher level if if you look at IMD they have they have the right strategy and and bad execution we inter had bad execution have bad strategy but uh I think decent execution of the strategy M and could uh could have work so well let's talk more Intel like I want to talk more about Intel because uh right now they are poor there's no re there's no risk that they're going to Su me and Intel has the uh so so if you look at what IMD is doing their compute stack is relatively similar to Cuda meaning uh the the goal here is you have your Cuda code existing Cuda code and porting to AMD world is if it works it's relatively easy yeah I was going to ask what does the am stack actually look like uh it's it's uh they're following they're following Cuda basically so so the apis are uh the run time the performance libraries are relatively similar uh the tools might be different but the tools are not super important they're not not part of your program right so so imagine you have a function to allocate memory uh on Cuda site it will be called q malog and take some arguments on on MD site it's going to be pretty much exactly the same but it's going to be named hip Malo and this is what programmers actually want because the the world we live in is the world where most of the GPU compute code is already written and using Cuda so so this is objective reality Intel rejects this objective reality and lives in their own um dream world where there's no GPU code and every programmer wants to write code from scratch so if you look at their uh DPC Plus+ or CLE in some ways it is actually better than Cuda if you look at the uh purly at the the apis they have but the problem is the API is very different from from Cuda so every porting you do is one way ticket to Intel ghetto and it's you know it's is very there's also the social aspect it's somewhat difficult to trust Intel who has doesn't seen much success in GPU compute world to port to Intel platform and just abandon your existing working uh Cuda code we with with hip you can you can always Port back hip hip to Cuda um you can do this with DPC Plus+ it's um uh much more difficult very different and another problem with DPC plus plus it's very it's very strongly CP C++ based with Cuda and Hep you can have your application written in some other language and it's going to interoperate with your Cuda C++ code relatively easily because every so so there's a cap API which uh you can use from any language with DPC Plus+ it's not so easy everything is C++ there's no Capi so using it from something like python is a that difficult and even if you do the mapping it's very difficult to have uh interoperability between your back from C C++ to to Python and you know we are talking about C++ and python but there are other languages that also want to do some degree of of GPU compute and they have the say difficulty and you're not going to you're not going to be able to interoperate between all of them and you need to have strong C++ support so it's already a sort of losing position so this strategy is just not good it has no future unless they unless we P that gpus hopefully we'll do this soon and rescue them from their own wrong ideas yeah when you bring up supporting things like python like python is a really popular language when it comes to like doing that in in like the research space you have a lot of people who just you know they they're already doing a lot of their math computation with python and they're going to do a lot of their other stuff with python if they can mhm so it just make sense to make that or make it easy to interact with other systems through that as well where you know this Intel system sounds like it's a whole different way of approaching things that uh challenge could that makes it easier uh like P python is actually in a relatively good spot because as I understand they have uh some good ways to bind to C++ libraries uh other languages are not in such a strong position right sure that makes sense so it's slightly less of a problem from Pon perspective M MH okay so we kind of went down I don't know how we exactly got to this bit um but let's uh let's let's shift focus a little bit so now that the project has been through its like new Revival into what it is now what are the current goals of the project like what sort of applications are intended to be supported and what would be like a dream support that maybe right right now is a bit outside of the scope and then what are things that you could support theoretically but you're just not going to touch right so so the goal the main goal didn't change total world domination uh so is something we want to do but more specifically since um our team is relatively small just me and uh external contributors for whom I'm very thankful uh uh so so so we we have to so so there's limited amount of time uh so we need to focus so right now we are focusing on machine learning workloads uh all the P sensor flow starting with something similar like LMC but we want you to be able to whatever Mach machine workl you have to run uh smoothly on and we're also making some like some other choices so we are focusing Less on Windows because on Windows Windows needed extra support and it's going to work but you will have to do User it's not going to be as smooth it wasn't well it wasn't really smooth but it's going to be even less smooth uh we are we are going to support less uh gpus only focusing on the gpus that are sort of similar to Nvidia gpus and this is specifically uh rdna gpus rdna 1 2 three future AG views um yeah are the main areas of focus MH what is about the uh the window support that actually makes it a challenge right so so uh loading libraries it's uh tricky so in the perfect world in the perfect world how it would work uh you have your executable it doesn't matter we we what sort of exec doesn't matter if it's python or I don't know or blender or something else language is not important because at the bottom level you are talking to a to to a library you have you have you launch your a through some kind of Zula launcher and every time you load a library we check is it a Cuda Library have it's a Cuda Library we replace it with zuda Library so it's all transparent and efficient but uh and every time you you launch a new subprocess we also insert ourselves into the subprocess and if this sub process Lo Library we also replace it with Z Library it's turns out it's not really possible so what zuda launcher settled on is uh and why it's that possible uh firstly surprisingly creative ways you can load uh codor process so some applications will will have a uh the M actually will have dependency on the uh nvq that L and um no we can replace it m it's fine but some other application and think and it's relativ think of of things like python python actually doesn't have uh dependency on nqa DL just some some uh some some python script is going to load andl using uh Dynamic loading apis load Library um so okay we can we can uh uh overwrite your load library and replace the libraries but there's also some applications where you uh load dll and this dll has a strong dependency on nvq and we cannot detect this on Linux world it's slightly cleaner because anytime you load a library the system runtime loader will go through the public public API it will be called to theal open on Windows it's split if if you're doing it yourself it's Public public API but the system loader will have its own function and we don't want to over write um you know system functions from Kel 32d or whatever it's um it's not a robust way so so what what what what we settled on you run your application under Zula loader zilda loader will inject uh our own nvq you want it or not you're getting it which is already not nice but presumably you want this and then uh if you're explicitly loading uh library then we explicitly uh replace it if you are explicitly loading and vuda we're EXP we're overriding it to explicitly load our own NB Cuda just this part of explicit implicitly loading zuda into every process whe whether you whether it's necessary or not it's um sort of not nice but it's most robust way and you need to have this support for every dll and those DLS there's there's some degree of complications because uh also you have sort of two kinds of DLS on on Windows when it comes to CODA you have loader DL which lives in Windows system 32 and you have real Library which which lives in one of those driver paths on on Linux it's it's simpler uh you you don't have this louder uh DL and Linux give you sort of official supported way to inject yourself into process and all its children through environment variable so it's um less work it's it's less work just and it it works if you could just suddenly force all the windows developers to do things in a certain way what would you want them to be doing I don't want to force them to do things my way that is easy to me because um they just have different it's different different system different needs so like like this is kind of backed into the system the way the the dependencies are resolved and like it it's more difficult for zuda but in some ways it's actually better than on Linux because on Linux you have a dependency to a function so if we say the simplest function there is Malo right so on Linux it's going to be what is backed into your uh executable library is the name of DL and the name of the function on Linux side it's just the name of the functions you you depend this your name of the library is not included in the dependes resolution and it quite often leads to conflicts so I mean for my purpose the Linux way is easier and smoother but it has its own problems MH that's understandable um it it does sound like though that the window side just makes things more complex I it sounds me for me yes yeah because what we do is not normal way of doing things because what what what we could do alternatively just throw away your or throw away your nvq your official nvq and just install ourselves into Windows system 32 right uh but I don't want to do this because it can Crush some applications uh maybe we don't support something and uh even if you don't have an an Nvidia GPU it's going to be more robust if you use official uh library because there there's some hidden apis there's dark API we might come to this uh in our discussion but uh if it's not supported on our side your application is more likely is likely to to crash and it's going to happen with games mhm so suddenly you installed zuda into your system and some of your games start crashing for no reason and it doesn't show it's hey it's fault of Z it just crushes and you don't know why we we don't want to do this we want you to launch your application with Z launcher so it crashes you know it's Z it's not uh something else no that actually makes a lot of sense cuz um my understanding is there are certain parts that you don't at least right now want to bother touching just because you know as you said there's only so many people working on the project there's only so much time to work on things so there's there's things that just are not going to be included at least for now yes yes yes that's if you look at the Nvidia IPI not NV if you look at the K IPI it's huge it's gigantic uh and you know it's just 8020 think if you look at application they're going to look so they're going to use certain functionality much more than any other functionality so pretty much every application is going will want to allocate memory on a GPU every application want to launch Kel not every applications will have multi-gpu support not every applications will want to do runtime linking of carel and some other U new new functionality so so generally so generally how how we approach things in in zuda we look at so so we don't track uh Cuda version and add uh apis at that in Cuda version we look at applications and if application loses uses certain apis then we Implement those apis and next applications what API does it to use we Implement those API so sometimes I get a question which which um which version of Cuda is implemented in Zoda I don't know it's whatever application are using it's it's a mix we do want applications to work we we don't care about some you know uh some artificial standard that doesn't exist right actually that that's a I think a really sensible way to approach it cuz like you know you could approach as I don't know how Cuda versioning works I let's just say version one like start at version one get a complete perfect implementation but then nothing else like no modern applications are supported because all you're supporting is like that basic core functionality and you're going to take a really long time to get to the point where things are actually working with real world like software yeah yeah yeah yeah yeah yeahh yeah we care about application want we want your stuff to work that's the goal whatever it takes right you're not trying to pass like Vulcan conformance tests no no no no so you you bring up that term dark API you you send that as one of the things you want to talk about I don't I've never heard that term before what does that actually mean and then we can go into like what you're actually talking about with Cuda dark API yeah it's is a pain so so something you should understand uh so so Coda obviously not um not an open source API but there's something more to it is that your code is always a second class citizen on Cuda platform you may not realize it and I will explain why so so first first thing there's really uh two and half apis on Cuda so there's two apis most Cuda Engineers will be aware of there's what what they call runtime API and Driver API and they're extremely similar uh runtime API slightly higher level driver API slightly lower level we Implement driver API and there's also one more API hidden from a public view uh so so we have API you typically if you have API you have a name of a function uh number and and type and name of the parameters and the return values and some documentation with dark API you have nothing every function that is supported by dark API is uh some kind of um unique identifier git and the index so so you ask your runtime you for you ask your ca driver hey give me table with function pointers for this unique key and this is used by Intel runtime really for no good reason I don't know why they do this and for some uh first part party libraries because there are some things andv doesn't want you to know on be able or they don't want to be able to do and a classic example is rate racing so if you look at so you might think okay I have my GPU and I I want to know if my GPU is capable of Hardware rate racing and doesn't expose this information at least doesn't expose it to you it exposes this information to its own libraries its own runtime uh and it goes through this hidden API it has no name has no documentation uh so we call it dark API uh I don't know what is is proper name it's nobody knows and the dark API is uh frustrating experience for for someone like for something like because we have to implement it it's actually almost kills Z because first time I'm enabling very simple application just adds uh it adds two numbers all it does adds two numbers on hpu uh so so so I have this good application and you know I look at the interactions between uh application and qua driver and um you know everything goes as expected every fun that I wrote in the source code is being called and there's one small addition I did not write and there's called to this Co export table and I look at it some kind of unique key I never seen before and this gives me table of pointers I don't know what are those pointers okay I said break point each one of them and calls one of them with for some reason I no no arguments I don't know what this function does I decided that's too much for me it's too much for me I give up uh and I gave up for I don't remember for two weeks after two weeks I thought well okay maybe well um I'm in mood for some pain let's try it it's uh probably not going to work and I look at this function actually look at what function it calls and it's uh for instructions uh it calls allocation it's it's it calls uh Malo to allocate some memory and returns this this uh this memory to the driver okay I said well it's not so bad actually and it's uh really motivated me to keep going and pure luck because it was by far the simplest and easiest uh function from R kpi all all other ones have been slightly more difficult and generally so so so generally I don't and wrong approach uh to look at what this function does because it's most of the time it's not super useful uh noway what what I do I look what what what are the inputs and what are the outputs uh because uh usually that's your first clue if if a if a if a function returns um some kind of pointer and you see across the rest of the application that this pointer is uses as context then you know okay this dark AP is created cont text with some internal bits being set or unset and those bits are not really important we might as well create a context for because it probably does something but there's no uh across the applications we had there's no observable effect to to this uh function that is different from the public API um so that's how it goes and and we implemented those uh all only parts of dark API that are necessary to to run your applications mainly interaction between the high level uh runtime API and Driver API runtime API doesn't always talk to driver API through the public apis it also talks through through direct API and then I I Noti that the first part the libraries use dark API for various reasons which I don't about M so it's basically just those internal functions that there's no point for them to document because you're never actually supposed to call them yourself I mean I think they they should document them they should because it would be nice to you know to use how how can to know how can I use uh everything that I paid with this with this GPU but they decided not to MH I I there might be some thinking why they do this uh it's it's it's obfuscation uh maybe some of those things they just don't think are useful maybe some things they don't they just want to hide well when it comes to rate racing they definitely want to hide those things because Optics also doesn't expose uh those things m whatever the reason for it it makes your life a lot more complex yeah yeah yeah it probably makes also life of quoda Engineers more complex because some things they do in this dark API I don't know why they are doing this it's just for for example there's a something they added recently and my suspicion is they they wanted to make my life more difficult but compiler optimized out all all they wanted to do in this function because this function what does it do you have let's call this function Fubar and it returns IF function IF function foar so itself starts at an even or odd odd bite in memory okay and this is completely completely totally pointless because your your your compiler will pretty much always align your function to not only to even even address but most likely to your natural size to 64 bits so it's always going to return zero I don't know why they do this just mystery to me and literally two instructions three I don't know I I don't I haven't answer for you either um but my expectation there there was more complex body inside they wanted me to implement but Optimizer optimize everything out M who knows what they are thinking it's mystery I only care about the applications is running not about their creative ideas right right so with these with these dark API functions the way you're basically approaching them is effectively like a test Suite where you're throwing data into it you're looking at the data that comes out of it and you're hoping that whatever you're doing in between is getting you the result that would happen on an actual uh Nvidia GPU with Cuda yeah most of the time most of the time it's stuent sometimes we have to look uh what it does internally but it's usually too complex to do this uh because uh it can you know if you have a function it's going to call any number of internal functions uh so observable properties is what matters that like if if it set a flag if it sets a flag is some kind of object it well tells me nothing uh we're we're going to do the same thing without setting this flock so you said earlier that Cuda is a really big API how how big actually is it uh I can open my if you give me a minute I can open my ID and tell you so good it's so good just just give me a second so just the it's not 100% accurate it might also count uh some function pointers but it's uh just the driver API 575 function exported okay we don't we don't Implement all of them we Implement well currently we Implement nothing but the old zuda implemented below 100 or maybe slightly above 100 and it was enough to run a lot of applications most of them okay so this is driver API and there's also performance libraries which have their own apis and they also have a lot of functions MH so those 100 are like a lot of the main functions that sort of everything is going to need to deal with as a there's going to be a lot of these like little functions here and there that like are for that are important for those workloads but mhm may not necessarily be something that most applications are using right if you have we we Implement as I said uh we do enablement uh workload by workload so application by application and we can see that uh there's a core of operations functions that everyone is going to use and then it gets less and less common uh some of them might be used by nobody because there are know added for completion some kind of Getters of properties this sort of stuff one thing I didn't really ask earlier is like what does the name of the project actually mean like why is that the I kind of get like it UDA so Cuda but like why why is the Luda yeah so okay I have to to give excuse for myself so so so okay the the name of the project I literally came with it day before release uh initi initially and for H of 2020 and you can say it if you go back and in history was just was simply called not Cuda and you know I'm not a lawyer but I'm relatively certain that I cannot release project name like this and they before relas I well let's let's go with something that sounds slightly like Cuda uh there's no there's no such words in English I think really let's try polish I'm polish and this this sounded nice so would I mean something like Mirage illusion okay that's actually you know that sounds kind of cool actually there there's no hidden second meaning um I thought he going to be nice use polish have a war that sounds nice and he's not going to get used to it yeah yeah but but I was aware that it's going to be impossible to to pronounce so is spelled differently so I simplified it a little bit m yeah yeah as I said like I I I was never going to pronounce this word correctly but I don't think I don't think anyone was going to if it was uh spelled correctly how would it actually be spelled um I can write in chart oh yeah let's see if I can that up I don't even know what that's yeah yeah this is this L is spelled differently okay I I don't even know what that symbol is it's like uh English W the pronunciation oh for anyone who uh is is just listening it's an L with like a slash through it yep yeah I have polish listeners who be like you're more OB I can barely speak English and the best of days I'm don't don't me started on uh don't start on polish um that's what I get for being Australian um yeah so H where where do we go from here oh we actually haven't talked about the uh haven't talked about it being in Rust yet have we ah yeah get get this question a lot uh so so rust is being perceived as still sort of exotic alternative to more mainstream language in this uh area for solving this sort of problem so so the mainstream solution would be to C or C++ but the thing is I have known rust for a long time for over a decade so I learned rust before version 1.0 yeah I was going to say over if you know rust over a decade yeah okay been interestes in this early because uh I always wanted to do so professional I've been writing C and then FP but always had certain level of interest in lower level uh development and I always had the suspicion that kind of lower level development system level programming is not only difficult because it's uh maybe less mainstream slightly more tricky key but also because C++ is just not a good language so the first time I learned about rust and saw the sort of semantics it have I realized wow that's that's what I that's something I always wanted for system level programming MH and I would never write zi in C++ it just too much SP too difficult and I knew Ras and I always wanted to do a project in Russ so so things I wrote in Russ were sort of small projects never I think I think Never released to like try some features or try something out uh with the language um so this I think this first big mainstream project I did in Ras and I'm really happy about the language so language is relatively good uh for writing system uh code well there's some problems with it like uh but this is but I mean very small minority uh my opinion at least so maybe don't listen to me but the build system is uh really anemic and if you so it works well if you have relatively simple project that is all in Rust but we have things that are slightly more complex there's interrupt between there's Cod gen and cargo just not good enough for this purpose so we kind have our own solution but it would be better if cargo had a real build system so I actually prefer seake I'm one of the few people in the probably only person who prefers to have a saric or Ms build or anything else other than cargo for building cargo cargo does package management is relatively good at package management but build build system is just not it but uh when when it comes to language semantics to the availability of libraries it's um it's good that's what I wanted why would C++ be painful what what about the language is a problem it doesn't have the feature I want in the language so so I want to have inam discri discriminated Union so so I experience writing in F and uh professional experience writing F in F you're are going to use demented unions a lot and a lot of really fundamental types in Ras are expressed are expressed as inam like fshp rust Styles inam where you have not only the value but also data associated with it discri units and so things like instruction this is uh uh you know uh statements directive pretty much everything in the compiler and C++ doesn't really have a good support for this other things like memory management much easier once you to learn how to deal with uh borrow Checker it's much easier to do in Ras than in C++ and obviously R is not all you know not every not uh whole world subscribes to rust's ideas about memory management and object and code safety so if you look at things like where we have to sort of interact with the outside world like me CM assembly the code to MVM assembly is just every second line is unsafe uh but it's okay and and safe in Rust is the same semantics or even more strict semantics than C++ so it's still we're still coming ahead with rust MH so it's um that's what I want basically so rust is basically just at least for you the better tool for the job yeah for me it's better and I said R is not perfect there's some areas where it's just unusable like writing GPU site code so C++ has and specifically Clank has all those little attributes that are useful for me so and things like uh support for address spaces this is very nich features but I want them for writing GPU code because if you look at um at Zoda compiler certain functions are implemented as calls to uh fullblown functions in a LV um module to which we link to during compilation and this LV module it's written in C++ write C++ compared to lvm MH and you cannot really write this this this this sort of code in Ras because doesn't have good support for writing GPU code fair enough I thought I I thought you were going to say more there but no no um the the the delay sometimes throws me off when people are going to stop talking yeah you're you're literally on the other other end of the world so uh must be a fa bit of delay yeah I I can make it work though I I've made it work so far so you know it's it hasn't gone it hasn't gone horribly bad um one thing about the uh the project itself I don't think you really talk about this anywhere but like the choice of license on the project and again neither of us are lawyers so don't don't get on our case about any specific points but why cuz it it seems to have two licenses attached to it the Apache and MIT license yes why is there two licenses and why those licenses in particular uh so I want my software to be used by everyone fair enough for any purpose you want so and this has been sort of actually comes from rust Community it's just been popular solution in Rust Community to uh dual license under MIT and Apachi licenses you pick whichever on you want uh and as we we agreed we are not lawyers and this is uh my understanding this is the path to be the most compatible with all other open source and close Source licenses MH that's why so your goal is basically just getting making it so people can actually use it rather than you you know ensuring some free software Perfection about the software where you know it's everyone that uses it has to also be free software that sort of stuff yes yes yes yes actually while we're we're actually down this route uh what is like your general stance when it comes to like open source and free software it clearly from this you're like more in favor of the open source side but do you have a sort of position you stand on more generally so my thinking ke and and this comes from working at uh relatively big companies M if you want your software to be used the G the way to go is to use MIT or aash license if if you're licensing under GPL big companies uh will not touch it unless it's something that is extremely critical like uh Linux Kel mhm like those companies have explicit policies that hey if it's uh MIT or BSD or Apachi then you can use it install on your computer uh link to your software if GPL not really and before release and this is this is even for internal stuff for external stuffff there's usually or other stuff that is is being released from within the company there's going to be a legal review and they will check if you're using something that uses GPL and if it's uses GPL then it's probably not going to be released with GPL so like may maybe it's your goal so the corporations are not using your software then I would even just recommend uh using GPL MH that's that's just it's not my stance or politic but this is uh the reality I observed mhm yeah the Linux Colonel is was in like a weird position when it came along because you know it was there to replace the proprietary unic systems that were coming up and the BSD world did exist but it was this weird like mix of proprietary BSD and then 386 BSD was there as well and I I we could just turn this into just ranting about the early history of Linux cuz this is one of my very one of the topics I really really enjoy um and why the whole like G new herd thing just didn't work and why it should have been based on 36 vssd but then they didn't end up wanting to do that and chucked away that entire project to wait on this mythical Colonel that was never going to come around anyway um and then Linux came along before they even started the project so no one cared about her after that anyway uh yeah yeah but just be aware that uh GPL GPL license staff has a sort of special position uh when it comes to to Licensing in in corporations so they're going to use it they're going to contribute but it has to be big enough so they're going to use l carada are going to have their own Fork of GDP uh which is also GPL license and stuff like this but if it's something smaller than a lawyer who who is giving a you a review is not going to probably give you a a special exception for it yeah if your goal is to make a license uh to make a library uh just don't even bother don't don't even bother with a GPL to license I mean I'm I'm not going to tell you how you should life your life just sure sure if you're maximum usage then G probably not the the solution mhh so what is your like background in programming I don't just mean like your like corporate background like when did you actually start like doing programming like how did you actually get yourself interested in it well during my first programming class at University I didn't program uh before going to University MH um so so uh I hope it it's not going to be a let down uh but I always use computers but wasn't really interested in programming as such I went into uh computer science degree because I was broadly aware that you can have a good life if you uh have a computer science degree whatever you do programming uh security databases it's going to be fine those were different times uh that was oh wow so long ago well I'm not going to say but long time ago uh it's it's still still I had understand that it's going to be great and there was actually [Music] um how do you say overwhelmed maybe not the best word but it was I did not see myself as a possible programmer because my thinking was that you know programming extremely difficult every programmer has this sort of Galactic brain and the tools they are using are really high technology and everyone is excellent and I started programming and what I realized that all the programming languages are old garbage from 30 years ago programmers they cannot program a f bus um so so you know it doesn't matter if I'm bad at programming was extremely encouraging because I realized it doesn't matter if I'm really bad at programming other programmers are even worse so my my my bad Cod is it's not going to make make things worse on average so yeah start programming and it was interesting interesting um uh that that's how I became a programmer I I I actually found my passion for programming uh during when I was studying uh for computer science degree in that uh first programming class you did what language were you actually working with uh my first programming class something C see okay it's it's a sensible language cuz I've heard some real weird answers from people before where it's like Objective C or you know just other random things that don't make any sense right so so I learned a a number of those language I think I'm using many of them so I learned cc++ Rie no python I did so C I learned by myself and this was my first uh professional well language I use professionally okay so all completely normal and sensible languages that's I started with Java oh yeah so when it not oh yeah I learned some Java we we we learn prologue uh prologue but I remember nothing about prologue I don't remember that prologue yeah it's it's fairly interesting in it's new uh but I I wouldn't wouldn't be able to write anything in prologue nowadays mhm so now is like is elud like your main focus at this point yes yes yes and my journey has been when when we want strench strench languages was sort of C sharp then FP then assembly then zowa which is rust and C++ no there there are some and in between their their opcl and C4 media which you definitely didn't hear about and N of your listeners heard about but it's great language unironically it's a great language for writing code for Intel gpus it's a dialect of C++ for writing in GPU code and it's really good if you want to do just this it's mainly used internally at oh okay that makes sense then but but I think they did did some public releases how do you spell the name by the way CM C for no I think it's c for metal nowadays previously called C for media um oh yep yep okay uh has nothing to do with apple metal everyone ask this question nothing to do CA medals program language allows for creation of high performance Compu immediate gpus using an explicit single instruction multiple program blah blah blah uh open sale and applications using implicit Sim PR okay I don't want to read all this yeah so so by the way if you have an Intel first party library that that has GPU code check which language does it use if it's written in CER metal those people know what they are doing written something else probably not [Music] no no but but they're trying to to to to include C4 metal in uh DPC Plus+ as a different mode of programming it's called now Eed explicit simed ah okay okay but cm is still more ergonomic if you if just want Intel GPU M code is has the advantage that it uh interrupts much better with the normal DPC plus plus code M so let's actually get back to like the the project itself when you when you when you want to sit down and actually start working on a project like this like where do you even where do you even start with something like this like you see the API there you see you have a GPU what what do you even do to start getting anything doing anything uh start with the so so in this case was starting with trying to understand what CA is because was my it was my first Cuda project I mean I know what Cai is uh uh but I had years of experience writing oh this was also I didn't mention was secondary goal of this project was to actually understand how CA works or how how you write CA code I had but but so at this point I had years of experience writing Intel GPU code but Intel GPU slightly nich it's always worth it to understand what is the competition doing uh how to write Cuda code so so firstly understand how how do you even simplest thing with with CA uh and then the the the goal has been the first goal has been an application that has simplest thing possible add two numbers on a GPU it still needs to do a a fair bit of setup so get a context uh create Lo load the source module allocate memory all those stuff so so look with tool like so on Linux you have tool like l Trace where you can lock all the interactions of your application with certain librar so one of the things first things to do was uh understand what sort of host functionality has to be implemented and think really hard can it be implemented in terms of Intel GPU so all this host functionality so so don't think big don't think big that's what worked for me it was just a simple application uh because it was a research project I had no goals to you know have complex things running if if I had you know two and two returning four on a GPU I would be successful so I have a simple application I know what are the interaction obviously it was a lie because I didn't know about dark API at the time but uh and I look at those calls and I each each of them there's like I don't know 10 of them and I know okay okay all of them can be expressed in the terms of Intel level zero and this is host code interaction since I have experience with writing GPU code I know there's also the GPU side uh called so so next step was try to understand what is the format of the Nvidia GPU code uh is it some kind of intermediate representation or is it some kind of uh specific architecture specific assembly I mean depending on the level of complexity either can work and here I was lucky because as I said Nvidia is competent and they have uh uh virtual assembly called PTX and this is a text format so I have a text format and this text format contains and it's somewhat abstract at least from my perspective uh so it's virtual abstract representation it's not it's not written for specific assembly but it's going to work after compilation with any CA GPU and I look at the instruction set and actually good think Nvidia uh documents the PTX format and I look at those instructions and all those instructions I'm familiar with and I know they're going to work on Intel GPU some things like uh load memory star memory uh [Music] do multiplication do addition all those things are going to ENT on Intel GPU and and I was still thinking small so okay I have my very simple module in in text format and I have my host code so it looks possible it looks possible then I started implementing starting with the uh PTX with the compiler because looks uh relatively easy but as I start implementing there are complications and uh why to implement is sort of classical compiler design you you have your text code you start by parsing it using a parser generator in my case it was uh R Library called lpop it's that's the name yeah it comes from it's it's a start it comes from approvation of the kind of grammars it can pass it's not super but library is good no no this is a Sol Li library and I start passing those things and then once we have things passed you have in sort of in memory representation of the source code uh you now start writing your transformation passes uh and every transformation pass makes it slightly closer to your target uh so I didn't even think about what what my target is going to be at first I thought maybe I can compile it straight to the Intel GPU instruction set but relatively early I decided no no no it's not going to work I'm going to compile it to spear V which is bit more abstract much easier to compile for which was the right choice so okay I'm going to compile to spent now in every compilation path I try to to make it a little closer to what is expecting M and [Music] uh sounds simple but uh there's a fair bit of weird behaviors in uh PTX format there's entirely for example there's entirely too many uh address spaces on GPU you are generally going to have like four four different address spaces um uh Global memory sh memory private memory generic memory maybe constant memory and PTX has like seven or eight and even in a simple C seven or eight different address spaces some of them are I mean if things were designed from the ground app are really not necessary in the current world but they exist and you need to have some way to translate them to add spaces that are available in in SPV some instructions have no uh direct uh replacement in SPV so you have to uh replace with a series of simpler instructions the sort of thing and it and it took a long time so I was almost a year uh just to have uh translation that actually translated correctly and worked so if you are attempting something like this I don't have any advice other then start small and never give up m i so I gave up like three or four times one of those things was I I said previously um dark API first time I encountered this another time was when I I was very technical so so I did one pass that was completely unnecessary translation to SSA format and I realized that well it's completely unnecessary LMN are going to do this for for you you don't have to do this what is the SSA form uh single static assignment this is uh sort of representation uh on which uh compilers operate optimizing compiler because this sort of compiler that is inside so that's not an optimizing uh compil it's just translation right so this is completely unnecessary there's some two or three other things I don't remember but I do remember just giving this project a rest for a week or two several times and still coming back to it when you said this was your first cter project the first thing I instantly thought of was the um the lonus 12s Linux email it's just a hobby project it won't be anything serious like a new yeah yeah as I said I didn't expect it to be any big uh was was the the main reason I wanted to do this I wanted to learn about level zero M and CA mainly about level zero because was new for all of us at Intel at the at the time so is so at least initially just basically a hobby project to like learn how this works yeah yeah yeah yeah but it sort of came along you know to some extent and became something a lot more than that yeah it was U even when I was uh joining KMD to work on this project for the first time fulltime we didn't know it's if it's going to work so we s have expected that it's going to be either going to work and it's going to be great or just as likely it's not going to work we are going to run into some kind of impossible roadblock and in this case we are going to release the updated uh uh zuda for for AMD that's that doesn't work for some reason but we've explanation why it doesn't work so it's uh anyone can can can read the source and learn from it m yeah then yeah I I didn't expect nobody expected it's going to work but it does performance is actually not bad yeah I didn't bring up the uh the MD stuff earlier because I wasn't sure what you can and can't say there because I'm I'm sure there's Alex Jud you have publicly said there certainly with issues that have gone on yeah so so I can only repeat what uh what happen so I worked uh as a contractor for AMD for two years towards the went and I released the source code and they decided that uh I they shouldn't have allow me to do so it's not legally binding we rolled back the code to pre-md version and we are starting again MH yeah that's that's I'm sure there's a lot more there that you would like to say if if there weren't issues there but maybe one day but won't push you on that cuz I'm I'm sure there's certainly some uh some legal issues there that you don't want to you don't want to you know avoid getting sued by so so so it wasn't so what was interesting about the situation is that something I learned there's so many internet lawyers who want to prosecute the case in front of Internet judges so one reason why the main reason why I don't want to I'm trying to dodge this topic is that it's going to uh bring internet lawyers into the comment section and I see I see I see sorry it has been already handled by actual lawyers yeah it's not going to happen another thing is that there there's sh demographic of people who blame Nvidia for everything even if Nvidia has nothing to do with things at least to my understanding yeah I I've definitely seen a lot of those internet lawyer comments myself if you go on Reddit you'll see plenty of them like oh well even just in the um the GitHub discussion you had like there were people arguing there like what was the position of the person who told you to take the thing down are they at the correct level of the chain to do that like they like this you're not even if you even if you were in the right like I don't I I can't imagine you're in a position to like go to war against AMD or even want to do that no no no I'm not AMD is a billion dollar company I'm not going to fight them they have lawyers they have uh PR department like it's not C pink anyone mhm yeah it's much better to just comply with whatever whatever they to Reas get you to do and then just go from there yeah yeah focus on the technology focus on the nice things we can build uh all the applications that we can have running on your GP mhm so what is the current state of things so curreny I think is very little works so currently I don't think even the whole so this is situation as of today I don't think even the whole applic whole project builds because it's being written from the ground up starting from the most basic things like even parsing of the PTX so we have a new parser and then this new parer the the why it works has effect on the rest of the compilation mhm so I rewrote the parser rewrote the compilation passes to be nicer simpler and recently finished writing the code responsible for emitting alv bit code so there's a number of um unit tests inside um zuda and unit test works by unitest for the compiler so compiler takes a simple handwritten PTX module compiles it check if the result is as expected and there's like 90 of them so those tests work compiler works but uh we don't have host code working and that's the next step have the host code working have some other tool link around the project working and once we have host code has some of the necessary tools then we'll start uh focusing on the specific work CLS the first one is going to be probably geek bench because it's relatively simple and then work on those machine learning corlo so llmc LMC is probably going to be the first thing first machine learning workload that that works with some caveats uh flash attention support we'll have to wait a little bit um but yeah LMC is the first goal first Milestone so now that you've done a lot of this work like in the past now you're doing a rewrite what is it like other things that you've maybe learned from that experience that maybe you're going to make things easier now that you know because as you said you started it it was like your first attempt at doing this I I assume there's a lot you've learned from doing it at least once right so one big lesson don't trust uh AMD things to work and might sound well funny but it was extremely stressful situation one of the first workloads I did uh sort of when I contracted for AMD was uh CLA source application I don't remember which one either 3DF zair or reality capture and so I enable everything that is required for this application both host code and in the compiler but for some reason it doesn't work and I since it's close source application it's relatively tricky to debug but I eventually I found the kernel that is producing incorrect data and I run this so I extract this kernel and the data it operates on and I run it both with on the Kack and on there zuda and it works correct and it works correct on both every time but for some reason the same kernel fails when run inside the application and this time I'm I'm starting to panic in a little because it's well imposs I situation impossible to debu it should work it just defies the logic and the law of physics and I tried it both ways so I run the whole workload on zuda and because I thought well maybe earlier Kel uh under is Computing things uh wrong and I run it on both Source from Cuda Source from zuda and try it zuda to Cuda Cuda to zuda it it all works well gives gives right result and purely by luck purely by luck I realize that there's a bug in andd host code for certain memory copies it copies them incorrectly it was my first revelation do not trust cost code even if it's simple mapping even it looks simple do not trust that MD host code is going to work correctly always double check so there's you know it sounds pessimistic there's not so many cases but generally if it's something that uh has to do with textures then double check if you're if you're using hip um so that's what I learned there's some other lessons then very zuda specific and is actually somewhat frustrating looking starting from this point two years ago because I look at my code from from before the roll back and I look at this compilation pass and I think well that's that's wrong this wrong there's better ways to do this there's better ways to do this but I have to fix all all that other Cod before I arrive at [Music] this so you you can see where the goal is but there's just so much I remember I remember I remember I I just remember oh yeah I got rid of this spot because there are better ways to do this but have to leave with this with this again but hopefully the second time around you can get there quicker like now that you yeah yeah it's it's much quicker right now because uh uh firstly because the workloads are different so oh one thing I I did not mention uh that that we're getting rid of I mean somewhat sadly I have uh mixed feelings about this we're not going to do right racing uh because Zoda has had an implementation of Optics Optics is nvidia's framework to do rate racing and it's very complex and during my contract work at AMD I think Optics took I don't know almost a year was very complex not only Optics itself but also applications using Optics so the goal has been Arnold it's close Source very complex rendering uh solution uh debugging it was lot of effort uh so it's going to be so much faster without Optics mhm so much easier it would be cool to have it you know if you had infinite time yeah yeah I know I agree I agree you know yeah yeah so so one of the things so as I left uh AMD or yeah I left AMD released U zuda so something that said became really clear really quickly was that there's a lot of interest in machine learning workloads right right there relatively little interest in sort of professional Graphics workloads so I've been focusing on so so there's a time when I left AMD and had nothing else uh lined up so I was still working on zuda but focusing on the workloads for which there's no commercial interest and this uh I almost got Gameworks uh uh well I got it running in one application but never matched the code M uh and then think things similar to it so like umh I don't remember there there's this suit or there's this suit for 3D photogrametry uh I don't remember what what what was it called uh so also has been requested but you know has no there's no commercial interest in it right but I I like to just well most people will describe Nvidia at this point not as a GPU company they're AI company right like I I like to call them a uh a machine learning shovel company you know during a gold rush you're not don't dig for gold sell the shovels that's what Nvidia does yeah yeah yeah yeah that's true if if you look at the machine learning Market uh M Hardware Market there's two markets there's Nvidia market and non Nvidia Market and you know a lot of people want zuda for machine learning because uh they want okay not by the means of Hardware but the means of software take part of uh Nvidia Market because Nvidia Market is much bigger than non Nvidia Market MH yeah and that's that's just not going to change in the real world like it's like no matter and no matter what AMD or I mean AMD making effort and AMD is I think closest because I said they made the right strategic call to make their apis close to Nvidia apis maybe the closest the problem is their execution is just just not good and all and purely talking about software there's also some Hardware choices that that some some design choices they made in the hardware that make porting uh from Cuda to he more difficult than it should be mhm is there anything you want to get into there or uh yeah so so so and it's no no secret uh do we have a drawing board because I need to draw it in jit uh I don't know if jitsy has one actually I I don't it be helpful if you had one um yeah there's show whiteboard okay excellent yeah beautiful okay uh so the m so core difference between CPU and GPU is that if you have a hardware GPU thread it's going to operate on a vector vector of data what it means is that you have your CPU and CPU uh how I draw well okay that's what I meant give me a second okay this one so okay we have CPU okay CPU operates on a single element at a time uhuh so you add two + two you get four single element at a time right gpus GPU operate on a fixed Vector of elements so in this example this is GPU where the instruction set had a WID of four so maybe there's one two three four and another I don't know uh where's the Eraser uh there's no raser on this thing I think I I have to well whatever there's one plus six seven yeah yeah and the result well we'll give it we'll leave it as an exercise for our readers Watchers listeners but operates on a vector at a time this this is key difference this is why uh CPU code is not going to be efficient when just taken and run on a GPU call because you're going to use only one element the vector mhm and that's why only some workloads are are efficient on GPU because you need you need to be able to vectorize your things and in this example you have with four of of the vector and this is sort of the most basic parameter of your GPU if you look at the all Nvidia gpus in the history that are programmable with Coda and probably every GP in the future the W of the vector the number of elements you operate at the time is 32 M and inside and AMD for some reason that is impossible to explain okay I explain in a second but but for reasons that are maybe I shouldn't explain it's going to sound silly but this is explanation I received inside AMD so so AMD has really two architecture one architecture that is not meant for compute that is meant for gaming has with 32 which makes it easy to Port compute code and their compute architecture is difficult to to Port compute code so are we done with the the Whiteboard by the way H are we done with the Whiteboard by the way uh come again are we done with the Whiteboard do we need that on the screen yeah yeah yeah yeah yeah yeah just uh just uh I don't know how to close it either uh unpin what now the white okay suffer with the white we got a few minutes left of the show anyway it's going to just be a broken layout it's fine okay yeah yeah so on on IMD gpus you have you you compute gpus have with 64 it's relatively difficult well it's not impossible but it's extra difficulty to compute Cuda codee with uh written for Vector size 32 to and D gpus with the vector size 64 M and if every MD GPU had Vector size what's called warp size 32 it would be much easier to Port much easier but what I learned at AMD there's a guy who has a spreadsheet and according to his spreadsheet warp size 64 is better for Center worklow than warp size 32 M and that's why AMD spends millions of dollars parting from webiz 64 to webiz 32 okay otherwise it would just uh replace the names and it should work if this 32 to 32 you know I I don't think I ever had a um visual demo in the middle of a show before that's certainly a f oh wait no no I did have a game Devon who did one once oh this is the second time I've not had a whiteboard though a whiteboard a new one I have to yeah but I think it's nice it's a nice change I like it done like 200 uh 250 of these or so and finally there's something new and weird so so by the way so previously Luda has the uh two modes to run expected by Cuda warp 32 on Hardware warp 64 but it was uh timeconsuming feature because it it uh sort of applies to every layer of uh zuda both compiler both and both host code so we're living it out and uh AMD announced that there merging those two architectures into one and I mean I don't have special knowledge but I expect it's going to be warp 32 to be similar with Nvidia otherwise it would be self- sabotage not using cor 32 like even if it's not not efficient in Hardware sense it's much more efficient uh when porting the software and porting the software is the is the B like right even if it's faster if no one's writing software it like it doesn't matter yeah yeah yeah yeah sadly sadly we live in the sort of Cuda shaped uh world when it comes to GPU computer that's the that's the that's the objective reality MH MH well on that note I guess we could start like wrapping things up um mhm so uh let people know where they can like find the project I know you got a A Discord server linked here as well and yeah everything we do we do how how active is that I don't I haven't going into it myself uh I mean it's fairly active uh so what else so uh can you put the link to to GitHub and Discord in the description yeah I can do all that yeah there there's no like nice and easy link to to it please do uh okay uh so we have a Discord so I was worried at the at the beginning that there's either going to be like I don't know three people and it's going to be totally empty or it's going to be 3,000 and I will be spending time moderating the Discord but uh oh there's there's I think like 100 or 200 people okay it has healthy level of activity uh it's nice and I and I encourage you to join unless you're one of those people from comment sections who is going to write really ugly things about Nvidia or AMD about Intel in this case please please do not come if you're a normal if you're a normal person please please do like when I have something uh working or work working uh in I'm going to share it first on Discord and later when I have more more things uh bch together that I'm going to write a use so if you want to uh be a bit closer to the development as it happens or you have some questions then please join Discord so if somebody wants to like get involved with the project head over there and head head over to the GitHub and is yeah yeah if you want to get involved uh look look at GitHub join Discord it's probably so we are not getting so many external contributors uh I think project yeah and it's it's it's sort of not mainstream programming if if it's something web related you're going to have a much bigger pool of developers who can contribute GPU development much more Niche so it's not like we are overwhelmed by contributors if you want to add something then uh you will be really special person in this project oh there's 12 okay yeah there's 12 contributors listed yeah yeah you will be a very special person yeah yeah and even if you cannot contribute uh Cod then if you cont if you can contribute I don't know uh ch understood the documentation or documentation itself very welcome I'm not a native English speaker so you might have a advantage over me and ways to improve this project it is really normal for a project to have you know a big drop between their core contributor and then the second top contributor I have not seen a distribution like this before where your like 400,000 Total Lines of code added and removed the next person is two you will be a very special person if you write like a a serious patch for this project yeah yeah yeah yeah yeah if if you contribute uh 20 lines of cod you can probably be you be the second biggest contributor yeah yeah yeah so is there anything else you want to direct people to um or is that pretty much just it I think that's it yeah awesome um oh yeah yeah one more thing uh watch the next episode and previous episode of this podcast yes do that uh I'll do my outro and then we can sign off okay uh so my main channel is Brody Robertson I do Linux videos there sixish days a week I did a video on zuda coming back so if you not seen that one yet uh go check that one out it'll be out it's like it'll be like 3 weeks old by the time this video comes out so I don't know if you haven't heard about it coming back yet go watch that video I go over the blog post to go over the history of the project all that fun stuff uh if you want to see my gaming stuff that is over on broon games I stream there twice a week I also have a reaction Channel where clips from that stream go up so if you want to go over and watch Just Clips of the stream do that as well and if you listening to the audio version of this you can find the video version on YouTube at Tech overt if you like to see the video version it is on YouTube Tech of a te I will give you the final word what do you want to say how do you want to sign us off um well I finished my tea time it's time for another te that's a good plan actually that's it Back To Top