we need new Direction large language uh models are not our way to advance AI language models is for me a database technology it's not artificial intelligent you uh grab all the human knowledge and text perhaps also in code or whatever uh and St it currently Isa reasoning is uh not real reasoning it's repeating reasoning uh uh things or code things which have been already seen our pass ends with scaling up you put more training dat to make it larger but uh not smarter not the systems we're not different so we're only larger something is missing yeah something is missing you and Jurgen you're pioneers of of connectionism in in a way and you're neuros symbolic gu as you you've always been why is that se it's an honor to have you on mlst thank you so much for joining us today it's an honor for me uh that you have me oh don't don't don't be silly the thing that is amazing about language models and and deep learning um in general is that they capture a lot of the subtle intuitions cultural information um creativity and and so on so they're really good for generating programs and the thing is if we want to do abstraction we need to have programs but the thing is where do the programs come from if we if we build systems that can create and acquire abstractions we need to build systems that can write their own programs it doesn't seem possible just to do some discrete program search because it's too difficult what my view on large language models is a large language models is for me a database technology yes it has uh uh uh is not artificial intelligent okay uh give me it's artificial but uh that's more or less a database technology you grab all the human knowledge in text perhaps also in code or whatever and store it and uh you generalize it you combine it you know uh uh if there's a Tuesday I can uh replace it by Wednesday because this uh uh days of the week uh or or names or or or numbers uh it's some kind of generalization but it's things which already exist the question is do we need new code is every code already written somewhere you only have to pull it together or to combine it uh but if you really can uh can should come up with a new code with a new idea with A New Concept large language models can only pull out existing code they have trained on it's just not possible like they not trained for it to produce something new and ser are very limited but they very powerful because AI needs a knowledge representation and right now there's a problem of hallucination uh okay uh how to pull out the knowledge also with inference with strawberry it's about the knowledge is perhaps already in the system how do I get it out it's a database where I don't know how can I access the information we need new Direction large language uh models are not our way to uh uh to uh Advance AI on the long term it's a good database technology it's a good knowledge representation technology it's important for AI but we have to find new ways could I could I challenge a tiny bit so I completely agree vanilla llms are approximate retrieval engines and and even that they're not quite databases because they have this interpolative property things like 01 they are approximate reasoning engines so they're doing this test time comput and they're they're searching many combinations because the thing is even though this thing it's a a finite State autometer has a fixed amount of compute do a single forward pass but it can generate code and the code contains all of these fundamental basis Primitives that can be composed together so you can do this test time search you can compose together programs you can in in a sense um search touring space indirectly right by searching through the space of programs so there seems to be something there are you saying that doing these kind of methods like 01 is the road to nowhere and we need something completely different or could we just tweak a little a little bit you can tweak it and you you will uh go very far because uh you have uh let's strain the program space because that's very nice uh there are so many programs so many combination of programs which give you a new program uh if you if you uh like if you think it's the col work of complexity it's the length of the programs and complexity of the programs programs the simple k of complexity programs already stored or combined but if you uh have to find a program which needs completely new Concepts cannot be combined out of existing programs I think it cannot do it it can only combine things it has already SE seen the large language models uh learned on code but it cannot come up with completely new code Concepts perhaps they don't exist then uh uh then if you say everything which uh uh in code was already invented we only have to combine and there's some nothing new s i with you but if there's something new to invent I don't think large language models can advance us mlst is sponsored by sensl which is the compute platform specifically optimized for AI workloads they support all of the latest open- Source language models out of the box like llama for example you can just choose the pricing Point choose the model that you want it spins up it Auto scale you can pay on consumption essentially or you can have a model which is always working or it can be freeze dried when you're not using it so what are you waiting for go to sent ml. and sign up now well yeah let let me ~ Sponsored Segment Removed ~ push gently on that so um I I think this is a discussion about creativity and and also epistemic foraging so you know creating new knowledge to expl also reasoning program is a lot of about reasoning but you have do have have some logic if you do this this this this very comp complex logic uh said your program is working absolutely but if we say reasoning is knowledge acquisition and we need systems to come up with new abstractions and if we agree that those abstractions can be combinator deduced from abstractions that are already in the system so we we have the the um the combinatorial closure so they they do exist then creating the abstractions is is more a matter of understanding when a good you know how do I find a good abstraction using an algorithm but I think what we humans have it's not only uh that we draw our ideas from coding but understanding uh the world uh having all this World Knowledge uh I think from coding alone uh you're limited I think we have much reasoning cap capabilities uh outside of uh doing only programs but I agree uh in programs you can uh go very far I think if it's a program where you need a lot of reasoning a lot of logic to go to the next step next next step and it was not in your training database I don't think the current large language model can do it right now I don't believe they really understand reasoning the concept say imitate reasoning say reproduce they have already seen but I don't know whether I understand it and there are many examples where you change a little bit and then as I go wrong can can you explain the difference between the kind of reasoning we do you know so strong reasoning and the kind of reasoning that we can do in in current AI I think in current AI the reasoning is uh not real reasoning it's uh uh repeating reasoning uh uh things or code things which have been already seen in the input data and combining it and also uh replacing some variables reasoning we do we have uh uh uh we have uh uh this reasoning concept like uh uh uh contradiction uh induction all this things we learned and also for us was hard to learn it as in school or in study but now we have reasoning Concepts how we can do how we uh structure things and how we show that something is true or not true all this formal system you have to have some formal rules uh in theory llms might learn some formal rules but then I can do reasoning uh in a very specific thing and produce new things because I only apply the rules if the rules are is a training data I can apply they can apply the rules also to new things but then in this reasoning system I probably can reason but uh if you go to another system uh uh uh uh I lose the reasoning capabilities I mean two quick points on that first of all would you consider you know move 37 in Alpha Z would you consider that reasoning so in in Alpha Z you know the the the Google the go player algorithm creatively discovered this amazing move that be so now it's a move okay uh yes it created new knowledge it that's things it created new knowledge but here uh there was a Subs symbolic uh uh uh part it's a Subs was multicol research Sear is is one ai classically ai concept at the end of the monal research you have the value functions and so on but it discovered it but by uh uh checking things and then evaluating uh it uh uh yes I think it's a combination uh between understanding the game and Computing a lot of uh uh moves into the future with Monti carot research tabs is a new AI research lab I'm starting in Z it is funded from past Ventures involving uh AI as well uh we are hiring both Chief scientists and uh deep learning engineer researchers and so we are Swiss version of deepi and so a small group of people very very motivated very hardworking and we try to do some a research starting with llm and O sty models we want to investigate reverse engineer and uh explore the techniques ~ Sponsored Segment Removed ~ even with that I guess you could say it was still an approximate value function there was no formal guarantees or anything exactly that's true but but I I completely agree that llms on their own that they are they are approximate retrieval engines but the thing is we can build systems we we can have formal verifiers we can have neuros you know we can have um like lean and for example we can build these systems so with these systems can we do reasoning I think in principle it should be possible I'm not sure uh but I think the reasoning is limited uh to uh uh the domain you see in the training data uh there are different formal systems formal Logics you can learn at one Logics if it sees enough of these rules I think it can uh it knows what of variables what I can change and how I can I produce something I can I think you can train an llm for one logic system to produce new logic uh but it's it's it's it's it's uh uh you learn the syntax what you don't learn the semantics if you I want to prove something then you have to have different steps towards this proof yes and here think I would struggle I would learn to do the formal things uh the syntax I have have a sentence a correct formula and I produce another correct formula by applying the rules it has learned it it has seen it it can do it but it's not gold directed in the step by step they they still not perfect or there still uh uh uh sometimes or in most cases not as good as humans it's interesting I I kind of agree that knowledge is created in service of a goal and there's a creative component to to reasoning and we can build systems that can dream and and generate data and we we can boot bootstrap that and some of it can come from the users of of a system so it feels like we can build systems that that can reason but perhaps it wouldn't have something that that we have maybe we have something extra yeah but I I would bet on this said why should uh also far why should you learn to reason why not use a reasoning system why not uh call a subprogram uh can you prove this or a theory improve Mathematica stuff like this uh you can learn it perhaps it's also okay to learn it but I don't see the necessarity because also we use tools why should not uh uh this future AI systems uh use you uh use tools uh for everything for math uh for looking up knowledge and stuff like this uh for me it's stupid to push everything in one system because also we don't do it we know how to use our tools somehow I feel that's a better solution what has happened in the last two years since since we spoke a lot of things uh uh uh happened for example I founded a company n x it's a company dedicated to Industrial eii and also X lsdm happened this revival of lsdm method which now should compete with a Transformer technology yes and we're going to get on to that and before we do that it would be good to go on a an intellectual journey of of lstms but just before we get there um we we've not really spoken about um some more you know broad stuff what was it like working with Jurgen Jurgen is a very special person he's very inspiring he is he uh uh uh I can tell you one story at University at Munich where we both there uh there was a seminar and there were three persons one uh tried to get all the students into U multi-agent uh systems s in uh uh space cognition and Y in neural networks y came and said oh I'm not prepared I don't know what to do then Jurgen gave his introduction to his topic out of 50 students selected his topic so you see he can convince persons and it was fun I was sitting there I did programming and J uh uh did uh his art uh things where he made circles and out of the circles uh uh uh women appeared and uh uh he did a lot of things and he once told me uh uh it was not clear for him whether he will go uh into art or whether he goes into science but it was always fun with him always uh inspiring yes um just in case the audience don't know of course you worked under um Jurgen and you are both two Pioneers in the realm of artificial intelligence it it's insane but what gave you the intuition all of those years ago to be working on the right things OB you mean lsdm which uh stands for a long shortterm memory uh at J introduced me to neural networks in particular into recurrent neural networks but it did not work and uh in my diploma thesis he was my supervisor he gave me some task it's called a trunka system where you have a sequence and everything which you can predict you can remove from the sequence because it's predictable anyway so you shorten the sequence and you can learn it that was the idea of the chanka system but uh uh this was sort of a solution because required neuron networks are not working and then uh two things happens first uh I build a neuron Network yeah only one weight have to be adjusted as a way to store a piece of information which you need at a sequence end and the network could not do this I did all my print FS all my my coding on the screen numbers flow over the screen and then so hey this a super small numbers and this were the gradients uh there was no weight update the gradients uh uh were not there and and this was s the discovery of the vanishing gradient that if you you want to have some Target you want to know what is needed to predict the target uh uh you would do createit assignment in the sequence end you got no signal and finishing gradient now I new by Rec networks do not work and the solution was lsdm so long short memory memory cell where I build something which make sure that the gradients as I get propagated back as a transferred back I don't scale as say Remains the Same so at the sequence beginning there was exactly the same gradient as at the sequence uh end there's no finishing gradient anymore this was uh the architecture uh uh the uh memory cell architecture which is the core of lsdm and uh so I discovered lsdm I wrote it up in my depl thesis and later as hgen came back he asked me hey you did something in your diploma thesis uh uh should we publish it and then we published it and yeah yeah and it's been one of the most cited papers in the history of of deep learning very very impactful paper I mean just reflecting back on it though what do you think the long-term impact has been of of LST I think it's still uh uh used uh in my keynote I gave one example the example is from this year for predicting floodings or thrs but especially floodings lsdm there a major model in the Google app and also for predicting floodings it's used by the US government by the Canadian government uh and here lsdm works better than everything else uh betters and Transformers and so on OPI built a big lsdm network uh as an agent or Deep Mind with uh Starcraft uh Alpha star was a big lsdm Network lsdm became the major thing in language up to 2017 everybody used lsdm together also with attention attention was uh together with lsdm and then the one paper came out attention is all you need meaning you only need attention and you don't need the lsdm anymore and this where uh the Transformer was born and the new technology took over uh language but lsdm still were good performing in time series prediction is reinforcement learning agents and so forth but Transformer were stronger in especially in language this changed now again hopefully but uh at this time uh Transformer took everything over so we uh better paralyzable you could uh throw more data to this model learn on more data uh so were faster and LSM could not compete at this time how did the lstm solve the trade-off between storing new data and protecting data that was already stored that's uh uh uh uh a very uh interesting questions and that's also the strength of the new x lsdm uh IDE is gating uh I have uh we have different Gates uh some perhaps most important uh uh thing is the input gate uh and uh it scales up or down new information which coming in it can be scaled down to zero it's not store or one everything is stored and the input gate is something like an early attention mechanism because input gate is an attention you have a Time series and to put what sequence elements you want to uh uh pay attention to the input gate would do that uh and then it's a forget gate forget gate uh uh is saying is the already stored memory important or should I downscale it but more important is the input Gates the input gate really uh picks out specific sequence elements to be stored and so non-relevant stuff is not stored it's was one of the first attention mechanism but we called it gating yes just before we get to xlsm because of course you got this amazing new invention which solves many of the problems that that the original lstm had can you tell me maybe like the the computational complexity of an lstm compared to an RNN how how did it compare uh LM is an RNN yeah vanilla without without all of the the G yeah without the gating and uh the complexity is only increased by this gating mechanism but it's still linear in time because uh uh perhaps it's better to compare it to to to attention attention if you have a new query a new piece of information you have to look back to all previous items and lsdm only interacts with the memory with all already stor uh memory so it's always a constant for one query you have a constant uh interaction with the memory uh attention have to go through all keys and have to do this PA interactions and there are two disadvantages first disadvantage is computational very complex it's uh quadratic in the context length and the second disadvantage you have only pair interaction you do a DOT product then exponential of the dot product which is sensor soft Marx but you have only pair comparison it could be better that uh uh more tus more sequence elements are pulled together and a new element interacts with I would say an abstraction of of this uh uh different tokens two disadvantages uh the Transformer comutation complexity plus very simple interactions but lsdm is like a recurrent Network all recurr networks are linear uh in the sequence lengths linear in are context length LSM is a little bit more complex because it has the gating mechanism which makes it a little bit more complex but it's by far not as complex as a Transformer with this quadratic uh uh uh complexity could you explain to the audience why something which is quadratic so you know something that should be worse but it actually ran faster why is that it run faster because uh of the implementation on the GPU uh graphical processing units so chips where uh everything is implemented on we had something like it's called flasher attention this is very fast attention mechanism and you uh uh uh use Hardware optimization uh uh this is one thing and the other thing is uh you could do it in parallel uh I say one quid is looking back to all keys but it can look back to all keys at the same time you can do everything uh in parallel you can push up all keys or or or a whole uh like assume you have a sequence you have you have a sentence and all the words are pushed up one level simultaneously while uh theal networks or LSM has to go over it sequentially the first thing build up a new memory the next thing build up a new memory and aention could push up everything in parallel and therefore attention uh uh was at this time much faster than lsdm uh because of this parallelism and the second thing was uh you can could optimize it for the GPU for the hardware these two things paralysis paralyzing and Hardware optimization this gave tension a big Advantage you can uh uh train on much more date in the same time and yeah uh and uh LM could not compete with this technique you mentioned flash attention as well so can you just quickly explain that to the audience so does that mean in certain circumstances you don't actually need to do the full quadratic attention uh it's still quadratic but super highly optimized right where you uh use this fast memory ases in in the thing uh you you even use registers uh of of of the uh uh GPU very very fast entities memories in the uh register it still has the same complexity because it's mathematically uh quadratic and you cannot uh cheat math but you can uh do uh it super super fast Flash attention was super super fast because it is Hardware optimized wonderful so can you bring in XL STM which is this this new invention and how does it overcome some of these problems with the original wellis tier yes I start with a spoiler um uh because I I talked about Flash attention we are with XL faster than flash attention both in training as also in inference especially inference is important uh now I go back xlsm uh uh we uh after seeing uh this this this rise of the Transformer and so we thought okay uh first of all could it could it not be ldm because uh uh the reset backbone architecture uh to build very very large uh models uh was this the key uh uh uh to have this feed forward connection this many parameters where you store all the information uh meaning is it important to build big models or is it important uh uh to have uh some specific uh uh uh technology looking back uh to compress the history we thought LSM should do it and we asked the question can we scale up the lsdm like Transformers and get the performance of uh Transformers but we know some limitations some drawbacks of lsms we already mentioned once uh one uh this was parallelization uh we now made lsdm also parallel we used the same ideas like attention to paralyze lsdm um but there were two other limitations one limitation was uh uh that lsdm could not revise decisions uh if you store something and then you see something different comes where say oh this should have been stored you cannot revise it I give you an example it's like uh you want to find new closes for example then you say and now I found this clothes with this price and if you look further in uh uh in in in the internet you find new clothes which is even better plus surprise perhaps the clothes should fit your your your shoes or whatever and if you find something better you should thray away or throw away what you already have stored both how uh large is similarity to your shoes and also surprise uh so old M could not do it if I found a better matching uh thing and I have to memorize surprise I have to delete everything as the xlsm can do this and the idea is exponential gating yes to revise a storage decision we do with exponential gating the IDE is if I find something better I upway it very heavily and then I normalize it therefore the old best solution is down weed and by this I can uh find something better and can throw away my my old stuff in theory so for getgate could do it but in practice you cannot learn to forget because you cannot learn at the same time to store very precisely and then forget some one time step but exponential gating exponential input gating was the key that they say I have something better forget everything what was before and this gave us uh uh an advantage the second thing was a matrix memory uh uh a big memory and the original lsdm has a as a memory a Scala one number you have only one number what which you can store that's not much you can store and now the new lsdm the X lsdm has a whole hopfield network we use a classical hopfield Network and it became popular again because there was something called Nobel Prize and uh uh John hopy got the Nobel Prize for this a classic hopfield Network and now instead of this uh single uh scolar we use a whole hopfield network it's like a a classical hopfield Network plus gating we with input say what we should uh uh store in the hopfield network and with a forget gate uh how much should be the old storage items be downrated it's a hfield network equipped with gting so we merged the ho Network idea with the lsdm idea and this G us an lsdm with a much stronger memory with a much bigger memory so exponential gating was important increasing the memory and the third I already mentioned it was to paralyze it and this three ingredients uh we used to build this uh a new xlsm it's was uh fantastic what results gave uh we didn't expect so good results to be honest and I suppose the memory mechanism is also reminiscent of the fast way programmers of the of the 1990s exactly already did this so we are other like hopfield networks it's always an AO product memory you have a memory and you have a new uh uh you have two vectors one we uh uh like in attention we call Key the other we call Value and you have an auto product of key and value and add this to the memory uh this is idea but what we did uh add on is an input gate to this new item which is added and a forget gate to the old memory but uh it's a known technique this Auto product storage uh it's also uh uh even older for this uh uh icing models in the' 70s they had already this ideas and also held networks used the same idea but also the fast weights use this idea could you also just give a little bit more intuition on um uh you know the the gating so you said you move to exponential from from sigmoid certainly in the 1990s people were using like sigmoid and and hyperbolic tangent even for an activation function what was the intuition at the time to go sigmoid and and and just in a bit more detail how does the exponential version fix the problem um to go sigmoid it's gating uh Sig is between zero and one yes and that's the natural thing to do one is uh the gate is open everything go through a zero is nothing go through uh the gate is closed and between you do scaling so for the Sig was a natural thing using for gating but uh it has a problem because if you encounter one sequence element and you say I multiply it by 0.5 let's say and there come another element say oh if this is 05 I should multiply this by four but this doesn't work because sigit only goes to one I cannot go higher I cannot over rules this uh so sigit is is limited and you have to make a decision but lat it's limited to higher values so exponential gating is not limited you always can do larger values but the problem is we never used in earlier days exponential activation function because learning would break down but uh we have to have a second ingredient uh one is the exponential gating but the normalization you have an exponential uh thing but then you normalize by uh this exponential uh input Gates but it's like a softmax uh uh if you uh uh remember how a softmax works you have this uh e uh to the power of something you have this exponentials and then you divide by the sum of this exponentials it's like a rolling softmax uh and therefore uh uh we went in direction of attention with LSM but it's uh recursive but it's very similar you have exponential input gate but then you divide by the sum of all input Gates it's a little bit like a softmax but it's uh uh different also uh and there's another thing uh it changes theic Dynamics but we we don't have a clear uh understanding what's happening we only saw if you have different uh architectures So Soft Marx with this exponential function has gives Advantage learning Dynamics meaning if other systems got stuck are installed do not learn anymore uh there are some gradient Peaks which let the Transformer learn and we now observe the same with XL SDM to revise s decision was our reason but the learning Dynamics also were modified in a positive way but it's not understood completely what's uh happening here exactly I think there's some random directions where you if if nothing goes anymore that you have some random weight updates which help you to progress learning but it's uh a speculation exponential gating again Matrix memory and uh to paralyze it I'm just interested in in what triggered the the flash of inspiration I mean if you could go back in time and tell your younger self about this would your younger self just say yes absolutely and and would you have done it then yes but my younger self had have to see a couple of examples uh because at this time uh we don't have this big language models uh we don't have this problems where we see exponential gating uh helps this big memory helps because we did we did not had this data sets yes I know only had to say what to do but also what data will come then it would say yes of course uh I have such a small storage if you want to store much more of course you have to do this I would have seen it but I also needed a a glimpse in the future what data will come in your paper you um you studied how these things scale with with data and model size and so on can you tell me about some of the theoretical underpinnings uh this is a standard scaling laws is we not U uh uh developed by us where uh uh either uh you do model parameters you increase model parameters and so follow a certain uh like exponential law a certain curve and you compare it with like Transformers or state space models which also follow a certain law and then you can extrapolate and say if you build larger models uh we will be also better but this are scaling laws uh which were used not invented by us uh by others which is nice because you can now predict if it makes a model larger or if I use more data how will these models these larger models behave you mentioned um State space models you things like Mambo could you just contrast to that M was most competitive method for lsdm and then after our publication of uh x lsdm m two came out and the nice thing is M two is X lsdm without G uh input gate it's exactly the same because it has uh e e to the uh soft plus eof soft plus is a sigm you can uh do some math then you see they have also this uh for gate Gates they have also an output gate number two is uh uh like an xlsm but no input gate uh input gate is left out therefore it's uh nice to see that different method converge to the same architecture a little bit so not service mama because they don't have inate I think the inut gate is important so remaining architecture is very very similar now so started with States based models we started with ldm with Hopi and blah blah and now we more and more convert to very similar architectures are you seeing any hints of Industry adoption of xlsm yes first of all X xlsm is now faster than flash attention in inference also in training I I can tell you why with flash attention you have to put also stuff along context into the GPU what we now do we did chunks of Flasher tension and between the chunks we do the recurrent stuff and we design the chunks of cles tension that we can uh more uh be more efficient on a GPU if you have smaller trunks you don't have to squeeze it in and do inefficient stuff you can exactly make it so large like the caches so have many we uses flash attention uh technology we stole it from uh uh these guys uh but uh to do the right size of fles attention makes it fast we do flashh attention recurrence flashh attention recurrence and now we are uh faster than if we do a whole flashh attention over the whole uh context and that's uh both in training but also in inference this is chunk wise flash attention it's called or we call it like this and this gives us a speed I didn't expect that we could be faster than flash or 10 in training as or no no way but hey it's unbelievably fantastic but set be a fast and inference we know because uh here tension has a has also go Auto regressive uh because you have to uh produce a new word in generation and then push everything uh into your system again uh and you produce a new word you have to push everything in the system you can uh cach some of the processing yeah you can do it fast but uh attention is not for auto regressive uh mechanism in training you have the whole sequence so be fast we are not faster unbelievable but in inference I was sure that we are faster and this gives us advantage in different ways first of all I will mention something in language you aware of this strawberry or one thing oh yes it's doing more on the inference side it's more thinking yes and on the inference side we would be much faster if we like 100 times faster in the inference we can do 100 times more thinking and this is a a big opportunity it it plays in our hands it was so nice so it came out because we are exactly there we are fast at in inference uh uh uh there would be better but this fast inference speed also helps us to go in industrial applications away from language language is not at the core of many uh Industries is not quite many uh uh uh uh there have the business many companies uh but uh I now can go into robotics uh Transformer were used for robotics a deep mind had a paper uh Tesla had a paper but they all struggled so a Transformers is too slow yes that's uh sometimes you have to wait a couple of seconds before agent is reacting now we have something which is much faster and we have a second Advantage we have a fixed memory we know in advance how large our memory is if we now go embedded on embedded device we know how large the memory is we will resign uh the LSM with this fixed memory and no matter how long the sequence was you use the same fixed memory if the sequence is 100 uh uh uh sequence elements or 100 Millions uh you have the same memory and we can fix the memory and we are very fast and this two things give us advantage to go embedded to go into robotics say's even one already tried it out for drones so have this uh uh devices this gpus on drones and they emailed us they don't want to reveal and they said it's it's unbelievable that much better results and now the drones are flying uh uh autonomous and and you have to have real time control you cannot wait until and with this xsm it's working we are it's fantastic and this guy also uh talked to me at newps there I said I don't know whether I want to Ral it because it's so good for them it's a company uh but uh to going robotics going to drones also to self driving uh uh uh in a car uh you want to be energy efficient you want to have also batteries with you you want to be fast you want to be uh concise you want to have a small uh a powerful system and here I see big advantages with this xlsm perhaps even going uh uh to to the cell phone I'm not sure I don't know the constraints of the cell phone perhaps this is too farfetched but we have a thing which is energy efficient it's fast and we can control some memory we can design some memory for the device for the embedded device for example do you think the XL STM sort of move us closer to something that resembles symble manipulation symbol manipulation yes uh I don't know uh um we have a project uh about a neuros symbolic uh uh AI yes U sybol manipulation uh I would say in one sense I think xlm uh is better in building abstractions yes uh what I'm missing for this AI system we have out there I never saw an AI system build proper abstractions it's always human-made say's language is human made Serv if you look at image net uh a human put the object in the middle I want to see an artificial system which comes up with A New Concept which is not human made and uh uh xlsm uh I don't know whether it can do it but but in the memory by combining more tokens by combining more uh from the past perhaps you can build a concept because it's more efficient to store a concept AB concept than stores a single items like a like an attention would do sing items if you can compress it to uh something if you send uh uh uh see you have Sun uh Beach uh cocktail uh and so on they say ah perhaps somebody's on the beach holidays and it's perhaps one abstract concept and to store this is perhaps more efficient than uh storing single items and the same should happen in industrial applications that you see Concepts you see structure and you store the structure not the single uh uh uh things and if you have the right abstraction you're better in generalization because uh if you have abstract Concepts in future we will encounter this abstract Concepts again hopefully yes the reason I asked is of course you've got your symbolic AI paper as well and I'm really interested in neuros symbolic architectures there and there are many approaches to doing that so in URS we've seen many people using Transformers to generate programs um some folks are just skipping the explicit program generation and just getting Transformers to perform symbolic like tasks and Transformers are incredibly limited they can't copy they can't count there's lots of things they can't do do yeah but do you think the XL STM could overcome some of these obvious computational sort of like limitations of of the Transformer probably some of them can overcome but I think the solution is to combine both I think uh uh what we have right now is not the final solution we have to go symbolic uh and uh there already things out where Transformer is perhaps uh uh uh uh using mudlab to uh to solve an equation or whatever or is inquiring at the internet or whatever I think we need V because uh there are so many symbolic techniques out there where for 50 years we have developed and we should somehow integrate them use them I don't know where everything is learnable perhaps in principle but now a shortcut would use what's already there combin in the right way and uh in Austria the biggest AI project it's about 40 million EUR I'm reading it is bilateral AI bilateral because bringing symbolic and Subs symbolic AI together because we need said in my talk scaling is over now we have to go into industrialization of AI and here we need new techniques and perhaps not uh new techniques only from the sub symbolic side from the neuron Network side perhaps we need uh things from symbolic side to make things more robust uh uh uh uh uh because if you uh uh if the uh uh production uh uh process stands uh or stalled uh uh that's not uh that should not uh happen and therefore you need perhaps a sub symbolic methods integrated or surrounding the sub symbolic methods like large language models or or or others I completely agree no I I think we need to build hybrid systems yeah yeah that's the neuros symbolic approach and that's what we are doing uh in Austria in this big thing it's hard I still it's hard to bring these two communities together sometimes they don't like each other the other say we have big success stories the other says have other success stories but I think set a way for tunks to to to advance AI but also to make industrial AI as I said in my talk because for industrial AI we need symbolic systems to make it robust to to guarantee stuff we now have uh to team up with a symbolic guys to advance AI I completely agree so we need to have formal um verification the thing is though can we have our cake and eat it because the only problem with these hybrid neuros symbolic systems is the degree of human engineering can we automate the the creation with some kind of architecture search because we're building the the these big these big systems that have many many components many verifiers and so on how much of that can we automate in the group from uh where we do the neuros symbolic uh so symbolic I said hey we need machine learning uh perhaps to adjust our parameters for our symbolic stuff uh so Subs symbolic guys say uh we can use the symbolics I perhaps as a shield surrounding it but they don't merge it they don't integrate it like learning rules learning new symbolic rules with rule things I I know how symbolic goes but perhaps some rules are better uh uh uh you have to better integrate uh uh uh uh uh uh those things but now every uh these two groups are thinking in their own uh domains and I'm missing uh this and if somebody is doing this Tak this from this community this from this community clue it together but it's clumsy uh not nice not elegant things elegant things also we as we did is we learn some formal systems but perhaps the learning should go into to the formal systems and the formal system should be a a subcomponent integrated subcomponent of I don't know a large language model or whatever right now it's it's not there we this two groups are too separated yeah so on on on the connectionist camp you know there's there's um Hinton and Benjo and and Lon you and Jurgen you're pioneers of of connectionism in a way and you're newos symbolic guys you you've always been um why is that perhaps going back back in history you have to know that uh Germany and Austria were very strong in the symbolic uh Cas was this Dave K uh and and uh this formal systems were many professors doing this and in the US and so on so were this uh things uh when Europe started and snowb and on but jgen was a very you know uh he's still a guy who is thinking uh along different lines and there was a big group on AI but it was formal but he said no I think this neural networks and as I went to the UN University uh I was a student everything was boring this some theor 50 years old 100 years old uh uh all computer science uh quick sword everything uh uh this old stuff but since there was this neural network stuff J did nobody knew what's coming out you learn something this was super super interesting and this was also uh jur uh uh thing it it was something new not something which was very old and traditional and also in the group we were we read this uh uh science fiction books yes and I said hey I have a new science fiction book uh and so many ideas came also uh uh like uh how you can transverse the universe with generation ships what's possible what's not possible good ideas it was also so time uh R networks as a new technology and uh a lot of innovation ideas and so on uh this was and here uh going away from this traditional uh symbolic thing this was super fascinating this new neur network stuff you did not know what's coming out you changed something here and here uh thisis was uh exciting yeah and in a way that that's very polyic it's it's it's knowledge of so many different fields at once me certainly Jurgen was talking about things like godal machines and recursive self-improvement artificial creativity all of these Amazing Ideas in in in some sense they were before their time but do you do you think things are starting to swing back the other way I me I'm certainly seeing look at Deep Mind for example loads of neuros symbolic architectures coming out I mean do you think that the the Consciousness is changing a little bit I think so perhaps it have to because I think uh our path ends with scaling up with making things larger uh building and and I don't know whether it was the right way because uh uh it's more about storing more information in this systems uh you put more training data to make it larger but uh not smarter not the systems we not different so were only larger and if this has an end uh we have to be smarter uh uh and I think uh the symbolic way or a neur symbolic uh thing uh have to come because it gives us the way because I don't know when a sub symbolic way which neural networks where should we go or what what should we uh do uh we scale it up we have this almost brain likee uh uh models now but something is missing it's not what humans doing uh humans uh learn different so with a few examples so have other abstraction capabilities say much more adaptive so can plan and something is missing yeah something is missing and perhaps symbolic gives us this what is missing we miss something how do we blend these ideas together so people think of system one and system two is being completely different and they might be very interlined you know like a lot of reasoning is perception guided how how do we really integrate these ideas together yeah it's very popular uh after kiman and and also uh uh it's a touring award speeches they used always system one and system two it's very compelling and but uh I'm also not sure whether this uh uh there's a clear separation okay there's perhaps a clear separation if you play a game of chess and now you start to plan set a system too but there's this intermediate things something you do gut feeling like open ad you grab thing and you don't think about it but sometimes uh you think a little bit I think say greatly think and sometimes you plan two steps should I go here or here what's faster I hear some guys are coming I think it's it's a very short two two thoughts uh where you're thing and for me it's it's not so separated I do it immediately and I do a long long thinking I think there's everything grad many things you do do uh uh intuitively it's system one like and sometimes you have something where you really thinking about think but there are so many things between when I if I leave here will I go home I can go straight ahead and see a peps I go down there and so uh I make a couple of decisions it's a little bit planning uh and it's intermediate I don't think there's a clear uh uh difference between system one and system two yes I are you I agree I agree the abstractions in these systems should they always be human intelligible and what I mean by that is um you know Elizabeth spoky had these core knowledge priors and you know things like um agess and spatial reasoning and objects and and stuff like that and it's almost as if there's a core set of um basis functions that we've acquired or learned about how the world works and that suggests that any reasoning system would just compose those simple priers together is is is that all there is with reasoning or do you think that AI systems could discover weird alien forms of reasoning that we wouldn't understand that but I believe also different concepts uh the concepts we developed words and uh is what helps us uh and for example I was say neuron Network perhaps you have speed acceleration uh uh uh and so on you have this Concepts but if you now do a linear transformation of this you have the same uh uh uh information but a little bit mixed up for neural network that's Now problem because using a linear transformation you can do the inverse it's the same information a little bit different distributed and perhaps sometimes it helps that the information is differently distributed and I think for us we uh develop Concepts and abstraction which helps us as humans uh helps us uh to uh uh from one generation to the Next Generation uh convey uh uh experience what we have learned also inform others best food and so on and uh that's the most important thing we do because if our kids have to learn what uh uh mushrooms are poisoners and what not uh uh uh you the most information in our children is from the previous generation they go to school and so on uh and uh I think our language our abstraction is uh uh uh designed is tailored uh uh to uh from one generation to the Next Generation uh to to to to transmit this information because it's some most information what you acquire as a human as a single human is much less what you acquire by the culture by the uh think and therefore I think uh the our abstraction our language our kind of thinking is tailored uh uh uh to our society to and I think uh AI systems should come up with complete different reasoning things but also different abstractions for them uh other Concepts might be much more useful because they live in the same world in a different way they uh uh manipulates the world in a different way yeah it's it's something I I think about a lot because there's this constructive component to abstractions as you're talking about so there's the language game and we have this mtic cultural transfer and it seems to be in service of some utility for us to understand each other but they are still grounded in the physical world acceleration is a thing in the physical world is it uh I would challenge you for us it is perhaps acceleration plus something else combined is the real thing yes I don't know uh it's it's only uh is it it only convenient for us because it's helps our kind of thinking or uh uh might it uh be acceleration plus the location I don't know oops but but we we also have this weird ability as humans to think of things that are not directly coming from our sensory experience like you know um abstract mathematical platonic ideas and so on yes and where do those come from uh I think uh uh uh uh many of this things first of all there could be only symbols there are only placeholders for something more interesting is also in physics because you have a concept of an atom probably you never saw an atom I at least did not see an atom but you have an concept of an atom if you know say what what shap has an atom you say it's it's it's it's perhaps a ball it's it's it's a circle why perhaps it's a triangle or whatever uh uh you do this kind of abstractions and you have some uh image in your uh head for for things but often it's a BL holder uh uh uh if this and this together let's call it blah blah and we uh invent a nice word for it uh and you have an intuition perhaps you have even an image in your head for this but sometimes it's it's it's it's it's abstract it's it's has no uh uh counter pound in reality yes exactly so so there's this huge difference between semantics and and the actual thing I often think um if we gave a 21st century physics book and we went back in time and gave it to Newton yeah I don't think he would understand very much of it too yeah I completely agree yeah yeah yeah yeah we we are trained in a specific way of thinking it's perhaps different from the sinking many uh Generations ago indeed indeed SE this has been amazing can you can you tell the audience a bit more about nxai nxi it's sen new company uh first uh uh idea or first founding I already told you about this xlsm I was super excited and uh I'm at the University here and then I got to the media and say hey I have a new idea but I don't have the money to show that's a cool idea and then uh there came this VC Capital thing do you have a business plan said no I'm not interested business plan I need some money to show as a cool idea I want to keep this cool idea in Europe I want to keep it local and that was a concept nobody understood until somebody local said yes I give you some money let's first fix the technology and then build on top of it uh perhaps vertical some companies say how it started nxi was for this EXL St the first 10 million EUR went into compute uh uh and for the first paper and now uh nxi uh formed more it's a company dedicated to Industrial AI one pillar is this xlsm is a new technology we want to develop we now showed uh for with 7B model we can compete with the Transformer technology it's powerful enough but it has other advantages like this energy efficience and uh speed that we can go in other directions not only uh in in Industry not language because language there are so many companies doing language and competing and uh it's it's yeah it's I don't know whether you can make money and it's not that's a Core Business the second pillar is AI for simulation yes and I for simulation also here we have uh uh big success stories uh because now we can do simulations we numerical simulation uh struggle which which they cannot do uh there could be uh discrete element uh methods it's like particles you have many particles but if it's a million particles 10 million particles 100 million particles uh the numerical method cannot copos it cannot do it anymore the same with mesh points perhaps you have computational fluid dynamics you have this mesh points like uh if air goes over the car you have all these points or over air a airplane any of the mesh points and sometimes the mesh points are so many that the numerical methods do not work anymore now we have things where for example for a car you change something uh a numerical simulation takes three weeks the guy does something goes home after three weeks he looks up what came out and we can do it in three minutes W and and uh what is the idea behind it why are this uh uh um this neural Sim ation so good I always uh make a example with the Moon the Moon can be described by a location by an Impulse perhaps by mass but we don't describe each uh uh particle each atom or each s corn and it's a very good uh uh prediction uh where some moon is in an hour or next day or whatever uh and also in many numerical simulations you can group particles because they are structures and if you can group uh like if you throw a snowball you would not uh uh simulate every snowflake or whatever but the whole snowball and it's it's quite good and if uh uh say I system can identify his structures where 10,000 particles stick together or do the same or in parallel or whatever you can speed up the simulation and this uh uh is happening you have the things where particles are somehow synchronized or glued together or and one example is uh if you have something like a corn you have corn in a machine there's no physics between qu if you have this uh uh the numeric simulation have to uh this qu go down to Atomic level or something like this and uh uh how I Inda but if you have this corn and another cor and you can learn physics how I uh interact with this as this is pushing thisis uh and perhaps how does interact when if they're a little bit wet if a little bit larger and so on then you have like thousands of points which the numerical simulation needs you have one thing or sandorn or or whatever you learn physics of sorns so simulations there's no physics of sunqu have to go down of atomic level or or this helps a lot to speed up the simulations and they super powerful because now we can do simulations where numerics struggles and this goes so far that uh there are the we have from the local industry like steel industry you have this big OV with steel they cannot simulate it because there are too many many uh numerically too many particles and often so have to build a prototype a larger prototype because the simulation cannot cope with it now we can can jump over the Prototype the Prototype is 100 million euros wor because of thing and now we can simulate it and we can build a real thing uh uh and this gives industri a big big uh push uh if this works that would be a simulation uh idea uh uh youan that is a guy he will come hopefully yeah he will tell you much more about it but I think it's super fruitful I think uh uh it's it's a super cool Direction but ask johanes he can convince you much better than I can do well he's he's coming here in 30 minutes so about it except it's been an honor and a pleasure to have you on thank you so much for joining us today it's was a pleasure to be here it was fun I enjoyed it thank you wonderful amazing Back To Top