so much so that we now have phrases like fractal intelligence you know in fact I think Andre karpati basically was saying llms have fractal intelligence what the fractal intelligence is basically we don't know when they work they work when they don't they don't that's fractal intelligence and that sort of shows which is good still we had nothing like this before but part of the science of llms has to be to say something more than fractal intelligence saying here is the level to which you can depend on their results so I'm not so there are are in reasoning in logic there are ways where limit you know of basically formally characterizing the limits of reasoning like limited depth limited look ahead uh reasoning and so on none of them seem to work for llms the question then is what would work we have to figure that out the bitter lesson is over and efficiency is going to matter and I completely agree with that I've been arguing this for a long time too that think about the following thing um the first time when we sent a man uh humans to the Moon cost was not a consideration we want to show that we can do it NASA was the one which is doing it the second time and the thir and for the space as well as the moon the second and third time Etc may be okay but by now it's Elon Musk sending people to space and supposedly possibly to the Mars too because the cast matters right essentially it's once it's been done then you start caring about uh you know the cost that you're paying and computer science is actually quite a bit about the unsexy parts of cost just as it is about doing things that haven't been done before there are people who say well if it's not retrieval then it is reasonable so what say you reminds me of this old montypython uh thing where this guy has I think Life of Brian he does something that looks like you know if this is to prove that some some someone is a witch right you know if she is made of wood and she floats on water then you know she is a witch she sing forward a witch I'm not a witch it's a bit you are just as well they dressed me up like this and this isn't my nose it's a FAL one will what would you do the nose no random connections and then saying it that she's a witch and you say QED that looks like reasoning because it's not just retrieving you know something she's a wi one but we know that it's not sound reasoning so tab is a new AI research lab I'm starting in zurk uh in a way it is a Swiss version of deeps and first we want to investigate uh so llm systems and search methods apply to them similar to o1 and so we want to uh investig investigate reverse engineer and uh explore the techniques ourselves mlst is sponsored by sensl which is the compute platform specifically optimized for AI workloads they support all of the latest um open source language model out of the box like llama for example you can just choose the um the pricing Point choose the model that you want it spins up it's elastic autoscale you can pay on consumption essentially or you can have a model which is always working or it can be Freez drived when you're not using it so what are you waiting for go to Cent ml. and sign up now that you know Microsoft essentially uh cannot any longer control open AI if in fact AI has been achieved that was one way they could avoid being beholden to Microsoft but now they're trying to say we'll remove it so that we'll get more money from Microsoft that's kind of I I don't know what that says are they looking for money or they have realized AI is not actually going to come anyway so why bother with that Clause you know other than surprise um so much has changed since our last conversation at icml can you give us a bit of a a rundown of what's happened when we were uh talking in Vienna I think we were talking about uh reasoning abilities of large language models especially I think of large language models as the auto regressive token by token prediction models which are pre-trained for that and they also do that in the INF time um and it was clear I think as we were talking about that time that those as from my perspective did not have the reasoning abilities they're amazing in in supporting uh creative work where they can give you ideas and they can you can run with them but and they will give you answer as soon as you hit return but they're not guaranteed to be correct okay um one of the interesting questions of course is reasoning tends to have a higher complexity in terms of the time needed and um are there ways of actually changing llm sort of substrates to do that and a couple of things um happened um and and I would think I would basically I mean obviously we'll get to 01 in a second because that's the thing that the bigger thing that happened uh but an interesting way of looking at what that whole direction is what's been called there like two parts inference time scaling and post training um first ideas that were tried and in fact we talked about this when I talked about llm modulo is if to the extent llms are essentially quickly generating candidates but with no guarantees um maybe you can make them generate tons and tons of candidates and then either do majority voting or self-consistency or something like that uh to see if you have the better answer and how do you check the better answer there are like a whole series of them there might be external verifiers uh there might be like llms themselves trying to partially verify there are problems with that we talked about but you know they have tried that too um so that has that's one type of inference time scaling um a related idea there uh an interesting very interesting idea there is it's been known from the day one that if you are trying to give a reasoning task to llm as a prompt and then it gives a completion and you check its completion whether it's contains the solution um the probability that of that happening um in general can be made higher if you can find the right kind of prompt augmentation so in addition to your reasoning thing some magical tokens that you add and that seems to increase the probability and there's like a this has been seen in like multiple scenarios originally this idea has been banded about as Chain of Thought and the first very first version of that essentially is the zeroth order Chain of Thought where the magical token will always be the same one irrespective of the task and the llm like let's think step by step and that sort of work partly because you know the human data had those specific know those types of tokens and so llm like outputs that and then that jogs its pattern um matching things to actually pick up with other uh Solutions on and so on that would be the thing and then then then came the Chain of Thought task specific one the one Jon V and code did their humans give task specific advice as to how to solve the problem and then hope that you know llms will actually solve it this can be connected with inference time scaling because you are adding Chain of Thought and also are essentially making it generate multiple candidates and then actually picking from them U Chain of Thought by itself has again problems just as a verification as problem Chain of Thought as problem in fact this newps we had a paper called chain of thoughtlessness that we talk about later but um basically by itself it has problems but as part of the toolbox of increasing the time spent during the inference time before you blurting out one answer Chain of Thought together with the uh the sort of picking U from many samples has shown some promise one variation of that and in fact that is something that I've been pushing more recently is originally Chain of Thought was kind of confused with it might be anthropomorphic and that in fact we tend to sort of tell ourselves okay let me do the this way Etc and people are hoping that llms are doing the same thing mostly they were just kind of imitating whatever like this you know let's think step byep data that they have found in the training data um but somehow people thought if you kind of make them imitate human thinking you know um then maybe they will do better that is the first two ideas and neither of them actually went that far and another idea essentially is you realize that it's just a magical tokens that you trying to add and you just have to figure out what's the right magical token and it's like sort of a scolum function you're trying to figure out a task specific llm specific magical token that increases the probability this is a learning problem it's an extra learning problem the two general approaches that have been tried the first approach essentially was to save so llm basically before giving the answer it has to tell itself a few things some you know like step by step is the one that makes sense to us but it can actually give it itself a gabil deuk string another gabil string and then that kind of probes its uh conditional probability of completion in such a way that it might actually come up with the correct solution the question then is where are these tokens coming from and one idea that people had previous first ideas were humans will supply these tokens by Chain of that advice that wasn't going anywhere one other idea that they had essentially is that if you for example have a class of problems for which there is an actual systematic no actually before going that open I did the following thing which is maybe we will ask humans to solve specific problems while thinking allowed so there's actually a paper about one and a half years back saying let's think step by step and and then basically they this went under this whole issue of process supervision and people were actually being asked to record what they're telling themselves Etc this is like the worst form of psychology unfortunately because we don't actually know how we think but they tried this and one of the things is it's extremely costly they wound up you know my joke is they improved the GMP of Nigeria because Nigerian um you know turkers were being asked to solve tons and tons of these problems and then think aloud that was very costly and then a separate Sim similar idea was there are Bunches of problems for which there are systematic solver like for example for arithmetic there is arithmetic solvers for search problems there are a star search uh sorts of things and for planning you have planners in general any systematic solver would be manipulating some data structures until certain termination condition is reached and then it outputs the solution imagine you make it output the trace of the data structure manipulation operations all you needed hopefully was some extra kind of you know tokens that are coming out before the uh solution so this stuff can be thought of as a derivation and the idea that people had was let's train the llm with a huge number of these derivational synthetic derivational traces and the solution and now remember this only works for problems for which actually there are synthe know synthetic solvers and systematic solvers and you're just trying to kind of make it be solvable you in a General sense without having to call those that was the idea and there are like a couple of three or four efforts that have gone there's a search farmer from uh Facebook uh meta and there's a stream of search and there's just recently last week there is a Google deine paper uh which also talks about internal versus external planning to do multi-board game solving and all of these essentially use variations of this idea so you have to realize that all they're doing is now llm has to before outputting the solution has to Output some additional tokens that will jog its memory to hopefully output a better solution this is the Hope and and so basically people this was their idea they tried it and you know s of sometimes it actually works it improves performance um there is no good reason to systematically say this would be making sense because it's almost like if you kind of you trying to teach your kids how to do reasoning very small kids and and then you sort of do some hand movements and then think like that and then give the answer you would see the junior also doing these hand movements and things like this give the wrong answer llms can do that they're essentially imitating your whatever the derivational pieces which may not even actually make sense uh but sometimes they have shown some promise this has basically become an a you know kind of most most recent uh idea called inference time scaling where you do this and you also do this multiple uh suggestions and then pick from them Etc this comes very close to what I think 01 is doing but with a big difference from here again as you know 01 nobody knows and it's become like you know we all sit around the ring and suppose and um I like to say no Brown sits in the middle and knows but uh basically they don't want to tell what they're doing but my guess I mean everybody has a guess on my best guess as to what o1 might be doing is going again with this prompt augmentation idea but the question of course is where are these prompt augmentations coming from we talked about first one prompt augmentation for everything second is human given prompt augmentation which is Chain of Thought the third is the synthetic derivational Trace that sort of gives these tokens and maybe you'll try to say this back and much none of them really make too much sense a much better idea is if you are saying what should I be telling myself to improve my outcome it's a kind of a RL problem it's a kind of a reinforcement learning problem imagine like an alphao agent is sitting there and thinking what action should I do one after other such that my win probability increases so it does a whole bunch of these board actions and then at some point of time it gets a signal saying you won the game or you lost the game and then you do this gazillion time you can then bring this reasoning back through the sequential decisions Computing their Q values like is at under what board positions or what um you know actions worth doing you know that's a q value now if you take the alphao analogy and put it to llms the llm board position is essentially the context window with the prompt and all the other stuff that you have put on and the action is this token that you are generating so to make things simple I would like to think of it as there's a big llm let's think of a gp4 there might be a small llm which has a reduced vocabulary all it's trying to do is give jogging um give like know these prompt augmentations that it tries throws it out and then this will then be given to this other llm in terms of its context it gives extensions and then it tries one more and and at some point of time it checks if the solution you know is correct now how does he get the solution you can have you could have actually generated huge numbers of synthetic examples beforehand um again using solvers this is pretty much known that you know op I did this it's no longer human solving problems because that's too costly this is systematic solvers uh solving like the planning problem con satisfaction problems and various sorts of problems and for which they have the problem and the answer and then open the llm is trying to solve it llm plus this you know prompt augmentation engine is trying to solve it and then it if it happens to reach the correct solution then you can propagate the uh thing back this is RL in a pseudo moves it's not the actions are if the prompt is about go the actions are not go actions they are just um you know essentially these tokens that the prompt augmentation tokens one nice thing of course is instead of learning the Q values one of the things you can do is you can essentially learn the you can change the weights of the smaller llm in the right ways such that it puts out the right kinds of tokens given the context window if you do this approx you gotl approximate Q values and then this is the pre-training phase in the pre-training there's the llm training followed by this humongously costly you know they are not telling us how costly it is a humongously costly post trining phase um which they spent know billions of dollar at that point you have 01 model which is now ready for the inference time and at the inference time once again they're doing inference time scaling where except this is now they have the Q values you can improve the Q values by online MCT kinds of approaches the kind of thing that Alpha go that's where they actually can see they're doing it because they charge you for this reasoning tokens if you run 01 it basically takes the prompt it gives the answer in the old gp4 the amount of money that you have to pay them is proportional to the number of prom input promp tokens time plus four times the number of output prompt tokens tokens in the case of o1 it actually does this whole bunch of stuff that's telling itself basically this this pseudo moves that it whose Q Valu improved and it's telling itself it never shows that to you but they are all count they are they all are counted as output tokens so you know you have let's say 50 input tokens 100 output tokens and maybe 5,000 reasoning tokens and so you suddenly start paying a lot more so one of the funny things that happened essentially is when we started playing with 01 preview when it came out in two days we spent like $88,000 and then so in fact I had to get like special permission from the University because they normally don't reimburse um Beyond certain thing unless you have like a separate permission and so on but that's basically one of the ways this works out the interesting thing of course is now the way we describe this it's not it is based on an llm but significant amount of additional things have been done right essentially you are essentially doing something like an alpha go style post training phase followed by an alpha go style um uh sort of a uh MCT online in computation uh and at that point actually I would think it could make sense and not surprisingly in fact in our results we found that for the normal Plan B it does much better uh than the state-ofthe-art G you know llms including Claud and so on and so forth but then of course then you can go to the next level it has its own issues we can still talk about the fact that it doesn't scale Beyond uh the larger problems can makes mistakes it has problems with unsolvability there is no guarantees about the solution but it's now makes more sense to me again I don't know this is I think this makes a reasonable way ow can be working and if it is the way it's working the first time I can make sense of how reasoning can emerge because you are at least having the pseudo actions whose Q values you are learning and nobody ever said RL cannot do reasoning RL can do reasoning it's just that now you B basically it's like sort of an interesting thing where I keep using the stone soup analogy Stones you can make soup with stones if you start adding carrots and you know tomatoes and you all that stuff at that point of time it will still taste like soup the question of course is who gets the credit and you know that's kind of an interesting question that we would think about but that is like the long Arc of what happened in my view in the last um only four months or something since we discussed and it also one of the other interesting things is part of the mistake of llms was you you write the prompt you hit return you get the answer and it doesn't cost you too much that was where everybody was using it o basically with its you know of course the post training itself is extremely costly but they are not charging us for that but they're charging us for the reasoning tokens which you never see but you pay for it you just have to take the take their word that huge number of reasoning tokens were generated and they're going to make you pay for that and as far as I could tell at least in Academia very few people have actually been doing experiments um evaluating over because it actually costs a lot essentially because you know and they speciically people are still going with the autor regressive LMS because they're cheap you know so one of the interesting things is you know you can do reasoning but the usual computational complexity issues that we blly forgot in the era of autor regressive llms and you are hoping that somehow you know complexity will disappear will come back because if you want to improve accuracy there is you have to actually do reasoning and this is pseudo move reasoning in my view but still it costs and that becomes an interesting question of when is it useful to use a general purpose system versus a you know sort of of a hybrid General special purpose system versus an extremely specialized solver something that we haven't talked about before but now it will become costlier at least for the industries you know in fact there there's this whole movement about compound AI systems and that's basically the kind of thing that they people think about very shortly after 01 was released you um quickly as you were just saying you you spent $8,000 you put a paper together called planning in Strawberry Fields evaluating and improving planning and shadling capabilities of lrm1 lrm so um yeah you you basically said they are positioned as approximate reasoners rather than mere approximate retrievers we don't know the actual details of what they're doing so there two parts one is what is objectively verifiable which is we did test 01 on the same plan bench problems and and they did quite well on blocks world I mean by by I think Claude already there were 66 these things were like 99 or something they basically saturated more impressively they did better on the um mystery domain which and then more and and we given what I explained to you earlier about the possibility that they are training themselves with the synthetic data um you know maybe they have unintentionally trained on the mystery domain which we have available outside so we actually generated truly new random mystery domains it has lower performance but it's still not like the 5% of the old ones it goes up to I don't remember the exact numbers up to 20 23% on on some of these problems um and which is obviously a good sign that they're actually able to solve this um the other part as to why are they approximate reasoners than retrievers is it's based lot more on my the construction of what they could potentially be doing which is they're sort of doing reinforcement learning based post training as well as online um Q value update um and using pseudo action m i call pseudo action M because there is you could do RL for just normal go or any specific board games this one is just language game games where the game basically is there's a language context window and there's a prompt augmentation and there's a ex you know new cont new completion and then one more prompt augmentation this is what they call this string of chains of thought but that's basically adding Bunches of prompt augmentations and then see what happens at the end and then if it winds up being correct in the sense if it winds up you know containing the correct solution for your uh training data then that's sort of like Alpha go getting Victory win signal after a bunch of moves and then it just needs to do credit blame assignment for the moves and that's what RL is essentially good at doing and if you're doing that it's the word it's a reasoning and it's approximate reasoning because it's not actually problem specific actions it is problem independent these language prompt actions is it possible that you might be wrong about that is is it possible that we're giving them too much credit and what they're actually doing is just this massive um generation of trajectories all in a single forward pass so maybe they do something like process supervision so they they do some clever RL pre-training stuff but um so obviously again this is the sad part of the way ow1 thing is in fact by the way I I was to tell you like a funny thing that I was talking to somebody who said they were having some conversations with the open AI guys trying to sound them out as to what ow might be doing at some point of time one of them said that I think you may have to wait until the Chinese replicate what we did to actually figure out what we did that's the level at which science of op has gone to but the the the point the only reason it is possible that I might be giving a lot more credit to the sophistication of the method they might be using the reason I still think that is likely to be the case is as I said in the earlier um you know description of how things shifted from llm to inference time scaling to this sort of o1 style method the general inference time scaling methods are not comparable just inference time scaling hasn't been as good um and again the question the other very important thing that you have to keep in mind is while over takes more time it doesn't take hours right basically online computation the a second in the online computation time is way more expensive from a business perspective than like days and months in the pre-training phase and and so some of the inference time scaling people actually spend a lot more time than o does and they still are not getting as far as I know in general to that level of accuracy what which basically makes me think that unless you do significant amount of post training to get approximate Q values up front you can't improve just by MCT so think in terms of again alphao analogy if you only did MCT it will take much much longer per move before you can get any level of accuracy you know any level of confidence but one of the things that Alo does is does humongous amount of pre-training phase where it learns an approximate policy which it's then kind of rolling out to improve the Q value estimates that it has so that's possibly the reason why I think it makes sense and of course I also think that the normal inference time scaling methods don't seem to make too much sense to me that one closest to Pure MCT method that I have seen is this paper from Alibaba uh called Marco one they have this Marco Polo group or something and they called it marco1 and marco1 actually essentially trains itself on Chain of Thought data which is like basically derivational data and then on top of it it does like an online MCT style computation to improve the Q values further they are they're like much smaller and they they're not as impressive in terms of the performance gains as 01 uh so those those are the reasons I think the full picture requires post training as well as inference time the thing that you and I see is the inference time but the thing that open ey can spend tons and tons of money is on the post training um you know which is before they even you know actually deploy the model and that's where it is getting this approximate Q values is my guess again as I said it's a it's a strange thing to be involved in you know I mean we should be looking for um looking for the secrets of the nature because nature won't tell us but we are now looking for the secrets of openi because openi tell us so um hopefully many of the there are many efforts already in trying to replicate this sort of a thing and uh um so we'll know more but as of now that's the thing I mean I cannot be sure exactly what they're what they're doing everything that they have said publicly is consistent with my hypothesis about the only thing I can say there's nothing that's inconsistent with my my model of I mean my speculation of what o one is working in that strawberry paper there's an appendix where I wrote down the speculation where we wrote down that and that is still consistent with everything they have said you know which is the only thing I can say yeah I mean I like the sound of it I mean it makes me more excited about using it because it makes me feel that there's more sophistication behind the system but a lot of this comes down to reasoning and I'd love to hear your definition of reasoning but there are people who say well if it's not retrieval then it is reasoning so what say you so let's actually let's look at it first the first part and second part so the definition of reasoning itself is is a good place to start from I think um I know that this whole AGI crowd basically sort of tries to say it kind of AI is going to be like humans the problem is we don't have a good definition of what hum reasoning is and but since Greeks on our civilization went forward by not by saying what humans how do we Define what humans do but be defining what are sound reasoning patterns Aristotle syllogisms logic problemistic logic you know the Entire Computer Science the entire civilization depended on having formal Notions of reasoning with there is a correctness there is you know incorrect thing Etc I mean you know reminds me of this old montypython uh thing where this guy has I think Life of Brian he does something that looks like you know if this is to prove that some some someone is a witch right you know if she is made of wood and she floats on water then she must be like a dark so or something like you know random connections and then saying it that she's a witch and you say Q that looks like reasoning because it's not just retrieving you know something she's a witch one but we know that it's not sound reasoning and so in general I prefer to think in terms of because ultimately these systems are going to be deployed whether you like it or not and and civilization didn't depend on whether people fallible humans can make mistakes and then we can look the other way around and we actually have to have G gu anes at at some level or other about the soundness of reasoning and the completeness of reasoning and so I go back to essentially definitions of reasoning from logic um and and and so on so basically the formal definitions of reasoning I try to avoid getting into this question of what is human reasoning because that is a big mess cognitive scientists don't know it we don't know it psychologists don't know it so I try to just give it a wide birth okay so that's the part as far as what I believe in reason so that's why we looked at planning problems uh for which there is a correct solution con satisfaction problems is the correct solution and if you are able to do if you say this system is a reasoning system that can be deployed it should have U some guarantees now you can say that humans can make mistakes but one of the things I keep saying is if humans if you are being paid to make decisions you make mistakes there are penalties for you you can in the end be put in jail until we figure out how to who to put in jail and how to put in jail when AI systems make you know mistakes that they have no actual guarantees over we are better off thinking in terms of formal definitions of reasoning and then seeing to what extent are AI systems coming close to it this has basically been very connected to how AI has developed you know up until now anyway now this discussion also brings back to this issue of retrieval versus reasoning I I think you are talking about you know a couple of these papers that keep coming out basically trying to say that look llms are not exactly retrieving anything that they've been said they're not just memorizing and retrieving so they must be doing something else and I would say well montypython logic is not actually retrieving anything he puts together a whole bunch of things but that's not reasoning either so there is between retrieval and what I would consider reasoning can be a whole entire universe of things that still won't be considered reasoning as far as I'm concerned because there are no sorts of guarantees um and and so and from the beginning we knew that if you again go back to those many of these sorts of papers um these claims go back to essentially the autor regressive llms Because by the way the researchers are still very busy I think we are one of the few papers on 01 we have like this you know um evaluation on 01 is also being presented in this news Workshop um but most people are still trying to make sense of Auto regressive llms themselves because that is still there as as we talked about last time too I still think it's a very impressive system one we never had system one in human civilization and trying to understand what they're doing is useful and so they go back to that and they'll say look they're not actually doing exact retrieval and they're doing so something else and we'll call this something else reasoning that's not if first of all everybody we always knew that llms are not databases right so they don't retrieve essentially they actually have hard time memorizing and retrieving when they memorize it it's not deliberate by deliberation it doesn't happen deliberately it happens fortuitously it's surprising that sometimes they wind up memorizing long passages because essentially everybody agrees that there are some kind of engram models rather than databases in the way they trade okay so given that it's very clear that they will never retrieve and the fact that they are not retrieving should not be seen as an indication that it can be seen as indication that they are not retrieving but we knew this already but the part that people seem to hint at is since they're not doing retrieving maybe they're doing reasoning no that's not making sense because again you have to subject it to what you would consider as the evaluations for sound reasoning procedures and they fail just as well as as easily as before so if you come back to this Chain of Thought paper that we I was mentioning that we just presented at new RPS right in the case of Chain of Thought in the jsv style Chain of Thought papers what you Chain of Thought idea what you do is let's say you take something last letter concatenation which is this really small toy problem you give like three like K words you know n words and you the system is supposed to take the last letter of each of these words concatenate them into a string right so um so large big uh rows so e uh g e is the one that you're supposed to out output that's basically the string right and what they were saying what they said was if you just told LM you know the the name the prompt saying you know you're supposed to take the last letters and concatenate them and give the answer and then they test it its performance is not as good and if they didn't tell it here is how here are some examples of three letter uh last letter concatenation problems uh uh and then fourl last letter concatenation problems a couple of these examples and then ask their questions it improves performance that looks like reasoning somehow it is able to follow the procedure the problem in I think we talked about the last time too is the problem with a empirical science is you shouldn't stop when you get the answers that you're hoping for you should see how to break your own um you know hypothesis right so what they didn't ask is they gave examples of three four word examples and and then they tested on threefold words but if you expect the system to be doing any kind of a reasoning any kind of a procedure following once I tell you what last letter concatenation is and give an example you will do it for 20 about 30 Etc it's just mechanically taking the last letter and concatenating what we show is if you increase the number of words the performance just plummets close to zero and this also happens in planning problems not surprisingly it happens in last letter concatenation happens planning problems which shows that yes it's doing something which seems to have improved its performance in the size of the problems for which you gave the examples for and its pattern matching of some kind is helping in there but it's not in any way generalized reasoning that would just generalize with respect to length for example and so one interesting way I've been thinking about this is it's sort of sort of glasses is you know nowhere near full versus glass already is wet you know that's sort of a optimism versus pessimism so people tend to think that since it's basically at least is solving the three four blocks uh three four word problems with higher uh accuracy because I gave this uh Chain of Thought that's sort of showing reasoning abilities but the question is we don't have a good understanding of what the boundary is where it will actually go correctly so much so that we now have phrases like fractal intelligence you know in fact I think Andre karpati basically was saying llms have fractal intelligence what the fractal intelligence is basically we don't know when they work they work when they don't they don't that's fral intelligence and that sort of shows which is good still we had nothing like this before but part of the science of llms has to be to say something more than fractal intelligence saying here is the level to which you can depend on their results so I'm not so there are in reasoning in logic there are ways where limit you know of basically formally characterizing the limits of reasoning like limited depth limited look ahead uh reasoning and so on none of them seem to work for llms the question then is what would work we have to figure that out but instead of that we basically once in a while there are these papers saying look we actually probed the um like using mechanistic interpret interpretability techniques we probed and we found that llms basically are not acting like they're doing retrieval but that's kind of understood already you know and and I think it's still the mechanistic interpret stuff is very interesting I think it may actually be part of the solution to figuring out what llms are doing but the argument that since it's not retrieval it must be something like like reasoning is still quite unsatisfactory to me because that reasoning is what I'm saying is not reasoning because all my papers are saying whatever it is that they were doing before you did your mechanistic interpretability study they're still doing before that even before you did that and they are still have these limitations before as well as after your study and and we don't actually know how to characterize what it is that they are doing and that's the part where we are stuck right now is it possible that everyone is is right and what I mean by that is I spoke with some deep mind guys earlier in the week um there's there's a great paper about you know soft Max needs glasses you know talking about you know how how it kind of sometimes we need directed attention for doing reasoning sometimes we don't there was another great paper I spoke to the guys um talking about just utter limitations of Transformers doing counting and and copying and uh Laura rwis I'm speaking with her on on Sunday so she's got this um paper out where she's looked at re traces and sometimes they are just doing um you know they're retrieving facts from documents sometimes they're doing kind of procedural um you know information generation which you might liken to a reasoning process and I guess it's a little bit like this fractal intelligence thing that it might be the case that possibly in certain circumstances these models are doing something which we would think is reasoning and sometimes they're doing retrieval and sometimes they're doing something else yeah no so actually I think Laura Ru paper is one of the one I had in one of the ones I had in mind when I was describing earlier about this issue of mechanistic interpretability I think it's a good paper in terms of they have developed a interesting set of techniques to actually see what is actually going on you know in the in in in the in the way llms are outputting their tokens but the thing that is unsatisfactory to me is yes that that that basically two things one is first of all everybody knew that LMS are actually not doing retrieval alone that was kind of well known way before right and and so there is nobody who believes that LMS are just doing retrieval and the question is you know what else are they doing and is there any clean um any clean uh characterization of what they're doing that I did not see I actually looked at that paper I think you know they're done good work but I am not yet I'm still hoping that there would be an interesting characterization there are lots and lots of groups are trying to look for a characterization of what this fractal intelligence might be right now um but we haven't gone further than that uh in in in terms of everybody might be right I mean the sense that it's there could be this whole blind man and the Elephant um you know phenomenon in in play to some extent and that part is possible because we are actually trying to piece through large number of parts of this puzzle right including the reasoning part including what are they even trying to do including what sorts of techniques seem to improve their accuracy and so on um but I think that's part of Science and basically my sense is eternal discontent is part of science I actually am much more worried about being hope too optimistic that we figured it out um then I am about being somewhat more discontent that we haven't yet figured it out and so I want to air on that side not because I think we know more than before when you know gpt3 came out but on the other hand and I think both of us all the camps know I mean the the people who thought gpt3 is AI know that that's not the case and the the camp that thought gpt3 is just stochastic parrot has to know that it's more than that okay by now so that is collective improvement in our in but still there are still large number of pieces that we haven't fig out yet yeah I mean um on on Laura's paper she was using influence functions I'm not sure if that would be class this classical interpretability or mechin but I think Mech interp is is largely about finding circuits in in neural networks and even that's an interesting discussion to me it's like a more of a the general idea of figuring out a way of probing the inside of what LMS are doing I think of that has mechanistic interpretability I mean there are very specific techniques that have shown great promise such as the auto encoder Stuff Etc but I think all of these to me are essentially trying to interpret what they're doing at the circuit level and try to make sense of their external Behavior to me that so there like M two ways of making sense of what you know what LMS are doing one is just external evaluation that happened already and we know that they're not doing any kind of guarante reasoning and that's basically enough results showing they seem to do promising things in some cases and they also results showing they seem to be very brutal that you change a prompt a little bit you change the problem specification a little bit they'll die again we are talking about Auto regressive alms not one sorts of things that's a whole entire thing that we haven't yet started making doing the same sort of you know analysis but you know once you figure those out my my my sense is that trying to actually get a s of just from outside versus also try to do probing of the internal circuts if you start doing the internal procirc I think of that generally in my view and as the mechanistic interpretability style okay okay but isn't it interesting though that she found that code and math-based procedural documents appeared disproportionately influential for tasks requiring reasoning larger models um show an even stronger Reliance on General procedural data for for reasoning the presence of code data in pre-training mix seems to offer abstract reasoning patterns that the model can generalize from I mean these are interesting observations so actually I again um I don't want to make this as a very specific critique of a particular paper just because that's not fair for them as well as me but I I I do want to basically sort of say that there is a distinction between factual tasks and reasoning tasks yeah right anms have been used for both and you know I think the factual I mean they they have troubles in both you know for the factuality I would think the only sorts of things that will improve them is things like the rag style techniques where you just give the factual data and ask it to summarize for the reasoning stuff you basically for arithmetic and so on in a large it's not to some extent I would expect that these are the kinds of things where the exact results don't exist and so I would also be equally you know troubled by the fact that people have shown that when if you take something like llm multiplication this is before way before all this laas work etc you know they tend to be correct in multiplications for popular digits and less correct for non-popular digits there's this this sort of you know mindblowing that there are digits that are popular versus non-popular but that sort of is a interesting point that the llm final performance is a complex combination of the data that they have been trained on and some additional pattern matching abilities that they are using on top but that's not sound reasoning so it basically we still don't quite know where it breaks but it's this fact that it gets to be correct for popular digits and not for some other digits that's a particularly interesting thing to me and that sort of shows by the way while we are on that subject some work uh has shown that even with ow we didn't we looked at Owen more on the planning side but some people I think Tom mcoy andco did some more work and tried basically these are the ones who did the Caesar Cipher sort of thing the ambers thing and they basically also found that owan does better and some of those things but they also still found that there are data dependencies in the sense its um its accuracy was higher in the regions where there were higher pre-training data which again makes I think it's still consistent with my view of what I think o might be doing there is an llm which was pre-trained on like some Corpus and there is this smaller llm which is sort of generating this you know pseudo action tokens uh that will make it output things and one of the interesting things is actually the difference I'm told again this is also we don't know for sure I'm told that when the original 0 models came there was the 0 mini and 0 preview and the difference I'm told was one of them I think the OV mini was using the smaller llm as the base llm and O preview was using the larger llm as the B so I don't know they didn't say this as second part but I would assume that if I have like a you know pseudo action generator model if is working on a bigger llm which has a higher capacity so it can generate more interesting completion versus a smaller LM that has less interesting completions that makes a difference in terms of you know how the the level to which the RL Based training can get your accuracy up um yeah I've noticed some interest interesting things so I've now paid for 01 Pro I was very skeptical with 01 so as as you say the the the base model is an even weaker version of GPT 4 so GPT 4 I I hate that model I hate the style of it think it's dumb and I must admit it's mostly because I'm sort of antreprenor iing it because I hate the style so I think it's dumb you know we're we're very um humans are very brittle even on the the rhf um you know we like assertiveness we like complexity you know there's certain styles that we like and we don't actually see the content but that's to one side don't like the model and um 01 preview and mini it doesn't really want to think so most of the time it won't think and you get an even dumber answer than you would do with gbg4 however um 01 Pro um the The Vibes are different so it thinks more and it gives you something which is qualitatively completely on a different level it doesn't look like dumb chat GPT anymore it it feels very very different but there are still some issues with it so certainly for situations where you are dealing with ambiguity doing programming or something like that I actually like having a dumber model because it's a dactic exchange right I'm saying no you misunderstood that let's do this let's do that we're working on this thing together what 01 does is it says well on the one hand you can do this and on the other hand you can do that it gives you a range of options you know and and I'm like well um wouldn't it be better just to either you know go on dance with the model or just better specify what you wanted in the first place so again uh two issues first of all and the 01 Pro just came I think last week right and it was the exam week for me and we haven't spent time yet you know spending time we haven't spent away money yet on the OV preview I mean OV Pro I mean I played from outside but I we haven't done any AP level studies which the kind of thing that we did with ow preview um but one thing that you know I've looked at the Twitter you know exchanges about people The Usual Suspects trying the various things on them Etc and the two things that jumped at me is one of the things we saw in ow preview is exactly the kind of thing you are saying and it looks like ow is still doing it which is they are good at digging to try and explain the answer they why their answer they gave is the correct answer one of the funny things was um I use this one particular three block stacking example which is unsolvable and in fact this showed up in the New York Times as an example of y gp4 o was actually fails on that and when Owen preview came N brown actually one of the in his long tweet one of the things was raav said this in uh ACL talk that this problem can't be solved and O preview actually does solve this instance and so which is good now people have actually said that o1 um gets the wrong answer and people multiple people actually I've seen this and people have know posted the screenshots um it gets the wrong answer but it argues with you as to why the answer that it is giving is still possibly correct so this particular problem involves essentially like there is no way of actually solving it without moving c um and it turns out out that um it gives an answer where actually C moves because of gravity it will fall down and then it tries to argue with you that there are games where people will say that unless you are intentionally moving C if the natural process makes it fall it's not considered moving which is like a very interesting uh thing that we have seen in 01 preview to when it will when we'll give it unsolvable instances which by the way um normal llm just die with unsolvable instances because they've been rlf to death and so they think that if you give a problem to them there must be an answer so basically they'll give you something and if for so most unsolvable problems basically this is why this was an unsolvable instance that I showed it to 40 before 01 preview actually solves more of them correctly that's a credit to it that's why it's actually a more approximate reasoning model l lrm in my view than llm but on the other hand when it actually basically gives a solution for an unsolvable instance it will argue with you that it is still actually right because and and so I made this uh joke in the strawberry paper that we have gone from hallucinations to gaslighting so it actually tries to argue that you are you know just like what you're saying know this is on the one hand what you want to do might be worthwhile doing but on the other hand this is the reason why I'm doing is as well doing and and in fact I think this this guy um Colin I believe one of these guys on Twitter who keeps playing with these models and he said he gave the surgeon problem the classical surgeon um you know the boy getting into an accident one and um which which uh 01 Pro said the surgeon this basically it does all this whole thing this is a classical puzzle that brings gender stereotypes into account etc etc and then gives the answer that the right way to think about it is and so this is this is that this is the puzzle where he makes the change that the mother and uh the boy are driving and and and the mother dies and the and the doctor says I can't operate on the on the boy and so it's actually changes the puzzle and and still o1 apparently says um we should basically realize that the doctor is the second mother of the boy and he'll try to argue that position okay so interestingly overthinking and is actually kind of a and actually trying to dig down and so one of the interesting questions that we don't know again we haven't played with this is to what extent is its explanation and its reasoning connected you know that in humans this is actually I mean this I'm not trying to anthropomorphize what it's doing it's just if there two different phases right if the phase one it comes up with a solution in phase two if it needs to explain if it doesn't have to look at what it did to get to the solution the explanation is just dig my heels and try to say the solution is correct and people Trend to do that sometimes we'll come to some solution and then we'll try to come up with an explanation as to why what we did might be right this is something that llms had this problem anyway to begin with because they completely assume these are completely different things and I'm always worried about LM explanations lrms seem to be even more sophisticated at this sometimes at but it's only mostly anecdotal I haven't really done systematic studies on this um so one I don't have like any visceral opinions about any of these models because to be honest I don't use them in my day-to-day life most of the time I write English well enough that I haven't yet seen a um an llm that does better job of uh things that I do and I haven't yet found useful things where I would need llms help I mean maybe I will do at some point of time llm and lrms so I don't use I don't have anecdotal experiences of the kind that you have I mean I'm mostly focused on specific systematic studies you know with like multiple instances of planning problems and then we extended the plan bench to look at unsolvability we look at longer length problems look at scheduling problems Etc to evaluate those are the ones that I have a better sense as to what ow can and cannot do yes I must admit I've updated a little bit so I was always in the same camp as you when we thought of them as approximate retrievers and I I now am starting to see something yes I think again my again my point is there two different ways of thinking about it one is it's not the llms which became that so how do you define llms as to be some a discussion that we should have I mean that's why I I keep actually talking about the stone soup metaphor not because I want to play Down the importance of o1 it's a great thing but you do have to decide who do you want to give credit to if you are arguing part of your and definitely my reservations about reasoning abilities of llms were there were Auto regressive teacher Force training things and that was true from GPT 2.5 all the way to GPT 4 o and openi knows this openi knows it enough that they no longer call it this is not gp1 you know that it's called o1 it's like a completely different model and they know that it's not all you can say is that it was done by some of the same people that also developed llms but we can't Define llms to be whatever it is that openi is producing I mean we have to have you know theoretical definitions and my sense is auto regressive llm still have all the problems but all the advantages because they're very fast they're like amazing fast system ones and o1 is a reasoning model because it actually adds the reasoning post trining as well as reasoning inference which nobody said will not be doable you know it's great still that they are able to do it in a very general sense but I don't think there was any argument that AI systems were able to do reasoning right after all Alpha go is basically a reasoning system it was just a deep and narrow reasoning system and the question was with some more General broader but you know not as shallow as llms is as lrms and which is a good step in the right direction um but I it doesn't change what I thought about llms which is auto regressive models they are different and in fact they have advantages that 01 lacks for example the cost of llms can actually be much lower inde much lower so one of the studies one of the things that we learned in the strawberry paper for example uh the the planning in Strawberry Fields paper um is that in in some cases if you are giving you know if you basically you have to think of computer science is eventually about efficiency and cost too right so if you're giving a particular instance of the problem to o1 and you pay this many dollars um versus you give the same instance to the llm with a verifier in this inference time scaling approach what I would call llm modulo in which is a general approach that we have been pushing um the LM modulo approach where it uses a autor regressive LM to generate many candidates and an external verifier or even an LM based verifier other learned verifier to check can actually be cheaper than 01 just doing one candidate with the same accuracy that becomes interesting because then you know part of the interesting thing about human civilization is on one hand we are general purpose you know reasoners but on the other hand we also know that every job requires a tool and we do that too we basically we Bas you know the fact that you know basically we doing everything that like a particular specialized tool does can be extremely inefficient in terms of the time that we are spending that is going to be the case for these reasoning models too to some extent because o actually cost quite a bit right now how much when it's going to change is anybody's guess but that sort of brings up in fact there was a Shep H H writer the the L lstm guy Friday oh great that's great so you should ask to him to so yesterday I was in his stock and so he basically made this uh one of the slides basically was the bitter lesson is over and efficiency is going to matter and I completely agree with that I've been arguing this for a long time too that think about the following thing um the first time when we sent a man uh humans to the Moon cost was not a consideration we want to show that we can do it NASA was the one which is doing it the second time on the thir and for the space as well as the moon the second and third time Etc may be okay but by now it's Alon musk sending people to space and supposedly possibly to the Mars to because the cost matters right essentially it's once it's been done then you start caring about uh you know the cost that you're paying and computer science is actually quite a bit about the unsexed parts of cost just as it is about doing things that haven't been done before and we are now in the second phase where we are actually going to care about um basically how much am I spending how you know in terms of the the the pre-training cost in terms of the inference cost Etc and is the other better approaches that I can be using this has been the case with computer science before too and it was just sort of became less of an issue for a while because we were llms were just system once because there's no at all inference time cost okay even though the Post train pre-training was very costly inference time it was very cheap right and and so we didn't have to worry about it now we will worry about it so one of the funny things the the elephant in the room for our plan plan bench problems on 01 preview was that the special the normal classical planners that are meant to solve these problems solve them in like fraction I mean such a small fraction of the cost they work on our laptops and solve all the problems with 100% guarantees right so the question is I I realize they're completely specialized only for that problem and then on the other hand you have this very general purpose thing which has cost as well as inaccuracies we start worrying about the tradeoff What level in this generality cost Spectrum are you going to find home that is going to be very important thing and I think that's sort of what shop writer was you know uh say hinting at when he said you know bitter lesson part is over that you do actually need to worry about the cost you're spending to actually achieve a goal the first time you're achieving that goal nobody cares about the cost because it's never been done so you're doing it you get all the credit but you knowth time it's being done you know because it becomes like a normal day-to-day thing then the efficiency aspects matter a few things on that I mean first of all with um 01 Pro I think it's worth $200 a month and you you can call it 100 times a day of course um API is very very expensive but I'm already spending you know over $1,000 a month on Claude Sonet 3.5 but um you raised an an interesting point I mean first of all the utility of an no1 model it's it's a bit of a weird model right um it's useful in certain specific circumstances and if anything because of the verbosity and the distractors and the context um it's not really a model that you want to be using most of the time but that raises the the sort of the pragmatism and the architecture and the efficiency thing that you're speaking to so I spoke with some guys this morning and they have built a kind of neuro Evolution um approach to designing multi-agent systems you know at the moment we we hack in the tool use you know do we use a debate pattern do we have a um a small model and we prompt it a lot or do we use you know a bigger model and we're all just hacking together these multi-agent architectures and some of those architectures will even be doing the kinds of things that that you're speaking about so rather than it trying to convince you that it got the right answer there might be a supervisor agent which does some reflexive uh there might be another agent which generates the planning symbolic code and runs it on a tool so you know we're building the these big complicated things and and I think that's the process that we need to figure out now is building the systems that actually use this technology in the best way yeah so I think the thing I sort of agree but one thing that I want to point out that a distinction is there this two Notions of use of these kinds of models when you do a subscription model $20 or $200 I would argue that that is by definition human in the loop with the model being an assistant to you and it's a very different way of evaluation where you were unhappy with the previous model because it was wasting more of your time and it's not worth it for you this one was helping in whatever you were doing and you are happy with that that's one particular Ty in general I've actually I I've always thought and I think we talked about it the last time too that large language models and large reasoning models now too they're all I'm there's no question that they are intelligence amplifiers there's like no question on that part okay I mean if you want to use it you use it and people are able to find uses for that that's great the part that I'm actually not talking more about and that's most of been that's been most of our work is really there would be scenarios where they this become the user facing systems where they'll make the decisions they will just say this is the answer and then I'm going to you going to execute this plan so the robot will execute this plan or this is the travel plan for which I'll buy the tickets you don't get to come back in and you know say oh I don't like this travel plan that's what you do in the you know the subscription model but the one that I'm talking about basically the API access is basically what people all the startups who are trying to build additional tools on top of these models they are going to give specific autonomous functionality and there that's where I'm talking about the actual computational cost versus benefit for a certain level of accuracy at the end user time um both of these are they're very different kinds of uses and I actually am no question at all in my mind that all llms and definitely also lrms are just great intelligence amplifiers but that's not what my worry is that the the whole thing has always been my worry has always been that people are trying to put this in the end user facing situations where they'll actually come make the decisions and some executor just executes it without pushing back and and when that happens the guarantees matter the in terms of the brittleness of the reasoning matters if you are in the loop you would never you know it's like if you have an assistant and assistant you may fire the assistant if they are giving mostly bad ideas but you will never blindly just use the assistant's ideas right so you'll always be the buck stops with you that's a very different way of using llms then llms are the ones that the patient talks to there's no doctor between the llm lrm and the patient in which case the their accuracy matters and their cost uh in getting to a certain level of accuracy matters and these are two very different uses and I'm much more interested in the Second Use than the first use can I push back just a tiny bit so first of all I completely agree with you that these things used autonomously they don't work they don't work for all of the reasons that that you said but that's not how they're being used and they're not being used like that because they don't work but um what we are seeing is that all of the successful reimagining of applications with language models are completely interactive so they have a human in the loop and the human is is supervising augmenting redirecting and so on the next step that we haven't seen yet but we're starting to see is um having autonomous agent based systems you know with multiple levels of reflection checking and and and so on for example it could be um a bunch of Agents generating programs it could be contributing to a library of programs the programs are being supervised not just by you but other users of the application and the whole thing just grows and it's a living ecosystem so there's some diffused form of of human supervised verification and maybe in the future you know the humans might be increasingly taken out of the the front plane I think that's a very San way of using but I'm afraid that's not the only way that's being used and in fact most of the people so actually there two issues one is if that's the only way I'm very happy because it's like it's a tool and you would use it and the hus is still on you finally the buck stops with you because you are in the loop right but most of the imagin users at least from where I sit and the kind of startups that I hear from and the kind of papers that I'm even reading um they're all about autonomous uh uses and that's where I'm actually looking at the fact that there is more promise than before it was very brittle before it's less brittle now okay but it is less brittle at the expense of cost and it's actually interesting that the evaluation strategies for both of these are quite different you know evaluating assistive Technologies is very different from evaluating autonomous Technologies and and and it's not that one is s technology evaluation is not any easier in fact you know I mean you basically you can say the evaluation is just if people are buying it and they keep paying for the subscription that's a proof that people seem to be getting some you know um value out of it but it's actually pretty hard to eval correctly evaluate assistive Technologies and that's a whole entire area and in fact most of the people who are worried about misuses of llms have never they're not the ones including like Fran for example FR and his AR thing Etc it's all about ultimately all of this is we are interested irrespective of whether you believe AGI is coming next week or you know next decade or Next Century everybody in AI eventually wants this autonomous abilities to actually make intelligent action with guarantees sort of right and and that is basically where I think we will get there but prematurely saying whatever currently is there is already working is the one that a bunch of us are worried about and that's what we pushing back on with the humans in the loop it's a completely different thing you know and even for the code generation right now they're like the two different uses essentially there's also use of code generation techniques where it tries to kind of improve accuracy to the level uh that humans don't have to it's not just an idea generation for the the human if it's idea generation it's great because somebody else's job is online it's not you know there still like a buck stops with the actual programmer in the thing so I I think that the autonomous one is the one that I care about at any rate and that's the one that I'm worried about the premature Declarations of they're already autonomously intelligent and um but I'm generally very happy that this technology exists as a human in the loop technology and it's kind of interesting from me sitting here to hear you say that you actually as a user I mean you seem to be a more of a regular user than I ever have been you know of the llms and L lrms that it's kind of interesting it means something to me when you say that you like 01 more than you ever liked o1 preview and you kind of okay with gp4 maybe but now like over one little more and that sort of you basically are getting value out of it but you still can always you have the red switch you can decide not to take its answer you know1 Pro yeah okay1 Pro yeah okay yeah the only difference is that there seems to be when it when it thinks for a long time there's a qualitative Improvement you know I wanted to get your take on something else so um we we're seeing I mean you had your llm modulo architecture and then we've got this huge approach of um test time this kind of green blatting approach so you green blat the model and um you get it to generate loads and loads of python functions M and in a way that this this is the sort of thing that we like because we like Yeah The Arc thing yeah yeah yeah Green Green okay fine yeah oh yeah um but but we we we're seeing that in lots and lots of different ways so doing loads and loads of inference and then you know we've got these python functions and maybe we do um you know Library learning and remixing and you know we're in the world of code so we're using we're using Code we're generating an explicit function we can verify it we we love that you know we're we're in a very happy place but now we're seeing an interesting shift so certainly on Ark and on several other papers people moving towards this idea of transductive active fine tuning and that simply means rather than generating an explicit python function and doing it loads and loads of times let's just generate the solution directly just using the neural network and this is a step away because we we like programs because you know programs are two incomplete and we understand what they mean and everything and and now there's a whole load of people that say actually um the the neuron Network can just can just do whatever the program does let's just let the neuron Network output the the solution L what what do you think about that so to be honest I haven't followed that work as closely uh so I have my answer is somewhat lightly more generic I I would be surprised I mean I would have the same bias that in fact there's a old saying that why write programs when you can write programs write programs uh that's the version that we talking about is basically you want to generate higher level code that generates the solutions this has always been the um the conceit of computer science so I am surprised I don't actually know specifically the work that you are referring to in terms of just going back and directly going for the solutions because honestly in the case of in in the in the context of inference time scaling one interesting question is you generate loads and lights of loads and loads of candidates the candidates can be either direct solution candidates or the code candidates either which way and then you still have to have verifier if it's code you need to have code verifier if you have solution you need to have solution verifier and one of the interesting question is where is these verifiers coming from and there has actually been one of the more you know effective ideas that we've been pursuing is you can essentially generate verifiers now like you they like of course there symbolic verifiers that might be there for specific things and that you know we can use that LM or style Frameworks but you could also use learned verifiers where essentially you just basically learn to do discriminatively what is a solution what what is not a solution um a third idea is generate the code for the verifier and then correct it and that's actually in fact it's still for at least in our case it seems to be promising we are working on some things that going to come out soon but you know basically I still think that and especially in the context of llms in the context of llms okay so it's like again it's a very different thing um if you're not having llms in the loop at all uh it's a different question but if the llms are there one of the things they're actually good at doing is like outputting you know basically they can output code as well as Solutions in which case you know the code can output lots and lots of class I mean you know lots and lots of um uh classes of solutions can be verified but the the the code and so if you correct it once then it will you know work for a longer time in in a sense and so I would still think that at least for the you know inference time scaling verifiers case that seems to be still a good idea I don't quite know um the specific context from which where you're saying this trans people are saying that the transductive directly you know guessing solutions would help I'm not sure whe they still have llm in the loop or they just saying we'll just directly train a separate neural network well I'll sketch it out so um solving Arc they have two llama 8 billion models and one is generating python programs and and they greenblat it the the other one is is trained separately just to Output the the answer grid directly okay and in both cases they do um you know inference time comput so either generating lots of python programs or doing um active fine-tuning of of the direct Solution One by augmenting the the the test time examples and what they found is like on the vend diagram of of the you know their success rate um they find that for some problem s the um the program works really well you know like the the green blat approach and for some problems um you know certainly things like mosaics and spatial perceptual type stuff the transduction works really really well and and this is kind of weird because if you think about like the the the space of functions that the neural network could could reason about they they should be the same so I don't know whether it's just because of limitations in the neuron Network or characteristics of the problem or something that you see to me interestingly to me again it depends very much on the space of solution configurations versus Space of code configurations there are many problems where Solutions might be of less quote unquote syntactic complexity then and and so a neural network that can kind of guess a string may not be able to guess something that looks like a syntactically correct Python program right llms actually can do that later and so it is interesting that if you can do that and if you still go back to neural network to actually directly guessing the solution it being a more useful step I I again the stuff that we doing for the verification thing is still in the you know initial stages you know and we haven't actually checked this kind of a tradeoff whether it would exist so you know I have no more insights specifically on why that might be happening wonderful what are you doing at the conference this week that's fun uh so I'm just here and to today and I think we did this Chain of Thought thought lessness CH of thoughtlessness paper and then I kind of said that it's like mostly when we wrote it it was like it can't follow procedure so I should be able to show it but now actually I explained the whole thing the way I explained to you here in the beginning essentially go from prompt augmentation and so I kind of think that like shophow said life must be lived forwards but only Mak sense backwards uh and U you know papers also only make sense backwards you know after a while of writing you know you actually look at it and say what I really want to say is the reason Chain of Thought is not a great idea is because you really want to think in terms of prompt augmentations and humans coming in the loop becomes less important so that's what we did and then I'm actually going to this compound systems thing and at great time they're like 16,000 people and you know running into lots of old friends and so on yeah one one of the best moments from the last interviews when when you were talking about that paper was saying that you know that they can catch you can teach someone to catch two fish or or three fish or yeah that's yeah yeah I mean that's basically because it doesn't quite know how to generalize and so I kind of made that thing that yeah essentially because you have to kind of give it examples for for forward problems again give it examples for seven-word problems again give examples of nine word problems Etc and then try to improve it whereas this the conceit people think is when people when you say this they'll say oh you must it must be doing you know procedure generalization the interesting thing again is I think we had this conversation last time too that the way I look at this I mean I'm skeptical only because of just having some additional additional background and one of the things is McCarthy John McCarthy who is the founding fathers I mean the guy who coined the name artificial intelligence basically said um the holy gra of AI is an advice taker program yeah and advice taking is AI complete and if chain of thought is able to make llm take advice that would be pretty impressive and I kind of went in thinking that there has to be holes there um and and so that is where that fish one fish two fish thing comes in but but a more interesting thing is I think de anthropomorphizing llms and trying to think of them as these basically these alien entities for which you know arbitrary you know prompt um augmentations prompt augmentations will can you can generate good behavior so by the way one example of this that people should be thinking about is if you think of uh jailbreaks and llms jailbreaks are you give a normal prompt and you give this particular carefully constructed learned sequence you know ziko Colter's original paper his group's original paper shows that sequence makes no sense to humans but it will make most llms provide a deterministic behavior like saying got you or something of that kind and essentially that should tell us that they're not seeing language and so the prompt augmentations don't have to make sense to humans in the loop and and and that's okay and because in some sense looking at things the only chance of thought that sort of made sense to humans was giving this false impression that somehow llms are doing things like we do but that's not the way it is you know so might as well just live in you know go with what they can do and optimize directly which is what the inference time uh scaling and post training methods seem to be doing yeah the the one thing I get stuck on is we can criticize individual llms I mean yeah they are approximate retrieval engines my co-host Keith dgar he's always at pains to point out theoretically that they're not to incomplete you know they finite State autometer and all this kind of stuff but the thing is it all breaks down when you talk about llm systems so even with the Chain of Thought thing right I could have another supervisor model that could generalize the prompt to go to Five fish six fish and so on so we can easily build systems that overcome all of these criticism so at some point does it just seem like we're just we're making criticisms that can be easily no no actually it's a very good point so in fact after I after this I'm going to this compound systems Meetup and I'm completely a big believer in that whole Direction but there are some people who don't want to believe that the usual L mician AOS don't in fact by the way it's a very interesting thing that open ey was at pains to point out that 01 preview was a model not a system it's not me and you saying it it's them saying it they would like to say there is this one siiz fitall model that will do it and so it is reasonable to take their word for that but parall I also like the compound systems work and it makes I llm modulo is a compound system and that's what basically it it improves on all the limitations of llm you set of limitations lims I'm completely fine with it you know again it doesn't matter to me as long as I can give guarantees and it a safety critical scenarios I'm fine with it I don't have that bias but if you are saying a single model will do it I will take you at your word and then see whether or not that's true that's a that's a fair thing it seems to be why do you think you you know Google have completely embraced you know hybrid systems open AI they they're really Clinging On to this single model that does everything I think they're slowly changing that but I think there was a reason there's a I think I mean to some extent I can understand and in the sense it would be the the sort of thing is this anthropomorphization again we only have one brain it's not that we have a brain for eating and a brain for just one brain right and so it would be nice if what we are trying to do would somehow basically be this this one size fits all this general you know system but at the same time there's also this issue of whatever I do I want to provide guarantees safe you know so that it can be used in safety critical systems and these so the the problem is the modern Ai and Neuroscience and cognitive science they are not one and the same right I mean everybody understands that essentially I mean neural networks themselves are not really that well connected to brain and essentially they're like biologically implausible and llms are definitely not in but there's nothing wrong with that just like we say you know the planes don't have to flap their wings you know um um so these are but we don't essentially try to make sense of planes and Birds um in in in in the same sentence you know because they both fly but other than that you know the mechanics are different the the the things that the flight equations are you know not not at all the exactly the same things you know um that's going to be more of the case with the llms too and as long as we realize that you know it would be good but I think open eye I think originally they were hoping my sense is bun of these people are hoping that he will just get one you know like two birds with one sh one shot like we'll get AI systems as well as understand how the brain works but I don't really think that part nobody really believes honestly I mean you might use these systems to improve our understanding in actually doing Neuroscience in fact I think what's his name um sunim I think um that he basically says you know obviously these systems help in actually doing Neuroscience research but they're not actually telling you how brain necessarily works so but that might that's just a speculation that might explain why you know you know open Ai and some of these people were you know sticking to but I mean the kind of conversations I've been having on the sidelines in the conference already the companies there the startups Etc they're already sort of going much more into these hybrid systems much more into these compound systems and um and it's like you know that would basically not be a single system but open I also is slowly coming up with this fine tuning model they have this oral fine tuning stuff for your specific kinds of scenarios Etc so it would be interesting to see um uh but I think just going back to your original idea I think compound systems is a very different there basically the individual role that the llm play llms have to play is much less demanding in fact one of the fun things is we can do llm modulo with normal llms are lrm modulo with instead of llm I call o1 and so it'll be the generation of candidates is costlier and we actually show in the strawberry paper that that we can improve further the performance of ow preview on some of the problems even though we could change the how much time it takes to think Etc we can just by calling it multiple times with the correct you know better criticisms of the you know instance the problems answers it gave we could improve its performance accuracy quite significantly so that is still using them in a system you know lrms themselves can be used in a system but I think o one open eye itself just wants to call it just models up until now let's see what happens Back To Top