I see the danger of this concentration of power to to proprietary AI systems as a much bigger danger than everything else what works against this is people who think that for reasons of security we should keep AI systems under lock and key because it's too dangerous to put it in the hands of of everybody that would lead to a very bad future in which all of our information diet is controlled by a small number of uh uh companies through proprietary systems I believe that people are fundamentally good and so if AI especially open source AI can um make them smarter it just empowers the goodness in humans so I I share that feeling okay I think people are Fally good uh and in fact a lot of doomers are doomers because they don't think that people are fundamentally good the following is a conversation with Yan laon his third time on this podcast he is the chief AI scientist at meta professor at NYU touring Award winner and one of the seminal figures in the history of artificial intelligence he and meta AI have been big proponents of open sourcing AI development and have been walking the walk by open sourcing many of their biggest models including llama 2 and eventually llama 3 also Yan has been an outspoken critic of those people in the AI Community who warn about the looming danger and existential threat of AGI he believes the AGI will be created one day but it will be good it will not Escape human control nor will it Dominate and kill all humans at this moment of Rapid AI development this happens to be somewhat a controversial position and so it's been fun seeing Yan get into a lot of intense and fascinating discussions online as we do in this very conversation this is the lexman podcast to support it please check out our sponsors in the description and now dear friends here's Yan laon you've had some strong statements technical statements about the future of artificial intelligence recently throughout your career actually but recently as well uh you've said that autoaggressive llms are uh not the way we're going to make progress towards superhuman intelligence these are the large language models like GPT 4 like llama 2 and 3 soon and so on how do they work and why are they not going to take us all the way for a number of reasons the first is that there is a number of characteristics of intelligent behavior for example the capacity to understand the world understand the physical world the ability to remember and retrieve things um persistent memory the ability to reason and the ability to plan those are four essential characteristic of intelligent um systems or entities humans animals lnms can do none of those or they can only do them in a very primitive way and uh they don't really understand the physical world don't really have persistent memory they can't really reason and they certainly can't plan and so you know if if if you expect the system to become intelligent just you know without having the possibility of doing those things you're making a mistake that is not to say that auto regressive LS are not useful they're certainly useful um that they're not interesting that we can't build a whole ecosystem of applications around them of course we can but as a path towards human level intelligence they're missing essential components and then there is another tidbit or or fact that I think is very interesting those llms are trained on enormous amounts of text basically the entirety of all publicly available text on the internet right that's typically on the order of 10 to the 13 tokens each token is typically two byes so that's two 10 to the 13 bytes as training data it would take you or me 170,000 years to just read through this at eight hours a day uh so it seems like an enormous amount of knowledge right that those systems can accumulate um but then you realize it's really not that much data if you you talk to developmental psychologist and they tell you a four-year-old has been awake for 16,000 hours in his life um and the amount of information that has uh reached the visual cortex of that child in four years um is about 10 to the 15 bytes and you can compute this by estimating that the optical nerve carry about 20 megab megabytes per second roughly and so 10^ the 15 bytes for a four-year-old versus 2 * 10 to 13 bytes for 170,000 years worth of reading what it tells you is that uh through sensory input we see a lot more information than we than we do through language and that despite our intuition most of what we learn and most of our knowledge is through our observation and interaction with the real world not through language everything that we learn in the first few years of life and uh certainly everything that animals learn has nothing to do with language so it would be good to uh maybe push against some of of the intuition behind what you're saying so it is true there's several orders of magnitude more data coming into the human mind much faster and the human mind is able to learn very quickly from that filter the data very quickly you know somebody might argue your comparison between sensory data versus language that language is already very compressed it already contains a lot more information than the bytes it takes to store them if you compare it to visual data so there's a lot of wisdom and language there's words and the way we stitch them together it already contains a lot of information so is it possible that language alone already has enough wisdom and knowledge in there to be able to from that language construct a a world model and understanding of the world an understanding of the physical world that you're saying L LMS lack so it's a big debate among uh philosophers and also cognitive scientists like whether intelligence needs to be grounded in reality uh I'm clearly in the camp that uh yes uh intelligence cannot appear without some grounding in uh some reality doesn't need to be you know physical reality could be simulated but um but the environment is just much richer than what you can express in language language is a very approximate representation of our percepts and our mental models right I mean there there's a lot of tasks that we accomplish where we manipulate uh a mental model of uh of the situation at hand and that has nothing to do with language everything that's physical mechanical whatever when we build something when we accomplish a task model task of you know grabbing something Etc we plan or action sequences and we do this by essentially Imagining the result of the outcome of sequence of actions so we might imagine and that requires mental models that don't have much to do with language and that's I would argue most of our knowledge is derived from that interaction with the physical world so a lot of a lot of my my colleagues who are more uh interested in things like computer vision are really on that camp that uh AI needs to be embodied essentially and then other people coming from the NLP side or maybe you know some some other uh motivation don't necessarily agree with that um and philosophers are split as well uh and the U the complexity of the world is hard to um hard to imagine it you know it's hard to represent uh all the complexities that we take completely for granted in the real world that we don't even imagine require intelligence right this is the old marac Paradox from the pioneer of Robotics and SMC we said you know how is it that with computers it seems to be easy to do high Lev complex tasks like playing chess and solving integrals and doing things like that whereas the thing we take for granted that we do every day um like I don't know learning to drive a car or you know grabbing an object we can do as computers um and you know we have llms that can pass pass the bar exam so they must be smart but then they can't learn to drive in 20 hours like any 17y old they can't learn to clear out the dinner table and F of the dishwasher like any 10-year-old can learn in one shot um why is that like you know what what are we missing what what type of learning or or reasoning architecture or whatever are we missing that um um basically prevent us from from you know having level five sing Cars and domestic robots can a large language model construct a world model that does know how to drive and does know how to fill a dishwasher but just doesn't know how to deal with visual data at this time so it it can operate in space of Concepts so yeah that's what a lot of people are working on so the answer the short answer is no and the more complex sensor is you can use all kind of tricks to get uh uh an llm to basically digest U visual representations of representations of images uh or video or audio for that matter um and uh a classical way of doing this is uh you train a vision system in some way and we have a number of ways to train Vision systems either supervised semisupervised self superise all kinds of different ways uh that will turn any image into high level representation basically a list of tokens that are really similar to the kind of tokens that uh typical llm takes as an input and then you just feed that to the llm in addition to the text and you just expect LM to kind of uh you know during training to kind of be able to uh use those representations to help make decisions I mean there been work along those line for for quite a long time um and now you see those systems right I mean there are llms that can that have some Vision extension but they're basically hacks in the sense that um those things are not like trained end to end to to handle to really understand the world they're not trained with video for example uh they don't really understand intuitive physics at least not at the moment so you don't think there's something special to about intuitive physics about sort of Common Sense reasoning about the physical space about physical reality that's that to you is a giant leap that llms are just not able to do we're not going to be able to do this with the type of llms that we are uh working with today and there's a number of reasons for this but uh the main reason is the way llm LMS are trained is that you you take a piece of text you remove some of the words in that text you Mass them you replace by replace them by blank markers and you train a gtic neural net to predict the words that are missing uh and if you build this neural net in a particular way so that it can only look at u words that are to the left of the one is trying to predict then what you have is a system that basically is trying to predict the next word in a text right so then you can feed it um a text a prompt and you can ask it to predict the next word it can never predict the next word exactly and so what it's going to do is uh produce a probability distribution over all the possible words in your dictionary in fact it doesn't predict words it predicts tokens that are kind of subword units and so it's easy to handle the uncertainty in the prediction there because there's only a finite number of possible words in the dictionary and you can just compute a distribution over them um then what you what the system does is that it it picks a word from that distribution of course there's a higher chance of picking words that have a higher probability within that distribution so you sample from that distribution to actually produce a word and then you shift that word into the input and so that allows the system not to predict the second word right and once you do this you shift it into the input Etc that's called Auto regressive prediction and which is why those llms should be called Auto regressive llms uh but we just call them LMS and there is a difference between this kind of process and a process by which before producing a word when you talk when you and I talk you and I are bilinguals M we think think about what we're going to say and it's relatively independent of the language in which we're going to say when we when we talk about like uh I don't know let's say a mathematical concept or something the kind of thinking that we're doing and the answer that we're planning to produce is not linked to whether we're going to see it in French or Russian or English chsky just rolled his eyes but I understand so you're saying that there's a a bigger abstraction that repes that's uh that goes before language yeah maps onto language right it's certainly true for a lot of thinking that we that we do is that obvious that we don't like you're saying your thinking is same in French as it is in English yeah pretty much yeah pretty much or is this like how how flexible are you like if if there's a probability distribution well it it depends what kind of thinking right if it's just uh if it's like producing puns I get much better in French than English about that no but so worse is an abstract representation of puns like is your humor an abstract like when you tweet and your tweets are sometimes a little bit spicy uh what's is there an abstract representation in your brain of a tweet before it maps onto English there is an abstract representation of uh Imagining the reaction of a reader to that uh text or you start with laughter and then figure out how to make that happen or figure out like a reaction you want to cause and and then figure out how to say it right so that it causes that reaction but that's like really close to language but think about like a math mathematical concept or um you know imagining you know something you want to build out of wood or something like this right the kind of thinking you're doing has absolutely nothing to do with language really like it's not like you have necessarily like an internal monologue in any particular language you're you're you know imagining mental models of of the thing right I mean if I if I ask you to like imagine what this uh water bottle will look like if I rotate it 90 degrees um that has nothing to do with language and so uh so clearly there is you know a more abstract level of representation uh in which we we do most of our thinking and we plan what we're going to say if the output is is you know uttered words as opposed to an output being uh you know muscle actions right um we we plan our answer before we produce it and LMS don't do that they just produce one word after the other instinctively if you want it's like it's a bit like the you know subconscious uh actions where you don't like you're distracted you're doing something you're completely concentrated and someone comes to you and you know asks you a question and you kind of answer the question you don't have time to think about the answer but the answer is easy so you don't need to pay attention you sort of respond automatically that's kind of what an llm does right it doesn't think about it sensor really uh it retrieves it because it's accumulated a lot of knowledge so it can retrieve some some things but it's going to just spit out one token after the other without planning the answer but you're making it sound just one token after the other one token at a time generation is uh bound to be simplistic but if the world model is sufficiently sophisticated that one token at a time the the most likely thing it generates is a sequence of tokens is going to be a deeply profound thing okay but then that assumes that those systems actually possess an internal World model so it really goes to the I I think the fundamental question is can you build a a really complete World model not complete but a uh one that has a deep understanding of the world yeah so can you build this first of all by prediction right and the answer is probably yes can you predict can you build it by predicting words and the answer is most probably no because language is very poor in terms or weak or low bandwidth if you want there's just not enough information there so building World models means observing the world and uh understanding why the world is evolving the way the way it is and then uh the the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take right so what model really is here is my idea of the state of the world at time te here is an action I might take what is the predicted state of the world at Mt plus1 now that state of the world doesn't does not need to represent everything about the world it just needs to represent enough that's relevant for this planning of of the action but not necessarily all the details now here is the problem um you're not going to be able to do this with generative models so genery model has trained on video and we've tried to do this for 10 years you take a video show a system a piece of video and then ask you to predict the reminder of the video basically predict what's going to happen one frame at a time do the same thing as sort of the autoaggressive llms do but for video right either one FR at a time or a group of friends at a time um but yeah uh a large video model if you want uh the idea of of doing this has been floating around for a long time and at at Fair uh some colleagues and I have been trying to do this for about 10 years um and you can't you can't really do the same trick as with LM because uh you know llms as I said you can't predict exactly which word is going to follow a sequence of words we can predict the distribution over words now if you go to video what you would have to do is predict the distribution over all possible frames in a video and we don't really know how to do that properly uh we we do not know how to represent distributions over High dimensional continuous spaces in ways that are useful uh and and that's that there lies the main issue and the reason we can do this is because the world is incredibly more complicated and richer in terms of information than than text text is discret video is high dimensional and continuous a lot of details in this um so if I take a a video of this room uh and the video is you know a camera panning around MH um there is no way I can predict everything that's going to be in the room as I pan around the system cannot predict what's going to be in the room as the camera is panning maybe it's going to predict this is this is a room where there's a light and there is a wall and things like that it can't predict what the painting on the wall looks like or what the texture of the couch looks like certainly not the texture of the carpet so there's no way I can predict all those details so the the way to handle this is one way possibly to handle this which we've been working for a long time is to have a model that has what's called a latent variable and the latent variable is fed to an Nal net and it's supposed to represent all the information about the world that you don't perceive yet and uh that you need to augment uh the the system for the prediction to do a good job at predicting pixels including the you know fine texture of the of the carpet and the on a couch and and the painting on the wall um uh that has been a complete failure essentially and we've tried lots of things we tried uh just straight neural Nets we tried Gans we tried uh you know Vees all kinds of regularized Auto encoders we tried um many things we also tried those kind of methods to learn uh good representations of images or video um that could then be used as input to for example an image classification system mhm and that also was basically failed like all the systems that attempt to predict missing parts of an image or video um you know from a corrupted version of it basically so right take an image or a video corrupt it or transform it in some way and then try to reconstruct the complete video or image from the corrupted version and then hope that internally the system will develop a good representations of images that you can use for object recognition segmentation whatever it is is that has been essentially a complete failure and it works really well for text that's the principle that is used for LMS right so where is the failure exactly is that that it's very difficult to form a good representation of an image a good in like a good embedding of all all the important information in the image is it in terms of the consistency of image to image to image the image that forms the video like where what are the if we do a highlight reel of all the ways you failed what what's that look like okay so the reason this doesn't work uh is first of all I have to tell you exactly what doesn't work because there is something else that does work uh so the thing that does not work is training a system to learn representations of images by training it to reconstruct uh a good image from a corrupted version of it okay that's what doesn't work and we have a whole slew of technique for this uh that are you know variant of ding Auto encoders something called Mee developed by some of my colleagues at Fair Max Doo encoder so it's basically like the you know llms or or or or things like this where you train the system by corrupting text except you corrupt images you remove Patches from it and you train a gigantic neet to reconstruct the features you get are not good and you know they're not good because if you now train the same architecture but you train it supervised mhm with with uh label data with Tex textual descriptions of images Etc you do get good representations and the performance on recognition tasks is much better than if you do this self-supervised free trining so the architecture is good the architecture is good the architecture of the encoder is good okay but the fact that you train the system to reconstruct images does not lead it to produce to learn good generic features of images when you train in a self-supervised way self-supervised by reconstruction Yeah by reconstruction okay so what's the alternative the alternative is joint embedding what is joint embedding what are what are these architectures that you're so excited about okay so now instead of training a system to encode the image and then training it to reconstruct the the full image from a corrupted version you take the full image you take the corrupted or transformed version you run them both through encoders mhm which in general are identical but not necessarily and then you you train a predictor on top of those uh encoders um to predict the representation of the full input from the representation of the corrupted one okay so joint embedding because you're you're taking the the full input and the corrupted version or transform version run them both through encoders so you get a joint embedding and then you and then you're you're saying can I predict the representation of the full one from the representation of the corrupted one okay um and I call this a JEA so that means joint embedding predictive architecture because it's joint embedding and there is this predictor that predicts the representation of the good guy from from the bad guy um and the big question is how do you train something like this uh and until five years ago or six years ago we didn't have particularly good answers for how you train those things except for one um called contrastive contrastive learning where um and the IDE contrastive learning is you you take a pair of images that are again an image and a corrupted version or degraded version somehow or transform version of the original one and you train the predicted representation to be the same as as that if you only do this the system collapses it basically completely ignores the input and produces representations that are con so the contrastive methods avoid this and and those things have been around since the early 90s had a paper on this in 1993 um is you also show pairs of images that you know are different and then you push away the representations from each other so you say not only do representations of things that we know are the same should be the same or should be similar but representation of things that we know are different should be different and that prevents the collapse but it has some limitation and there's a whole bunch of uh techniques that have appeared over the last six seven years um that can revive this this type of method um some of them from Far some of them from from Google and other places um but there are limitations to those contrasting method what has changed in the last uh you know three four years is now now we have methods that are non-contrastive so they don't require those negative contractive samples of images that are that we know are different you can only you turn them only with images that are you know different versions or different views of the same thing uh and you rely on some other tricks to prevent the system from collapsing and we have have a dozen different methods for this now so what is the fundamental difference between joint embedding architectures and llms so can uh can japa take us to AGI whether we should say that you don't like uh the term AGI and we'll probably argue I think every single time I've talked to you with argued about the G and AGI yes get I get it I get it we we'll probably continue to argue about it it's great uh you you like uh I me this because cuz you like French and um I me is is is uh I guess friend in French yes and Ami stands for advanced machine intelligence right um but either way can japa take us to that towards that advanced machine intelligence well so it's a it's a first step okay so first of all uh what What's the difference with generative architectures like llms um so llms um or Vision systems that are trained by reconstruction generate the inputs right they generate the original input that is non-corrupted non-transformed right so you have to predict all the pixels and there is a huge amount of resources spent in the system to actually predict all those pixels all the details uh in a jepa you're not trying to predict all the pixels you're only trying to predict an abstract representation of of the inputs right and that's much easier in many ways so what the japa system when it's being trained is trying to do is extract as much information as possible from the input but yet only EXT ract information that is relatively easily predictable okay so there's a lot of things in the world that we cannot predict like for example if you have a s driving car driving down the street or road uh there may be uh trees around the around the road and it could be a windy day so the the leaves on the tree are kind of moving in kind of semi chaotic random ways that you can't predict and you don't care you don't want to predict so what you want is your encoder to basically eliminate all those details will tell you there's moving leaves but it's not going to keep the details of exactly what's going on um and so when you do the prediction in representation space you're not going to have to predict every single Pixel of a relief and that you know um not only is a lot simpler but also it allows the system to essentially learn an abstract representation of of the world where you know what can be modeled and predicted is preserved and the rest is viewed as noise and eliminated by the encoder so it kind of lifts the level of abstraction of the representation if you think about this this is something we do absolutely all the time whenever we describe a phenomenon we describe it at a particular level of abstraction and we don't always describe every natural phenomenon in terms of quantum field Theory right that would be impossible right so we have multiple levels of abstraction to describe what happens in the world you know starting from Quantum field Theory to like atomic theory and molecules you know in chemistry materials and you know all the way up to you know kind of concrete objects in the real world and things like that so the we we can't just only model everything at the lowest level and that that's what the idea of JEA is really on is really about learn abstract representation in a self-supervised uh Manner and you know you can do it hierarchically as well so that I think is an essential component of an intelligent system and in language we can get away without doing this because language is already to some level abstract and already has eliminated a lot of information that is not predictable and um so we can get away without doing the tring without you know lifting the abstraction level and by directly predicting words so joint embedding it's still generative but it's generative in this abstract representation space yeah and you're saying language we were lazy with language cuz we already got the abstract representation for free and now we have to zoom out actually think about generally intelligent systems we have to deal with a full mess of physical reality of reality and you can't you you do have to do this step of jumping from uh the full Rich detailed reality to a uh abstract representation of that reality based on which you can then reason and all that kind of stuff right and the thing is those cell supervised algorithm that that learn by prediction even in representation space uh they learn more uh concept if the input data you Feit them is more redundant the more redundancy there is in the data the more they're able to capture some internal structure of it and so there there is way more redundancy in structure in perceptual uh inputs sensory input like like like Vision than there is in uh text which is not nearly as redundant this is back to the question you were asking a few minutes ago language might represent more information really because it's already compressed you're you're right about that but that means it's also less redundant and so self supervision will not work as well is it possible to join the self-supervised training on visual data and self-supervised training on language data there is a huge amount of knowledge even though you talk down about those 10 to the 13 tokens those 10 to the 13 tokens represent the entirety a large fraction of what US humans have figured out both the talk on Reddit and the contents of all the books and the Articles and the full spectrum of human uh intellectual creation so is it possible to join those two together well eventually yes but I think uh if we do this too early we run the risk of being tempted to cheat and in fact that's what people are doing at the moment with vision language model we're basically cheating we are using uh language as a crutch to help the deficiencies of our uh Vision systems to kind of learn good representations from uh images and video and uh the problem with this is that we might you know improve our uh visual language system a bit I mean our language models by you know feeding them image but we're not going to get to the level of even the intelligence or level of understanding of the world of a cat or dog which doesn't have language you know they don't have language and they understand the world much better than any llm they can plan really complex actions and sort of imagine the result of a bunch of actions how do we get machines to learn that before we combine that with language obviously if we combine this with language this is going to be a winner um but but before that we have to focus on like how do we get systems to learn how the world works so this kind of joint embedding predictive architecture for you that's going to be able to learn something like Common Sense something like what a cat uses to predict how to mess with its owner most optimally by knocking over a thing that's that's the Hope in fact the techniques we're using are non-contrastive uh so not only is the architecture non generative the learning procedures we're using are non contrastive we have two two sets of techniques one set is based on distillation and there's a number of uh methods that use this principle uh one by Deep Mind Bol a couple by by Fair one one called uh VRA and another one called IA and vcra I should say is not a distillation method actually but IA and B certainly are and there's another one also called Dino or dyo also produced from at fair and the idea of those things is that you take the full input let's say an image uh you run it through an encoder uh produces a representation and then you corrupt that input or transform it running to the essentially what amounts to the same encoder with some minor differences and then train a predictor sometimes to predictor is very simple sometime doesn't exist but train a predictor to predict a representation of the first first uh uncorrupted input from the corrupted input um but you only train the the second Branch um you only train the part of the network that is fed with the corrupted input the other network you don't you don't train but since they share the same weight when you modify the first one it also modifies the second one uh and with various tricks you can prevent the system from collapsing uh with the collapse of the type I was explaining before where the system basically ignores the input um so that works very well the the technique with the two techniques we develop at Fair uh dino and uh and IA work really well for that so what kind of data are we talking about here so this the several scenario one uh one scenario is you take an image you corrupt it by um changing the cropping for example changing the size a little bit maybe changing the orientation blurring it changing the colors doing all kinds of horrible things to it but basic horrible things basic horrible things that sort of degrade the quality a little bit and change the framing uh you know crop the image um or and in some cases in the case of a JEA you don't need to do any of this you just you just mask some parts of it right you just basically remove some regions like a big block essentially and and then you know run through the encoders um and train the entire system and and predictor to predict the representation of the good one from the representation of the corrupted one um so that's the Ia doesn't need to know that it's an image for example because the only thing it needs to know is how to do this masking um whereas with doo you need to know it's an image because you need to do things like you know geometri transformation and blurring and things like that that are really image specific uh a more recent version of of this that we have is called V JEA so is basically the same idea as I except um it's applied to video so now you take a whole video and you mask a whole chunk of it and what we mask is actually kind of a temple tube so an all like a whole uh segment of each frame in the video over the entire video and that tube was like statically position throughout the frames lit straight tube the tube yeah typically is 16 frames or something and we mask the same region over the entire 16 frames it's a different one for every video obviously and um and then again train that system so as to predict the representation of the full video from The partially matched video uh that works really well it's the first system that we have that learns good representations of video so that when you feed those representations to a supervised uh classifier head it can it can tell you what action is taking place in the video with you know pretty good accuracy um so that that's it's the first time we get something of that uh of that quality so that that's a a good test that a good representation is formed that means there's something to this yeah um we also preliminary result that seem to indicate that the representation allows us allow our system to tell whether the video is physically possible or completely impossible because some object disappeared or an object you know suddenly jumped from one location to another or or change shape or something so it's able to capture some physical con some physic based constraints about the reality represented in the video yeah about the appearance and The Disappearance of objects yeah that's really you okay but C can this actually get us to this kind of uh World model that understands enough about the world to be able to drive a car uh possibly um this is going to take a while before we get to that point but um um and there are systems already you know everybody systems that are based on this uh idea uh and the what you need for this is a slightly modified version of this where um imagine that you have uh a video and the a complete video and what you're doing to this video is that you're either translating it in time towards the future so you only see the beginning of the video but you don't see the latter part of it that is in the original one or you just mask the second half of the video for example um and then you you train a JEA system of the type I describe to predict the representation of the full video from the the shifted one but you also feed the predictor with an action for example you know the wheel is turned 10 degrees to the to the right or something right so if it's a you know a dash cam in a car and you know the angle of the wheel you should be able to predict to some extent what's going what's going to go what's going to happen to which to see uh you're not going to be able to predict all the details of you know objects that appear in the view obviously but at a abstract representation level you can you can probably predict what's going to happen so now what you have is a internal model that says here is my idea of state of the world at time T here is an action I'm taking here's a prediction of the state of the world at time t plus one t plus Delta t t plus 2 seconds whatever it is if you have a model of this type you can use it for planning so now you can do what llms cannot do which is planning what you're going to do so as to arrive at a particular uh outcome or satisfy a particular objective right so you can have a number of objectives um right if you know I can I can predict that uh if I have uh an object like this right and I open my hand it's going to fall right and uh and if if I push it with a particular force on the table it's going to move if I push the table itself it's probably not going to move uh with the same Force um so we have we have this internal model of the world in our in our mind uh which allows us to plan sequences of actions to arrive at a particular goal um and so um so now if you have this world model we can imagine a sequence of actions predict what the outcome of the sequence of action is going to be measure to what extent the final State satisfies a particular objective like you know moving the bottle to the left of the table um and then plan a sequence of actions that will minimize this objective at run time we're not talking about learning we're talking about inference time right so this is planning really and in optimal control this is a very classical thing it's called Uh model predictive control you have a model of the system you want to control that you know can predict the sequence of State St corresponding to a sequence of commands and you're planning a sequence of commands so that according to your world model the the the end state of the system will uh satisfy an objectives that you fix this is the way uh you know rocket trajectories have been planned since computers have been around so since the early 60s essentially so yes for model predictive control but you also often talk about hierarchical planning can hierarchical planning emerge from this somehow well so no you you will have to build specific architecture to allow for hierarchical planning so hierarchical planning is absolutely necessary if you want to plan complex actions uh if I want to go from let's say from New York to Paris this the example I use all the time and I'm sitting uh in my office at NYU my objective that I need to minimize is my distance to Paris at a high level a very astract representation of my uh my location I would have to decompose this into two sub goals first one is um go to the airport second one is catch a plane to Paris okay so my sub goal is now uh going to the airport my objective function is my distance to the airport how do I go to the airport where I have to go in the street and H a taxi which you can do in New York um okay now I have another sub goal go down on the street uh well that means going to the elevator going down the elevator walk out the street how do I go to the elevator I have to uh stand up from my chair open the door of my office go to the elevator push push the button how do I get up from my chair like you know you can imagine going down all the way down to basically what amounts to millisecond by millisecond muscle control okay and obviously you're not going to plan your entire trip from New York to Paris in terms of millisecond by millisecond muscle control first that would be incredibly expensive but it will also be completely impossible because you don't know all the conditions of what's going to happen uh you know how long it's going to take to catch a taxi um or to go to the airport with traffic you know uh I mean you you would have to know exactly the condition of everything to be able to do this planning and you don't have the information so you you have to do this hierarchical planning so that you can start acting and then sort of replanning as you go and nobody really knows how to do this in AI um nobody knows how to train a system to learn the appropriate multiple levels of representation so that hierarchical planning Works does something like that already emerge so like can you use an llm state-ofthe-art llm to get you from New York to Paris by doing exactly the kind of detailed set of questions that you just did which is can you give me a highight a list of 10 steps I need to do to get from New York to Paris and then for each of those steps can you give me a list of 10 steps how I make that step happen and for each of those steps can you give me a list of 10 steps to make each one of those until you're moving your mus individual muscles uh maybe not whatever you can actually act upon using your mind right so there's a lot of questions that are sort implied by this right so the first thing is llms will be able to answer some of those questions down to some level of exraction under the condition that they've been trained with similar scenarios in their training set they would be able to answer all those questions but some of them may be hallucinated meaning non-factual yeah true I mean they will probably produce some answer except they're not going to be able to really kind of produce millisecond by millisecond muscle control of how you how you stand up from your chair right so but down to some level of exraction we can describe things by words they might be able to give you a plan but only under the condition that they've been trained to produce those kind of plans mhm right they're not going to be able to plan for situations where that that they never encountered before they basically are going to have to regurgitate the template that they've been trained on but where like just for the example of New York to Paris is is it going to start getting into trouble like at which layer layer of abstraction do you think you'll start cuz like I can imagine almost every single part of that anal will be able to answer somewhat accurately especially when you're talking about New York and Paris major cities so I mean certainly uh LM would be able to solve that problem if you f tun need for it you know just uh and and so uh I can't say that nlm cannot do this it can do this if you train it for it there's no question uh down to a certain level where things can be formulated in terms of words but like if you want to go down to like how do you you know climb down the stairs or just stand up from your chair in terms of uh words like you you can't you can't do it um you you need that's one of the reasons you need experience of the physical world which is much higher bandwidth than what you can express in words in human language so everything we've been talking about on the joint embedding space is it possible that that's what we need for like the interaction with physical reality for on the robotics front and then just the llms are the thing that sits on top of it for the bigger reasoning about like yeah the fact that I need to book a plane ticket and I need to know I know how to go to the websites and so on sure and you know a lot of plans that people know about um that are relatively high level are actually learned they're not people most people don't invent the you know plans um uh they they by themselves they uh you know we have some ability to do this of course uh obviously but um but but most plants that people use are plants that they've been trained on like they've seen other people use those plants or they've been told how to do things right um that you can't invent how you like take a person who's never heard of airplanes and tell them like how do you go from New York to Paris and they're probably not going to be able to kind of you know deconstruct the whole plan unless they've seen examples of that before um so certainly LMS are going to be able to do this but but then um how you link this from the the low level of of of actions uh that needs to be done with things like like Jad that basically lift the abstraction level of the representation without attempting to reconstruct every detail of the situation that's why we need Jass for I would love to sort of Linger on your skepticism around uh autoaggressive llms so one way I would like to test that skepticism is everything you say makes a lot of sense but if I apply everything you said today and in general to like I don't know 10 years ago maybe a little bit less no let's say three years ago I wouldn't be able to predict the uh success of llms so does it make sense to you that autoaggressive llms are able to be so damn good yes can you explain your intuition because if I were to take your wisdom and intuition at face value I would say there's no way autoaggressive LMS one token at a time would be able to do the kind of things they're doing no there's one thing that auto llms uh or that llms in general not just the autoaggressive one but including the birth style bir directional ones uh are exploiting and it's self-supervised learning and I've been a very very strong advocate of self supervising for many years so those things are a incredibly impressive demonstration that cell supervisor learning actually works uh the idea that you know started uh it didn't start with with uh with Bert but it was really kind of a good demonstration with this so the the the idea that you know you take a piece of text you corrupt it and then you train some gigantic neural net to reconstruct the parts that are missing um that has been an enormous uh produced an enormous amount of benefits uh it allowed allowed us to create systems that understand understand language uh systems that can translate um hundreds of languages in any direction systems that are multilingual so they're not it's a single system that can be trained to understand hundreds of languages and translate in any direction um and produce summaries um and then answer questions and produce text and then there's a special case of it where you know you which is the auto Progressive uh trick where you constrain the system to not elaborate a representation of the text from looking at the enti text but only predicting a word from the words that are come before right and you do this by the constraining the architecture of the network and that's what you can build an auto regressive ATM from so there was a surprise many years ago with what's called decoder only llm so since you know systems of this type that are just trying to produce uh words from the from the previous one and and the fact that when you scale them up they they tend to really kind of understand more about the about language when you train them on lot of data and you make them really big that was kind of a surprise and that surprise occurred quite a while back like you know uh with uh work from uh you know Google meta open AI Etc you know going back to you know the GPT kind of uh work General pre-train Transformers do you mean like gbt2 like there's a certain place where you start to realize scaling might actually keep giving us a an emergent benefit yeah I mean there were there were work from from various places but uh uh if if you want to kind of you know place it in the in the GPT uh timeline that would be around gpt2 yeah well I just cuz you said it you're you're so charismatic you said so many words but self-supervised learning yeah yes but again the same intuition you're applying to saying that autor regressive llms cannot have a deep understanding of the world if we just apply that same intuition does it make sense to you that they're able to form enough of a representation of the world to be damn convincing essentially passing the original touring test with flying colors well we're fooled by their fluency right we just assume that if a system is is fluent in manipulating language then it has all the characteristics of human intelligence but that impression is false we we we're really fooled by it um what do you think alen tan would say it without understanding anything just hanging out with it an Turing would decide that a Turing test is a really bad test okay this is what the AI Community has decided many years ago that the tring test was a really bad test of intelligence what would Hans marvac say about the about the large language models hence Marv would say the Marv Paradox still applies okay okay okay we can pass you don't think he would be really impressed no of course everybody would be impressed but uh you know uh it's not a question of being impressed or not it's a question of knowing what the limit of those systems can do like there again they are impressive they can do a lot of useful things there's a whole industry that is being built around them they're going to make progress uh but there is a lot of things they cannot do and we have to realize what they cannot do do and uh and then figure out you know how we get there and you know and and I'm not seeing this I'm seeing this from basically you know 10 years of of research uh on on the IDE of sell supervis learning actually that's going back more than 10 years but the IDE of cell supervis learning so basically capturing the internal structure of a piece of uh of of a set of inputs without training the system for any particular task right learning representations um you know the the conference I co-founded 14 years ago is called inter International Conference on learning representations that's the entire issue that deep learning is is dealing with right and it's been my obsession for you know almost 40 years now so um so learning representation is really the thing uh for the longest time we could only do this with supervised learning and then we started working on uh you know what we used to call unsupervised learning uh and sort of revive the idea of unsupervised learning uh in the early 2000s with yosha benju and Jeff Hinton then discovered that supervisor leing actually works pretty well if you can collect enough data and so the whole idea of you know unsupervised supervisor kind I took a a backseat for for a bit and then I kind of tried to revive it um uh in a big way you know starting in 2014 basically when we started fair and uh and really pushing for like finding new new methods to do cell supervised learning both for text and for images and for video and audio and some of that work has been incredibly successful um I mean the reason why we have multilingual translation system you know things to do content moderation on on meta for example on Facebook that are multilingual that understand whether piece of text is H speech or not or something is due to their progress using cell supervis learning for NLP combining this with you know Transformer architectures and and blah blah blah but that's the big success of supervis rning we had similar success in speech recognition a system called wave to V which is also a joint embedding architecture by the way train with contrastive learning and and that that system also can produce um speech recognition systems that are multilingual with mostly unlabeled data and only need a few minutes of labeled data to actually do speech recognition that's that's amazing um we have systems now based on those combination of ideas that can do realtime translation of hundreds of languages into each other uh Speech to speech speech to speech even including which is fascinating languages that uh don't have written forms that's right they spoken only that's right we don't go through text it goes directly from from speech to speech using an internal representation of kind of speech units that are discrete but it's um it's called text lesson LP we used to call it this way but um yeah so that I mean incredible success there and then you know for 10 years we tried to apply this idea to learning representations of images by training a system to predict videos learning intuitive physics by training a system to predict what's going to happen in the video and tried and tried and failed and failed with generative models with models that predict pixels uh we could not get them to learn good well presentations of images we could not get them to learn good well presentations of videos and we tried many times we published lots of papers on it you know they kind of sort of work but not really great they started working we we have been this idea of predicting every pixel and basically just doing the joint embedding and predicting in representation space that works MH so there's ample evidence that we're not going to be able to learn good we representations of the real world using generative model so I'm telling people everybody is talking about generative AI if you're really interested in human level AI abandon the idea of generate AI okay but you you you really think it's possible to get far with the joint embedding representation so like there's Common Sense reasoning and then there's highlevel reasoning like I I feel like those are two the kind of reasoning that LMS are able to do okay let me not use the word reasoning but the kind of stuff that LMS are able to do seems fundamentally different than the common sense reasoning we use to navigate the world yeah it seems like we're going to need both you're not would you be able to get with the joint embedding would is the JEA type of approach looking at video would you be able to learn let's see well how to get from New York to Paris or um how to uh understate understand the state of politics in the world today right these these are things where various humans generate a lot of language and opinions on in the space of language but don't visually represent that and you clearly uh compressible way right well there's a lot of situations that you know might be difficult to for a purely language based system to um to know like okay you can probably learn from Reading text the entirety of the Public Public avilable text in the world that I cannot get from New York to Paris by snapping my fingers that's not going to work right yes uh but there's you know probably sort of more complex uh scenarios of this type which an nlm May never have encountered and may not be able to determine whether it's possible or not um so um so that that link you know from the the low level to the high level the the thing is that the high level that language expresses is based on the common experience of the low level which llms currently do not have you know we when we talk to each other we know we have a common experience of the of the world like you know a lot of it is is similar uh and LMS don't have that but see there it's present you and I have a common experience of the world in terms of the physics of how gravity works and stuff like this and that common knowledge of the world I feel like is there in the language we don't explicitly express it but if you have a huge amount of text you're going to get this stuff that's between the lines you're going to you're going in order to um form a consistent world mod you're going to have to understand how gravity works even if you don't have an explicit explanation of gravity so even though in the case of gravity there is explicit explanations of gravity and wiia but uh you're like the stuff that we think of as common sense reasoning I feel like to generate language correctly you're going to have to figure that out now you could say as you have there's not enough text okay so what you don't think so no I agree with what you just said which is that to be able to do high LEL um uh common sense to have high level common sense you need to have the low level common sense to build on top of yeah um but that's not there and that's not there in llms llms are purely trained from Tex so so then the other statement you made um I would not I would not agree with the fact that implicit in all languages in the world is the underlying reality there's a lot about underlying reality which is not expressed in language is that obvious to you yeah totally so like all all the conversations we have what okay there's the Dark web meaning uh whatever the private conversations like DMS and stuff like this which is much much larger probably than what's available what what llms are trained on you don't need to communicate the stuff that is coming but the humor all of it no you do like when you you don't need to but it comes through through like you like if I accidentally uh knock this over you'll probably make fun of me and in the content of the you making fun of me will be a explanation of the fact that cups fall and then you know gravity Works in this way and then you you'll have some very vague information about what kind of things explode when they hit the ground and then maybe you'll make a joke about entropy or something like this and you will'll never be able to reconstruct this again like okay you make a a little joke like this and there'll be trillion of other jokes and from the jokes you can piece together the fact that gravity works and mugs can break and all this kind of stuff you don't need to see uh it'll be very inefficient it's easier for like to not knock the thing over yeah but uh I feel like it would be there if you have enough of that data I just think that most of the information of this type that we have accumulated when when we were babies is just not present in uh in in text any in any description essentially and the sensory data is much is a much richer source for getting that kind of understanding I mean that's the 16,000 hours of of wake time of a four-year-old and uh 10 to the 15 bytes you know going through vision just Vision right there is a similar uh bandwidth you know of touch and uh a little less through audio and then text doesn't language doesn't come in until like you know year uh in in life and by the time you are 9 years old you've learned about gravity you know about inertia you know about gravity you know the stability you know you know about the distinction between animate and inanimate objects you know by 18 months you know about like uh why people want to do things and you help them if they can't you know I mean there's a lot of things that you learn mostly by observation really uh not even through interaction in the first few months of life babies don't don't really have any influence on the world they can only observe right and you accumulate like a gigantic amount of uh of knowledge just just from that so that that's what we're missing from uh current AI systems I think in one of your slides you have this nice plot that is one of the ways you show that llms are limited I wonder if you could talk about hallucinations from your perspectives the why hallucinations happen from large language models and why and to what degree is that a fundamental flaw of large language models right so because of the auto regressive prediction every time an LM produces a token or word uh there is some level of probability for that word to take you out of the set of reasonable answers uh and if you assume which is a very strong assumption that the probability of such error um is that those errors are independent across a a sequence of tokens being produced M what that means is that every time you produce a token the probability that you rest you you stay within the the set of correct answer decreases and it decreases exponentially so there's a strong like you said assumption there that if uh there's a non-zero probability of making a mistake which there appears to be then there's going to be a kind of drift yeah and that drift is exponential it's like errors accumulate right so so the probability that an answer would be nonsensical increases exponentially with the number of tokens is that obvious to you by the way like well so mathematically speaking maybe but like isn't there a kind of gravitational pull towards the truth because on on average hopefully the truth is well represented in the uh training set no it's basically a struggle against uh the curse of dimensionality so the way you can correct for this is that you fine-tune the system by having it produce answers for all kinds of questions that people might come up with M and people are people so they a lot of the questions that they have are very similar to each other so you can probably cover you know 80% or whatever of questions that people will will ask um by you know collecting data and then um and then you fine tune the system to produce good answers for all of those things and it's probably going to be able to learn that because it's got a lot of capacity to to learn um but then there is you know the enormous set of prompts that you have not covered during training and that set is enormous like within the set of all possible prompts the proportion of prompts that have been uh used for training is absolutely tiny um it's a it's a tiny tiny tiny subset of all possible prompts and so the system will behave properly on the prompts that it's been either trained pre-trained or fine-tuned um but then there is an entire space of things that it cannot possibly have been trained on because it's just the the number is gigantic so um so whatever training the system uh has been subject to to produce appropriate tensors you can break it by finding out a prompt that will be outside of the the the set of promps has been trained on or things that are similar and then it will just P complete nonsense do you when you say prompt do you mean that exact prompt or do you mean a prompt that's like in many parts very different than like is that easy to ask a question or to say a thing that hasn't been said before on the internet I mean people have come up with uh things where like you you put a essentially a random sequence of characters in The Prompt and that's enough to kind of throw the system uh into a mode where you know it it's going to answer something completely different than it would have answered without this so that's a way to jailbreak the system basically get it you know go outside of its uh of its conditioning right so that that's a very clear demonstration of it but of course uh you know that's uh that goes outside of what is designed to do right if you actually stitch together reasonably grammatical sentences is that the is it that easy to break it yeah some people have done things like you you you write a sentence in English right that has and or you ask a question in English and it it produces a perfectly fine answer and then you just substitute a few words by the same word in another language and all of a sudden the answer is complete nonsense yeah so so I guess what I'm saying is like which fraction of prompts that humans are likely to generate are going to break the system so the the problem is that there is a long tail yes uh this is a an issue that a lot of people have realize you know in social networks and stuff like that which is uh there's a very very long taale of of things that people will ask and you can find tune the system for the 80% or whatever of uh of the things that most people will will ask and then this long tail is is so large that you're not going to be able to fun the system for all the conditions and in the end the system has a being kind of a giant lookup table right essentially which is not really what you want you want systems that can reason certainly they can plan so the type of reasoning that takes place in llm is very very primitive and the reason you can tell is primitive is because the amount of computation that is spent per token produced is constant so if you ask a question and that question has an answer in a given number of token the amount of competition devoted to Computing that answer can be exactly estimated it's like you know it's it's the the size of the prediction Network you know with its 36 layers on 92 layers or whatever it is uh multiply by number of tokens that's it and so essentially it doesn't matter if the question being asked is is simple to answer complicated to answer impossible to answer because it's undecidable or something um the amount of computation the system will be able to devote to to the answer is constant or is proportional to number of token produced in the answer right this is not the way we work the way we reason is that when we're faced with a complex problem or a complex question we spend more time trying to solve it and answer it right because it's more difficult there's a prediction element there's a iterative element where you're like uh adjusting your understanding of a thing by going over over and over and over there's a hierarchical element so on does this mean that a fundamental flaw of llms or does it mean that there more part to that question now you're just behaving like an llm immediately answer no that that it's just the lowlevel world model on top of which we can then build some of these kinds of mechanisms like you said persistent long-term memory or uh reasoning so on but we need that world model that comes from language is it maybe it is not so difficult to build this kind of uh reasoning system on top of a well constructed World model OKAY whether it's difficult or not the near future will will say because a lot of people are working on reasoning and planning abilities for for dialog systems um I mean if we're even if we restrict ourselves to language uh just having the ability to plan your answer before you answer uh in terms that are not necessarily linked with the language you're going to use to produce the answer right so this idea of this mental model that allows you to plan what you're going to say before you say it um that is very important I think there's going to be a lot of systems over the next few years that are going to have this capability but the blueprint of those systems would be extremely different from autoregressive LMS so so um it's the same difference as the difference between what psychology is called system one and system two in humans right so system one is the type of task that you can accomplish without like deliberately consciously think about how you do them you just do them you've done them enough that you can just do it subconsciously right without thinking about them if you're an experience driver you can drive without really thinking about it and you can talk to someone at the same time or listen to the radio right um if you are a very experienced chess player you can play against a non-experienced chess player without really thinking either you just recognize the pattern and you play MH right that's system one um so all the things that you do instinctively without really having to deliberately plan and think about it and then there is all tasks what you need to plan so if you are not to experienced uh chess player or you are experienced with you play against another experienced chess player you think about all kinds of options right you you think about it for a while right and you you you're much better if you have time to think about it than you are if you if you play Blitz uh with limited time so and um so this type of deliberate uh planning which uses your internal World model um that system to this is what LMS currently cannot do so how how do we get them to do this right how do we build a system that can do this kind of uh planning that or reasoning that devotes more resources to complex problems than to simple problems and it's not going to be Auto regressive prediction of tokens it's going to be more something akin to inference of Laten variables in um you know what used to be called problemistic models or graphical models and things of that type so basically the principle is like this you you know the prompt is like a observed uh variables M and what your what the model does is that it's basically a measure of it can measure to what extent an answer is a good answer for a prompt okay so think of it as some gigantic neural net but it's got only one output and that output is a scalar number which is let's say zero if the answer is a good answer for the question and a large number if the answer is not a good answer for the question imagine you had this model if you had such a model you could use it to produce good answers the way you would do is you know produce the pumpt and then search through the space of possible answers for one that minimizes that number um that's called an energy based model but that energy based model would need the the model constructed by the llm well so uh really what you need to do would be to not uh search over possible strings of text that minimize that uh energy but what you would do it do this in abstract representation space so in in sort of the space of abstract thoughts you would elaborate a thought right using this process of minimizing the output of your your model okay which is just a scalar um it's an optimization process right so now the the way the system produces its answer is through optimization um by you know minimizing an objective function basically right uh and this is we're talking about inference we're not talking about training right the system has been trained already so now we have an abstract representation of the thought of the answer representation of the answer we feed that to basically an auto regive decoder uh which can be very simple that turns this into a text that expresses this thought okay so that that in my opinion is the blueprint of future dialog systems um they will think about their answer plan their answer by optimization before turning it into text uh and that is tur complete can you explain exactly what the optimization problem there is like what's the objective function just link on it you you kind of briefly described it but over what space are you optimizing the space of representations those abstract representation abstract repr so you have an abstract representation inside the system you have a prompt The Prompt goes through an encoder produces a representation perhaps goes through a predictor that predicts a representation of the answer of the proper answer but that representation may not be a good answer because there might there might be some complicated reasoning you need to do right so um so then you have another process that takes the representation of the answers and modifies it so as to minimize uh a cost function that measures to what extent the answer is a good answer for the question now we we sort of ignore the the fact for I mean the issue for a moment of how you train that system to measure whether an answer is a good answer for for a question but suppos such a system could be created but what's the process this kind of search like process it's a optimization process you can do this if if the entire system is differentiable that scalar output is the result of you know running through some neural net mhm uh running the answer the representation of the answer to some neural net then by gradient descent by back propag back propagating gradients you can figure out like how to modify the representation of the answer so has to minimize that so that's still a gradient based it's gradient based inference so now you have a representation of the answer in abstract space now you can turn it into text right and the cool thing about this is that the representation now can be optimized through gr and descent but so is independent of the language in which you're going to express the answer right so you're operating in the substract representation I mean this goes back to the Joint embedding right that is better to work in the uh in the space of I don't know to romanticize the notion like space of Concepts versus yeah the space of concrete sensory information right okay but this can can this do something like reasoning which is what we're talking about well not really in only in a very simple way I mean basically you can think of those things that's doing the kind of optimization I was I was talking about except the optimize in a discrete space which is the space of possible sequences of of tokens and they do it they do this optimization in a horribly inefficient way which is generate a lot of hypothesis and then select the best ones and that's incredibly wasteful in terms of uh computation because you have you run you basically have to run your LM for like every you know Genera sequence um and it's incredibly wasteful um so it's much better to do an optimization in continuous space where you can do great and descent as opposed to like generate tons of things and then select the best you just iteratively refine your answer to to go towards the best right that's much more efficient you can only do this in continuous spaces with differentiable functions you're talking about the reasoning like ability to think deeply or to reason deeply how do you know what is an answer uh that's better or worse based on deep reasoning right so then we're asking the question of conceptually how do you train an energy base model right so en based model is a function with a scalar output just a number you give it two inputs X and Y and it tells you whether Y is compatible with X or not X You observe let's say it's a pump image video whatever and why is a proposal for an answer a continuation of video um you know whatever and it tells you whether Y is compatible with X and the way it tells you that Y is compatible with X is that the output of that function would be zero if Y is compatible with X it would be a positive number non zero if Y is not compatible with X okay how do you train a system like this at a completely General level is you show it pairs of X and Y that are compatible a question and the corresponding answer and you train the parameters of the big neural net inside U to produce zero okay now that doesn't completely work because the system might decide well I'm just going to say zero for everything so now you have to have a process to make sure that for a a wrong y the energy would be larger than zero and there you have two options one is contrastive Method so contrastive method is you show an X and a bad Y and you tell the system well that's you know give a high energy to this like push up the energy right change the weights in the neural net that comput the energy so that it goes up um so that's contrasting methods the problem with this is if the space of Y is large the number of such contrasty samples you're going to have to show is gigantic but people do this they they do this when you train a system with RF basically what you're training is what's called a reward model which is basically an objective function that tells you whether an answer is good or bad and that's basically exactly what what this is so we already do this to some extent we're just not using it for inference we're just using it for training um uh there is another set of methods which are non contrastive and I prefer those uh and those non-contrastive method basically say uh okay the energy function needs to have low energy on pairs of xys that are compatible that come from your training set how do you make sure that the energy is going to be higher everywhere else and the way you do this is by um having a regularizer a Criterion a term in your cost function that basically minimizes the volume of space that can take low energy and the precise way to do this is all kinds of different specific ways to do this depending on the architecture but that's the basic principle so that if you push down the energy function for particular regions in the XY space it will automatically go up in other places because there's only a limited volume of space that can take low energy okay by the construction of the system or by the regularizer regularizing function we've been talking very generally but what is a good X and a good Y what is a good representation of X and Y because we've been talking about language and if you just take language directly that presumably is not good so there has to be some kind of abstract representation of ideas yeah so you I mean you can do this with language directly um by just you know X is a text and Y is a continuation of that text yes um or X is a question why is the answer but you're you're saying that's not going to take it I mean that's going to do what llms are doing well no it depends on how you how the internal structure of the system is built if the if the internal structure of the system is built in such a way that inside of this system there is a latent variable it's called Z that uh you can manipulate so as to minimize the output energy then that Z can be viewed as a representation of a good answer that you can translate into a y that is a good answer so this kind of system could be train in a very similar way very similar way but you have to have this way of preventing collapse of of ensuring that you know there is high energy for things you don't train it on um and and currently it's it's very implicit in llm is done in a way that people don't realize it's being done but is it is being done is is due to the fact that when you give a high probability to a a word automatically you give low probability to other words because you only have a finite amount of probability to go around right there have to some to one so when you minimize the cross entropy or whatever when you train the your llm to produce the to predict the next word uh you're increasing the probability your system will give to the correct word but you're also decreasing the probability will give to the incorrect words now indirectly that gives a low probability to a high probability to sequences of words that are good and low probability to sequences of words that are bad but it's very direct mhm and it's not it's not obvious why this actually works at all but um because you're not doing it on a joint probability of all the symbols in a in a sequence you're just doing it kind of uh you sort of factorize that probability in terms of conditional probabilities over successive tokens so how do you do this for visual data so we've been doing this with all jepa architectures basically the joint I so uh there are the compatibility between two things is uh you know here's here's an image or a video here's a corrupted shifted or transformed version of that image or video or masked okay and then uh the energy of the system is the prediction error of the representation uh the the predicted representation of the Good Thing versus the actual representation of the good thing right so so you run the corrupted image to the system predict the representation of the the good input uncorrupted and then compute the prediction error that's the energy of the system so this system will tell you this is a good you know if this is a good image and this is a corrupted version it will give you Zero Energy if those two things are effectively one of them is a corrupted version of the other give you a high energy if if the two images are completely different and hopefully that whole process gives you a really nice compressed representation of of reality of visual reality and we know it does because then we use those for our presentations as input to a classification system something system works really nicely okay well so to summarize you recommend in a in a in a spicy way that only Yan laon can you recommend that we abandon generative models in favor of joint embedding architectures yes abandon autor regressive generation yes abandon Pro this feels like a court testimony uh abandon probabilistic models in favor of energy based models as we talked about abandon contrastive methods in favor of regularized methods and uh let me ask you about this you've been for a while a Critic of reinforcement learning yes so what uh the last recommendation is that we abandon RL in favor of Mo model predictive control as you were talking about and only use RL when planning doesn't yield the pr predicted outcome and uh we use RL in that case to adjust the world model or the critic yes so uh you mentioned uh rlf reinforcement learning with human feedback uh why do you still hate uh reinforcement learning I don't hate reinforcement learning and I think all of I think it should not be uh abandoned completely but I think its use should be minimized because it's incredibly inefficient in terms of samples and so the the proper way to train a system is to First just have it learn uh good representations of the world and World models from Mostly observation maybe a little bit of interactions and then steered based on that if the representation is good then the adjustment should be minimal yeah now there's two things you can use if you've learned a world model you can use the world model to plan a sequence of actions to arrive at a particular objective you don't need a unless the way you measure whether you succeed might be inexact your idea of you know whether you're going to fall from your bike might be wrong or whether the person you're fighting with MMA is going to do something and do something else um so there uh so there's two ways you can be wrong either your your objective function does not reflect the actual objective function you want to optimize or your world model is inaccurate right so you didn't you the prediction you were making about what was going to happen in the world is inaccurate so if you want to adjust your world model while you are operating the world or your objective function that is basically in the realm of RL this is what RL deal deals with uh to some extent right so adjust your world model and the way to adjust your world model even in advance uh is to explore parts of the space where your world model where you know that your world model is inaccurate that's called curiosity basically or play right when you play you kind of explore part of the St space that um you know you don't want to do in for real because it might be dangerous but but you can adjust your world model uh without killing yourself basically um so that's what you want to use ourl for when when when it comes time to learning a particular task you already have all the good representations you already have your world model but you want you need to adjust it for the situation at hand that's when you use RL why do you think rhf works so well this reinforcement learning with human feedback why did it have such a transformational effect on large language models that before what's had the transformational effect is human feedback there's many ways to use it and some of it is just purely supervised actually it's not really reinforcement rning so it's the the HF it's the HF yeah uh and then there is ways to use human feedback right so you can you can ask humans to rate answers multiple answers that are produced by World model and uh and and then what you do is you train an objective function to predict that rating and then you can use that objective function to predict you know whether an answer is good and you can back propagate gradient through this to find you new system so that it only produces High highly rated answers okay so that's one way so that's like in ourl that means uh training what's called a reward model right so something that you know basically a small on that that estimates to what extent an answer is good right it's very similar to The Objective I was I was talking about talking about earlier for planning except now it's not used for planning it's it's used for fine-tuning your system I think it would be much more efficient to use it for planning but um but but uh currently it's used to fine tune the parameters the system now there there's several ways to do this um you know some of some of them are supervised you just you know ask a human person like what is a good answer for this right then you just type the answer um uh I mean there's there's lots of ways that those systems are are being adjusted now a lot of people have been very critical of the recently released Google's Gemini 1.5 for essentially in my words I could say super woke woke in the negative connotation of that word uh there is some almost hilariously absurd things that it does like it modifies history uh like generating images of a u black George Washington or um perhaps more seriously something that you commented on Twitter which is refusing to comment on or generate images of U or even descriptions of tianan square or the the Tank Man one of the most sort of legendary protest images in history and of course these images are highly censored by the Chinese government and therefore everybody start asking questions of what is the process of uh designing these llms what is what is what is the role of censorship in these all that kind of stuff so you uh commented on Twitter saying that open source is the answer yeah essentially so um can you explain I I actually made that comment on just about every Social Network I can and I've I I've uh I've made that point multiple times in in various forums um uh here's my my point of view on this uh people can complain that AI systems are biased and they generally are biased by the distribution of the training data that they've been trained on um that reflects biases in society um and that is potentially offensive to some people or potentially not and and some techniques to debias then become offensive to some people um because of you know historical uh incorrectness and things like that um and so you can ask the question you can ask two questions the first question is is it possible to produce an AI system that is not biased and the answer is absolutely not and it's not because of technological uh challenges although there are technological challenges to that it's because bias is in the eye of the beholder um different people may have different ideas about what constitutes bias um you know for a lot of uh a lot of things I mean there are facts that are you know indisputable but there are a lot of opinions or or things that can be expressed in different ways U and so you cannot have an unbiased system that's just an impossibility um and so what's the what's the answer to this and the the answer is the same answer that we found in Liberal democracy about the press the Press to be free and uh diverse we have free speech for a good reason is because uh we don't want all of our information to be uh to come from a unique Source um because that's opposite to the whole idea of democracy and uh you know progress of ideas and even science right in in science people have to argue for different opinions and and science makes progress when people disagree and they come up with an answer and you know a consensus forms right and it's true in all democracies around the world so there is a future which is already happening where every single one of our interaction with the digital world will be mediated by ai ai systems AI assistants right we're going to have smart glasses you can already buy them from MAA the ran MAA where um you know you can talk to them and they are connected with an llm and you can get answers on any question you have or you can be looking at a monument and there is a camera in the in the system that in in the glasses you can ask it like can what can you tell me about this uh building or this Monument you can be looking at a menu in a foreign language and I thing will translate it for you or we can do real time translation if we speak different languages so a lot of our interactions with the digital world are going to be mediated by those systems in the near future um you know increasingly the search engines that we're going to use are not going to be search engines they're going to be uh dialog systems that would just ask a question and it will answer and then point you to perhaps appropriate reference for it but here is the thing we cannot afford those systems to come from a handful of companies on the west coast of the US because those systems will constitute the repository of all human knowledge and we cannot have that be controlled by a small number of people right it has to be diverse for the same reason the Press has to be diverse so how do we get a diverse set of AI assistant um it's very expensive and difficult to train a based model right a based llm at the moment you know in the future it might be something different but at the moment that's an llm uh so only a few companies can do this properly and if some of those Tob systems are open source anybody can use them anybody can fine-tune them um if we put in place some systems that allows any group of people whether they are um individual citizens groups of citizens government organizations NOS uh companies whatever to take those open source uh systems AI systems and fine-tune them for their own purpose on their own data then we're going to have a very large diversity of uh different AI systems that are specialized for all of those things right so I tell you I talked to the French government quite a bit and the French government will not accept that the digital diet of all their citizen be controlled by three companies on the west coast of the US that's just not acceptable it's a danger to democracy regardless of how well-intentioned those companies are right um and so uh and it's also a danger to local culture to values to language right I was talking with um uh the uh founder of infosis in India um he's funding a project to fine tune Lama 2 the open source model produced by by meta so that Lama 2 speak all 22 official languages in India it's very important for people in India I was talking to a former colleague of mine Mustafa used to be a scientist at fair and then moved back to Africa I created a research lab for Google in Africa and now is as a new startup Kara and what he's trying to do is basically have llm that speak the local languages in Sagal so that people can have access to medical information because they don't have access to doctors it's a very small number of doctors per per capita in the in syal um I mean you can't have any of this unless you have open source platforms so with open source platforms you can have ai systems that are not only diverse in terms of political opinions or things of that type but in terms of uh uh language culture value systems political opinions um technical abilities in various domains and you can have an industry an ecosystem of companies that fine-tune those open source systems for vertical applications in Industry right you you have I don't know a publisher has thousands of books and they want to build a system that allows a customer to just just ask a question about any about the content of any of their books you need to train on their proprietary data right um You have a company we have one within meta it's called metam and it's basically an llm that can answer any question about internal uh stuff about about the the company U very useful a lot of companies want this right a lot of companies want this not just for their employees but also for their customers to take care of the customers so the only way you're going to have an AI industry the only way you're going to have ai systems that are not uniquely biased is if you have open source platforms on top of which uh any group Can U build specialized systems so the the direction of of inevitable direction of history is that the vast majority of AI systems will be built on top of Open Source platforms so that's a beautiful Vision so meaning like a company like meta or Google or so on should take only minimal fine-tuning steps after the building the foundation pre-trained model as few steps as possible basically can meta afford to do that no so I don't know if you you know this but companies are supposed to make money somehow and uh open source is is is like giving away I don't know Mark made a video Mark Zuckerberg um very sexy video talking about 350,000 Nvidia h100s yeah the the math of that is just for the gpus that's 100 billion um plus the infrastructure for training everything so I'm no business guy but how do you make money on that so the division you payt is a really powerful one but how is it possible to make money okay so you have several business models right the business model that uh MAA is built around is um youer a service and the the financing of that service is uh either through ads or through business customers so for example if you have an llm that uh you know can help a mom and pop pizza place um by you know talking to their customers through WhatsApp and so the customers can just order a pizza and the system will just you know ask them like what topping do you want or what sites blah blah blah um the business will pay for that okay that's a model um and otherwise you know if it's a system that is on the more kind of classical Services it can be uh ad supported or you know there's several models but the point is uh if you have a big enough um potential customer base and you need to build that system anyway for them it doesn't hurt you to actually distribute it in open source again I'm no business guy but if you release the open source model then other people can do the same kind of task and compete on it basically provide fine-tune models for businesses the is the bet that meta is making by the way I'm a huge fan of all this but is is the bet that meta is making is like we'll do a better job of it well no the the bet is is more we have we already have a huge user base and customer base ah right right so it's going to be useful to them whatever we offer them is going to be useful and there is a way to derive revenue from this uh and it doesn't hurt that you know we provide that system or the Bas the base model right the foundation model uh in open source for others to build applications on top of it too if those applications are not to be useful for our customers we can just buy it from them um uh it could be that they will improve the platform in fact we see this already um I mean there is you know literally millions of downloads of Lama 2 and thousands of people who have you know provided ideas about how to make it better um so you know this this clearly accelerates progress to make the system available to a sort of a a wide community of people and and there is literally thousands of businesses who are building applications with it so um so our ability to meta's ability to derive revenue from this technology is not impaired uh by the distribution of it of based models in open source the fundamental criticism that Gemini is getting is that as you point out on the west coast just to just to clarify we're currently in the east coast where I would suppose meta AI headquarters would be so there uh strong words about the West Coast but uh I guess the issue that happens is I think it's fair to say that most tech people have a political affiliation with the left wing they're they lean left and so the problem that people are criticizing Gemini with is that there's in that debiasing process that you mentioned that their ideological lean becomes obvious uh is this something that could be escaped you're saying open source is the only way have have you witnessed this kind of ideological lean that makes engineering difficult no I don't think it has to do I don't think the issue has to do with the political leaning of the people designing those systems it has to do with the uh acceptability or political leanings of the their customer based audience right so a big company cannot afford to offend too many people so they're going to make sure that whatever product they put out is safe whatever that means and and it's very possible to overdo it and it's also very possible to it's impossible to do it properly for everyone you're not going to satisfy everyone so that's what I said before you cannot have a system that is unbiased is perceived as unbiased by everyone it's going to be you know you push it in one way one set of people are going to see it as biased and then you push it the other way and another set of people is going to see it that's biased and then in addition to this there's the issue of if you push the system perhaps a little too far in One Direction it's going to be non-factual right you're going to have you know you know black Nazi uh soldiers in the we should we should mention image generation of of uh black Nazi soldiers which is not factually accurate right and can be offensive for some people as well right so uh so you know it's going to be impossible to kind of produce systems that are unbiased for everyone so the only solution that I see is diversity and diversity in the full meaning of that word diversity of in every possible way yeah uh Mark Andre just tweeted today let me do a tldr the conclusion is only startups and open source can avoid the issue that he's highlighting with big Tech he's asking can big Tech actually field generative AI products one ever escalating demands from internal Act activists employee mobs crazed Executives broken boards pressure groups extremist Regulators government agencies the press in quotes experts and everything uh corrupting the output two constant risk of generating a bad answer or drawing a bad picture or rendering a bad video who knows what is going to say or do at any moment three legal exposure product liability slander election law many other things and so on anything that makes Congress mad four continuous attempts to tighten grip un acceptable output degrade the model like how good it actually is uh in terms of usable and pleasant to use and effective and all that kind of stuff and five publicity of bad text images video actually puts those examples into the training data for the next version so on so he just highlights how difficult this is from all kinds of people being unhappy as you said you can't create a system that makes everybody happy yes uh so if you're going to do the fine tun yourself and keep a close Source essentially the problem there is then trying to minimize the number of people who are going to be unhappy y um and you're saying like the only that that almost impossible to do right and it's the better ways to do open source phasically yeah I mean he's Mark is right about uh a number number of things that you list that indeed scare um large companies uh you know certainly Congressional investigations is one of them legal liability uh you know uh making things that get people to you know hurt themselves or hurt others like you know um big companies are really careful about not um producing things of this type and um because they have you know they want to hurt anyone first of all and then second they want to preserve their business so um it's essentially impossible for systems like this that can inevitably formulate political opinions and you know opinions about various things that may be political or not but that people may disagree about about you know moral issues and you know um things about like questions about religion and things like that right or or cultural issues that people from different communities would disagree with in the first place um so there's only kind of a relatively small number of things that people will uh sort of agree on you know basic principles but beyond that if you if you want those systems to be useful they will necessarily have to uh offend a number of people inevitably and so open source is just better and then diversity is better right and open source enables diversity that's right open source enables diversity that this can be fascinating world where if it's true that the open source world if meta leads the way and creates this kind of Open Source Foundation model world there's going to be like governments will have a find new model and yeah and and then potentially uh you know people that vote left and right will have their own model and preference and be able to choose and it will potentially divide us even more but that's on us humans we get to figure out basically the technology enables humans to human more effectively and all the difficult ethical questions that humans raise will just it'll um leave it up to us to figure it out yeah I mean there are some limits to what you know the same way there are limits to free speech there has to be some limit to the kind of stuff that those systems might be authorized to um to produce um you know some guard rails so I mean that's one thing I've been interested in which is uh in the type of architecture that we were discussing before where the output of a system is the result of an inference to satisfy an objective that objective can include guard rails and uh we can put guard rails in open source systems I mean if we eventually have systems that are built with this blueprint uh we can put guard rails uh in those systems that guarantee that there is sort of a minimum set of guardrails that make the system non- dangerous and nontoxic Etc you know basic things that everybody would agree on um and and then you know the the fine tuning that people will add or the additional guardwell that people will add will kind of cater to their um Community whatever it is and yeah the fine doing will be more about the gray areas of what is hate speech what is dangerous and all that kind of stuff I mean you've different value systems value systems I mean like uh but still even with the objectives of how to build a bioweapon for example I think something you've commented on or at least there's a paper where a collection of researchers is trying to understand the social impacts of these llms and I guess one threshold is nice is like does the llm make it any easier than a than a search would like a Google search would right so the increasing uh number of studies on this seems to point to the fact that it doesn't help so having an llm doesn't help you design or build a bioweapon or a chemical weapon if you already have access to uh you know a search engine and a library uh and and so the the S of increased information you get or the ease with which you get it doesn't really help you um that's the first thing the second thing is it's one thing to have a list of instructions of how to make a chemical weapon for example or bioweapon it's another thing to actually build it and it's much harder than you might think and then LM will not help you with that um in fact you know nobody in the world not even like you know countries use bioweapons because most of the time they have no idea how to protect their own populations against it so um so it's too dangerous actually to kind of ever use um and it's in fact banned by uh International treaties um chemical weapons is different it's also banned by treaties U but um uh but it's the same problem it's difficult to use in situations that doesn't turn against the perpetrators but we could ask you on musk like I can I can give you a very precise list of instructions of how you build a rocket engine M and even if you have a team of 50 Engineers that of re experienc building it you're still going to have to blow up a dozen of them before you get when that works um and you know it's the same with uh you know chemical weapons or biow weapons or things like this it requires expertise you know in the in the real world that n is not going to help you with and it requires even the common sense expertise that we've been talking about which is how to take uh language-based instructions and materialize them in the physical world requires a lot of knowledge that's not in the instructions yeah exactly a lot of biologists have posted on this actually in response to those things saying like do you realize how hard it is to actually do the the lab work I you know this is not trivial yeah and that's Hans Marik comes comes to light once again uh just the Linger on llama you know Mark announced that llama 3 is coming out eventually I don't think there's a release date but what what are you most excited about first of all llama 2 that's already out there and maybe the future llama 3 4 5 6 10 just the the future of the open source under meta well a number of things so uh there's going to be like various versions of of Lama that are uh you know improvements of previous llamas bigger better multimodal things like that and then in future Generations systems that are capable of planning that really understand how the world Works um maybe are trained from video so they have some World model maybe you know capable of the type of reasoning and planning I was talking about earlier like how long is that going to take like when is the research that is doing going in that direction going to sort of feed into the product line if you want of L I don't know I can't tell you and there's you know a few breakthroughs that we have to basically uh go through before we can get there but you'll be able to monitor our progress because we publish our research right so you know if last week we published the Via work which is sort of a first step towards Training Systems from video um and then the next step is going to be World models based on on kind of this type of idea training training from video there similar work at at Deep Mind also and um the taking place people and also at UC brookley on uh World models from video a lot of people are working on this I think a lot of good ideas are coming are appearing my bet is that those systems are going to be Jep alike they're not going to be gener generative models um and uh we'll see what the future will tell um there's really good work at uh um a gentleman called danar Hafner who is not Deep Mind who who's worked on kind of models of this type that learn representations and then use them for planning or learning tasks by reinforcement running um and a lot of work at brookley by Peter iil S leine bunch of other people of that type uh I'm collaborating with actually in the context of some grants with my NYU hat um and then collaborations also through meta because the the lab at brookley is associated with meta in some way so with fair so I I think uh it's very exciting you know I I think I'm super excited about I I haven't been that excited about like the direction of machine learning and AI you know since uh you know 10 years ago when Fairway started and before that um 30 years ago when we working on 35 on on com Nets and and and the early days of neural net so um I'm super excited because I see a path towards potentially human level intelligence uh with you know systems that can understand the world remember plan reason um there there is some some set of ideas to make progress there that might have a chance of working and I'm really excited about this what I like is that you know it uh somewhere we we get onto like a good direction and perhaps succeed before my brain turns to white sauce or or before I need to retire yeah yeah uh you're also excited by are you is it beautiful to you just the amount of gpus involved sort of the the the whole training process on this much compute it's just zooming out just looking at Earth and humans together have built these Computing devices and are able to train this one brain then then we then open source like giving birth to this open-source brain trained on this gigantic compute system there's just the details of how to train on that how to build the infrastructure and the the hardware the cooling all of this kind of stuff U or you just still the most of your excitement is in the the theory aspect of it the meaning like the software well I used to be a hardware guy many years ago yes yes that's decades ago Hardware has improved a little bit changed a little bit yeah I mean certainly scale is necessary but not sufficient absolutely so we certainly need competition I mean we're still far in terms of compute power uh from you know what we would need to match the compute power of the human brain um you know this may occur in the next couple decades but um but we're still some ways away and certainly in terms of power efficiency were really far um so a lot of progress to make in uh in in in hardware and you know right now a lot of progress is is is not I mean there's a bit coming from Silicon technology but a lot of it coming from architectural Innovation and quite a bit coming from uh like more efficient ways of you know implementing the architectures that have become popular basically combination of Transformers and cets right and uh so you know there's still some ways to go until uh we're going to saturate we're going to have to come up with like new new principles new fabrication technology new uh basic components um perhaps you know based on sort of different principles than those classical digital semas interesting so you think in order to build Ami M me we need we potentially might need some Hardware Innovation too well if you want to make it um ubiquitous yeah certainly because we're going to have to reduce the you know comput the power consumption a GPU today right is half a kilowatt to a kilowatt human brain is about 25 wats uh and the GPU is way below the power of human brain you need you know something like a 100,000 or million to match it so uh so you know we're off by huge Factor here you often say that AGI is not coming soon meaning like not this year not the next few years potentially farther away what's your basic intuition behind that so first of all it's not going to be an event right the idea somehow which you know is popularized by science fiction and Hollywood that you know somehow somebody is going to discover the secret the secret to a gii or human level AI or Ami whatever you want to call it and then you know turn on a machine and then we have a gii that's just not going to happen it's not going to be an event it's going to be gradual progress are we going to have systems that can learn from video how the world works and learn good World presentations yeah before we get them to the scale and performance that we observe in humans it's going to take quite a while it's not going to happen in one day um uh are we going to get systems that can uh have large amount of associative memory so they can they can remember stuff yeah but same it's not going to happen tomorrow I mean there is some basic techniques that need to be developed we have a lot of them but like you know to get this to work together with full system is another story how we going to have system that can reason and plan perhaps along the lines of the objective driven AI architectures that I I described before yeah but like before we get this to work you know properly it's going to take a while so and before we get all those things to work together and then on top of this have systems that can learn like hierarchical planning hierarchical representations systems that can be configured for a lot of different situation at hands the way the human brain can um you know all of this is going to take you know at least a decade and probably much more because there are a lot of problems that we're not seeing right now we have not encountered and so we don't know if there is a easy solution within this framework um so you know it's it's not just around the corner I mean I've I've been hearing people for the last 12 15 years claiming that you know edgi is just around the corner and being systematically wrong and I knew they were wrong when they were saying it I call their why do you think people have been calling first of all I mean from the beginning of from the birth of the term artificial intelligence there has been a Eternal optimism that's perhaps unlike other Technologies is it a Maric Paradox is the explanation for why people are so optimistic about AGI I don't think it's just Marx Paradox Marx Paradox is a consequence of realizing that the world is not as easy as we think so first of all um intelligence is not a linear thing that you can measure with a scalar with a single number um you know can you say that humans are smarter than WR tongs in some ways yes but in some waysons are smarter than humans in a lot of domains that allows them to survive in the forest for example so IQ is a very limited measure of intelligence T intelligence is bigger than what IQ for example measures well IQ can measure you know approximately something for humans MH but um because humans kind of you know come in relatively kind of uniform form right right uh but it only measures one type of uh ability that you know may be relevant for some test but not others and uh but then if you talking about other intelligent entities for which the you know the the basic things that are easy to them is very different then it doesn't mean anything so intelligence is a collection of skills and an ability to acquire new skills efficiently mhm right and the collection of skills that an need intelligent particular intelligent entity possess or is capable of learning quickly is different from the collection skills of another one and because it's a multi-dimensional thing the set of skills is high dimensional space you can't measure you can compare you cannot compare two things as to whether one is more intelligent than the other it's multi-dimensional so you push back against what are called AI doomers a lot uh can you explain their perspective and why you think they're wrong okay so a I doomers imagine all kinds of catastrophe scenarios of how AI could Escape or control and basically kill us all uh and that relies on a whole bunch of assumptions that are mostly false so the first assumption is that the emergence of super intelligence is going to be an event that at some point we're going to have we're going to figure out the secret and we'll turn on a machine that is super intelligent and because we've never done it before it's going to take over the world and kill us all that is false it's not going to be an event we're going to have systems that are like as smart as a cat has all the have all the characteristics of you know human level intelligence but their level of intelligence would be like a cat or a pirrot maybe or something um and then we're going to work our way up to kind of make those things more intelligent and as we make them more intelligent we're also going to put some guard rails in them and learn how to kind of put some guard rails so they behave properly and we're not going to do this with just one it's not going to be one effort there's going to be lots of different people doing this and some of them are going to succeed at making intelligent systems that are uh controllable and safe and have the right guard rails and if some other goes wrog then we can use the the good ones to go against the Rogue ones uh so it's going to be my you know smart AI police against your Rogue AI um so it's not going to be like you know we're going to be exposed to like a single Rogue AI that's going to kill us all that's just not not happening now there is another fallacy which is the fact that because the system is intelligent it necessarily wants to take over MH um and there is several arguments that make people scare of this which I think are completely false uh as well so one of them is um you know in nature it seems to be that the more intelligent species otherwi that end up dominating the other and uh and even you know extinguishing the others uh sometimes by Design sometimes just by mistake and and so you know there is sort of uh Thinking by which you say well if AI systems are more intelligent than us surely they're going to eliminate us if not by Design simply because they don't care about us and that's just Preposterous for for a number of reasons um first reason is they're not going to be a species they're not going to be a species that competes with us they're not going to have the desire to dominate because the desire to dominate is something that has to be hardwired into an intelligent system uh it is hardwired in humans it is hardwired in baboons in chimpanzees in wolves not in a wrong Tes the species in which this desire to dominate or submit or or attain status in other ways is is specific to social species non-social species like our tongs don't have it right and they are as smart as we are almost right and to you there's not significant incentive for humans to encode that into the AI systems and to the degree they do there'll be AIS that um sort of punish them for it I'll compete them over well there's all kinds of incentive to make AI system submissive to humans right right I mean this is the way we're going to build them right um and so so then people say oh but look at llms LMS are not controllable and they're right LMS are not controllable but objective driven AI so systems that derive their Answers by optimization of an objective means they have to optimize this objective and that objective can include guard rails one guardrail is uh obey humans another guardrail is don't obey humans if it's hurting other humans with I've heard that before somewhere I don't remember yes maybe in a book yeah uh but speaking of that book what is could there be unintended consequences also from all of this no of course uh so this is not a simple problem right I mean uh designing those guard rail so that the system behaves properly is not going to be a a simple uh issue that for which there is a silver bullet for which you have a mathematical proof that the system can be safe it's going to be very Progressive iterative design system where we put those guard rails in such a way that the system behave properly and sometimes they're going to do something that you know was unexpected because the guardare wasn't right and we're going to correct them so that they do it right uh the idea somehow that we can't get it slightly wrong because if we get it slightly wrong we all die is is ridiculous um we we're just going to go progressively and it's it's just going to be the the analogy I've used many times is um is uh turbojet design um how how did we figure out how to make turbojet so unbelievably reliable right uh I mean those are like you know incredibly complex uh pieces of Hardware that run at really high temperatures for you know 20 20 hours at a time sometimes and we can you know fly halfway around the world with a on a two engine uh jetliner at near the speed of sound like how incredible is this it's just unbelievable right and did we do this because we invented like a general principle of how to make Turbo Jet safe no we it took decades to kind of fine-tune the design of those systems so that they they were safe is there a separate uh group Within General Electric or snma or whatever that is specialized in turo jet safety no it's the design is all about safety because a better Turbo Jet is also a safer Turbo Jet so um a more reliable one it's the same for AI like do you do you need you know specific Provisions to make AI safe no you need to make better AI systems and they will be safe because they are designed to be more useful uh and more controllable so let's imagine a system AI system that's able to be incredibly convincing and can convince you of anything I I can at least imagine such a system and I can see such a system be weapon-like because it can control people's minds we're pretty gullible we we want to believe a thing you can have an A system that controls it and you could see governments using that as a weapon so do you think if you imagine such a system there's any parallel to something like nuclear weapons no so is why why why is that technology different so you're saying there's going to be gradual development yeah there's going to be I mean it might be rapid but they'll be iterative and then we'll be able to kind of respond and and so on so that AI system designed by Vladimir Putin or whatever or his uh minions uh you know is going to be uh like talking to trying to talk to every American to uh convince them to vote for you know whoever whoever pleases Putin sure uh or whatever or or you know or R people up against each other um as they've been trying to do they're not going to be talking to you they're going to be talking to your AI assistant mhm which is going to be as smart as theirs MH right right that AI because as I said in the future every single one of your interaction with the digital world will be mediated by your AI assistant so the first thing you're going to ask is is this a scam like is this thing like telling me the truth like it's not even going to be able to get to you because it's only going to talk to your AI assistant and your AI assistant is not not even going to it's going to be like a spam filter right you're not even seeing the email the spam email right it's automatically put in a folder that you never see um it's going to be same thing that AI system that tries to convince you of something is going to be talking to a assistant which is going to be at least as smart as it and is going to say this is spam you know U it's not even going to bring it to your attention so to you it's very difficult for any one AI system to take such a big leap ahead to where it can convince even the other AI systems so like it there's always going to be this kind of race where nobody's way ahead that's the history of the world history of the world is you know whenever there is a prog at some someplace there is a countermeasure and and you know it's a it's a Katan mous game well this is why mostly yes but this is why nuclear weapons are so interesting because that was such a powerful weapon that it mattered who got it first that you know you could imagine Hitler Stalin ma getting the weapon first and that that having a different kind of impact on the world than than the United States getting the weapon first but to you nuclear weapons is is like you you don't imagine a uh breakthrough Discovery and then Manhattan Project like effort for AI no as I said it's not going to be an event it's going to be you know continuous progress and and whenever you know one breakthrough occurs it's going to be widely disseminated really quickly yeah probably first within industry I mean this is not a domain where you know government or military organizations are particularly Innovative and they're in fact way behind um and so this is going to come from industry and and this kind of information disseminates extremely quickly we've seen this over the last few years right where you have a new like you know even take alphao this was reproduced within three months even without like particularly detailed information right yeah this is an industry that's not good at secrecy no but even even if there is just the fact that you know that something is possible yeah uh makes you like realize that it's worth investing the time to actually do it you you may be the second person to do it but you know you'll you'll do it uh and you know same for you know all the Innovations you know self supervisor Transformers decoder only architectures llms I mean those things you don't need to know exactly the details of how they work to know that you know it's possible um because it's deployed and then it's getting reproduced and then you know people who work for those companies move they go from one company to another and you know the information disseminates what makes the success of the the US tech industry and Silicon Valley in particular is exactly that is because information circulates really really quickly and this you know disseminates very quickly and so you know the the whole region sort of is ahead because of that circulation of information so maybe I just to linger on the psychology of AI doomers you give uh in the classic Yan laon way a pretty good example of just when a a new technology comes to be you say uh engineer says I invented this new thing I call it a ballpen and then the Twitter sphere responds OMG people could write horrible things with it like misinformation propaganda Hast speech ban it now then writing doomers come in akin to the AI doomers imagine if everyone can get a ballpen this could destroy Society there should be a law against using ballpen to write hate speech regulate ballpens now and then the pencil industry Mogul says yeah ballpens are very dangerous unlike pencil writing which is erasable ballpen writing stays forever government should require a license for a pen manufacturer I mean this does seem to be part of um human psychology when when it comes up against new technology so what what deep insights can you speak to about this well there is a a natural fear of uh new technology and the impact it can have in society and people have kind of instinctive reaction to um you know the world they know being threatened by Major Transformations um that are either cultural phenomena or technological um revolutions and they fear for their culture they feel for their job they feel for they fear for their you know the future of their children um and uh their way of life right so so any change um is feared and and you see this you know long history like any technological Revolution or cultural phenomenon was always accompanied by uh you know groups or reaction in the media uh that that basically attributed the all the problems the current problems of society to that particular change right electricity was going to kill everyone at some point you know you uh the train was going to be a horrible thing because you know you can't breathe past 50 kilm an hour um and so there's a wonderful website called a pessimist archive right which has all those newspaper clips of all the horrible things people imagine would would arrive because of uh either technological uh Innovation or uh a cultural phenomenon um you know the this is wonderful examples of uh uh you know jazz or comic books being blamed for uh unemployment or or you know young people not wanting to work anymore and things like that right and and that has existed for for centuries um and it's you know knee-jerk reactions um the question is you know do we Embrace change uh or do we resist it and what are the real dangers as opposed to the imagined uh imagined ones so people worry about I think one thing they worry about with big Tech something we've been talking about over and over but I think worth mentioning again they worry about how powerful AI will be and they worry about it being in the hands of one centralized power of just a handful of central control and so that's the skepticism with big Tech you can make these companies can make a huge amount of money and control this technology and by so doing you know take advantage uh abuse the little guy in society well that's exactly why we need open source platforms yeah I just wanted to nail the point home more and more yes um so let me ask you on your like I said you do get a little bit uh um you know flavorful on the internet uh yos shabbach tweeted something that you loled at uh in reference to H 9000 quote I appreciate your argument and I fully understand your frustration but whether the pod bay doors should be opened or closed is a complex and nuanced issue so you're at the head of meta AI um you know this is something that really worries me that AI our AI overlords will speak down to us with corporate speak um of this nature and you sort of resist that with your way of being um is this something you can just comment on of working at a big company how you can avoid the over fearing I suppose the through caution create harm yeah again I think the answer to this is open source platforms and then en enabling a widely diverse set of people to build AI assistance that represent the diversity of uh cultures opinions languages and value systems across the world um so that you're not bound to just uh you know be brainwashed by a particular way of thinking because of single AI entity um so I mean I I think it's really really important question for society and the problem I'm seeing is um is that um which is why I've been so vocal and sometimes a little sardonic about it never stop never stop Yan we love it is because I see the danger of this concentration of power through through proprietary AI systems has a much bigger danger than everything else that if we really want you know uh diversity of opinion uh AI systems that you know in in this future that where we'll all be interacting through AI systems we need those to be diverse for the preservation of uh uh diversity of ideas and you know Creeds and political opinions and and and whatever uh and the preservation of democracy and what works against this is people who think that for reasons of security we should keep AI systems under lock and key because it's too dangerous to put it in the hands of of everybody um because it could be used by terrorists or something um that would lead to uh you know potentially a uh a very bad future in which all of our information diet is controlled by a small number of uh companies through proprietary systems do you trust humans with this technology to uh to build systems that are on the whole good for Humanity isn't that what democracy and free speech is all about I think so do you trust institutions to do the right thing do you trust people to do the right thing and and yeah there's bad people who are going to do bad things but they're not going to have Superior technology to the good people so then it's going to be my good AI against your bad AI right I mean there the examples that we were just talking about of you you know maybe uh some Rogue country will build you know some AI system that's going to try to convince everybody to go into a civil war or something or or or elect a favorable U ruler and um but then they will have to go past our AI systems right an AI system with a strong Russian accent will be trying to conv our and doesn't put any uh articles in their sentences um well it'll be at the very least absurdly comedic okay uh so I uh since we talked about sort of the uh physical reality I'd love to ask your vision of the future with with robots in in this physical reality so many of the kinds of intelligence you've been speaking about would Empower robots to be more effective collaborators with us humans so um since uh Tesla's Optimus uh team has been showing off some progress on humanoid robots I think it really reinvigorated the whole industry that's that I think Boston Dynamics has been leading for a very very long time so now there's all kinds of companies figure AI obviously Boston Dynamics um un tree un tree uh but there's like a lot of them it's great it's great I mean I love it uh so do you think there'll be uh millions of humanoid robots walking around soon not soon but it's going to it's going to happen like the next decade I think is going to be really interesting in robots like the the emergence of the robotics industry has been in the waiting for you know 10 20 years without really emerging other than for like you know kind of preprogram behavior and stuff like that um and uh and the main issue is again the Maric Paradox like you know how do we get those system to understand how the world works and and kind of you know plan actions and so we can do it for really specialized tasks um and uh the way Boston Dynamics goes about it is you know basically with a lot of um handcrafted dynamical models and careful planning uh in advance which is very classical robotics with a lot of innovation a little bit of perception um but it's still not like they can't build a domestic robot right um and you know we're still some distance away from completely autonomous level five driving mhm uh and we're certainly very far away from having uh you know level five autonomous driving bi A system that can train Itself by driving 20 hours like any 17y old uh so until we have uh again World models systems that can train themselves to understand how the world Works uh we're not going to we're not going to have significant progress in robotic so a lot of the people working on robotic Hardware at the moment are are betting or banking on the fact that AI is going to make sufficient progress towards that and they're hoping to discover a product in it too is uh yeah before you have a really strong World model there'll be an almost Strong World model and um people are trying to find a product in a clumsy robot I suppose like not a perfectly efficient robot so there's the fact factory setting where uh humanoid robots can help automate some aspects of the factory I think that's a crazy difficult task because of all the safety required and all this kind of stuff I think in the home is more interesting but then you start to think I think you mentioned loading the dishwasher right yeah like I suppose that's one of the main problems you're working on I mean there's you know uh cleaning up cleaning the house uh clear clearing up the table after meal washing the dishes you know all those tasks you know cooking I mean all the tasks that you know in principle could be automated but are actually incredibly sophisticated really complicated but even just basic navigation around an un Space full of uncertainty that's sort of works like you can sort of do this now navigation is fine well navigation in a way that's compelling to us humans is is is a different thing yeah it's not going to be you know necessarily I mean we have demos actually because you know there is a So-Cal embodied AI group at at fair and uh you know they've been not building their own robots but using commercial robots um and you can you can tell a robot dog like you know go to the fridge and they can actually open the fridge and they can probably pick up a can in the fridge and stuff like that and and bring it to you you know so it can navigate can grab objects as long as it's been trying to recognize them which you know Vision systems work pretty well nowadays um but but it's not like a completely you know General robot that would be you know sophisticated enough to do things like clearing up the dinner table Yeah to me that's an exciting future of getting humanoid robots robots in general in the whole more and more because that gets uh humans to really directly interact with AI systems in the physical space and in so doing it allows us to philosophically psychologically explore our relationships with robots can be really really really interesting so I hope you make progress on the whole uh japa thing soon well I I mean I hope I hope things kind of you know work as uh as planned um I mean again we've been kind of working on this idea of self supervised learning of from video for for 10 years and and you know only made significant progress in the last two or three and actually you've you've mentioned that there's a lot of interesting breakthroughs that can happen without having access to a lot of compute yeah so if you're interested in doing a PhD and this kind of stuff there's a lot of possibilities still yeah to do Innovative work so like what advice would you give to a undergrad that's looking to uh go to grad school and do a PhD so basically I've listed them already uh this idea of how do you train a world model by observation and you don't have to train necessarily on gigantic data sets or I mean you could out to be necessary to actually train on large data sets to have emerging properties like like we have with LMS but I think there's a lot of good ideas that can be done without necessarily scaling up then there is how you do planning with a learn World model if the world the system evolves in is not the physical world but it's the world of let's say the internet or you know some sort of uh world of where an action consists in doing a search in a search engine or interrogating a data database or running a simulation or calling a calculator or solving a differential equation how do you get a system to actually plan a sequence of actions to you know give the solution to a problem um and so the question of planning is not just a question of planning physical actions could be you know planning actions to use tools for a dialog system or for any kind of intelligent system and um there's some work on this but not like not a huge amount some work at Fair um one called tool former which was couple years ago and some more recent work on planning U but um but I don't think we have like a good solution for any of that then there is the question of hierarchical planning so the example I I mentioned of you know planning a trip from New York to Paris that's hierarchical but almost every action that we take involves hierarchical planning in some in some sense and we really have absolutely no idea how to do this like this's zero demonstration of hierarchical planning uh in AI where the various levels of representations that are necessary have been learned we can do like two level hierarchy hierarchical planning when we design the two the two levels so for example you have like a a dog lag robot right you want it to go from the living room to the kitchen you can plan a path that avoids the obstacle and then um you can send this to a lower lower level planner that figures out how to move the legs to kind of Follow that trajectories right so that works but that twole planning is designed by hand right um we specify what the proper levels of abstraction the representation that each level of attraction has have to be how do you learn this how do you learn that hierarchical representation of action plans right we you know with cight and deep learning we we can train the system to learn hierarchical representations of percepts mhm what is the equivalent when what you're trying to represent our action plans for action plans yeah so you want you want basically a robot dog or humanoid robot that turns on and travels from New York to Paris all by itself for example all right they might have some uh trouble at the at the TSA but yeah no but even doing something fairly simple like a household task sure like you know uh cooking or something yeah that there's a lot involved it's a super complex task we take and once again we take it for granted what hope do you have for um the future of humanity we're talking about so many exciting Technologies so many exciting possibilities what gives you hope when you look out over the next 10 20 50 100 years if you look at social media media there's a lot of there's there's Wars going on there's division uh there's hatred all this kind of stuff that's also part of humanity but amidst all that what gives you hope I don't have that question uh we can make Humanity Smarter with AI okay I mean AI basically will amplify human intelligence it's as if if every one of us will have a staff of smart AI assistants they might be smarter than us they'll do our bidding perhaps execute a task in ways that are much better than we could do ourselves because they'll be smarter than us and so it's like everyone would be the the boss of a staff of super smart virtual people so we shouldn't feel threatened by by this any more than we should feel threatened by being the manager of a group of people some of whom are more intelligent than us I certainly have a lot of experience with this of uh you know having people working with me who are smarter than me um that's actually a wonderful thing so uh having machines that are smarter than us that assist us in our all of our tasks our daily lives whether it's professional or personal I think would be absolutely wonderful thing because intelligence is the most um is the commodity that is most in demand that that's really what I mean all the mistakes that Humanity makes is because of lack of intelligence really or lack of knowledge which is you know related so um making people smarter would just can only be better I mean for the same reason that you know public education is a good thing and books are a good thing and the internet is also a good thing intrinsically and even social networks are a good thing if you run them properly it's difficult but you know you can um uh because you know it it's helps the communication of information and knowledge and the transmission of knowledge so AI is going to make Humanity smarter and the analogy I've been using is the fact that perhaps an equivalent event in a history of humanity to what might be provided by journalized is the invention of the printing the printing press it made everybody smarter the fact that people could uh have access to um two books books were a lot cheaper than they were before and so a lot more people had an incentive to learn to read which wasn't the case before um and people became smarter it it enabled the enlightenment right there wouldn't be an Enlightenment without the printing press it enabled uh philosophy rationalism U escape from religious Doctrine um democracy science uh and certainly without this it wouldn't be there wouldn't have been theeran American Revolution the French Revolution and so was still be under feudal regimes perhaps um and so it completely transformed the the world because people became smarter and kind of learn learn about things now it also created 200 years of essentially religious conflicts in Europe right because the first thing that people read was the Bible and uh realized that perhaps was a different interpretation of the Bible than what the priests were telling them and so that created the Protestant movement and created the rift and in fact the Catholic School the Catholic Church didn't like the idea of the printing price but they had no choice and so it had some bad effects and some some good effects I don't think anyone today would say that the invention of the printing press had a overall negative effect despite the fact that it created 200 years of religious conflicts uh in Europe now compare this and I I thought uh I was very proud of myself to come and put this analogy but realized someone else uh came with the same idea before me um compare this with what happened in the Ottoman Empire the Ottoman Empire banned the printing press for 200 years uh and it didn't ban it uh for all languages only for Arabic you could actually print books in Latin or Hebrew or whatever in the Ottoman Empire just not in Arabic and uh I thought it was because the rers just wanted to preserve the control over the population and the Dogma religious dogma and everything but after talking with the uh UAE minister of AI uh Omar um he told me no there was another reason uh and the other reason was that uh it was to preserve the corporation of calligraphers right there's like a an art form which is you know writing those beautiful yes uh you know Arabic uh poems or whatever religious text in in the thing and it was very powerful Corporation of scribes basically that kind of you know run a big chunk of the Empire and you know it couldn't put them out of business so they you know B the Ping press in part to protect that business now what's the analogy for AI today like who are we protecting by Banning AI like who are the people who are asking that AI be regulated to protect their their jobs and of course you know there's it's it's a it's a real question of what is going to be the effect of you know technological transformation like AI on the on the job market and the labor market and there are economists who are much more expert at this than I am but when I talk to them they they tell us you know we're not going to run out of job this this not this is not going to cause mass unemployment this this just going to be gradual uh shift of different professions the professions are going to be hot 10 or 15 years from now we have no idea today what they're going to be the same way if we go back 20 years in the past like who could have thought 20 years ago that like the hardest job even like 5 10 years ago was mobile app developer like smartphones weren't invented most of the jobs of the future might be in in the metaverse well it could be yeah but the point is you can't possibly predict but you're right I mean you made a lot of strong points and I I believe that people are fundamentally good and so if AI especially open source AI can um make them smarter it just empowers the goodness in humans so I I share that feeling okay I think people are Fally good uh and in fact a lot of doomers are doomers because they don't think that people are fundamentally good uh and they either don't trust people or they don't trust the institution to do the right thing so that people behave properly well I think both you and I believe in humanity and I think I speak for a lot lot of people in saying thank you for pushing the open source movement pushing to making both research in AI open source making it available to people and also the models themselves making it open source so thank you for that and uh thank you for speaking your mind in such colorful and beautiful ways on the internet I hope you never stop you're one of the most fun people I know and get to be a fan of So yeah thank you for speaking to me once again and thank you for being you thank you Le thanks for listening to this conversation with Yan laon to support this podcast please check out our sponsors in the description and now let me leave you with some words from Arthur C Clark the only way to discover the limits of the possible is to go beyond them into the impossible thank you for listening and hope to see you next time