hello this is deep learning classics and trends it's a reading group we've been running for five years now um and i like doing things um different every now and then just to throw in something creative so this time i thought i was reading so many minecraft papers and projects on twitter and it just just i was just amazed that at that very moment a lot of minecraft energy was bursting and i'm curious of why so i thought like different teams together together and do this jam session more and also easy and because dlct is happening every every week and i sort of know everyone from past work relationship and just social events so i just put everyone together in a thread and this happened so yay um so great um so this is kind of like a panel but we we go very casual that's the the vibe of dlct so feel free to keep it casual um usually when is the talk people also would ask questions during so i don't know about this one i think we can also allow questions doing what do you guys think that's good to me okay yeah sounds great yeah of course uh okay to get started i think uh maybe each of you can give a brief intro of yourself your team the the minecraft project team and maybe one sentence about why do you think this is the minecraft is a good problem to work on uh we can start with oh my my on my screen sebastian sure yeah should i share some slides what we do with outside or i don't know i think okay let's see if the sharing we lost sebastian i think we lost him he muted himself and turned his camera off [Laughter] can you still see me now we can you you came back and we can hear you but no slides no slides let's see if this works on my phone maybe maybe i have to do dots but uh then i'll just how to share slides on your phone that not a phone i'm very impressed i know yeah yeah yeah um well maybe it doesn't work uh then i'll just all right i'll try one more thing and then i'll uh just say i don't have my laptop okay then i'll i'll just talk a little bit about this project no it's working i see your phone you have four whatsapp messages you haven't checked yet this is amazing don't look too close to what messages i get i try to get to the real quick do you see the the minecraft thing this is the first time that i saw it looks perfect from phone you see this slideshow it's perfect yes it looks amazing yeah yeah okay perfect uh yeah so we started a few years ago to organize this minecraft open ended this uh challenge basically and the team is uh georgie rasmussen least clear a few people from from itu university and our motivation kind of was that you know you see all this kind of great uh diversity in nature we want to kind of create algorithms that can invent similar uh like things in an open-ended fashion uh this challenge that also jeff is working on and ken and joel um so we want to create algorithms that can in an open-ended way generate interesting artifacts and never stop inventing similarly to what basically evolution did in nature and then we were looking at different kind of frameworks artificial life frameworks and we thought they're all interesting they allow us to answer like interesting questions but there's no real framework that allows us to study really this this um uh this question of open-endedness in the way basically if you let it run you would be surprised kind of what things would come out of the system what things would it invent that you didn't even think about and that's why we thought minecraft is kind of the perfect platform from this because people have been able to build all kinds of things in minecraft computers simulating atari games neural networks uh robots basically anything you can imagine you can build in minecraft and with these redstone components you can actually build also circuits that do computations um and uh you can build fully functioning word processors basically anything you can imagine you can build in this so can we make an algorithm that was kind of the goal that could create similar things automatically that can that couldn't invent uh you know a computer or all kinds of other things in this game and i think in contrast to some other of these competitions we're not focused on the so much on the agent we have more like that we allow the the um the api allows directly to spawn blocks and delete blocks and read blocks in the environment so it's more about what kind of structures could be made but not so much about the from an npc perspective more from um we allow the direct manipulation of basically blocks uh in the environment and provide a very kind of simple api but then nevertheless allows you to run things like optimization evolution um and then we did this competition basically where we wanted to see what kind of um sebastian you're muted for whatever reason yeah we lost your audio but you can still see the beautiful slides and videos maybe when the video starts playing oh yeah you need it or something that's a pretty good idea that's a good hypothesis go on to your next slide and see if you come back exhibition can you hear us question might not be able to hear us huh wonder oh cause this is also a video let's just comment through his his slides like we've been so it looks like this may be one of the winners i'm guessing let's see what they come up with in the competition so i suppose this is like procedural generation inside a procedurally generated world okay it's really cool yeah procedural adds in one block after another yeah and also like it's it's not repeating like in the previous one um the building has different rooms in every block i guess it's like algorithmically generated this is a fun experience like us commentating all right so and then i'm back did you guys see everything we couldn't hear you the whole time sylvester i'm sorry you couldn't hear me no you were muted but we got the gist there's some um nice gentlemen no that oh that's really oh can you hear me now yes so when you're playing the people i think you're automatically muted oh that's when it muted okay well yeah yeah just check out the site evocraft.live basically where it explains a lot about the framework of things [Music] yeah so what did the winners on the pro the windows creations tell you through this competition yeah yeah so last last year we had this uh the winner did this uh it was gutenberg we had this like revolution of uh of uh three like of circuits in minecraft so like in an open-ended way it would invent novel circuits using these redstone components and this year the winner they used like the clip uh basically your network uh to invent all kinds of uh novel like images and create kind of like um family trees of different concepts so you have the concept of a tree and then you have the concept of a table and the concept of a chair and and making all these uh um structures in in minecraft uh and um yeah so there's a lot of possibilities like it's nice i think this competition that allows people to do very very different things it's not focused on like one particular thing but just things that create things in an open-ended manner and they're very a lot of different approaches people taking in that direction yeah that's cool um okay and then we will move on to jeff you're next on my screen and i know you we all know sebastian for a long time so were you inspired by the competition was there any linkage of how you decided to work on minecraft um i love sebastian's work but i cannot say that sebastian's work inspired our particular work um we uh but i think it's probably like common cause which is the idea that um as sebastian mentioned many of us um and he listened some names joel ken myself sebastian others i've been fascinated by this question of like how can you cr recreate the open-ended explosion of complexity that you see in the natural world you know where a simple process produces jaguars hawks human brains three-toed loss dolphins you name it and um one of the things you quickly kind of run into when you work on open-endedness is this problem that if you're in a very simple restricted domain then it's hard to really push the algorithms to their limits and and put them to the test of like what would you do if you were in a very rich open-ended environment and so it's kind of like almost natural if you think like what's one of the most open-ended environments out there uh minecraft is it you know it's a sandbox game it doesn't have a clear objective people have done i mean sebastian just showed all these different things there's just like an infinite number of possible things that you can do in there another thing that really motivated us um and what i say awesome i'm talking about the team at openai that put together the vpt which is video pre-training paper and project if people aren't familiar with that then you can kind of quickly google it you'll see the list of the wonderful team members one of the things that we were thinking there is you know i think one of the clear trends in the last few years of machine learning is that you know ai will see farther and move faster if it stands on the shoulders of giant human data sets which is kind of a play on the isaac newton quote right and so we can dramatically catalyze and speed up and improve the ability of ai to do anything if we pre-train first on human data and so minecraft has this double win for us one it's a huge open-ended environment to play in and two there's a tremendous amount of human data out there and so we were motivated to try to unlock the amount of data that you see on youtube and directly be able to learn by watching youtube videos and i know that um you and jim have also done a fantastic job of thinking hard about the potential that's there for pre-training off of human data and their mind dojo project which i'm sure they're going to tell you about is a wonderful example of that as well which happened in parallel so obviously we were thinking along very similar lines independently so those are the reasons we were really interested in minecraft um but i should also point out that i don't think minecraft is an end in of itself it just it had those two properties and there are other things that have those properties and eventually we probably want to work on other domains as well but i do think minecraft has a fantastic test bed because it's got the human data it's open-ended it's also quite hard you know like especially if you do what we did which is play with the keyboard and the mouse you're really starting to take ai out of these like simplified environments where make everything really easy we have macros where they can do complex things with the push of a button and just say like you know you have to deal with very long action sequences tens hundreds of thousands of actions you've got to deal with extended temp time scales um huge action spaces uh 3d worlds point of view pixels you know we're starting to kind of see ai transition from toy environments to human environments and so you know humans have spent a ridiculous amount of time in minecraft because it's complex and interesting and open-ended and now we can start to see ai do that as well cool um so jim you pay whenever you can can jump in it's like jim have some slice to present maybe like he can go first yeah okay yeah sounds good uh first of all thank you so much sebastian jeff for like introducing the minecraft so yeah i don't need to reintroduce it again and all of the motivations uh you guys have covered i vigorously agree on all accounts that's why i'm working on this in the first place like totally uh great points you guys have raised and let me uh share slides i have some just cool pictures and videos so everyone relax and enjoy the show let me share screen okay so all right can everyone see my screen see the powerpoint yep yes awesome yeah uh and also thanks roseanne and ml collective for hosting this great event my name is jim and on behalf of the nvidia team i'm very excited to share with you an initiative called mine dojo built on minecraft so jeff has already covered some of these main points and i'll just do a quick summary the grand goal that we want to pursue is to build a generally capable agent but what does that mean we argue that a journalist agent should have three properties first instead of maximizing some scores let's say entire games the agent should be able to discover and pursue very complex and open-ended objectives and second a journalist agent as its name implies must be able to do a large number of tasks or even better can we prompt it with arbitrary natural language command just like gp3 and third the agent should understand how the world works through massive portraying knowledge also another great point jeff covered instead of training from scratch every time towards this end we introduced my dojo a framework built on minecraft for ai research my dojo implements a recipe of three ingredients towards journalist agent so the first is open end environment next internet skill knowledge base which vbt also touched on and finally can we learn a foundation model for embodied agents using the simulator and the massive data set so let's zoom into the first one main dojo features a massive benchmarking suite of more than 3000 tasks so this is among the largest open source agent benchmark created to our knowledge it is enabled by a versatile simulator that unlocks the full potential of minecraft so for example it supports multi-mode observation action space it can also be customized at every detail to make a wide variety of environments and given the simulator we introduced around 1 500 programmatic tasks which are tasks that have grown to success conditions and examples are harvesting different resources learn to use a tree of tools and combat various monsters but minecraft is more than that so we also introduce 1500 creative tasks that are free-form and open-ended for example let's say we want the agent to build a house but it's really hard to define what makes a house a house so these tasks don't have a simple check and human judgment or learned scoring methods should be used moving to the second ingredient mine dojo features an internet knowledge base of three parts the first one is youtube so we find that minecraft is among the most streamed games on youtube and gamers love to narrate what they're doing so we collected more than seven hundred thousand videos with two billion words in the transcript and this provides very rich learning material of human strategies and also creativity second players have compiled a large minecraft specific wikipedia that explains every entity and mechanism that you need to know in the game we scraped 7000 wiki pages with interleaving multimodal data so for example this is a gallery for all the monsters all explained in the wiki and the crafting recipes everything you need to know finally the minecraft subreddit is a very active forum players have showcased their creations and also ask questions for help here are some examples of how people essentially use this reddit as a stack overflow for minecraft and we can see that the top voted answers are actually pretty good cool so now given the massive task suite and also the internet data we have the essential components to train foundation models and in my dojo we propose a new algorithm that takes a baby step towards this ideal of a journalist agent so the idea is quite simple for our youtube database we have time aligned video clips and transcripts in natural language so we can train uncontrasted video text model called mind clip in the same spirit as the open air clip intuitively the model learns to associate the video and the text that describes the content and then we repurpose this my clip model to be a language conditioned learn the reward function for our agent so here's our agent interacting with the simulator let's say the task is share sheet to obtain wall as the agent explores it generates a video snippet which can be encoded and fed to the mind club model so intuitively the higher the association score the more the agent's behavior aligns with the language description so that becomes an effective reward function to any reinforcement learning algorithm that you like so here's our learned asian behavior on various tasks it is a multi-task agent that we train but still i want to point out that what we can do is pretty far from the awesome things that human players have demonstrated such as recreating the entire hogwarts or even simulating a cpu circuit that's the best trend already covered so here's a call to action for the community if humans can do these mind-blowing tasks then why not our ai so that concludes um a whirlwind tour of my dojo a lot more contents are available on our website myndojo.org so yeah please visit the website and let us know what you think our simulator suite database and also minecraft models are open source and another big shout out to jeff's team um you guys have open source open svbt checkpoints and human contractor data as well so all of these are complementary and will be very useful for the community so please check out all of these resources and this project is a big teamwork at nvidia our code advisor professor ikoju is also here to answer any questions you have so thank you all so much yeah and back to rosa cool um so let me zoom into that pre-training part because i think that's the part that um my dojo and mickey sort of shares so you both use video um and in my dojo's case you have have paired you call that caption or some text what about in your case jeff in the vpt's case yeah so we um i love so much of the vision it's in the mindojo progression we had thought about a lot of taking advantage of things like the captions um which we played with and put in our appendix uh we didn't even think about and get to the reddit stuff so that's really cool um so we don't have those pieces what we do have is something a little bit different the mandoja team um they did something where they took videos but they even mentioned in their paper we don't have the action labels for the um the video so they did like a contrastive kind of learning where you're learning representations from the video you're tarnishing these videos to kind of learn how to see and know what's going on in the world but you don't necessarily start knowing how to act with that kind of contrastive learning we were interested in seeing whether or not we could actually get all the way to an agent that kind of learns how to act via pre-training which is to say he knows in this situation i should take this action and to do that we needed the actions that the humans were taking and so um we had this idea that we would basically try to solve you know the task of if i can watch a video on youtube i can probably figure out what action must be taking at each point in time and so the way that we do this is we take just a small amount of con relatively small amount of contractor data we used 2 000 hours in the end it turned out we probably only needed like 100 to 200 hours and we pay contractors to play minecraft and we're recording their actions as they're playing the game and so then we train a deep neural net to say if i can look at the past and the future what action must have been taken in the middle that's actually not that hard of a task right like if you see a hand move then you probably hit the karate chop button or if you see a block get placed you must have hit the place block item etc etc that's a relatively simple task this is a general principle that works across almost any game or situation but the physics of the world tend to be consistent and not that complicated especially in video games and so it's not that hard to know oh you hit jump you hit forward you hit backwards etc once you learn that model that knows how to look at past and future and infer the action that must have been taken you can take that model and you can run it on all of the videos on youtube and label them with the actions that must have been taken and now you have a gigantic database of videos and actions mappings and then you can just train a model via behavioral cloning and rotation learning to go from state to action and just pre-train at scale so we ended up pre-training on 70 000 hours which is about eight years of videos of people playing minecraft where now we have the actions and we just do invitation learning behavioral cloning at scale and then you're basically kind of like in the gpt style of learning that's why we kind of we named it vpt as a nod to gpt uh where vpt stands for video pre-training where the model does all the things that gpt does you know it kind of comes out of the gate with a lot of zero shot capabilities where it can do pretty impressive things and then with a little bit of fine tuning you can get it to then go off and do all sorts of very complicated things and we showed in the paper that you can even get it to mind diamond tools which takes humans over 20 minutes and i forget exactly how many bits you know tens of thousands of actions i think in a row which is just unprecedented if you go back to a couple years ago in rl you know to think that agents could be not only learning that but in this case kind of learning it in an unsupervised way basically for free from the internet uh which is pretty cool and so the idea is you know going beyond minecraft is that you could imagine going to almost any domain for which there's lots of video out there on the internet and you train a model that infers the actions then you label those videos and then you can pre-train a model to basically as i said earlier stand on the shoulders of giant human data sets yeah i just want to add a bit like to jeff's pawn so one thing that i really really like about minecraft is there's like this is like a game design for human rights and human use mouse and keyboard interface to play this game and this keyboard and mouse interface is pretty universal like a human does use this to play all kinds of games but even beyond games they use this to like photoshop instagram pictures or like a design 3d ships in unity and all those kind of things so you can imagine like so the similar principles can be applied to uh those kind of other domains and using the same interface like for building applications for real world applications right so uh really kind of uh the richness and the and and also that they're gonna the universal user interface offered by minecraft is really kind of what appeals us and that to build a benchmark off the top of it so actually for contacts like when we first started uh started like this kind of my individual projects with jim so we can be looking at what what existing benchmark in particular like see continued learning lifelong learning have to offer and we find that uh there's a lot of benchmark developed in the community as like pioneer work but when they see like a continual learning is a 2050 task and then your agent learns like for two hours right and then you get the end of the list of tasks but really kind of the openness and the in the kind of impos uh even the possibilities what makes human um play this game for four years so that's why we we thought that uh this could be a really deep platform to enable kind of a way a much larger skill study of like how to deploy inbody agents to learn um throughout the entire lifespan of an agent like so now we're talking maybe like a month or even years of learning right so these are the kind of uh the possibilities offered by the open-endedness of the platform yeah i wanted to jump in on a few points there i'm sure i think it's really interesting one is um we also were really motivated by the fact that this the interface in minecraft is particularly interesting not just because you're using a mouse and a keyboard and you're in a 3d space which is you know a lot of what humans do but also because it has these kind of like kind of games within a game you've got to do crafting which opens up basically something that looks a lot like a little computer application and you have to drag and drop these little icons and do precise sequences and so one of our motivating use cases was computer using agents agents that eventually learned to do things like yuki was just saying like photoshop and using microsoft word or google docs or whatever and so being able to precisely do drag and drop sequences with icons was like a really nice added challenge within minecraft because it requires a lot of precision and it's kind of like a different modality within the game another thing i think was really cool is that you could imagine that you don't necessarily need to follow exactly the playbook we did where you train contractors in one domain and you have what we call this inverse dynamics model that's the thing that labels the actions given past and future you don't have to have that necessarily be domain specific you could imagine trying to learn a very generic idm that works across all domains or at least all computer using domains for example where you have the same uh input output space which is you know basically mouse pixels to mouse and keyboard and in fact just yesterday um the team at open ni that worked on vpt somebody found this that somebody had taken our exact pre-trained minecraft idm and run it on this game fallout which i'm not familiar with but it's like totally visually different it's like a 3d game but it doesn't look at all like minecraft and apparently it's working relatively well which i wouldn't have expected and i'm surprised but imagine if you trained on like 10 20 30 games or domains or computer using applications and then you can generalize to the next and the next the next and kind of like just add more and more pre-training and you get this one generic idm that can recognize what action must have been taken in a whole lot of different domains then you're really off to the races in terms of pre-training an agent on all the different things that people do on computers the final thing i wanted to add something i'll just really quickly continual learning we actually do have evidence that in minecraft continual learning is the problem the previous paper we put out out of open ai out of my team was on curriculum learning in minecraft um and the agents if you try to train them with a curriculum of like based on learning progress whatever they'll continuously learn but they're learning new skills at the expense of forgetting old skills and we see this fall off and skills from things that they already were really good at because they're learning the next thing and if we didn't properly do curriculum learning and continual learning techniques to abate that then that would show up and that's because minecraft is complex and rich enough and you can focus on one totally different aspect of the game and forget this other one so that is another really nice element of challenging ai problems that shows up in this game that doesn't show up in a lot of simpler rl games so i'm glad you mentioned that yuki sorry i should go ahead oh yeah yeah i think following up on this i think what also makes it really nice is that the game offers so many possibilities that you can do like rl tasks like learning from human data uh but you can actually now there's some sire in the background but but you can actually also focus on the what i think is really uh great that you have the survival mode where you have to do this you know crafting but you can actually also have this creative mode where you you build all kinds of crazy creations so there is no basically have to make an algorithm that can create things in an open-ended manner i mean you can learn to kind of replicate basically what humans have been have been doing in minecraft but uh and you can do this on a very low level like mouse and keyboard but then you can even you can then go further and you can try to build on this now we have an algorithm that can do the basic things that you can do in the game and they can uh my diamonds that can crash and then i think the next step would be that you can can they actually create similar things to how humans what humans have been created in the game and and then and i think it's really interesting to think about what would it take for an ai to invent a computer in minecraft like what kind of uh selective pressure would it make for for the eye to invent the computer and minecraft or invent the neural network in minecraft or all the other things that humans have been inventing can we kind of bias these systems to inventing these things because they're somehow useful for you know some in-game behavior and what kind of um open-ended algorithm would ever discover those things uh i think that's really interesting and it allows microsoft to simulate things from the lowest levels basically the the highest level of kind of human creativity it's kind of lego like building blocks that that where you can basically do anything with that that's also what i think is really uh really exciting about it yeah i want to follow up on a slightly different point that uh jeff uclan sebastian brought up is on the data side so jeff said we're standing on the shoulder of giants i want to provide a statistic in case people don't know like for minecraft um there are 140 million active players so just to put this number in perspective this number is bigger than the entire population of mexico and twice bigger than the population of uk so it is like this many like massive number of humans playing this game generating all the data all of the knowledge so i feel that you know back to sebastian's point you know what does it take to build a computer in minecraft i feel that maybe we can get started from data driven creativity there are already like so many open-ended objectives and things uh awesome even like crazy things that humans have done demonstrated in either youtube or explaining already step by step using language right so can we like first mimic some of the diversity and creativity that human players already do and then once we have some kind of foundation model for creativity then we build upon it using like evolutionary algorithms to bring it to an even higher level so that's like very analogous to what arfa go first did like the first kind of bootstrap from human play and then you know self play gets it to like super humourishing so i feel that you know on the creative creative side and also on the policy learning side i think both can benefit a lot from the data and the other point is um for all of these like youtube or you know reddit or wiki data sets language is a central component and i feel that these days nlp folks are having a great time like you know asian learning is for is falling a bit behind from what nlp people have achieved like the gpd3 all of those already open-ended large language models like gb3 is open-ended you can explain a task to it in any natural language it will be able to do it but we do not have any agent that can just tell it arbitrary things and it will be able to do it we are so far from that ideal so i think maybe it's time to leverage some of those technologies created in the nlp community and then apply them to minecraft and we have the data we have all the language people explaining narrating what they're doing youtube people also like um kind of telling about all of the knowledge and knowledge and you know table of contents on on wiki and all of these data can be utilized um combined with a natural language yeah maybe one one one remaining thing is also the other thing about those studies also like collaboration of humans like uh because you have all these also the the data sets that you guys created you have all these data sets also like people building things for example together and then that requires a lot of collaboration if you see those videos like hundreds of people creating micro structures together it's amazing how like somehow they work like this uh collective system creating amazing things so i think that's also a rich source of how do humans collaborate in this game and can we make some algorithms that that do similar things and even maybe collaborate you know not just the eyes separated but even collaborate maybe together with humans in some way in the future i completely agree and also language will provide a very good channel for the humans to communicate with the agent and we have models like gp3 that understands language so you know plugging it into some kind of minecraft agent will even facilitate this kind of human and ai collaboration because natural language is just such a you know comfortable interface for us yeah a couple things um one of them is that one of my favorite parts of the my dojo paper is the emphasis and the vision on natural language conditioned agents i'm a huge believer that this is the future of agents in general and rl agents in particular and i love the idea of trying to harness the data that's out there like in captions to produce that so in our appendix we played with that idea a little bit but we didn't get it to work nearly as well as as you guys did and i just think that that is going to be very very powerful going forward because as you said with gpt you want to just be able to tell the agent what to do and then it can go do it and that also dramatically decreases the exploration challenge of the task most of the time you know when we give an rl agent a task you know it's basically the equivalent of putting it in a giant cluttered room and say go and what we really wanted to do is like go get all the blue things but we don't tell it that and it's just flailing around randomly trying to figure out what the heck we want like just imagine even a human would would do terrible at that task but if you could just tell the asian go get all the blue things then it's just a matter of mastering the physical skills of picking up the blue things which usually is very very easy for humans and ultimately will be for agents as well when they come to the task with skills from other domains so i think this natural language conditioning is just very very very essential for going forward and so i love that you guys are focusing on that and i love you also focusing on how we can get it via pre-training off internet scale data which is really fantastic i'm very very bullish on that as a future direction another thing i wanted to mention which is also related to the idea of overcoming kind of the exploration challenges with tasks is you know a few years out a few years ago i put out this idea of ai generating algorithms so we were going to try to bootstrap up from very simple conditions all the way to superhuman ai and you know the inevitable question i got from everybody but also that i had myself is kind of like how are we going to do this without a planet-sized computer as josh tenenbaum put it when i put this idea him to him and that was a real challenge the question is like how are you going to shave many many many orders of magnitude off of what happened on earth in order to have this happen within the computational um resources that we're going to have as ai researchers and you know one of the big challenges is here is kind of bootstrapping getting into the regime where you can start to learn new tasks where you have to learn all these initial skills including like what language means and like how to move about the world etc and that's where i think this standing on the shoulders of giant human data sets thing comes in and i think the vpt paper really shows this world of mine dojo paper as well but basically we show that if you pre-train on all of these internet scale data and then you challenge the agent to go get diamond tools it can do it um whereas if you don't have that pre-training you basically will never happen now never you know we should put in in quotes i mean with infinite amounts of compute eventually you'll figure this out and there are projects like dota that did oppressive things without any pre-training but as as jim just mentioned you know a lot of works like alphago and starcraft they seeded learning with pre-training and then from there you create the next wave of learning and i think that's what's really exciting in terms of open-endedness that probably like we can we can leapfrog ahead from like the first couple billion years of evolution right into a regime where we have pretty capable agents that then can kick off an open-ended process and we can do that by pre-training on human data and so i think the future is very exciting for open-endedness research because it seems now possible that you could use pre-trained models like video pre-trained models language pre-traded models to kick off open-ended processes maybe like within like the next couple of years whereas before pre-training it seemed like we were had like a much bigger um mountain decline before we'd be able to see truly open-ended processes maybe just a question here when we talk about pre-training of course we're thinking of massive data does it have to be self-supervised or do you think a little bit of human label data in the beginning would get us a long way the way that vpt did it yeah i was just going to say the vpt is kind of an example where you can use a little bit of human data as like a great catalyst right a great multiplier that allows you to unlock huge amounts of of unsupervised data so i don't think there's any reason why you shouldn't mix the two um if as long as you can do a little bit of the expensive stuff and a lot of the inexpensive stuff i think that yeah you've got it's very powerful formula i mean about i'm also like a i think free trading will definitely get us a big way there and it's really like impressive i'm wondering sometimes if if there are any like drawbacks we could think that that pre-training might get us stuck somewhere from which it is hard to get out because we are like uh i mean to be a little bit you know not only positive like uh um like could it be that we we have these large foundation models like we pre-turn on things that i'm also a big fan of but could be but they also result as they were getting somehow stuck somewhere for which it was hard to learn more well as if we would have started from uh you know from scratch without any human data by letting the algorithm discover its uh everything itself would be somehow uh you know not have went to this cul-de-sac i'm not sure if that that would actually happen but it's something that i'm you know sometimes thinking about because now everybody's you know we'd be doing a lot of these uh using one of these treats and wallets in a lot of different areas so is there any kind of i don't know if you guys also thought about those those drawbacks uh that you could imagine or is it do you think it's always useful to have these free trade models and then kick off this kind of open-ended learning um and so i think that um in the aiga paper i talk about how exciting it will be once we can once we figure out how to make these open ai generating algorithms we could run them multiple times and see vastly different types of intelligence show up almost as if you were doing like alien you know cultural travel like visiting this culture and that culture in this culture and you can come to understand the space of all possible intelligences and like wildly different societies etc and so i think that there's a risk of not seeing that as much or seeing it at all if you start off with a human seed because you're going to get something that looks a lot like human society and human culture however i think about the staging here like in my opinion let's create the first open-ended process that's really exciting the first few and those can be based on human data and they basically will be a narrow cone around human society and they'll probably reflect a lot of our biases and we'll figure out how this stuff works and then eventually we'll get to the more you know super ambitious thing of seeing what wildly different societies look like because right now i mean take our rl results as kind of a microcosm of what we're trying to achieve with pre-training you could do pretty impressive stuff and you could actually go and accomplish things like learning diamond tools uh without that flatline you almost learn nothing right and so barring some major algorithmic innovations that will get us past that exploration bottom like without using human data like this is the only way to play the game and so like getting even like a diamond pickaxe while impressive for ai is not that impressive for a human player if you want them to build computers and build hogwarts and things like that and so um i think this is going to be a dramatic shortcut that's going to get us to where we want to get to a lot faster but ultimately i want to put this out there i think scientifically it is a fascinating challenge we should absolutely pursue which is how we do this without human data because we know that natural evolution did it and we don't want these biases in the system because ultimately we do want to see the entire space of intelligences and what's what in the possibilities and so i'm okay with like kind of the the impoverished version now because it's the fastest way to play but ultimately we do want to do what you're saying sebastian comparison that you might not oh sorry i just wanted to say i like to compare that you might when you get a lot of you could get like really like really intel like human-like intelligence but you might never get something like an octopus or something then they sold intelligence in a very different way because you took this shortcut but but i also agree that it's probably it's a very useful shortcut for me to say sorry no worries yeah so just to add to what jeff said uh first first of all i completely agree there is like an exploration bottleneck and what vpt solved was like uh kind of getting familiar with all the motor skills so you don't the agent does not like randomly do things and and then cannot even walk in a straight line right like you you need to be able to walk in a straight line and do some crafting to make anything interesting happen so the exploration burden is very real and i think it's important to use some human data to boost track um but the point i want to add is i think minecraft is a little bit special because this is a very human-centric game like in it there are many um concepts that we know from common sense so let's say in minecraft you can use sand to craft a glass and you can use water to put on lava all of these are like common sense knowledge and um uh humans have curated like so much knowledge online about uh so it's like both from the common knowledge let's say on wikipedia or just on like general text corpus and also on minecraft specialized knowledge so i feel that you know all of these knowledge can be combined to bootstrap the agent um to a very good state so that it can understand language and then do certain tasks that it is instructed to do and after that we can let the agent lose right we can we can set it loose in minecraft you know discover new things that humans have not even considered before and also back to the uh problem with like collaborating with humans uh language is another like great interface so humans can even help the agent just like we kind of mentor a child right like we tell the child okay like these are some of the things you can think of or you know how we tell grad students right like as advisors you tell grad students okay these are some promising ideas to explore but exactly how you implement those ideas it's up to you but for advisors we can point the ai to the right direction so i think like minecraft is quite special in this uh english framework yeah i mean like if you look at like a baby it's like a clock into development in early stage like a lot of learning they are not born in a vacuum right so they actually like imitate their parents and to learn a lot of skills they learn to stick their thoughts like a few days after born like looking at their parents so i think i mean if we have this human data we should use it that's my opinion uh that allows us to do more structured exploration like so a lot of this creative task requires really prolonged interaction with the environments like ten thousand of stuff if we can solve this like basic skills we can level up the kind of level through brazilian exploration so that we can get to do more interesting data discover more interesting goals but i do think i still only training on human data does not replace um the agent's only experience like and then they we need to have a mechanism to uh for the agents to self-set their own goals and all like have their own intents or curiosity to explore the world i think that touched upon like a um kind of topic on like artificial cure and like a curriculum learning like what jeff also said so we have been also kind of developing along the line of work like also for the following like like jeff's line of work employees on how to kind of have agents that self-create their own environment or kind of a procedure to generate their own tasks right so as a self-play as a problem generator and a problem solver at the same time that we can cast that as some kind of learning framework but i think most research have to be done here because like we have been only focused on very small domain not really minecraft skills so but it's interesting research direct to see how can we have the agent like that self generation of goals and intents yeah i love one of the things that we had on our roadmap at openi um was something very similar to what you proposed on the my dojo paper which is using gpt um with uh to not only generate tasks but generate subtasks and also then if you combine that with the curriculum paper that we did for minecraft so basically if you went to gpt and you said what are the first 10 things i should do in minecraft and if it could solve those good you say okay now i solve those what's next and if you couldn't solve them you go back to gpt and you say you challenged me to make an iron shovel but i didn't know how to do that like can you help me and then maybe it breaks down the task and you assign those to subtasks and if it sells those great if not you ask for more help and then you basically can just do that with a like a learning progress based curriculum like we have on the earlier paper that i mentioned and just kind of fan out and try to learn all of the skills that there are in minecraft and so kind of having a natural language conditioned agent that can parse and consume the kind of the tasks and then the subtasks from gpt pair that with a learning progress based thing in an open-ended environment and a lot of compute which you have at nvidia and boom you have a very powerful exciting research project that i think would be really interesting to pursue and see somebody pursue in the next year i'm glad to hear that you're maybe working on that now yeah even better if the gbt is um fine-tuned or like even trained with the reddit data that you're sure yeah expert knowledge on on this game so it's truly not just a general language model but like a true expert on everything you can do with minecraft yeah and then also um you can see that agent with the vpt agent which already knows how to do a lot of stuff in the world and so it's like off to the races and i just saw that thomas is on thomas mcconey thomas did this amazing thread right after both papers came out in which he proposed something very similar which is how these the the mind dojo work and the vpt work um really dovetail nicely together and be great to see them put together so thanks for that thread thomas i had a quick question about the embodiment like what do you think would change depending on the form factor of these agents right because a lot of the intelligence and like the exploration that happens is conditioned by embodiment and relationship of the body to the environment in which you're placing these agents i'm sorry i i was actually um my internet hike up there for a second was that a question for me yes yes okay can you say it again i apologize yeah i was curious like because we talked a lot about the task if it's open-ended or directed and so on but i i and you were talking about discovering different types of intelligence and i wonder what do you think it's the role of embodiment like the factors that these agents have like i'm thinking philippe dia in riyadh in france had these robots of very different body uh bodies and like he would put them in an environment where they had to develop a language to speak with each other and the type of languages that they would develop change depending on their form factor right and that kind of like gets me to think of you know you talked about evolution and adaptation a lot of that had to do between the relationship of our bodies and the environment we're in so a lot of these learning and intelligence you will uncover is also determined by that adaptation of form factor body to environment in which you are you are creating uh that you're creating for these agents yeah those are wonderful questions so um i kind of have a similarly staged version of this answer i think that kind of like you know you think about like the the 80 80 20 rule the 90 10 rule you kind of get like most of the interesting stuff and then the long tail of stuff that's harder and fascinating so i think there's probably like a core to intelligence that probably is relatively independent of um the body um for example thinking about like you know symbolic thought learning how to play chess do math things like that you probably or you know you probably could do that almost no matter what your body was um but there are subtleties like you mentioned like um how you might end up communicating and what that would do to influence the way that you think about the world so there's this wonderful book called star maker and first and last men it's this great science fiction book and they travel to all these different planets and one of my favorite parts of the book is they visit a planet that had like a like a really thick cloud of dust so you couldn't really see on the planet it was just dark all the time but the creatures then basically became very dominant in smell and if you think about it like a lot of our metaphors for understanding are vision based like oh i see i you know like that's clear to me there's like this huge extension of vision-based metaphors we have for understanding whereas in this book all of those were just transformed into touch and smell they'd be like oh yeah that smells really good or like oh yeah like you know i'm picking up the scent of what you're saying et cetera et cetera and so like subtle things about that in terms of how they think and express their metaphors might really depend on on the embodiment um do they have like if they communicate via like vibrating their hands they probably have an entirely different set of metaphors for like what happens when i like finally understand you etc but i mean to some extent that's very interesting and fascinating like once we're at the point of understanding and appreciating those subtleties i think are way down the the the path towards understanding the space of possible intelligences but there's so much between here and there just to get that 80 before we enjoy that last 20 that i'm not that worried about the particular embodiment like i think you could study agi in maybe even just a tech space a purely text-based thing i think it's one of the most fascinating things like could gpt just by studying the internet and never being in a 3d world how close could it get to like uh human-level intelligence and understanding us um or would it completely have no 3d common sense you know even if we read about it forever but it's never walked around in a 3d world so it doesn't understand the idea of like a collusion and like a shortcut um that's a fascinating thing to think about and so i think this is one of the cool things about doing ai research is that these become testable hypotheses because we'll find out my instincts are that the body doesn't matter as much as maybe like uh rolf pfeiffer or josh bongard i've advocated over the years but you know josh and i debate this over beers so um we'll find out so if we're moving to like multi-modal learning i feel that embodiment could help kind of ground some of like the language concepts the linguistic concepts in the visual world or in the physics world so guys yeah it depends on kind of what's the downstream application you're aiming for so if it's you know video games or like robotics then embodiment may be very useful because you are receiving the agent is receiving um information from multimodal sensors and you introduce this kind of multi-modal concept grounding and again i think language is a very good medium for doing that just like uh clip from open ai all of these and also dolly from open air right like this multi-modal connection is becoming a very hard research topic but i guess you know if we want to kind of create completely like open-ended uh agent or like ai generating algorithms i feel that embodiment it could be embodiment agnostic like this is a general paradigm that can be applied to all embodiments so yeah i think these are all great points from jeff i fully agree and that's what another reason i just want to trumpet with the genius of the natural language conditioned um agents and pre-training of the my indonesia vision of the paper it's just so right go back to my example like go pick up all the blue things well imagine instead if it was go pick up all the chess pieces well if you don't know what a chess piece is and you have to learn that while also learning what the task is it's just like you know it becomes combinatorially impossible or virtually impossible but if you come to the world knowing what a chess piece is then you're so far ahead of the game and so you can get that by like we were actually working the same thing just taking the transcripts of youtube like while people are talking about what they're doing just like putting that in the um conditioning to say like you know what action should be taken or your contrastive living or whatever and then you learn you know like what the diamond is and what a shovel is and what a tree is and what a cow is and the sheep is you can get that all via pre-training and then you're so like the learning task becomes so much easier and you get over this you just like to have this giant uh shortcut to all the things you want to accomplish um we are almost time well actually time but maybe we can do a little bit of q a from the audience that's been with us for this hour there are some questions in chat that i haven't been able to read at the same time as participating in the discussion i don't know if anybody wants to speak up thomas did you have one are you just commenting i'm calling you out no question thanks thomas anyone has a question that you want to speak up or type yeah so thomas is basically echoing um kind of the uh the idea of using a poet style thing with text-based goals provided by gpt and taking advantage of gpt's knowledge of minecraft which it got from pre-training it would be even better to fine-tune to propose like oh you know how to do that this is what's next or this is what's next so this is what's next and you know i think that is just like just such such an awesome project to be pushing on we're actually very surprised that gpg3 just auto box that davinci model already knows a lot about minecraft yeah i mean it's such a popular game as you mentioned exactly uh everybody has so much text on the internet about it yeah we did the same thing i i did basically like you know how do you do this in minecraft what are the first 10 things to learn in minecraft and you know like at least half of them are very good suggestions so you can see it already could basically be used to generate a curriculum yes uh we have a race hand from hamid hey can you hear me all right great thanks for the talk i just wanted to know your opinion on uh if we were to build an agent that can interact with it with the environment and we wanted to use the intuition and common sense reasoning of a language model but the perception kind of has a different input modality or a combination of different input modalities what are the important aspects to take into account mapping and unseen input representation to the domain of the language level because there are like different paradigms at the moment on mapping kind of a sequence of inputs uh from perception to the domain of language model what are your thoughts in that direction my personal thoughts i'll jump in because nobody else was is um i really like the idea of universal inputs and outputs so you basically like a computer screen like right now all of us are looking at a bunch of pixels and we have you know keyboard mouse and i guess i'm using a microphone as well uh you set that basic environment up and as long as we don't we're not talking about robotics and like you know stuff in the real world but in the computer world which is already a massive amount of learning tasks you've got this universal api and then you never have to deal with a new modality in some sense at uh inference time because you know you might be dealing with a different modality in the sense that like you've never heard rock music before and all of a sudden that's coming out of the speakers i guess speakers is another one um you know and you might see a picture of of text as opposed to like getting the characters of the text but if you basically just go from the the same kind of inputs and outputs we have on a computer think of all the things you can do on a computer it's very very universal so i like that as a starting place for agents right so what you're suggesting is to have like a universal representation of all inputs rather than mapping a certain input to another modality that's exactly right yeah i tend to really like the long-term view of ai where i'm not trying to solve like one particular problem one particular modality i want to get like the very general long-term solutions in place and then like start to like create a path towards that future i i like that but one kind of uh blockage is that the large models pre-trained models are usually trained on different kinds of models for example we have language models trained on text and then we have large i don't know video models trained on images or sequence of images and if we were to kind of use those and leverage those it introduces this challenge of mapping between different kind of modalities or just make a decision that okay text is going to be my universal representation of all things i'm just mapping everything to text or some language model tokens and then i take it from there in the yeah or we should fix the problem you mentioned and not have foundation models be domain specific let's just start training the general domain models that can take all the you know settle on a set of core um modalities which is like pixels sound and then the outputs are like keyboard mouse and sound and then boom sort of training foundation models on those or you can do what the mind jojo team did and many people have done which is like all right i'm going to take my clip my vision piece i'm going to take my gptp so i'm going to like run them through and then i'm just going to like merge them um but i prefer to go to a domain i really think that the future is going to be models that can consume like all these modalities and uh we don't have like separate models for separate right domains completely agree with jeff yeah sorry i just want to quickly bring up so i highly agree with jeff and there are actually a couple of like great works uh coming out from the community like uh from deep line there's the the perceiver model that can process all kinds of different modalities um and yeah multi-modal fusion is a very hot topic so i agree with jeff this is the kind of the the way for the future thank you maybe i'll highlight one question from the chat um that's interesting to me are there some examples of real world applications of these methods learned from minecraft setting so i i don't know anything about like in the like in the actual real world but i just one thing i wanted to mention is like that it actually is a so it's great having these systems running in minecraft and and um what's what's really um what we're trying to do now with with our this company model ai is actually using them for for one thing that they're really good at is finding bugs in games like if you have a system and that can play a game really well like we want to train large game large foundation mods on not just minecraft but like a variety of different games then you can actually use them in addition to human testers you can just have them play the game um and they can try to you know you can instruct them to find bucks in the game and then they would uh report those bugs back to you so i think that's an interesting application that is not just we're trying to generate more general agents but we can actually use it for for tasks where we before we needed a lot of hundreds of human testers they had to do this repetitive task uh uh now we can use uh ai agents that play these games and then they can report back here i found this bug i found this glitch this thing is not working which i think is a nice combination of open-ended search methods and these kind of pre-trained uh large-scale foundation models yeah i bet go explorer's gonna be really good about finding bugs and games i found a lot in atari to directly answer your question oh yeah yeah reward hacking and general rl optimization uh i'm not yet familiar with anybody using this in the real world on our end i don't know about you guys uk and jim so part of it is it's like debugging the games and finding the defects as you guys said but also like just improving generator gamers experience by making the non-playable characters so called npc more interesting right so if you play pokemon or like a other type of rpg games before and you know that if you walk to a character and you talk to him you speak one thing and then you come back another day and to talk to the same person and speak exactly the same thing right so how can you kind of really personalize and they're gonna make this non-playable characters adaptive so that kind of it makes improved significant improvement like a gamer's experience like it feels like now that kind of in this man this game is unique to its own experience as a place in the game also adapt to uh his or her own preferences and kind of personality right so this is something that i think within the game of course beyond the game we talk about the generally used computers and the using the keyboard mouse interface or even like so some of the principles i think like though through building general purpose agents in minecraft i think can transfer to more practical domains like a robotics is like something that i personally care about like so right so we care about like well eventually move beyond minecraft into the real-world settings like how those can provide guidelines for building machine learning foundation for embedded agents or even kind of physical robotic systems as well yeah one of the things we want to do we're hopeful that our paper will get into nerups we don't know yet but if it does what we're planning on doing for the celebration party is trying to create a game server for our team where we put in a bunch of vpt agents alongside the humans and we get to like all play together because we have we've always wanted to do that we haven't done it yet but you can't yet talk to our agent because we didn't get to the the text conditional piece yet we were hoping to so it'd be fun also if you had a text conditional like minecraft pet that you could like play with and like be like go get me some redstone i'm building something it like comes back with redstone or whatever it'll be fun yeah i love that because like i i'm not a game player but i kind of want to get into it but i'm also lazy except so like i have some ideas i'm going to click all the things can i just talk to agent and he does have to work for me and then but then i like find him or i advise the agent it's like a client text traditional ai assistants roseanne can be our first customer first yeah if rosen is happy that we're doing a great job that's that's our objective function make sure i'm happy we've solved the alignment problem absolutely you need me happy with the game that's actually a pretty background i don't play games i'll set aside those games challenge accepted um cool i think that's probably it we're already 10 minutes over time thanks everyone um for showing up and having this lovely panel at dlcd thank you roseanne thanks a lot for the invites thank you so much yeah thanks everyone jeff and sebastian yeah great discussion great paper i loved it i loved reading it congratulations same same vpt awesome yeah okay thanks everyone bye bye weekend bye