this is unsupervised learning red points AI podcast and today we have on Percy leang one of the world's leading AI researchers and co-founder of together Ai and outside of founding together which is an AI unicorn Percy has led some of the most Innovative AI projects in his work on generative agents Percy created a virtual world similar to The Sims where AI agents interact with each other allowing researchers to study complex social dynamics during the episode we also got Percy's takes on the future of EV vals why interpretability is so difficult and the current state of AI research overall this was an awesome episode with one of the leading Minds in AI research don't forget to stick around for patent my debrief now here's person well thanks so much for uh for coming on really appreciate it yeah thank you for having me awesome well I feel like there's a lot of places we can start but I think you know top of everyone's list these days is you know opening eyes new 01 model um and I'm curious uh you know just your reaction to it is it a crazy breakthrough is it something that uh you know you know what what's kind of what what do you think when it came out yeah from a product perspective I thought it wasn't very good it was kind of slow it was uh sort of hard to kind of use for many of the things I wanted um from a research perspective I think this signals a sort of a change that I think we'll see uh going in the future and I I think the idea of test time computer it's been around but but I think as with many things it needs uh sort of a certain kind of a scale for people to have it sink in um I think a lot of what we think about language models is prompt comes in response comes out people measured you know tokens per second and whatever and everything happens within a few seconds um but I think that's really understanding the the value and the potential of an AI being able to solve much more ambitious tasks tasks that take you know on the order of you know days or weeks or even months like kind of the big projects that we as humans take on and I think um directionally at least 01 is just like a very small step in that direction but you can kind of see where the field could go is that you have these agents that can reason and plan and do things for quite some time and solve maybe ambitious like come up with new research or invent new uh you know uh you know drugs and and and so on um one other thing I think is interesting is um you know if you rewind 10 years ago you know reinforcement learning was all all the rage alha go remember that um and in the last uh you know five years or so we've kind of you know almost forgotten about this because now it's uh large language models predict the next token um but I think we're kind of starting to see a shift where now these are well agents I mean we're using the same word even we have agents that can take you know actions and I think the way to interpret the the generation of a language model is not necessar text but it's just you know actions in in some space um and when you are embedding an agent in a larger uh task when it's you know operating for a long time then there's a possibility of getting more and more experience and that experience uh with appropriate reward uh signal means that you can actually learn and iterate on that so so I think we'll see a lot more of you know agents being deployed in some task and getting feedback and actually making them you better yeah super cool I saw the uh I think uh the lmis or um there were some benchmarks around it that the the 01 models were incredible at math right or like yeah there specific domains that you look towards of like okay if it's really good at you know math or some of these other things then that means you know it'll be better at these multi-step kind of type workflows so my favorite new one is The Benchmark that we just put out it's called sidebench it's basically Capture the Flag cyber security exercises and these are you know crazy car I can't I don't think I can solve you know you most of them you can't solve what hope those Mortals there there's no chance CT um but some of them the hardest one takes a team of of human competitors over 24 hours to solve and currently the models are able to solve the challenges where the the first time that a a team or a person has been able to solve is around you know 11 minutes or so so there's quite a bit of Gap um so I think this is a challenging you know benchmark so when people say like oh benchmarks are being saturated I was like okay well you know what you take a look at uh um this Benchmark um so so that's one thing that been we tried a one actually it was a little bit strange um because uh and and this I guess points out the subtlety of evaluation um if we use as a drop in replacement actually doesn't really improve overall actually it it does improve in certain respects on that kind of a subtask performance so if you break down the task and that gets better um but overall it didn't you know it wasn't like a huge bump and I think the the reason is that when we dropped it in as a replacement we're treating it as not A1 but like a normal language model so we have our own templates for how uh the agent should you know reflect and plan and do all these things and no one just ignored all of that it just did its own thing and generated the answer so it wasn't really compatible with our framework so that's why I think the scores weren't as good so so I think if you spend more we've spent more time or developed towards it I think could be better but but I I think maybe this is just kind of a cautionary tale where if you just look at Raw benchmark scores it doesn't necessarily tell you the full story yeah um another cautionary tale is that you know we have an idea of monotonic progress but if you are using AI in a context of a larger system there's sort of this compatibility that needs to be there for you to actually see the improvements so if like it doesn't fit in then you're going to just not get good results so for everyone who's trying to use you know uh o1 or or any sort of new model that's coming out I think the compatibility is something that's maybe not uh looked at as as much I think it's a really interesting question what you alluded to around like you know so many of these a applications have come out on like the gbg4 level models and they built all the scaffolding around it and kind of ways to chain things together in prompts and it's like I think it's a big question of like will that stuff still be relevant for like this next generation of models or was like that was all great scaffolding for like the previous generation and now it's like a whole new set of things uh that you have to do I do think a lot of that will change I don't think we should be particularly proud of or attached to the scaffolding that has been built up so far I mean it's just like a chain of prompts I mean it's I think it's very uh dispensable I think the Paradigm U might shift in a way which o1 kind of reveals that right so the idea of having um all that reasoning scaffolding kind of internalized to the model and furthermore not exposing it to the user I think that's going to um be somewhat actually from a research perspective kind of annoying because if things go wrong you just can't debug it you don't see the this the trace right you don't get the stack Trace in some sense um which might make it harder to actually for a developer to you know customize and develop applications um I think open's hope is that this they will just take care of it right and so which me don't want people to train on the step-by-step logic yeah yeah there's competive reasons why you don't want it do that but but I think there's this kind of you know tension here where yes when it does work it'll be great it'll solve like IMO problems um but when it doesn't which is going to happen all the time because um all the places where you know we didn't have data or open a didn't have data coverage or you have some novel application um the ability to customize um I think is important uh and hopefully we'll talk about you know open weight and open source models a later um but we're seeing that we're kind of losing that even with the closed models whereas before if you have open Ai gp4 and you had application with a bunch of chains at least that the prompt was uh was transparent and you had openness there but now if that's getting internalized then it's just a you know it's a you know a black box yeah what you mentioned kind of like playing with o1 when it came out obviously you've invented and been a key part of a lot of the benchmarks that people use but like what do you personally like the the morning that 01 drops like what are you testing out to see if it's if you know what a can and can't do yeah you know it's kind of funny the AI developer doesn't actually maybe use AI as much as what might think um I think that you know there's a few things you could Kick the Can around um I don't have like a set of prompts sometimes it's just kind of like trying whatever is on the top of my mind and looking at kind of tail you know behavior um I think the other thing is I you know I do use CBT for various things and um over the weekend I was like trying to write this react app I haven't used react in a long time and so it was it was rather helpful I actually used 01 and it was not it was it was slow and kind of annoying so I switched back to4 because that for that application it doesn't well there's interesting use cases where you want maybe you want immediate feedback and you just want something that's fast but it it's okay if it's wrong you're going to kind of steer it along the way so so it's very multi-dimensional like cost speed accuracy which which task is a good op is a customizability I think all of these facets matter and so I think you portraying it as like a kind of one dimensional like models are getting better is a little bit of oversimplification that was interesting to see how obviously it's better on certain task coding math uh but works on others which makes a lot of uh sense because math and coding are the ones where these reasoning chains really help and also you can get you know better supervision um compared to other tasks yeah one thing I was struck by when we were preparing for this is I feel like your research and and kind of uh all the things you spend time on it's like the breath is incredible right you're you're doing like a lot of different things um and I wonder like how you think about you know obviously you said at Stanford um the role that like Academia should have an AI research and how you prioritize like where where does it make sense time for you know where does it make sense for you to spend time where doesn't it yeah yeah I think this is a really good question and often you hear kind of people in Academia having this sort EX crisis where gbd4 GPD 40 or whatever1 is coming out and my research is doomed and I think um and it's it's totally unfair they have more resources than us you know how can we make get tens of thousands of gpus and and I think that's exactly right if you kind of go head ande against the open a it's just there's there's no point so the point is that you should there should be another game you know you play and I thought about this uh quite a bit and I think one of the key things is to be orthogonal right so it should be the case that whatever research project you pick um if gbd4 or gb5 let's say gbd 5 comes out it should enhance either enhance your research or be like kind of irrelevant in some sense so just give a few examples um so we had this work on generative agents maybe we'll talk about it a little bit later which is which would be enhanced if the models get better then the ability to use these models to create you know simulation Society will get better and that's great because you we're not interested in you know just the raw language model we're interested in uh novel use cases of the language model again another example we have a a benchmark on um you know ability for a language model to solve machine learning you know engineering tasks and that's going to get better so every time a new model comes out that's very kind of exciting uh for um those applications um another Direction I think Academia I think is could do more in is is more thinking about the open you know Source Community um because at the end of the day open um AC is about open science right which means that creating knowledge and putting out in the public domain and while there is a n of knowledge at you know a lot of these Frontier Labs it's not in the public you know domain so to think of major cont distribution is actually discovering or even even if it means Reinventing who knows um uh and Publishing it I think has a lot of benefits because it can get more easily taken up by the broader community and you know you have new models and products and whatever being built on top of these ideas than if they were just in kind of one one one place yeah so often I think some of the things that we're doing about kind of understanding you know uh data quality in in pre-training for example how to weight data is something that I think that you know I don't know what the frontier labs are doing maybe they're doing something already you know fairly uh you know sophisticated but having some of these ideas being out in open I think is a just in itself I think is a you know valuable uh contribution um and then there's a third category of works that we've been doing around um you could say it's either you could call it uh transparency or benchmarking or you know auditing so I think Academia uh stands as sort of a unique uh position where we are not we don't have you know commercial interests we're trying to develop things that are for um you know the you know the benefit of society or and at least for the public good and so when we you know we Benchmark or we we had a project that assesses the transparency of different you know um providers um that work is extremely hard or almost impossible to do if it were um in a kind of a commercial setting so so Academia kind of makes uh sense from that perspective and we have some collaborations with Folks at the law school and other areas of Campus where there are a bunch of unique kind of problems that that come up um when you look at a kind of academic uh context so enough to keep busy yeah yeah it's interesting what you were saying about choosing projects that are orthogonal to kind of the inovation of some of the um big model players a lot of startups kind of yeah I was actually thinking about the same thing it's not specific to Academia yeah they take a similar approach of like okay we want we want gbt 567 to reinforce the advantage that we have ex yeah if AAR is bet against openi new model I think s said this like's always like what startup should I invest in he's like you should the startup should be rooting for the models to get better yeah he did um is there something kind of that you you've obviously been in the AI World for for years now is there anything that you've seen that kind of has completely changed in your mind and um with regards to AI in the last few years so um I think one thing that I've gone to appreciate more is how to think more holistically about ai's role um I think a lot of AI researchers because they come from kind of this more technical background think of the the model as the central object right and if you think for example in uh the context of AI safety you know AI safety I think uh often gets construed as here's a model we're going to like make it safe by making sure that doesn't do you know bad things it doesn't uh respond to harmful responses um harmful requests and I I think that's sort of missing a broader picture which is that there's the model but that's one you know small piece of a larger ecosystem of you know actors with different incentives doing things in the world and what you're trying to do when you think about safety is make sure that the whole system is quotequote safe not the particular model and I think this leads to bunch of other interventions besides just you know rfing the model to death that I think are much more you know sensible for for example um a lot of uh Bad actors can I think circumvent the safety by you know decomposing their their problem for example if I wanted to send you um a fishing email I wouldn't go to you know an model and say like I want to write a you know fishing email yeah and and declare my intent and I would um write a personalized email I would say something like you know this is uh the top 10 research papers you should read that really are you know going be big and then I go in and change all the links to whatever I want right so so I think it's actually Harder Than People kind of give credit for to you know police um some of these these um cases of uh misuse um and the other thing I think is that you know is is defense right and I think you know I think there needs to be more invested in defense just like um with any dual use uh technology like you know email or the internet um it can be used for good if it can be used for ill and uh we have you know anti you know spam filters we have anti fraud detectors and these things are essential otherwise you know we have we have a complete mess and I think the analog of that needs to also develop that these tools will be out there people will always be able to use them I think it's it's a losing battle to try to gate access to mod because they get cheaper and WID spre but that doesn't mean we give up and throw our hands like oh it's impossible there are measures that we can take to secure it so so I think thinking more holistically about how the whole ecosystem works as opposed to just focusing on what the model is do I think there's a limited amount of things you can do with the model and obviously some of those measures can be taken by you know uh the model providers themselves or net new startups but you know I think it's a good segue to talk a little bit about regulation because you know I think people are asking you know what role uh should Regulators have in in in you know controlling some of these use cases and you you've seen things like the eui act and I know you've talked uh about I mean there's a bunch of different facets of this um and I know you've talked a lot about you know there's obviously uh you know values encoded in models and you know uh I'm curious like how you kind of think about the regulatory landscape we have today and then like what the ideal landscape to go to is hey this is Jacob this isn't an ad but I wanted to quickly share more about our show our goal with unsupervised learning is to unpack how the best AI operators make day-to-day decisions and where the space is headed through these conversations we hope to shine light on what's working versus not and what's real versus hype in the world of AI if you're enjoying this episode with Percy please consider subscribing on whatever platform to support the show and share with a teammate if you think they'd benefit I I'm curious like how you kind of think about the regulatory landscape we have today and then like what the ideal landscape to go to is yeah this is a complex topic and I should also say that I'm not an expert so but but I will say that the stance I've always taken is that it is really early and there's a lot that we don't understand and um you know I sort of get the desire we need to regulate we don't want to you know have you know repeat social media or whatever but um I think there are things that you can do that are you know potentially you know bad like SP 1047 for example um and and just to say a bit more I think I am very much Pro kind of Regulation that emphasizes transparency and disclosure because I think the first step to gating something is to just understand you know the risks and um also the benefits and basically what's what's happening and I think that currently there is uh everything is so kind of closed off that it's hard to get a sense for you know policy makers or researchers or you know third party Auditors to really understand what the um what's what's happening so having you know more evaluations and um I think you know make a lot of sense um I think you know the other piece of you know regulation is you know where should regulate kind of upstream or Downstream do you regulate Foundation model developers or you feel regulate you know the end end products and and clearly we have regulation for the various you know we have you know in finance Healthcare we have sectoral regulation already and and I think um in in some sense that's where the most effective you can see kind of what the the harms are I think the problem with regulating you know uh like in a heavy-handed Upstream is that it's sort of might be either ineffective or it's kind of it's it's a very blunt um you know way to do things but but having um you know transparency and obligations for allowing the downstream decision makers to make sense of what's coming down the pipe I think is to you know totally reasonable just like you know when we think about as like you know uh nutrition labels you know for food and other um you know uh devices having spec sheets and I think a lot of that is is necessary so that Downstream um you know uh product developers have an idea of what's what's what's happening mean I guess transitioning to some of your research like uh I feel like one of the the ones that I found most compelling was this you know generative agents work you did where uh I'm sure this is butchering it from a Simplicity standpoint but I thought of it as like you kind of created this Sims like environment where basically uh you know all these agents are interacting with each other it's like a fascinating way to study social dynamics I mean can you talk a bit about that work and kind of what's what's next for it yeah yeah so general of Agents uh wow that came out last last year well it's dog years AI dog so um but but I I think it's still directionally relevant and the main thing we were interested in kind of going into this is work with uh my student June Park and Michael binstein um uh and and the you know from an AI perspective I think one thing that's FAS ating is you know we think about language models classically is generating text right but text is you know could be a document could be a you know a piece of code or something but what if you can what other things can you generate can you generate you know essentially an agent or a society of agents and that's sort of the the thing that sort of triggers our you know imagination what what if you could do this and um and so we built this um these set of Agents each agent was powered by a language model with an set of of prompts and grounded in this kind of virtual environment where agents could actually move around and and talk to each other and I think what that was a really fun project because it was pure kind of exploration right it was like we don't know what it's going to happen let's build it and see and simulate and see and it was kind of interesting that many of the phenomena that you see and uh you know kind of uh you know social dynamics crop up like you know information diffusion where one person talks to another one person was announced they were running for mayor and he you know talks and tries to convince other people and and and so on so I think that uh was really kind of you can call it kind of an emergent you know behavior um so I think going forward one Gap that needs to be filled is while general of Agents was about creating believable simulations so you look at it it's like looks good but but I think if you could get simulations that were actually valid in the sense that they reflected something in reality then I think you unlock a lot of different new Vistas and while believability I think is sufficient for let's say games and you know that's going to go be its thing I think a lot of the um potential for having essentially a digital twin of society in some sense you know uh I think would be allow you to run you know experiments like for example let's say you know Co mask policy let's do this and see what happens to agents or we're going to create this law and see what happens and and of course I I think we're still a bit away from having being able to trust these things but if we you know bet on these models getting better you know in the you in the future we might be able to actually um simulate many of the decisions that we would make and and also even just um being able to do um you know you know social science studies where you have to recruit a set of participants and um often it's you know very slow and expensive and just on college kids college kids so if you could get them on a much more demographically diverse uh you know set of build a demographically diverse set of Agents um you can actually also do things one thing I I really like is that you can give someone both the treatment and the control yeah which you can't do in because they have the same kind of underlying model and then you say put this model well you just give them the one a and then you reset you wipe their memory and then give them B and so you get really kind of a cleaned kind of control obviously this is uh you know just in simulation yeah but I think you can use this as a sort of a first pass to you know get some ideas and then you know then do the actual studies it's so interesting they're actually I mean even for um for clinical trials right now in medicine they're uh trying to do this digital twin type work where it's like you give someone the treatment but then kind of simulate what they what would have happened to them if they if they stayed under control um what what an awesome I think that's a fascinating set of applications you could imagine also like or Design Within a company and like simulating much different ways and and seeing what happens uh yeah so so I think there's two types of Agents I think both are fascinating uh the first agent is more um what we talked about earlier agents like you know ow that are able to perform really difficult tasks and the other type of agent is more like um about simulation so it's not about performing a particular task per se but kind of mimicking what humans are a human is is doing or individual is doing um and I think the latter is maybe less well you know studied but but I think it has a whole host of new kind of uh you know uh potential applications that you know we're you know not really looking at right now totally what makes kind of that type of simulation so different than the type of simulation that people have been doing in the past um I guess you know people would bottle diseases or things like that but I guess that wasn't really in the context of each agent has their own kind of decision making and model of the world and then they're all going off and interacting with each other solear simulations have been around for a long time often it's you know if you think about like physical simulations weather simulations um those are governed by you know you know physics or some equation or uh like generative um agent based modeling is a you know area where people create a very um stylized and simplistic models um you have to simplify the the system has to be simple but for the first time because we have these models you can actually simulate something in much greater detail than was ever you know possible so that's that's a a sort of thing that is you know potentially unlock yeah um do you think before we make like major life decisions in the future we'll run these simulations and like see how they play out I think hopefully we can do that with potential Investments what happens if there's for you you killer app right there um I I think it makes a lot of sense I mean I think even now right I think you could simulate let's say you're going to you know go um you know do a podcast for example I didn't actually do this but you could have you know talk to ask um you know a language model to simulate what the interview would would say or um I'm sure people you know before they go on a date they I don't know use it as practice or whatever um and I I think there's a lot of yeah potential um applications or benefits from being able to do you know accurate you know simulations but but I will say that you know we have to be very careful that you know the simulation currently is still you know quite uh far or we don't know how close they are to the the real thing yeah yeah such interesting work I mean another you know Switching gears to other parts of your work I mean you obviously spend a lot of time working on evaluations we're talking about some of it earlier um you know I'd love just kind of maybe giving your listeners some context on like your thoughts on the state of evals today and then you know what happens as we get more and more of these you know obviously a lot of this has been built to Benchmark LM and now we have more like agentic models are we going to need a whole new you obviously talked about some of the captur flight work youve done uh where are we kind of headed yeah I mean yeah evaluations is is exciting but also a huge mess right now I think it used to be so much easier before right you have these data sets there would be a train set and there a test set and then you train and you test and you get some number and then you be You' be done and but the one big problem right now is the train test over right we have no idea what's actually in the training data and so and they won't you know the companies won't tell us either so there's always any evaluation that I run I constantly Wonder okay you know how do I how do I trust this and even if they didn't train on it exactly it's you know uh you know maybe it's was something similar was it like a lot of you know word problems of a certain type um and yeah so so I think that's just like hanging over everyone's you know um head um that said I think that um evaluation is uh is a constantly moving you know Target because I think um existing as models get better there's new um things that we want the models to do that are not captured by existing benchmarks and so one thing I've been really excited about is how you can use use language models themselves to you know benchmark language models of course of course well at some point they're going to be so capable that we couldn't possibly come up with like the latest Frontier yeah so and in particular you know getting at uh this idea of coverage right so what's really interesting about these models is that they claim they can do anything ask me any instruction we we'll follow it and and I think so it's unlike you know classic benchmarking where you have just one task um and because it's so diverse I think it's really nearly impossible for someone to sit down and like okay we're going to think about all the tasks and even even if you have you know user data I think user data doesn't maybe stress some of the kind of the tals as much um but language models you can get them to invent all sorts of uh of things um we had a paper called Auto bencher uh that uh looks at how you can basically generate um automatic you know inputs um in a way that leverages some you know side information so the basically the the uh language mod that is Genera the question has some piece of information that the the test taker doesn't know so you can get use this asymmetry to get um more sensible evaluation otherwise you're just like you know evaluate so we self- evaluation I think is still is not you know trivial but I think you want this extra layer of kind of protection um and and uh another thing I've been thinking about in terms of evaluation is I I I feel like a lot of uh evaluation right now kind of relies on fairly superficial uh judgments like you have a and b and a is a long piece of text and B is a long piece of text and you just say oh B is better than a but that feels so unsatisfying and what I like to do is think more um um I guess having you know taught uh for a while you know when we grade exams we have idea of a rubric right so that we provide a standard for you know evaluation where um any kind of Student Response or uh I guess in this case uh LM response is subject to that um rubric so that anchors evaluation in sort of more uh you know concrete terms Asos to you know is this good or well tell me what what I should be looking yeah so but but overall I think that evaluation landscape will well I think need to evolve quite a bit beyond kind of these standardized benchmarks which everyone says like oh yeah we don't trust the benchmarks or whatever but you know we should do something with that yeah and it makes sense that obviously you know sitting in Academia is actually a great place to be to develop like an objective set of benchmark that uh that that work for the industry and then I think we'll also see you know a ton of like vertical specific benchmarks where like we don't care if our model is amazing at like you know The Cutting Edge math problems just like kind of diagnos medical image or or do whatever so one thing that has happened so if two years ago we put out Helm holistic of language model which is you know trying to cover all the different aspects kind of manually um and what has happened is that hel has become more of a a framework and we've looked at different types of um you know you could say verticals um um that uh for example you know we uh we have a benchmark that looks at you know uh you know safety evaluations we have a another you know Helm leaderboard for um you know a tie another one for you know so there's different languages that you might be interested there's um we're doing one for medical and one for finance so so I do think that it makes sense that you know in any given vertical there's actually quite a bit to do and developing something that's more tailored towards that set mix sense and speaking of verticals I mean some of these you know especially regulated Industries uh you know I think a problem they have with with llms is just how black box it all is and I know you've worked on interpretability before and I can't tell whether interpretability at this point is like a a loss cause cuz like these models are so complex and we have you know so little knowledge of what's going on in them or like how do you think that evolves for you know as these models make noes or credit decisions like the ability to kind of understand yeah yeah yeah there's a lot of unpack here and I feel like the problem has is exacerbated so when we started working on trity back in you know 2017 this at least you got the model weights and you got the training you can look at that and even then it was hard to understand why is this model making this prediction Because deep learning is you know is difficult to work with and now on top of that we know often with the models we don't have the the weights we don't have the data so um it adds another layer of sort of aisc as if it weren't hard enough already um so so that that's kind of the world we're living in um in terms of what inability uh you know means I think uh there's a school of uh thought or a body of work around a mechanistic inability which tries to really understand the indiv ual neurons inside a network and make sense of what's what's happening um I find that a you know I think it's very interesting when you discover ah this neuron is responsible for you know this particular style it's more I think about you know scci I I think inability has two audiences one is about kind of just scientific understanding you're just curious what what is going on and the other is um like you said in you know regular Industries if it makes a decision and you want to know why it's making a decision or you're developing model and you just care like to UND debug and and fix the problem um I think there's other things so for example in6 17 we worked on influence functions as a way to um attribute why a model makes a prediction um in terms of other training examples and this work has uh you know been adapted to um language models but it's it's very difficult uh you know to scale um that's like I made this decision because I saw these 10 examples kind of yeah like which basically the the idea is which training examples were most influential for making this prediction and and the thought experiment is if I had an example and I removed it from the training set would I still make the prediction if and if it wouldn't then that says this example was it was influential if it basically didn't change anything then this example didn't um doesn't matter I mean there there's with how you you know both compute this and how you interpret this but that is one uh flavor of interpretability that I think is is useful now of course um if your training data is is is private then then I think it's you know you don't want to have a you know oh you got diagnosed with disease because this uh this guy over here got yeah because this random Reddit thread that uh yeah yeah um so I think it's it's difficult and another tack on ability is you know explanations so you can do Chain of Thought and it generates um um some explanations but there's all this work that shows explanations are not maybe exactly what actually is uh is happening kind of like people yeah kind of like people so at least I think maybe with the agent architecture explanations you can think about as at least a bottleneck uh for if you have things which are more modular then and all they get transmitted is explanations then at least you know know kind of what um how the pieces are you know moving so we may you know when there's open models that do kind of similar things to one and you can see the steps you may have an ability to explain you know how something came about yeah but but I think where I'm at with inability is in order to even do inability we need to like have the weights and the the training data so you can get back to like 2017 era level of access and then you can answer a lot of these these questions maybe one other question on the research side before we move on to together and um you know all the other amazing things that that you've been up to um and this is more I just cuz I'm I'm very curious there's been kind of you know Transformer and then Mamba and then these different model architectures one I guess how enduring do you think those will be and then two when new model architectures come out is that due to breakthroughs in math or experimentation or like where do those come about um like I think back to okay there was like the vanishing gradient problem and then people were like oh we need to get actually more data back to these other parameters and so then we're going to make some new adjustments for that but I don't know is it like math or um experimentation or how are people coming up with these new model architectures yeah so historically a lot of the architectures like lstms and you know uh you know cets and Transformers were you know intuition you train you think about the gradients and how that should work and some experimentation um I think what's interesting maybe about the Mambo or state space model uh line of work is that um the it came actually you know from math it was you know Albert and Chris and some people answer trying to answer the question of you know if you have a sequence of points how do you fit a kind of online update on the best polinomial fit so it was like kind of just math they they solved the problem and then they they figured out how to get it to actually work at at scale and inside a um inside a neuronet h which is kind of interesting because um even though the math doesn't apply to the actual model but it was a source of inspiration for the actual you know architecture um in terms of uh where all these architectures uh go um I don't have terribly strong opinions about this I don't really do architecture you know stuff um but it is I would I guess bet on like you know Transformers and all these things not being the one that we kind of continue you know using in the the in the future um just because you know it's it's sort of arbitrary and you can you know uh change it um whether there will be super large changes I think that remains to be seen I think that maybe more interesting um is if you have other kind of regimes for example video you know I think that's where maybe you more likely see kind of Innovations there because like you can't just do the naive uh thing of you know giant you know Transformer um or if you have let's say um now we're kind of in this more agent searchy you know setting maybe there's other architectures that are more amenable and fit in um the um fit better in the mold of a kind of a larger system yeah but I I think structurally something has to change for like something really new to come out like I mean for example in Vision if you just want to uh do very simple uh image classification like confidents are you know totally fine and so Transformers wouldn't have I think been invented unless like people really tackle like you know machine translation which is a much different problem so I think if you look at uh you know maybe different types of problems um are a good way to look for kind of where architectural Innovations will happen super interesting I mean I guess you know one other hat you wear obviously is you're the co-founder of together um and one thing I'm struck by is obviously this kind of new 01 Paradigm of test time you know uh you know all kind of trying a lot of these different scenarios um seems like it has big implications for the inference market right it's quite different than this current setting and so what impact do you think that has is like are a lot of these breakthroughs like flash tension and whatnot still relevant for this new class of models or like what what happens to the inference Market with this uh CH tot I mean I feel like I would see inference as a sort of a very lowlevel primitive that you just need to get right you make it robust you you make it cheap and that's a lot of what together you know focuses on right now because I mean in some sense everything needs inference like if you um you know train a a model you need forward passes inference if you are doing like any sort of agentic flow you know this is you're doing inference if you're generating synthetic data to you know further uh tune the model you're doing inference so so I think it's more of a a building block that needs to be you know stable um but of course I think over time um you know you think about what infuence for what and then you start maybe operating on a different abstraction for for example a lot of the infuence market right now is serving you know llama 3 essentially and um but why are we serving llama 3 like I mean because it's a good model but you know why why llama 3 why not a different model and and I think uh that what you know I think we'll see is um people wanting to Ser models that basically what do people care about do people care about the models work for whatever their use case is they don't care if it's llama 3 llama 3 just gives a sort of a stamp of approval that this is like not a total sucky thing um so but over time I think what's interesting is that if you can adapt the the model to a particular use case you can potentially get way much uh you know faster and you know better performance um you know over time and and also I think of the sort of agentic workflows and all that open up opportunities for maybe further optimization for example if you have the high throughput setting if you want to you know just generate a bunch of huge number of possibilities and you know do um then that is something that you could kind of optimize for so the inference I guess optimizations could definitely take into account like what's happening sort of at the level together perspective you think of like inference optimization as this requisite to then do all this really interesting work around uh you know some so together the the story is like you you need gpus yeah that's the lowest level and then you need inference and then you uh you have the ability to like fine-tune and um what we think of as um basically kind of customizing models because I think that's it's I think generally pretty hard to take aex say llama 3 45b and say I want to make it 10 times faster and it's still going to do everything that the original model but if you say okay take this one you know I don't know whatever call center application and you're going to do it a lot of times I think there's U much more room for uh kind of squeezing out you know performance you alluded to this earlier that we kind of like it feels like we need some sort of algorithmic breakthrough or change from Transformers to get to like the next step of of reasoning or you know uh I guess I'm curious like you know what Milestones are meaningful to you like you know people talk about reasoning is like this this kind of broad term and um what are like the meaningful Milestones to you and and like you know I guess to to ask a Shameless question like how far away do you think we are if you're predicting uh from from some of these Milestones well I think just starting let's say with the the benchmarks that we have like sidebench for Cy security or ml agent bench for um you know solving ml research tasks I think these are still you know good uh trackers of performance because performance is you know well below kind of what it what it could be um I think that um you know so so that's one way to kind of track progress in the in the near term I think um that um and of course we'll continue developing you know benchmarks over time but if you want to kind of go you know farther out um you know solving some kind of really open let's say solving open math problem we're doing something that is really you know beyond what you know humans like basically anything that extends human knowledge I think is really really interesting because a lot of what AI has been about is catching up with with humans in some sense right like you're mimicking humans and I think we're sort of getting past that point where now we're mimicking like expert you know humans but all this stuff like solving Amo problems well people have already done well clearly people have done that because people made the problem um but if you can actually create something that's really you know discover something new like create new research or solve something that hasn't been solved then I feel like or or even in a cyber security case like you know five find a zero day then I think that that's kind of a you know Game Changer and do you feel like we're in like I mean I feel everyone's asking are we like getting to some plateau of model capabilities or does it feel like we're still on this crazy exponential I think things are still moving up like pretty quickly I think locally it's always feels like maybe things are you know locally everything's linear in some sense but um I think I think it's maybe going to be more than just like naive uh scaling like oh yeah of course models will get bigger more data and whatever but I think there will be qualitative changes as well like for example you know different like a one kind of represents a little bit of a you know different uh way that you might think about using these you know systems and of course on top of that you know you know chips are going to get more powerful and and cheaper so that's going to still kind of Drive the whole yeah I mean another category models you spend time in is you know you worked on open source models for Robotics and I think you know there are a lot of varying opinions on where we are in like robotics Foundation models and you know if we're close to some sort of chaty B moment in that space uh you know what what do you think yeah yeah I think we're definitely not at a chat GPT moment for robotics today I think we are closer to a kind of bird era a moment for robotics where the idea of using uh Vision language models fine- tuning them and getting for foundation models for robotics um is you know is uh is effective um um but you still have to fine-tune them and fine tune for narrow task and the resulting policies are still like very brittle um in ways that you know um that it's not the case in in language yeah so I think it'll be a few more years I think the the good news is that there's a lot more kind of interest and you know I guess funding and everything for this and you know there's bunch of other data collection efforts so I think things are moving but you know I think Hardware is hard so it will maybe take a I mean some amount of time um but but I'm optimistic that things will I don't see any any fundamental obstacles like we can't you know get to a chat gbt moment for robotics that's just a qu yeah there's no there's no internet of data available to uh yeah you just have to do the hard work you have to do it the hard way yes and I think it's still you know you're betting on emerging capability is that until they emerge it's hard to know when and how they might yeah I guess one thing that's maybe good is that we can lean on language and vision to basically sort of driving all the architecture and data and kind of the the compute recipes so that when you we have the right you know data and for problem formulation for robotics like we have infrastructure that we can just you know borrow and and furthermore I I think that a lot I'm My Hope Is that a lot of the soal robotics problems are actually just you know language and vision problem so if you can factor out a lot of that Stu then you can potentially make robotics a lot more you know efficient totally like you shouldn't need robotics data to learn like what a cup is you you need robotics data to figure out how to like manipulate and grasp a cup but you know just identifying a lot of the high level stuff like that should all be language and you know videos yeah um you know one thing I think uh that's about your background is I understand you're you're you're quite a talented classical musician and I guess uh there's been a lot of really interesting developments in like the AI music world you know puno and udio and all these companies um like as someone who who is really passionate about music like where do you think that space goes yeah that's interesting I think um you know certainly the same recipe you know trying to Giant trans some data is is effective there I think there are a few things that are challenging for um for music one is um I think copyright is a kind of a big much bigger hurdle there so you know that that that's the whole can of worms um the other thing is I think you know thinking about you know control and so I had a um postto uh in my lab John thixton who worked on Foundation models for for music and the emphasis there was on um models that you can actually you know control we don't want to just have unconditional generation or even just textual generation and having it just generate the music felt like it wasn't giving um you know artists enough control like what what it was happening um and so we basically built this sort of generalized infilling model where you could it's called anticipatory music Transformer so you can condition on any subset of the musical events in in the score you can condition on the melody generate the harmony you can condition on like you know one part and fill in sort of a section cool and and I think one direction that this could go is is basically like a co-pilot for you know musicians to basically you know like composers or artists to create their their Works um just like you know we I think people use GitHub co-pilot I think that's one um area that I think is you know pretty you know interesting and um and yeah in terms of um you know how it intersects with my you know personal um musical aspirations um I think that you know in many years ago I was you know trying really hard to um compete in these piano competitions but I was doing this while doing a PhD so I uh it was like I don't have enough time to practice but I had these musical I ideas on like I I just wanted to sound like this I know what I want but my fingers won't do it and and so at a personal level I'd be really excited of you know how can these tools really help me realize my musical Vision um and you know I I it's not like I don't want I W be lazy and don't not practice but just that it's sort of a a augmentation to and and I think um I'm not sure what that would look like exactly but but I think um because I play you know classical music it's certainly very far away from like just pushing a button and having you know just some music come out there's a lot of subtleties and also classical music there isn't that much data for so I think there's going to be some challenges there yeah to I imagine these like AI piano teachers in the future too I know everyone always has complex relationships with their initial so I think that'll be uh that'll be fun yeah I mean I think in general like you know on the point of teachers uh they can be extremely good like uh teachers coaches umt the use in education I'm I'm very kind of you know bullish or we're just excited about yeah it's really cool when you see even those gbg4 demos of like you know drawing a math problem and having explain really cool yeah yeah I use it for you know teaching my kids like all the time how do I explain this concept which I know and and now I need to explain it to a 5-year-old it's so good at that and yeah just breaking down complicated things and making them simple it's uh it's very impressive yeah well we always like to end our interviews with a quick fire round where we get your take on on a few things and so maybe to kick it off uh what's one thing that's overhyped and underhyped in AI today agents and agents yeah I feel like we've gone through a full hype cycle and then some with uh with agents agents they've been on both sides of the question you've written about like you know ml agents doing their own ml research before uh I'm Cur like obviously people always talk about this as some hypothetical step in the path age um how far away from that do you think we are from AGI or no sorry from ml agents like contributing novel you know uh novel insights into ml work yeah depends on how you think about what novel is but I don't think we're very far from to the extent that if you run and I know this is supposed to be a quick round I show a shamelessly broad question where we sneak in all of our life I was laughing in my head because I was like did you just ask about AGI and the fire round yeah yeah usually it's like a yes or no question but yeah so depends on what you mean by novel knowledge but to the extent that having an agent that runs an experiment that fills in some sort of like runs a bation experiment to answer some question I think we can do that already and now the the question is how you know maybe the you can think about it at the level of you know a very Junior student and then how can the model actually come up with maybe experiments new experiments to run and then take on maybe bringst many new directions um you know I'm optimistic that you can actually do it and you know do something and meaningfully in the next you know years I mean just if you think about coding right um certainly uh you know it has created a new code and helped people code and and you know it that's that's maybe a a kind of simpler version so I think the same will happen to research I don't think it's a you know it's that different totally what application areas do you feel like are underexplored right now um on top of these models I think classically a lot of AI uh applications driven by you know commercial needs right everyone wants their rag solution to question answering and you know summarization and and so on uh but potentially you know more kind of fundamental science and scientific discovery improving you know researcher um you know productivity I think these are um you know obviously not as well studied um but I think are important because they can um feed into the whole you know cycle of improving the whole ecosystem Pat that was a fun one super fun so do I understand correctly that you took Percy's class I did take Percy's class I took CS 221 uh which was kind of the first AI class um that is uh that is off offered so yeah took his class back that day I feel like even just from that recording I can tell he was probably a very good professor he is a very good teacher um yeah and the class is very well done for sure um no that's awesome what what did what were the takeaways for you what stood out to you I love this work that he did around generative agents and this kind of like Sims World that he created and these you know when you throw a bunch of agents at each other these things that emerge of what they start doing and how they start interacting with each other totally and I loved this idea that like you know in the future we'll have a of you know agent twins that you can simulate you know a policy or even uh you know I he talked about dating and I was thinking about that Black Mirror episode where like you simulate a bunch of dates with a ton of people and figure out like who's the right person tot you could imagine instead of dating apps you just like send your your digital agent to go do the work for you yeah you're like that look like it was a fun date with that person like that one I don't know if I if I want to go on yeah that his that whole take on simulation uh was really interesting of just what that could mean for the future if you can actually simulate how you know humans or agents would speak or communicate or behave in uh in these different worlds I thought was was really fascinating um yeah I feel like there's always been this like dream of uh you know Central planning and like government figuring out like how to you know have every lever move in like the economy and it never works and then you're like well I wonder if if these agents get better and better whether at least govern would be more informed uh in in being able to to project how things might go I thought it was really interesting what he was saying saying about having academic research be orthogonal the progress that the model vendors are making and we talked about this in in the episode being similar to uh to startups but I just thought it was interesting kind of how he was approaching that from a practical standpoint uh when you think about how many more resources a Google or open AI or at least compute resources may have than um than Stanford and uh and some of the other academic institutions so figuring out where they can still uh push the envelope and that kind of doves tailed into one of the things that he mentioned around um AI research and thinking through the whole system versus just the model and that's kind of changed over time as these these models have gotten into the into the real world and so I thought that was that was fascinating yeah I love that like he's telling his students what like VCS tell founder which is like you know be in the path of progress of the models like you should want GPT 5 to come out and be really good because that then makes your work even better and I think it's it's like for any type of work whether it's AC or or company building or whatnot you know uh being on the path of progress toward that uh I think ends up being super important totally he just does so many different things like I I like all the work he's done around evals obviously you know I think Helen is is one of these clear benchmarks and now it feels like the capture the flag work he's doing as well I love this idea that like we're getting to a point of model capabilities where we may just need like models to write evals for mod like you and I couldn't figure out how to like they're they're smarter than us but I mean the capture of the flag is a really cool example that's very easy to understand and you can see how that would be a good test of model capability okay there's some objective out there how long does it take you to reach uh that objective and so um I thought that was that was interesting too yeah um and then you I think generally his takes on uh 01 which I feel like has taken the uh the the space by storm in the last week um you know it obviously uh you know I think his his point that it's kind of like this first step of you know running much more compute difference and figuring out you know simulating a bunch of different steps and how we've come full circle with with alphago and a lot of that work but you know I think it also obviously presents challenges to to Academia because up has hit in the steps right and so what's actually going on behind the scenes is is hard to study it makes it hard for interoperability and so you know I wonder when we'll see an open source equivalent but you know I actually he he got me thinking that actually you know from an interpretability standpoint for some of these industries like healthcare or Finance like this might be a step in a good direction because at least the model is explaining step byep along the way and that that actually could you know could be could be helpful rather than trying to identify the one in you know uh 70 billion neurons that's uh it's driving something yeah totally the um I thought his take on the Transformer and model architecture was interesting and we've heard this from other people but kind of this belief in Future model architectures and he seemed to think that that would come from video or different domains that were going to be pushing um the existing architectures to break I thought that was uh pretty interesting I also thought it was cool that um maybe this is just the the nerd and me coming out but that uh Mamba was actually driven through by a math breakthrough yeah um I thought that was that was fascinating that they were actually like sitting down and looking at the math of uh how these things work and then that led to new did you understand the math he was saying cuz it compl I was like you you had the masters from Stamford might have gotten it for me I was like I'll just I'll just not along here there there is some important math which it seemed like uh influen the model model architecture but uh no it's been a long time since uh since uh I have done any real math although I appreciated his very subtle calculus hint which was if uh what did he say uh when you're looking locally everything looks linear yeah yeah exactly like that I got still still still still have that there you go uh maybe we'll ask one of these uh you know really complex RS me models to explain to us what in the world happened yeah like we're a 12-year-old exactly I'm younger for me yeah what do you think of his uh uh his music AI take um or how he thinks about AI in the music world I love this idea that there's like something you know in your head that you can't quite get out into the world and that like there's a way to communicate that to AI that enables you to um and I think it's this this co-pilot Vision rather than just one prompt and you have a song or a picture or something that like there is something that you want to create and not all of us can practice 6 hours of piano day to get there or go to art school or do whatever but like can you still Express that creativity within you I thought that was uh pretty cool yeah I mean people like py may have uh beautiful songs in their head I feel like if I would need a lot more than AI to help me get get get the songs that are in my head sh it off for the 17th time exactly why do you keep coming up with Taylor Swift songs we brought up Taylor Swift actually contractually obligated that when our de briefs it it comes up in one way or another yeah yeah that was a really fun episode with Percy uh folks definitely stay tuned next week I know we have a pretty sweet line of guests coming up [Music] Back To Top