hey everyone welcome to the laden space podcast this is cesio partner and CTO at deible partners and today uh we're in the new studio uh with my usual co-host Sean from small AI hey and today we are very blessed to have Eric schen from anthropic with us welcome hi thanks very much uh I'm Eric schin I'm a member of technical staff at anthropic uh working on tool use computer use uh and sweet bench yeah um well how did you get in into just the whole AI Journey uh I think um you spent some time with SpaceX as well yeah and Robotics yeah there's a lot of overlap between like the robotics people and AI people and maybe like there's some inter interlap or interest between language models for robots right now maybe may just a little bit of background on how you got to where you are yeah sure um I was at SpaceX a long time ago but before joining anthropic I was the CTO and co-founder of cobalt robotics we built security and inspection robots uh these are sort 5ot tall robots that would Patrol through an office building or a warehouse looking for anything out of the ordinary uh very friendly no tasers or anything we would just sort of call a remote operator uh if we saw anything so we have we have about a hundred of those out in the world and had a team of about 100 we actually got acquired about 6 months ago but I had left Cobalt about a year ago now cuz I was starting to get a lot more excited about AI I had been writing a lot of my code with things like co-pilot and I was like wow this is actually really cool like if you had told me 10 years ago that AI would be writing a lot of my code I would say hey I think that's AGI and so I kind of realized that we we had passed this this level we like wow this is actually really useful for engineering work that got me a lot more excited about Ai and learning about large language models so I ended up taking a sabatical um and then doing a lot of sort of reading and research myself and decided hey like I want to go be at the core of this and joined anthropic and why why anthropic did you consider other labs did you consider maybe some of the robotics companies yeah so I think at the time I was a little burnt out of Robotics and so also for the rest of this any any sort of negative things I say about robotics or Hardware is is coming from a place to burnout and you know I reserve my right to change change my opinion in a few years yeah you know I looked around but ultimately I knew a lot of people that I really trusted and I thought we're incredibly smart at anthropic and I think that was the big deciding factor to come there like hey this team's amazing they're not just brilliant but sort of like the most nice and kind people that I know and so I just felt like it be a really good culture fit and ultimately like I do care a lot about AI safety and making sure that you know I don't want to build something that's used for bad purposes um and I felt like I had a the best chance of that was joining um anthropic and from the outside these Labs kind of look like huge organizations that have this like obscure ways to organize how did you get you join in Tropic did you already know you were going to work on like sweet bench and some of the stuff you publish or you kind of join and then you figure out where you land I think people are always curious to to learn more yeah I've been very happy that anthropic is very Bottoms Up up and sort of very sort of receptive to whatever your interests are um and so I join sort of being very transparent of like hey I'm most excited about code generation and AI that can actually go out and sort of touch the world or sort of help people build things and you know those weren't my initial uh initial projects I also came in and said hey I want to do the most valuable possible thing for this company help anthropic succeed and you know like let me find the balance of those so I was working on lots of things at the beginning um you know function calling tool use uh and then sort of as it became more and more relevant I was like oh hey yeah like let's it's time to go work on en coding agents and sort of started looking at Sweet bench as sort of a really good bench mark uh for that so let's get right into sweet bench that's one of the many claims to fame I feel like there's just been a series of releases related with Cloud 3.5 Sonet around about 2 three months ago 3.5 Sonet came out um and it was it was a step ahead in terms of a lot of people immediately fell in love with it for coding and then last month uh You released a new updated version of of C on we're not going to talk about the training for that cuz that's still confidential but I think anop has done a really good job like applying the model to different things so you you took the lead on sweet bench but then also we're going to talk a little bit about computer use later on so yeah just maybe just give us a context about like why you looked at sbench verified and you actually like came up with the whole system for building agents that you know would maximally use the model well yeah so I'm on a sub team called Product research and basically the idea of product research is to really understand like what end customers care about and want in the models and then work to try to make that happen so you know we're not focused on sort of these more abstract General benchmarks like math problems or MML um but we really care about like finding the things that are really valuable and making sure the models are great at those and so because I've been interested in coding agents sort of I knew that this would be a really valuable thing and I knew there were a lot of startups in our our customers trying to build coding agents uh with our models and so I said hey this is going to be a really good Benchmark to be able to measure that and do well it and I I you know wasn't the first person at anthropic to find Sweet bench and you know there are lots of people that already knew about it and were um had done some internal efforts uh on it it felled the me to sort of both Implement The Benchmark which is very tricky and then also to sort of make sure we had an agent and basically like a reference agent maybe I'd call it that could do very well on It ultimately we want to provide how we implemented that reference agent so that people can build their own agents on top of our system and get sort of the most out of it as possible so with this blog post uh we released on sweet bench we released the exact tools and the prompt that we gave the model uh to be able to do well for people who don't know who maybe haven't dived into sweep bench I think the general perception is there like task that a sof software engineer could do I feel like that's an inaccurate description because it is basically U one it's a subset of like 12 repos it's everything they could find that every issue with like a matching commit that could be tested so that's not every commit and then sbench verified is further manually filtered by open AI is that an accurate description anything you change about that yes sbench is it certainly is a subset of you know all tasks it's first of all it's only python repos um so already fairly limited there and it's just 12 of these popular open source repos um and then yes it's only ones where there were tests that passed at the beginning and also new tests that uh were introduced that uh that test the new feature um that's added so you know it is I think a very limited subset of real engineering tasks but I think it's also very valuable because it's even though it's a subset it is true engineering tasks and I think a lot of other benchmarks are really kind of these much more artificial setups of even if they're related to coding they're more like coding interview style questions or puzzles that I think are very different from like day-to-day what you end up doing like I don't know how frequently uh you all like get to use recursion uh in your day-to-day job but I when whenever I do it's like a treat uh and I think it is it's kind of it's almost comical and a lot of people joke about this in the industry is like how different interview questions are dynamic programming yeah exactly code from the day-to-day job but I think the one of the most interesting things about sweet bench is that all these other benchmarks are usually just isolated puzzles and you're starting from scratch whereas sweet bench you're starting in the context of an entire repository and and so it adds this entirely new dimension to the problem of finding the relevant files and you know this is a huge part of real engineering is you know it's actually again pretty rare that you're starting something totally Greenfield you need to go and figure out where in a code base you're going to make a change and understand how your work is going to interact with the rest of the systems and I think sweet bench does a really good job of like presenting that that problem why do we still use youry B it's like 92% I think I don't even know if you can actually get to 100% because some of the data is not solvable do you see benchmarks like that they should just get Sunset it because when you look at like the model releases it's like oh it's like 92 instead of like 89 90% on new eval versus you know sweet bench verified is you have 49% right which is like before 45% was state-of-the-art but maybe like six months ago it was like 30% something like that so is that a benchmark that you think it's gonna replace you manyal or do you think they just gonna run in parallel I think there's still need for sort of a uh many different Vari devels like sometimes you do really care about just sort of Green Field code generation and so I don't think that everything needs to go to sort of an agentic setup it be very expensive to the other thing I was going to say is that um sbench is certainly hard to implement and expensive to run because each task you have to parse you know a lot of the repo to understand where to put your code and a lot of times you take many tries of writing code running it editing it it can use a lot of tokens compared to something like human eval so I think there's definitely a space for these more traditional coding evals um that are sort of easy to implement quick to run uh and do get you some signal and maybe hopefully there's just sort of harder versions of human eval that get created how do we get sweep bench verified to 92% do you think that's something where it's like line of side to it or it's like you know we need a whole lot of things to go right yeah yeah and actually uh maybe I'll start with sweet bench versus S bench verified which is I think something I missed earlier um so sbench is as we described this big set of of tasks that were scraped like 12,000 or something uh yeah I think it's um it's 2,000 in the final set but a lot of those even though a human did them they're actually impossible given the information uh that comes with a task the most classic example of this is the test looks for a very specific error string you know like assert you know message equals error something something something and unless you know that's exactly what you're looking for there's no way the model is going to write that exact same error message and so the tests are going to fail so s bench verified um was actually made in partnership with open Ai and they hired humans to go review all these tasks and pick out a subset to try to remove any obstacle like this that would make the tasks impossible so in theory uh all of these tasks should be fully doable by the model and they also had humans grade how difficult they thought the problems would be between like 15 less than 15 minutes I think 15 minutes to an hour an hour to 4 hours and greater than four hours um so this kind of this interesting sort of uh how big the problem is as well to get to Sweet bench verified to 90% actually maybe I'll also start off with some of the remaining failures that I see like when running our model on sweet bench I say the biggest cases are the model sort of operates at the wrong level of abstraction and what I mean by that is the model puts in maybe a smaller Band-Aid when really the task is asking for a bigger refactor and some of those you know is the model's fault but a lot of times if you're just seeing the um if you're just sort of seeing the GitHub issue it's not exactly clear like which way you should do so even though these tasks are possible there's still some ambiguity uh in how the tasks are described that being said I think in general like language models frequently will produce like a a smaller diff when possible rather than trying to do a big refactor I think another area so at least the agent we created didn't have any multimodal abilities even though our models are very good at Vision so I think that's just a missed opportunity uh and if I read through some of the traces there's some funny things where especially the tasks on map plot lib which is a graphing Library the test script will like save an image and the model will just say Okay looks great you know without looking it looking at it so there's certainly extra juice to squeeze there of just making sure the model really understands all the sides of the input that it's given including multimodal but yeah I think like getting to 92% so this is something that I have not looked at but I'm very curious about I want someone to look at like what is the union of all of the different tasks that have been solved by at least one attempt at s bench verified there's a ton of submissions to The Benchmark and so I'd be really curious to see how many of those 500 tasks at least someone has solved and I think you know there's probably a bunch that none of the attempts have ever solved and I think it be interesting to look at those and say hey is there some problem with these like are these impossible or are they just really hard and only a human could do them yeah like specifically is there a category of problems that are still unreachable by an ELM Agent yeah yeah I think there definitely are the question is are those fairly uh inaccessible or are they just impossible because of the descriptions but I think certainly some of the tasks especially the ones that the human greaters uh reviewed as like taking uh longer than 4 hours are extremely difficult I think we we did a we got a few of them right but not very many um at all in the Benchmark and did those take less than four hours they certainly did less than yeah than 4 hours is there a correlation of length of time with like human estimated time you know what I mean or do we have sort of more of XEX Paradox type situations where it's something super easy for a model U but hard for a human I actually haven't done um like done the stats on that but I think that' be really interesting to see of like how many tokens does it take and how that is that correlated with difficulty what is the likelihood of success with difficulty I think actually a really interesting thing um that I saw one of my co-workers who was also working on this uh named Simon he was focusing just specifically on the very hard problems the ones that are said to take uh longer than 4 hours and he ended up sort of creating a much more detailed prompt than I used and he got a higher score on the most difficult subset of problems but a lower score overall in the whole Benchmark and the prompt that I made which is sort of much more simple and barebones got a higher score on the overall Benchmark but lower score on the really hard problems and I think some of that is the the really detailed prompt made the model sort of over complicate a lot of the easy problems cuz honestly a lot of the sweet bench problems they really do just for a Band-Aid where it's like hey this you know this crashes if this is none and really all you need to do is put a check if none and so sometimes like trying to make the model think really deeply like it'll it'll think it circles and over complicate something which certainly human Engineers are capable of as well um but I think there's some interesting thing of like the best prompt for hard problems might not be the best prompt for easy problems how how do we fix that are you supposed to fix it at the model level like how do I know what prompt I'm supposed to use yeah and I'll say this was a very small effect size and so I think this is not I think this isn't like worth obsessing over but I would say that as people are Building Systems around agents I think the more you can separate out the different kinds of work the agent needs to do the better you can tailor a prompt for that task and I think that also creates a lot of like for instance if you were trying to make an agent that could both you know solve hard programming tasks and it could just like you know write Quick Test files for something that someone else had already made the best way to do those two tasks might be very different prompts I see a lot of people build systems where they first sort of have a classification and then route the problem to two different prompts um and that's sort of a very effective thing because one it makes the two different prompts much simpler and smaller and it means you can have someone work on one of the prompts without any risk of affecting the other tasks so it creates like a nice separation of concerns yeah and the other Model Behavior thing you mentioned they prefer to generate like shorter diffs why is that like is there a way you know I think that's like the The Lazy the lazy model question that people have is like why are you not just generating the whole code instead of telling me to imping tokens yeah exactly like conspiracy theory yeah so there there's two different things there one is like the I'd say maybe like doing the easier solution rather than the hard solution and I'd see the second one I think what you're talking about like the lazy model is like when the model says like dot dot dot code Remains the Same go yeah I'm like thanks dud I think I think honestly like that just comes as like people on the internet will do stuff like that and like dude if you're talking to a friend and you asked them like to give you some example code they would definitely do that they're not going to the whole thing and so I think that's just a matter of like you know sometimes you actually do just just want like the relevant changes and so I think it's this is something where a lot of times like you know the models aren't good at mind reading of like which one you want so I think that like the more explicit you can be in prompting to say Hey you know give me the entire thing no no elisions versus just give me the relevant changes and that's something you we want to make the models always better at following those kind of instructions I'll drop a couple references here uh we're recording this like a day after Dario uh Lex Freeman just dropped his 5- hour pod with Dario and Amanda and you know the rest of the crew and Dario actually made this interesting observation that like we actually don't want We complain about models being too chatty in text and then not chatty enough in code yeah and so like getting that right is kind of a awkward bar because you know you you don't want it to Yap in its responses but then you also want it to be complete in in code and then sometimes not complete sometimes he just want it to diff which is something that entopic has also released uh with you know like the the fast edit stuff that you guys did and then the other thing I wanted to also double back on is the prompting stuff you said you said it was a small effect but it was a noticeable effect in terms of like picking a prompt I think we'll go into s agent in a in a little bit but I kind of reject the fact that you need to choose one prompt and like have your whole performance be predicated on one prompt I think something that uh empop has done really well is meta prompting prompting for a prompt and so why can't you just develop a meta prompt for for all the other prompts and you know if it's simple Tas make a simple prompt hard Tas make a hard prompt obviously I'm probably hand waving a little bit but I I will definitely ask people to try the anthropic workbench meta prompting system if they haven't tried it yet I went to the build day recently at anthropic HQ and it's the closest i' I felt to an AGI like learning how to operate itself that yeah it's it's it's really magical yeah no Claud is great at writing prompts for Claud met yeah yeah the way I think about this is that humans even like very smart humans still use sort of checklists and use sort of scaffolding for themselves surgeons will still have checklists even though they're incredible experts and certainly you know a very senior engineer needs less structure than a junior engineer but there still is some of that structure that you want to keep and so I always try to anthropomorphize the models and try to think about for a human sort of what is the equivalent and that's sort of you know how I think about these things is how much instruction would you give a human with the same task um and do you would you need to give them a lot of instruction or or a little bit of instruction let's talk about the agent architecture maybe so first runtime you let it run until it thinks it's done or it reaches 200k context window how did you come Up's up with that yeah yeah uh I mean this so I'd say that a lot of previous agent work built sort of these very hardcoded and rigid workflows where the model is sort of pushed through certain flows of steps and I think to some extent you know that's needed with smaller models and models that are less smart but one of the things that we really wanted to explore was like let's really give Claude the reins here and not force Claude to do anything but let Claude decide you know how it should approach the problem what steps it should do and so really you know what we did is like the most extreme version of this is just give it some tools that it can call and it's able to keep calling the tools keep thinking and then yeah keep doing that until it thinks it's done uh and that's sort of the most the most minimal agent framework that we we came up with and I think that works very well I think especially uh the new Sonet 3.5 is very very good at self-correction it has a lot of like grit Claude will try things that fail and then try you know come back and sort of try different approaches and I think that's something that you didn't see in a lot of previous models some of the existing agent Frameworks that I looked at they had whole systems built to try to detect loops and see oh is the model doing the same thing you know more than three times then we have to pull it out and I think like the smarter the models are the less you need that kind of extra scaffolding so yeah just giving the model tools and letting it keep sample and call tools until it thinks it's done was the most minimal framework that we could think of and so that's what we did so you're not pruning like bad paths from the context if it tries to do something it fails you just burn all these tokens too bad yes and so this I would say the downside of this is that this is sort of a very token expensive way to do this but still it's very common to prune bad paths cuz models get stuck yeah but I'd say the um yeah 3.5 is not getting stuck as much as previous models and so yeah we wanted to at least just try the most minimal thing uh now I would say that you know there this is definitely an area of future research especially if we talk about these problems that are going to take a human more than four hours those might be things where we're going to need to go prune bad paths enable to let the model be able to accomplish this task Within 200k tokens so certainly I think there's like future research to be done in that area but it's it's not necessary to do well on these benchmarks another thing I always have uh questions about on context window things there's a mini cottage industry of code indexers that have sprung up for large code bases like like the ones in sweet bench you didn't need them we didn't and I think I'd say there's like two reasons for this one is like sweet bench specific and the other is more General thing the more General thing is that I think Sonet is very good we what we call agentic search and what this basically means is letting the model decide how to search for something it gets the results and then it can decide should it keep searching or is it done does it have everything it needs so if you read through a lot of the traces of the sweet bench the model is calling tools to view directories list out things View files and it will it will do a few of those until it until it feels like it's found the file where the bug is and then it will start working on that file and I think like again this is all everything we did was about just giving Claude the full Reign so there's no hardcoded system there's no search system that you're relying on getting the correct files into context this just totally lets Claude do it or embedding things into a vector database exactly um oops no no and know but again this is very very uh token expensive and so certainly and it also takes many many turns and so certainly if you want to do something in the single turn you need to do rag and just P push stuff into the first prompt and just to make it clear it's using the bash tool busy doing LS looking at files and then doing cat put the file in context it can do that but it's uh it's file editing tool also has a command in it called view that can view a directory it's very similar to LS um but it just sort of has some nice sort of quality of life improvements like it'll only do an LS sort of two directories deep so that the model doesn't get overwhelmed if it does this on a huge file I would say actually we did more engineering of the tools than the overall prompt um but the one other thing I want to say about this agentic search is that for sweet bench specifically a lot of the tasks are bug reports which means they have a stack Trace in them and that means right in that first prompt there's tell to go there it tells you where to go um and so I think this is a very easy case for the model to find the right files versus if you're using this is a general coding assistant where there isn't a stack race or you're asking it to insert a new feature I think there it's much harder to know which files to look at and be an area where you would need to do more of this exhaustive search where an agentic search would take way too long as someone who spent the last few years in the JS World it' be interesting to see sbench JS because these stack traces are useless um so how much virtualization that we do so they're very very disconnected with um where the act the the code problems are actually appearing that makes me feel better about my limited front and experiences I've like always struggled with it's not your fault uh We've we've uh gotten ourselves into a very very complicated situation and I'm not sure it's entirely needed uh but you know if you talk to our friends of verell they will say it is I will say um sweet bench just released sbench multimodal um which I believe is either entirely JavaScript or largely JavaScript and it's entirely things that have visual components of them are you going to tackle that we will see I think it's it's on the list and there's interest but no no guarantees yet just as a side note it occurs to me that every model lab including mopic but you know the others as well you should have your own sweet bench whatever your bug tracker tool like this is a general methodology that you can use to track progress I guess sort of running on our own internal code base yeah that's a fun idea since you spend so much time on the tool design so you have this added tool that can make changes and whatnot any learnings from that that you wish like the AI idees would take in is there some special way to like look at files feed them in I would say the core of that tool is string replace and so we did a few different experiments with like different ways to specify how to edit a file and string replace of basically the model has to write out the existing version of the string and then a new version and that just gets swapped in we found that to be the most reliable way to do these edits other things that we tried were like having the model directly like write a diff having the model fully regenerate files that one is actually the most accurate but it takes so many tokens and if you're in a very big file it's cost prohibitive there's basically a lot of different ways to sort of represent the same task and they actually have pretty big differences in terms of like model accuracy I think Ider they have a really good blog where they uh they explore some of these different methods for editing files and they post results about them um which I think is interesting but I I think this is like a really good example of the broader idea that like you need to iterate On Tools rather than just a prompt and I think a lot of people when they make tools for an llm they kind of treat it like they're just writing an API for a a computer and it's sort of very minimal it's sort of just the barebones of what you'd need and honestly like it's so hard for the models to use those I really again I come back to anthropomorphizing these models like imagine you're a developer and you just read this for the very first time and you're trying to use it like you can do so much better than like just sort of the bare API spec of what You' you'd often see like include examples in the description include like really detailed explanations of how things work and I think that again also think about what is the easiest way for the model to represent the change that it wants to make for file editing as an example writing a diff is actually let's take the most extreme example you want the model to literally write a patch file I think patch files have at the very beginning like numbers of how many total lines change that means before the model has actually written the edit it needs to decide uh you know how many numbers or how many lines are going to change don't quote me on that um I'm pretty sure I I think it's something like that but I don't know if that's exactly the diff format but you can certainly have formats that are much easier to express without messing up than others and I like to think about like think about how much human effort goes into designing human interfaces for things like it's incredible this is like entirely what front end is about is creating better interfaces to kind of do the same things and I think that same amount of attention and effort needs to go into creating agent computer interfaces it's a topic we've discussed ACI or U whatever that looks like oh it also shout out that I think you release some of these toolings as part of Compu use as well and people really liked it uh they yeah it's it's all open source if people want to check it out I'm curious if there's a there's an environment element of that complements the tools so how do you like do you have a Sandbox do you is it just Docker because that can be slow or resource intensive do you have anything else um that recomend I don't think I can talk about sort of public details or about private details about how we Implement our sandboxing okay um but obviously you know we need to have sort of safe secure and fast sandbox for training for the models to be able to practice writing code and and working in an environment I'm aware of a few startups working on agent sandboxing e2b is a is a close friend of ours that leio has let around in but also I think there's others where they're focusing on snapshotting memory so it can do time travel for debugging computer use where you can control the mouse or keyboard or something like that whereas here I think that the kinds of tools that we offer it are very very limited to coding agent work cases like bash edit you know stuff like that yeah I think the the computer use demo that we released is an extension of that of it it has the same bash and edit tools but it also has the computer tool that lets it get screenshots and move the mouse and keyboard yeah so I definitely think there's sort of more General tools there and again the tools we released as part of squee bench were I'd say they're very specific for like editing files and doing bash but at the same time that's actually very general if you think about it like anything that you would do on a command line or like editing file you can do with those tools and so we do want those tools to feel like any sort of computer terminal work could be done uh with those same tools rather than making tools that were like very specific for sweep bench like run tests as its own tool for instance yeah you had a question about test yeah yeah exactly I thought there's no test veriter tool is it because it generates the the code and then you run against webench anyway so it doesn't really need to write the test or yeah so this is this is one of the interesting things about sweet bench is that the tests that the model's output is graded on are hidden from it that's basically so that the model can't cheat by looking at the tests and writing the exact solution but I'd say typically the model the first thing it does is it usually writes a little script to reproduce the error uh and again most sweep bench tasks are like hey here's a bug that I found I run this and I get this error so the first thing the model does is try to reproduce that and so it's kind of then rerunning that script as a mini test but yeah sometimes the model will like accidentally introduce a bug that breaks some other test and it doesn't know about that and should we be redesigning any tools CS we kind of talked about this and like having more examples but I'm thinking even things of like Q as a query parameter in many apis it's like easier for the model to like requery than read the que I'm sure it learned the queue by this point but like is there anything you've seen like building this where it's like hey if I were to redesign some CLI tool some API tool I would like change the wayist structure to make it better for lims I don't think I've thought enough about that off the top of my head but certainly like just making everything more human friendly like having like more detailed documentation and examples I think examples are really good in things like descriptions like so many like just using the Linux command line like how many time I do like Dash dasel or look at the man page or something it's like just give me one example of like how I actually use this like I don't want to go read through 100 flags just give me the most common example yeah um and again so you know things that would be useful for a human I think are also very useful for a model yeah I mean there's one thing that you cannot give to code agent that is useful for human is just access to the internet I wonder how they design that in because one of the issues that I also had with just the the the the idea of a s bench is that you can't do follow-up questions you can't like look around for similar implementations these are all things that I do when I try to fix code and we don't do that it's not it wouldn't be fair like you be too easy to cheat but then also it's kind of not being fair to these agents because they're not operating in a real world situation like if I had a real world agent of course I'm giving it access to the internet cuz I'm not trying to pass a benchmark I don't have a question in there more more just like I feel like the most obvious tool access to the internet is not being used I think that um that's really important for humans but honestly the models have so much General Knowledge from pre-training that it's a it's like less important for them feel like versioning you know if you're working on a newer thing that was like that came after the knowledge cut off then yes I think that's very important I think actually this this is like a broader problem that there is a Divergence between sbench and like what customers will actually care about who are working on a coding agent for real use and I think one of those there is like internet access and being able to like how do you pull in outside information I think another one is like if you have a real coding agent you don't want to have it start on a task and like spin its wheels for hours because you gave it a bad prompt you want it to come back immediately and ask follow-up questions and like really make sure it has a very detailed understanding of what to do then go off for a few hours and do work so I think that like real tasks are going to be much more interactive um with the agent rather than this kind of like one shot system and right now there's no Benchmark that that measures that and maybe I think it would be interesting to have some Benchmark that is more interactive I don't know if you're familiar with tow bench but it's a it's a customer service Benchmark where there's basically basally one llm that's playing the user or the customer that's getting support and another llm that's playing the uh support agent and they interact and try to resolve the issue yeah we talked to the lmis guys awesome um and they also did Mt bench for for people listening along so maybe we need MT sweet bench sh um yeah so maybe you know you could have something where like before the sweet bench task starts you have like a few back in forths with kind of like the the author who can answer follow-up questions about what they want the task to do and of course you need to do that where it doesn't cheat and like just get the exact the exact thing out of the human or out of the sort of user but I think that would be a really interesting thing to see if you look at sort of existing agent work like uh repetz coding agent I think one of the really uh great ux things they do is like first having the agent create a plan and then having the human approve that plan or give feedback I think for agents in general like having a planning Step At the beginning one just having that plan will improve performance on the downstream task just because it's kind of like a bigger Chain of Thought but also it's just such a better ux it's it's way easier for a human to iterate on a plan with a model rather than iterating on the full task that sort of has a much slower time through each Loop if the human has approved this implementation plan I think it makes the end result a lot more sort of auditable and trustable um so I think I think there's a lot of things sort of outside of sweet bench that will be very important for real agent usage in the world yeah I was say also there's a couple comments on names that you dropped co-pilot also does the plan stage before it writes code I feel like those approaches have generally been less Twitter successful because it's not prompt to code it's prompt plan code you know so there's a little bit of friction in there but it's not much like it's it actually it's you get a lot for what it's worth and I also like the the way that Devon does it where you can sort of edit the plan as it goes along Y and then uh the other thing replit we had a we hosted a sort of Dev day pregame with repet and they also commented about multi-agent so like having two agents kind of bounce off of each other I think it's a similar approach to what you're talking about with kind of the F shot example just as in in the prompts of clarifying what what the what the agent wants but typically I think this would be implemented as a tool calling another agent like a sub agent I don't know if you explored that do you like that idea I haven't explored this enough but I've definitely heard of people having good success with this of almost like basically having a few different sort of personas of Agents even if they're all the same llm I think this is one thing with multi-agent that a lot of people kind of get confused by is they think it has to be different models behind each thing but really it's sort of usually the same the same model with different prps and yeah having one having them have different personas to kind of bring different sort of thoughts and priorities to the table I've seen that uh work very well and sort of create a much more thorough and thought out response I think the downside is just that it adds a lot of complexity and it adds a lot of extra tokens so I think it depends what you care about if you want a plan that's very thorough and detailed I think it's great if you want a really quick just like write this function you know you probably don't want to do that and and have like a bunch of different calls before it does this and just talking about the prompt why are XML tags so good and Claud I think initially people were like oh maybe you're just getting lucky with XML but it's obviously you use them in your own agent proms so they must work and why is it so model specific to your family yeah I think that there's again I'm not sure how much I can say but I think there's historical reasons that internally we've preferred XML for data I think also the one broader thing I'll say is that if you look at certain kinds of outputs there is overhead to outputting in Json like if you're trying to Output uh code in Json there's a lot of extra escaping that needs to be done and that actually hurts model performance across the board where versus like if you're in just a single XML tag there's none of that sort of a escaping that needs to happen that being said I haven't tried having it right you know HTML and XML which maybe then you start running into into weird escaping things uh there I'm not sure but yeah I'd say that's some historical reasons and there's there's less overhead of escaping I use like in other models as well and it's just a really nice way to make sure that the thing that ends is this is tied to the thing that starts that's the only way to do code fences where you're pretty sure like example one start example one end like that is one cohesive unit because the braces are non descriptive yeah exactly so that would be my simple reason XML is good for everyone not just cloud cloud was just the first one to popularize it I think I do definitely prefer to read XML than read Json so yeah any other details that are like maybe underappreciated I know for example you had the absolute Pats versus relative any other yeah fun nuggets yeah know I think that's a good uh sort of anecdote to mention about iterating On Tools like I said you know spend time impr prompt engineering your tools and don't just write the prompt but like write the prompt or write the tool and then actually give it to the model and like read a bunch of transcripts about how the model tries to use the tool and I think you will find like by doing that you will find areas where the model misunderstands a tool or makes mistakes and then basically change the tool to make it foolproof there's this Japanese uh term poke yoke about like making tools mistake proof you know the classic idea you have like you can have like a plug that can fit either way and that's dangerous or you can make it asymmetric so that like it can't fit this way it has to go like this and like that's a better tool because you can't use it the wrong way so for this example of like uh absolute paths one of the things that we saw while testing these tools is oh if the model has like you know done CD and moved to a different directory it would often get confused when trying to use the tool because it's like now in a different directory and so the paths aren't lighting up so we said oh like let's just Force the tool to always require an absolute path and then you know that's easy for the model to understand it it knows sort of where it is it knows where the files are and then once we have it always giving absolute paths it never messes up even like no matter where it is because it just if you're using an absolute path doesn't matter where you are so like iterations like that you know let us make the tool foolproof for the model I say there's other categories of things where we see oh if the model you know opens Vim like you know it's never going to return and so the tools it get did it get stuck yeah get out of them uh and well because the tool is like it just text in text out it's not interactive so if the it's not like the model doesn't know how to get out of Vim it's that the way that the tool is like hooked up to the computer is is not interactive yes I mean there there's there is the meme of no one knows how to get out of him you know basically we just added instructions in the the tool of like hey don't launch commands that don't return yeah like don't launch Vim don't launch whatever if you do need to do something you know put an Amper sand after it to launch it in the background and so like uh just you know putting kind of uh instructions like that just write in the description for the tool really helps the model and I think like that's an underutilized space of prompt engineering where like people might try to do that in the overall prompt but just put that in the tool itself so the model knows that it's like for this tool this is this is what's relevant you said you worked on the function calling and Tool use before you actually started the sbench work right was there any surprises because you basically became from went from creator of that API to user of that API any surprises or changes you would make now that you have extensively dog footed in you know state of the-art agent I want us to make like a maybe like a little bit less verose SDK I think some way like right now it it just takes uh I think we sort of force people to do the best practices of writing out sort of these full full Json schemas but it would be really nice if you could just pass in a python function as a tool I could I think there's a lot of like P instructure you know I don't know if there if there's anyone else that is specializing for anthropic maybe Jeremy Howard's and Simon Willison stuff they they all they all have CLA specific stuff that they are working on claudet claudet exactly I also wanted to spend a little bit of time with swe agent it seems like a very general framework like is there a reason you picked it apart from it's the same authors as sweet Ben short the main thing we wanted to go with what was the same authors sweet Ben show it just felt sort of like the safest most neutral option and and it was you know very high quality it was very easy to to modify uh to work with it I would say it also actually their underlying framework is sort of this um it's like you know think act observe that they they kind of go through this Loop which is like a little bit more hardcoded than what we wanted to do but it's still very close that's still very general so it felt like a good match is sort of the starting point uh for our agent and we had already sort of worked with the and talked with the sweet bench people directly so it felt nice to just have uh you know we already know the authors this will be easy easy to work with I'll share a little bit of like this all seems disconnected but once you figure out the people and where they go to the school it all makes sense so they it's all Princeton yeah the sweet bench and sweet age it's it's a group out of Princeton yeah we had sh you on the Pod uh and he came up with the react Paradigm and and that's like think act observe like that that's all react so they're all friends yep yeah exactly and you know our uh if you actually read our traces of our submission you can actually see like think act observe like in our logs and like we just didn't even like change the printing code like that's uh um so it's it's not actually it's like doing still function calls under the under the hood and the model can do sort of multiple function calls in a row without thinking in between if it wants to but yeah a lot of similarities and a lot of things we inherited from s agent just as a starting point for the framework yeah any thoughts about other agent Frameworks I think there's you know the whole gamut from very simple to like very complex crew AI Lang graph yeah yeah I think I haven't explored a lot of them in detail I would say with agent Frameworks in general they can certainly save you some like boiler plate but I think there's actually this like downside of making agents too easy where you end up very quickly like building a much more complex system than you need and suddenly you know instead of having one prompt you have five agents that are talking to each other and doing a dialogue and it's like because the framework made that 10 lines to do you end up building something that's way too complex so I think I would actually caution people to like try to start without these Frameworks if you can because you'll be closer to the raw prompts and be able to sort of directly understand uh what's going on I think a lot of times these Frameworks also by trying to make everything feel really magical you end up sort of really hiding what the actual prompt and output of the model is and that can make it much harder to debug so certainly these things have a place and I think they do really help at getting rid of boilerplate but they come with this cost of obus skating what's really happening and making it too easy to very quickly add a lot of complexity um so yeah I would recommend people to like try it from scratch and it's like not that bad would you rather have like a framework of tools you know do you almost see like Hey look it's maybe easier to get tools that are already well curated like the ones that you build you know if I had an easy way to get the The Bash tool from you and like you maintain the definition or yeah any thoughts on how you want to formalize Tool sharing yeah I think that's something that we're certainly interested in exploring and I think there you know is space for sort of these General tools that will be very broadly applicable but at the same time most people that are building on these they do have you know much more specific things that they're trying to do you know I think that might be useful for hobbyists and demos but the ultimate end applications are going to be bespoke and so we we just want to make sure that the model's great at any tool um that it uses but certainly something we're exploring so everything bepoke no Frameworks no anything just just for now for now yeah I would say that like the best thing I've seen is people building up from like build some good util functions and then you can use those as building blocks yeah yeah I have a utils folder where like all these scripts my framework is like def call anthropic and then I just put all the defaults I need exactly there's a startup hidden in every uols folder you know use it enough like it's a startup you know like at some point I'm kind of curious is there is there a maximum length of turns uh that that that it took like what was the longest run I actually don't I mean we had we it had basically infinite turns until it ran into 200k context I I should have looked this up I don't know and so for some of those failed cases where it eventually ran out of context I mean it it was over a 100 turns I'm trying to remember like the longest successful run but I think it was definitely over a 100 turns that some of the time which is not that much it's a coffee break yeah um but certainly you know these things can be a lot of turns and I think that's because some of these things are are really hard where it's going to take you know many tries to do it uh and if you think about like think about a task that takes a human 4 hours to do like think about how many different like files you read and like times you edit a file in four hours like that's a lot more than 100 how many times you open Twitter because you get distracted but if you had a lot more compute what's kind of like the return on the extra compute now so like you know if you had thousands of turns or like whatever like how much better would it get yeah this I don't know and I think this is um I I think sort of one of the open areas of research in general with agents is memory and sort of how do you have something that can do work beyond beyond its uh context length where you're just purely appending so you mentioned earlier things like pruning bad paths I think there's a lot of interesting work around there of can you just roll back but summarize hey don't go down this path that yeah uh I think that's very interesting uh that you could have something that that uses way more tokens without ever using at a time more than 200k so I think that's very interesting I think the biggest thing is like can you make the model sort of losslessly summarize what it's learned from trying different approaches and bring things back um I think that's sort of the Big Challenge what about different models so you have Haiku which is like you know cheer so you're like well what if I put Haiku to do a lot of these smaller things and then put it back up I think cursor might have said that they actually have a separate model for file editing I'm trying to remember I think they were on a maybe the Lex fredman podcast where they said like they have a bigger model like write what the code should be and then a different model like apply it so I think there's a lot of interesting room for stuff like that yeah fast apply we actually did a pod with fireworks that uh they worked with on it's speculative decoding but I think there's also really interesting things about like you know pairing down input tokens as well especially sometimes the model's trying to read like a 10,000 line file like that's a lot of tokens and you know most of it is actually not going to be relevant I think it'd be really interesting to like delegate that to hiu Haiku read this file and just pull out the most relevant uh functions and then uh you know Sonet reads just those and you save 90% on tokens um I think there's a lot of really interesting room for things like that and again we were just trying to do sort of the simplest most minimal thing and show that it works I'm really hoping that people sort of the agent Community builds things like that on top of our models that's again why we release these tools you know we're not going to go and do lots more submissions to sweep bench and try to try to prompt engine this and build a bigger system we want people to like the ecosystem to do that on top of our models yeah so I think that's a really interesting one it turns out I think you did do 3.5 Haiku with your tools and it scored a 40.6 yes yeah so it did it did very well it it itself is actually very smart which is great uh but we haven't done any experiments with this like a combination of the two models but I think that's one of the exciting things is that um how well Haiku 3.5 did on sweet bench shows that sort of even our smallest fastest model is very good at sort of thinking a gentically and working on hard problems like it's not just sort of for writing simple text anymore and I know you're not going to talk about it but like Sonet is not even supposed to be the best model you know like Opus it's kind of like we left it at three back in the Corin at some point I'm sure the new Opus will come out and if you had Opus Plus on it that sounds very very good there's a run with s agent plus Opus but that's the official sweet bench guys doing it that was the older you know you didn't do yours yeah okay did do you want to or sorry I mean you could just change the model name I think um I think we didn't submit it but I think we included it in our model card okay we included the score as a comparison yeah um yeah and Sonet and Haiku actually I think the new ones both uh they both outperformed the original Opus yeah I did I did see that yeah it's a little bit hard to find yeah yeah it's it's not an exciting score so we didn't feel like they need to submit it to The Benchmark we can cut over to computer use if we're okay with like moving on to topics on this you have anything else I I think we're good I think uh I'm trying to think if there's like anything else sweet bench related it doesn't have to be also just like specifically sweet bench but just your thoughts on building agents cuz you are one of the few people that have you know reached this Leaderboard on building a coding agent like there's this is the state ofth art it's surprisingly like not that hard to reach you know GI with some good principles right but like there's obviously a ton of low hanging fruit that we covered just so just your thoughts on like if you were to build a coding agent startup like maybe like what next I think the the really interesting question for me for all the startups out there is like this kind of Divergence between the benchmarks and like what real customers will want so I I'm curious like maybe the next time you have a coding agent startup on the podcast you should ask them that like what are the differences that they're starting oh perfect perfect yeah I'm actually very curious what they will see because I also have seen I feel like it's like slowed down a little bit if I don't see the startups submitting to Sweet bench that much anymore cuz of the traces the Trac so we had cosign on uh they had a like a 50 something on full on sweet bench full which is the hardest one and they were rejected because they didn't want to submit their traces yep IP you know yeah that makes sense that makes sense actually tomorrow we're talking to bolt which is a cloud customer you you you guys actually published a case study with them I assume you weren't involved with with that but um they were very happy with Claud one of the biggest launches of the Year totally we actually happened to be sitting in adept's former office my take on is as anthropic shipped Adept as a feature or as open it's still beta feature but but yes what was it like when when you tried it for the first time was it was it obvious that CLA had reached that stage where you could do computer use it was somewhat of a surprise to me like I think I actually i' had been on vacation and I came back and everyone's like computer use Works um and so it was kind of this very exciting moment I mean after the first just like you know go to Google I think I tried to have it play Minecraft or something and it actually like installed and like opened Minecraft I was like wow this is pretty cool so I was like wow yeah this thing can actually use a computer and certainly it it is still beta you know there's certain things that it's it's not very good at yet but I'm I'm really excited I think most broadly not just for like new things that weren't possible before but as a much lower friction way to implement tool use one anecdote from my days at Cobalt robotics we wanted our robots to be able to ride elevators to go between floors and fully cover a building the first way that we did this was doing API Integrations with the elevator companies and some of them actually had apis we could send a request and it would move the elevator each new company we did took like six months to do cuz they they were very slow they didn't really care an elevator even installing like once we had it with the company they would have to like literally go install an API box on the elevator that we wanted to use and that would sometime sometimes take six months so very slow and eventually we're like okay this is this is getting like slowing down all of our customer deployments and I was like what if we just add an arm to the robot and I added this little arm that could literally go and press the elevator buttons and we use computer vision to do this and we could deploy that in a single day uh and have the robot being able to use the elevators at the same time it was slower than the API it wasn't quite as reliable you know sometimes it would miss and it would have to try to press it again but it would get there but it was slower and a little bit less reliable and I kind of see this as like an analogy to computer use of like anything you can do with computer today you could probably write tool use and like integrate it with apis up to the language model but that's going to take a bunch of software engineering to write those Integrations you have to do all this stuff with computer use just give the thing a browser that's that's logged into what you want to integrate with and it's going to work immediately and I see that like reduction and friction as being incredibly exciting of like imagine like a customer support team where okay hey you got this customer support bot but you need to go integrate it with all these things and you don't have any Engineers on your customer support team but if you can just give the thing a browser that's logged into your systems that you need it to have access to now suddenly in one day you could be up and rolling with a fully integrated customer service spot that could go do all the actions you care about so I think that's the most exciting thing for me about computer use is like reducing that friction of Integrations to almost zero or farming on board of Warcraft right computer use very very high value use cases all you say about this is you know this is like the oldest question in robotics or self-driving which is you know do you drive by Vision or do you have special tools and vision is the universal tool to to claim all tools there's trade-offs but like there's situations in which that will that will come but you know this week's podcast the one the one that we just put out had Stan Po from from dust saying that he doesn't see a future where it's like the significant Workhorse I think there could be a separation between maybe like the the high volume use cases you want apis and then the long tail you want computer use I totally agree right yeah or you'll start you'll prototype something with computer use and then hey this is working like customers have adopted this feature okay like let's go turn it into an API and it'll be faster and use less tokens yeah I'd be interested to see a computer use agent replace Itself by figuring out the API and then just dropping out of the equation altoe together you know yeah that's that's really fun actually if I was running an RPA company like you would have the RPA scripting RPA for people listening is robotic process automation where you would script things that like always show up in sequence so you don't have an llm in the loop and so basically what you need to do is train an LM to code that script and then you can you can sort of naturally hand off from computer use to non-computer use SW yeah y or or have some way to turn claud's actions of computer use into a saved script that you can then run repeatedly yeah it' be interesting to record that why did you decide to not ship any like sandbox harness for computer use it's kind of like hey peace run at your own risk dock no no launched it with a I think a VM or Docker a Docker system but it's not for your actual computer right like the docker instance is like runs in the docker it's not for yeah it runs its own browser I think um I mean the main reason for that is one is sort of security you know we don't want you know the model can do anything uh so we wanted to give it a Sandbox not not have people do their own computer at least sort of for our default experience we really care about providing a nice sort of making the fault safe I think is the is the best way for us to do it and I mean very quickly people made modifications to let you run it on your own desktop that's fine someone else can do that but we don't want that to be the official anthropic thing to run I would say also like from a product perspective right now because this is sort of still in beta I think a lot of the most useful use cases are like a Sandbox is actually what you want you want something where hey any it can't mess up anything in here it only has what I what I gave it also if it's using your computer you know you can't use your computer at the same time I think you actually like want it to have its own screen it's like you and a person pair programming but only on one laptop versus you have two everyone should totally have a side laptop where computer us clish just doing his thing and I think it's just a better experience yeah unless there's something very explicit you want it to do for you on your own computer it becomes like you're sort of shelling into a remote machine and uh you know maybe checking in on it every now and then like I have fun memories of half our audience going to be too young to remember this by Citrix like desktop experience like he was he was sort of remote into a machine that someone else was operating um and for a long time that would be how you did like Enterprise uh Computing yeah yeah it's coming back any other implications of computer use you know is it a fun demo or is it like the future of anthropic I'm very excited about it I think that like there's a lot of sort of very repetitive work that like computer use will be great for I think I've seen some examples of people build like coding agents that then also like test the front end that they made uh so I think it's very cool to like use computer use to be able to close the loop on a lot of things that right now just a terminal based agent can't do um so I think that's that's it's kind of like end to end testing exactly yeah yeah the end sort of front end and web testing is something I'm very excited about yeah I've seen Amanda also talking uh this would be Amanda asol the head of cloud character she goes on a lunch break and it generates um you know research ideas for her giving it a name like computer computer use it's very practical it's like you're supposed to do things but maybe sometimes it's not about doing things about thinking and thinking in the process of thinking you're using the computer mhm in some way that's you know solving sweet bench like you you should be allowed to use the Internet or you should be allowed to use a computer to to solve it and use your vision and use whatever like we we're just sort of shackling it with all these restrictions just because we want to play nice for a benchmark but really you know a full AI which will be able to do all these things to think yeah we'll definitely be able to to Google search for things yeah pull down inspiration can we just do a before we wrap a robotics Corner people are always curious especially with somebody that is not trying to Hype their own company what's the state of AI robotics underhyped overhyped yeah and I'll say like these are these are my opinions not anthropics um and again coming from a place of a burned out robotics founder uh so take everything with a with a grain of salt I was see on the positives like there is really sort of incredible progress that's happened in the last five years that I think will be a big unlock for robotics the first is just general purpose language models I mean there was an old saying in robotics that if if to fully describe your task is harder than to just do the task you can never automate it because like it's going to take more effort to even tell the robot how to do this thing than to me just do it itself llm solved that I no longer need to go exhaustively program in every little thing I could do the thing just has common sense and it's going to know how do I make a Reuben sandwich I'm not going to have to go program that in whereas before like the idea of even like a cooking thing it's like oh God like we're going to have the team of Engineers that are hard coding recipes for the long tale of anything be a disaster so I think that's one thing is bringing Common Sense really is like solves this huge problem describing tasks the second big innovation has been diffusion models for path planning a lot of this work came out of uh Toyota research there's a lot of startups now that are working on this like physical intelligence Pi um Chelsea fins startup at a Stanford and the basic idea here is is using a little bit of the um I'd think maybe more inspiration from diffusion rather than diffusion models themselves but they are a way to basically learn an endtoend sort of motion control whereas previously all robotics motion control was sort of very hardcoded you either you know you're programming in explicit motions or you're programming in an explicit goal and using an optimization library to find the shortest path to it this is now something where you just give it a bunch of demonstrations and again just like deep using learning it's basically like learning from these examples what is it mean to go pick up a cup and doing these in a way just like diffusion models where they are uh somewhat conditioned by text you can have it the same model learn many different tasks and then the the hope is that these start to generalize that if you've trained it on picking up coffee cups and picking up books then when I say pick up the backpack it knows how to do that too even though you've never trained it on that that's kind of the Holy Grail here is that you train it on 500 different tasks and then that's enough to really get it to generalize to do anything you would need I think that's like still a big TBD and these people are working have like measured some degree of generalization but at the end of the day it's also like llms like you know do you really care about the thing being able to do something that no one is ever shown in training data people for like a home robot there's going to be like a hundred things that people really wanted to do and you can just make sure it has good training for those things what you do care about then is like generalization within a task of oh I've never seen this particular coffee mug before can I still pick it up and those the models do seem very good at so these kind of are the two big things that are going for robotics right now is llms for common sense and diffusion inspired path planning algorithms I think this is very promising but I think there's a lot of hype and I think where we are right now is where self-driving cars were 10 years ago I think we have very cool demos that work I mean 10 years ago you had videos of people driving a car on the highway driving a car you know on a street with a safety driver but it's really taken a long time to go from there to I took a wh here today um and even then like weo is only an SF and a few other cities and I think uh like it takes a long time for these things to actually like get everywhere and to get all the edge cases covered I think that for robotics the limiting factor is going to be reliability that these models are really good at doing these demos of like doing laundry or doing dishes if they only work 99% of the time like that sounds good but that's actually really annoying like humans are really good at these tasks like imagine if like one out of every 100 dishes it washed it breaks like you would not want that robot in your house or you certainly wouldn't want that in your factory if one of every hundred boxes that it moves it drops and breaks things inside it so I think for these things to really be useful they're going to have to hit a very very high level of reliability just like self-driving cars and I don't know how hard it's going to be for these models to move from like the 95% reliability to 99.9 um I think that's going to be the big thing and I think also like I'm a little skeptical of how good the unit economics of these things will be these robots are going to be very expensive to build and if you're just trying to replace labor like a one forone purchase uh it kind of sets an upper about how much you can charge and so you know it's it seems like it's not that great a business I'm also worried about that for the self-driving car industry do you see most of the applications actually taking some of the older especially manufacturing Machinery which is like it needs to be like very precise even if it's off by just a few millimeters it kind of screw up the whole thing and be able to adjust at the edge or do you think like the net new use Cas is maybe like the more interesting I think it' be very hard to replace a lot of those was traditional manufacturing robots because everything relies on that Precision if you have a model that can again only get there 99% of the time you don't want 1% of your cars to have the weld in the wrong spot like that's going to be a disaster yeah um and a lot of manufacturing is all about getting rid of as much sort of uh variance and uncertainty as possible yeah and what about the hardware a lot of my friends that work in robotics one of the big issues like sometimes you just have a Servo that fails and then you gotta f it takes a bunch of time to like fix that is that holding back thing or is the software still anyway not I think both I think this there's been a lot more progress in the software in the last few years and I think a lot of the humanoid robot companies now are really trying to build amazing Hardware Hardware is just so hard um it's something where classic you know you build your first robot and it works you're like great then you build 10 of them five of them work three of them work half the time two of them don't work and you built them all the same and you don't know why and it's just like the real world has like this level of detail and differences that software doesn't have like imagine if every four Loop you wrote some of them just didn't work some of them were slower than others like how do you deal with that like imagine if every binary that you shipped to a customer each of those four Loops was a little bit differently was a little different it becomes just so hard to scale and sort of maintain quality uh of these things and I think that's like that's what makes Hardware really hard is not building one of something but repeatedly building something and making it work reliably where again like you'll you'll buy a batch of 100 Motors and each of those Motors will behave a little bit differently to the same input command this is your lived experience at cop and Robotics is all about how do you build something that's robust despite these differences we can't get the tolerance of Motors down to it it's just everything you know you'll have um actually everything no I mean one of one of my horror stories was that at Cobalt this was many years ago we had we had a thermal camera on the robot um that had a USB connection to the computer inside which is first of all is a big mistake you're not supposed to use a USB it is not a reliable protocol it's designed that if there's mistakes the user can just unplug it and plug it back in see um and so typically things that are USB they're not designed to the same level of like very high reliability you need again because they assume someone will just unplug it and replug it back in you just say you just say someone sometime I I heard this too and I didn't listen to it I really wish I had before anyway at a certain point a bunch of these thermal cameras started failing and like we couldn't figure out why and I asked everyone on the team like hey what's changed like did the software change around this no the hardware design change around this no and I was like investigating all this stuff like looking at like kernel logs of like what's happening with the this thing and finally like the procurement person was like like oh yeah well like I found this new vendor for USB cables uh like last summer and like what you switched like which vendor were buying USB cables I'm like yeah it's like the same exact cable just like a dollar cheaper and it turns out this was the problem is this new cable had slightly worse resistance or slightly worse Emi interference and it you know didn't it worked most of the time but 1% of the time these cameras would fail and we'd need to like reboot a big part of the system and it was all just cuz like same exact spec these two different USB cables like slightly different and so it these are the kind of things you deal with with hardware for listeners we had a episode with Josh alri INB where they talked about buying you know tens of thousands of gpus and just some of them will just not do math yeah yeah it's the same thing you like you like run them run some tests to find the bad batch and then you return it to cender because they just gpus won't do math right yeah yeah this is the thing um just the real world has this level of detail uh there's um Eric Jang was at uh he did AI at Google yeah and then joined 1X I see him post on Twitter occasionally of like um you know complaints about hardware and supply chain and I we we know each other and we joke occasionally that we've like switched if I went from robotics into Ai and he went from AI into Robotics and yeah look a very very promising the The Tam of the real world is unlimited right uh but just also a lot harder and yeah I do think like something I also tell people about for why working software agents is they're infinitely clonable and they always work the same way mostly unless you're using python uh and um and yeah I mean like this is this like the whole thesis I'm also interested like in You You Dropped a little bit of alpha there I don't want to make make sure we don't lose it like you're you're kind of skeptical about self-driving as a business so I I want to like double click on this a little bit because I mean I I think that that shouldn't be taken away um we we do have some public wayo numbers read from from Whos is is pretty public with like their their stats they're exceeding 100 weo trips a week if you assume like a $25 ride average that's $130 million Revenue run rate at some point they will recoup their investment right like what are we talking about here like why uh why the skepticism I think and again I'm not an expert I don't know their financials I would say the thing I'm worried about is like compared to an U like I don't know how much an Uber driver takes home a year but like call that the revenue that a weo is going to be making in that same year yeah those cars are expensive it's not about if you can hit profitability it's about your cash conversion Cycles like is building one wayo like how cheap can you make that compared to like how much you're earning sort of as the equivalent of what an Uber driver would take home because remember an Uber driver you're not getting that whole Revenue you think about for the Uber driver like the cost of the car the depreciation of the car I'm not con vinced how much profit wh can actually make per car that that's I think my skep they need to pre-assess the run weo because the class C is like 110 Grand something like that plus lar that's many years yeah exactly exactly anything else parting thoughts call to action rants the floor is yours or yeah I'm very excited to see a lot more llm agents out there in the world doing things and I think that like I think they'll be the biggest limiting thing will start to become like do people trust the output of these agents and like how do you trust the output of an agent that did 5 hours of work for you and is coming back with something and if you can't find some way to trust that agent's work it kind of wasn't valuable at all so I think that's going to be a really important thing is not just doing the work but doing the work in a trustable auditable way where you can also explain to the human hey here's exactly how this works and why and how I came to it I think that's going to be really important thank you so much thank you thanks this is great [Music] Back To Top