there is some unique properties of the problems we're solving for example our stack realiz on really powerful spatial awareness of everything going on around you but then you need to understand the weld in 3D even better another thing you want to incorporate is very long memory over uh a scene you need to reason often based on the history of several seconds or more that can be a lot of frames you need to think how to prevent hallucinations so all of this you know are still areas that are very much tribe for further work even though we have this Vision language models all right everyone welcome to another episode of the twiml AI podcast I am of course your host Sam charington today I'm joined by dragir angov drgo is VP and head of research at weo before we get going be sure to take a moment to hit that subscribe button wherever you're listening to Today's Show Drago welcome back to the podcast thank you pleasure to be back I'm really looking forward to digging into our conversation uh I went back and checked it's been uh four years and a month since we last book it was February 2021 and I have got to imagine a ton has been happening on your end so I'm really looking forward to catching up on all that uh we'll be especially digging into some some of the work that you've been doing to incorporate Foundation models and that whole idea into uh the self-driving experience at wh Mo uh but before we get going I'd love to have you jump in and share a little bit about your background for folks who weren't listening four years ago so uh well a lot has happened in the last four years I'm Drago I've ~ Preview Segment Removed ~ been at wayo for almost seven years now I have been leading the wayo research team uh since summer of uh 2018 and we focus on uh pushing the state-ofthe-art uh in autonomous Driving Systems with machine learning or AI uh or both and uh I have been uh machine learning researcher for over 20 years um I spent eight years at alphabet working on uh image understanding with deep neural networks we publ some of the early architectures for image uh classification or object detection and we want the imag challenges that was 11 years ago 2014 and I used to work on street view as well uh on 3D posst estimation and 3D Vision for street view as the vehicles used to drive around we would reconstruct the world in 3D uh align all the different trajectories of all the platforms we even not just have cars we had snowmobiles truck bikes bicycles sometimes we would put a trike on a boat we've done all kinds of things at the time and so you had to have accurate post so that when you navigate for all these photos and uh and the pancake in 3D needs to look good essentially we uh I led a small team that was enabling this for a few years and uh then around 2015 I got into autonomous driving uh initially as head of 3D perception at zooks uh another uh reasonably prominent uh autonomous driving company and uh I did this put to a half years and uh then I had the opportunity to uh start and form this way more research team that uh you know have been uh leading and evolving uh together with the team since then and uh I think as soon it will be seven years that's amazing that's amazing I think when we spoke you guys were early days in the Phoenix Market trying to get some on the ground uh test experience for the vehicles uh talk a little bit about how far you've come since then so I think 21 we had a wa one service in Phoenix I think it probably was still with the uh older Pacifica vehicles and uh since then a lot has happened uh just recently maybe two weeks ago we announced that that we are now giving over 200,000 trips to customers in four major markets and 200,000 trips a week um so this is spans fully automous fully autonomous to pay customers right uh at reasonable scale at this point it is a significant part of people's lives in these cities and an option that they can clearly uh use anytime uh they would like uh you can get the app uh you can just Halo Amo uh in San Francisco uh Phoenix and La uh directly and now in Austin in a partnership with Uber there need to go to the Uber app to h a vehicle and often or reasonably often it would be way more uh but uh in these four cities we operate at reasonable scale in large territories too so San Francisco we have the full City and even a little bit of I believe Del City uh so around 55 square miles in Phoenix uh that's our oldest area where we provided the service and is the largest uh area I think generally in the Western World I'm not sure uh in China how how it goes but Phoenix is over 300 square miles and there we even provide rides to the airport directly um and then we have the other two cities they around again La is maybe 90 square miles and uh Austin where it's we relatively recently around 37 I believe when we were building it we were not sure how enthusiastically it would be received and how much people would love being in a autonomous vehicle I can tell you that at least for myself and I'm a biased uh user of course because I contributed a lot of well systems of features to uh that you know either are part of the vehicle or help support the vehicle uh I think it's an amazing experience I think a lot of people also really appreciate the safe driving the Comfort the fact that you know there is significant amount of privacy you there by yourself and safety uh as part of the experience you can play your own music I think generally people that try it a lot of them a hook pretty quickly and so it's validated the vision that autonomous vehicles can be compelling they can be safe the company recently published some updated safety numbers can you talk a little bit about that yeah so we uh share uh safety studies and also uh safety numbers of our service as we operate in all these cities and we update them regularly there's a hub uh that the safety dashboard on the wio website where one can keep checking in and so the most recent update was at 50 million miles and there we share statistics uh about our performance as we drove people for first 50 million autonomous driven miles actually uh every week we drive more than 1 million miles uh using a fet uh we drive customers and so in this 50 million miles there's different uh groups of uh incidents in ter ranked in terms of severity and so the most severe we we track is uh for example incidents in which airbag was deployed these are fairly serious high impact incidents and uh we are uh over 80% safer than uh normally uh human drivers would perform uh in our estimate right in the cities where we drive in the conditions we drive when we adjust for all of this we find that we are I think 83% fewer crashes that uh result in airbag deployment than if humans drove these miles and that's a nontrivial amount so if you given the miles that's over 60 crashes that we have potentially uh eliminated by having whamos drive this and the second category of uh incidents is ones where there was uh some kind of injury right and that one where over again 88 1% safer than an estimate of what humans drivers would given you know our uh domain where we drive and the third class is ones that are actually police reported so more minor incident but still often you can report to police that one we about I think 68 or so 64% more uh how to say fewer than our estimate what hum would do so these are significant improvements and uh also uh separately we've had studies done by insurance companies uh third party that also confirmed that in ter terms of like property damage or you know injury claims we are about 80% less than what humans would do for comparable amount of driving that Swiss R is a company so uh this is another separate Benchmark not just our Sal benchmarking but we will continue releasing this but it's it's I mean that's what a lot of us are in the space not just for the experience I mean I think we always many many people are the join the autonomous vehicle uh domain are motivated by uh the great safety benefits and many other benefits that these vehicles uh can bring to the communities uh that they serve and so but this one the main one there's tons of people that I mean we safety comes first our mission at whmo is be the world's most trusted driver right and it underpins everything we do um is it still the case that uh the vehicles are to some degree remotely operated either on an intervention basis or otherwise and do you publish stats around that and the you know the number of incidents per mile or something along those lines I think in terms of uh can a vehicle can an operator you know inspect the vehicle I think that is possible Right but it happens rarely and I think the fact that we've scaled to such amounts of vehicles and such scope it's you're not going to do it if there's people watching over every vehicle uh right all the time that's just not a good product you know what did you eliminate you move the driver to sit behind the workstation that doesn't work and so this is uh mostly in very rare cases and also uh the vehicles just to be clear they're not teleoperated so it's not like someone drives them uh remotely uh in this cases at most occasionally a person can confirm something the vehicle wants to do and usually the vehicle already needs to be stopped for this to happen yeah it seemed like that having a real-time Loop for that kind of confirmation would be very difficult yes and you do not want it but then that means that you actually need to understand and deal with these issues uh at least to understand to stop for real right uh which is uh you know again very rare that's also not even a good experience you don't want to do it so while it's possible it's very so tell us a little bit about how from a a research perspective you've been tracking and and thinking about incorporating all of the ideas around foundation models llms VMS that have uh sprouted up since the last time we spoke so yes uh generally we look at the state-ofthe-art in machine learning in AI our team and has tried to adapt uh the advances to our domain and usually uh it has involved uh some creative adaptation so there is some unique properties of the problems we're solving and traditionally we had to do adjustments or fairly creative adjustments of the techniques we see right now every two years or so my experience in this space and I'm in in it since 2015 actually in December maybe early 2016 is there's big technological leaps and a lot of them have to do with machine learning or Ai and I think generally in this space you want to keep Reinventing Your stock so as as these uh advances uh occur you digest them you want to be at the Forefront and take the benefits and I think what's happened in the last two years if anything there is even bigger jump than before with this uh multimodel large language models with generally generative AI technology that you know also spans multimedia obviously images video understanding audio uh handling all of these things uh it's been a bit of a revolution and you know we are keenly aware of the advances and we have been exploring them uh for the benefit of our driver right uh and uh so we've engaged in understanding and leveraging the Technologies uh such that Vision language models right uh such as diffusion prediction for generative outputs that can include videos but they can also include like generate driving scenes for testing we have done we leverage things like 3D reconstructive advances there's a lot of very interesting new technologies relatively Gan splatting and that kind of Gan splatting before that Nerf right uh these kinds of Technologies also very interesting and uh we've been working on them uh but generally right recently there has been this Trends to scale the models and uh you know train them on a lot of data scaled architectures um overtime ad reinforcement learning right to uh Target even better reasoning capabilities all of these things um I think that's the general Trend in the space and you can see it with all the large models uh we have been scaling our models for a while and we've had um actually transformer architectures for a while if you see our Publications we had a few even externally published around 22 21 uh with Transformers even before that for certain applications in perception and so we have been on this trend but I think what has changed more is just the understanding how much more scale we can leverage and so now we have gone and scaled the models and evolved them Beyond uh what normally uh we used to do before and that and we have seen clear benefits from doing it now just one thing I want to describe is why is our domain different than the vanilla Vision language model and by the way speaking of vision language model uh we have for example uh experimented with them we have even published a paper called Emma uh last year where we took a state-ofthe-art uh multimodal large language model what Gemini people like to call theirs and uh tried to adapt it for driving tasks and adapted it reasonably successfully we got some very nice results on the driving tasks with the and this is the idea of converting trajectories into tokens and passing them through the VM so the way it works mostly is okay so what does Gemini bring to you it brings to you World Knowledge it's pre-trained on the internet and it's a large model so it understands a lot of Concepts already that you don't have to teach it directly with your own labels or data and then you find unit on your tasks so now you bring your own data in your own tasks and you you adapted to do well given all this knowledge it brought to the tasks that you uh find you need to do and of course you need to adjust a bunch of recipes you need to make sure that the tasks are well represented in the Gemini inputs and that you know all the all the sensors are properly scaled and the right data sets are created and so on uh but then when you fine tune it right it does a reasonable job actually on these tasks but there are still limitations for example um our stack relies or likes to rely on really powerful spatial awareness of everything going on around you it's helpful for safety but then you need to understand the weld in 3D even better right and we have additional sensors like Li and Raa that are really superb for contributing to 3D spatial awareness that you want to incorporate which a vanilla Gemini model does not do another thing you want to incorporate is very long memory over a scene you need to reason often you know based on the history of several seconds or more that can be a lot of frames you need to incorporate this memory somehow maintain it somehow structure it you need to think how to prevent hallucinations right so the model can predict things but every once in a while if you go to domains where it has not seen things it can be wrong so you need to think okay what are the way we have actually a lot of experiments experience uh you know launching machine learning powerful machine learning in every part of our stack but doing it in a way with the stack design that is built to also mitigate all these issues of falling off the data Min def fault hallucinating and so on so so all of this you know are still areas that are very much ripe for further work even though we have this Vision language models but we want their power we want want to bring external knowledge right internet has a lot of knowledge when you pre Trin it there's a lot of uh value in having these kinds of uh pre-trained uh components and we also can benefit from from some of the architectures and scales what are some of the specific tasks that you've uh explored using VMS for I mean for example there there obviously can be explored for understanding all kinds of perception um uh outputs right so they they understand semantics very very well right they train such that you show it an image you can ask a variety of questions and a lot of them are really powerful on all kinds of even fairly obscure details that traditionally is a property uh that uh you want to benefit from because otherwise in the old days maybe 5 to 10 years ago you would use humans to all these Concepts now that model for example would understand a lot of them out of the box that's very powerful right so the idea being uh the model sees the scene and as opposed to the old world of bounding boxes around humans balls scooters that kind of thing the model just knows what those things are and that they're important yeah and without us having to label all these Concepts right so you can directly bootstrap yourself the other power of all these models is the more scaled they are the larger they are uh the more the better they generalize so if you see some new variant of the concepts you for some reason didn't label that you need to handle uh before you would label have to label a lot of examples for the small model to really capture what those things like and then every new particularly different looking thing you need to label more examples here for perception mostly I'm thinking here as I'm describing now the the large models just like I mean you can take gemini or a lot of these other external uh Vision language models they most understand what it is already and it's a big model and it generalizes again it can when it sees a trench it can see a trench in a whole bunch of conditions and no it's a trench for the small models this is harder it requires more bespoke engineering to ensure and so so that's a nice benefit right now uh the scaling part uh is appealing but there's also this interesting uh limit there's this question how far can you push scaling and how much of this knowledge can you bring in our model such that now we can get data from new cities from new countries from different software platforms so you can move the cameras you can change the lighters on the vehicle over time right which we do we have a sixth generation driver that you know keeps evolving the sensor stock uh which is then something to think about so when all these properties the question is can we have a model that just generalizes across all of these things we want to scale to dozens of cities we want to do that much bespoke work and the best way to scale a model is let's try to scale it in the data center first so remove constraints of real time uh response which needed to be for safe driving remove the constraints even potentially to uh only see events as they unfold you can see potentially how in the future remove constraints on how much you can scale the model now how much understanding can you get out of this model how much can you generalize and so we have been uh pushing to understand this question and I think we we have been uh aiming to build what we call this weo Foundation model which is not just the vanilla VM adaptation but we want to uh bring in all these uh capabilities I recently talked about that expand on a standard Vision language model and build it large build it in the Data Center and see how much can you just understand all the data uh that we see and we collect and the better this does the most straightforward it is then you this model then acts as a teacher to the uh models in this in the car distill it down to you can distill it down and and of course you still need to do do this with the thoughtful design ensuring that the onboard stack uh fits all the safety constraints all the latency constraints and so on but then this thing is just the source of knowledge that you can always ask uh and both to mind data to understand data your mind and uh there's a very interesting question how far can you can push it but the more you push it the more you accelerate uh the scaling uh or seamless scaling of the service right so we want to explore that I think that's a very exciting Direction and it's in tune with uh the time today right I think uh it's the age of these types of models and it it tackles one of the historically you know challenging aspects of autonomous vehicles and that is that generalizability like the you know the vehicle you've got lots of uh data captured about a particular Road during the day in clear weather but uh the vehicle uh needs to operate uh in that same road at night in bad weather uh historically you'd you know either need to capture that data manually or use some kind of um data augmentation scheme to train the model so that it can adapt to you know those variations but if you're able to achieve sufficient General generalization with a scaled up model that eliminates I would think a lot of the challenges of um you know getting coverage yeah in the market you're in and and um getting into new markets so the reality is we we have good generalization and coverage in the existing markets uh already right and we operate on them clearly and have past a certain bar but New Market and the more different the new market is or the more different the new market and S stack is the more it pushes your ability and then you know you can you can do it a lot faster if you have this kind of capability than uh the process today which would of course involved you would go explore the market collect data label some data understand anything that may be different make sure you're confident that you're safe right like there is a standard procedure how we ensure safety in the market I think which is important but we could accelerate it how long do you typically need to be testing and collecting data in a new market in order to launch in that market I think it's been uh a very much evolving uh process meaning the first time say we did Phoenix uh we did East Chandler first which is a bit of a suburban area Phoenix and Suburban areas have one set of challenges uh for example in protected left turns on the freeway where not a freeway but a reasonably high speed Road 45 miles an hour and then people speed so it can be a lot higher and it's really important how you can go in front of traffic going at such speed and there's a few other challenges there but it took a while to sort out right and make sure our system is robust then you go to San Francisco and you see oh it's totally different there's all these crowds and totally different Behavior and the streets big part of the city are a lot narrower right and it's just different Dan serban scenario so there it took a while too and then we go to Los Angeles which somewhat combines uh the properties of the Suburban Phoenix and by then we had moved more to Downtown Phoenix and so on so we had a bit more of a span everything from a bit more Suburban to to Urban to dense Urban and you have S Francisco so when you combine those two La actually was fairly straightforward right so as we cover it's about a bit of covering the design domain the more you cover it right through and in our path we've been doing this the easier it is for the next city they generalize so every new part you collect it helps you with the next part so it's an accelerat process too uh now separately what do we work on a lot today freeways freeways offer offers a whole set of new challenges compared to the ones we've seen in um say the more Suburban areas or dense Urban there's different things to be concerned about right uh and you need to validate for example on freeway you need to be robust at 70 miles per hour or 65 miles per hour to any kinds of failures that can occur and also you need to worry about uh you know stopping at any position on the freeway that's inherently unsafe right and sometimes there's not even shoulder for certain parts of freeway so you need to think what do you do there and that's a different bespoke work you need to think about and another one of course would be snow ice that brings yet another dimension there's a few of these so I would say it I can't tell you an exact estimate because it really varies but we have a process a robust process to ensure that uh we gain sufficient confidence in the market before we launched in the markets that you're in now do the vehicles operate on freeways or are they only on uh surface roads so for most customers today we do not provide freeway uh fully driverless on freeway but we are testing on freeway uh for a while actually without driver in some capacity so we're it's still an area which we are scaling and working on and so we of course would very much like to provide this to customers but uh it's still an ongoing uh process right uh for example take Phoenix we're clearly motivated to do freeway because free Phoenix for example is 300 square miles going from one end of that area to the other you really want to take the freeway otherwise your service would uh you know take a while compared to what is possible you want to provide good service so it's clearly an area of high importance uh so kind of going back to the foundation model work when you think about this challenge of scaling up the foundation model in the data center one of the uh challenges that you mention is incorporating all of the different sensor data that you might want to incorporate um that aren't necessarily native to VMS and to images how how do you approach that problem so it's a very interesting problem actually it's a bit open-ended how to do it best right and I think maybe um one way to think about it is also uh audio is an example of this too right so initially when you do a model um you people tried images first and then you know later audio became a domain to so you want to build a multimodel model now how to actually build it and how to insert data from a new sensor into a model that was trained potentially or pieces of the model that were trained without it is it's actually an interesting research problem and there are many approaches so I can't tell you how exactly we do it we're experimenting with several and learning along the way but we find that it's one of the key questions in our domain how to do that is building a foundation model on top of existing VMS like the part of the motivation for that is taking advantage of the World Knowledge that is embedded in that uh but the alternative might be incorporating I'm sorry um training from the ground up a foundation model that understands these different modalities and um you know is able to do the the things that you want to do um so you can train not clear what exactly that would would look like either um well exactly well there's this interesting question of okay so let's say you train something fully from scratch yourself but then you need to worry well how do I bring all this knowledge that people already usually put into this from the internet right you still want to somehow then maybe you need to get that data there's still some kind of fusion of you know modalities that needs to come together it's just a different way of looking at the problem right and so there's another thing though we can do which is uh we right and it's also expensive of course to train a model fully from scratch right so you need to think that's one aspect of things the other aspect is okay how can we learn and build a model from our domain right using data from our domain directly separate from the fact okay well one is you leverage hopefully World Knowledge from Models one way or the other right you need to think how maybe you can even use them to label data for you or you can take pieces of them there's many things possible now the question is how do I create how do I train model in our domain to understand our sensors right and uh there's this interesting question you can learn in our domain to predict the future also right so how do you pre-train large language models you predict the next token in language uh what can you do in our domain right you can do things like you can predict what the sensors give you in the future so imagine if you could to talk recogize your sensors You can predict the next sensor token and the next sensor token and if a model is able to predict accurately how sensors look in the future whether it's lighter or camera or radar they need to learn a lot about the environment because uh it will actually capture all the knowledge in here and to predict how things evolve so you can imagine how you will see them and that's very powerful pre-training this is equivalent to um this kind of pre-training next word predict the issue is only of course uh a word can be tokenized in maybe 100,000 tokens or something to that effect and then you're predicting a distribution over 100,000 discrete tokens that's fairly straightforward now our sensors they higher resolution they can they're different multimodel now the question is well for those What's the vocabulary right how do youly predicted and there is that opens a whole bunch of interesting questions right and it's a question do you need Auto regressive model do you need diffusion type predictors uh how do you how do you set it up or a mixture of those what is effective there's work in the space uh already in Academia trying to build this and in the industry already trying to build what it's called World model you dreaming the Futures in the world that you're modeling right from your own sensors and that can be useful to also train large models right uh but how to best build it is also very fascinating in question and of course the more the Richer Futures you dream the larger models you need because now you need to capture a lot of the Nuance Beyond a certain level of oh can I scale some model to just predict what I should do and maybe what some high level properties of that is versus predict everything in super high pixel lighter scan level Fidelity right you need potentially larger models to do that well and so there's trade-offs like how much of the compute now you're going to spend to pre-train these very large models how much of that actually results in driving improvement say for you in the end and how does the math worked out is it compelling enough this is also very interesting question but for good or bad some of these World models are needed for the simulator so you need to think how to build them regardless right we want to so in our domain it's not just about building the driver there's two main problems one is build a driver right and I talked to this also about this in my GTC talk the other problem which is valid to validate it and ensure that you are confident this model or driver perform well in the vast majority of cases we need to handle right like the the environments in which vehicles Drive spend all kinds of diversity in season is we already talked about different operational domains there is all kinds of human behaviors some of them very rarely seen that you need to deal with in front of you you need to interact with humans so all of this of course is great to test in a simulator and is a great tool to test it scale but now the question is how do you build the simulator and now you want to use machine learning to build it right and before we dive into validation uh I want to to dig into this idea of like predicting future sensor values um you know what I think about that I think that there's like a causal relationship between the vehicle trajectory or the vehicle behavior and then what the sensors you know pick up as the vehicle progresses through you know a scene um and I'm I'm trying to kind of wrap my head around like what it means to predict future uh sensor values as opposed to predicting the thing that is like the the root cause of that the vehicle trajectory how do you think about the relationships between future sensor prediction and like uh the vehicle itself and what you using that as a control input or or something so let's just contrast the two a little bit and see if we're thinking of the same setup so in our Emma paper for example right you have a vision language model and one of the tasks we can teach you to do even though we tried several Like You can predict the road graph that's in front of you you can predict the 3D boxes that you see but one of the interesting properties the main one maybe you predict is what's the driving trajectory you're likely to follow and you can of course sample several and the start of that trajectory actually then you can turn in the controls the rest is more speculative kind of showing you what it likely to be like over time right and of course the longer you predict the trajectory by the way in our domain it's the more uncertainty it gets whether you're actually taking it because uh it also depends on certain Assumption of the world around like next word prediction or next Tok prediction but like you know I you can think I'm driving this trajectory for a second but if the vehicle in front of me SS I'll adjust it so it's all relative to what the others do a little bit as well that said it's a very good output and the start of it you always would drive right so it's it's it's related to planning quite strongly but the trajectory how do you train this model right in our domain you would actually observe uh either our drivers or potentially how others drive and you saw what they did in the future and you collected it and now you're teaching the model to predict just those tokens of how driving tokens for a few steps in the future that's a signal right that's the most topical signal for how you should drive you watch how people drive you see their tokens and you learn to predict them now when you think of sensors it's it's essentially the environment that goes with how you drove instead of predicting just how you drove now for examp example maybe a Next Level imagine is observe others and predict how they drive next to you a lot of the models do this already right so not just predicting yourself we have a model called motion LM published maybe two years ago and it's in even all the work uh showing how to turn motion into conversation and you can train the models with that level of abstraction so you say oh I'm not just teaching a model to predict how I drove I'm going to teach it also to predict the other people's motion tokens around us and so you predict jointly how a whole group of them now behave so that's that's a richer signal you're teaching the model uh more things from everyone example and they're highly topical now you're teaching it oh not only I how I behaved but how everyone else reacts to me and then maybe how I would react to them you're you're modeling what we call The Joint distribution of at least some group of Agents around us so that's a richer signal now you imagine you go to sensors that's maybe very high bandwidth detailed variant of everything that happened before you not only see the environment not only changes because you uh say did certain driving paths it also changes because others reacted to you and change their behavior relative to you that's also reflected in the sensors and beyond that you need to think how everything actually changes appearance as you and how objects may look from the other side if you surround them and how Reflections may work and how certain lighter intensity relates to the RGB pixels because now you're dreaming the future of the environment the full environment in its full Fidelity so there there's an argument that it's like a more grounded prediction in that it's reflecting the actual environment that the sensors are intended it's the most high bandwidth prediction you have this is all the information you actually had about the future which is your full sens and that's the most you could try to predict but then the issue is are you overdoing it like you know so for example humans I don't spend my time thinking how the tree looks from the other side when they drive right right right right oh like you know as I go how would the branches seclude each other from a different Viewpoint or how the shadow will evolve on that pedestrian I don't necessarily have to think about it so while there's a lot of knowledge inherent that the model can capture not all of it is relevant to driving some of it is and so now there's this interesting trade out do you really want to go all the way there how much benefit you're going to get or do you do you do something clearly some version of fidelity is relevant so like predicting what people did a few steps in the future is the minimum you can do in the most topical and then you then you broaden and it becomes uh increasingly less relevant like you know you want to model how the clouds move in the sky maybe it's relevant to it raining in an hour but I'm not sure it will help your driving suit so so there's this tradeoff there's a ton of signal in these sensors but you could choose to model it it's helpful in some cases for a simulator to model it but it doesn't always help driving and so we discovered what is the right RIS uh remind me or or maybe uh talk a little bit about the way your approach to kind of end to-end training versus you know subsystems that um maybe map to traditional approaches to uh you know robotics or or Vehicles like you know are there separate you know control perception Etc systems or is are you moving you know towards or away from end to end modeling I I vaguely remember us talking about this a few years ago as one of the um uh kind of differentiators of your approach but I don't remember which side of that uh divide if you we're on the Practical side we will take the thing that works best right I think that's uh that's what I would say now there is end to end in general is a bit of an overloaded topic I know it's been made into a big issue by a lot of people it's compelling in some in bment right but let's let's entangle what nend to end means right so end to end is a if you think there's there's kind of mild definition and a strong definition so end to end means typically right uh the strong definition is hey you're going to pass some gradients from you know sensors from from controls what you want something to do through a whole system maybe to the sensors and you'll along the way learn features uh that you may not necessarily be able to inspect yourself or that have semantic meaning that just emerge from the training process that may help you drive right end to endend in general is a training uh strategy it is it's orthogonal from where have modules in the stock right you can still have modules and train things end to end right if the models are connected properly and you can pass gradients between them you can right and you can still train many tasks and T so it doesn't have to be sensors to control right you can even potentially consolidate large pieces of the stack so every piece of your stock became more end to end because initially say had be spoke 10 different Mar models handling certain tasks and now you Consolidated them that part became more end to endend all right now so it's a complex topic now I would say that what we've learned over time is that you want few large components if possible it simplifies the development right it uh allows you to essentially scale those components more and optimize them as opposed to have a bunch of little custom things each doing different stuff with different data generation so you want you want few Consolidated components and now the question is how few and of course the other part you know is if you have a few you still want to ideally back propagate through them but what's the how few and can you just make do with sensors to controls the extreme and the sensors to controls I think is what people are really excited about in Academia and it's very powerful in an academic and it's a compelling idea just collect enough data you've got some black box in the middle that's sufficiently you know has the sufficient you know scale or number of parameters and you've got some desired output and you let the middle part figure it out and our Emma paper does this too right so we we do train it end to end clearly right and uh predicts nice trajectories and generally does a good job with a fairly straightforward recipe um so that's appealing but the big questions there is uh can you simulate and prove that this system does not hallucinate does not provability controlability [Music] introspection validate the system now I need a simulator that's you know uh always from sensors to controls and in our domain there's quite a lot of sensors and they're not that easy to simulate the technology for this has been evolving and then you're really tied to this you need to scale it you want to see imagine driving over a million miles a day in the simulator and now having to simulate everything you could possibly see in these sensors and usually most most companies have at least a couple sensors different ones and you need to uh at million miles with over a dozen cameras with a few lar for example a radar right and it needs to be really realistic otherwise it doesn't count how do you do it right that is the challenge how do you prove to yourself now that all these cases uh can be validated just with this setup that's hard right and and so so there's pushing to also the other big questions let's TR let's say you train a huge model and now you want to change it well what do you do the only thing you can do is you change the data to you to R Train the model now you fixed it but maybe it broke something else how do you it's really hard to fix something really quickly all you have is a model that goes into so there are all these considerations so I would say that there is consensus reasonable consensus in our space that you need few large components but there is no consensus yet if it should be one component and I think a lot of it comes from the stability and so if you want to say uh in our case when we're doing actually fully autonomous driving it's top of mind you design your stack in a way that you're sure you can test so that's a constraint right so we need to anything we do we we're doing with the mindset that hey we actually know how to test this at scale to actually deploy a driver when you're doing driver assist it's less demanding you don't have to test it to this extent so maybe there you you can get by with a fully end to end model as some people have but in autonomous driving when you have to release it there to drive you know hundreds of thousands of trips and millions of miles uh a week uh that's a high bar right between that and ability to fix things as there arise these are top of mind when you build a stack uh when it's fully end to end so you need to have an answer I think so these are the the tradeoffs and we are we're doing our best thing I think I think it's appealing to do end to end generally it learns good features when you can do it just prove to yourself you can validate does the more modular approach that you've taken give you um the controllability testability introspection you know quote unquote for free or is are those all active research topics as well like how to um how how to ensure a degree of um I guess explainability was the word that was briefly escaping me explainability and controllability you you want if you have a model and it's some some big VM even or whatever like we did with Emma you want to have a way to detect and steer it when it's doing wrong things and it's hallucinating so you need a harness around it there's two kinds of harness you can have you can build a harness around it that operates in the system you deployed that make sure with you know there's there's ways to do this we have experience you can make sure to catch this model when it hallucinates things that it should not do and control it on top the other part is you can catch it in the simulator the other part we discussed right uh but then you need to build a fully realistic simulator and in both cases you need to understand the world a lot deeper and more symbolically than just having sensors to controls I would say that and you need to somehow get the confidence that you've run sufficient simulation samples in order to get the the coverage you need to oh that's it's called science actually it's extremely hard and I think there's this other interesting aspect of the simulator was so for example as you collect more data and make your model larger you can have it to predict better trajectories for you so it's very uh efficient in some way data efficient to if you just want to improve over time and you get more data you can keep adding data and the model will improve in certain ways at least over time in general right uh when you want to test it you cannot do that like you actually need to think think of all the cases that could go wrong you need to create some ways to validate all of those they can require all kinds of Hardware failures sensors going wrong um rare cases you name it you need to somehow handle all of this in your release cycle that does not work as well like you just take data and you just feed it in the simulator and the simulator tests you better it's also true but uh it's not sufficient so you need you need to be very thoughtful on the testing side you it's not purely a datadriven exercise you need experts thinking what could go wrong and uh working on mitigating that and it scales less straightforward than something where you can just put data in and Driver would improve with more and better data suitably picked that is true uh but your ability to test it is not similarly scalable how do you characterize the the the test coverage or the tested of a given model do you need to get to uh appr provably you know some standard of you know proved valid or uh is there more of a statistical determination of uh tested this so like I think you need to be very comprehensive and that's why it's hard to do just it's not just some datadriven model that you just do throw data in it somehow tests you but you're also not trying to get to like NASA levels of mathematical proof of uh you know complete controllability or are you I guess I mean there's also this question how do you even prove it right because uh if people do very unreasonable things some of them you cannot even mitigate right like you you're stopping and someone tailgates you at full speed what are you going to do you can't even get out right uh so it's already relative to what what's reasonable to expect but I guess if you we have published our safety framework we are one of the fewer companies that have done it and uh that details the whole set of tests we subject our driver to when we want to release it and it's a very broad set and it includes both kinds of hard Hardware testing and failure injection and you know all kinds of standard analysis techniques and it includes simulator and it incl includes driving in closed courses and testing with drivers and all kinds of things right like we uh we this actually difficult to develop it's very comprehensive it takes a lot of effort but we was shown that we can iterate over time and release improvements to ASAC uh while going through this validation cycle and uh I think that's one of our uh great things that we've been able to accomplish is to have that right that's it it's bespoke it's diverse uh many things are combined but if you ask me what are we testing for you need to test for a lot of things one obvious one is you should feel you can be safer than the human driver and right that's one second one is you should be confident you don't get stuck all over the place either the third one is okay you don't want to uh you know you want to be thoughtful and mindful around say ambulances and police vehicles it goes on and on and on the amount of things you need to be able to do is Broad you need to in construction uh for example or if directed by an official with some gesture you need to try to understand it it's a it's a broad set of cases you need to uh check the chenand and you going back to the question about data only and end to end versus um you know uh we didn't explicitly talk about like a rules based like how does how do you in incorporate that long list of scenarios into the model in a you know just throwing it into throwing more examples into a data set isn't good enough like what how do you do it me there's different ways right uh I think generally we have uh a a lot of experts right as well into what is good driving motion planning experts that's one second is there's generally this question of how do you define a reward function for driving right so what's good driving even how do you define it how do you ensure your stock uh satisfies it it's a very interesting question I mean so far we discussed maybe imitating some drivers that's one way to define it but that's not the only way you can say you should not do a whole bunch of things so you can read this role book rule book and ensure that certain things are true and optimize for it right but I think it's a it's a difficult Endeavor um to to Define it and then make sure the model uh take that into account uh it's an evolving space uh we have reasonable approaches I think generally as you can see some things easier to Define than others for example don't hit things is somewhat straightforward no I'm think thinking of like I was just yesterday I was on a road and there was some work happening to uh uh Transformer on the side of the road so they combined the two-lane Highway into one lane and they had signal people on each side with either the slow sign or the stops you know the sign that spins from slow to stop like and I'm thinking about in the context of this conversation like is that something that you just get enough data about those situations and the car figures out what to do or do you have some rules set somewhere that you know kind of defines for the car like when this is happening like this is what's going on and this is how you're going to behave or does it follow more readily from what the car in front of you is doing than I might think like it's complicated because a lot of those things is well normally shouldn't cross cross say the lane lane boundaries but when they put the cones they actually should but then separate from the cones and the sign the human actually gestures you to do something right these are very that's what makes it very complicated yeah yeah so we combine um different approaches to make sure we can handle this as well as we can it seems like an example of where of at least a failing of uh a fully endtoend process you need to collect a lot of data about a lot of you know what are probably you know unique uh individually unique situations that have this pattern in order for the vehicle to on its own do the right thing all the time and it's limiting one thing that uh yeah so if you did not see some situation right the model may just nail it or it may not uh right and that's a challenge now we need to know does it nail it often enough where you say it's good enough or does it fail often enough where now you need to think of other things to do um that's that's part of the you know convincing yourself validating the driver store you mentioned earlier uh we talked about uh Nerfs and 3D gaussian splatting and diffusion models you mentioned those as technologies that you were looking into and a lot of that comes up in the uh the validation part of the equation can you talk about some of the ways you're using those Technologies so I can say the following I mean so far I was trying to tell you hey uh validation is difficult to scale it's complicated you need to you know handle all kinds of individualities and some of them need experts it's true but the best way or one of the best ways to scale validation is still to use machine learning to build the simulator better right and because simulator is one big part of your validation story and now you need uh realistic scenarios and interesting scenarios to be played in this simulator and traditionally maybe 10 years ago when I started the state-ofthe-art in simulation was uh computer Graphics methods so you collect a whole bunch of assets houses and trees and you know whatever and uh ballards and and then when you drove it you understand those assets where they are and how they look and then you just replace it with the assets you place them there ideally automatically and then you have a world you have a 3D World now you can drive in the 3D world and you can use computer graphic technology to make it look there was a to real issue where the models wouldn't react to the graphical versions of the things in the same way that they'd react to the real world versions of the things it can easily be and this is still true in a lot of Robotics now um it's true that they're correlated right they look similarish but they're not the same uh and so why because I mean honestly computer Graphics again is certain approximation and who had the uh properties of you know what should the Sun be like and what is the diffusion aspects of the environment and is there fog like you need to set a whole bunch of knobs properly what is the camera you know uh detailed is it grimy or not and you need to somehow need to all of this it's really hard in the computer Graphics environment there's a lot of knobs even though it's nominally powerful um right and so then there is this Gap and also how good are you at taking what you drove and place all these assets did you have all the assets did you not have them if you place different assets if you set the conditions differently is that again the Gap is there so you're not guaranteed that you will be able to well reconstruct anything you draw you can drive at night in the fog in the rain things may look different now the Dream from you know scalable simulator is to be able to just reconstruct it from your own sens we have enough sensors they're extremely rich they're you know cameras and 3D sensors you should be able to go a long way into just building yourself the simulator from what you drove and so the the dream is right and that's something we work to enable is any situation I drove with my car I can build a simulation environment from I may have to populate it with additional traffic suitably that's where generative comes in I may need to evolve it uh in response to the things I do because my decision impact the decisions of others they respond I need to model that correctly otherwise um you're not going to have realistic outcomes if the agent drove like what it drove when you weren't in front of it it will plow into you uh just because you did it nearly can just replay what the agents did needly and get all the signal you wanted so you need intelligent agents too you need so between reconstructing the world and being able to drive say this intersection in San Francisco I want to be confident I can drive drive and being able to populate it with agents that behave like I is reasonable for you know pedestrians and vehicles to behave that's a really interesting challenging machine learning problem and so I'm inspired by this it's fascinating I think some of these Technologies at 3D G plats uh diffusion modeling on the there a three G plats Nerf for the reconstructive side of things so you can reconstruct really well representation of the environment you draw but you can't do Technologies don't necessarily allow you to augment it or dream it or turn day into night or place new agents so easily as it would be now for example something like a diffusion model so that's more of a generative type of model it can dream new parts of the environment or agent behaviors a complex one where the appearance matters or Futures but you know again do you dream free form or do you want to condition yourself and land yourself at some specific intersection in San Francisco how do you do the two together there's some interesting questions here that I think the the field is now making progress on so it's exciting and uh we are um alongside the field as well making some I think exciting progress uh one of the ways that you make some of this progress is by offering challenges to the broader Community to help you solve some of uh the things you're focusing on in a given year and you recently announced those challenges for 2025 can you talk a little bit about about those thank you for mentioning this uh we've been engaged with the academic Community um since 2018 we started working on this when I joined wayo in 2019 we released a data set at the time the largest most comprehensive uh we had in terms of you know actually having high quality wh sensors and quite many quite a few scenarios uh compared to what was reasonable for the time and we kept upgrading it every year adding more and more either tasks or potentially uh additional examples of interesting driving or things that need to be understood and in parallel with data set work we organize challenges on tasks which we believe are exciting uh and there's headro to make more progress uh in terms of interesting approaches and solutions and so this is our sixth year we actually have not announced them yet uh we probably will announce them by the time this podcast C airs uh they're launching uh supposed to launch uh in end of March uh and every year we do four challenges so we pick certain tasks we've done over 15 so far over six years this our sixth year we have some very exciting ones so uh one of the interesting ones that we are aiming to launch that relates to our conversation today a bit is endtoend driving from just camera uh inputs we are sharing some very interesting scenarios among the ones weo has seen and we want to see how people can take some of these large models and get them to generalize on reasonably rare conditions um so that's exciting um we also uh have a simulated agents challenge we're unique in running such a challenge uh it's uh very interesting problem where you're building agent models that populate the simulator and we have ways to validate whether they're good models and uh we have improved the metrics a bit this year we have one on traffic generation again something we discussed so let's say I give you the road graph of an intersection I want you to uh potentially uh populate it with realistic traffic that uh you know the relevant positions velocities behaviors are reasonable and uh not not ad hoc but we have ways to measure how realistic a traffic scenario is and we want to see what people can do and uh there's interaction uh modeling as well that's a challenge we're bringing back with a bit improved metrics from I think maybe four years ago is when we ran it when we last talked so we're bringing that one back and uh here I'm back on your podcast also four years later that's an exciting one um modeling how agents react Vis A one another well is an interesting Challenge and I think a lot has changed so I'm excited to see how the new crop of methods that people try with with doing it well we will include a link to those in the show notes as well as uh Link to the Past conversation if anyone wants to to go back to that one but uh thanks for taking some time to chat with us about what you are up to there and how you're uh approaching you know this new world of foundation models it's very cool stuff thank you so much for having me on it's always a pleasure and uh yeah well I look forward to you uh trying us and telling me how it goes as well uh when you are in the city with the wayo driver for sure for sure thanks so much Drago thank you bye [Music] Back To Top