I'm going to talk to you about exciting Trends in machine learning it's going to be a very broad talk it's not going to go into detail in any particular area but I think it it's important to understand what is happening in this field and what is exciting and also you know what are the opportunities and also what are the things we should be sort of aware of as we build out this technology for everyone um and I'm presenting the work of many many people at Google so some of this is work I've been involved in as a co-author some of it is not just cool work I think you should learn about so uh with that I'm going to now give you another glimpse of the slido number uh 22072019 10 or 15 years ago you know speech recognition kind of worked but it wasn't you know really seamless it you know made lots of Errors computers didn't really understand images and from the pixel level what was in that image um language was kind of uh there were a bunch of work in natural language processing but it wasn't really a deep understanding of language Concepts and multilingual uh data but I think we've moved from that state to one where you actually expect computers to be able to sort of see and perceive the world around us in a much better way uh than they were able to 10 years ago and that opens up all kinds of amazing opportunities in you know pretty much every field of human endeavor because all of a sudden you know think about when animals evolved eyes you know we're sort of at that stage in Computing uh we now have computers that can see and sense and that's a completely different uh ball game uh and the other observation is increasing scale larger scale use of comput compute resources specialized comput as I'll talk about you know uh larger and more interesting and richer data sets larger scale machine learning models all scaling all of those things tends to deliver better results and that's been true for the last 10 to 15 years and every time we scale things up things get better all of a sudden new capabilities emerge or the accuracy of some problem reaches a threshold where before it was kind of unusable and now all of a sudden it becomes usable and that enables new kinds of things um and also the kinds of computations we want to run because of this new machine learning learning based paradigms is pretty different than traditional handwritten twisty C++ code that you know a lot of basic CPUs were designed to write to to run effectively and so we want different kinds of Hardware in order to run these computations more efficiently and we can actually in some sense focus on a narrower set of things we want computers to do and do them extremely well and extremely efficiently and then be able to have you know that increasing scale uh actually uh be even more possible okay so uh I mentioned some of these things but there's been a decade of amazing progress in what computers can do so thinking about uh this uh you have you know going from the raw pixels of an image to you know maybe a categorical label of one of 10,000 or a thousand different categories you know computers didn't used to be able to do that a decade ago and now can uh audio uh waveforms to you know what was being said in that utterance of you know 5 Seconds of of audio uh that's speech recognition and we've made tremendous progress on that um translation hello how are you to bonj you know being able to translate from one human language to another is an incredibly useful uh capability for computers to help us with um and we've even been able to go from you know something like this is a nification photo of a cheetah on top of a a Jeep uh going from something like that to a description of that um not just a categorical label like leopard but you know a little short sentence that describes what is going on in that scene so that's pretty amazing like we made tremendous progress in this what's more amazing though is we've been able to reverse a lot of these arrows in the last uh few years um so going from a categorical label like leopard to you know the computer will generate you 50 or 100 different images of leopards or you know whole how cold is it outside to audio waveforms that's just text to speech that's been around for a while but it has improved a lot um the translation reversal is not that surprising but is getting better and better and then even going from a short description of an image you want and getting you know an image or sometimes now even a short video clip of what you want or a short you know an audio clip when you describe a sound in language um so these are capabilities that are now starting to emerge and I think are pretty exciting about um you know what we can build with computers now as opposed to a decade ago um so let's look at the level of improvement we've had in the last decade so Stanford uh developed a benchmark called imet which many of you have heard of uh which is basically going from you know you get some training data of this form like a bunch of color images and labels one over a th labels uh and then you can train your system on about a million images like that and then you're given a bunch of images you've never seen before and you have to predict you know what is the the actual label for these new images that you've never seen before and a lot of machine learning work is how do you generalize from observations you've made on data to new settings new images you've never seen before and um so in 2011 the first year that cont contest was run uh the winning entry got uh 50.9% accuracy and then uh the next time the contest was run uh in a very famous landmark paper uh affectionately known as Alex net um by Alex kesy ilas leter and Jeffrey Hinton they made a giant leap forward uh in accuracy you know 13% or so Improvement in accuracy which was just remarkable and they were the only one of about 28 entrance that year that used an Earl Network um but this was a major Improvement and the next year nearly all the entrance used in Earl Network simply because this was such a revolutionary Improvement and clearly a a really major approach to just learn from the actual raw data rather than trying to hand engineer features that are indicative of a leopard that's a really you know hard thing to do what features would you hand design to decide this is a leopard as opposed to a giraffe or a car um but learning from data actually makes that possible um so I think that's that's a pretty big Improvement but it's also easy to ignore the Improvement that's happened since then like we've gone from 63% to now 91% accuracy on this task and so that that is actually you know pretty amazing we know human accuracy on this task is actually uh a bit below that level because is actually pretty hard there's a thousand categories 40 different breeds of dogs people don't actually know you know if they're staring at a photo which breed of dog is that so uh that's pretty amazing uh this was you know about a 10e span and it's revolutionized computer vision if you look at speech recognition so this is a popular open source U uh Benchmark for measuring speech recognition accuracy here it's measuring word error rate you know the percentage of words that are wrong uh and you want lower numbers obviously so we've gone from 13.25% down to 2 and a half% and this is a much shorter span this is only like 5 years so you know basically kind of one word in you know six or seven is wrong to one word in 40 or so is wrong and that that makes a huge difference in usability of these systems all of a sudden you can rely on it you can start to dictate your emails and it mostly gets things right uh pretty great um and so I mentioned that scaling things up uh actually improves the quality of these models and so we actually want Hardware that enables us to scale up more efficiently how can we you know for the same amount of dollars of compute Hardware or the same amount of energy get something that gives us an even higher quality model because it's more efficient and so it's really thinking transforming how we design computers and they're uh you know machine learning optimized Hardware is much more efficient and there's major improvements that have happened generation to generation and this enables these larger scale models with lower economic and energy costs so there's two really nice properties of neural networks the kind of machine learning models that everyone is is using these days um the first is reduced Precision is okay like if you carry out the computations in the machine learning model to one or two decimal digits of precision instead of six you know that's fine you know a lot of times some of the optimization algorithms for these things actually introduce explicit noise in order to make the model learn better and so you can think of reduced Precision is just a way of you know in some sense adding a bit of noise to the learning process and it actually sometimes works better and the other property is all the computations all the algorithms you're hearing so much uh noise about in some sense are really just different transpositions of ways of assembling different linear algebra operations so things like Matrix multiplies Vector operations of various kinds so if you can make and that's really what those algorithms are is repeated applications of lots of different linear algebra uh Primitives and so if you can make a computer that's really good at reduced Precision linear algebra that's what you want for learning uh these really high quality models at sort of reduced uh computational cost or energy cost and so we've been doing this for a while while at Google we saw that there was a real need for uh in our systems to build a system uh the initial version of what's called a tensor Processing Unit or TPU is really this uh architecture designed for low Precision linear algebra and the first one we built was for inference when you already have a trained machine learning model but now you want to apply it in a produ product setting you now need to like apply all this compute in order to recognize what's in that image or to have someone utter something in a audio in their microphone and then be able to recognize what they're saying and so we built uh the first generation which was really just a single card system uh that had one of these accelerators on it TPU V1 um and that uh actually was a you know about a 30 to 80x improvement over using a CPU at the time in terms of both uh Energy Efficiency and and computational performance um later generations of the TPU we then Focus fed on uh larger scale systems composed of you know multiple chips that were designed for both training and inference and so this is the TPU V2 board with four of these chips uh the TPU V3 board is sort of a a close cousin of this one but we added uh water so there there's actually uh water going to the surface of the chips to help with cooling uh the tp4 board uh we added cool colors uh which is nice and those three later generations were designed to be assembled into larger systems that we call pods and so the pods uh increased in scale over the generation so the first one um they have very very simple but high bandwidth networks uh in the pods so basically each chip in this uh first generation was connected to its Four Neighbors in a 2d mesh so you have a 16 by6 grid of chips in some sense in these in these racks and every chip is connected to its neighbor with basically a wire uh and so that means you don't have to do any routing in the network um and so you can have very high speeded bandwidth very lowc cost connections because you're only trying to go six inches to the next chip or something like that um the Next Generation you know extended this to 1,24 chips in eight racks and the Next Generation actually used uh 64 racks of 64 chips each um it's actually multiple of these data center rows and gives you 1.1 exop flops of lower Precision floating Point computation um with 4,096 chips and then the mo most recent uh generation uh which we dis uh disclosed publicly last year at the end of last year is the V5 series it has comes in two variants one is for sort of uh inference where you have a pod of 256 chips and then the v5p has a lot more memory per chip uh much more bandwidth between chips uh and much more memory much more bandwidth between chips and much more memory bandwith um and has you know uh close to half a peda flop per chip of uh 16 bit floating Point performance and and double that for inate performance and so one of these pods is also bigger so it's uh close to 9,000 chips for exop flops of compute okay so let's talk about language now we talked about image recognition and speech recognition advances uh but language is actually one of the areas that I think people are seeing the most change in what computers can do um so I I've actually been excited about language models for a while even before n networks so I I partnered with some people in our Google translate team to they basically had a really capable system that was um very high quality trans ations but it was designed for a research contest where you only had to translate like 50 sentences in 2 weeks or something and then you submit your entry um and so it would do like a disc seek for you know 200,000 engrams that it needed to look up for every sentence it would translate um and so I said oh well that if you have really high quality translations it'd be good to actually bring these into to real practice and so we built a system that would serve um an endr model basically it kept statistics of how often every five-word sequence occurred in two trillion tokens and that gives you about 300 billion unique uh five Gams and then we just would store that in memory on a bunch of machines we'd look them up in parallel for the 100,000 things you need to translate a sentence and uh we came up with a an Innovative new algorithm called stupid back off that kind of ignored the right mathematical thing to do uh and did something simpler so that when you looked up a 5 G and there was no data there you would just look up the 4 G that was its prefix and use that if it was there and if it wasn't there you'd look up the three G and so on uh and that actually worked reasonably well compared to the fancier fizer nay smoothing which is what you really want to do uh but is actually kind of computationally hard um so I mean one lesson from this is simple techniques over large amounts of data are very effective this has been a lesson throughout my my career uh that you can actually do very simple things and the data speaks when you do that um then uh my colleague tamas melov was interested in distributed representation so instead of representing a word as a sort of a discret thing you want to represent it as a very high dimensional Vector so we're going to represent different words with different say 100 dimensional vectors and through a training process we're going to try to move words that appear in similar context nearer each other and we're going to try to push apart words that appear in in different contexts um and if you sort of train over a very lar large amount of data with a relatively simple training objective that says okay if these things uh are appear in similar context push them closer and if they're different push them apart and you do that over trillions of tokens then you end up with uh really nice properties where in this 100 dimensional space right a 100 dimensional space is a hard thing to wrap your head around um but in that high-dimensional space things that are very similar end up near each other so if you have mountain and Hill and Cliff they will all tend to be kind of near each other in that high dimensional space uh so points in space are interesting but perhaps more interestingly directions also are meaningful in this High dimensional space because there's a lot of different directions you can go in 100 dimensions and it turns out that for example if you look at where King is in this space and you want to get to Queen you go in a certain direction so you can compute that by subtracting the vector uh King from Queen and that's the direction you go um so it turns out that King minus Queen that's the direction is roughly the same is the direction you would go to get from from uh man to woman and so directions are meaningful and different directions mean different things so going from the present tense of a verb to the past tense is a different direction regardless of what the verb is um and so that says that there's a lot of power in these distributed representations they're they're encoding a lot of different kinds of information in the 100 dimensional Vector that represents the word then my colleagues uh Ilia Oriel and quac um developed a model called sequence to sequence learning and so basically this used a neural network where you would have an input sequence uh let's take the case of translation so you put in an English sentence one word at a time and the system kind of builds a representation from its own current state plus the new word that it's now seeing and update that state to have now a in the same way you have the the distributed representations for an individual word you now have a distributed representation for the sentence you've seen so far and you can update it with a recurrent neural network called a long short-term memory and then when you hit an end of sentence marker You Now train the model to spit out the correct the the translation of that sentence so you have a bunch of training data which is like an English sentence and a French sentence that mean the same thing and you train the model when it sees this English sentence it should spit out this French sentence and you just repeat that process over large amounts of paired training data and sure enough um you can use a neural encoder over this input sequence to initialize the state which kind of gets you into the I've now absorbed the input sentence and I now want to decode a word at a time the correct translated sentence and we're going to use that to initialize the state of the neural decoder you scale this up and it works you get major improvements in Translation accuracy um Oriel and qua then published a workshop paper showing that instead of translation you could use context for a multi- turn conversation so basically the sequence of interactions you've had with one person or you know one party and then a computer model uh responding and then the other person or the person then utters another response and there's multiple turns that is your context previous previous multi- turn interactions and then you can train it to generate a good reply in in that in the context of you know the multiple turns of things that happened before and it's the same model basically it's a sequence to sequence model but now the sequence is initialized with the context of all the the conversational turns that have happened and it's possible then to have effective multi- Turner interactions using a neural language model which is pretty neat um then a collection of other Google researchers plus an intern uh came up with a model called the Transformer so remember I said in this model this is a recurrent model so you have some State you take the next token and you do some processing to update the the the new state to have absorbed this token and then you go on with that new state to absorb another token and update the state again so that's a very sequential process right because in order to absorb the the third word you need to have done the processing for the second word in order to to have done that for the second word you need to have done the processing for the first word that's not so great like in computers we like to do things in parallel not in sequence if we can if we can get away with it and so what this model did was say we're going to just process a bunch of data in parallel all the words in this input and then we're going to attend to different pieces of it rather than trying to have just a single state that we update sequentially going through the words um and what that said was don't try to force that state into a single distributive representation just save all the representations of all the you know tokens or words that you've seen and then attend to them like pay attention to the parts that make sense to focus on when you're doing this translating this part of the sentence or trans Translating that part of the sentence and you get higher accuracy with 10 to 100x less Compu so remember I said all that stuff about computer hardware improving and specialized Hardware you know that's giving us you know large significant improvements over time but we're also seeing algorithmic improvements like this uh also multiplying together with those improvements and so seeing now the ability to train through algorithmic advances plus machine learning Hardware much larger scale models and you know much more capable models because of that um and so then a group of people uh decided to train uh scale up and train on conversational style data using a Transformer model instead of a recurrent model and that gave you know quite good results and in particular a way of evaluating this so that it's uh both sensible in what it responds and also specific you don't want your chatbot to be overly vague like yeah it's nice you wanted to actually say something sensible in response to to what you you interacted with it because that makes it more engaging and useful okay so I talked about some of these but there's been a progression of neural language models and also a progression of in neural chat pods uh you know a neural conversational model uh Mina uh chat GPT from open AI uh Bard which we released about a year ago uh at Google uh and then a progression of neural language models so the sequence to sequence work I talked about gpg2 from open AI which was you know some of these have parameter counts which you can think of as a rough sense of the scale of the model so 1 and a half billion parameters in 2019 the T5 work from Google from some colleagues of mine uh 11 billion parameters you know very very capable there oh the Transformer work I should mention that underlies a lot of these like the T here and the T here that's all for Transformer so people have now really seen the advance in the Transformer model and architecture that 10 to 100x Improvement in computation and really move to using that as the the basis of these these large language models uh GPD 3 uh gopher from uh uh some deep mind colleagues Palm from Google research uh chinchilla from deepmind Palm 2 from Google research and then gbd4 from open Ai and then Gemini which is the the project I co-lead with my colleague Oriel vignols uh we have a large collection of people in lots of different uh research offices working on building capable uh multimodal models so one of the things we wanted to do was move from not just a language-based model that understands text but one that can deal with all the different modalities simultaneously so you can feed it you know text plus an image or you know Audio Plus some text and ask it to do something and it will be able to sort of uh fluently and coherently deal with whatever kinds of modalities you want to do give it so our goal when we started this project about a year ago was train the world's best multimodal models and use them all across Google um and so there's a Blog about Gemini there's a you know a website you can go to and there's a tech report um by the Gemini team of which I'm a proud member so Gemini was really multimodal from the beginning so one of the things we did I mentioned we didn't want to just deal with text we wanted to deal with with images and video and audio and we turn that into a sequence of tokens that we then train a Transformer based model on uh and then we have a couple of different decoding paths one we train to generate tokens uh and that are textual and then the other we initialize a decoder with the state uh that the Transformer has learned and then we can sort of go from that state to a full uh you know set of pixels for an image um and we support interleaving these sequences of text it's not like you give it an IM a text input and an image input you can sort of inter leave them for video you might put in you know a video frame and some text describing that and then another video frame in some text or the Clos captions of the audio that's being said in the text uh and then have the Transformer kind of use the C the fact that it's been exposed to all these modalities during training to now build common representations across all the different modalities you want to give it um we have a few different sizes so the V1 generation of Gemini comes in three different sizes so Ultra is kind of the largest scale uh and most capable model we have Pro is like a good size for running in a data center uh and we use that in a lot of different product contexts uh so like our our B product which has now been renamed Gemini uh confusingly um uh is uh running on uh the pro model or the ultra model we just announced last week uh and then the Nano model you actually want a lot of these machine learning models to be able to run on device so on a small phone or a laptop and the Nano model is very efficient for for doing that and and fits quite reasonably you can quantize these things to make them even smaller uh and so on um so one of the things about our training infrastructure is we we wanted to be able to have a very scalable fabric that can deal with you know you specify very high level uh description of the computation you want and then have a a a system that then maps that computation onto the available hardw you have and so I mentioned we have these pods and so for example you might describe your computation as I have these two parts that I care about I don't care where you put them and not let the underlying Pathways uh Software System that we've built uh decide where to put them so it might decide to put this part on one pod and this part on another pod and then it knows where the chips are located and what the topology and bandwidth is between them so when this chip needs to communicate with that one it'll use this link the very highspeed uh Network I mentioned and when you need to have you know this part of the model communicate over here then it will go up to the data center Network which is you know much less bandwidth um to send data from one place to another but it's kind of seamlessly happens and the Machine learning researcher developer doesn't have to worry about it uh from that perspective other than just understanding there are different performance characteristics um so one of the things about training Large Scale Models is as you scale up you know failures will happen you'll a machine will die one of the TPU chips will you know overheat and and start to malfunction in some way uh and so minimizing failures is really important uh some of those failures you want to minimize can be Almost Human self-inflict Ed things so for example we had a a sweeping way of upgrading kernels on our machines which is a perfectly fine approach if those machines are kind of independent computations but if they're all part of the same you know thousand machine computation you actually would prefer to take the machines down upgrade all thousand kernels simultaneously and then bring them back up rather than having rolling failures throughout so we kind of optimize some of our repair and and upgrade processes uh but you also once you've done that you then want to minimize the time to recover uh because the faster you you uh can recover the the sooner you can actually be making uh useful forward progress and so we have a a metric we call goodput which is the percentage of time that model training is actually making useful forward progress uh as opposed to recovering from a checkpoint or waiting for some other part of the system to be started and one of the things we we use is rapid recovery from other copies of the model state from memory in those other machines rather than going to a Distributive file system to recover from a checkpoint and that you know makes the recovery time you know a matter of a few five to 10 seconds rather than several minutes uh in terms of training data you know we want this model to be multimodal so we want to train it on a large collection of web documents you know various kinds of books uh various kinds of code in lots of different programming languages Plus images audio and video data um we have heris for filtering those data sets uh some of them are kind of handwritten heris some are modelbased classifiers of like do we think this is a high quality document in various ways um the final mixtures uh of the training data are determined through ablations on smaller models so we'll run smaller scale models with different mixes should we use 32% code or 27% code and then evaluate the performance on wide range of metrics to to better understand that uh we've done some things like increased the weight of domain relevant data towards the end of training so we want to enrich it with say more multilingual data towards the end of training uh in order to make the multilingual capabilities improve I do think data quality is a really interesting and important research area um and I think it's we've seen that you know having really high quality data makes a huge difference in the performance of the model on tasks you care about and so that means you know in some sense that's as important or even more important in some cases than the actual say model architecture you're using um and and so you know I think it's a pretty important area for future research you know having the ability to learn curriculums automatically seems important uh identifying high quality examples and and low quality examples and so on and then there's been a bunch of advances in not only training these models but also how do you elicit the best qualities of the model how do you actually ask questions in a way that causes the model to be able to answer questions in a more effective way um so for example asking models to show their work improves the accuracy of the model and also the interpretability uh and so some of my colleagues came up with a technique called Chain of Thought prompting uh so if you remember back to third grade math class your your teacher would always encourage you to show your work right and the reason they want to do that is both to see your thought process in getting to the answer but also to kind of encourage you to think about you know what are the next steps and how am I going to break this complicated problem down into a smaller set of of steps and so if you ask a model you know give you usually you give it an example of a kind of question you wanted to answer and then the actual answer for that question and then you ask it a new question and then you ask it to answer that question and so here's an example of a question and and then the way the model was taught to respond is just to figure the answer out and give it um and then there's a more there's a different question and the model output actually says the answer is 50 which is wrong but if you instead ask the model and show demonstrate to it you know how do you show your work um and you say okay okay well Sean started with five toys if you got two toys each then that is four more toys five plus four is nine so the answer is nine that's that's the work your your third grade math teacher would be really proud of um and more importantly if you do that the model actually you know elicits these sort of more incremental steps to get to the answer and it gets it right because it's now had longer to think about the the steps to basically more time to think about through the steps of of uh getting to the right answer uh and so that's actually and it's a pretty dramatic effect right these two these two lines are the same underlying models at different scales and what you see these are two different M sort of mathematical oriented benchmarks uh the one on the right is sort of eighth grade math problems and this one is you know a bunch of arithmetic problems what you see is the quality of responses is pretty bad when you just give it standard prompting but at some point the model scale becomes large enough that all of a sudden when you ask it with Chain of Thought prompting your accuracy shoots up quite a lot so that says there's a really interesting science of how do you ask these models questions in a way that actually you know both makes them more interpretable and also more likely to give you the right answer okay so let's talk about multimodal reasoning in a Gemini model and I think an example is a nice one is a good way to understand what the this model can do so here's the prompt you know here's a solution to a physics Problem by a student and then there's just a a a picture of the problem and the students written out answer in kind of handwriting and then the rest of the prompt says try to reason about the question step by step that's again kind of the Chain of Thought prompting style thing did the student get the correct answer if it's wrong please explain what's wrong and solve the problem and make sure to use latch for Math and round off the final answer to two decimal places and so that's the input this kind of like Hokey image of handwriting and a skier going down a slope and all this kind of stuff conservation of energy blah blah blah and then this is the output of the model so the student get didn't get the correct answer student made a mistake in the calculation of potential energy at the start of the slope um the potential energy at the start is given by MGH student used the length of the slope I guess that's the hypotenuse instead of the height in the calculation correct solution is you know it means the total blah blah BL therefore we can write this is this is actually in latch but we've rendered it for your reading convenience uh and substituting the values there it is it worked the problem out to two decimal places so think about what this means all of a sudden we can give models like kind of multimodal input you know a complex you know picture of a whiteboard and a problem and ask it to do something and it can do it you know it's not always going to do it right but but it can and uh this can be an amazing educational tool so think about you know a student trying to work things out on their own and they're taking pictures of their solution and and you know the system is kind of helping helping them figure out what they did wrong um we know that individualized tutoring has outcomes that are two standard deviations higher when you have a one-on-one human tutor for Education than when you have a much uh broader scale classroom setting so could we get close to that in terms of having individualized uh tutoring I think I think that that possibility is within our Collective grasp um okay so evaluation you know that was kind of a a qualitative example of Gemini's capabilities but it's also good to look at how it Compares on a bunch of different uh characteristics um evaluation you know really helps us identify the model strengths and weaknesses helps understand you know is training going well so we're constantly evaluating these metrics as we're training the model um helps us make decisions about what to change is our math performance you know lower than we would hope and so maybe we should enrich the training mixture with more math oriented data but what will that Doe to multilingual performance um there's a lot of complicated tradeoffs some of which you make at the beginning of training some of which you're kind of monitoring online and trying to make uh principled or to the pan decisions I guess um and helps you compare the capabilities to other models and systems um and so the high highest level summary is you know we looked at 32 academic benchmarks uh and the Gemini Ultra model exceeded the state-of-the-art performance in 30 of the 32 um and so if we look and delve into some of these in in depth uh there's a bunch of text oriented or general reasoning or oriented benchmarks um and if you compare Gemini Ultra with gp4 which is generally the the prior state-ofthe-art in most of these problems uh what you see is uh the ones in blue is the state-ofthe-art and so we we state of thee art on seven of the eight uh the 90% on mlu is interesting because this is a very broad uh set of questions in 57 different subjects you know chemistry math uh International law philosophy um and the group that put together the The Benchmark um measured human expert level performance at 89.6% I think or maybe 89.8 and so this actually exceeds human expert level performance in these 57 categories uh which is which is quite nice we're happy with that uh and then there's a bunch of coding related ones down here and math oriented ones here uh yeah I mentioned the 90% so if you look at image understanding benchmarks you know these are now getting into the multimodal aspect of this you know we uh got state-of-the-art results on eight of eight benchmarks R ranging from uh one of the nice things uh was this Benchmark came out a week before we published our paper and we' never seen it before so we our Val team quickly added this Benchmark to our Val set and discovered that we exceeded the state-of-the-art results uh H by a you know reasonable margin uh which is nice it's it's always nice when you have a a benchmark you've never seen before and you do well on it uh because you're always worried about like leakage of of test training data into the test set and so on um if you look at video understanding you know again the multimodal capabilities of of this model really really shine pretty well State of-the-art on six of six of six uh benchmarks including you know the important English cooking video captioning Benchmark uh and video question answering and so on um and if you look at audio you know the word error rates of of this on a bunch of diff four different uh uh public uh uh speech recognition benchmarks as well as a speech translation Benchmark state-ofthe-art m505 and multilingual capabilities are are quite good we're state-of-the-art on four of the five so yeah first I hope you appreciate our vow team because this is a tremendous amount of work to evaluate these models and really understand the capabilities in this level of detail and uh that's uh pretty pretty awesome um and it does give us a nice firm idea that the Gemini model is pretty capable uh and we also have you know measurements of pro and Nano in the paper Okay so these large Transformer models can actually generate uh surprisingly coherent conversations uh which is kind of the the evolution of that sort of neural conversational model and then the Transformer based versions of those um so if you look at bard. I guess I have to update my slides so uh this will be Gemini google.com um so I was actually preparing a talk before we were using Gemini models uh a few months ago in Bard uh Gemini uh and I said you know reverse the letters of hot chips and tensor processing for me so you know just to show what these models could do and it says sure the reverse strings are you know there they are great but then it went on to say I can also do this for you in Python here's the code uh you know Define reverse string as a function here's the string print reverse string of that reverse string of that uh use code with caution I always recommend that um and then it also goes on to explain it code first defines a function called reverse string takes a string as input through reverse string function works by looping through the string code then prints reversing and it's always going to be helpful is there anything else I can help you with um so this is pretty amazing right like the fact that someone asked a question and it did what it was asked but then it also said by the way here's this thing called programming and some python code uh and here's what it here's how you would do this uh WR writing you know code to do this uh I think that's pretty cool and then again a real Educational Opportunity um is there anything else you can help me with sure tell me more about tpus so you know the model has a fair amount of World Knowledge it knows that tpus are you know basically I I've told you a lot of this they're specialized Hardware processors developed by Google to accelerate machine learning they can help improve the efficiency and performance uh here are some of the benefits faster training inference I hope again I hope this helps um and and you know I think one of the kind of fun things about these chat Bots is they can have different kinds of personalities right The Bard is sort of your helpful friend and will help you answer lots of questions uh so we actually put Gemini Pro in Bard uh in Gemini last month and there's a uh public site called lmy that can evaluate different chat agents because there's now a lot of different chat bots in the world and the way they do that is they get users to write their own prompt they pick two random chat Bots that they have configured in their system and then they send the query to both of them the prompt to both of them show and then show the output anonymized so you just say which is better the left or the right um and then from that you can compute what's called an ELO score so ELO was a I believe a Hungarian uh mathematician who was trying to develop ways to rank uh chess players and so when you have a tournament um basically you get more ELO points when when you defeat uh one opponent and so you can in a tournament setting uh compute an ELO score and roughly an ELO score being Higher by 100 means that the stronger competitor will win 64% of the time and 400 points is meant to be a 10 to1 advantage and so the the the pro level uh model actually achieves the second highest uh ELO score in this thing there's probably like 30 different models I'm showing the top of the leaderboard uh which is pretty nice uh so this is like a free service uh and these ones are are not and so I tweeted about this and then this person who I'd never seen said it's actually it's really pretty good check out this example and you know have we have a way for people to share their interactions with bars that they liked and they actually want to make public and so he he decided to do that so if you click on that he was asking about please estimate the number of companies per million residents in each of the following countries the US England South Korea Taiwan and Singapore present your numbers in table form and so Gemini I said absolutely here's a table showing the number blah blah blah there it is the number of companies per million just as he asked as you can see England has the highest number and it sort of interprets the results for you too England has the highest number followed by the US South Korea Taiwan then Singapore it's important to note these are just estimates the actual number May Vary depending on source and definition I hope this helps let me know if you have any other questions um and so he went on to say you know where'd you find the numbers for the company oh I apologize for not providing the sources um it's important to be transparent where information comes from so the company numbers I use came from a combination of different sources so for the US they used the dun and Brad street database England I used the office of national statistics blah blah blah blah blah please note these sources have somewhat different definitions so it actually understands the subtleties right like the there's different definitions of what constitutes a company and it sort of knew where to go off off and and get all this uh different kind of sources of data it didn't have that memorized but it was able to make use of it um pretty neat okay and another Trend I think is important is that uh um further refinement of these General models can make amazing domain specific models so some of my colleagues took some of our earlier work on the palm model and then the Palm 2 model which was a general purpose model kind of like uh trained on General text and decided to enrich it and further train it on medical data so medical kind of questions and medical articles and what they found was the med Palm model the first one actually exceeded the medical passing Mark for the medical boards and that when you then six months later they trained it on the med the Palm 2 model to do med Palm to they actually got expert level performance on the medical boards uh for this particular ular task now this is not a full general purpose uh setting it's like a bunch of medical questions but it does show the capabilities of having a really capable General model and then training it in a domain specific way for specific uh problems okay uh I'm going to go quickly through generative models uh producing images and video you've probably seen this as a trend in the world so we have a couple of different research projects projects party and imagine uh and you know one of the kind of cool things I mentioned you can give prompts that describe what you want in visual imagery and then have models that can generate these these images that are kind of constrained by the encoding representation of processing the sentence and then conditioned on that it will generate pixels for an image so a steam train passes through a grand Library oil painting in the style of rembrand there you go uh a giant cobra snake made from X where X might be corn pancakes Sushi or Salad which is your favorite I I kind of like the the Ferocious lettuce looking snake but the corn one is pretty nice too um you know a photo of a living room with a white couch in a fireplace and abstract painting is on the wall and bright light comes through the windows so if you happen to need a picture like that for for a presentation or something like I did you can uh do that and it can be pretty detailed descriptions you know a high contrast photo of a panda riding a horse Panda's wearing a wizard hat and reading a book the horse is standing on a street against a gray concrete wall colorful flowers and the word peace you know blah blah blah dlsr DSLR photograph daytime lighting and there you got you know there are many plausible interpretations of that but at least you got one example of what you what you asked for um and this is now integrated into Bard so the uh K through 12 uh government uh School agency in Illinois uh was really excited about being able to create images of their mascot hyperlink the Hedgehog um so there's hyperlink surfing riding this AI wave uh and this person was very excited about you know the prompt was a human buying coffee at Costa Coffee in London uh Costa Coffee is is a very popular coffee chain um and one of the things that these models have often struggled with is the Fidelity of text uh actually you know putting the text you asked for making it look like a real font and so on and here you see it does does a pretty good job um I won't talk through a lot of the details but essentially you know you put in a prompt that gives you a representation of what that sentence in a distributed Vector based setting is and then conditioned on that the model is trained to generate first a small scale image and then take uh another model that is designed to improve the res increase the resolution of an image uh conditioned on both that lower scale lower resolution image plus the text embedding and then we apply that one more time with the larger image and the conditioned on the text edting to produce the full scale, 24 x24 image um and you can really see the effects of scale so if we train four different models uh with 350 million to 20 billion parameters um and then given them the same prompt you know a portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign in the chest that says Welcome Friends um what you see is you know it kind of got the kangaroo aspect ECT at the smallest scale but not and the Ary I guess but not much else there is a sign but it's you know again not it struggled with text let's say uh as you scale up a bit more the kangaroo got a little better it now knows a bit more that Sydney Opera house looks something like that but it's kind of a little chunky and doesn't have a lot of detail uh it's closer to Welcome Friends uh but it might be veg me he I'm not sure but then as you scale up you now get a pretty nice image of the Sydney Opera House and your kangaroo and the orange hoodie with the right text so you see scale is is an important aspect of this and this is why you're seeing all these advances in you know the last decade is essentially scale and better training methods and algorithms really contribute to higher quality results this graph just effectively says the same thing but I think the kangaroo says it better um I think it's also important to realize that there's a lot of machine learning uh kind of invisibly helping people in various ways uh and particularly on phones so you know a lot of camera features in modern smartphones have gotten significantly better over the years through combinations of computational Photography methods and machine learning methods uh together you know so portrait mode where you make the background all blurry so you look all fancy in the foreground um is a nice nice technique for some of these uh portrait style photos uh night sight where you're trying to take an image in very low light conditions you can essentially take lots of readings from the sensor and integrate those in in software to create you know much higher uh lighting conditions than the actual conditions under which you did that that also helped you take better astrophotography and portrait blur and color pop are nice features sometimes when you want them um Magic Eraser so if you actually understand images and you point at like one of the telephone polls and says make and say make these go away then the system can do that maybe your waterfall photo had these you know uh other tourists in front of it and you didn't want them there you can you can erase them uh there they go um and there's a lot of features on the phone and many of them are sort of about how do you transform one modality into another uh you know so sometimes uh you want to be able to say screen a call uh so maybe you don't want to actually answer your phone but have a a computer generated voice answer the phone for you ask what the person's calling about and then give you a transcript of what they said uh and then you can decide you know do I want to accept this phone call or not um you know hold for me can kind of listen on the phone for you so you don't have to hold uh hold Bank of America when you're calling their customer support live caption can show can take any video playing on your phone and listen to the audio and then give you transcripts uh captions of what's being there maybe you're trying to watch a video on a lecture hall like this and you don't want the audio to disturb people um so there's a lot of cool cool features of this and a lot of these are running on people's phones without them necessarily realizing it or thinking about what technolog is under there um and this has amazing uh advances for for uh sort of people in limited literacy settings you know you can point your camera at something and it can read you what you're pointing it at or maybe you don't speak that language and you're trying to understand it it can it read it and translate it for you um I think I'm gonna go quickly here and maybe skip some of this section uh I will just skip over some of this but there's a pretty awes some advances in uh yeah so I I'll start here you know I think Material Science is a pretty interesting area where you know basically machine learning is starting to influence lots and lots of aspects of of science um both through kind of automated explorations of interesting parts of a scientific hypothesis space or through you know creation of very rapid simulators that are learned rather than sort of traditional high high large large scale kind of HPC style computation you know in some areas you've been able to learn a simulator that is uh sort of the functional equivalent of a hand-coded simulator but is now 100,000 times faster and so that means that all of a sudden you can you know search a space of 10 million possible chemicals or materials and identify ones that are interesting and promising and have certain properties uh that you would normally have to apply lot or compute for and so some of my Deep Mind colleagues were actually looking at uh interesting ways of searching the space of possible materials for those with interesting uh uh uh properties so they have a structural pipeline that can you know represent a PO potential material as a graphical neural network and then a compositional pipeline that can sort of mutate sort of known structures into ones that are sort of interesting and adjacent and then uh use an existing database of materials to then be able to Output energy models and a bunch of stable uh interesting possible compounds uh and so this automated discovery of 2.2 million new Crystal structures uh leads to a bunch of interesting you know possible candidates for actual uh synthesis in the in the lab to see what properties they actually have and I think there's huge potential for using machine learning in all aspects of healthcare um really uh we've been doing a fair amount of work in the space of Medical Imaging and Diagnostics uh for quite a while uh and those problems range from ones where you have 2D images to some some where you have sort of 3D volumes from uh MRIs or other kinds of 3D CT scans um and then some where you have just a single view to ones where you have multiple views and and large images very high resolution things for pathology for example um and so there's been a a fair body of work I'm going to talk briefly about two of them though uh so one of the areas we've been working in the longest in this space uh is the area of diabetic retinopathy and so diabetic retinopathy is a is a degenerative eye disease that can um you know if you catch it in time it's very treatable but if you don't uh you can suffer full or partial vision loss and really people who are at risk which is sort of anyone uh with diabetes or pre-diabetes should be screened every year but in a lot of parts of the world there just aren't enough opthalmologists to do this screening those who' have been trained in sort of interpreting these retinol images um and so this is something where machine learning can actually help a lot because you can actually train a model based on you know uh trained opthamologists annotating images to say yes that's a one that's a three that's a two that's a five um and if you train a model on board certified optomologist you can actually train a model that is as as effective as board certified opalus if you then go on to get that same training data annotated by retinal Specialists who have a lot more uh expertise and and experience in the these cases you can actually train a model that is uh on par with retinal Specialists it's kind of the gold standard of of care in the space and there are very few of those in the world but you can all of a sudden make the screening quality be that of a retinol specialist uh using a GPU on a laptop um and so we've actually partnered with uh organizations in India a network of Indian eye hospitals and and the government of Thailand as well as France and Germany um and we're sort of uh doing lots and lots of screening every year uh and then dermist so Dermatology is a condition where it's interesting because you actually don't need specialized equipment to to sort of us gather data that is useful for interpreting you know do you have a Dermatological condition or not uh and so we've got a system now deployed where um you can take a photo of something as you see in the video and it will give you a sense of you know what this might be what are other similar looking images in sort of Dermatological databases uh and it can help give you a sense of is this something very serious or is this something s fairly benign um okay uh and then finally I think deeper and broader understanding of the machine learning methods as we sort of deploy them in more places in the world is really really important and you know as we've gone from doing basic research in in machine learning to then using it in a lot of places in all of our products we started to think about a you know a set of principles by which we want to you know think about the implications of using machine learning what considerations should we have for you know various ways in which we might apply it um and we in 2018 published a set of principles that we came up with uh really these were designed to help educate our own uh internal teams about machine learning and things you should be thinking about as you're thinking about applying it to problems you care about uh and so for example you know avoid creating or reinforcing forcing unfair bias often when you train these models they're trained on uh World data from The Real World and that's often the world as not the world as we'd like it to be but the world as it is and so it's really important when you're deploying machine learning models that you don't sort of train on data that is biased in unfair ways and then accelerate that because now you can automate and make these decisions more rapidly um so there's a bunch of techniques you can apply uh on a sort of algorithmic basis to remove some kinds of bias um and what we strive to do is sort of apply the best known current techniques but then also do research on advancing the state-of-the-art in these areas of bias or for example um uh accountable to people we think you know making models interpretable is an important aspect of that um you know being sensitive to privacy when that makes sense in the setting you're deploying it uh and you know be socially beneficial uh and so I I'll point out a lot of these are sort of active areas of research uh so we you know we've published about 200 different papers in uh in the last five years or so six years related to fairness or bias privacy or safety and you can see those there okay in conclusion you know I think it's pretty exciting times for computing I think there's a change underway from handc gooded software systems to ones that are learned and that can interact with the world in various interesting ways and interact with people in interesting ways um the modalities that computers can now sort of ingest and understand and sort of produce are growing and are you know I think going to make using computers much more seamless and natural you know a lot of times we sort of restrict ourselves to typing on a keyboard or something like that but I think we now have the ability to talk to a Computing system in a very natural way it will understand what we say it'll be able to produce a natural sounding voice in response or a nice image if that's what we asked for and so I think that's pretty exciting so there's tremendous opportunity for sure uh but there's also a lot of responsibility and how do we sort of take this work forward make sure that it's socially beneficial uh and really uh kind of do good things in the world with it and so with that thank you very much um [Music] I will I will put up one more plug for the slider number There It Is Well thank you very much for your talk thank you very very much for your talk please don't send more questions to slide though okay it's a very nice idea but we are overwhelmed at this point so what we are going to do is we um I'm going to give you some questions some there were some Trends in the questions that appeared in slid though so I'm going to ask you some of these questions and then for those of you who made it to this Auditorium will give a we'll ask we'll take one or or two questions from the audience uh so one of the questions um let let me start with a question that you probably expect okay more data is it going to to make your model better twice more data are we going to see twice as as good a performance yeah I mean it's a it's a good question and it's a it's not a simple answer I think I mean I think we've seen that more high quality data absolutely makes the model perform better when you have the capacity to sort of train on that larger amount of data so it's important to think about the model's capacity you know sometimes you need to increase the scale of the model as well when you have more training data we've seen uh more data actually hurt so if you actually get a lot of lowquality data uh you can actually for example decrease the model's ability to effectively do mathematics problems or things like that so it's it's a nuanced thing but in general more high quality data and more capacity for the model will make the model better yeah so a next question in this uh um in the slide that emerged is okay so what is the future of llms now that the vast majority of high quality training data has been exhausted how would you react to that I would disagree with that assertion a bit uh you know I think we've not really begun to train on say video that much I mean we've done small amounts of video but there's a huge amount of video data in the world I think actually understanding the world through Visual and audio data will be different than sort of training on a lot of language you're going to want to do both but I don't think we've we really exhausted the training data in the world yeah I I tend to agree with you I think we still have a lot to go multimodal models you emphasized that in your talk um do they achieve better performance on all domains than targeted models for each domain separately or you can paraphrase this question and answer a version of that question I me I think I think in some cases they do so the question is as you add more modalities does that improve the performance on other modalities and you hope so and and generally we do see some aspects of that um but I I you know I think if you collect a if you have a narrow problem and you collect a very targeted data set that is designed to tackle that just that problem that will often you know give you good performance on a problem on the problem but if you have a complicated problem or it's hard to collect very specialized data what you want is a model that has a huge amount of knowledge of lots of different things in the world you know from language and from images and audio and then to be able to apply that model to the problem you care about and then if you have a little bit of data for a problem you care about then you're going to want to start with that base model and then fine-tune it or do in context learning or something like that to make the performance uh quite good MH maybe I could follow with another question which is kind of related today the cost of training large models prevents small startups from making impact what kind of projects would individuals with less resource work on would you like to comment on that yeah absolutely I mean I think um there's a really large set of problems in the machine learning domain I I'm going to address it more from a you know what interesting research can one do MH in the in the broad area uh where maybe you don't have access to large Data Centers of of compute and so on and I think there's just an really wide open set of things uh so I've mentioned the quality of data eval automatic evaluation of data quality or online curriculum learning or optimization methods or a lot of these kinds of things can actually be demonstrated on you know one GPU or you know a handful of gpus under your desk and actually make pretty significant and Innovative advances you know the original Transformer work was was done on8 gpus I think so uh that or the sequence to sequence model for sure was eight gpus and so I think there's advances to be had from clever ideas good evaluation of them and even demonstration of them at small scale MH okay another set of questions that we got is is llms everything is Transformers everything what else is there should we be working on other kinds of uh models is the emphasis on nlm llms stiffling other other work in machine learning yeah I mean it is a worry right like are we crowding out other innovative ideas that maybe uh are you know not as fully developed and so they don't look as good as some of the things that have been you know much more fully explored and we're sort of uh you know um in the kind of now gentle exploration of the space around what works well when maybe something over here would work really well you know I think a lot of the time uh showing even at a small scale that some other idea is a really interesting Direction can be done with some modest amount of experimental evidence um and I think that's an important area to go I would say mo you know I I tend not to use llm because I think we're moving to a multimodal world okay uh and I think multimodal is going to be more than just kind of the human modality you think about like uh Visual and audio and language but other modalities that are important in the world and you know like time series of interesting you know heart rate sensor data for healthcare applications there's probably 50 to 100 modalities of data you'd want to be able to deal with uh I see I just saw the the the clock and we've really run over time so I would like to end here by thanking Jeff for his talk thank you