[Music] hi everyone uh thanks for tuning back to neural nodes in this episode we are speaking to Christa obalon about Myro which uh is a almost sort of a extension of the first uh interview that we did with Omar kab at dby and chrisa will chat with s about the work at micro am I pronouncing it right she can correct us and uh you know what led to the work and and sort of the ramifications of this work and with that I'll turn it over to Chrissa and Simon to kick us off yeah Christ maybe um maybe just to get things started can you tell us you know how you got into this wonderful world of uh uh prompt optimization yeah for sure for sure and on the pronunciation side so we always say Meo and DSP but everyone has another way of pronouncing it and so uh I think it's all open to interpretation but uh it's funny everyone has their their own way so uh I always find that fun um so yeah definitely in terms of uh sort of what got me into this work I guess I can start with some like broader contexts on myself and then kind of yeah talk about what what led me here so I am currently uh a second going on third year PhD candidate at Stanford uh and recently my work has focused a lot on exactly what we're talking about so prompt optimization in particular for language model programs so uh these sort of pipelines that are composed often of one or more calls to a language model uh or other components like uh you know tools or a retrieval model for example to solve some task and the way I sort of got into this space was through my own uh like personal experience working on a project prior to this one so in the first year of my PhD and and still I'm really interested in uh sort of AI applied to uh various clinical applications and so I was working with a clinician at the time um basically seeing if we could use U these sort of chain calls to language models to uh effectively extract relevant clinical insights from electronic health records and so um much to my honest disappointment what we found was that uh really just manually tweaking the prompts that we were using uh for our use case was what end up achieving the best performance at the end of the day as opposed to you know more like systematic elegant methods like Kora or fine-tuning uh and maybe this was just based on the small data set that we had but uh yeah that's sort of what ended up working best and so this sort of got me thinking like naturally I was going through the process of you know the back and forths of working with this clinician to craft these prompts and you know tweaking specific words choosing specific F shot examples and it was very clear that this was just just not an optimal way to go about things you know while we were achieving good performance from this process it was really timec consuming and it just wasn't scalable so we were doing this for one condition that we were trying to you know extract uh insights for but we needed to be able to scale this up to hundreds of conditions and it just wasn't going to be feasible to do that manually so that's what ended up getting me really interested in yeah how we could sort of go about this in a more systematic scalable way and so at the time uh I was introduced to mik Ryan who's a amazing Master student at Stanford who was sort of interested in I like Twitter feed yeah yeah he's great Michael's wonderful anyone who doesn't follow him on X definitely give him a follow um so we were chatting about this and Michael had just gone to a talk at like an NLP lunch by Omar who was of course talking about DSP and a lot of the work on prompt optimization that he was starting to look into um and so Michael was the one who uh sort of made the intro to Omar and we started all B basically chatting together about sort of things that we could do in this space uh and so that was sort of how I got into dspi and uh you know the space of prompt optimization and the rest has been history they've been uh wonderful collaborators and U Meo I think is a uh sort of product of one of those collaborations that's awesome you know I think maybe that's a good place to start before we get into prompt optimization multi-layer compound I guess I I knew them as compound models when I was doing it yeah um yeah maybe could you give the audience a little bit of background on dspi why it was kind of necessary and um you know the kind of movement that is sparked I suppose yeah for sure so I think um dspi tackles two main challenges there's of course the challenge of prompt optimization and how to do that effectively and I think there's also been this trend uh that we're seeing of um sort of multi-stage or uh multi-layer programs these compound AI systems that you're talking about where um we're seeing a lot of performance improvements and uh sort of the ability to solve much more complex tasks when we're able to chain together different calls to an LM uh or to a retrieval model or other tools to sort of work together uh to to solve for the task and so what dspi does is both make it really easy to build these types of systems uh in a way where you're not you know worrying about all of the different prompts that are composing your pipeline and tweaking all those to coer to work well together you just get to focus on the logic of the broader pipeline that you're building and then dspi can handle the rest of how to sort of optimize all the parameters that are involved so whether it's the prompts uh in each call in your system or even the weights of the language model that you're using for each call dspi can kind of handle all of this so you can focus on architecting this pipeline that solves your task really well yeah yeah awesome yeah uh so so then you know let us do a quick recap in that case you know can you talk walk us through the core components of DSP uh and you know how they enhance the efficiency of language models uh you know just as a way to level set and then we can go into Meo as a coral each that yeah that sounds great so I would say um the best way to understand the sort of like core components of DSi is really to check out an example so there's lots of um documentation now both by um folks who are working directly on dspi and then just generally on the web um so I think that's like a really concrete way to just you know see the syntax and how it works but at a high level um I would say that there are like a few key abstractions that really make up um dspi so the the first is sort of abstractions that allow you to build these language model pipelines or programs uh very easily so um basically the way that you would go about building a language model program in dspi is using uh basically a module to express uh your language model call and here you just specify in a declarative way the inputs and outputs that you would expect from this language model call so it's really like a function uh call that you've defined and then you can use this function call or module or uh you know LM call whatever you want to call it um and sort of compose it with other language model calls or other logic um in your program to produce a final output so that's sort of how um DSP allows you to build these uh language model programs effectively and then once you've built the your pipeline you can then optimize it using any of our off-the-shelf optimizers which Meo is one of these uh our one of our latest um and this allows you to just call you know Optimizer do compile on your program so basically like one line and then we can sort of produce a program uh or return your program with optimized uh prompts yeah you know when we talk about optimization there are so many kind of levels to optimization I guess um you can optimize The Prompt uh maybe the examples the architecture of the model itself I think is a very interesting Direction I guess what are some of the kind of the cool um discoveries that you've made in your in your work on on how best to optimize where practitioners can get the most value uh you know um most efficiency uh for the cost yeah we love to your thoughts on that yeah it's a great question I think there's still lots more to do in this space and lots of open questions but from our most recent study on this uh we looked at uh sort of benchmarking Meo against um a number of other sort of techniques that we were investigating to sort of see you know what worked best in being able to optimize prompts for these different language model programs and what we broadly found is that um for us Meo um sort of produ the best results over all so for five out of six of the the tasks that we were looking at um so you know that was exciting to see something that uh came out ah head and and worked well but I think uh what was also interesting there we were curious to kind of understand like you mentioned um if we have a certain optimization budget what are the components um within a prompt that are most uh worth optimizing so uh you know we've talked about these instructions so a sort of like free form string maybe that specifies how to go about a task and then you can have these few shot examples which s of show or demonstrate how to perform the task um and so we were interested in sort of understanding which of these components would be most worth sort of dedicating our optimization budget towards um so when we evaluate this uh what we found is that ultimately uh maybe not surprising to some but it was interesting to us that optimizing these components jointly ended up yielding the best performance at least with the optimization budget that we were using I think they still worth doing a study at like very low medium and high budgets and and and what works best there but that ended up coming out ahead um the interesting thing though is that F shot examples were really key uh to being able to achieve the best performance so even though optimizing these two components jointly um was best it was still only slightly better really than just optimizing F shot examples for the majority of our tasks um and something that was kind of interesting with this that we found was that uh in particular for fot examples the the optimization process itself was really key so there's a lot of variance um in performance depending on like the actual examples that you choose uh for your prompt so it's not enough to just be like Oh I'm going to use three shot prompting let me uh like randomly sample three prompts from my training set um instead being able to bootstrap these examples to sort of show and I can talk more about how this process works if it's helpful uh but in a way that shows the the intermediate steps that you're trying to perform for your program um and optimizing over these sets to find the best ones yields quite a bit of uh Improvement in end performance so that was one thing that was interesting the other sort of insight that we had here um was you know we were getting all these results that fot examples were just like outperforming everything and we were like uh there has to be just intuitively we felt like there had to be a place still for instructions uh and being important impr prompting and what we ended up sort of hypothesizing and later finding is that um in particular for cases where you have multiple conditional rules um so like an example is let's say you have um some sort of like uh system where given like a complaint ticket you have to Route it to the right uh sort of team or something to fix it a company and so you have lots of um conditions for example like if the ticket contains information about this rather it's to team a uh or if it's this right to Team B if you have all of these rules it can be hard to express all of them effec L with just a limited set of few shot examples so in these cases your instruction becomes really important and optimizing it uh ends up yielding the best gains so I thought that was kind of an interesting Insight um that you know was nice to see actually proven by some of the the tasks that we benchmarked on I mean do you think it's because intuitively uh if you have lots of forks which is what conditional statements are are uh you know then it is hard to capture the full diversity of how the behave in a few short sort of uh example basically exactly and I think they're kind of like two pieces to that either it could just be like impossible to include all of the cases in a limited size prompt for example but even if you are able to sort of include a diversity of examples that capture all of the cases it might still be kind of confusing for the model to see all of these different examples uh that are sort of all over the place and draw patterns from them um at Le is a hypothesis but yeah I think you're spot on with that intuition no speaking of which this might be a dumb question but like with context Windows getting bigger and bigger can I just stuff a bajillion uh F shot examples and I see massive Improvement or do I reach some kind of marginal like diminishing marginal return uh yeah it's a great question and honestly something that I have on my like research bucket list to look more into um so I think there was a Google paper recently that I'm blinking on uh the name of that did look into this and essentially found that intuitively as you include more and more F shot examples you get better and better performance uh but I would assume that um there's probably diminishing returns to this and you could even reach a point at least I feel like I've I've seen this and again haven't studied it systematically um but seen it in my own work where if you just include too much information in your prompt uh your model almost gets confused and sort of doesn't know what to do with all the information so yeah I think it would be something that's interesting to study more but uh yeah apologies that I can't remember the exact name of this paper but for those who are interested oh yeah yeah definitely something it's fine you don't have um you talked a little bit about optimization is a problem right like you know you talk to us a little bit about how do like well and this is this is a universal problem which most ption is face which is how do you understand whether this output is good or not basically you know there's obviously Vibes based you know test of whether a language model is performing well or not like can you talk a little bit about like how did you construct metrics uh right to to even like derive this optimization so what is the yeah what is the objective function that you're optimizing for and what are the trade-offs involved in picking one or the other yeah that's a great question um that I wish I knew more about candidly but I at least have my own experiences to talk about I think that yeah the metric at the end of the day is so important to the optimization process because it's you know the Northstar that's guiding the optimization um I think that uh and and yeah so optimizing can only be good as the the metric that you're choosing the nice thing with gspy is that it does support like you know if you want to optimize with a more classic metric like accuracy or exact match for example if you're doing a Q&A uh solving for a Q&A task um that is supported but you can also use sort of another language model program as a judge um so you can actually optimize your metric too if you want with these same systems which uh we've done for like uh I think we did the use case where we tried to optimize a judge for evaluating uh tweets as an example uh on different criteria um so there's definitely you know things that you can do there um I think that the the whole LM as a judge Space is really interesting because being able to uh sort of get that to work well feels like it would allow us to unlock so many more complex and interesting tasks that we could then solve for right now I think metrics are a big limitation and what we're actually able to do and and build with machine learning and so I think that is a really interesting uh area of study that I haven't myself you know Doven super deep into but I know there are a lot of great folks who are working on it yeah that's awesome um maybe when you were developing Meo how did you guys think about um benchmarks and uh evaluation were were you guys using like you know s bench or I think you I think also um the DS Pi Community created another Benchmark recently if I remember correctly too yeah yeah so um we built out uh sort of as part of this study we built out a benchmark that's now been open sourced in dspi uh called Lang probe um and I think the language model program Benchmark uh if I remember the acronym correctly but uh yeah the way that we really thought about building this out we wanted to be able to have um of course a suite of language model programs and sort of like task pairs that they're trying solve uh and we wanted to make sure that we had good representation across the different tasks so it was you know a diverse set of different types of task with different evaluation metrics that people could Benchmark optimizers or different uh approaches for improving language model programs on um so that's sort of what led us uh to build the Benchmark I think we used a number of tasks that were uh sort of already used in the original DSP paper and then uh basically just expanded it out to uh other tasks that we came across but this is something that we're actively in the process of expanding further so I think some folks are looking into adding U more like agent based uh use cases uh because you know that's a big thing for gspy uh that people are interested in um as well as maybe like code Generation Um so there's a lot of work still uh to be done on the Benchmark but I think at the time of the paper we had about six different tasks uh that that we had included awesome well let's dig one layer deeper so I think the cool thing about prompt optimization at least for me is that it kind of is reminiscent of like the the promise of self-improving AI you know having your prompts be optimized autonomously and then you know llms can maybe code their own llms and their own so maybe um could you could you dig one level deeper and kind of describe how you guys built a a Meo what what were some of the key kind of insights when thinking about optimization yeah great question so I can sort of uh maybe we walk through the high level of yeah how we kind of got to Meo um so when we started off the the sort of process of uh process sounds like a negative word but our collaboration and working together um we were at a point where um we had dspi which offered um this one Optimizer bootstrap F shot with random search which essentially created F shot examples and searched over them using basic random search technique to find a good set of fot examples for your program and this is one of the I I think first or only methods that really worked for these like multi-stage language model programs and then concurrently there was a lot of work at the time that had looked into uh sort of optimizing instructions so those like free form instruction strings that we spoke about earlier for uh specifically single language model calls so uh this wasn't really considering the the case of you know what if we have multiple calls chained together and so what we became interested in was okay um how can we first start by just extending these methods to working for the multi-stage case um and so that's what led us to create copro which was our uh sort of first method that was like a basic exploration of um being able to uh extend opro which is a method from uh deep mind I believe that looked at using a language model um as a proposer for instructions and sort of like iteratively um took a history of past proposals and their evaluations to propose new better and better instruction proposals over time um and so we essentially looked at how we could extend this to the multi-stage case where you know we have multiple components that are all influencing uh the end performance um and we did this in a sort of like greedy approach where we just uh went through prompt by prompt in our program and like proposed a new set of instructions for each evaluated those went on to the next one uh evaluated chose the best froze it and sort of continued this process um and so uh that's what uh that was our sort of first uh initial Optimizer that we released in dspi that supported instruction optimization but as you can imagine uh this was not very efficient because it sort of looked at each variable one at a time uh and sort of evaluated each instructions for each prompt separately and so we became interested in both how we could do this in a more efficient way and then also how we could optimize both the F shot examples and the instructions in the prompt jointly um and that's what ended up leading us to Meo where uh we essentially propose F shot examples using a very similar approach to what we do in the bootstrap uh fot with random search where uh we essentially take an example from our training set run it through a program that will generate an intermediate trace of sort of how our program is going about solving for that example and then if the end output is correct we can assume that each step it took was also correct and just keep that as a fot example for every step in in our program um so that's like the first step in Meo where we generate these F shot examples uh then the second step uh we generate instructions uh basically by synthesizing a lot of relevant information about the task um as well as uh information about each substep in our pipeline um so we take the F shot examples that we generated from the step before to sort of show what inputs and outputs for each step should be uh we summarize the the program code itself to say sort of here's like the logic of this pipeline here's what this exact module is responsible for that we're trying to generate an instruction for create an instruction accordingly and a few other sort of pieces of information about the task and then we use that to create a lot of instructions and then the third step is that like I mentioned we want to be able to optimize these fot examples that we've generated and the instructions jointly um and do so in a more efficient way that's not you know just greedy optimization so for that the key Insight was that we could use beijan optimization which is really good at sort of efficiently optimizing over multiple variables yeah um and so that's sort of what we use as the final step to find what is the best combination of all these fot examples all these instructions for our program so let me know if there were any clarifications there a lot of information but no I think it was it was pretty good and I think you walked us through it like pretty you know so why you made the design that you made I think that's pretty good actually um maybe you know maybe let's like uh talk a little bit about sort of other approaches to composing you know complex uh llm outputs from like simpler llm outputs right like you know we are aware of all the work that's going on in like llama index and L chain and and you know more effec proposing that you can string together with instructions or with with orchestration you know a variety of like language models uh together to approach to like to solve for something right in some ways it is also kind of programming multiple language models together you know what would you say sort of the biggest difference in approach you know if you can comment on this you know between uh you know llama index or Lang chain like that sort of more deterministic workflow approach right or orchestration approach as you might call it versus you know the sort of joint uh Approach at that dispy and and Neo part up basically yeah I guess so I might not be the best person to answer this given that I have worked largely um with cpy as opposed to uh Lang chain but um I guess commenting on like the broader differences or um and similarities so I think that the key difference that I see is that um dspi kind of really looks at trying to abstract away a lot of the uh the the prompting and trying to keep sort of how you specify each language model call as simple as possible so you can really focus on the high level logic and how these different calls are chained together um and Lang chain is of course a library for also chaining together these these calls hence the the name um but I think there's less of a focus maybe on um the the optimization side of thing so um there's actually sort of uh uh a basic implementation of uh being able to like for folks who want to build their program in Lang chain uh and then optimize using dspi um there's some basic support now for being able to do that um and also you know something that we're totally open to investing more in if there's you know popular demand for it um so I would say that's like one of the larger differences that I'm aware of but um other folks who have really uh used both libraries quite a bit might have a better sense of like all the exact differences does when you guys were working on Meo did you consider like looking and optimizing for I guess the program structure itself as opposed to just uh the instructions and examples making sense here but like the steps the AR it does it does yeah I think it's a great question and another thing that's on my H sort of research backlog that I'm excited to look into um yeah I mean I think it's I I think in general it um seems pretty clear that the architecture um makes a massive difference in in performance um sometimes even more than like the prompt or even the model um that you choose um to use does and so I think this is a major access just in general that people are starting to become aware of um and being able to improve like overall performance um so yeah I think that uh right now it's kind of more of an art than a science of like being able to construct these pipelines and um I think that you know similar how we saw neural uh Network architecture search um come up as a field uh for these like yeah neural network AR ures I think we'll probably see a similar thing for these types of uh sort of like next level up uh you know actual like language model program architectures so I think there are a lot of interesting questions there like we were we were hand fiddling with like handcrafted prompts just exactly it felt so archaic and like honestly stupid doing this and then yeah finally we can optimize with in a programmatic approach um maybe maybe bring this this all the way back you know where uh in practice do you see some of this um being most applicable in the near future um is it is it Healthcare um is it uh you know Finance uh other particular attributes where you think these Technologies could have a real impact yeah I mean maybe I'm biased but I'm I see them having an impact in everywhere um I think that you know anyone who can benefit from solving tasks where you have some data available more effectively um can you know start benefiting from some of these Technologies so I mean we're seeing this uh I think Isaac Miller who's a contributor to dspi wrote up a list of use cases where dspi is being used today and um it's pretty impressive there's like quite a variety uh of different you know academic groups and companies that are using dspi now I think everything from data bricks uh to VMware to Sephora is on the list you know which is kind of cool you wouldn't necessarily think of that so yeah I think that there are tons of applications here um where you know DSP and and Meo and these sorts of uh sort of systems can be really uh useful I think two use cases that I think are cool that um I um yeah uh sort of came about recently were um both a group at University of Toronto that uh was uh sort of built and optimized a program in DSP for doing I believe it was medical uh question and answering as well as error correction in the answers awesome and so um they uh yeah built their program in dspi optimized with Meo and I think ended up winning this competition with 15 plus other teams by a margin of 20 points uh which was kind of amazing so that was like a big step change you don't always see you know a leader board that looks like that um so and that was cool because they also had pretty detailed ablations and uh sort of contributed a lot of their performance gains to our optimizers uh which was exciting um and then another example here is um Haz Labs I believe had a uh sort of use case where they looked at uh doing red teaming with uh dspi so they essentially built a language model program to uh sort of generate prompts that could be used to jailbreak another language model under test and then they evaluated the output of that language model uh with another language model um and so yeah this was this was really cool to see just like this chain of LMS that was being used to to uh you know optimize for understanding uh you know what could be used to jailbreak uh one LM and I think they started off with maybe 10% performance uh or like success rate and being able to generate a harmful response just using a basic call uh to the LM uh they got that up to I think like 26 27% using the a multi-stage approach just out of the box with DSP and then with optimizing I think they got to like maybe 44% um so because you can see the the sort of Step changes as Optimizer yeah they yeah they did awesome yeah that's awesome um yeah well I guess what what are like kind of the next steps that you see in the field of advances yeah and you know if you if you publish a paper I hope we get credit for some of your ideas I'm just kid yeah yeah you'll get a shout out yeah so next steps I think as we've talked about I think there are so many I think uh part of the challenge now is just organ in uh the the backlog to to prioritize things because there's a lot to do I think that maybe there are like two broader directions that I see as being interesting I think on the prompt optimization side there's a lot of work that I think we could still do and just more fundamentally understanding what makes a a good prompt so I mentioned like the few shot examples how we saw this huge variation in performance uh just based on the exact set of examples that we chose and it's still not entirely clear you know like why certain examples would be better than the other if it's just that they're more representative um of the the training set if it's you know just like the combination of words uh that happen to sort of bias the model in a better direction or the reasoning string that's used in like Chain of Thought um so I think that is an interesting sort of like area to understand better and you know same with the instruction strings I think another uh sort of area with in prompt optimization that is exciting to me is looking at routing like real-time routing of prompts um because right now we're basically just choosing like the set of prompts that work best overall on your training set and using that at test time but you can imagine that uh if we're able to sort of understand which prompts sort of work best for different types of examples a test time we could actually choose the best prompt for your specific example that you're looking at right now and that could you know with performance quite a bit so that's sort of what I see as uh sort of interesting next steps on the The Prompt optimization side and then of course anything that we can do to make these optimizers like better faster more efficient cheaper um is is a good route to pursue and then on the sort of like compound AI system side we chatted about this uh already but I think it is really interesting to see what we can do with this like broader architecture search um and you know being able to like someone could just spef if the task that they want to solve maybe just like a few examples of that task we could scale those examples out um you know as creating synthetic data um we could then create architectures and optimize over this uh in the back end and basically have like a a fully functioning uh ml solution for them and I think that that would be very exciting uh as like a a future Direction um so yeah lots to do and little time but uh at least uh there's no short edge of fun ideas which which you know probably brings us to the close here is you're now in your second or third year you know working on uh you know prompt optimization and prompt engineer and Engineering to make composable uh llms what advice would you give uh to yourself you know three years ago or what advice would you give to somebody who's now getting into this field as a way uh making the most out of their research right yeah yeah great question I mean I would say again there's really no shortage of ideas and if you're also feeling stuck you don't know where to start I think just finding great mentors is honestly an amazing step for just people that you really love working with and I feel very lucky to have that with you know the broader dspi community and Omar and Michael and all the other folks that I've had the opportunity to collaborate with and you know once you start working with a a group like that um you know and start on one project there are so many next steps and exciting directions and I think it's really just about like following whatever is most exciting to you at that time and uh I think part of the challenge both the exciting thing and the challenge with uh sort of where we're at now with AI is that uh it feels like there are almost infinite things going on it's hard to keep track of all of the different uh you know new papers that are coming out and so I think there can be some natural fomo of like oh you know um what about this thing and what about this thing uh but yeah I think what's really important important is just again that you have a group that you love working with um and that you're excited about what you're doing and uh yeah I think that the field has there's there's lots of winners in this field you know whatever you work on I think there's so many exciting things to do uh so I think that's that's one thing I would tell myself two years ago well you know we're VCS we know all about fomo so yeah yeah yeah it's a familiar feeling it's been great to talk to you Christa thanks so much for yeah so much for having me on where can people yeah so um I guess you can find me um just Google my name I think my website will come up or my LinkedIn uh feel free to reach out on yeah LinkedIn Twitter uh wherever but yeah happy to chat with folks uh who are interested in this space uh who are interested in getting into research if people want to contribute to dspi we have a huge community and you know always welcome we're helping hands so yeah definitely feel free to to reach out thanks for yeah yeah great time you guys thank you so much again [Music] Back To Top