[Applause] so thanks Blain and also thank you no on research for sponsoring this my name is hogi Cunningham um I am currently a researcher at anthropic I'm mostly going to be presenting research from last year that I did with my co collaborators at elth RI and that anthropic did some very similar work on um which I'm now like have joined them to work on um and this is finding features in large language models with SP autoencoders but before I sort of get into the meat of what that means I just wanted to take a bit of time to kind of get a bit of a sense of what this field is because we've been seeing a lot of um much more theoretical work uh now we're going to have a few different talks on interpretability um so yeah I mean what is interpretability there is um I mean it's a pretty huge field of work um someone recently said to me that interpretability is kind of the science of trying to understand how models generalize um so we know that like we're only ever with any system going to be able to observe some finite limited number of cases um where this model is acting and we want to know how to generalize um you know we want to be able to predict what it's going to do in these other situations and I think almost everything can be kind of cast as a well like a lot of things can be cast as interpretability so when we do like evaluations when we try and evaluate how smart a model is we're getting a sense of how smart that model is going to be in other situations so we're understanding it its generalization um I mean other things you can go in more detail you can sort of for a given output you can like look at the shly value of the tokens and sort of get a sense of which tokens have caused the output and you can even do things just like playing around with the model trying to like understand the Contours of the character of the model and this I think is still interpretability because you are just improving your ability to predict what uh this model would do just by you know you predict a person by interacting with them nonetheless I think especially because these models are particularly new and weird things and they're going to be changing a lot over time we can't just sort of rely on like just talking to these things um and there were also lots of situations that we wouldn't want to put them in to even test what they would do um so to get a deeper understanding we turned to mechanistic interpretability which is the the sort of project of trying to look inside the box so we have this um algorithm we have this machine learning algorithm and we have we know exactly what the code is that it it's written on and we know exactly what all the parameters are and we know what all of the vectors inside the model are you surely we can understand this it's a very similar project to Neuroscience but it has the advantage that you know you can you can run pretty much any test you could possibly think of you know you get to start with the exact values of every single parameter rather than that being you know a gigantic multi-decade task to like map out all of the parameters in the brain and again there's lots of types of mechanistic interpretability so some people would do things like trying to understand the macro structure you know do we find modular it in models or perhaps uh for example in Transformers we find that every unit in a Transformer is is a a residual unit so you will like in an M layer you will calculate uh the like the neurons and they will feedback and they will add back to the to the inputs of that unit and the same for the attention and so you can see it as consistently you know updating a single Vector which is getting updated at every single layer and it turns out you can do things like you know just try and take a middle layer and immediately go to the output and that works pretty well and so you find that it's it's it's a machine which iteratively calculates its outputs in the sort of step-by-step way which is not didn't have to be the case um and so you can find some interesting things at the macro level but what we're really trying to do is more like finding atoms like we want to know what is the sort of core unit of thought that these um networks are working with um and well one thing you might think is a good idea for these core units is the neurons themselves they are a neuron Network if you're a neuroscientist you might immediately think this is perhaps not the the unit to work with um but to give a very brief history of like mechanistic interpretability lots of the original work was done by by Chris Ola and others at Google brain and then open Ai and anthropic and camarat and others um and lots of that work was done using neurons as the unit of analysis in Vision models and so what we have here is like a a description of the kind of like what we the the dream of this working and this is a real example what we have here here is um these are three neurons in a in a in a previous layer on the left and a neuron in the subsequent layer on the right and for each of these neurons we can work out like what kind of images would this neuron try to fire on and you find you something that looks a bit like Windows and a bit like car bodies and a bit like wheels and on the next layer we find something that looks a bit like cars and we look at the weights that connect up these things and we find it seem looking for this windows in the top and wheels at the bottom and this you know seems to detect cars so this is like a really nice example of a clean crisp story we can tell using neurons as the unit to say well how is this model detecting cars well it detects them like this and obviously there's more to the story but you know that's a pretty good starting point unfortunately when we moved or when they moved from um working with these Vision models to Transformers and especially to language models we find that the neurons just don't seem to work as well so in the way that these we found that these neurons would um consistently activate on quite like clustered um examples of things when we look at when this neuron is active we can understand the kind of semantic cluster what we we immediately have a reasonable guess of What kinds of things will activate this neuron when we go to language models this is much less the case there's maybe multiple clusters sometimes it just doesn't seem to make sense at all and so there's this question well why would it be the case that in a a vision model neurons seem to mean specific single things and in a language model they just don't seem to mean specific things in the same way um and again in my very stylized history of meup the the the second paper is on toy models of superposition uh which is an investigation of when you would expect to see this phenomenon using lots of like simple um small examples and uh what we see here is we're going to say well let's imagine that there are some larger number of features Maybe there's maybe there's 100 features and we're going to try and compress those down into maybe a 10-dimensional um layer a 10 dimensional space and we're going to say well how does it do that does it does it actually manage to represent all 100 does it even try to represent all of 100 features or does it say just pick the 10 most important features and give them them all a neuron and we just ignore the rest and the the kind of determining property that they found was it this whether or it gives a a single feature to a single neuron depends on how sparse those features are and so what we see here this is one of their examples um each each column here represents a single neuron and the sort of little units sorry it's not very clear on this screen but the these yellow ones are made up of lots of small little blocks um and what that's saying is that in in the top case this is when the sparity is zero so all features are on all the time what does it do well each neuron just represents a single concept and if there are any more Concepts than there are neurons we just won't represent them at all in this other case we have sparity is 999 so what happens in this case well because these features are so sparse it becomes advantageous to the model for each neuron to try and represent multiple Concepts and each concept will be spread across multiple neurons um so there's this kind of yeah this key relationship we have low sparity we have you know features being on often or even all the time uh then they will expect to get a dedicated Dimension perhaps a dedicated neuron and when the features are really sparse we'll move to this this sort of distributed representations or superposition case where you know neurons represent multiple different things and the kind of units are spread across multiple neurons okay so given that we have this sort of basic picture in mind how should we kind of recover it you know if if if we don't want to use neurons as the basis what should we do well as we've just said we have this belief that we are now in a model where we have way more features than there are neurons than there are dimensions in any given activation space but we also believe that these are sparse and this makes a lot of sense if you think of what what's going on in a language model I mean if you if you think with a vision Vision model you might have textures and shapes and colors these things are very common they exist in sort of most images when we think about Concepts maybe like people or um you know even just particular words types of verbs you know they're going to be very rare and so very few of these features will be active at any one time and so we can convert this almost directly now into a sort of neural network architecture of our own so we can take our our activation vectors whether this is from an MLP or from the uh residual which is you know where it's added back into the input we can take that vector and then we can TR and say well if we had really knew what the features were what should we should what should we be able to do well we should be able to expand this Vector out to a much wider di a much wider space where each of the um dimensions are these new features but that this wider space should be very sparse there should be very few of them on much fewer than the dimension even of the the much sort of thinner activation layer and then we should because this is kind of all of the information we should be able to reconstruct the original activation Vector from this wider sparse space um so that's the kind of intuition and we cat this out as a a very simple machine learning problem which is to say you know we have X's our input um we our features are a relu of just a matrix plus a vector um we reconstruct them with another Matrix um and yeah that Matrix can also be thought of as a sum of feature vectors and then the loss we apply is a mean squ error loss so we want to be able to reconstruct it um we have an L1 penalty so this is where the sparity comes from uh an L1 penalty says we're just going to add up all the activations so if if we have two features on one is on with a magnitude of five and one is on with a MAG magnitude of two the the loss the L1 loss would be seven and this is a little bit funny because you know L1 is not really what we care about what we really care about is just the the absolute number of features but this does not have a good gradient this is like a very difficult thing to optimize uh and there's like a load of research which shows that using L1 optimizing L1 is a sufficiently good proxy that you will recover l0 like like properly sparse Solutions um so yeah we use the L1 penalty um this is where the sparity penalty is applied and yeah we can just train this we can take millions or billions of activations you know we'll take a language model we'll run we'll just get a a big data set and we'll run the run the model on it and then pull out these vectors in a particular layer it can work on any layer um or maybe not attention attention need some extra work but on MLP and residual layers we run it through this model and then we're just going to see what we what we get out uh and we we get out a lot of things and the question is how do we even start looking at that um and and the the kind of basic thing you can do is to say well okay now we've got one feature we've got one dimension in this wider space What are the cases in which this feature is most active and so this is from the anthropic paper the their one layer paper and we can see on the left that these like there it seems to be uh activating most strongly on nouns at the end of the word at the end of noun phrases there's like a fair amount of clustering in terms of game and station and Warehouse it's not like a perfectly clear unit that we immediately make sense but it seems to be like quite quite sort of uh consistent and clustered and this is the kind of thing that we see and as you go to larger models you start to see much more interesting things this is not like you know in a one layer model you're not going to expect to see brilliant structure but it seems you know it's not like a combination of obviously separate things now obviously just looking at the top activations isn't going to tell you a huge amount even in cases like neurons just looking at the very top activations often looks very clustered even when you sort of you know dig down into this mention there's like lots lots of other things going on so we can kind of have a look at that this is a feature which seems to fire on Arabic text uh and so the red this is a histogram the x-axis is how much the feature has fired and like this is a histogram of of yeah like the the amount of times that is apped this much and the red is Arabic text and the blue is like probably not Arabic text uh and we can see interestingly firstly that there's just it's not like anything there nothing like a gaan distribution we have this big like lump this is you know suggesting that we found some kind of just structure at least um and then we can see that it's quite clearly separating out Arabic text you know not just at the very top but throughout its range and then we seem to cut off the feature roughly when you know it's mostly you know mostly turning off Arabic going back to normal text uh and there's you know there's some probably some cases of Arabic text which it doesn't catch but it seems to be pretty good um so that's a sense of one good feature perhaps um but we can then use that to say well let's understand a passage of text through the mod Through The Eyes of this Auto encoder so this is uh this is John keats's ode on a grean n I think and what we've done here this is we'll run the auto encoder over this whole passage um and I've highlighted the word Earth here and we're going to see which kind of clusters seem to be active at this particular moment so and and the coloring here is the proportion of the maximum activation that of all of the activations we've seeing What proportion do we have at this word so the the highest one is early modern Shakespearean English so firing on things like Words which contain things like the and ye and so that's not on this word in particular but you know just previously and in the paragraph So it's tracking as a whole the fact that this is in this very early modern English we have the second one is labeled sun and says things like celestial bodies so obviously Earth is a Celestial body and then as we go down things which are sort of somewhat related but maybe less exactly clustered things like uh grass and farm related Concepts descriptions of nature so we can get a sense of like what kind of clusters the model is kind of treating this input as containing um and again this is you know an interesting window into one a one layer model but it suggests that perhaps for much larger models we might be able to get some much richer kind of insight into what's really going on there um I one thing one problem you might have with with having larger models is um like we might find millions of features we might find tens of or hundreds or billions of features um how could we even look at them all and that's where I think we can it's it's a kind of research where we can leverage um large language models a lot to like really help us uh get a hold of these features so one thing we can do is is and this is following some work that openai did last year you can say okay here's here's a feature that we've learned um how like let's see loads of examples of when this feature was active can we then feed that to a language model and get it to give a give a guess of like or you know say a few sentences about when we think this feature is active and then not only can we get it to sort of describe the feature but we can test its description because then we can separately give another language model you know just the description and say predict when will this thing activate and we can like judge it we can judge the um quality of the feature based on the um degree of correlation between the predicted and the actual activations and so this us both to like maybe feel confident that we can actually get a grasp on potentially tens or hundreds of millions of features um and it also allows us maybe if our protocol is good enough which I think we haven't demonstrated yet and we're working on this internally atropic to make these things better but um yeah to maybe get a judge of like okay how good was this sparse Auto encoder versus that sparse Auto encoder um so I think there's a lot of Promise in this kind of you know really bringing in the language models to help understand the language models um but kind of stepping back a little bit um I also want to talk about just like why do we think that we sort of found something kind of real here you know I mean when there's always a danger that we sort of we find something that's sort of a cluster it's it's it's an interesting fact about the model but you know are we finding something which is kind of fundamental to its computation and I think this is something this is like this is hard to answer and I don't think we've necessarily answered it totally um but I wanted to go back to this Arabic feature and instead uh on this histogram what we have is um like instead of like the the context in which it activates what we have is this is when this feature is active this is just a one layer model so when this feature is active which tokens are most upweighted by this direction so oh this feature is active we can just multiply through the matrices and ask well when is this going to be um you know what what direction does this push the model in terms of its answers um and right on the far right forming a separate Peak that is the like uh tokens which correspond to letters in Arabic and they're like really quite nicely separated from the other ones so we're finding that not only is this firing on Arabic it's not just finding that it is Arabic text but it's acting to upweight the probability of the subsequent text being Arabic and not only that but there is no neuron that we could draw this graph for you know you could you could look over any number of neurons we could look through every single neuron and there is not any neuron which like separates these out nearly as cleanly not not in terms of the logic weights and not in terms of the um like being able to just detect the Arabic so I think that's an indication that we're sort of we are really separating out not just what's in the space not just the information in the space but what the model is actually doing to act you know how how is the model doing computation I think there's a lot more to answer in that case but I think this is just one indication that we are like at least pointing towards something which is kind of true and important about the model um in terms of how we would go further um obviously this is just like finding finding a sort of cluster in the model so we might say haha we've we've like found a feature in the model which um is uh active if a person is lying now we might we might care about that we might think if it's certainly if the model is is thinking that it's lying itself that would be interesting but what we really want to know is like you know how is the model going to act on these features and how is it going to sort of build them up over time and this is still definitely a work in progress um and in fact there's been some really interesting work really developing this in the last week which I haven't had time to look at by S marks um but this is from our paper this was our kind of initial version of looking at this and we'd find this was on a six layer model we'd find in the last layer there was a feature which would particularly like make it likely that the uh closed bracket would be predicted so okay we're going to predict a closed bracket um well which features um cause this well in the middle layers we find things which are well sorry I'd say at the top we find features which just detect the Open Bracket and then we can find that those influence later on in the model and presumably later on in in the token in the text uh we find features that um will track whether the Open Bracket has we are within a bracket and then later on we realize ah okay now it's time to close the bracket uh and so we we can't flesh that story out in perfect detail but we can get a sense of how it like builds up the complexity from sort of recognizing inputs so like tracking longer term kind of things about the model and then into like converting that into action so that's is this is just like a very early kind of idea of how we might get some some more serious statements about what this model is really doing um so that's our kind of like basic picture um I think we we're learning something really interesting but there's a lot to do um so this is just like a set of things or what we're going to try and do next so like I've just saying we want to do circuits and there's been some great work of that literally this week and there's like a lot of really really fun things coming out from from various different labs and independent people on this kind of work now um we also just want to train these things better we want to train them to huge models and we want to get the right hyper parameters and you know we it's yeah there may be many like tricks and tweaks and maybe it's not even the right exact right kind of formula for training we also don't really know if we can apply these ideas to attention properly um and interestingly um we don't we still have like I don't think the perfect understanding of what we are finding so like one thing like early on we said that there's um in the vision models there is something which really pushes features to um be concentrated in single neurons and we would expect there to be still some kind of pressure even in language models they might not be on a single neuron but they might be on 10 or 20 not like all neurons whereas what we actually tend to find in these layers is something which is like totally spread out so I think that's just one maybe pointer towards the fact that we don't have a perfect understanding of what we're finding here um so yeah lots of lots of future work to be to be done scaling and scaling and training um it's kind of time of questions I just I I've given some pre Q&A questions for just sort of General things that I didn't really cover the first one is like this assumes everything is linear um like if if a feature is not linear it won't be picked up by this model um is this not a dangerous assumption and um I think Oscar is talking about linear representations immediately I would just say there's been a lot of interesting work from Oscar and others um like showing just how many Rich things can be represented linearly not just like features or in the sort of classic sense but even like tasks you know when we ask a model to do something the nature of that ask can even be represented linearly um and what have we really learned about the model um you know we're group grouping these things together we definitely need circuits um and when do we think this might help I think I I think there's lots of cases I think we might just learn more about the model because it's just such a mystery it's such an open it's such an open door at the moment I think especially perhaps it would be really good um like companies are starting to offer fine-tuned versions of models and we want to have a really quick way of like understanding what changes are made when you find the model um and I think this maybe can can provide an answer actually what has been changed in the model when we've just given it these few examples um as well it's yeah I mean I think there if if if these really are the atoms I think there's almost an unlimited number of things we can do um so I think it's a really exciting thing anyway those were those were my quick questions but your [Applause] questions uh let's bring the lights up uh yes straight away from the front center H yeah thank you for the nice talk so you're um essentially trying to like decode these features which you assume are spars which are then somehow stacked through the superposition in the real model um or in the trained model then um it would be interesting to hear your thoughts on like the choice of spars Auto encod and especially the size of like the the middle layer and what happens if you like have to to too large one or when it's too small you could comment on that failure modes on in either case yeah so this is a really great question um I mean I think I think what we kind of hoped to find initially um was something that I could have covered here which was um yeah we were hoped that there would be a discrete number of features so if it was too many maybe things would go wrong or maybe they would just like never turn on that would be the sort of easiest version and and maybe if if there were too few I don't know it wouldn't quite quite work properly somehow I think what we find instead is that it's much more smooth um you know we we T we we make it as wide as you want and sometimes these things because of the spareness it's quite easy for features to die um because they just they get to a region where they just never have a gradient but assuming they haven't actually died which is more of a training failure mode it seems like uh it just continually finds like more and more and more fine grain distinctions um which is I think maybe an evidence that we it's not quite although we are finding something about the clustering of the model it's it yeah it's not like it's not the sort of perfectly discreet set of features that we might have hoped cool uh could you pass the microphone back on to your right to Kai hello hi um so if we want to do mechanistic interpretability on multimodal models so say it's taking audio Vision text as input do you expect um you'd want to be using different methods based on say the layer or depth you're looking at or would it be more based on the natural distribution of properties in the data like as you said like spatial Vision um natural images contain different distributions to Text data and then finally if you had a sort of multimodal space towards the end of the model like what sort of methods would you be using to something that's integrating like diff information from all of these different modalities interesting question I mean I think so I I guess with vision models we've seen that like especially in the early layers like just looking at the neurons seems to be pretty good I think there's like some really interesting work that hasn't yet been done um which I would love for someone to do which is to compare like at what point does it seem there is there a crossover or how well does SP aut encoders work in this in this kind of early Vision sense uh when it comes to audio um Ellena Reed did some interesting work on the Audio models and um like found that you could like find syllables quite nicely in the residual stream in a way that you know you couldn't necessarily discover um uh otherwise so so I think I think these seem seem promising for for at least the inputs um I think like I alluded to I think it theoretically it seems like we should want to adjust perhaps something about the way we're doing for the sort of statistics of the neurons and how spast they are and I don't think we necessarily know how to do that so I think there may be some adjustments to the kind of techniques that we would make if we knew how um but I think at the moment like this does seem to be the kind of best technique that I know of at least um regardless of model layer or depth okay I think that is sadly oh I think we could do one more we can do one more if you head to the hi thank you this is a followup to the previous previous question have you tried things like checking the goodness of fit where you are doing the encoding and then the de encoding you're basically and then you check how well you've been able to De to decode the original thing that you encoded how well does that work basically how well does that work yeah yeah great question um yeah it really really depends on the model um I think with their on layer models which have some kind of advantages because it's kind kind of a natural sparcity that's induced by just the fact that the tokens are just there and generally in early layers you can really get it down pretty well you know you can I think the measure we often use is not exactly like well you can say sort of proportion of variance recovered or something what we often do is like um if we like ran the model forward but using our layer instead of the actual lay or like going through our layer as well like what's the loss how much do we lose in terms of model capability relative to just having taken out that model taken out the layer entirely um and for the this one layer model you can get up to like 95 maybe even higher percentages for the like middle layer in a sizable model generally at the moment looking more in the kind of 8090 range 80 90% so so there's definitely some some somewhat sizable losses but it's a pretty large chunk of because I guess that can help you address how wide you need to make the internal layer I mean I it definitely a good point that we would love to ideally there would be a point we could sort of draw this line and then be oh maybe this is how many we need for perfect reconstruction it's in this in at the moment it's kind of like more of it sort of flattens out and often it flattens out you know quite aggressively where like getting going from like 1 million to 2 million features um doesn't necessarily give you that much and one of the big questions that I'm working on at the moment is like okay given we're doing that kind of work is that because we just aren't good at optimizing it or is that something fundamental and that's quite hard to figure out thank you okay thank you Sam and uh thank you hogi thank you BL great talk for [Music] Back To Top