uh okay hi everybody thanks for coming I guess uh so there's three groups of people that may be watching this one is people uh actually here so this is the uh 11th part of the uh reading group for uh the the Deep learning book um uh there may be people watching on the live stream who I guess haven't been coming to our book group so far and there may be people watching the recorded video who also haven't been come to the book group so far so um tar ask me to do more of a presentation style thing this week um so this is actually kind of having two purposes the second is to do a bit of a trial run of some of the material that I'll be presenting at the data Institute course certificate course on deep learning at the end of October so any feedback would be super helpful about stuff that went well or didn't went well go well or was clear or not clear um that would be really really helpful the because we've so far we've been going through the Deep learning book which is quite uh mathematically intensive uh which is to say there's no code or discussion really of computational issues in the book can you talk a little bit louder yeah absolutely um we don't have a um I do have a mic but it might I don't know actually it's it's POS over there do you guys know how to hook up a microphone to the speakers okay you might ask if bring a wireless okay oh we're going to have a oh you got one I do but I don't know how to up at all that we let's just have the tech guy come down okay all right that sounds good for those who don't know that guy runs the um Masters of Science and analytics here at USF so he's a good person to go and find a mic course um okay in the meantime I will speak up thank you Rachel um so uh I want to take a bit of a different tack this week which is to take more of a uh coding oriented approach uh and U talk more about some of the computational issues um rather than focus more on the mathematical side um there's a number of reasons for that but one is that um rnn's I think uh are quite neatly presented in code and are quite difficult to present in mathematical ation um another is just in general I find it easier to think in terms of code than in math because of my background and so I thought it' be nice to present things in a way which I would like um another thing I'm going to do differently to how the book generally presents things is I um like to do an educational approach of top down rather than bottom up so I generally start with a complete working thing using lots of abstractions and then gradually work down to show how the thing works and then how that works and how that works so um for those of you who are more used to an academic approach where you start with the simplest thing and then gradually work up um you may find this slightly weird um because I won't explain all the details at first so that may require some patience to kind of wait to hear all the details but we will get to a full uh lstm or actually guu written from scratch in theano so we we are going to see the entire um the entire cycle um I'm not going to yeah is the code hosted somewhere or um not at this stage but it will be yeah so we can mention that when it appears it might not be until after the course happens but uh we'll make sure that goes on the meet up page um I know quite a few of you from the book group some of you I don't know but whoever you are please feel free to ask a question at any time uh or add any comments yeah it was great last week with Ian Goodfellow um um presenting that there was some really interesting discussion so um yeah feel free to add anything anytime um so I want to start right at the start which is um kind of why are it ends and the material I'm going to be assuming is that you assume is that you've kind of know all the stuff up to chapter 9 at some level in other words you know what's a neural network and kind of how do you train them and what's a con net and stuff like that so uh if you're watching the video of this and you don't know those things go and find that out first and if you're here live and you don't know those things it might be heav you're going um uh so con Nets and standard fully connected networks are not great at dealing with um things that are of variable length so in the book uh they have a couple of examples of things you might want to try and analyze with an rnm uh I went to napole in 2009 versus in 2009 I went to the PO the first thing to notice is they are different length sequences um with a convenant if you want to have varying length inputs you have to do some kind of hacking around with pulling layers and pulling layers really just blur the data They Don't Really capture all of the detail of the data so if you want to actually fully utilize all of the information in a variable length sequence uh a CNN is not a great way to do it in general um there are some ways to get around that um but I'm I'm just kind of speaking in Broad terms um more importantly uh longterm dep question yeah so why are those two different lengths you just mean in general they could be different lengths um I think one of them has an extra comma in yeah think one of them has an extra piece of punctuation okay so you're kind the characters yes yes exactly so um often um when we look at language we can look at a word level or character level and today we'll mainly be looking at character level um long-term dependencies if you want to be able to pass or generate something like this comment you're both going to have to know that you're inside like a percent tag and know to close it properly and then longer term you're going to have to know that you're inside a comment block and close it properly um to be able to do something like that you're going to need to know about things that have happened in the past some arbitrary L of time ago um that's pretty difficult to do with uh standard comets or fully connected layers related to that is this idea of memory um really the thing that makes um it possible to handle long-term dependencies is to have some concept of memory so basically um sorry I think I just you just killed us well this room but I think I killed you oh yeah I'll restart thanks sorry guys one second testing testing great switch back are you you're on the HD line I'm on the HD line thanks great it thinks you're on that laptop over there yeah what I did was I shut off the main one I turn the side one you could take it over right there what do I have to Chris I think he has to shut the center one off and then turn on the side one that's what I did last time I work okay um so that's fine I can keep talking uh without the slide the microphone's off too now microphone's off too come back okay okay it's a long sequence here sorry guys next time I'll have this set up for well if you guys want this permanently too then that's easy that'd be fantastic thank you so like during on Monday just grab it I'm going to Japan on thday oh that's right well then yeah we'll the next session thank you okay uh might be warming up again okay um so the the fourth piece is um what I've called state for representation this piece is like a little harder to explain intuitively but basically if you think about the difference between I went to nepo in 2009 and in 2009 I went to Nepal um Here Comes okay we' got the screen up thank you um the the in 2009 is a concept that you want to be able to apply to me when was I in the poll it was in 2009 so you want a representation that can handle the idea of in 2009 as being a a unit and that it could appear at different places um and that that that concept of in 2009 can be kind of saved for long enough until we find out what happened in 2009 so these three areas of kind of stateful representation memory long-term dependency are all very much connected I'm lost again s okay I'm back um um I just going to explain kind of these two approaches a little bit more so they're very often rnns are used for natural language processing and the reason for that is that natural language processing tends to have these kind of long-term dependencies and variable L sequences um there is actually some successful work in using componets for natural language processing but if you imagine we were trying to PA and understand this sentence with a a filter of size three we'd basically be having the Sens of free coming in um and three characters coming in and they basically would eventually join up way down at the bottom so the the ability of this um letter to influence the interpretation of this letter is pretty challenging because it's going through these convolution of convolution of convolution of convolution right so it's going to be very hard for it to learn these longterm dependencies yeah another question wouldn't you also represent a space as a uh that's true yes absolutely thank you absolutely thank you much um in an RNN um we're going to represent things um very differently um what we're going to do is we're going to say there's some state which we represent H hidden State and there's going to be State at each different time period t minus1 t t +1 and at each time period we're going to introduce another piece of input so this is our X's so like this might be XT this might be XT minus1 this might be XT + 1 right so in comes n t space okay and what we're going to do is we're going to say okay let's take our input at T minus one multiply it by some wave Matrix put it for nonlinearity that's going to give us how hidden state that point in time and then to get our hidden State at the next point in time we're going to combine the previous hidden state with our next piece of input data to calculate the new hidden State and then we'll send that across and into the next hidden State again bringing in again the next character and what's happens here is that at each of these time periods we're going to see lots more detail about this shortly but these are all exactly the same weight matrices for these three here and these three here all us exactly the same weight matricies as well um and so what that means is that over time um let's change this bit uh over time if um it wants to remember the fact that it's I who went in the pole then it can basically store it in this hidden State early on and make sure that it stays there as long as it needs it um so there's it doesn't have to try and combine everything into one big tree but we can just have the time steps going like this this is called the unfolded representation over our current neural network the much needed representation is this one on the left hand side and this and this are two ways of writing the exact same thing um in fact maybe we should go back another layer again you guys might remember from a couple of weeks ago um we talked about this idea of like there's one way of drawing a neural net where you say here is all the inputs and here is all the activations or hidden nodes and here are the outputs and then there's like a weight connecting all these and a weight connecting all these and you could draw it all out like this uh and then there's of course some kind of hidden um some kind of activation function and then there's another bunch of Weights here so this is a pretty ugly way of drawing a neural network so we learned that uh what we can do instead is just draw it like this um and this is a way of describing what the book calls and is why we called a computational graph and in general when they're computational graph is a neuronet these lines tend to represent a matrix multiplication followed by some kind of nonlinearity um and then these circles represent the Matrix or tensor that you get as a result of that so that's the way these things are being drawn so here's a um here's a tensor just for those you don't know a tensor simply refers to a multi-dimensional array so it could be a vector could be a matrix or it could be any kind of n dimensional so anytime you anytime you hear tensor if you put Computer Science Background you might prefer to think multi-dimensional AR in your head so we've got some kind of input tensor this is representing a matrix multiplication followed by nonlinearity that creates some other tensor and then there's a feedback loop this means delay for one time period um again with a matrix modification and a nonlinearity back to itself so if you take this at one particular time step X and H okay and then you can say okay okay this here is going from HT minus1 to HT right so you can basically see how the right hand side has unfolded the left hand side so generally speaking this is a pretty nice clear way to draw things when you're explaining to somebody some Network architecture I find the one on the right hand side a better way to draw things when I'm trying to like work through the details of like some calculation or particularly doing a derivative so we're going to be kind of going backwards and forwards between these two ways of looking at recurrent neuronet today um right so let's have a looking example this is from Andre kath's blog um who always has great stuff um and here's a classic example of something we might be trying to model which is we've got some input characters h e l l and we're building a model that is trying to predict what the next character will be and therefore the target characters might be e l l o because the original um word was hello so we start with our letter H and we have to turn that into a vector the most common way to turn into a vector is an En one hot encoding which probably most of you have come across already and it simply says since in this case there are 1 2 3 four possible letters we're going to have four places and we'll just put a one in an arbitrary place so an H we always have one in the first spot an e in the second spot and an L in the third spot so this is called one ploton coding and you'll see it all the time anytime you're dealing with natural language processing um an important thing to mention here is that this is even used if your input is a whole dictionary of words now there's about 160,000 words in the English language right so conceptually for one hot encoding your vectors will be every word will be a size 160,000 Vector with a single one and everything else is a zero I say conceptually because in real life you would never do that there's no point storing it that way if you think about it a matrix multiplication between a one hot encoded vector and any other kind of Matrix or vector is simply the same as looking up into that vector and grabbing one single feature from it so in practice when you code these things if you're using a one poot encoding you don't actually create these vectors of ones and zeros but instead you just tell the computer this is a one hot encoded vector and it knows every time it does a matrix multiplication instead it actually just does an indexing into a into the vector it's one of those things that sounds like a minor issue but until you realize that's the case if you do it the kind of like the clunky pure Matthew way then your code's never going to finish running Okay so we've got these four um one po encoded vectors um that's our input and so once we've got our H we're trying to predict that the next character might be an e so we put it yeah uh you've chosen four units in your hidden layer is that specifically always the length of your string minus one um yes it's actually um it's actually this so it's actually got one hidden layer with a recurrence in it so this Matrix and this Matrix and this Matrix are all the same weight Matrix okay so it's it's doing it uh its own self propagating yeah so this is the unfolded version we're looking at here and so all of the arrows of the same color are using the same we mat all the blues are Wy all the reds are wh sorry WX and all the greens a wh so so you don't have four separate like neurons you that's that's right we basically got one neuron that's linking back to itself um but you're going to have to go through if your if your input is of length four you're going to have to step through this recurrence four that wasn't clear when I was reading the chapter yeah okay yeah so in this case we started with a vector of four and we've ended up with a vector of three so there must have been a 4x3 matrix um so that's our WX and then it's been put through some kind of nonlinearity to create this which is our hidden State now there's nothing magic about the fact that there are three things in this hidden State you can pick any hidden State you like you can and you do that just by deciding how many columns to stick into your um Matrix here okay so your input Matrix can have any number of columns that's just like in a CNN you get to pick how many how many filters you want okay so that basically decides how much information are you trying to to store in this distributive representation so then to go from hidden layer to Hidden layer you're going to need in this case a 3X3 Matrix right it's starting with three and it's keeping three and remember all of these green arrows represent the same Matrix and then to get from my hidden layer to my output layer I'm going to need a 3X 4 Matrix and so out of here is popping four numbers and we know that the target is e and we know that an e would be a one in the second spot so you can see that this is green bolded that shows this is what we would like to be a high number unfortunately we were wrong okay in this case actually the fourth one was the highest number so this allows us to create a loss function so we can basically say the target Vector would have been 1 right and so we can just do a standard maximum likelihood or whatever on the difference between these two vectors and in fact in most of today's um examples I'm just going to be use sum of squared errors for my loss function it works pretty much as well and it's kind of a bit easier to reason about when that's more what you used to um but in general you could have some kind of loss function here and then all of these losses are all going to get added together to create the total loss for this particular um input sequence so that's that's a single example of a one word going through an r&m so when you start at H wouldn't you be trying different versions of the Hidden layer you just have one or is that just one yeah so this here is assuming that we already have some kind of WX some kind of wh Y and some kind of wh we haven't talked about how we're going to create those weight matrices so for now let's just assume they're full of random numbers um given how um crappy these outputs are that may well be the case and this is just while they're still random numbers in general just like with convet we're going to start by initializing our weight matrices somehow um and we'll talk about that soon and then we're going to try to make them better and better gradually using some optimization method and you won't be surprised to hear we'll be using some kind of SGD to do that yeah so to go to your last function can you a little B about you know why you would use some Square letter versus l oh we would use log Hood normally so we would normally use we would normally use a multiple uh cross entropy uh loss function it's just in um some of the code that I'm going to be writing from scratch later on I'm just going to use some of squid errors because I found empirically it works pretty much as well um and I just find it sometimes a little easier to explain because more people are more familiar with it the stream is just asking if you could repeat some of the questions oh yeah okay no problem so um that question was probably obvious from Context but will you remind me if I get I'll like this um okay so that's the back pin uh I'm more into writing code than looking at screens so um let's get into some code and I always think code is easiest to understand if it's a problem that's small enough that we can run it in a couple of seconds so we can iterate lots of times and and the data is small enough and uh intuitive enough that we can look at it and debug it um one of the challenges I had this week in comparing this was that I couldn't find any really clear simple lstm or guu examples on the internet anywhere so I had to build one from scratch and in doing so I really realized how incredibly hard it is to actually get it to work properly like there's a lot of details like it's not as you'll see it's not complex it's just until you kind of get the structure right it's very easy to screw it up so um that it definitely helped me that I decided to use a very small um simple example um so the small simple example is that we're going to learn how to add um we're going to learn how to add binary digits um this is uh this the idea for this problem actually comes from the website for a guy called himself I am tras so thank you to that person he or she I think it's a he um and so the reason that adding B digits seems like a great thing to use recurrence for is that you have to remember whether or not you have a carry digit so when you're adding binary numbers you have to remember whether you have carried the one or not not a terribly hard thing to do but you have to remember it for the length of the whole sequence also you might want to have 8 bit numbers 160 bit numbers 32bit numbers um so it would be nice to have something that can take vary in different sizes as well so this is our goal is to to get a computer to learn how to do this um so the first step um oh I should mention I mentioned earlier um that this is kind of a bit of a trial run for one of the later classes from the um data Institute course if anybody uh either here or online is interested if you go to course. fast.ai it will redirect here where you can um sign up if you're interested starts at the end of October which I think is the day after we get back from Japan so if I'm a little jet KN for that day apologies um and this is the chapter of the book that we're um very roughly going through today I should mention there's lots of stuff in the book that I'm not going to go through today I'm happy to answer questions about it but I'm trying to focus on the stuff that I know from experience actually is most important um am I assuming that this is not working then I should have took it off I know how to turn it on that's fine I will not wear it makes me look like Madonna or something okay um so step one is to remind ourselves about how to um set whites in general the font size uh yes that's that more yes probably okay for us okay yeah all right um so I just want to remind ourselves how how to train parameters using SGD um in fact funly enough in the book Ro so far we've never actually done it with code we've only done it with math so it's probably useful for all of us um so I wanted to start something super super simple before we start optimizing a the current neuronet how do we optimize something which is trying to learn a X plus b Okay so um so we Define uh the linear function ax plus b and we say there the actual A and B are 3 and 8 uh we generate 30 random numbers for each one we generate the um ax plus b version of it that gives us an X and it gives us a y and not surprisingly the scatter plot is line um so if we wanted to go back and figure out what the three and the eight were if we didn't already know um the way we would do that would be to define a loss function and in this case here I'm using the sum of squar Errors um we would guess what A and B might be um and that would tell us our loss at that initial guess um we would then um figure out the derivatives of the loss function with respect to each of our uh two inputs um uh we pick a learning rate and we then Define um how do we update our guesses um given some predictions and of course the answer is to find the derivative of the loss with respect to each of the two different inputs and then update each of our inputs by subtracting the learning rate by the derivative um so if we now run that update 10 times um and animate each step we get [Music] okay so um we're basically going to be doing this but rather than having two parameters we're going to be having more parameters um once you hit more parameters you can't really plot it like this so that's why we're starting there okay so let's do the RNN version so in the RNN remember we are adding up binary digits so conceptually there therefore we have one two um inputs each input is eight time units long and one output which is eight time units long so here are those definitions two inputs one output and eight digits and then we can pick however many hidden units we want it's like how many filters we want the CNN so we're going to use 15 clearly remembering to carry the one when doing binary addition doesn't really require a 15x 15 Matrix but like you kind of there's no reason not to use a size a bit bigger than you need until you find you over fit in that a really useful piece of advice I've mentioned in bul gr before um and it's well worth repeating is you always want to anytime you're building any kind of Model start by trying to overfit because if you can't overit there's no use like regularizing or whatever right so start with a model that can overfit and then try to regularize it so if 15 turns out to be too much we can worry about that later we'll know because our training error would be much higher than our test error so um I'm going to create some random data so some random inputs between n 127 and some random outputs um let me turn off my F.L turn off should be until sunrise yeah that there you go okay so let's create some two random inputs and then add them together for our output so here's an example of some outputs and there's an example of some inputs 44 + 10 = 54 okay doing well so far um something need to turn them into binary okay so here's an example of one of the binary pairs of numbers um what I'm going to do is I'm going to reverse the binary numbers and that's just because rather than scanning from right to left which is what you normally do when you add I want to scan from left to right because that just makes the code easier so I got to reverse the order of both of my binary numbers okay so there's a few examples of some inputs and there few examples some outputs so that's nothing RNN specific there just setting up the problem right so we're going to start at the highest level and so um the highest level probably is to use a library called caras um caras is a wer that wraps around either theano or tensor flow um it presents the same API regardless of whether you use theano or tensor flow um quite a few things with rnms tend to be faster if you use theano um also theano works on um Windows um so uh I use theano as my tensor flow back end but like literally it's a single line of code change and it won't make any difference either way so we're going to use caras by using um so theano and tensorflow are really computational graph Builders and we're going to see a lot of theano in a moment intensive flow would look very similar as well a computational graph Builder is basically something that lets us describe computation and it's then going to compile it for the GPU obviously this kind of stuff you generally want to put on the GPU um caras sits on top of that and provides all kinds of neural network functionality using the computation graph functionality of Theo intens oflow so um this is easiest explain with an example um so I'm going to turn my input and my output into slight changes and calling them X and Y and specifically I'm just going to change the order of things a bit um Kos and many RNN um libraries expect the order to be first of all the first access is normally the um the the items so in this case each pair of binary digits um the second um is the uh the time Dimension so in this case the AG digits and then the third is the input features which in this case will be our two binary digits so that's why I'm just transposing it in that order and I'm doing the same thing for my output and even though with my output I have just one output I still need to create an axis for that so that the two tensor have the same dimensionality so I just got this axis with a single excuse me what is the zero Dimension the zero Dimension is the um uh the 10,000 yeah so I created 10,000 random pairs of numbers so the first uh Dimension the zero Dimension is those 10,000 items in fact it might be easiest to show you the shape um so you can see the shape of X is 1082 and y shape is 1081 so it's which sample you're on okay so in caros to Define an RNN is this simple um you first of all pretty much always use this word sequential in caros that just says I'm creating uh a neural network where I'm going to sequentially tell you each of the layers and then you have an array where you tell it each of the layers so my first layer is an irnn layer and my second layer is a dense that means a fully connected layer so my RNN layer uh has an input shape of digits so that's 8 by two so each sample my output dimensionality is and uh is and hidden that's 15 um we're going to use a sigid activation function um we're going to initialize each one using this function which is a normal with a mean of not stand deviation of two um and then we're going to take that and stick it into a fully connected layer but it's going to be a bit of a special kind of fully connected layer if you think about it the output of this rnm is going to be um one element for every time step so for each of the eight digits so conceptually we want to basically have eight separate um dense layers being computed so to tell caras that we want them we want you to distribute this layer computation across the time unit you use this special function wrappers do time distributed so that just says create eight separate dense layers so Jeremy aren't we then in fact keeping separate neurons just smeared across time I mean it seemed like you were saying earlier that there's really just one neuron and the the hidden activation is just passed back you know recursively in so the the yes kind of the um once we have our eight so the eight steps are done always with the same weight Matrix from from hidden unit to Hidden unit and the same Matrix in the end units and the same weight Matrix coming out but then we need to combine these together um in order to um calculate our loss that's where the that's where this dense Matrix comes in oh okay okay so that's going to have um now in this case actually that that's not even true it's even simpler than that because because I've said time distributed it's actually going to create eight separate dense matrices and in this case because the output is only a size one um that means that this is actually um uh it's it's basically nearly unnecessary um except it's just going to take our um 15 hidden units um because this is going to have an output dimensionality of 15 uh and just turn it into the dimensionality of one so it's going to take the 15 hidden units and turn it into a output of one but it is actually going to keep each of the eight ones separate it's time distributive and so it actually won't be added together until later on when we finally say there's a loss that's what it's going to add it together so yeah sorry I that slightly wrong the first time I said it um you can see that uh defining a uh caras model and running it is pretty simple um to say the least there's not much code to do um once we've defined it we can get a summary of it it shows us um how many parameters there are at each layer and how many parameters in total so it's still a pretty small model right so once we've defined the model um we have to train it because at first we have random weights okay so to train it we say okay let's use SGD with this loss loading rate with this momentum trying to optimize this loss um and it says compile which is actually interesting because it really is it's actually taking it's literally taking the computation graph that it built back here and it literally calls nvidia's nbcc compiler to compile Cuda code behind the scenes so when I see model. fit it's now not running python it's running Cuda and did it very quickly okay so say we've got to an accuracy of one so we can now try running a prediction and we can check it against the actual and you can see yep that was pretty good right this should have been zero is actually 0.1 this one was. n it actually should have been one but overall it was pretty close okay and I'm sure if we run for one two more OTS yeah now it's thing okay so that's the highest level right and for anybody who kind of wants to code up something without worrying too much about the detail that's all you need to know right if you want to go further then you need to understand behind the scenes what is caras doing with piano or tensor flow so that's what I'm going to show next yeah so somewhere in there is this idea that you have to add up all the losses of the output Etc yeah so that's all hidden that's hidden away so you so the question was um hidden somewhere in there is this idea that you have to add up the um time distributed eight losses and indeed that is hidden away inside this um loss equals thing when it has more than one item in the output it's going to add up to the losses from all of those items great question okay so for me this is where it starts getting interesting which is and this is where it starts getting hard to find working code on the internet that actually shows you what's going on behind the scenes um so let's take a look theano is um does not give us all of these things to import like sequential model SGD DSE Dropout rnm and so forth um what theano provides is functionality that is very similar to numpy um or you know or glass or you know whatever linear outa stuff you're used to um and we have to build our RNN and our SGD from that functionality if we do so we are going to get the benefit of getting it compiled and running on a GPU for free um the nice thing about using this instead of caras is that we get to play with all kinds of other variations of the algorithm that maybe caras doesn't currently also caras can let you provide an arbitrary theano function pretty much anywhere initialization layer loss function so when you understand Theo or tensor flow you can plug it into caras so I think they're both good reasons so let's try and replicate what we just did with Yello so because we don't have the idea of layers and so forth we going to have to manually initialize each one of our weights to be the right size so that's why I've created a function called initialized weights and you provide in some rows and some columns how many rows how many columns it uses numpy to create the normal normally distributed um uh uh random variables of the size you ask for and then it wraps this in this theano concept called shared shared stands for shared variable and what this does is it creates a computational graph node that can be passed off to the GPU uh or included in our computation um and it will be transferred to and from the GPU as necessary to make things work so this basically tells theano okay this is now your baby to manage and I'll be able to use the result with this in theano Expressions um I also want to be able to initialize a bias um in general for a bias you either want to start with all zeros or all ones depending on the situation um for most of the things we'll start with it'll be all zeros so that's why I have um initialized bias working in that way um the reason you generally want to start with all zeros is because the sigmoid function uh if you're using a sigmoid function has the nicest gradient at that point um one good reason to start with Zer okay so then um the nice thing about piano compared to something like cafe or Microsoft system is we're using real python here so we can write functions on functions some functions and like kind of create our own abstractions so for example we can create a function to build our weights and biases all at once so that's what I'm doing here got of tle that of weights and biases in one go so to give you a sense of uh how we can build a neural network with this functionality you can see here I'm basically creating all of the weights and biases that we're going to need for a recurrent neural network here's my wxs so that's the weight Matrix that goes from the input to the hidden so that's got the weight Matrix will be n input rows by n hidden columns and the bias will be n hidden columns my w y's are the ones that go from the hidden output to our outputs so that needs to be n hidden by n output and then finally we've got the wh's now the wh's you see I'm actually creating in a different way and this is a really cool trick um there is a unfortunately not well enough known paper this paper here by quar Lee um and some guy called Jeffrey Hinton um which pointed out a couple of years ago a less than that I guess 8 months ago that if you initialize a recurrent network using um an identity Matrix um it works incredibly well in fact they found that if they didn't use an lstm but just use a standard new recurrent network with an identity Matrix as the um starting point for their hidden units that they actually got um very good results on some real large problems so that's why for my um hidden unit I'm actually calling a different function which is ID and bias and as you can see all that's doing is it's calling the NP DOI that's I for identity um uh with our n and initial in the bias um as well so that's a cool little trick um so that's created all of our weights and then I'm just going to um chain them all together to create a complete list of all of the weights the other thing we do in theano is that we tell it what are all of the inputs to our algorithm um we tell it that we need a matrix called input a matrix called output a matrix called H KN that's going to be our initial hit hidden units um and a scaler called learning rate notice we're not setting it to anything because at this point all we're doing is we're compiling the computational graph it just has to know what's the dimensionality of each of these tensor that's why I call them T something it's a Tor okay and so we've then told all of the arguments to our function are those four arguments so that's set up all of the weight matrices set up all of the arguments so at this point we can now tell it how to do one step of a recurrent neural network okay so one step of the recurrent neural network will need our input Vector our previous hidden Vector our hidden weights our hidden bias our input weights our input bias our output weights and our output bias and then all we do is to calculate our hidden we just take our input dot product matrix product our X weights add our bias same thing with our hidden and our hidden weights and add that bias and then stick all that through a um nonlinearity and then so to remind you that's that's here right uh input hidden add them together output WX W Wy okay so that's the entirety of the calculations for a simple recurrent neural net it's not there's not much there okay we're going to return two things we're going to return the hidden units because we're going to need it for the next um to calculate the next set of hidden units and we need our predictions because that's what we're interesting so that's one step all right how do we do lots of steps there's a very handy function in theano called scan um the scan function actually comes from APL which is a language that goes back to 1962 something like that um so it's a really old idea but it's a great idea and it's basically this scan is a function that takes as its argument another function in this case our step function and a sequence to apply it to so it's going to apply our step function to every element of our input sequence and it's going to Output an array of two things and one of those two things it actually needs the starting point so we give it a starting point and then also our function is going to take lots of other things non sequences which in this case is all of these okay so when we run theano do scan it's going to call our step function with our input X our hidden H and all of our weight matrices and bases so that's kind of neat right it's something that you might normally use a for Loop for um why not use a for Loop this is actually quite fascinating in my opinion um think about how GPU Works um um a GPU um basically has thousands and thousands of threads running at the same time and most of them have to be running on the same data right so it's all they're all happening at almost exactly the same time now in this case we've got something where if we look at our C we had something which is calculating H and it's calculating it based on H right so it's kind of like you're incrementing something okay so we've got um a calculation where H is coming in and it's going through some kind of calculation and it's spitting a new H out aw oh what's up so use different color oh no he's I think he's intending to draw on the board like that oh yes okay got an H going into some function and calculating some new H so imagine you try to have a thousand threads do this at the same time right then basically each of them is going to be reading the same value of H and calculating their own value of H and then setting all at the same time a new value of H it's not going to do anything useful at all right and this is like one of the big challenges with programming gpus is like how do you paralyze things so um in 2007 um somebody um programmed on a GPU uh a really neat way of doing the parallel scan um and they here's basically how it works you take your input in this case here's our change color again um so here's our input now eight inputs um eventually this is the output I'm trying to get right is I'm trying to basically create something in this case it's doing a sum it doesn't have to be a sum it's just something that's repeatedly applying some kind of function in this case it's line plus so how do I get from here to here in parallel and the answer is well step one to be clear the bottom right those are like um cumulative sums like yeah exactly so you got X sum of x to X1 X to X2 X to X3 and so forth X X6 thanks R so step one we could add up n and one 2 and three four and five six and seven those can all be done in parallel and then in parallel we can add these pairs and in parallel we can add these pairs this is called the forward Shuffle or the forward pass and in the backward pass we basically do something pretty similar doing a combination of copies and additions and you can basically see using that um process you end up with um a cumulative sum or a cumulative whatever um done in par um the number of steps that it requires if your input is size n will be log 2 N or at least two times that right so that's pretty efficient so that's why theano doesn't have a for Loop because you can't really run a for loop on a GPU but it does have a scan um and also like scan is quite a convenient way to write this kind of thing anyway sorry maybe I miss something but but still there are temporal dependencies right we cannot confus H1 before there is an h0 or H2 before there is an H1 no you you absolutely can because um our function is defined such that you've got um our function is defined such that you've got some kind of previous state and some kind of new input value and they're the parameters to our function and then it spits out a new previous input value for the next one it's a pure function which means for every pair it's going to always return the same thing and because of that that's why if you think about in this case the U the function it's doing is plus okay so in this case you can plus n and one 2 and three four and five six and seven and then you can plus 2 and four and that's the same as adding n through three ditto here you're adding 6 to 8 that's the same as adding 4 to 7 then finally you can add uh 4 to 7 and that's the same as adding up the whole lot so it's kind of counterintuitive but when you think about the math you know it actually works out uh so this works for sums but does it work for more more complex operations where you don't have things that commute and stuff like that so yeah so exactly you have to just make sure that the the the basic communative functionality is there and and in a in a um in the the things that we're doing where we're just doing additions and that non linearity it it works out fine did you have a question okay um so there's our scan and actually um in the next section I'll show you this in pure Python and um event originally I did it with a F and I actually found it easier with the scan so so does scan always do just login steps or sometimes it evolves into linear as far as I know it always does login steps I haven't looked at the theano code but in um all of the GPU programming uh texts I've seen whenever somebody talks about Scan they always mean um this implementation of scan if you're interested um if you just Google for GPU scan there a fantastic free book called GPU gems and actually that picture I just showed you I stole from here um and it shows you the whole history of the scan algorithm and how it's implemented and um so forth so this is a really terrific uh resource um also um Udacity has a really great course on heterogeneous parallel programming which covers these Topics in more detail uh okay um um so now that we've described how to create our um outputs uh we can now calculate our error um so interestingly you'll notice that so far we have not at any stage actually referred to any of our input values any of our binary digits everything we've done has been on these um symbolic um variables and tensors um and so this error is just a description of how to calculate an error it's not actually doing anything yet doesn't have any data yeah uh rsh ask asks um do we have to take special precautions when defining a function like step to use Theos scan um any ones that perhaps you take yeah um honestly I only use theano scan for things where I know it's appropriate already um I I haven't studied it closely enough but in general um is it the communative property Rachel is that the one we be looking for I associative associative okay sorry Rachel say mathematician so um okay so I guess as long as it's an associative function it should work it needs to be a pure function as well of course um it forces you to only use um elements of a sequence and uh some previous uh calculation result so like that it kind of ensures for you because of the API itself thank you very much um so here's the best bit remember last time when we calculated the gradient um actually no we haven't done that so um we will look shortly at calculating gradient but um for now to calculate the gradient we just type t. grad L function with respect to all of our weights and that gives us a vector of all of our gradients um how does it do it um it's doing automatic differentiation so it knows how to differentiate every operation in theano uh it knows the chain rule um and it just does it so at this point we've now calculated that we have a computational graph that calculates gradients okay so given that we know how to calculate gradients we can now Define um our theano function we tell it what does it take it takes all of our arguments so that was those things our initial State our input our output our learning rate it produces the error that's the thing we calculated up here and then at each step I wanted to make some updates to improve our predictions and so the updates are in this dictionary which is calculated by this function which is this dictionary for each weight I want you to set it equal to weight minus gradient time learning rate for every weight and gradient in all of our weights and gradients yes uh Aran asked a question kind of pertaining back to the scan um he's asking is there a way to have a dynamic computation in scan uh for example HT computed using h0 H1 h t minus one for H tal 1 2 T makes yeah I mean we are um because uh because H is using the previous H right um you can um keep whatever state you need there um so you're not getting all of the previous state it's up to you to aggregate it in some way okay so this defines stochastic radient percent basically this defines our update rule um so here's our list of updates and so what theano does with this theano do function is it's going to call every time we call FN it's going to pass all of our arguments to our error function and then do all of these updates question stas where is the Dropout or noise to make it stas um you don't need drop out to be stochastic stochastic gradient descent is stochastic because uh we're not calculating the um we're not analytically calculating the gradient and we're not even doing it on the full sample um so Dropout is a method of regularization and I would not add regularization until I know I need it first of all try to over pit so you stand doing fin differences or it's not symbolic right uh yes it's symbolic so when we go T.R uh as I said earlier it's actually it knows the analytical derivative of every operation and it knows the chain rule so it's actually calculating four analytical derivatives for all of our so why uh why are you saying that that it's not that it's it's not it's not doing do an analytical calculation of the objective function it's doing an anal it's analytical calculation of the gradient uh compare it to uh say you know using a matrix inverse to calculate a mod linear regression you know that's not a SGD you could where else um this is a SGD you want to sh the code for that again s part of this yeah so it's it's exactly the same thing take and subract learning rate by the in my mind this this is gr incent from differences SC yeah so it becomes um stochastic at the point where you're doing it on less than the whole batch okay just really basic question so the variables like w all is that just a list of values or is that a dictionary of like variable name to Value um it's actually neither it's just a list of I'll show you um so here's uh w h which is a tle and that the reason it's a tle is because weights and biases was defined to return a tle and it's a tle of what it's a tle of Cuda andd array Flo 32 comma Matrix so it has no data in it at all um it's just a symbolic concept of a float 32 Matrix and so w all uh W is a list of Cuda andd arrays okay so how does it so how does it know which ones to match with the parameters to step so how does it know which ones to match to which parameters the way it knows that is because theano docan documentation tells you that it's going to pass it in the order as follows first of all each element of the sequence then each element of outputs info that's not none and then each element of non sequences in the order that you define them thank you um and then uh as to the F function itself the order of the items will be whatever this is and that is allore ARS something I defined earlier um this uh first one was um h0 our initial hidden State then our input then our output and then our learning rate so in the next step we actually go through each of our 10,000 binary digit pairs um we call our function um passing in our initial hidden state which is just going to be a bunch of of zeros our input item our output item and our learning rate and it's going to call everything necessary to calculate this and it's going to do our updates which are all of these using the gradients that are calculated up here um that's actually all you need to do the STD uh I've just added something that every be th times um anals the learning rate resets the error and prints out the error so far so let's run that um there it goes okay so you'll see that this is running a lot slower than caros why is that that's because in Kos we told it to use [Music] a mini batch size of 32 right so it was passing along 32 sets of um um data items at a time and on the GPU doing all of them in parallel um because that code would be much more complex in piano I wanted to keep it simple for teaching purposes we're just doing one at a time so that means that this code has to take every item transfer it to the GPU call the GPU kernel and transfer it back to the CPU and do it again and again so in fact this this kind of purely online gradient descent is going to be slower on a GPU than on a CPU because the overhead of the transferring is going to be higher than the amount you improve with the paralyzation Jeremy um seems like this is going to be a lot harder to debug because we've built up it's like we're using python to construct a symbolic representation that we can't really debug yeah okay so for those that don't know teren par is an expert in um language computer language design and so youve hit the nail on the head we have designed we have um um Frederick the theano guy has designed a domain specific language that is written in Python and then creates this compiled Cuda kernel attempting to debug theano is a um he has done an amazing job of his error messages like if you have um like you can see all the types know that they're vectors or scalers or whatever if you try and like have something where the rows and columns don't match it'll tell you this row doesn't match this column um you'll also see that it optionally lets you name things and if you do name things that it will use that name when it prints things out um it does a lot to try and help but yeah you're not going to be stepping through it right in fact the theano compiler is pretty smart about trying to optimize the like it's an optimizing compiler right with all the complexity that that involves so this is this is difficult to debot right because the only chance we have with a print statement in is in that little for Loop you built if there's no there's no other place to go like actually believe it or not there is a symbolic print statement so you can actually print and that will insert that will insert a symbolic print request so but the the point is for the average uh person uh a python programmer we can't go in like we can't like this gentleman was asking you know where is the value stored in those matrices you can't ever ask python to go what is that Maj yeah it's it's super difficult to to to to do that and like for me I love graphical debuggers you know and like forget it um Nvidia actually provide an Nvidia GPU debugger which is fantastic um it on windows at least it runs inside visual studio and it makes life a lot easier um because theano creates literally Cuda code and calls nbcc it is actually possible to use the Nvidia debugger on it um uh I've done that for other types of kind of Cuda DSL I haven't done it for theano um you can also use the Nvidia profiler yeah in general debugging theano is um or tens is tricky so you bang on that until it worked yeah exactly this is what I was saying earlier like trying to get this thing to work just took a whole lot of around and and like sometimes you realize you've got like an off by one error and it works but like not quite as well as it ought to yeah and so it requires and that's like why another reason I tried to do a lot of like refactoring out of things so that rather than copying and pasting code you know I can check it in one place what what are these parallel processing algorithms Sy have been around for decades like are there any other techniques people have used to debug these things shockingly actually these have not been around for long so actually a lot of the parallel programming algorithms uh came with the rise of a company called thinking machines um which I guess is like mid to late 80s uh they were kind of the first to create classes in this when I was doing my masters in computer science and that was 88 or something gen I used it in 85 yeah like this like the scan that was not implemented on GPU until 2007 and I don't think the algorithm was even described until the mid90s um I would say also like there haven't been very many people doing massively parallel programming until very recently um a lot of the previous parallel programming used like MPI and stuff you know meage passing stuff it was I don't know it was very different and uh it was all scientific Computing it was all like how do we get our linear algebra to work on this yeah and um they you know what we would call what you probably called The simd Machine that's what the gpus are signal instruct multiple data at least up until recently aren't they now and they do they're like M yeah yeah yeah so gpus gpus are quite interesting they have um big simd registers so they can take a whole lot of numbers into a register and do a big back operation on them but they're also massively parallel terms of like having lots and lots of threads um they run in these things called warps where basically they have like 32 threads running at a time and then they have lots of warps running at a time and each one's doing all the 70 at a time so many layers of parallels um so they're pretty neat things okay so um after all that we can um compile a different theano function that actually calculates our predictions rather than our errors um we can then call that function and uh confirm that again it's creating some correct answers so that's good you want to show how if you change some of the parameters a little bit yeah I might actually do that in my pure python one now I think about it but thanks for reminding me so just for having a lot of fun we're now going to pretend that that step did not exist and we had to do it ourselves um and after we've done this you will never want to prect that again I promise I know I don't okay so noo okay just Python and NP um let's build it again all right so we're going to need a sigmoid function we're going to need the derivative of the sigmoid function there they are okay and so our activation function is going to be our sigid our derivative of activation function is going to be the derivative of sigid they are uh our distance function and the derivative of the distance function are there and that's going to be our loss function and the Der of L function function okay so set all that up um let's WR scan um because I'm um using um Python and Python's parallel processing is a c piece of we may as well not even try so we're going to do it in serial so scan in serial is the world simplest thing we're going to return an array uh that array initially is empty um and we're going to construct to it by going through our sequence grabbing each thing out of it calling our function on the previous state and the next element of the sequence we're going to append that to our result and our previous is now going to be equal to that that scan that's scan in serial not scan in parallel so just to confirm let's try a scan on plus starting with zero passing in not to four okay so um let's now do our one step in this case it one digit um and so what we're going to do is we're going to be applying our activation function to our preh hidden and our preh hidden again it's the dot product of X and WX plus the do product of hidden and wh I'm not putting a bias in here because I'm too lazy right um You probably should have a bias um then our predictions are equal to the activation fun function applied to our pre- predictions and our pre- predictions is the dot product of our hidden State and our W okay so this is exactly the same as before and because we're using scan we have to keep track of all of the state we need to calculate the next thing so we need to keep keep track of our total loss um our hidden State and our predictions and because we are going to be doing backrop we need to keep track of everything we need to calculate the gradient which is also this preh hidden and pre-edit and this is where we're going to look at how to calculate the gradient if you read in good fellow's book this is how you gain some intuition for how the Alor behaves uh if that doesn't work for you I will draw a picture uh here's another 5 Seconds okay good got it got it go have a coffee ter come back and Che all right so um here is time okay it actually it actually goes up to eight but let's assume we've just got three uh time units here's our input X here's our hidden State here's our predictions here's our loss right so here is hidden one okay so how do we calculate hidden one hidden one is equal to the activation function applied to preh Hidden okay so preh hidden is here activation function is sigo okay how do we calculate preh hidden it's the dot product of xwx and the dot product of hidden wh okay so x x is the input time WX and hidden time W okay um our predictions our activation function applied to the pre- predictions so that's our predictions that's a sigmoid and our pre predictions are the dot product of the Hidden State and W hidden State and w y okay uh our loss is equal to our loss function of Y and Y where our loss function was the distance what just call it loss so this is loss of y y okay um this is going to be done lots of times to create lots of different losses there L1 L L2 and then we're going to sum them all together can ask question yes sir pre- prediction what is that um so what is pre- prediction um I'm I'm just uh separating out if we go back to theano we calculated our prediction y by first of all doing a DOT product and then applying the sigmoid I've just taken this middle bit and given it a name and I'm calling that pre- prediction so it's preactivation exactly so here it is here prepr is just the thing that we passed to the activation function to calculate our prics okay um so the rest of it exactly the same so to do our time step two WX sigmoid you're also going to need our previous weights our output weights our sigm and our loss function and again here all right um notice that um this preh hidden at Time Zero requires some hidden state that doesn't exist and that was why we had to Define h0 as as just a bunch of zeros so you'll see um in my code when I call this scan function one of the things I pass in is a bunch of zeros okay um so that's why you'll often see H not as a parameter um so how do we calculate now the derivative of uh this um with respect to let's now change [Music] color with respect to that and um it's really quite easy to do when you put it graphically you have to start at the end and you have to work backwards we have to go through every red arrow and we're going to create a blue arrow that goes backwards and the blue arrow going backwards it's just going to calculate the derivative of whatever just went forwards so to get from loss to yed we're going to need the derivative of the loss to get from y back to um prepr we're going to need the derivative of the activation function and that would get us to here which would be a really great step so let's take a look here is one step backwards and here is derivative of the loss with respect to pre equals derivative of the loss times the Der the activation why times because every time you combine um function like take functions of functions the derivative is the multiplication it's just y DX equals d y du du DX that's like the only piece of calculus you have to remember from school um and if you don't remember it just remember you can like cross out those and you get this okay so that's got us to that blue box and so now we want to be able to update Wy so I want to get from this blue box to here so Wy was being do producted with h so to undo that we have to do the opposite which is to take the dot product with the transpose of H and so here is our new W is equal to column the hidden state so in other words take the transpose with the hidden state times that thing we just calculated okay and then times the learning rate so that's how we get our new w y how do we get our new WX okay so that's over here we're already up to here so first of all we have to get to here so to do that we multiply by the transpose of the output weight Matrix um but notice the hidden unit has two arrows coming out of it which means we need to create two arrows coming back into it okay now notice also these two arrows this is not a function of a function these are two totally separate directions so the chain rule says you have to add these together so the derivative here will be the um previous hidden State preh hidden State times the transpose of the the weight Matrix wh uh plus the derivative of the prepr times the transpose of W1 and then we can get this far by multiplying that by the derivative of the activation function so here it is D PR hidden time transpose wh DED time transpose Wy by the derivative of the acation function okay um so that gets us back to here and this is going to be the thing that we're going to need in the next time step back um so that's why we need to make sure that at the very end we're going to return that value um okay and so now we can take our WX and we can update it we've got to here we want to calculate the derivative with respect to this which means we need to multiply multiply this by the transpose of the input so here is the transpose of the input times D preh hidden times okay and the um to get to here is exactly the same process so I won't lay it out in detail one thing to notice is that at our last time step we're going to have to multiply by a derivative that doesn't exist yet and that's why before we start we set D pread to a bunch of zeros okay um so that's it so you can either learn this way or this way it's fine um okay so that was a pain in the ass as well just a question yes please reimplement it all from scratch okay no problem so back propagation and time nothing but back propagation on an unrolled Network yes back propagation from time is indeed nothing but back propagation on an unrolled Network you don't have to actually unroll it and you'll notice that in my code I have not actually unrolled it I've done it what they would say symbolically just to say I've look through it um in tensorflow you can no sorry in theano with caras you can choose whether to do back propop three times symbolically or after unrolling if you do it after unrolling if uses more memory because it literally creates the unrolled matrices um if you use it symbolically then it uses less memory but it may take a little longer if your batch size is small uh with tensor flow I believe it only supports uh unrolling at least last time I um yeah so like many things or maybe most things in deep learning um they give new names to things that are not in any way new so backrop through time is just backrop um okay so now that we've defin all of these things um we're going to uh apply it to basically a whole bunch of couples of our input array and our output and our input array and our output so that's all this get digits is going to return so we've defined one forward which does a scan we've defined one backward we can now set up our three weight matrices um and we can now run through our 10 doing one forward calculating the error and one backward and from time to time printing out how we're going interestingly you'll see it's faster okay and it's uh it's faster because we are not using uh the GPU and we're only doing one at a time you'll also notice interestingly I'm glad it didn't work it didn't work right and the reason it didn't work is because rnns are fragile right right and I Tred to find a learning rate and a scale which kind of worked well n times out of 10 this work I do it again at'll work there you go right this is the thing about irns right I could have made it work more definitely by making my random numbers smaller and my learning rate smaller um and then it would be more likely to work but it would work slower right um to find the point at which it runs in the amount of time that you have and it doesn't explode is quite difficult because it's just super super file so if I use a scale of five say and a learning rate of I don't know one doesn't get anywhere and you'll notice I'm printing out the average derivative um and this is something that you always want to be doing right you can see the derivative it's tiny why is it tiny the reason it's tiny is because this um uh derivative and multiplying it by a matrix now if you imagine that the average of that Matrix is a little bit smaller than one I'm doing that eight times it's it's n to the E right it's much worse if you've got a longer sequence like you got a sequence of 50 you've got end to the 50 this is this thing they call um uh Radiance uh exploding it could be either direction they either get really really big or they get really really small um this is why qu's idea of using identity Matrix to initialize the hidden weights is so important it makes it much less likely that they'll explode but it still happens quite a bit right yes so what that parameter scale um so what is the parameter scale the parameter scale is simply what I'm passing into my normal distribution random Generation Um and it takes three things a location and the scale and the size so that's just the mean and the standard deviation parameters of your noral distribution okay so if you use mean z standard and if you use bigger numbers it's like it's got more to work with it'll go faster you know just like if you use a Higher Learning rate it'll go faster um but it also means it's more likely that things will explode so if I make this learning rate a little bit lower again there we go yes sir there's nothing like batch normalization for is there batch normalization for rnns there absolutely is um in fact um Ian Goodfellow told us last week that his view is that normal bachy normalization should work fine as long as you do it in the right place um however there is a more recent paper which um works better which I think is called layer normalization yeah um so this is like six weeks ago I suppose um again this Jeffrey Hinton guy I guess he just writes a lot of papers um uh it's very similar to batch normalization but it it does work better with um uh rnns and it actually seems to work better across the board so then the vanishing gradients problem yeah we would hope that the vanishing gradients problem might be much improved by using b normalization or layer normalization um that is absolutely right we're not going to look at that today um instead today what we're going to look do is look at something called lstms and Gus um to try and tackle the same thing um having said all that as I said if you're just use an identity Matrix um that runs very quickly uh and it's not too bad um so it's worth trying um so again we can just uh check that this looks okay by actually running dictor question um and indeed it's uh giving the right answer yes sir good question so they talk about the spectral radi of the the weight Matrix um has there been any work done I guess having that as a regularization so you try to make matri stay in that area with yeah so the question is about using the spectral radius of the um hidden weight Matrix and trying to keep that stable yeah um in fact you know the earlier attempts were basically using something called gradient clipping which was just to say hey let's take whatever the current hidden Matrix proposed Matrix is and normalize it um and that doesn't work terribly um it doesn't work as well as bat normalization or lay normalization or um all these um U lsgm and so forth um as we'll see things like lstms actually allow us to um maintain memory for a much longer time uh so the uh idea of gradient shipping doesn't seem to be U something that currently people are finding they need to use all right so finally um for this section and we've got a tiny bit to go up this as well let's talk about lstms and Gus um they have a reputation for being um terribly complex um they're really not um and indeed we are going to implement one from scratch um to prove that um there are lstms and Gus um we're going to focus first of all on guus in general we're going to focus on Gus because they are simpler easy to understand uh easier to understand uh faster to compute faster to code and often have better results so I don't see the point of using an lstm um rather than a guu nowadays um although both are fine how recent are the grus huh how recent yeah um so G new come from this paper which I think is 2014 um this is an amazing paper which introduced two things at once it introduced Gus and introduced um what's nowadays called sequence to sequence learning which we'll learn in a moment um two of the most important steps in RN's of the last 10 years appeared in the same paper and I think Quan control didn't even get his PhD at this time so um it was a pretty big deal um so gius um gatored recurrent units and what they are is we're going to take this pair uh of squares in our um RNN and we're going to make it a little bit more interesting right and so we've got something coming in so that's our in right it's kind of the other way around and we're going to calculate our proposed new hidden unit in much the same way as before which is uh to take our n by a weight Matrix and stick it through a sigmoid take our previous hidden State through a weight Matrix and stick it through a sigmoid but before we um send it in to be added together to our um our um the result of the in part we're going to put it through a a gate called R which stands for our reset gate R is a function we're going to talk about how to calculate it in a moment the important thing to know is that it's a function that returns something between n and one and it's a function that is going to be learned and it's going to decide how much of our previous hidden state do we want to remember and we're going to learn to sometimes remember it and sometimes not remember our hidden state is a vector so in fact it's going to decide sometimes to remember certain pieces of the Hidden State and sometimes to forget certain pieces of the Hidden state so again we're going to come back to how to calculate the reset function but for now just recognize it is a learned function it's really important I think in understanding modern deep learning to be comfortable with the idea that there is a learned function without having to worry about how it was calculated because you can calculate them in all kinds of different ways the important thing is as long as it's differentiable there's a learn function right but we will look at how to do it in the moment so H is going to be multiplied by R before it goes into our proposed hidden um new hidden uh hidden units this is um not quite though what we're going to use as our next hidden State instead we're going to take it and we're going to add it to our previous hidden State and we're going to decide sometimes to use lots of the previous hidden State and sometimes lots of the new hidden State and sometimes somewhere in between right it's basically a linear interpolation or a waiting of the previous hidden State and the new hidden State again as a vector so we can get choose sometimes to largely keep our previous hidden state for some elements sometimes to largely keep this new proposed state for other elements this uses a different gate which for some reason here they've called Z more often you will see called U for update so we've got a reset gate and we've got an update gate again goes from North to one and it's leared right so um that is going to be inside here and everything else in our RNN is the same so before we look at the details let's see how to do that in caras in caras we take um our previous um uh neuronet we select simple CNN and we type over the top of it guu and we press enter and we're done okay so because conceptually it's just replacing this square with a different Square you know there's nothing else to think about the input shapes the same the output shapes the same activation the same time distribute DS is the same okay so that's that um let's take a look um you'll see just for fun um on two here just for fun we've I've set the um learning rate way up high uh and the scales also high so we're well into unstable territory um because I'm using the caros I can easily use a nice big batch size and you can see very quickly it's created a very good answer and like we could really screw around with this all over the place maybe scale of [Music] five no problem maybe use a scale of one and a lining rate [Music] of2 it'sit slower but it's still getting there let's run another four EPO there we go so you can see it's pretty resilient now to screwing around the hyper parameters which is what we were hoping for if we had a a ER rnm sequence um this would be even more important um so what's happened is that because we've now got something that can learn to forget things or learn to ignore the new update that can basically learn to remember the information it needs when it needs it and to use the information it needs when it needs it and this basically allows it to have a very like a very simple way to see this is we've now got a direct way of saying just leave the hidden State unchanged right so like it's going to be much harder to have really exploding gradients when there's a kind of a path through the neuronet which is the do nothing path this is really common to see in like other kinds of deep Nets like res Nets have skip connections Highway Nets have skip connections so this idea of trying to find other ways to get through the neural network is a super popular way of of training deeper Nets and a a recurrent net is basically a super deep nck and it's unrolled uh Nikki is asking um is there a way to visualize what the RNN is learning what each neuron is responding to she uh she she stated that there has been similar research with uh convolutional networks um is there any anything like this in RNN perhaps like a conventional way of doing it yeah so can you um visualize um RNs um the answer is um so um Matt Zyer did this fantastic work on visualizing um conet which I'm sure is what she's referring to um it is with Rob Fergus um really great paper and the other guy who's done some fantastic work there is Jason jinsky who actually has a um a tool kit you can download uh to visualize them um there's much less for rnns but Andre Kathy wrote a paper which hopefully we can find um [Music] projects talks teaching Pap right here it is visualizing and understanding current networks um so yeah I mean that's the only one I really know of um it's not nearly as deep as the stuff that Jason ninski has done but it's a St uh okay um lstms are very similar to Gus um they call it a cell rather than hidden State the general idea is the same you've got some stuff coming in you previous state give some State proposal you've got a gate which determines again a learn gate which determines how much of your previous state you've got a learn gate that determines how much you forget your old State and you've got another gate which determines how much you use your hidden stick um it's got one more gate um than a guu um and um I guess why use three would even use two um but it it produces pretty similar results um so how do we calculate um Z and R and the answer answer is with a neuronet okay how do you learn stuff in deep learning with a neuronet um easiest way is I think with code so we so Jer we can't lump the uh the parameters for that uh for the r and the update function into the overall parameters of our recurrent we totally going to do that yeah um here they are we've got a whole bunch of parameters we just got more now so rather than just having weight matrices for hidden input and output we've also got reset gate weights for hidden and input and update gate weights for hidden and input okay right so I've added these so it's not like a nested neuron net we're just adding weights to our model well I'll show you the code and we can decide if it's a nested near or n it certainly can be one thing to noce that with my reset gate um this isn't something I've experimented with a lot but I was just thinking about it in terms of Common Sense here's my reset gate like what do we want that to be most of the time and I was guessing that like by default we probably want the gate closed right so I have actually set um this last parameter here do I want zero bias to false so that means that my initial bias for my reset gate is going to be one right and like this question of like whether you initializing with zeros or ones for your bias has turned out to be super important for recurrent networks um I know in lstm having the forget gate set to wine is really important as an initial bias in general or is that prob uh in general um and like if you think about it it kind of makes sense right you you probably want if you it depends I mean it depends a bit on what the trouble you're having is but if you want to be able to train your network really easily then you probably want your initial path to be more like leave H as it is um but on the other hand you probably want your initial proposed H to actually use your information rather than ignore it so you know you can kind of think about what you think makes sense um but it's easy enough to try changing ones to zeros and vice versa um and it's something that not enough people seem to think about in my opinion so um I actually just created a function called gate um which does a sigmoid activation to the dotproduct of X and some weights plus some bias plus the do product of H and some weights right so that gate in each step I just send different stuff so the reset gate gets my inputs my hidden and my reset weights my update gate gets exactly the same puts but my update weights my um this is like my H Tilla my proposed hidden Matrix gets my X and my H * reset plus my various H hidden weights and here's where I do my my gate as to sum of my old h plus one minus that of my new h this is why the guu was able to get rid of a gate because they were like hey if you're not using the new one then you probably are using the old one so it's just this update and one minus update which I think makes a lot of sense um I'm going to just step over this though and point something out which is that if you're adding together two dot products that's exactly the same as taking the two weight matrices and sticking them on top of each other and the two input matrices and sticking them on top of each other and then doing that dot product so if you've got a matrix here a smaller Matrix here and a matrix here and a smaller Matrix here you can just concatenate them together and you'll get exactly the same answer um you can convince yourself that at home if you're not sure um for now you can take my word for it and that means we can simplify this code to have way less parameters because I can combine input plus hitting so now I've got one set of reset weights one set of update weights and my gate function now doesn't have to do the addition okay it's just one do product and the nice thing about this is that now all of my um functions all use the same thing they all use gate so reset gate update gate H new new gate y gate so once you finally kind of get to it that's the lstm is like 1 2 3 four five six seven lines of cod right so it's actually not at all difficult once you kind of put it together so let's try it so it's initialize all of our reset weights update weights input weights and output weights um create our gate function create our step function and everything else is the same so this is the same scan function as before the same updates as before the same error same gradients now can you imagine if this didn't exist and we had to calculate all of those gradients by hand so this is where like lstm yeah you want to be using theano or something similar um there's our function and so now let's try it so we're going to run through our 10,000 this would be a bit slow because we're using the GPU without using batches uh the other reason it's a bit slow is it actually has to compile up here it comes a GU or an this is a guu yeah there we go so we have a working guu um so uh yeah this is um I think the main thing to take away here is that actually you know neural network architectures that are thought of thought of as complex I don't think they're actually complex you know like they're actually not much code and I think a lot of people shy away from actually experimenting with architectures because they think they're harder than they are as long as you're using something that can calculate the derivatives for you and it's a real programming language so not like a cafe config file or a Microsoft green script file um I think you're good so if you're using tensorflow or piano I think you can absolutely contribute to The Cutting Edge of neuronet research because there's so many things to try and it's so easy to try them you know you could take this code and try using different ones and zeros for bias you might find something else works better and boom that's of paper you [Music] know um okay so what do we do with this well one thing we could do with this is something kind of fun which is to to read in uh 600,000 characters of Nan philosophy um we could then turn each of those characters into uh one hot encoding um we could then um put that into an lstm and I want to show you something interesting here this lstm is feeding into this lstm why is that well if you think about it um there's not very much going on in here you know you've got outputs coming out here right the hidden State goes through one matrix product and one nonlinearity and that becomes our new hidden State our output goes through one matrix product one nonlinearity and that becomes our predictions these are pretty dull functions they can't do very much but what if instead of going to our output instead we went to H1 H2 another recurrent n and that went to our output now you've got all of that ability to basically stack nonlinearities together and you know all the niceties of the universal approximation theorem that says this can now kind of uh approximate anything so um I think it Terrence who asked earlier you know are these real neuron Nets inside here and the answer is sometimes people stack um the current Nets on top of each other sometimes they actually do take this input and turn it into a multi-layer neuronet this input becomes a multi-layer neuronet anywhere that you've basically got these arrows you can stick a a neuronet of whatever depth and architecture you like um the easiest thing to do when using a wrapper like caras is just to feed a recurrent net to a recurrent net and that seems to get all of the benefits right so here we've got a recurrent net fitting through a recurrent net um n didn't write that much philosophy so we need to use some Dropout to regularize it uh then we have our fully connected layer and an activation function uh we compile that actually trained it for an hour or so um one of the things to note is that sometimes when you're training things for a long time it's a good idea to put in your training Loop something to save the weights occasionally um because then you can come back and like load in some different set of weights and see how it looks uh so uh I load into weights um and then I can basically set up some seed string uh pass that as the initial um state of my uh current net and ask it to start producing some predictions so there you go uh this is character based right huh this is character based yes right so how come uh you got uh real words real words so fast um well I trained it overnight so I didn't TR it just now right so if we go back and load that was V on PO L if we load lstm one this was after one AP not so good I mean at least it's getting kind of the concept of words and spaces and punctuation or else if we take [Music] 60 we should actually learn something about the nature of mankind trained on 600,000 characters so um so that's one thing you can do with it um we have a question over yeah so there's some some ated time it's like temperatur yeah that's right I actually used a slightly different kind of parameter I just basically took my predictions into the power of something less than one to make it like a little bit more random and then renormalized it um so each time my next character I basically pick from all of the characters I have available with a probability according to those predictions and so yeah you can basically decide how random you want it it's a bit tricky because if you make it not random enough it ends up just repeating the text that it got another question so another hyper question so is it how do you decide how many hidden units versus how many stacks um just another so how do you decide how many hidden units versus how many stacks um you just try things like the the yeah cre creating architectures for any kind of neuron Nets is total artisanship at the moment or you can use some kind of you know hyper parameter search grid search whatever but that takes work yeah are there any particular uh conventions you use for Designing architecture as well as parameters I think you can get a feel for it the best way to get a feel for it is by looking at what are the stateof the art um from other papers so um um this kagle perhaps a good source for this like col unfortunately kagle has not had any decent recurrent net type things that I've seen it's a great source of convolutional net stateof the art um so I would just look at like the best language modeling papers and see what did they use and try to get a sense of like how many layers do you need and often they will publish tables of like I tried these number different hidden units and here are the different results um I think the most interesting thing you can do um with rnns is a sequence to sequence learning um the um kind of normal approach to a conet is something that kind of takes to fix sized input and creates a fixed size output such as list of probabilities of object types um sequence to sequence is kind of interesting because it allows us to create some arbitrarily length input and and map it to some arbitrarily length output that can vary at train time and at inference time for example language translation I want to take English and I want to create Hungarian how do you do that and the answer is with a sequence to sequence Network or as they were originally called encoder decoder networks so as I mentioned quanum control who created Gus and encoded decoder networks all in one paper um uh and then IIA s who created um sequence to sequence um networks very very shortly afterwards and um fairly kind of parallel work um their basic approach is pretty cool it's basically you take you've got some input sequence you stick it into some kind of RNN and at the end of it you have some hidden state right take that hidden State and basically say that is the essence of this sequence it's a sum of that sequence so now feed that to a new recurrent um to a new recurrent neural network which is called the decoder this is the encoder this is the decoder and that takes the uh that hidden State and it creates an output you'll see that as well as creating an output it feeds that output as additional input okay there's a special uh go symbol and a special end of stream symbol which again they're just things that are are learned right so you can learn this whole sequence to sequence architecture as an endtoend differentiable program this is the kind of more detailed version so the encoder takes your various time steps sticks it into an RNN outcomes my hidden state and creates this context the context then becomes the initial hidden state it also gets added as fresh input every state time the output as we discussed becomes input to the next and also the hidden State comes through right so this is what uh is used this is also used for image captioning right so an RNN or even a CNN can be used to turn an image into some kind of state and then that state can can become the context for the decoder so this is encoder decoder or more generally nowadays sequence to sequence thats um tensorflow has a sequence to sequence module you can just use it um finally last couple of things to mention briefly uh the book talks about recursive neural networks briefly I'm not going to talk about them very much but it's basically the idea that you could create a computational graph which is more like a tree um the idea being that the size of this graph is um log of the size rather than the full size um I'm not going to mention it much because just because it hasn't had many good results um at first blush it looks like it's had some good results because there are some published dateof the art but it's kind of a cheat with a recursive neural network you can actually apply a label in the training data to every one of the kind of sub trees and so effective you're training it with a lot more data for example this was used to create a state-of-the-art result for sentiment analysis but the label is actually created a sentiment label for every single sub phrase on the cus um so it's not a like to like comparison and in the real world that just doesn't happen um you know so maybe there will be interesting uses of recursive neural networks yet to come but I haven't seen them used in the wild for anything interesting um the book also talks about um memory networks um and um neural tring machines um again this is something which is not quite ready for prime time in a sense but the general idea is very interesting um take an RNN right so this is basically an RNN here right and at each stage in your rnm you could learn something which can write out to not just one hidden unit but kind of multiple hidden units and it could also learn something that could decide to decide to write another set of functions that could learn to arrise Hidden units and another one which could learn to read from hidden units what the um these types of things do is that they generally at this point here use uh soft Max or something similar which basically forces the network to mainly only read WR arrays kind of one area at a time um and so it's quite a neat idea because it kind of gives the recurrent neural network the ability to like save different pieces of information in different areas keep them for a while erase them later or replace them and then read them at different times and all of these things are learn functions um again you have to be a bit careful of some of the claims that are made with these because um a lot of the uh the the tables of like here are the results we've got do something a bit cheeky which is they basically let's say they're doing a Q&A um exercise um they basically say okay here's a whole list of facts that we taught the neuronet and later on we asked the neuronet this question here are the facts and here's the question um then it gives us this answer and it looks really good and you go wow that's really impressive but what you realize when you look more closely is that not only do they give it the facts they actually say for every piece of information it could be inferred using this this this and this like they actually say these are the facts you would need to know to figure that out that's really handy for building a neuro because you can actually then create a a you know reader um unit which learn learns oh I'm interested in these things but of course again in the real world you you don't get given a culus of stuff that says here's all the things we know in the world and we know it because of this and because of this because of this in general right um so this is the difference between I think they call them strong versus weak memory networks and the weak memory network still don't do very well um but I'm sure that this is all this is all like in the last year or two it's been happening it's changing very quickly um yeah okay so I think that's uh everything that I wanted to cover um as I said you know um this is kind of uh some uh material that I'm hoping to present um at this uh course if anybody's interested in doing this please let me know and if anybody thinks of ideas that could help make this better um also please let me know s oh just a couple questions from the cool um Okay so first off uh sheril sorry if I mispronounced that um is asking uh is unfolded back prop the only way to train rnms or there any other methods which might potentially be better or equivalent um also are RNN is the only way to model SE sequences and text okay so yeah there are other ways to model um uh sequence uh to model uh to to train rnms um the main one which is actually discussed in the B is um the idea that rather than um kind of having something where the hidden State Loops back to itself um instead you have hidden State goes to output and then the output goes back to the next hidden State and then that goes to the next output and then that goes back to the next hidden State the reason that this is interesting is because you can actually train all of these in parallel so there's no radiant explosion or anything like that um it it's much less uh rich in terms of what it can model but for things that really do have that structure that's a that's a better that's could be a better approach um other ways of doing sequences other than rnns um you can use conents um and in fact I remember somebody mentioning when Goodfellow was here last week about some recent results that I've also briefly scanned in in natural language processing where confet SE achieves instead of the out results it the time delay networks time delay networks are a kind of 20y old version of RNN that's kind of like hinton's original rnm thoughts you mentioned that recurrent output is not a good generalizer it's a good mization that the RNN output is a good um memorization but not a good generalizer that's yeah I don't recall that part of the book well enough to to comment uh Nikki asks Gru learns a function for writingreading is there a way to learn the connections of the feed for of the network instead of just feed for connections I'm not sure I understand a question do you understand the question uh I didn't entirely so I just asked it but um sorry ask I specify that and we'll I can email or something yeah um and then one last question uh what is your opinion on RNN learning algorithms based on reinforcement learning haban stdb Etc oh okay um reinforcement learning is totally fascinating um um we will get to it in a later discussion fair enough um you know the fact that you can um kind of smooth out things over time is really cool I mean obviously the problem is things like credit assignment and also just the kind of how slow um they can be um but yeah it seems like a really interesting line of inquiry okay all right more questions I think we're good I guess the question is is this Cod not yet but it will be yeah a lot of people were asking about that if it was like repository or something so Back To Top