[Music] we're very happy to have Migel uh in New York today uh yeah Mikel has been a yeah at University of Copenhagen first and then at AT&T for many years and then back at University of Copenhagen where he has like very strong students uh and uh and people visiting him um yeah so uh Mikel is a expert on hash function uh with uh several stock Fox papers on ash function I think that's the one of the main things that that make that's contributing to and um yeah and today he's going to talk about hash function so we're excited uh yeah we have one of the world experts on on hash function so I think that's the maybe the world exp from has function so I think we're excited and uh and I think an exciting thing is also that has been working at h& for many years so a lot of the work that mik has done is not purely Theory but also like you know connected to practice and and I think that's something that's very cool to see M work so yeah so Mel please well thank you Vincent and one of the great people I had at Copenhagen was Vincent who did his first post talk with me yeah or last as well good and again this uh work is based on lots of different papers including stock and F first I know it's boring to start with notation but I use this notation the bracket notation for a set just that you warned okay and uh I'm going to talk about hash functions that basically map some large range Universe uh down to a smaller domain um of hash values I'm going to talk about fully random hash function which is basically a fully random uniform variable amongst all such maps and they are great uh abstractly and that's what we normally think about but they're not implementable in the sense the space is at least as big as the size of the universe because you would have to have a random variable basically an equivalent of a random variable for every single key in the universe and especially you're doing hash table the whole point is to map a very large Universe into much smaller space so this would just defeat the purpose then then we have something implementable and the most classic implementable thing is just classic linear hashing where we use two random num to we work over a prime field that is big enough to contain the universe and then we just pick two random elements A and B that's our random C they Define the hash function which is just the Classic ax plus b mod P mod R where R is the range and the important thing here is as soon as you have fixed A and B you have fixed the hash value of of every single key in the universe so we just had to pick two random things right so it is a function when you know a b you know where X goes and one of the nice things about this implementation is it has one property in common with fully random hashing namely low Collision probability actually if we cheat a little bit and say a is not allowed to be zero then the probability that two keys X and Y hash to the same thing is strictly less than the we map into if we allow it to be zero then it becomes a little bit more that's annoying so it looks prettier this way okay and why do we skip a equal to zero why is that a good idea every Happ to the same value that's where you get all the collisions so this is really one to avoid sorry I love asking questions here in the talk uh and what the most classic Implement use of hash functions is a hash table so again we map every key into a chain of elements with the same hash value so if we have X and it hashes into this list here then we just check if x is in the list to see if it's amongst the things we have stored if not we can put it in and that's how it goes and the important thing here is just that all we need is low Collision probability one or R where R is the range to get that the expected chain length is small and that holds both for fully random hashing and for linear has has and this is what card and wman do uh Universal hashing so you can use that for signatures and a lot of things that's very useful property the target of hash function research is to design simple and reliable I should say hash functions as opposed to the research into hash tables which are algorithms that use hash function to distribute data in memory okay so we algorithmically important properties that are similar to those with a fully random hashing but which we can easily Implement and there linear hashing with low Collision probability is sort of one of the most classic examples of that or the most classic example and that is what Bridges Theory assuming fully random hashing so very abstract Theory with Pro something you can actually Implement and and I think this area is extremely important because randomized algs extremely popular in practice but often people Implement them using two simple hash functions and this means you only have the guarantees that people have proved they only work for random input right so if you have random input then you can use identity as a hash function and it's great and and this is a really bad situation that theory the theory Community has left industry with because simple hash functions work deceptively well on random input but the world world is full of these structured data on which they may fail with way to high probability so often they will work good most of the time but it's these things that should never happen that happen way too often um and again just back to AT&T what we saw was that linear probing which is very popular in practice as a hashtable implementation was unreliable there were a lot of it was quite often it didn't work so well and it turned out the the problems were typically associated with the hash function basically leading to way too long runs and even had a system that they switched to when the probe length became too long okay so just to make sure we're in the same page the classic uh way of measuring how powerful a hash function is is a k Independence Paradigm that says given any K keys they get independent hash values okay and this the classic implementation is just a degree K minus one polinomial over a big enough Prime field so a dire generalization of the linear hashing which is just k equal to 2 so we just pick these K I mean it's a beautiful thing right you just pick these K random coefficients and then that's enough to guarantee that for any K keys they get independent hash values uh yeah so that was what we already saw before uh so this is sort of a very nice Paradigm because now you can for any algorithmic application just ask how much Independence do you need and then you just find the best implementation uh of that amount of Independence uh but we will do much much better than that using some different kinds of hashing not not based on Independence not based on pooms but on tables the running example will be jard similarity that for two sets A and B is just the size of intersection divided by the size of un okay and the basic idea is we use hashing to construct sketches we'll construct a small sketch of each set or we'll Define how to construct such a sketch and then the idea is that given two sets we can compare how similar they are just by looking at the sketches so this might be the natural thing to if you just had two sets why not compare them directly you could do some sorting or whatever and compare them quite easily but the idea here that we have many sets and we store just the sketches and now when we come with a new set we can find out find a similar set just by looking at the sketch of the new setch okay so we just reduced the amount of data we have to store and the whole point here is the hasing is used to create comparable sketches we have to sort of agree what random choices we do so that the sketch of a always looks the same and we'll study how hash functions implementations impact the quality of these sketches so first I'll talk about the most classic way of doing this by bro from alista uh and in everything I do here in this talk I'll first say how would we reason about it if we have fully random hashing available and then I will sort of discuss how we get screwed if we make some bad choices about the hash function okay so let's just take an element from the union that minimizes the hash value so this I basically just call it the smallest element referring to the hash value okay so we take the smallest element in the union if it's a truly random hash function then this element must be uniform in the union so the probability that this is is in the intersection is exactly the ja similarity but we also have that X is an intersection if and only if we get the same smallest hash value from a and from B let's say a noia right so the jard similarity is simply the probability that A and B get the same smallest hash value so this means that this thing is a sketch it tells something about the hash value just this single value of course if you want some concentration bound something you can trust you should do this with similar hash function so then you get a whole Vector of Min hash values from each set and this is something you can use if it works but again estimating that city is just looking at the average number of matches in these vectors yeah I was just going to mention that brur was at alter Vista when he created this but he's here now at Google R oh yes oh great another person I should know why is he not here over there 3,000 miles that way 3,000 okay great okay so anyhow the expected relative error is just just the square root well one divide by the square root of the expected number of matches which is just the deity times K so now what happens when you don't have fully random hash functions right so we want this what we really want is we want to say that given any element X in any any set C we want to say that the probability gets the smallest hash value should be exactly one divided by C but there may be some bias so it's plus minus some Epsilon so that's Epsilon minwise and in fact bias Epsilon takes Independence log of one Epsilon so if you want subc constant bias you you need to have super constant Independence which means super constant running time of the hash function and that doesn't improve when you do repetitions so MIT Mark and Madan have this very nice paper in some senses which says that if you have enough Randomness in the input then two Independents work so it don't have to be fully random if there's enough Randomness the two Independence kind of smears out the randomness and and then everything will be as good as if you had fully random hashing that's related to Nissan's pseo random number generator and other things very beautiful but the real world is full of low entropy data and here's just a very simple test I Ran So I'm going considering the dissimilarity of two sets whose intersection is just consecutive numbers and where the symmetric difference consist of random numbers and this is in some sense very realistic because if you use Huffman coding of sets then common words will have small numbers which will mean they will form a fairly dense set whereas The Uncommon words will be more like random numbers so this is in some sense an extreme version of a very common phenomena and what you see when here's just an increased K and then you can see what happens the one in the middle is sort of a real run just with increas in K of my estimate and it converges beautifully there's only one issue in convergence to the wrong value because the wrong value the real value is down here that's a bias okay and so again imagine that you are using this industrially or something like that and all you have is your estimates you don't know what the truth is because the reason you do estimates was because you couldn't afford to find out what the truth is and you'll see this looks beautiful convincing but it's just wrong right so if you do any kind of major Investments based in these things you're in trouble well nobody may know it and if you're doing better than competition you may be fine but fundamentally speaking this is not right and so now it's a similar sketch which is very similar from a theory Viewpoint I think most people like the one I talked about before but this is a bottom case sketch so instead of and it's in the same paper of Brer and instead of taking K independent hash functions you just take one and you take the K best this is simpler to implement you just need a priority Cube so to estimate the frequency of some subset you just look at the number of samples from that subset and you divide by K now I talked about similarity but it's sort of same thing it's just an application because if we have two sets then we can and we have the sketches of them then we can construct the sketch of the Union you take the case mes from each set unite them and then take the case mes from the union that's the K smallest elements from the union Al together okay so now when we want to estimate the car card similarity just using this formula we just look at the intersection of this sketch we constructed with the sketches from A and B okay so this satisfied the target which was given the sketches of two sets we can find a sketch uh we can compute our estimate of the similarity but it's not as simple as before it before was just basically a DOT product we did and divided by K it's something more complicated but it's fully doable and surprisingly um here it actually works with two Independents and would say it's kind of surprising it's because bottom one is the same as one men wise right both just takes the smallest thing so it's clearly bias but somehow the bias just vanishes as K grows so so same experiment as before now we get even better convergence because we actually do uh sampling without replacement which is generally a better idea and we actually converge to the right value so this should not surprise given the math but it's still nice to see it working so beautifully in the real world right so here's the comparisons of the two different methods they were theoretically very similar similar concentration bonds and everything this works this doesn't if we use Simple hash functions so really kind of what I'm basically just saying is you need to understand what you're doing you can use Simple hash functions sometimes but certainly not all the time um yeah so that's basically it so why does bottom K work if you have fully random hashing then you have a beautiful argument that works for any almost any hashing in estimation scheme because so basically what I'm just saying is that if we select the set of K elements from X based on fully random hash values then no matter how we do it this set is uniform just choice of K hash Val ke from X and the proof is very simple because we have this principle of deferred decision first we fix a set of hash values right we can do things in any order we want but then just looking at the hash values we can say oh these are the K guys we want then afterwards we decide which key gets which of these hash values we can do it in that order if it's fully random we couldn't do it if it was you know with other kind of hashing schemes but with fully random hashing we have these super powerful arguments that just makes it really simple to make the analysis in a very general way yeah so then what I just said and then of course now we have strong concentration right because now we just know that our sample was a uniform sample of K elements and this thing that we do without uh replacement uh means that all the usual concentration bounds work I mean it's not always strictly better there are cases where it's different but CHF and CH enough works anyway so it's only good when we uh to have this negative correlation so we get this expected error that I claimed but now we want to the same just with two Independents and now it's not so simple because the decision about whether to sample an element depends not just on the hash value of the element but on the all the other hash values because we take the KA best right it's a competition so how to an I will just Analyze That some simple Union so again this is just the definitions from before how we do the estimate and we want to limit the probability of uh getting too large a value so we just pick some parameters A and B for reasons that will become clear so we ask if the number of sampled elements from Y is bigger than the frequency of Y times the number of samples now instead of doing this competition which is and something is which is hard to analyze we Define a threshold sampling probability P so IDE is we will say that your threshold sampled if your hash value is below P but now it doesn't depend on what happens with anybody else so now we have an it's not independent but we have a samping decision for each element that only depends on the hash value so in the case of two independent hashing this means that we have some indicator variables that are well they are there's no correlation we basically get all the chip CHF and variance that we want that the sum of the variances is just the variance of that the variance of the sum is the sum of the variances all these nice properties when we have two Independents and then we just have to argue that the overestimate either implies that you threshold sample to few elements from the big set X or that you threshold sample to many elements from the set y we care about right the last should be clear because we want we are afraid that that the sampes from why we get too many of those so either it's because we got too many from y or because we thought X was smaller so again the probability of the error is just by the union bound the sum of these two probabilities and now these events here are both of the type did I assle too many from a given set and the choices were too independent then we have variant then we have CHF and then we using CHF bound we just get these error probabilities and then you play around with some numbers and you get the kind of stuff I I claimed that if Epsilon is not too big then the probability that we get too large with some Epsilon is four divide by Epsilon squ FK the same kind of errors that we would get if it was fully random okay so it was actually not too hard to show but the beautiful clean argument we had with fully random hashing we couldn't use we had to do something else and it actually worked and the sort of interesting thing is just with minwise it totally doesn't work good so again if we understand what we do we pick the right one and I don't think anybody in practice was aware on this difference I'm just saying I don't think again because it's hard to discover the things if you don't know what you're doing if you don't have the math you don't even even know how to write ask the right questions so anyhow the analysis was just Union bound or CHF yeah so now I'm going to talk about something far more powerful which is trying to get a vector sample so that's just a vector with K samples from some set X you can say who cares if it's a set or a vector uh I still want to preserve similarity but what I want is that for each that in order to sort of see if things are similar I just want to compare things coordinate wise that was the same we did with the K in scheme okay and and why is that important well it's simpler but and I also want the concentration but this alignment property we need if we want to do things like support Vector machines right so like if you want to I I basically claim now we can easily translate things to the stck products right because we had this Vector of samples and now we have another hash function that assigns to every element at plus or minus one divid by square root K 50 50% chance of each of these things and if you run that through then so that each of these samples getl by this plus or minus 1 / k then the jard similarity is exactly the dot product of these vectors I mean you could also just use Boolean dot part what I'm just saying if you want to have this nice linear algebra you really need the vector sample and in fact it's equally important if you want to do efficient things with locality sensitive hashing so again what I'm saying is this uh thing here we are getting a vector instead of just a set and this has importance in various applications so and and with bottom K sketches it wasn't satisfied at all that we just really just get a set of K elements there's no Vector in it there's no uh saying that the fifth element has anything from a has something to do with the fifth element from B it's just a completely random set and the strong concentration and again the reason why I like to have strong concentration is that if we have a large number of sets in our database uh and we want to have the set a and want to find a similar set we don't want it to be just due to noise which one looks most similar right if you can imagine everybody was at the same random distance and they had some bad hash functions they might make one set look much closer than someone else so we really want to have these concentration bounds and again with exponential concentration this basically means that we need to have uh the length of the vectors only have to be logarithmic if we had CHF then you would need to have vectors of length square root of n instead of log n okay if you wanted to have the same error probability so what else do we want from this Vector sampling we want consistency which is that if the sample from a set C belongs to a sub subset a of C then the sample from the subset should be the same sample okay so that's some it's some consistency in this thing and then you want the samples to be uniform so the consistency is just the thing that implies that two things match if and only if the sample from the union isn't intersection that's what get from consistency and then also so now that you card estimator which is just the average number of matches is just the fraction of samples from the union that are in the intersection and then we also have that the samples are unbiased thanks to uniformity so these are sort of the fundamental properties we want oh Sor so how do we design such VOR samples and again as usually I assume we have fully random hashing um so we had the C in Sample and it does it but in a very slow way because what we had was that for each coordinate we wanted an independent hash function which orders all the elements and we took the smallest I put the smallest on top like trees always grow down in computer science same here so smallest on top and we see that the sample is these top elements but the problem with this thing is that to compute this we have to do K hash computations for every single element so time is K times the set size and like if you want an error of Epsilon then you typically need K to be L divided by Epsilon squ that easily becomes a factor large enough that you would rather not have it like in the thousands or more so I'm going to talk about how we can get something that works even better but that we can compute in time which is linear in the set size plus k l k and typically I think of K is much smaller than the set so this is this is a baret but first I want to talk about a a beautiful idea very simple goes actually back to flash uh that is called one permutation sampling which was some nips a while ago so basically we now have two hash functions one that gives a hash value from before but also one that gives us a random bucket for each key so now every key like 52 get placed in a random bucket and also gets a hash value okay and now the point is that for each coordinate we look at the bucket and say the smallest key is our choice the only problem is there might be some buckets that don't get any keys so it's undefined okay but it's wonderful wonderfully fast we just do two hash computations for every single key so it's linear time um and we have the coupon collector problem right we we know we just if every key gets a random buckets and we have K buckets how many random buckets uh choices of random buck do we have before we have filled all the buckets K that's a coupon connector for right so then if we have that many samples then we act um that many keys then we get K samples without replacement which is generally a good thing obviously we can't get K samples if we have less than K keys and all the things we won't work very well it's consistent uniform we get CH FS we get everything but these undefined samples pose some problems because want some sketches that don't know anything about how big the set sizes are so they actually had a lot of uh of uh these HML nips papers and stuff like that about how to fill it so one of the simple suggestions was just if you have an undefined thing you just copy from the left it prefer that actually preserves that things are unbiased that that the expect expected value is correct that follows from the general argument from fully random hashing which says as long as we select Keys based in the hash values everything is perfect but we lose on the concentration bounds because we get no more information when we just copy from the left right so you you just don't get as many samples that way then we had a a very simple fix a while back which was if you didn't manage to hit everybody just try again so this was the first one peration sketch then we have the next one peration sketch and we only looked for what values we get that we didn't have already so now we get something in this bucket that was missing from the first round and we're done now we get rep we can get repetitions because in each round we do things without repetition but we may do competition is later rounds okay and then it's easy to prove that you don't need that many rounds before you fill things you can finish with K mins but I won't get into that but looking at a very simple example three elements in the universe I know it's a little bit easy but actually we saw quite a big difference in how well we estimated things how so this what again some similarity stuff we made but you could see this scheme actually worked better than Min has and again the reason was the advantage of without replacement whereas these other schemes had way too high tails okay so it actually did make a difference that was very noticeable in experiments with small sets because if the sets were large then we know that we will just do one permutation hashing because everything will be filled why why are the chesta papers like having a lot of points at zero like like oh that's because uh things can disappear oh right so what happens is you do your first round sample and you get something and then so so okay so how do things work right so with these things here now five is gone right because it was here so if all you now do is to sample from the elements that you already picked you'll never see five no matter how many experiments you run I see yeah I mean that happens quite commonly yeah but good question yeah so there's a lot of zeros and again it's easy to see that the running time is a plus k l k because every key in each of these experiments every key gets a random bucket so when we have seen when we have thrown K Lo K times and random buckets everything is filled okay so that's why we don't need more than at most lck well yeah so that basically coms the computation time so again what did we have we have with and without replacement we have classic repeated minwise that took a times K times because we had to do K hash functions then we had one permutation hashing which was great in the terms of its linear time but we have problems with undefined entries when uh the set size is not so big and then we had this fast similarity sketch which actually performed better than the first scheme and it did this sort of funny combination of with and without replacement right so in each round it was without replacement but then next rounds you could repeat things and it's efficient and also we have all the usual CH of bounds they work perfectly again because if you look at proofs of the CH of bounds uh based on tailor expansion uh the negative correlation can only improve things so now I'll talk about how we can actually Implement these things okay because again I always start by saying how what can we do I mean this this puts ourselves free when we want to design hash function based algorithms that we just assume we have full Randomness but how well can we now Implement them so just to make it simple I assume we have enough elements that one peration hashing works that it will fill all the buckets uh something went strange here anyhow so I'll I will always like I talk to talk about red and blue balls so the red balls are those in the intersection we want to know their frequency and the rest are blue and we want to make this Vector of K samples from the intersection and we want the number of those red samples to be concentrated around the expectation so let's just see generally speaking the way I'll think of implementing these things is we'll have one hash function that generates plenty of bits so the first log we have K buckets so I'll just use the first log base two of K bits for the buckets and the rest will be used for the local hash function that decides who's big and who's small okay so that's the example here and now I'll see what happens if I use the classic ax plus b mod P right so I use a plus b mod P to compute these hash values and now my my set B that's well that's they are just consecutive numbers and they said ah the red numbers they're just random so similar to the example I showed when wrong before and what happens here so so let's just say that this one here is zero okay so this is the hash of zero that's just B right then what where does one go I just have to add one so it goes here where does two go I add one again I end here where does three go I add one I get up here okay adding a you mean one copy of a sorry copy of a so every time I go from one element to the next because it's consecutive numbers I just add the same number 8 so we'll move around and create this simple Style that keeps twist turning a little bit around so if I'm unlucky they will actually get pretty close to each other some of the elements at some stage with some reasonably high probability some of them would get close and that small distance will be repeated over and over so this is a very good example of how the sort of the structure of the input is kind of preserved in the hash values and so this goes really bad right because if this thing happens then those from the blue set are always close to each other which means that chance of capturing one of the Minash values is much smaller and it's completely correlated right so in order to get concentration bonds we want different buckets to behave differently but all the buckets behave almost same so we have almost no gain from repetitions right so this is what happened here and again in this example which of course fake I just got way too many red winds compared with blue winds and this is a real issue right when you hit the structure data so so again I want this concentration and the basic problem is I want independence between the buckets but each bucket has a lot of elements so even if I have K buckets it's not clear that K Independence would help me at all and I think of K as being a large number log nid Epsilon squ it's not clear how much Independence would give any results whatsoever and and this is also the same measure if you use for example the hyper log log of FL It's s of the same kind of principle but actually think this example is more elegant and I will fix all of them oh I keeping and for this I want to use tornado tabulation hat in and to explain that I'll first explain simple tabulation then I will use that to Define let's should say tornado tabulation uh and then show how to apply it and then even show some experiments times if we get so far so simple tabulation hashing that goes back to chest computers from the 70s so you just view a key has divided into characters in this case a 32-bit character could be four 8 bit characters okay then for each character position we have four in this case we have a random hash table that Maps the alphabet into random hash values so again the big Point here is just that the random tables here are not large they just have to cover the alphabet not the universe and now hash value is just the bitwise X all of these different hash values so this is super fast uh again the idea being that uh when we're allowed to take the universe but take the root one or C then we can get something that typically fits in Fast cach and in the experiments I've been running uh this is the fastest three independent hashing schemes as fast as two multiplications right and I will argue it will always be a bit like that because any any oper if you want to calculate something on computer if it doesn't involve data it's not very interesting so having arithmetic circuits that are much faster than the fastest memory cach won't help you very much so why even develop such a process so I I'm claiming that this is that this kind of cach if you stay into a small cach if you can get to use it then this will always be fast but it has a lot more power so this was sort of my original paper with Mii so have all this study of Independence different the things needing more or less Independence like this thing with Min minwise we talked about there's linear probing C hatching all needs quite a bit of Independence but simple tabulation works for all of them so that scheme is not even four independent it's only three independent but it can solve all these problems beautifully but I still have more to say because of we need full Independence for example this simple tabulation can't TOS coins I don't know how important it is to TOS coins but it actually doesn't work for if I throw things into throw coins getting uh uh getting the number of tails to match the expectation and divided by two I can't do that it actually only works if I uh don't just hash into two values but into many values so now let's do a little bit more like linear algebra so if I have a set of keys I say it's a Sero set if each position if an each position every character appears an even number of times so I have a set of keys keys right and what I'm saying is that for every position and every character the number of keys having that character in that position is even it's a strong requirement an example of that is for example this set here if we just have two characters then we look at all the different combinations okay so we have four keys they all uh they all the different combinations for this alph for alphabet of size two if you have such as zero set then no matter how I defined a simple tabulation hash function the XO of the hash values in that set is exactly zero and why is that that's because how did we get the hash of a key it was just by doing the bitwise X all of each character and now when we exort everything then each character the hash of each character appears exactly twice we do bit Y X all that just adds up to zero so that's always the case now I say that a set of keys is linearly independent if there's no zero set inside of it it doesn't have to be a it's not enough to be nonzero there should be no zero set inside I also want them to be dis thing well that's part of it being set so otherwise you would just have two keys the same key repeated but now we need sets to be of size at least four and then there's this lmma that a set of keys is linearly independent if and only if simple tabulation hashing is fully random right so this is the extreme case of simple tabulation not being fully random because we had four keys and when you knew the hash value of the three we just had to ex all them to find the hash value of the fourth guy so it's not four independent but that's the only thing that can go wrong nothing else so dependent sets are nasty so somehow if the alphabet is not too small and and we have enough characters then if you had a random key set it would not likely to be uh to be it would be most likely to be linearly independent so idea now will be to not just use simple tabulation but we first take our keys and then we somehow randomize them and if we succeed in Rand randomizing them in such a way that they become linearly independent and then apply a simple tabulation to the resulting deriv Keys then the composition is fully random on the original sets so this is sort of a two-step process is the first step we hope to create linearly independent keys and if that is successful then we know everything is fully random so I'll basically talk about tornado tabulation hashing that does exactly that in a in a very powerful way um it looks weird but let me just try to explain that it's not weird at all right so the basic point is again my dream is full Randomness on a set of keys X if I want fully random there on a key set X I need to have at least that much Randomness in the system right I mean I can't get Randomness out of nothing so since the alphabet size is describing how big sets have to be this the tables the random tables have size Sigma it's not unreasonable to ask for that to be twice as big we've lost the factor two I'm cheating because we also lost the factor C but just to say we're not having huge amounts of Randomness compared with what we get in the end and I also want it to be bigger than 256 and this is because I'm actually doing exact mathematics here I'm not just throwing away low order terms so in order to make the low order terms disappear I need the thing to be not too small but still I was having this 8bit example that's covered here that was important to me then it says that the probability that the set of keys is fully random is this weird formula here and I will claim that's not too weird this term is just Vanishing forget about that that's two to Sigma so even if divide by two so even if Sigma is 256 that's going to disappear then we have this term the important part in this term is that every time I add one more derived character the probability that things are linearly independent drops by a factor of 3 / Sigma and that's optimal that's exactly right because it relates to the number of matchings you can get on four keys so basically what we're proving is that the worst case is exactly this example with four keys that we somehow want to blast and then it becomes a question of how many matchings I could do on them so every time I add one new character I get this probability to drop by uh and Factor 3id by 256 or of course if it's a bigger alphabet it will drop much quicker so just to say this term here is optimal actually the whole thing is optimal except for the seven and I know the seven is actually 0.5 but that's a new analysis but uh but this is really sort of the right answer and yeah I think I lost something here so Sigma is the alphabet size yes for example 256 if you're just using 8 bit key yeah that's the small distance this is then saying that your entire key set size has to be at most 120 no yeah we'll get to that later okay so I'm saying if you want if your dream is to get full Randomness on H set of any size then the sigma has to be at least that twice as big you're not going to get full Randomness in the set using less Randomness than the set size but this is the whole point is I don't know your key set I'm just saying I'm generating my hash function not knowing your key set and with this high probability I succeed in getting full Randomness on it this is just the first half of the tornado hash basically or the second half which one of the two halves essentially yeah so it also has some concentration bonds that we get to later but yeah so this creates a linear Independence and then we also have the simple tabulation that makes it fully random and before I have been doing things just using old notation stuff like that but there you hide that everything may not even be an error probability below one I mean I'm just saying that that I really enjoy these kind of exact formulas when you want to argue that something works here is actually something where you can actually plug in the numbers and see that it works with high probability so I can trust it and the code is not too difficult I mean it might have looked weird but this is a complete computation both of the derive keys and the hash values s of funny thing first this is what you would do for simple tabulation extract the first character you look it up and here what happens is that when I want to do the last of them I move the key into the the hash value and now I extract keys from the hash value and I shift them out there's some shifts and stuff but it's a very doable piece of gold and the power of this kind of hatching is not just is it highly rare so we have this thing that if Sigma is twice as big as the number of number of keys I work with then things are good but I actually claim I don't even need that it has a certain local uniformity so what will happen is that for linear probing to get within kuse bound on how well it works so K has this beautiful bound that if your table is filled to one minus Epsilon then the probe length is only 1 + 1id Epsilon squ divided by two and this is indeed explaining why linear probing is so efficient I mean if you only have it with own notation that wouldn't explain why linear probing can be chaining but this kind of bound is actually quite useful if Epsilon is around a half or something like that and we get that we don't need the alphabet to be as big as the number of elements would store which would be a lot of space we just need it to be bigger than the log of the number of elements divided by Epsilon squ and that's because in they probing everything happen locally so we don't actually we're not our our problems are not resulting from all the keys we can easily show that it's only keys in The Limited range around where we have to that matters for us and and that's why uh and that suffices and it will also work with vector vector sampling as long as sign is bigger than we have again a log K is a number of samples we want not we imagine we are streaming a huge amount of keys but we only want K samples and we need to have it bigger than the number of samples time a log and then an oing seven that we probably disappear with a better analysis and the local uniformity is a very weird theorem but I've worked a lot on these things and it seems that this theorem is extremely useful despite being weird so let me try to explain it to you so we have a tornado hash function that gives us lbit hash values and then we have a bit mask so bit mask is strings of zeros and ones in question mark and for a given key set we can use the hash the hash function and the bit mask the bit mask I think of as a kind of a constant but I only use it in analysis it's not something that is known to the hash function to select some a subset of the keys which are just the keys that matches this big mask and the real theorem is that if Sigma is B bigger than 256 or twice times the expectation of the selected set then with higher probability the fre hash bits are fully random some of the bits are used to do a selection but then the remaining things are fully random so if my bit Mark was just all question marks that would mean I would select the whole set and then of course the expectation is the same as a set size and then we have exactly the theorem from before and so the point is I can use this SE that for many applications I can show I only need a selection of things to be fully random and for the selected things I also have strong concentration of how many things to select it may not look as pretty but we do have these exponential concentration bounds on how many things get selected and then of course for a given application you have to see if this suffices but my experience is that this is incredibly powerful and S of the best we can hope for because again you can never get more Randomness than you have in your tables and here we just say well even for selections random selections we can still get full Randomness within the selection and now I'm hoping to show a little bit about I have two minutes uh how this thing works for one peration hashing the one where we just throw everything in a random pocket and pick the smallest one in each bucket so again if the number of elements is smaller than whatever Sigma I picked divided by two then we know everything is fully random so we don't have to discuss anything because it is just fully random with high probability the interesting case is when we have a lot of keys and we only want to sample a few so that's what I'm focusing on but in that case we expect one permutation to fill all the slots in the first round so that's all we have to worry about so again I'm using tornado tabulation I have some alphabet which is again K Times log of some desired low error probability one divided by that it should have said oh yeah I put it the right way anyhow the AR should be one divided by this okay and now let me show you how we do the selection so we say a key is selected if it has a certain number of leading zeros in the hash value we don't care about the bucket and we don't care about the rest so this is exactly our bit mask we talked about before so we're going to argue a couple of things to make this to work first that the sample keys that the sample Keys is a subset of the selected keys this means that no other Keys matter that if a key is not selected it won't matter for the results that's first step so we basically going to show that there's a parallel between what tornado hashing does and what fully random hashing dots so both of them are just going to focus and select the keys the rest don't matter the selection has almost the same distribution of red and blue keys with tornado and fully random hashing so those things will also be in parallel and that will be thanks to concentration bounds the remaining bits are completely fully random in both cases so then we have this usual argument that we are just doing fully random sampling because everything only depends on these bits and since it's fully random who gets which uh hash values we have a fully random sample and therefore the distribution of red and blue keys in the sample is almost the same with tornado and fully random hash Al a little bit we have to do a couple of steps but these are the steps so again I picked a number of bits and the whole point the number of bits I said picked was I selected enough bits to make sure that the expected number of selected Keys is half this number because that's how I get tornado to have any power so because it's half the alphabet size by tornado local uniformity with high probability all the hatch all the three hash bits are fully random all those that were in the question box and also so we had the Chan of bounds that says that the number of selected Keys is strongly concentrated around what it should be well this has some nice application because all these free bits are fully random that means that the buckets are fully random so amongst the selected Keys the buckets are fully random well this is the classic case now we can look at any bucket and said what's the probability that it does not have a s the key well that's just a usual small probability because these selected keys of which we knew we had a lot with high probability are distributed fully randomly so we have selected keys in all pockets they all that matters because then the smallest key must be selected and so all samples are selected yeah and then I think this is enough now but so the rest is just trying to use this thing that we could just select what what was relevant for the application and we have all the and the base behaviors is fully random uh from then on so I will just skip to just a little bit about some of the experiment actually with mixed tabulation which is not quite as powerful as tornado but just the running times we had so here we have the really simple hash functions that are only too independent then we have something like murmur hash that's fairly popular for things that are supposed to be random like we know it's not crypto we know it's broken and stuff like that but it's fairly popular as a generic uh fully same random thing and we actually see that mix tabulation is more powerful and the strength of mix tabulation is we have provable guarantees things actually work so I would love to see that as a generic replacement of the merma has we also have things like Blake 2 which has crypto properties but you see that's almost 100 times slower you don't want to use that so even though crypto people will always say why don't you use a crypto function well they say you'll become very rich if you can break it or famous or whatever but you don't even know if you broke it because if you just look at estimates you don't know that they're wrong so that's not a big benefit and the other thing is why take something so slow when you have something that actually works uh provably and here was just these kind of concentration bounds where the mix tabulation well it works as well as these other ones here and much better than poomi uh other things then we have experiments from locality sensitive hashing I don't even remember what it is I just remember was that being small is good and makx tabulation actually performance better than these other things even within local locality sensitive hashing and that was a surprise for us when we have such a complicated thing we sort of thought there would be so much mess anyway how can you see the difference but it was very distinct on these standard Collections and then the last thing is inside feature hashing where we also saw a pretty big difference between using this tornado stuff and uh and these other schemes yeah so just it's simple it's simple and pretty efficient and it guarantees uh like fully random hashing for many things uh can be used for other things we've shown it works for fast J L St so for dimensionality reduction it also works uh it also seems to work a few feature hashing I haven't been able to prove it yet so basically math is pretty complicated for some of this stuff but I think it's much better than using these two simple hash functions that rely on Randomness in the input so that's it [Music] Back To Top