You find yourself with a sufficiently large pile of items (tweets, blog posts, cat pictures) and a key-value database. New items are arriving every minute and you’d really like a way of finding similar items that already exist in your dataset (either for duplication detection or finding related items).

Clearly we don’t want to scan our entire existing database of items every time we receive a new item but how do we avoid doing so? Minhashing to the rescue!

#### Minhashing

Minhashing is a technique for generating signatures for items. The more similar two items are, the higher the chance that they share the same minhash. Of course, we need a proper definition of similarity for this to be useful.

##### Similarity

For this we use the Jaccard Similarity:

That is, the Jaccard Similarity is the number of elements two sets have in common divided by the total number elements in both sets. Obviously, a similarity of 0 means two sets contain no elements in common and a similarity of 1 indicates the sets contain the same elements.

If we treat our items as sets and the item attributes (such as tags or keywords) as the set elements then we can use the above definition to compute the similarity between two items. It’s worth pointing out here that we are making the assumption that an item attribute can only appear once in an item, there are ways of dealing with this later though.

A quick example. If we have two items { cat, dog, fish, banana } and { lemon, fish, dog, cat, cake } then the Jaccard Similarity would be:

Now we have a way of computing the similarity between two sets, let us put it to one side and consider the actual process of generating a minhash. Before we get started, we need a hash function. Your programming language of choice’s standard library will almost certainly have a few acceptable ones within arm’s reach. We’ll call it . The actual process goes as follows:

- Set the to infinity
- For each element in the set S:
- Compute the element’s hash
- If then

- Return , the minimum hash

Simple. How does it work though?

First, let’s consider the hash function . The function hashes a value to one of possible buckets. We assume that the number of buckets greatly exceeds the number of elements or features we’re likely to see, . With a good hash function, the probably of a value being hashed to any one of the buckets is .

Now consider a set which we’re trying to compute the minimum hash for. Since each element from set has an equal probability of being hashed in to any one of the buckets, the probability that any given element is the minimum hash for the set is where is the number of elements in set .

Let’s introduce a second set B. What is , the probability that both sets share the same minimum hash? Since any element from combined elements of both sets has an equal probability of being the minimum hash then is equal to the ratio of the number of elements common to both sets over the total combined elements:

which is the Jaccard Similarity!

What we’ve worked out above is that if we pick an arbitary hash function (picked at random from a particular family, for example) and calculate the minimum hashes for two sets then the probability that they share the same minimum hash is equal to the ratio of their common elements to their total elements.

##### Practical usage

For a key-value database we can use minhashes for keys, storing a list of the keys for the items that resulted in that minhash as the value. I use plural for minhashes because by storing multiple minhashes for a particular item, it is possible to fine-tune the probabilities of minhash matches towards a particular goal.

If speed is a concern for the particular application then minimising false positives is crucial. For this, combining multiple minhashes such that two sets only share the resulting combination if all of their minhashes match, can allow a tuning of sensitivity.

If two sets have a Jaccard Similarity of and minhashes are generated for different hash functions then the probability of a match is simply . Choice of obviously depends on the desired characteristics of the application.

Alternatively, if ensuring a low false-negative is crucial then using minhashes generated from hash functions and selecting based on any match (effectively, the OR) will give a probability:

Obviously, these two strategies are not mutually exclusive and give you a range of parameters to tune depending on your application.