Data analysis frequently requires the detection of similarity between items in large transactional data sets. Canopy and k-means clustering perform well with physical data, but transactional data often requires less restrictive forms of analysis. Locality-sensitive hashing, or minhash, is a particularly effective way of clustering items based on the Jaccard metric of similarity.
Minhash assigns a pair of users to the same cluster with probability proportional to the overlap between the set of items that they have bought. Each user u is represented by a set of items that he or she has bought. The similarity between users u i and u j is defined as the overlap between their item sets, given by the intersection of the item sets divided by the union of the item sets. This quotient is called the Jaccard coefficient or Jaccard metric.
Minhash calculates one or more cluster identifiers for each user as the hash value (s) of a randomly chosen item from a permutation of the set of items that the user has bought. With a universal class of hash functions, the probability that two users are hashed to the same cluster identifier is their Jaccard coefficient (S).
If cluster identifiers are formed by concatenating p hash values, each generated by hashing a random item from the item set with multiple hash functions, then the probability that any two users have the same hash key is Sp.
If each user is assigned to multiple clusters, the probability that two users have the same hash key increases, causing more effective clustering. Therefore, minhash computes several cluster identifiers for each user. Minhash produces each cluster identifier by selecting an item from the user’s item set, hashing it with each of several hash functions, and concatenating p hash values. Therefore, p (the number of key groups) must be a divisor of the number of hash functions. (The item that minhash selects from the item set is the one that produces the minimum hash value for a particular hash function, hence the name of the algorithm.)