Finding similar items using minhashing

You find yourself with a sufficiently large pile of items (tweets, blog posts, cat pictures) and a key-value database. New items are arriving every minute and you’d really like a way of finding similar items that already exist in your dataset (either for duplication detection or finding related items). Clearly we don’t want to scan […]