Suppose you have a very large dataset - far too large to hold in memory - with duplicate entries. You want to know how many duplicate entries, but your data isn't sorted, and it's big enough that sorting and counting is impractical. How do you estimate how many unique entries the dataset contains? It's easy to see how this could be useful in many applications, such as query planning in a database: the best query plan can depend greatly on not just how many values there are in total, but also on how many unique values there are.
I'd encourage you to give this a bit of thought before reading onwards, because the algorithms we'll discuss today are quite innovative - and while simple, they're far from obvious.
A simple and intuitive cardinality estimator
Let's launch straight in with a simple example. Suppose someone generate a dataset with the following procedure:
nevenly distributed random numbers
- Arbitrarily replicate some of those numbers an unspecified number of times
- Shuffle the resulting set of numbers arbitrarily
How can we estimate how many unique ...