Damn Cool Algorithms: Cardinality Estimation

Suppose you have a very large dataset - far too large to hold in memory - with duplicate entries. You want to know how many duplicate entries, but your data isn't sorted, and it's big enough that sorting and counting is impractical. How do you estimate how many unique entries the dataset contains? It's easy to see how this could be useful in many applications, such as query planning in a database: the best query plan can depend greatly on not just how many values there are in total, but also on how many unique values there are.

I'd encourage you to give this a bit of thought before reading onwards, because the algorithms we'll discuss today are quite innovative - and while simple, they're far from obvious.

A simple and intuitive cardinality estimator

Let's launch straight in with a simple example. Suppose someone generate a dataset with the following procedure:

  1. Generate n evenly distributed random numbers
  2. Arbitrarily replicate some of those numbers an unspecified number of times
  3. Shuffle the resulting set of numbers arbitrarily

How can we estimate how many unique ...