'Most popular' metrics in App Engine

One useful and common statistic to provide to users is a metric akin to "7 day downloads" or "7 day popularity". This appears across many sites and types of webapp, yet the best way to do this is far from obvious.

The naive approach is to record each download individually, and use something akin to "SELECT count(*) FROM downloads WHERE item_id = 123 AND download_date > seven_days_ago", but this involves counting each download individually - O(n) work with the number of downloads! Caching the count is an option, but still leads to excessive amounts of work at read-time.

Another option is to maintain an array of daily download counts, keeping the last 7. This is an improvement from a workload point of view, but leads to either discontinuities at the start of a new day, or to all counts being updated only once per day.

There is a third option, however, which has the performance of the second option, with the responsiveness of the first. To use it, however, we have to reconsider slightly what we mean by '7 day popularity'. The solution in question is to use an exponential decay process. Each time an event happens, we increase the item's popularity figure, which continuously decays over an extended period, ensuring that the figure provides a representation of recent popularity.

In order to figure out at what rate our popularity figure should decay, we can look to the 'mean lifetime' formula. If we think about the current popularity as a set, the mean lifetime figure allows us to calculate a decay rate such that an element (for example, an individual download) will remain in the set for a specified average amount of time - for example, 7 days.

The relevant formula is N(t) = N0e-t/r, where N0 is the value at the start, t is the amount of time elapsed, and r is the decay rate, which is simply the inverse of the mean lifetime. Here's how we calculate it in practice:

def timedelta_to_seconds(delta):
  return delta.days * 86400 + delta.seconds + delta.microseconds / 1000000.0

MEAN_DOWNLOAD_LIFETIME = timedelta_to_seconds(datetime.timedelta(days=7))

def decay(value, elapsed):
  # What fraction of the mean lifetime has elapsed?
  decay_fraction = timedelta_to_seconds(elapsed) / MEAN_DOWNLOAD_LIFETIME
  return value * (math.e ** -decay_fraction)

In the above code, we define MEAN_DOWNLOAD_LIFETIME as the number of seconds in 7 days, and the function decay 'decays' a value, given the amount of time elapsed since the value was last evaluated.

Of course, in practice we can't have a value that continuously decays, much less store one in the datastore. The solution is to use an approximation: We will store the current value of the popularity metric, along with how long it's been since we last applied the decay function. Each time we increment the popularity metric, we'll first apply the decay function to make sure it's up to date. Here's an example model:

class File(db.Model):
  popularity = db.FloatProperty(required=True, default=0.0)
  last_decay = db.DateTimeProperty(required=True, auto_now_add=True)

  def update_popularity(self, delta):
    now = datetime.datetime.now()
    self.popularity = decay(self.popularity, now - self.last_decay) + delta
    self.last_decay = now

The standard caveats when dealing with frequently updated values apply here: update_popularity should be called inside a transaction that fetches and puts the File object, and depending on the expected update frequency, you should use a technique to alleviate contention issues when updating the record - my unsharded counter recipe would be a good choice here.

The solution we have now, then, allows us to search and rank based on popularity in an efficient manner, at the cost of using a slightly different metric (exponential decay of popularity, rather than absolute download counts). You've probably realized that the ranking may not be entirely fair, however: A slightly less popular, but less recently updated entity may be ranked higher than a more popular but more recently updated entity. For the most part, this will be taken care of by regular updates to the entities, but we still need a way to clean up after the stragglers. This can be achieved with a scheduled task that applies the decay function to any particularly out-of-date entities:

def decay_old_entities():
  age_threshold = datetime.datetime.now() - datetime.timedelta(hours=6)
  q = File.all().filter('last_decay <', age_threshold).order('last_decay')
  entities = q.fetch(500)
  for entity in entities:
    entity.update_popularity(0)
  db.put(entities)

Note that we're not bothering with a transaction here, the reasoning being that any entity that's not been updated for 6 hours is unpopular enough that conflicts are unlikely, and in the event of a conflict, the other update will win out, as it will be transactional, and is doing our work for us in any case.

A little room for further optimization still exists: Every time the process above runs, it will have to work its way through 'long tail' entries, which have not seen an update in a very long time, further decaying their popularity. We probably don't care much about the individual ranking of items with a popularity of 0.01 vs one with 0.012, however, so we needn't waste the time updating them every 6 hours. We can do this by setting a threshold on what is considered popular enough to update, and neglecting to update anything older than that.

Since we can't do inequality queries on two properties (the last_decay and popularity fields), we need to create a new field that indicates if the threshold is met or not. Updating our model, we can make use of aetycoon's TransformProperty for this:

class File(db.Model):
  popularity = db.FloatProperty(required=True, default=0.0)
  last_decay = db.DateTimeProperty(required=True, auto_now_add=True)
  is_popular = aetycoon.TransformProperty(popularity, lambda p: p >= 1.0)
  # ...

And we make a straightforward change to our maintenance function to exclude entities too unpopular to warrant consideration:

def decay_old_entities():
  age_threshold = datetime.datetime.now() - datetime.timedelta(hours=6)
  q = File.all().filter('is_popular', True).filter('last_decay <', age_threshold).order('last_decay')
  entities = q.fetch(500)
  for entity in entities:
    entity.update_popularity(0)
  db.put(entities)

If you want to see a system like this in action, this is exactly what my site, netboot.me employs for its popularity metrics. The source is available here.

Comments

blog comments powered by Disqus