Damn Cool Algorithms: Cardinality Estimation

Suppose you have a very large dataset - far too large to hold in memory - with duplicate entries. You want to know how many duplicate entries, but your data isn't sorted, and it's big enough that sorting and counting is impractical. How do you estimate how many unique entries the dataset contains? It's easy to see how this could be useful in many applications, such as query planning in a database: the best query plan can depend greatly on not just how many values there are in total, but also on how many unique values there are.

I'd encourage you to give this a bit of thought before reading onwards, because the algorithms we'll discuss today are quite innovative - and while simple, they're far from obvious.

A simple and intuitive cardinality estimator

Let's launch straight in with a simple example. Suppose someone generate a dataset with the following procedure:

  1. Generate n evenly distributed random numbers
  2. Arbitrarily replicate some of those numbers an unspecified number of times
  3. Shuffle the resulting set of numbers arbitrarily

How can we estimate how many unique ...

There's nothing quite like...

Witness the product of an idle hour or two at work, and time spent pondering the meaninglessness of the cliche "There's nothing quite like...":


Powered by App Engine. Naturally.

Migrating to Python 2.7, part 2: Webapp and templates

In part 1 of the migration series, we discussed changes to your app to take advantage of the 2.7 runtime's support for multithreading. Today, we'll start looking at the changes to the included libraries, and how we'll have to modify our app to use them.


First up is the replacement of App Engine's lightweight webapp framework with webapp2. Webapp2 is an external open-source project, which says it "follows the simplicity of webapp, but improves it in some ways: it adds better URI routing and exception handling, a full featured response object and a more flexible dispatching mechanism". Webapp2 does an excellent job of maintaining compatibility with webapp, which is why it was chosen as a replacement, but it also adds a number of useful improvements on it.

To see what's changed in concrete terms, let's assume we have a very straightforward app that looks something like this:

import os

from google.appengine.ext import webapp
from google.appengine.ext.webapp import template

class BaseHandler(webapp.RequestHandler):
  def render_template(self, filename, **template_args):
	path = os.path.join(os.path.dirname(__file__), 'templates', filename)
	self.response.out.write(template.render(path, template_args))

class IndexHandler(BaseHandler ...

Migrating to Python 2.7, part 1: Threadsafe

With the recent experimental release of Python 2.7 support, many people are starting to move their apps from the 2.5 runtime to 2.7. In this series of posts, I'll go over the various considerations for migrating your app in detail, starting with the most immediately obvious - and arguably biggest impact - of them all: threadsafe and multithreaded Python apps.

As you're probably aware, the 2.7 runtime supports making your Python app multithreaded, meaning a single instance of the app may service multiple user requests at the same time. Due to the Global Interpreter Lock, a multithreaded Python app still has limited concurrency, but since most of the wallclock time of a typical App Engine app is spent waiting for RPCs - during which the GIL is not held - the parallelism, and corresponding improvements in utilization, can be substantial.

Moving to Python 2.7 and enabling multithreading is pretty straightforward. First, you have to update your app.yaml. Suppose the start of your app.yaml looks something like this:

application: myapp
version: main
runtime: python
api_version: 1

To switch to Python 2.7, you change it to this (changes are highlighted):

application: myapp
version: main
runtime: python27 ...

Using the Channel API on App Engine for instant traffic analysis

In last week's post, we introduced Clio, a system for getting live insight into your site's traffic patterns, and we described how the Prospective Search API lets us filter the site's traffic to get just the records we care about.

This week, we'll cover the other part of the system: delivering results in real-time. For this, we'll be using the Channel API to stream new log entries to admin users in real-time. As with last week's post, where there's differences between our demo implementation and what you'd use in a real-world system, I'll point those out.

The admin interface

First up, we need to provide a simple admin interface to which we'll stream results. Here's the handler for that:

class IndexHandler(webapp.RequestHandler):
  """Serve up the Clio admin interface."""

  def get(self):
    client_id = os.urandom(16).encode('hex')
    channel_key = channel.create_channel(client_id)
    template_path = os.path.join(os.path.dirname(__file__),
                                 'templates', 'index.html')
    self.response.out.write(template.render(template_path, {
        'config': config,
        'client_id': client_id,
        'channel_key': channel_key,

The only thing of significance we do here relates to the Channel API. First, we generate a random client ID by getting some ...

Using the Prospective Search API on App Engine for instant traffic analysis

One of the really interesting new APIs released as part of App Engine recently is the Prospective Search API. Prospective search inverts the usual search paradigm, where you have a database of documents, and search queries match on those documents. In Prospective Search, you instead have a list of persistent search queries, and as new documents are created or updated, you match them against the queries. Twitter's live search interface is a good example of Prospective Search in action.

Today, in the first of a two post series, we'll be trying out the Prospective Search API with a sample application, Clio. Clio, named after muse of history, is designed to give administrators insight into the actual live traffic being served by their app. With it, you can see user request logs as they occur, and apply filters so you only see the hits that interest you - invaluable on a heavily trafficed site. Mystefied where people are getting to that 404 page from, and don't want to wait 12 hours for the analytics? Clio can help.

In this post, we'll go over the details of how to use the Prospective Search API to construct the server-side, query ...

Demystifying the App Engine request logs

People often ask about the App Engine request logs shown in the admin console: What do all the fields mean? How should they be interpreted? They're actually fairly easy to read once you know the format, so here's a quick-reference to help.

The basic format is based on the Apache combined log format, which is so widely used even outside Apache that it's become a de-facto standard of sorts for webserver request logs. In addition to that, App Engine adds a number of extra fields logging additional, App Engine specific stats. Suppose you're examining the following log entry: - foobiz [10/May/2011:17:26:28 -0700] "GET /page.html HTTP/1.1" 500 2297 "http://www.example.com/home.html" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4,gzip(gfe),gzip(gfe),gzip(gfe)" "www.example.com" ms=364 cpu_ms=23 api_cpu_ms=0 cpm_usd=0.001059

Here's what the fields mean, in order:

  1. Client's IP address (
  2. The RFC1413 identity of the client (in practice, always '-')
  3. The userid determined ...

Using the SM130 RFID/NFC reader

Long time no blog, I know - and this post isn't even my usual. Don't worry, more posts are coming soon!

I've recently been playing around with Arduino, and with NFC tags and readers. Sparkfun (and their Aussie distributors, Little Bird Electronics sell a rather nifty NFC reader/writer, the SM130, by Sonmicro, along with a corresponding Arduino Shield for it. It has a fairly easy to use serial protocol, and supports both regular UART serial and I2C.

In order to use I2C, though, you need to flash the reader with a new firmware. Sonmicro will provide you the firmware for free, b ut the application to flash it to the device is Windows only, which puts a bit of a crimp in the plans of those of us who don't use that platform. I wanted to use the reader in I2C mode, so I decided to try and solve this not just for myself, but anyone else in the same situation.

First step was to obtain the firmware. Sonmicro promptly sent it to me when I asked, and even kindly agreed to allow me to redistribute it - so here it is, in a gist.

The first ...

Modeling relationships in App Engine

One source of difficulty for people who are used to relational databases - and certain ORMs in particular - is how to handle references and relationships on App Engine. There's two basic questions here: First, what does a relationship entail, in any database system? And second, how do we use them in App Engine?

The nature of relationships

Many ORMs expose multiple 'types' of relationships as first-class entities - one-to-one, one-to-many, and many-to-many. This obscures the fact that, in reality, they are all built on the same building block, that of references. A reference is simply a field of an entity that contains the key of another entity - for example, if a Pet references an Owner, that simply means the Pet has a field that contains the key of its owner.

All relationship types simply devolve to references. A one-to-many relationship is a reference in its simplest form - each Pet has one Owner, and therefore an Owner can have multiple pets, all pointing to it. The Owner is not modified; it relies on the individual Pets naming it as the owner. One-to-one relationships are one-to-many relationships with the additional constraint that there will be only one Pet referencing each Owner; this is ...

Under the hood with App Engine APIs

In the past, I've discussed various details of the way various App Engine APIs work under the hood. If you've used certain tools, such as Appstats, too, you probably already have a basic overview of how the App Engine APIs function. Today, we'll take a closer look at the interface between the App Engine runtime and its APIs and how it works, and learn what that means for the platform.

If you're interested in App Engine purely to write straightforward webapps, you can probably stop reading now. If you're interested in low-level optimisations, or in the platform itself, or you want to write a library or tool that tinkers with the innermost parts of App Engine, then read on!

The generic API interface

Ultimately, every API call comes down to a single generic interface, with 4 arguments: the service name (for example, 'datastore_v3' or 'memcache'), the method name (for example, 'Get' or 'RunQuery'), the request, and the response. The request and response components are both Protocol Buffers, a binary format widely used at Google for exchanging structured data between processes. The specific type of the request and response protocol buffers for an API call depend ...