They're almost here!

The App Engine USB drives - in bright primary, Google(tm) colors - are finished! They're currently winging their way to the Dublin office (and a separate batch direct to the .astronomy venue. Can't wait to get my hands on them.

Want to get your hands on one of them too? Post a suggestion for what topic you'd like to see me write about - be it App Engine, Go, Damn Cool Algorithms, or something else - and I'll send a USB drive, loaded with App Engine goodies and a Wave invite, to the authors of the best few suggestions.

I'm going to be at .astronomy all next week, so I'm not going to be putting up new posts on my regular schedule. I will, however, be blogging about the conference, so look out for posts on some of the more interesting talks and breakout sessions/hackathons.

In one final, unrelated item, I'd like to draw your attention to an amazing bit geekery. Last week, I posted this code golf competition to Stack Overflow, for the shortest Fractran interpreter. As an extra challenge, I offered a bonus to anyone who could provide a Fractran interpreter in fractran ...

Implementing a DHT in Go, part 2

In the previous post, we started a straightforward implementation of a Kademlia Distributed Hash Table in Go. Today, we'll add in the real meat - interaction between peers to find nodes.

First, though, a correction to the previous post. There were a couple of errors in it, the most notable of which is that the routing table's FindClosest method failed to order the returned results correctly. The original implementation ordered results by their absolute node ID, but the correct ordering is by their ID xor the target ID. To my shame, my rudimentary unit tests did not catch thils. This is now fixed in the original article with the introduction of a 'ContactRecord' struct.

Let's start by defining a Kademlia struct to hold information about a Kademlia network:

type Kademlia struct {
  routes *RoutingTable;
  NetworkId string;
}

func NewKademlia(self *Contact, networkId string) (ret *Kademlia) {
  ret = new(Kademlia);
  ret.routes = NewRoutingTable(self);
  ret.NetworkId = networkId;
  return;
}

Note the presence of the 'NetworkId' field in the above code. This is an arbitrary string that should be unique for each deployment of our Kademlia implementation, to prevent different instances of the network merging together.

Go supports RPC calls using its built in ...

Implementing a DHT in Go, part 1

In order to further explore the capabilities and limitations of Go, I thought it would be interesting to try implementing something that was practical, non-trivial, and of interest on its own. With that in mind, we're going to spend the next few posts creating an implementation of the Kademlia DHT in Go.

A DHT, or Distributed Hash Table is a peer-to-peer networking primitive that, in its most basic formulation, permits storage and lookup of key, value pairs - that is, it's a hash table that's distributed across many nodes. The name is not entirely accurate for some newer DHTs, as some formulations permit many other operations besides those focused around data storage and retrieval.

Kademlia is a good example of a basic DHT, because unlike some competing algorithms, it's extremely simple. There are no explicit routing update messages, and the internal state it maintains is fairly straightforward and easy to understand. Lookups are also accomplished in an obvious - yet very efficient - manner. In return for this simplicity, Kademlia sacrifices a few of the features of competitors like Pastry and Chord - it's not as practical to implement other primitives such as pubsub over it.

The reference we ...

Enforcing data isolation with CurrentDomainProperty

In a previous post, we described how to implement API call hooks, and demonstrated a common use-case: Separating the datastore by domain, for multi-tenant apps.

It's not always the case that you want to partition your entire datastore along domain or user lines, however. Sometimes you may want to have only some models with restricted access per-domain, with others being common across all domains. You might also want a way to ensure that users can't read or modify each others' data. Fortunately, there's a way to implement all this at a higher level: Instead of defining API call hooks, we can define custom datastore properties to do the job for us.

Here's an implementation of a CurrentDomainProperty:

class InvalidDomainError(Exception):
  """Raised when something attempts to access data belonging to another domain."""


class CurrentDomainProperty(db.Property):
  """A property that restricts access to the current domain."""

  def __init__(self, allow_read=False, allow_write=False, *args, **kwargs):
    self.allow_read = allow_read
    self.allow_write = allow_write
    super(CurrentDomainProperty, self).__init__(*args, **kwargs)

  def __set__(self, model_instance, value):
    if not value:
      value = unicode(os.environ['HTTP_HOST'])
    elif (value != os.environ['HTTP_HOST'] and not self.allow_read
          and not users.is_current_user_admin()):
      raise InvalidDomainError(
          "Domain '%s' attempting ...

Recursion and concurrency with Go

Eager to jump on the bandwagon, I've been reading up on Go, the new language released by a group at Google. I had nothing to do with the development of the language, but several things about it interest me, in particular, its approach to concurrency.

Some time ago I came across a rather neat language (or rather, language extension) called cilk. Cilk is an extension of ANSI C with concurrency primitives. It implements a concurrency model based on 'spawning' functions, returning a deferred result, then 'syncing' in order to obtain the results. Most interestingly, cilk uses a 'work stealing' scheduler, which means that spawning a function is nearly as cheap as simply calling it.

One of the demonstrations of cilk which really captured my imagination was using it to search a game tree for perfect knowledge games such as chess. Parallelizing tree search is generally a really hard problem, because most of the function invocations only do a small amount of work, and context switching and thread creation overhead often overwhelms any benefits gained from parallelization. Cilk's work stealing strategy made it possible to parallelize these algorithms in an intuitive fashion and still see performance improvements from multiple ...

App Engine Java API call hooks

In a previous post, we discussed API call hooks for Python. It's possible to hook and modify RPC calls in Java, too. In this post, we'll demonstrate how.

All API calls in Java are handled by the class com.google.apphosting.api.ApiProxy. This class behaves similarly to the ApiProxyStubMap in the Python SDK, having makeSyncCall and makeAsyncCall functions that take care of invoking the relevant API calls. Unlike the Python runtime, however, all Java API calls are handled by a single delegate class, defined by the interface ApiProxy.Delegate. The active delegate for an App Engine app can be retrieved with ApiProxy.getDelegate, and set with ApiProxy.setDelegate.

Another difference from the Python API is that API calls are serialized into byte arrays before being passed to the ApiProxy, and responses are likewise deserialized by the caller. As a result, the makeSyncCall method takes a byte array as an argument, and returns a byte array.

Because all calls are routed to a single Delegate, the granularity of hooks supported is restricted to hooking all API calls, or none. To make things easier, we'll define a Delegate implementation that dispatches API calls to other Delegate subclasses based ...

Shiny!

Shiny, and headed my way (hopefully) in time for .astronomy:

Python Gotchas

A lot of App Engine developers are fairly new to Python as well, and so probably haven't encountered a few subtle 'gotchas' about the Python programming language. This post aims to sum up the ones you're most likely to encounter while programming for App Engine.

Mutable default arguments

First up is something every Pythonista learns sooner or later. What's wrong with this snippet?

def append(value, l=[]): l.append(value) return l

API call hooks for fun and profit

API call hooks are a technique that's reasonably well documented for App Engine - there's an article about it - but only lightly used. Today we'll cover some practical uses of API call hooks.

To start, let's define a simple logging handler. This can be useful when you're seeing some odd behaviour from the datastore, especially when you're using a library that may be modifying how it works. You can also use it to log any other API, such as the URLFetch or Task Queue APIs. For this, we just need a post-call hook:

def log_api_call(service, call, request, response): logging.debug("Call to %s.%s", service, call) logging.debug(request) logging.debug(response)

To install this for the datastore, for example, we call this:

apiproxy_stub_map.apiproxy.GetPostCallHooks().Append( 'log_api_call', log_api_call, 'datastore_v3')

The arguments to Append are, in order, a name for the hook, the hook function itself, and, optionally, the service you want to hook. If you leave out the last argument, the hook is installed for all API calls.

Once a hook is installed, it remains installed as long as the runtime is loaded - including across requests, and for requests to ...

Damn Cool Algorithms: Spatial indexing with Quadtrees and Hilbert Curves

Last Thursday night at Oredev, after the sessions, was "Birds of a Feather" - a sort of mini-unconference. Anyone could write up a topic on the whiteboard; interested individuals added their names, and each group got allocated a room to chat about the topic. I joined the "Spatial Indexing" group, and we spent a fascinating hour and a half talking about spatial indexing methods, reminding me of several interesting algorithms and techniques.

Spatial indexing is increasingly important as more and more data and applications are geospatially-enabled. Efficiently querying geospatial data, however, is a considerable challenge: because the data is two-dimensional (or sometimes, more), you can't use standard indexing techniques to query on position. Spatial indexes solve this through a variety of techniques. In this post, we'll cover several - quadtrees, geohashes (not to be confused with geohashing), and space-filling curves - and reveal how they're all interrelated.

Quadtrees

Quadtrees are a very straightforward spatial indexing technique. In a Quadtree, each node represents a bounding box covering some part of the space being indexed, with the root node covering the entire area. Each node is either a leaf node - in which case it contains ...