Implementing a DHT in Go, part 1

In order to further explore the capabilities and limitations of Go, I thought it would be interesting to try implementing something that was practical, non-trivial, and of interest on its own. With that in mind, we're going to spend the next few posts creating an implementation of the Kademlia DHT in Go.

A DHT, or Distributed Hash Table is a peer-to-peer networking primitive that, in its most basic formulation, permits storage and lookup of key, value pairs - that is, it's a hash table that's distributed across many nodes. The name is not entirely accurate for some newer DHTs, as some formulations permit many other operations besides those focused around data storage and retrieval.

Kademlia is a good example of a basic DHT, because unlike some competing algorithms, it's extremely simple. There are no explicit routing update messages, and the internal state it maintains is fairly straightforward and easy to understand. Lookups are also accomplished in an obvious - yet very efficient - manner. In return for this simplicity, Kademlia sacrifices a few of the features of competitors like Pastry and Chord - it's not as practical to implement other primitives such as pubsub over it.

The reference we ...

Enforcing data isolation with CurrentDomainProperty

In a previous post, we described how to implement API call hooks, and demonstrated a common use-case: Separating the datastore by domain, for multi-tenant apps.

It's not always the case that you want to partition your entire datastore along domain or user lines, however. Sometimes you may want to have only some models with restricted access per-domain, with others being common across all domains. You might also want a way to ensure that users can't read or modify each others' data. Fortunately, there's a way to implement all this at a higher level: Instead of defining API call hooks, we can define custom datastore properties to do the job for us.

Here's an implementation of a CurrentDomainProperty:

class InvalidDomainError(Exception):
  """Raised when something attempts to access data belonging to another domain."""


class CurrentDomainProperty(db.Property):
  """A property that restricts access to the current domain."""

  def __init__(self, allow_read=False, allow_write=False, *args, **kwargs):
    self.allow_read = allow_read
    self.allow_write = allow_write
    super(CurrentDomainProperty, self).__init__(*args, **kwargs)

  def __set__(self, model_instance, value):
    if not value:
      value = unicode(os.environ['HTTP_HOST'])
    elif (value != os.environ['HTTP_HOST'] and not self.allow_read
          and not users.is_current_user_admin()):
      raise InvalidDomainError(
          "Domain '%s' attempting ...

Recursion and concurrency with Go

Eager to jump on the bandwagon, I've been reading up on Go, the new language released by a group at Google. I had nothing to do with the development of the language, but several things about it interest me, in particular, its approach to concurrency.

Some time ago I came across a rather neat language (or rather, language extension) called cilk. Cilk is an extension of ANSI C with concurrency primitives. It implements a concurrency model based on 'spawning' functions, returning a deferred result, then 'syncing' in order to obtain the results. Most interestingly, cilk uses a 'work stealing' scheduler, which means that spawning a function is nearly as cheap as simply calling it.

One of the demonstrations of cilk which really captured my imagination was using it to search a game tree for perfect knowledge games such as chess. Parallelizing tree search is generally a really hard problem, because most of the function invocations only do a small amount of work, and context switching and thread creation overhead often overwhelms any benefits gained from parallelization. Cilk's work stealing strategy made it possible to parallelize these algorithms in an intuitive fashion and still see performance improvements from multiple ...

App Engine Java API call hooks

In a previous post, we discussed API call hooks for Python. It's possible to hook and modify RPC calls in Java, too. In this post, we'll demonstrate how.

All API calls in Java are handled by the class com.google.apphosting.api.ApiProxy. This class behaves similarly to the ApiProxyStubMap in the Python SDK, having makeSyncCall and makeAsyncCall functions that take care of invoking the relevant API calls. Unlike the Python runtime, however, all Java API calls are handled by a single delegate class, defined by the interface ApiProxy.Delegate. The active delegate for an App Engine app can be retrieved with ApiProxy.getDelegate, and set with ApiProxy.setDelegate.

Another difference from the Python API is that API calls are serialized into byte arrays before being passed to the ApiProxy, and responses are likewise deserialized by the caller. As a result, the makeSyncCall method takes a byte array as an argument, and returns a byte array.

Because all calls are routed to a single Delegate, the granularity of hooks supported is restricted to hooking all API calls, or none. To make things easier, we'll define a Delegate implementation that dispatches API calls to other Delegate subclasses based ...

Python Gotchas

A lot of App Engine developers are fairly new to Python as well, and so probably haven't encountered a few subtle 'gotchas' about the Python programming language. This post aims to sum up the ones you're most likely to encounter while programming for App Engine.

Mutable default arguments

First up is something every Pythonista learns sooner or later. What's wrong with this snippet?

def append(value, l=[]): l.append(value) return l

API call hooks for fun and profit

API call hooks are a technique that's reasonably well documented for App Engine - there's an article about it - but only lightly used. Today we'll cover some practical uses of API call hooks.

To start, let's define a simple logging handler. This can be useful when you're seeing some odd behaviour from the datastore, especially when you're using a library that may be modifying how it works. You can also use it to log any other API, such as the URLFetch or Task Queue APIs. For this, we just need a post-call hook:

def log_api_call(service, call, request, response): logging.debug("Call to %s.%s", service, call) logging.debug(request) logging.debug(response)

To install this for the datastore, for example, we call this:

apiproxy_stub_map.apiproxy.GetPostCallHooks().Append( 'log_api_call', log_api_call, 'datastore_v3')

The arguments to Append are, in order, a name for the hook, the hook function itself, and, optionally, the service you want to hook. If you leave out the last argument, the hook is installed for all API calls.

Once a hook is installed, it remains installed as long as the runtime is loaded - including across requests, and for requests to ...

Damn Cool Algorithms: Spatial indexing with Quadtrees and Hilbert Curves

Last Thursday night at Oredev, after the sessions, was "Birds of a Feather" - a sort of mini-unconference. Anyone could write up a topic on the whiteboard; interested individuals added their names, and each group got allocated a room to chat about the topic. I joined the "Spatial Indexing" group, and we spent a fascinating hour and a half talking about spatial indexing methods, reminding me of several interesting algorithms and techniques.

Spatial indexing is increasingly important as more and more data and applications are geospatially-enabled. Efficiently querying geospatial data, however, is a considerable challenge: because the data is two-dimensional (or sometimes, more), you can't use standard indexing techniques to query on position. Spatial indexes solve this through a variety of techniques. In this post, we'll cover several - quadtrees, geohashes (not to be confused with geohashing), and space-filling curves - and reveal how they're all interrelated.

Quadtrees

Quadtrees are a very straightforward spatial indexing technique. In a Quadtree, each node represents a bounding box covering some part of the space being indexed, with the root node covering the entire area. Each node is either a leaf node - in which case it contains ...

Server-side JavaScript with Rhino

I've only made limited use of the Java runtime for App Engine so far: The two runtimes are largely equivalent in terms of features, and my personal preference is for Python. There are, however, a few really cool things that you can only do on the Java runtime. One of them is embedding an interpreter for a scripting language.

Rhino is a JavaScript interpreter implemented in Java. It supports sandboxing, meaning you can create an environment in which it's safe to execute user-provided code, and it allows you to expose arbitrary Java objects to the JavaScript runtime, meaning you can easily provide interfaces with which the JavaScript code can carry out whatever actions it's designed to carry out.

Rhino also supports several rather cool additional features. You can set callback functions that count the number of operations executed by the JavaScript code, and terminate it if it takes too long, you can serialize a context or individual objects and deserialize them later, and you can even use continuations to pause execution when waiting for data, and pick it up again later - possibly even in another request! Hopefully we'll show off some of these features in future ...

Blogging on App Engine, part 10: Recap

This is part of a series of articles on writing a blogging system on App Engine. An overview of what we're building is here.

Over the last 9 posts, we've covered building a fully functional blogging system for App Engine from scratch, and I've even migrated this blog to it. In the process, we've covered many important components of App Engine, including the new deferred library (and through it, Task Queues), important design principles, such as pre-generation of content, and interesting technologies such as PubSubHubbub, CSEs, Disqus, and sitemaps.

There's also been significant community involvement, for which I am very grateful. Amongst the contributors were:

  • Moraes, who ported bloggart to werkzeug, calling it bloggartzeug
  • Sylvain, who ported bloggart to tornado, calling it bloggartornado
  • andialbrecht, who contributed - and continues to contribute - enhancements and bugfixes for bloggart
  • Everyone who contributed a name suggestion on the first post, and everyone who has provided feedback during the series
  • Everyone who filed bug reports and feature requests in the issue tracker

To give some sort of objective assessment of how well Bloggart measures up, we need to compare it to our original goals, outlined in the very first post:

Simple ...

Blogging on App engine, part 9: Sitemaps and verification

This is part of a series of articles on writing a blogging system on App Engine. An overview of what we're building is here.

Today we're going to cover basic sitemap support and verifying your site with Google.

Sitemaps

Sitemaps are a recent innovation that aim to make it easier for search engines to find and index your site. The format is a very straightforward XML file. Several optional attributes can be present, such as the last-modified date and update frequency; for this first attempt we're not going to use any of them, and just provide a basic listing of URLs. Future enhancements could provide more sitemap information, and break the sitemap into multiple files for extensibility.

For the purposes of generating a complete sitemap, we have a significant advantage: Our static serving infrastructure provides us with a convenient means of getting a list of all valid URLs. Not all URLs should be indexed, however, so we should make it possible to specify what content should be indexed. Add a new property to the StaticContent model in static.py:

indexed = db.BooleanProperty(required=True, default=True)

We'll need to enhance our set() method to take this ...