'Most popular' metrics in App Engine

One useful and common statistic to provide to users is a metric akin to "7 day downloads" or "7 day popularity". This appears across many sites and types of webapp, yet the best way to do this is far from obvious.

The naive approach is to record each download individually, and use something akin to "SELECT count(*) FROM downloads WHERE item_id = 123 AND download_date > seven_days_ago", but this involves counting each download individually - O(n) work with the number of downloads! Caching the count is an option, but still leads to excessive amounts of work at read-time.

Another option is to maintain an array of daily download counts, keeping the last 7. This is an improvement from a workload point of view, but leads to either discontinuities at the start of a new day, or to all counts being updated only once per day.

There is a third option, however, which has the performance of the second option, with the responsiveness of the first. To use it, however, we have to reconsider slightly what we mean by '7 day popularity'. The solution in question is to use an exponential decay process. Each time an event happens, we increase the item's ...

DIY USB preloading with *nix

Having recently received a large number of USB flash drives, I needed a solution for preloading them in bulk. Dedicated USB preloading/flashing devices are pricey - starting at over 500 euro for a small model - and while the preload services most companies offer (including Memotrek, the company we ordered the drives from) are handy, they add an extra 50c or so to the price of each drive, and the preload is quickly out of date. With that in mind, I decided to go the DIY route. This post documents my attempts and the final (successful) result.

To start, you need a lot of USB ports. I purchased two D-Link DUB-H7 7 port USB hubs, but any hubs ought to do, as long as the spacing between the ports is sufficient to accommodate a flash drive in every port. You won't need the included power bricks, as the power provided by the USB host is sufficient even for 7 UDB flash drives.

The general process of bulk flashing goes something like this:

  1. Plug in one of your drives. Wipe it with "dd if=/dev/zero of=/dev/your-drive bs=1M", partition and format it, and write the data you want ...

.astronomy wrap-up

Wednesday was Hack Day at dot astronomy. I spent the day working on a tool that uses seadragon ajax and a modified Python tilecutter to allow people with large astronomical images (tens to hundreds of megapixels) to easily upload them to App Engine for viewing by users. This is useful because many really attractive astronomical images get released to the public, but often only in two versions: 'desktop wallpaper' and 'too big to view'. Ideally, with this tool (which I'm tentatively calling astrozoom), astronomers could make it easy for users to view and zoom the product of their work.

Further extensions would include integration with astrometry.net to automatically locate and annotate uploaded images, and support for clipping out and downloading certain sections of an image, not to mention community features like sharing with friends, comments, and embedding in other pages.

I got the basic upload-and-display functionality done on hack day, but due to lack of memory on my mac to run the tool on a decently sized image, I'm unable to show it off yet.

Other hack-day projects included a large team working on a project called Buried Data, for making datasets available for research that would ...

.astronomy so far

The first 3 days of .astronomy have been busy. So busy, I haven't had the time or energy to write about them until now! Here's a quick summary of what's happened:

I arrived at the conference a bit before 11AM on Monday, having taken the earliest flight available that day (I didn't fly in the night before, as that would've meant missing Video Games Live). When I came in, Robert Hollow was giving a talk on pulse@parkes, a fascinating program he runs to get students into astronomy by giving them real observing time on the Parkes radio telescope in Australia. He gave an engaging presentation, and made me wish I could give it a go myself.

Next up, Arfon Smith and Chris Lintott gave a talk on Galaxy Zoo, which has come a long way since I last looked at it. They described the architecture (Ruby, running on AWS), some of the project's successes, and some of the new projects they're working on, including Galaxy Zoo Mergers, a project dedicated to determining and documenting the details of galaxy mergers.

After lunch, I gave my Python 101 tutorial. Due to a screw-up on ...

They're almost here!

The App Engine USB drives - in bright primary, Google(tm) colors - are finished! They're currently winging their way to the Dublin office (and a separate batch direct to the .astronomy venue. Can't wait to get my hands on them.

Want to get your hands on one of them too? Post a suggestion for what topic you'd like to see me write about - be it App Engine, Go, Damn Cool Algorithms, or something else - and I'll send a USB drive, loaded with App Engine goodies and a Wave invite, to the authors of the best few suggestions.

I'm going to be at .astronomy all next week, so I'm not going to be putting up new posts on my regular schedule. I will, however, be blogging about the conference, so look out for posts on some of the more interesting talks and breakout sessions/hackathons.

In one final, unrelated item, I'd like to draw your attention to an amazing bit geekery. Last week, I posted this code golf competition to Stack Overflow, for the shortest Fractran interpreter. As an extra challenge, I offered a bonus to anyone who could provide a Fractran interpreter in fractran ...

Implementing a DHT in Go, part 2

In the previous post, we started a straightforward implementation of a Kademlia Distributed Hash Table in Go. Today, we'll add in the real meat - interaction between peers to find nodes.

First, though, a correction to the previous post. There were a couple of errors in it, the most notable of which is that the routing table's FindClosest method failed to order the returned results correctly. The original implementation ordered results by their absolute node ID, but the correct ordering is by their ID xor the target ID. To my shame, my rudimentary unit tests did not catch thils. This is now fixed in the original article with the introduction of a 'ContactRecord' struct.

Let's start by defining a Kademlia struct to hold information about a Kademlia network:

type Kademlia struct {
  routes *RoutingTable;
  NetworkId string;

func NewKademlia(self *Contact, networkId string) (ret *Kademlia) {
  ret = new(Kademlia);
  ret.routes = NewRoutingTable(self);
  ret.NetworkId = networkId;

Note the presence of the 'NetworkId' field in the above code. This is an arbitrary string that should be unique for each deployment of our Kademlia implementation, to prevent different instances of the network merging together.

Go supports RPC calls using its built in ...

Implementing a DHT in Go, part 1

In order to further explore the capabilities and limitations of Go, I thought it would be interesting to try implementing something that was practical, non-trivial, and of interest on its own. With that in mind, we're going to spend the next few posts creating an implementation of the Kademlia DHT in Go.

A DHT, or Distributed Hash Table is a peer-to-peer networking primitive that, in its most basic formulation, permits storage and lookup of key, value pairs - that is, it's a hash table that's distributed across many nodes. The name is not entirely accurate for some newer DHTs, as some formulations permit many other operations besides those focused around data storage and retrieval.

Kademlia is a good example of a basic DHT, because unlike some competing algorithms, it's extremely simple. There are no explicit routing update messages, and the internal state it maintains is fairly straightforward and easy to understand. Lookups are also accomplished in an obvious - yet very efficient - manner. In return for this simplicity, Kademlia sacrifices a few of the features of competitors like Pastry and Chord - it's not as practical to implement other primitives such as pubsub over it.

The reference we ...

Enforcing data isolation with CurrentDomainProperty

In a previous post, we described how to implement API call hooks, and demonstrated a common use-case: Separating the datastore by domain, for multi-tenant apps.

It's not always the case that you want to partition your entire datastore along domain or user lines, however. Sometimes you may want to have only some models with restricted access per-domain, with others being common across all domains. You might also want a way to ensure that users can't read or modify each others' data. Fortunately, there's a way to implement all this at a higher level: Instead of defining API call hooks, we can define custom datastore properties to do the job for us.

Here's an implementation of a CurrentDomainProperty:

class InvalidDomainError(Exception):
  """Raised when something attempts to access data belonging to another domain."""

class CurrentDomainProperty(db.Property):
  """A property that restricts access to the current domain."""

  def __init__(self, allow_read=False, allow_write=False, *args, **kwargs):
    self.allow_read = allow_read
    self.allow_write = allow_write
    super(CurrentDomainProperty, self).__init__(*args, **kwargs)

  def __set__(self, model_instance, value):
    if not value:
      value = unicode(os.environ['HTTP_HOST'])
    elif (value != os.environ['HTTP_HOST'] and not self.allow_read
          and not users.is_current_user_admin()):
      raise InvalidDomainError(
          "Domain '%s' attempting ...

Recursion and concurrency with Go

Eager to jump on the bandwagon, I've been reading up on Go, the new language released by a group at Google. I had nothing to do with the development of the language, but several things about it interest me, in particular, its approach to concurrency.

Some time ago I came across a rather neat language (or rather, language extension) called cilk. Cilk is an extension of ANSI C with concurrency primitives. It implements a concurrency model based on 'spawning' functions, returning a deferred result, then 'syncing' in order to obtain the results. Most interestingly, cilk uses a 'work stealing' scheduler, which means that spawning a function is nearly as cheap as simply calling it.

One of the demonstrations of cilk which really captured my imagination was using it to search a game tree for perfect knowledge games such as chess. Parallelizing tree search is generally a really hard problem, because most of the function invocations only do a small amount of work, and context switching and thread creation overhead often overwhelms any benefits gained from parallelization. Cilk's work stealing strategy made it possible to parallelize these algorithms in an intuitive fashion and still see performance improvements from multiple ...

No, I didn't just get married

It appears I've just inadvertently discovered a bug in Bloggart which causes it to 'republish' old posts, updating the last-modified time and causing them to show up in the Atom feed again. So no, I didn't just get married then stuck in Switzerland.

If I ever track down the author of Bloggart, I'll give him such a hiding he'll wish he'd never written it. ;)