App Engine Cookbook: On-demand Cron Jobs

Today's post is, by necessity, a brief one. I'm travelling to San Francisco for I/O at the moment, and my flight was delayed so much I missed my connection in Atlanta and had to stay the night; in fact, I'm writing and posting this from the plane, using the onboard WiFi!

In a previous post, I introduced a recipe for high concurrency counters, which used a technique that I believe deserves its own post, since it's a useful pattern on its own. That technique is what I'm calling "On-demand Cron Jobs"

It's not at all uncommon for apps to have a need to do periodic updates at intervals, where the individual updates are small, and may even shift in time. One example is deleting or modifying any entry that hasn't been modified in the last day. In apps that need to do this, it's not uncommon to see a cron job like the following:

cron:
- description: Clean up old data
  url: /tasks/cleanup
  schedule: every 1 minute

This works, but it potentially consumes a significant amount of resources checking repeatedly if there's anything to clean up. Using the task queue ...

Implementing a dropbox service with the Blobstore API (part 3): Multiple upload support

In the last part of this series, we demonstrated how to use plupload, a Javascript library with multiple backends for handling file uploads. The solution we demonstrated there only supported uploading a single file at a time, however, and required us to improvise our own progress indicators - far from optimal.

So now, the post you've all been waiting for, where we demonstrate how to do multiple file upload!

The basic trick is simple: Hook the event that's triggered before a file is uploaded, and update the URL to upload to when it's called. That way, ever uploaded file gets a new URL. Where do we get the URL from? We simply ask the server for one. Here's the Javascript for that:

      uploader.bind('UploadFile', function(up, file) {
        $.ajax({
            url: '/generate_upload_url',
            async: false,
            success: function(data) {
              up.settings.url = data;
            },
        });

Straightforward, right? The only subtlety here is that we have to make the request an asynchronous one, so that the uploading doesn't start until we've updated the URL. Here's the server-side code that generates those URLs:

class GenerateUploadUrlHandler(BaseHandler):
  @util.login_required
  def get(self):
    self.response.headers['Content-Type'] = 'text/plain'
    self.response.out.write ...

High concurrency counters without sharding

Sharded Counters are a well known technique for keeping counters with high update rates on App Engine. Less well known, however, are some of the alternatives, particularly in areas where you want to keep a reasonably accurate counter, but absolute accuracy isn't required. I discussed one option in this cookbook post - be sure to check the comments for an improved version - and today we'll discuss another option, which also makes use of memcache and the task queue.

The basic assumption is this: We want to keep as accurate a count as possible, but we're willing to accept that it may, in some cases, under-count. A good example of where this is true is counting downloads, or hits, or other such metrics.

Our solution has three major components:

  1. A 'permanent' count, stored in the datastore.
  2. A 'current' count, stored in memcache.
  3. A task queue task that updates the datastore with the total from memcache.

In order to implement this, we'll take advantage of the task queue's task name functionality, and 'tombstoned tasks' - the restriction that two tasks with the same name cannot be enqueued within a reasonable period (at least a week) of each other. Each ...

Pre- and post- put hooks for Datastore models

A number of people have asked about the possibility of pre- and post- put hooks for datastore models, to allow for changes or other processing before or after a model is stored to the datastore.

While such a feature isn't currently supported by App Engine, it's quite possible for us to implement it ourselves, using monkeypatching. This also gives us a good opportunity to show off how monkeypatching works, and how it can be used to make your own changes (at your own risk!) to the App Engine SDK.

One caveat of monkeypatching is that you have to be very careful to make sure that your patch is installed at all times. If it's not, the changes you made will be unavailable and cause errors - or worse, simply behave differently. This is particularly noticeable in the case of app-engine-patch, which monkeypatches models to change their kind name, causing operations on them to fail if the patch hasn't been imported.

The functionality we want is about as simple as you could ask for: We want to be able to define a method on our Model that gets called just before it is written to the datastore, and ...

Generating PDFs on App Engine Python, and introducing Mapvelopes

This is the first of two posts covering the technologies used to implement the Mapvelopes app, an App Engine app that generates customized printable envelopes with the map to your recipient on them.

While HTML is the lingua-franca of the web, it's not the be all and end all. Sometimes, you need your webapp to generate something slightly different, and often, that something is a PDF. PDFs have the major advantage that they're designed for printing: pagination is built in, and the PDF defines the page size, so nothing about the layout is left to chance. When you need to provide something for the user to print, especially when it's complex, using a PDF can make the difference between okay output and really excellent output. Hit 'Print' in a Google Docs spreadsheet, and you'll see this in action.

PDF generation on App Engine is something that's been left largely up to individual users to figure out. Depending on your runtime - Java or Python - and your specific needs, it may be quite straightforward, or rather complicated. In particular, if you want to include images in your PDF, you're going to have to jump through some ...

Task Queue task chaining done right

One common pattern when using the Task Queue API is known as 'task chaining'. You execute a task on the task queue, and at some point, determine that you're going to need another task, either to complete the work the current task is doing, or to start doing something new. Let's say you're doing the former, and your code looks something like this:

def task_func():
  # Do some stuff
  deferred.defer(task_func)
  florb # This line causes an error

I'm sure you can guess what happens here. You successfully do some work, successfully chain the next task, then you encounter an error. Your code throws an exception, and returns a non-200 status code to the task queue, which notes the failure and schedules your task for re-execution. When it re-executes, the whole thing happens all over again (if your error is persistent, instead of transient, like the above).

Meanwhile, the task you enqueued runs. Perhaps it also fails after chaining its next task. Now you have two repeatedly executing tasks. Soon you have 4 - then 8 - then 16 - and so forth. Disaster!

"Ah, " you may say smugly, "I don't do anything important after chaining the next task ...

ReferenceProperty prefetching in App Engine

This post is a brief interlude in the webapps on App Engine series. Fear not, it'll be back!

Frequently, we need to do a datastore query for a set of records, then do something with a property referenced by each of those records. For example, supposing we are writing a blogging system, and we want to display a list of posts, along with their authors. We might do something like this:

class Author(db.Model):
  name = db.StringProperty(required=True)

class Post(db.Model):
  title = db.TextProperty(required=True)
  body = db.TextProperty(required=True)
  author = db.ReferenceProperty(Author, required=True)


posts = Post.all().order("-timestamp").fetch(20)
for post in posts:
  print post.title
  print post.author.name

On the surface, this looks fine. If we look closer, however - perhaps by using Guido's excellent AppStats tool, we'll notice that each iteration of the loop, we're performing a get request for the referenced author entity. This happens the first time we dereference any ReferenceProperty, even if we've previously dereferenced a separate ReferenceProperty that points to the same entity!

Obviously, this is less than ideal. We're doing a lot of RPCs, and incurring a lot of ...

Enforcing data isolation with CurrentDomainProperty

In a previous post, we described how to implement API call hooks, and demonstrated a common use-case: Separating the datastore by domain, for multi-tenant apps.

It's not always the case that you want to partition your entire datastore along domain or user lines, however. Sometimes you may want to have only some models with restricted access per-domain, with others being common across all domains. You might also want a way to ensure that users can't read or modify each others' data. Fortunately, there's a way to implement all this at a higher level: Instead of defining API call hooks, we can define custom datastore properties to do the job for us.

Here's an implementation of a CurrentDomainProperty:

class InvalidDomainError(Exception):
  """Raised when something attempts to access data belonging to another domain."""


class CurrentDomainProperty(db.Property):
  """A property that restricts access to the current domain."""

  def __init__(self, allow_read=False, allow_write=False, *args, **kwargs):
    self.allow_read = allow_read
    self.allow_write = allow_write
    super(CurrentDomainProperty, self).__init__(*args, **kwargs)

  def __set__(self, model_instance, value):
    if not value:
      value = unicode(os.environ['HTTP_HOST'])
    elif (value != os.environ['HTTP_HOST'] and not self.allow_read
          and not users.is_current_user_admin()):
      raise InvalidDomainError(
          "Domain '%s' attempting ...

API call hooks for fun and profit

API call hooks are a technique that's reasonably well documented for App Engine - there's an article about it - but only lightly used. Today we'll cover some practical uses of API call hooks.

To start, let's define a simple logging handler. This can be useful when you're seeing some odd behaviour from the datastore, especially when you're using a library that may be modifying how it works. You can also use it to log any other API, such as the URLFetch or Task Queue APIs. For this, we just need a post-call hook:

def log_api_call(service, call, request, response): logging.debug("Call to %s.%s", service, call) logging.debug(request) logging.debug(response)

To install this for the datastore, for example, we call this:

apiproxy_stub_map.apiproxy.GetPostCallHooks().Append( 'log_api_call', log_api_call, 'datastore_v3')

The arguments to Append are, in order, a name for the hook, the hook function itself, and, optionally, the service you want to hook. If you leave out the last argument, the hook is installed for all API calls.

Once a hook is installed, it remains installed as long as the runtime is loaded - including across requests, and for requests to ...

Server-side JavaScript with Rhino

I've only made limited use of the Java runtime for App Engine so far: The two runtimes are largely equivalent in terms of features, and my personal preference is for Python. There are, however, a few really cool things that you can only do on the Java runtime. One of them is embedding an interpreter for a scripting language.

Rhino is a JavaScript interpreter implemented in Java. It supports sandboxing, meaning you can create an environment in which it's safe to execute user-provided code, and it allows you to expose arbitrary Java objects to the JavaScript runtime, meaning you can easily provide interfaces with which the JavaScript code can carry out whatever actions it's designed to carry out.

Rhino also supports several rather cool additional features. You can set callback functions that count the number of operations executed by the JavaScript code, and terminate it if it takes too long, you can serialize a context or individual objects and deserialize them later, and you can even use continuations to pause execution when waiting for data, and pick it up again later - possibly even in another request! Hopefully we'll show off some of these features in future ...