Announcing the SQLite datastore stub for the Python App Engine SDK

For the past couple of weeks, I've been working on one of those projects that seems to suck up every available moment (and some that technically aren't). Now, however, it's largely done, and as an extra bonus, I've been given permission to release it as an early preview for those that are interested.

The code in question is a new implementation of the local datastore for the Python App Engine SDK. While some of you are probably delighted at the news, I expect most of you are puzzled. Why do we need a new local datastore implementation? Let me explain.

The purpose of the local stubs in the App Engine SDK is to exactly replicate the behaviour of the production environment, and in general they do that very well. A specific non-goal is replicating the performance characteristics of the production environment, or being as scalable as the production environment - the stubs are designed for testing, not production use.

The Python SDK's datastore implementation operates by storing the entire contents of your development datastore in memory. It writes changes to disk so that it can reload your datastore when the dev_appserver is restarted, but the in-memory ...

Handling downtime: The capabilities API and testing

After the unfortunate outage the other day, how to handle downtime with your App Engine app is a bit of a hot topic. So what better time to address proper error handling for situations where App Engine isn't performing at 100%?

There's three major topics to cover here: Handling timeouts from API calls, using the Capabilities API, and testing your app's support for handling failures. We'll go over them in order.

Handling timeouts

At the 'stub' level, timeouts and other exceptions are communicated by the stub throwing an google.appengine.runtime.apiproxy_errors.ApplicationError. ApplicationError instances have an 'application_error' field, which contains an ID, drawn from google.appengine.runtime.apiproxy_errors, which indicates the cause of the error. As you can see, DEADLINE_EXCEEDED is 4. Other errors of interest are OVER_QUOTA, which will occur if your app runs out of quota for a given API call or capability, and CAPABILITY_DISABLED, which is thrown if the API capability has been explicitly disabled (more on this later).

Each of the various APIs catches ApplicationErrors thrown by their stub, and wraps them in a higher level exception. The datastore, for example, has a function, _ToDatastoreError that maps different error codes to ...

Consuming RSS feeds with PubSubHubbub

Frequently, it's necessary or useful to consume an Atom or RSS feed provided by another application. Doing so, though, is rarely as simple as it seems: To do so robustly, you have to worry about polling frequency, downtime, badly formed feeds, multiple formats, timeouts, determining which items are new and other such issues, all of which distract from your original, seemingly simple goal of retrieving new updates from an Atom feed. You're not alone, either: Everyone ends up dealing with the same set of issues, and solving them in more or less the same manner. Wouldn't it be nice if there was a way to let someone else take care of all this hassle?

As you've no doubt guessed, I'm about to tell you that there is. I'm speaking, of course, of PubSubHubbub. I discussed publishing to PubSubHubbub as part of the Blogging on App Engine series, but I haven't previously discussed what's required to act as a subscriber. Today, we'll cover the basics of PubSubHubbub subscriptions, and how you can use them to outsource all the usual issues consuming feeds.

At this point, you may be wondering how this is ...

Implementing a non-relational database in Go

In a previous Damn Cool Algorithms post, I discussed log structured storage, and how it applies to databases. For a long time, I've wanted to implement a database based on log structured storage, and a few other nice mechanics from other database systems:

  • Tables are key:value mappings, with duplicate keys allowed (Bigtable, BDB)
  • Map-based views, also known as materialized views, for indexing (couchdb).
  • Reducer support for views (couchdb).

Since my previous posts about go have been generally well received, and because I want to explore the language a bit more, I'll be implementing all this in Go. The approach I'd like to take is one of gradually building up abstractions. We'll tackle each of the components in its own post:

  1. An interface for writing records to an append-only file or set of files.
  2. A B-Tree implementation, built on the record interface.
  3. Map-based / materialized views, based on the B-Tree implementation.
  4. Reducers for views.

Unlike previous series, this one is likely to be fairly fragmented. There's a fair chunk of functionality to be implemented here, so I won't be able to get it out at my usual three posts a week schedule. In the meantime ...

Writing a twitter service on App Engine

Services that consume or produce Twitter updates are popular apps these days, and there are more than a few on App Engine, too. Twitter provide an extensive API, which provides most of the features you might want to access.

Broadly, Twitter's API is divided into two distinct parts: The streaming API, and everything else. The streaming API is their recommended way to consume large volumes of updates in real-time; unfortunately, for a couple of reasons, using it on App Engine is not practical at the moment. The rest of their API, however, is well suited to use via App Engine, and covers things such as retrieving users' timelines, mentions, retweets, etc, sending new status updates (and deleting them, and retweeting them), and getting user information.

Authentication

Most of Twitter's API calls require authentication. Currently, Twitter support two different authentication methods: Basic, and OAuth. Basic authentication, as the name suggests uses HTTP Basic authentication, which requires prompting the user for their username and password. We won't be using this, since it's deprecated, and asking users for their credentials is a bad idea. The OAuth API makes it possible to call Twitter APIs on behalf of a user ...

Bulk updates with cursors

Last week, I blogged about cursors, a new feature in version 1.3.1 of the App Engine SDK. Today, I'm going to demonstrate a practical use for them: Bulk datastore updates.

In both the Remote API and deferred articles, I used a (perhaps poorly named) 'mapper' class as an example of ways to use these libraries. In neither case was the class intended to be anything other than a sample use case for the library, but nevertheless, people have used the examples in production. The introduction of cursors provides a prime opportunity to introduce a more robust, yet simpler, version of the bulk updater concept.

First, let's define a few requirements for our bulk updater:

  • Support for any query for which a cursor can be obtained
  • Handles failure of individual updates gracefully
  • Can fail the whole update process if enough errors are encountered
  • Handles timeout errors, service unavailability, etc, transparently
  • Can report completion to admins

As in the Remote API and Deferred articles, we'll implement the updater as an abstract class, which individual updater implementations should subclass. Here's the basic interface:

import logging
import time
from google.appengine.api import mail
from google.appengine.ext ...

Webapps on App Engine, part 6: Lazy loading

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

A major concern for many people developing for App Engine, particularly those building low-to-medium traffic sites, is instance load time. When App Engine serves the first request to a new instance of your app, it must import the request handler module you specified, which in turn imports all the other modules required to serve the request. In large apps, this can add up to quite a lot of additional overhead for loading requests, and substantially impact the experience for end users.

There are a number of things you can do to reduce loading times, including using lighter weight frameworks instead of all inclusive ones, and breaking seldom used components up into separate handlers - an approach taken by bloggart for the admin interface. One source of inefficiency stands out as a prime candidate for optimisation, though: unnecessary imports.

Many frameworks, including the built in webapp framework, require you to provide a list of handler classes that should be instantiated to serve requests, in a 'url map'. When a request comes in, the framework simply instantiates the relevant class and ...

New features in 1.3.1 prerelease: Cursors

Recently, the App Engine team announced that they'd be pre-releasing SDKs for testing and feedback, before they go live in production. With the first prerelease, 1.3.1, a number of new features are included in the SDK. Today we'll discuss cursors - how they work, and what they're useful for.

Cursors are a feature that many people have been waiting for with bated breath. As well as making pagination easier, they also provide a way around the "1000 result limit" that many people feel (in some cases correctly) makes it harder to achieve what they want to do on App Engine.

When it comes to investigating new features, there are two really useful tools: An interactive console - such as that on http://localhost:8080/_ah/admin/, http://shell.appspot.com/ or the remote_api console - and the source code. Many people forget that as an Open Source project, the App Engine SDK code is all available - and easily browseable on code.google.com.

Our first stop is google/appengine/ext/db/__init__.py. Of interest here is the cursor() method, which starts on line 1600. As you can see, when called on a query that's already been ...

Webapps on App Engine, Part 5: Sessions

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

Sessions are another component that's regularly required by webapps, but isn't really a core part of a framework. In this post, we'll discuss the session mechanisms available for App Engine and how they work, and settle on a recommendation for our own lightweight framework.

The basic mechanism behind a session library is straightforward: A random session ID is generated for the user, which is embedded in an HTTP cookie and sent to the user. Meanwhile, a record is created on the server with the same ID, containing any data the webapp wants to store about this user. When the user makes a subsequent request, the session library decodes the session ID from the cookie header, and loads the corresponding session record from permanent storage.

There are three major advantages of handling sessions this way, rather than naively storing session data directly in the cookie:

  • We can store data that the client shouldn't be able to modify, such as the user's access flags.
  • We can store data the client shouldn't even be able ...

Webapps on App Engine, part 4: Templating

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

In the first three posts of this series, we covered all the components of a bare bones webapp framework: routing, request/response encoding, and request handlers. Most people expect a lot more out of their web framework, however. Individual frameworks take different approaches to this, from the minimalist webapp framework, which provides the bare minimum plus some integration with other tools, to the all-inclusive django, to the 'best of breed' Pylons, which focuses on including and integrating the best libraries for each task, rather than writing their own.

For our framework, we're going to take an approach somewhere between webapp's and Pylons': While keeping our framework minimal and modular, we'll look at the best options to use for other components - specifically, templating and session handling. In this post, we'll discuss templating.

To anyone new to webapps, templates may seem somewhat unnecessary. We can simply generate the output direct from our code, right? Many CGI scripting languages used this approach, and the results are often messy. Sometimes, which page to be generated isn't clear ...