Storage options on App Engine

App Engine provides a number of ways for your app to store data. Some, such as the datastore, are well known, but others are less so, and all of them have different characteristics. This article is intended to enumerate the different options, and describe the pros and cons of each, so you can make more informed decisions about how to store your data.

Datastore

The best known, most widely used, and most versatile storage option is, of course, the datastore. The datastore is App Engine's non-relational database, and it provides robust, durable storage, as well as providing the most flexibility in how your data is stored, retrieved, and manipulated.

Pros

  • Durable - data stored in the datastore is permanent.
  • Read-write - apps can both read and write datastore data, and the datastore provides transaction mechanisms to enforce integrity.
  • Globally consistent - all instances of an app have the same view of the datastore.
  • Flexible - queries and indexing provide many ways to query and retrieve data.

Cons

  • Latency - because the datastore stores data on disk and provides reliability guarantees, writes need to wait until data is confirmed to be stored before returning, and reads often have to fetch data from disk.

Memcache

Memcache ...

Breaking things down to Atoms

Over the last few years, more and more sites have been providing RSS and Atom feeds. This has been a huge boon both for keeping up to date with content through feed readers, and for programmatically consuming data. There are, inevitably, a few holdouts, though. Notable amongst those holdouts -and particularly relevant to me at the moment - are property listing sites. Few, if any property listing sites provide any sort of Atom or RSS feed for listings, let alone a API of any kind.

To address this, I decided to put together a service for turning other types of notification into Atom feeds. I wanted to make the service general, so it can support many different formats, but the initial target will be email, since most property sites offer email notifications. The requirements for an email to Atom gateway are fairly straightforward:

  • Incoming email should be converted to entries in an Atom feed in such a fashion as to be easily interpretable in a feed reader
  • As much of the original message and metadata as possible should be preserved in the Atom feed entry
  • It should be possible to access the original message if desired

In addition, I had a ...

Modeling relationships in App Engine

One source of difficulty for people who are used to relational databases - and certain ORMs in particular - is how to handle references and relationships on App Engine. There's two basic questions here: First, what does a relationship entail, in any database system? And second, how do we use them in App Engine?

The nature of relationships

Many ORMs expose multiple 'types' of relationships as first-class entities - one-to-one, one-to-many, and many-to-many. This obscures the fact that, in reality, they are all built on the same building block, that of references. A reference is simply a field of an entity that contains the key of another entity - for example, if a Pet references an Owner, that simply means the Pet has a field that contains the key of its owner.

All relationship types simply devolve to references. A one-to-many relationship is a reference in its simplest form - each Pet has one Owner, and therefore an Owner can have multiple pets, all pointing to it. The Owner is not modified; it relies on the individual Pets naming it as the owner. One-to-one relationships are one-to-many relationships with the additional constraint that there will be only one Pet referencing each Owner; this is ...

Under the hood with App Engine APIs

In the past, I've discussed various details of the way various App Engine APIs work under the hood. If you've used certain tools, such as Appstats, too, you probably already have a basic overview of how the App Engine APIs function. Today, we'll take a closer look at the interface between the App Engine runtime and its APIs and how it works, and learn what that means for the platform.

If you're interested in App Engine purely to write straightforward webapps, you can probably stop reading now. If you're interested in low-level optimisations, or in the platform itself, or you want to write a library or tool that tinkers with the innermost parts of App Engine, then read on!

The generic API interface

Ultimately, every API call comes down to a single generic interface, with 4 arguments: the service name (for example, 'datastore_v3' or 'memcache'), the method name (for example, 'Get' or 'RunQuery'), the request, and the response. The request and response components are both Protocol Buffers, a binary format widely used at Google for exchanging structured data between processes. The specific type of the request and response protocol buffers for an API call depend ...

Google I/O playlist, day 8: Next gen queries

This is the eighth in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them. Don't worry, we're nearly done!

First up, apologies for the lack of posts the last few days. I spent the weekend at Hack Camp, so Friday was spent preparing for it, and Monday and Tuesday were spent recovering and catching up!

Today's session is Alfred Fuller's session, Next Gen Queries. If you're interested in the continued evolution of the datastore API, this is a must-watch talk. It's completely language-agnostic, too. If you're looking for better background knowledge, Brett's talk from I/O '09 is well worth watching first, though.

Alfred goes into a lot of detail about some of the exciting improvements you can expect in future iterations of the Datastore API, largely due to his work on extending and improving the ZigZag merge join strategy in App Engine, as well as better support for MultiQuery. As an extra bonus, check out 31:29 for details of an improvement you can take advantage of ...

Implementing a dropbox service with the Blobstore API (part 2)

In part 1 of this series, we demonstrated what's necessary to build a very basic 'dropbox' type service for App Engine. Today, we're going to enhance that by adding support for 'rich' upload controls.

Various types of rich upload controls have sprung up in recent years in order to work around the weaknesses of the HTML standard file input element, which only allows selection of one file at a time, and doesn't support any form of progress notification. The most common widgets are written in Flash, but there are a variety of solutions available. With the ongoing browser adoption of HTML5, additional options are opening up, too!

Today we're going to use an excellent component called Plupload. Plupload consists of a Javascript component with a set of interchangeable backends. Backends include Flash, HTML5, Gears, old-fashioned HTML forms, and more. When you configure Plupload, you can specify which backends you want it to try, in which order, and it will stop when it finds one that works in the user's browser.

Different backends have different capabilities, and the ones you need will depend on your use-case. Check out the feature matrix on the Plupload homepage to ...

Implementing a dropbox service with the Blobstore API (Part 1)

The blobstore api is a recent addition to the App Engine platform, and makes it possible to upload and serve large files (currently up to 50MB). It's also one of the most complex APIs to use, as it has several moving parts. This short series will demonstrate how to implement a dropbox type file hosting service on App Engine, using the Blobstore API. To start, we'll cover the basics needed to upload files, keep track of them in the datastore, and serve them back to users.

First up is the upload form. This step is fairly straightforward: We create a standard HTML form, only we generate the URL to post to by calling blobstore.create_upload_url, and passing it the URL of the handler we want called by it. Here's the handler code:

class FileUploadFormHandler(BaseHandler):
  @util.login_required
  def get(self):
    self.render_template("upload.html", {
        'form_url': blobstore.create_upload_url('/upload'),
        'logout_url': users.create_logout_url('/'),
    })

Standard stuff - though it's worth pointing out that, for convenience, we're using the login_required decorator from the google.appengine.ext.webapp.util package to require users to be logged in (and redirect them to the login form if they're not). And here's ...

Announcing a robust datastore bulk update utility for App Engine

Note: This library is deprecated in favor of appengine-mapreduce, which is now bundled with the SDK.

I'm pleased to announce the release of bulkupdate, an unoriginally-named library for the App Engine Python runtime that facilitates doing bulk operations on datastore data. With bulkupdate, simple operations like bulk re-puts and bulk deletes are trivial, while more complex operations like schema transitions or even emailing all your users become much simpler.

The basic operation of bulkupdate is very similar to the 'map' phase of the well known 'mapreduce' pattern. To use it, you create a subclass of the 'Bulkupdater' class, and define two methods: get_query(), which returns the query to execute, and handle_entity(), which is called once for each entity returned by the query. For example, suppose you want to write a daily task that sends an XMPP message to everyone with new activity on their accounts - the updater class would look something like this:

class ActivityNotifier(bulkupdate.BulkUpdater):
  def __init__(self, date_threshold):
    self.date_threshold = date_threshold

  def get_query(self):
    return UserAccount.all().filter('last_update >', self.date_threshold)

  def handle_entity(self, user):
    if user.unread_messages > 0:
      xmpp.send_message(user.jid, "You have %s unread messages!" % user.unread_messages)

Running the job is even simpler ...

Announcing the SQLite datastore stub for the Python App Engine SDK

For the past couple of weeks, I've been working on one of those projects that seems to suck up every available moment (and some that technically aren't). Now, however, it's largely done, and as an extra bonus, I've been given permission to release it as an early preview for those that are interested.

The code in question is a new implementation of the local datastore for the Python App Engine SDK. While some of you are probably delighted at the news, I expect most of you are puzzled. Why do we need a new local datastore implementation? Let me explain.

The purpose of the local stubs in the App Engine SDK is to exactly replicate the behaviour of the production environment, and in general they do that very well. A specific non-goal is replicating the performance characteristics of the production environment, or being as scalable as the production environment - the stubs are designed for testing, not production use.

The Python SDK's datastore implementation operates by storing the entire contents of your development datastore in memory. It writes changes to disk so that it can reload your datastore when the dev_appserver is restarted, but the in-memory ...

'Most popular' metrics in App Engine

One useful and common statistic to provide to users is a metric akin to "7 day downloads" or "7 day popularity". This appears across many sites and types of webapp, yet the best way to do this is far from obvious.

The naive approach is to record each download individually, and use something akin to "SELECT count(*) FROM downloads WHERE item_id = 123 AND download_date > seven_days_ago", but this involves counting each download individually - O(n) work with the number of downloads! Caching the count is an option, but still leads to excessive amounts of work at read-time.

Another option is to maintain an array of daily download counts, keeping the last 7. This is an improvement from a workload point of view, but leads to either discontinuities at the start of a new day, or to all counts being updated only once per day.

There is a third option, however, which has the performance of the second option, with the responsiveness of the first. To use it, however, we have to reconsider slightly what we mean by '7 day popularity'. The solution in question is to use an exponential decay process. Each time an event happens, we increase the item's ...