Announcing a robust datastore bulk update utility for App Engine

Note: This library is deprecated in favor of appengine-mapreduce, which is now bundled with the SDK.

I'm pleased to announce the release of bulkupdate, an unoriginally-named library for the App Engine Python runtime that facilitates doing bulk operations on datastore data. With bulkupdate, simple operations like bulk re-puts and bulk deletes are trivial, while more complex operations like schema transitions or even emailing all your users become much simpler.

The basic operation of bulkupdate is very similar to the 'map' phase of the well known 'mapreduce' pattern. To use it, you create a subclass of the 'Bulkupdater' class, and define two methods: get_query(), which returns the query to execute, and handle_entity(), which is called once for each entity returned by the query. For example, suppose you want to write a daily task that sends an XMPP message to everyone with new activity on their accounts - the updater class would look something like this:

class ActivityNotifier(bulkupdate.BulkUpdater):
  def __init__(self, date_threshold):
    self.date_threshold = date_threshold

  def get_query(self):
    return UserAccount.all().filter('last_update >', self.date_threshold)

  def handle_entity(self, user):
    if user.unread_messages > 0:
      xmpp.send_message(user.jid, "You have %s unread messages!" % user.unread_messages)

Running the job is even simpler - you simply do something like this from inside a cron job:

job = ActivityNotifier( - datetime.timedelta(days=1))

The bulkupdate framework takes care of the rest, running tasks on the task queue, and automatically chaining new ones when necessary. The real value of the library becomes apparrent when you consider the need to monitor and debug jobs. The bulkupdate library handles this by providing an admin interface, allowing you to list current and past jobs, show statistics on them, and even cancel or delete them. Here's an example of the current version of the admin console for one of our jobs:

As you can see, along with general statistics, the console captures stacktraces from any failed instances. Updaters can also log information using self.log, which gets recorded to the same log, and made available on the admin interface, as you can see above. This is an invaluable tool for outputting diagnostic information on the progress or contents of a bulkupdate job.

The bulkupdate library is still a work in progress, but it's fully functional and ready to be used today. Check out the documentation on the main bulkupdate page for more on how to get started.

Will you use this library? Do you have a special job in mind for it, or a feature request? Let me know in the comments!


blog comments powered by Disqus