Advanced Bulk Loading, part 1: Converters

This is the second in a series of 'cookbook' posts describing useful strategies and functionality for writing better App Engine applications.

The bulk loader facilitates getting data in and out of App Engine, but many people don't realise just how powerful it can be. In this and subsequent posts, we'll explore some of the more advanced things you can do with the bulk loader, including:

  • Importing or exporting binary data
  • Customizing created entities
  • Loading from and exporting to relational databases

Custom Conversion Functions

The most straightforward way of using the bulkloader, as shown in the documentation, is to define a bulkloader.Loader subclass, and overload the __init__ function, supplying a list of converters for the fields in your input file. Here's the example from the documentation:

class AlbumLoader(bulkloader.Loader): def __init__(self): bulkloader.Loader.__init__(self, 'Album', [('title', str), ('artist', str), ('publication_date', lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()), ('length_in_minutes', int) ])

Most of the converters look like declarations - title is a str(ing), as is artist; length_in_minutes is an int. Publication_date is an odd one out, though, and gives us a hint of the real power behind converters: They can be any function that takes a string, and returns a valid value for the model we're loading into. The declaration above is a bit messy, though. Let's define our own 'custom_date' conversion function to clean things up. Here it is:

def custom_date(fmt): """Returns a converter function that parses the supplied date format.""" def converter(s): return datetime.datetime.strptime(s, fmt).date() return converter

What we're doing here is making use of some of Python's flexibility: We're defining a function that returns another function. The inner function (converter) has access to variables from the outer function. This is known as a 'closure'. You can think of our new function as a 'converter generator' - when it's called with a date/time format, it returns a converter that accepts dates in that format and parses them.

With the help of our new function, our AlbumLoader now looks like this:

class AlbumLoader(bulkloader.Loader): def __init__(self): bulkloader.Loader.__init__(self, 'Album', [('title', str), ('artist', str), ('publication_date', custom_date("%m/%d/%Y")), ('length_in_minutes', int) ])

A noticeable improvement - and we can reuse this function anywhere we're parsing dates, even if we use different formats in different places.

Converters can get more sophisticated than that, though. Suppose we want to load a set of images into the datastore so we can serve them to users. We can define a conversion function that takes a filename, and returns the contents of the file, like this:

def file_loader(filename): fh = open(filename, "rb") data = fh.read() fh.close() return data class DatastoreImage(db.Model): filename = db.StringProperty(required=True) data = db.BlobProperty(required=True) class ImageLoader(bulkloader.Loader): def __init__(self) bulkloader.Loader.__init__(self, 'DatastoreImage', [('filename', str), ('data', file_loader) ])

When we run the bulkloader with this configuration, we supply a CSV file withtwo fields: The filename we want the file to have on App Engine, and the path (relative to the directory we're running the bulkloader from) to the actual file to upload. Our file_uploader conversion function takes the second of those filenames, and reads the file into memory, so it's uploaded as part of the entity we just created - without us having to figure out a way to embed images in a CSV file!

This approach isn't limited to images, of course - it can also be useful if you want to upload HTML files into the datastore, for example.

Using this method, we could even define converters that do exotic things like fetching a file over HTTP, or generating an image on the spot using the Python Imaging Library - not that I'd recommend either approach!

Keep an eye out for the next post, coming soon: Advanced bulk loading, part 2!

Comments

blog comments powered by Disqus