Advanced Bulk Loading, part 3: Alternate datasources

This is the fifth in a series of 'cookbook' posts describing useful strategies and functionality for writing better App Engine applications.

The bulkloader automatically supports loading data from CSV files, but it's not restricted to that. With a little work, you can make the bulkloader load data from any datasource you can access with Python. The key to this is the generate_records method. This method accepts a filename, and is expected to yield a list of strings for each entity to be uploaded. By overriding this method, we can load from any datasource we wish - for example, a relational database such as MySQL. To make this reusable, let's define a base class we can use to load data from MySQL databases:

import MySQLdb class MySQLLoader(bulkloader.Loader): def __init__(self, kind_name, query, converters): self.query = query bulkloader.Loader.__init__(kind_name, converters) def initialize(self, filename, loader_opts) self.connect_args = dict(urlparse.parse_qsl(loader_opts)) def generate_records(self, filename): """Generates records from a MySQL database.""" db = MySQLdb.connect(self.connect_args) cursor = db.cursor() cursor.execute(self.query) return iter(cursor.fetchone, None)

What we've done here is extended the bulkloader's Loader class to load from a MySQL database instead of from a CSV file. We've modified the constructor to take an extra argument - the query to execute - and we've written our own generate_records method that connects to the MySQL database, executes the provided query, and returns an iterator of rows to be uploaded into the database. We've also made use of the 'loader_opts' command line argument, which gets passed to the Loader's initialize() method, to contain the connection options for MySQL. Also note the unusual use of 'iter' in the return statement: when used with two parameters, iter returns an iterator that will repeatedly call its first parameter as a zero argument function until it returns the value provided in the second parameter. Since cursor.fetchone returns None when there are no further results, we can use iter to create an iterator from it.

Here's a simple example of using MySQLLoader:

class BlogPost(db.Model): title = db.TextProperty(required=True) date = db.DateProperty(required=True, auto_now_add=True) body = db.TextProperty(required=True) class BlogPostLoader(MySQLLoader): def __init__(self): MySQLLoader.__init__('BlogPost', 'SELECT title, date, body FROM posts', [('title', str), ('date', custom_date('%m/%d/%Y')), ('body', str) ])

To upload data using this class, we'd use a command line like this:

appcfg.py upload_data --config_file=blogpost_loader.py --loader_opts="passwd=moonpie&db;=things" --kind=BlogPost

Using this technique, we can load from any data source that can be accessed from within Python. Using another database like PostgreSQL or SQLite is a trivial change; we can also write subclasses to load from less common datasources, such as another webservice, or a non-relational database such as CouchDB.

In the next post, we'll briefly cover ways we can extend the bulk exporter by adapting the techniques we've already covered for importing.

Comments

blog comments powered by Disqus