Advanced Bulk Loading, part 4: Bulk Exporting

This is the sixth in a series of 'cookbook' posts describing useful strategies and functionality for writing better App Engine applications.

In previous posts, we covered Advanced Bulk-loading. Now we'll cover, briefly, the reverse: advanced use of the Bulk Exporter functionality. Unfortunately, the Bulk Exporter is currently much more limited than the Bulk-Loader - though it's also much less mature, so we can expect that to change - but there are still customizations you can apply.

The simplest one is the same as what we covered in part 1 of our Bulk-loader series: Custom conversion functions. Bulk exporter classes define a list of fields and conversion functions just like the importer; the difference is that these functions are expected to convert to strings, rather than from strings. Let's start by going over the equivalent to the two loader conversion functions we defined. First, one to format dates:

def export_date(fmt):
  """Returns a converter function that outputs the supplied date-time format."""
  def converter(d):
    return d.strftime(fmt)
  return converter

So far so good. We use it the same way we used the converter function in the importer:

class AlbumExporter(bulkloader.Exporter):
  def __init__(self):
    bulkloader.Exporter.__init__(self, 'Album',
                                 [('title', str),
                                  ('artist', str),
                                  ('publication_date', export_date("%m/%d/%Y")),
                                  ('length_in_minutes', str),
                                 ])

A 'file exporter' converter function is similarly simple:

def file_exporter(file):
  filename = hashlib.sha1(file).hexdigest()
  fh = open(filename, "wb")
  fh.write(file)
  fh.close()
  return filename

Here we're naming the file after its content-hash, and returning the derived filename; other solutions are possible if you know more about the files you're storing, of course. Here it is in use, again with the example from our first post:

class DatastoreImage(db.Model):
  filename = db.StringProperty(required=True)
  data = db.BlobProperty(required=True)

class ImageExporter(bulkloader.Exporter):
  def __init__(self):
    bulkloader.Exporter.__init__(self, 'DatastoreImage',
                                 [('filename', str),
                                  ('data', file_exporter),
                                 ])

We can't use the filename provided in the datastore for a couple of reasons: First, we don't know if it's unique or not, and also because converter functions can't access properties other than the ones they're 'converting'. We'll tackle how to modify the exported records and how they're serialized.

Our opportunities for further customization of the export process are more limited than in the importer class. The Exporter has three methods of interest: initialize and finalize, which do what their names imply (and hence won't be covered here), and output_entities.

output_entities is called once the bulkloader has downloaded all the records and stored them in a local sqlite database. It takes one argument, which is a generator that yields all the downloaded entities - before they are stringified using the process above. What this means is that if you override output_entities, you need to either reimplement the stringification code, or use your own. Let's cover the last one, to demonstrate exporting directly to a relational database such as MySQL:

class AlbumExporter(bulkloader.Exporter):
  def __init__(self):
    super(AlbumExporter, self).__init__(self, 'Album', [])

  def output_entities(self, entity_generator):
    db = MySQLdb.connect(**dict(urlparse.parse_qsl(filename)))
    c = db.cursor()
    for entity in entity_generator:
      c.execute("INSERT INTO albums (title, artist, publication_date, length) VALUES (%s, %s, %s, %s)",
                [entity.title, entity.artist, entity.publication_date, entity.length_in_minutes])

Presto - loading straight into a relational database. This is only an example - a more efficient one would use executemany() to pull batches of records from the iterator and insert them all at once, and would reuse or reimplement the stringification code to make it easier to specify fields to load and conversions to perform on them.

In the next post, we'll discuss how users of App Engine for Java can use the bulkloader to load data into their apps.

Comments

blog comments powered by Disqus