Using the new bulkloader

Posted by Nick Johnson | Filed under tech, app-engine, coding, bulkloader

Recently, Matthew Blain, of the App Engine team, announced the prerelease of a new bulkloader. The new bulkloader uses yaml files for configuration, and takes a 'declarative' rather than procedural approach to configuration for downloading and uploading data. As a result, you don't have to understand Python in order to configure and use the new bulkloader, which is a significant advantage for users of the Java App Engine runtime.

There are, of course, many other significant improvements, including autogeneration of config files, a bulit in library of converters for common data types, support for input and output types other than CSV, and more. Today, we'll walk through basic usage of the new bulkloader, and demonstrate some of its features.

Configuration autogeneration

One of the most significant new features of the bulkloader is its support for autogenerating config files. It works like this: You point it at your production app, and it downloads the datastore stats, and uses them to generate a configuration file for you. You edit the configuration file to fill in a few missing fields and tidy it up, and presto, you have a working bulkloader configuration. Let's see how that works out when we apply it to Tweet Engine.

First, we download and unpack the patched SDK from the bulkloader site. Changing into the google_appengine directory, we run the bulkloader with the --create_config flag:

$ ./bulkloader.py --create_config --url http://tweet-engine.appspot.com/remote_api --filename generated_bulkloader.yaml
[INFO    ] Logging to bulkloader-log-20100426.144319
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 100
[INFO    ] Opening database: bulkloader-progress-20100426.144319.sql3
[INFO    ] Opening database: bulkloader-results-20100426.144319.sql3
[INFO    ] Connecting to tweet-engine.appspot.com/remote_api
No handlers could be found for logger "google.appengine.tools.appengine_rpc"
[INFO    ] Downloading kinds: ['__Stat_PropertyType_PropertyName_Kind__']
.
[INFO    ] Have 64 entities, 0 previously transferred
[INFO    ] 64 entities (23986 bytes) transferred in 1.9 seconds

Now let's take a look at (part of) the generated_bulkloader.yaml file it created for us:

# Autogenerated bulkloader.yaml file.
# You're likely to have to do various edits to this file:
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
# Also:
#  * Review property_map.

The file starts with a fairly straightforward set of instructions. We'll see what they mean shortly.

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.db
- import: re
- import: base64

This relates to how the bulkloader performs the bulkload, and finds any functionality we want to reference. For basic bulkloading, we shouldn't have to modify this.

transformers:
- kind: Permission
  connector: # TODO: Choose a connector here: csv, simplexml, etc...
  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string

    - property: account
      external_name: account
      # Type: Key Stats: 119 properties of this type in this kind.
      import_transform: transform.create_foreign_key('TODO: fill in Kind name')
      export_transform: transform.key_id_or_name_as_string

    - property: invite_nonce
      external_name: invite_nonce
      # Type: NULL Stats: 100 properties of this type in this kind.

    # Warning: This property is a duplicate, but with a different type.
    # TODO: Edit this transform so only one property with this name remains.
    - property: invite_nonce
      external_name: invite_nonce
      # Type: String Stats: 19 properties of this type in this kind.

    - property: role
      external_name: role
      # Type: Integer Stats: 119 properties of this type in this kind.
      import_transform: transform.none_if_empty(int)

    - property: user
      external_name: user
      # Type: Key Stats: 119 properties of this type in this kind.
      import_transform: transform.create_foreign_key('TODO: fill in Kind name')
      export_transform: transform.key_id_or_name_as_string

The 'transformers' section forms the bulk of the file. I've only included one sample type here, but the bulkloader generates one for each kind in your datastore. For each entry, we have a 'kind', which should be self explanatory, a 'connector', which describes what format we want to load or dump the data for this kind in, and 'connector_options', which are arbitrary options to be passed to the connector.

Customizing

The bulkloader README lists available connectors - currently they are 'csv', 'xml', and 'simpletext' (which is export only). It's also possible to write your own connectors. We'll start by using the csv connector, and provide some basic options, replacing the existing header with this:

- kind: Permission
  connector: csv
  connector_options:
    encoding: utf-8
    columns: from_header

This specifies that we want to use the CSV connector, and that the files are UTF-8 encoded and should include a header line. Next let's look at the property map.

The property map consists of a set of 'property' entries, each of which specifies how to handle a particular property of the model on import and on export. for our Permission kind, the bulkloader has identified 4 properties, plus the __key__ pseudo-property. Each has an 'external_name', and optional import and export transforms, which specify how to translate between the App Engine datastore representation and an external representation. For the most part it got everything right, but it mistakenly duplicated one property - invite_nonce - and we need to fill in the kind names of our reference keys. Here's the updated properties section:

  property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string

    - property: account
      external_name: account
      import_transform: transform.create_foreign_key('Account')
      export_transform: transform.key_id_or_name_as_string

    - property: invite_nonce
      external_name: invite_nonce

    - property: role
      external_name: role
      import_transform: transform.none_if_empty(int)

    - property: user
      external_name: user
      import_transform: transform.create_foreign_key('User')
      export_transform: transform.key_id_or_name_as_string

All we had to do here was to remove some of the boilerplate and the extraneous invite_nonce entry, and fill in the kind names for the two reference properties, and we're sorted.

Exporting data

Exporting using the new bulkloader is just as straightforward:

$ ./bulkloader.py --download --url http://tweet-engine.appspot.com/remote_api --config_file generated_bulkloader.yaml --kind Permission --filename download_permission.csv
[INFO    ] Logging to bulkloader-log-20100426.160656
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 100
[INFO    ] Opening database: bulkloader-progress-20100426.160656.sql3
[INFO    ] Opening database: bulkloader-results-20100426.160656.sql3
[INFO    ] Connecting to tweet-engine.appspot.com/remote_api
[INFO    ] Downloading kinds: ['Permission']
.[INFO    ] Permission: No descending index on __key__, performing serial download
.
[INFO    ] Have 119 entities, 0 previously transferred
[INFO    ] 119 entities (42514 bytes) transferred in 5.8 seconds

As expected, download_permission.csv contains our data, in CSV form. And we didn't have to write a single line of Python code, or set up an app.yaml, or anything else Python-specific in order to achieve it! Further, the bulkloader took care of generating a mostly-finished configuration file for us, in a format that ensures the data we download can be re-uploaded again without loss of fidelity. It's like magic!

Got a particular feature of the new bulkloader you'd like to see discussed? Let us know in the comments!

26 April, 2010

Previous Post Next Post

Nick's Blog