Advanced Bulk Loading Part 5: Bulk Loading for Java

This is the seventh in a series of 'cookbook' posts describing useful strategies and functionality for writing better App Engine applications.

When it comes to Bulk Loading, there's currently a shortage of tools for Java users, largely due to the relative newness of the Java platform for App Engine. Not all is doom and gloom, however: Users of App Engine for Java can use the Python bulkloader to load data into their Java App Engine instances! Because App Engine treats each major version of an app as a completely separate entity - but sharing the same datastore - it's possible to upload a Python version specifically for the purpose of Bulk Loading. This won't interfere with serving your Java app to your users.

To follow this guide, you won't need to understand much of the Python platform, though you will need to know a little Python. If your bulkloading needs are straightforward, you won't need to know much at all - it's essentially connect-the-dots - but if your bulkloading needs are a little more complex, you'll need to understand some of the basics of programming in Python - defining and calling functions and methods, basically. Who knows, you may even discover you like it. ;)

Initial Setup

To start, we need to set up your app to be able to accept bulk uploaded data:

  1. Download the Python SDK from here.
  2. Unpack and/or install it, as appropriate for your platform.
  3. Create an empty directory somewhere on your hard drive. Under your current Java project directory is probably fine.
  4. Create a file called 'app.yaml' in the directory you created in step 3, and in it write the following:

application:
version: bulkload
runtime: python
api_version: 1

handlers:
- url: /remote_api
  script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
  login: admin

Now, upload your new Python 'application'. If you're using the command-line tools, the command to do this is:

appcfg.py update path/to/dir

If you're using the Windows or Mac launchers, you'll need to add the application directory to the list (In OSX this is File -> Add Existing Application), then click 'Deploy'.

Writing Loaders

Now, you need to define your data model and upload specification. Create a file called 'models.py' - in the directory you created earlier, or anywhere else that suits. Data models in Python are expressed as classes, with fields listed as properties. For example:

from google.appengine.ext import db

class Film(db.Model):
  name = db.StringProperty()
  average_rating = db.FloatProperty()
  num_ratings = db.IntegerProperty(indexed=False)

The first line imports the 'db' module, which we will need in order to define our model. The kind name is the name of the class. Field names are on the left, while the data types are defined by the 'property' classes on the right. Field options can be specified by passing parameters to the Property, inside the parentheses. The only one likely to be of interest to you is 'indexed', which you can supply to specify that the field in question should not be indexed in the datastore, as in the num_ratings field above. A list of available property classes - the data types, on the right in the example above - is here, but the ones most likely to be of interest are:

  • db.StringProperty(), which defines short, indexed text strings.
  • db.TextProperty(), which is for unindexed text strings of any length.
  • db.BlobProperty(), for unindexed binary data.
  • db.IntegerProperty(), for integers.
  • db.FloatProperty() for floating point numbers.
  • db.DateProperty(), db.TimeProperty, db.DateTimeProperty(), for dates, times, and timestamps, respectively.
  • db.BooleanProperty() for boolean values.
  • db.ReferenceProperty() for referencing other models.

The last one, ReferenceProperty, deserves extra attention. To use a ReferenceProperty, you must specify one argument, the model class you are referencing, and this model class must have been defined before the class containing the ReferenceProperty. For example:

class Owner(db.Model):
  name = db.StringProperty()

class Pet(db.Model):
  name = db.StringProperty()
  owner = db.ReferenceProperty(Owner)

If you need to create a reference to another object of the same class, use a db.SelfReferenceProperty(), which does not require any parameters.

If you need to see how your Java models map to these, look at your app's data in the Admin Console datastore viewer to see the kind and field names.

Now that you've specified your models to match the Java ones (or at least the ones you need to bulk-load), you can write the Loader class that specifies how to map your CSV (or other) data to the models you just defined. The best reference for this is the official documentation, here - the procedure is identical for loading into standard Python apps as for loading into your Java app. For example, here's a loader for our Film class:

from google.appengine.tools import bulkloader
import loader

class FilmLoader(bulkloader.Loader):
  def __init__(self):
    super(FilmLoader, self).__init__(self, 'Film',
                                     [('name', str),
                                      ('average_rating', float),
                                      ('num_ratings', int),
                                     ])

loaders = [FilmLoader]

Put your loader(s) in a file called loader.py, in the same directory as your models.py file.

The first two lines of loader.py import the bulkloader module, which we need to define our loader, and the models module we defined earlier. The rest are the loader definition, as described in the docs linked above. The last line provides a listing of loader classes to use.

You may also want to read the previous articles in this series for advanced bulkloading techniques. Once you've defined your loader, you're ready to bulkload the data into the datastore. The process for this is described here. Because we're loading our data into a version other than the default - the one we created way back at the start - we need to supply an extra argument or two. Here's an example command line for loading our Films into an app called 'filmdb':

appcfg.py upload_data --config_file=loader.py --filename=film_data.csv --kind=Film --url=http://bulkload.latest.filmdb.appspot.com/remote_api

Here, 'directory' is the directory containing app.yaml and our Python files. Running this will upload your data into your app, where it will be accessible from your Java application. Congratulations!

Next time, we'll discuss some useful examples of custom Property classes for Python.   

Comments

blog comments powered by Disqus