ReferenceProperty prefetching in App Engine

This post is a brief interlude in the webapps on App Engine series. Fear not, it'll be back!

Frequently, we need to do a datastore query for a set of records, then do something with a property referenced by each of those records. For example, supposing we are writing a blogging system, and we want to display a list of posts, along with their authors. We might do something like this:

class Author(db.Model):
  name = db.StringProperty(required=True)

class Post(db.Model):
  title = db.TextProperty(required=True)
  body = db.TextProperty(required=True)
  author = db.ReferenceProperty(Author, required=True)


posts = Post.all().order("-timestamp").fetch(20)
for post in posts:
  print post.title
  print post.author.name

On the surface, this looks fine. If we look closer, however - perhaps by using Guido's excellent AppStats tool, we'll notice that each iteration of the loop, we're performing a get request for the referenced author entity. This happens the first time we dereference any ReferenceProperty, even if we've previously dereferenced a separate ReferenceProperty that points to the same entity!

Obviously, this is less than ideal. We're doing a lot of RPCs, and incurring a lot of per-RPC overhead and delay. Further, since we're performing them serially, they take a lot longer than a batch fetch for the equivalent number of entities would take. Is there some way we can improve on this?

It turns out, there is. There's a well known mechanism for retrieving the key for a ReferenceProperty without dereferencing it, by using Property.get_value_for_datastore. For example:

key = Post.author.get_value_for_datastore(a_post)

Given a list of entities, then, we can get the keys they reference, and with those, we can fetch the referenced entities. How do we update the entities in the list with the retrieved references, though? The code for caching referenced entities is deep inside the ReferenceProperty class, and although we could monkey around with it, we really shouldn't - it's likely to break without notice.

There's a way around this impasse, however: We can simply set the ReferenceProperty to the value we retrieved, as if we were modifying it. This will cause the ReferenceProperty to update the value (but no change there), and to cache the entity for later dereferencing. Easy!

Here's the code:

def prefetch_refprop(entities, prop):
  ref_keys = [prop.get_value_for_datastore(x) for x in entities]
  ref_entities = dict((x.key(), x) for x in db.get(set(ref_keys)))
  for entity, ref_key in zip(entities, ref_keys):
    prop.__set__(entity, ref_entities[ref_key])
  return entities

Line 2 extracts the referenced key from each entity that was passed in, storing it in a list named ref_keys. On line 3, we first convert ref_keys to a set, eliminating any duplicates, then we retrieve the referenced entities with a db.get(). Finally, we construct a dict mapping entity keys to retrieved entities with the results. Line 4 iterates through the original entities and the keys they referenced, and line 5 sets the property on each entity to the retrieved value, looking it up in the dict we just constructed. At the end, we return the original list of entities, so we can use our function as a filter, if we wish. Here's how it's used:

posts = Post.all().order("-timestamp").fetch(20)
prefetch_refprop(posts, Post.author)
for post in posts:
  print post.title
  print post.author.name

This is looking really good - but what if we want to dereference multiple ReferenceProperty fields on the same set of entities? We could call prefetch_refprop once for each, but that's reintroducing some of the same inefficiency we wrote all this to combat. Can we do better? Naturally we can:

def prefetch_refprops(entities, *props):
    fields = [(entity, prop) for entity in entities for prop in props]
    ref_keys = [prop.get_value_for_datastore(x) for x, prop in fields]
    ref_entities = dict((x.key(), x) for x in db.get(set(ref_keys)))
    for (entity, prop), ref_key in zip(fields, ref_keys):
        prop.__set__(entity, ref_entities[ref_key])
    return entities

This is similar to the original function, but with a couple of added wrinkles. We've converted the "prop" argument to "*props", allowing us to pass any number of ReferenceProperty instances as additional arguments. On line 2, we create the list "fields", which consists of the cartesian join of entities and properties - that is, every combination of entity and property in the input lists. Line 3 operates much the same as previously, except that both the property and the entity are fetched from the fields list. Line 4 remains completely unchanged, while line 5, the loop, now zips together the fields list and the referenced keys. Line 6 behaves as previously.

Using this new function is exactly the same as using the original one, except that we can now pass multiple ReferenceProperty instances, as in "prefetch_refprops(posts, Post.author, Post.category)" - and they're all fetched with a single datastore get.

One caveat if you intend to use this recipe: With regular dereferencing, two fields that reference the same entity will return different objects, which can be modified independently. With our recipe, though, if the keys are the same, the entities will be the same object - so modifying post1.author could modify post2.author! Bear this in mind if you intend to modify the referenced entities.

Comments

blog comments powered by Disqus