Blogging on App Engine, part 3: Dependencies

This is part of a series of articles on writing a blogging system on App Engine. An overview of what we're building is here.

First, a couple of things of note. Between the last post and this one, I've snuck around behind your back and made a couple of minor changes. Don't worry, none of them are major. The most noticeable of these is that I've implemented a CSS design from the excellent site styleshout; our blog will now look halfway presentable. I've also refactored the existing admin code into a number of smaller modules; if you're browsing the source, you'll notice the code is now split between 'handlers.py' (the webapp.RequestHandlers), 'models.py' (the datastore models), and 'utils.py' (the utility functions such as those to generate content from templates).

I'm also pleased to announce that a couple of dedicated coders are following along with the series by writing their own ports of Bloggart. Sylvain is writing 'bloggartornado', a port of Bloggart to the Tornado framework, the source to which is here; a demo can be seen at http://bloggartornado.appspot.com/. Rodrigo Moraes is writing 'bloggartzeug', a port of Bloggart to the werkzeug framework, the source is here.

This post is going to be a long one. Are you sitting comfortably? Then let's begin.

The biggest challenge when writing a statically-generated blogging system such as ours is figuring out what pages need regenerating, and when. For that, we're going to use a dependency system. Our system will consist of a series of 'ContentGenerator' classes, each of which is responsible for (re)generating some specific part of the blog - such as the posts themselves, the index pages, the RSS feed, and so forth. We'll start by defining an interface for these classes in a new file, 'generators.py':

generator_list = []

class ContentGenerator(object):
  """A class that generates content and dependency lists for blog posts."""

  @classmethod
  def name(cls):
    return cls.__name__

  @classmethod
  def get_resource_list(cls, post):
    raise NotImplementedError()

  @classmethod
  def get_etag(cls, post):
    raise NotImplementedError()

  @classmethod
  def generate_resource(cls, post, resource):
    raise NotImplementedError()

Together, these methods define the interface all our ContentGenerator subclasses will have. Note that they're all class methods - we don't need to create instances of ContentGenerator anywhere, because it has no state to speak of. Let's look at them in order:

  • name() is straightforward - it returns a unique name for the ContentGenerator. By default, this is the name of the class.
  • get_resource_list() takes a BlogPost and is expected to return a list of strings representing resources this post will appear in. For example, if we were implementing tags (we're not, but we will sooner or later), this would return a list of tags in the post: ["foo", "bar", "baz"]
  • get_etag() takes a BlogPost and returns a short string that uniquely identifies the state of the content this generator cares about. For example, a ContentGenerator for the blog's index page should return an ETag that only depends on the title and summary of the post, while the ContentGenerator for the post itself should return one that depends on the entire post. This lets us figure out if we need to regenerate all the existing resources for a post when it changes.
  • generate_resource() takes a BlogPost object and a resource as returned by get_resource_list(); it's expected to generate that resource for that post and update it in the static serving system.

Now we need to make use of this interface to regenerate only the changed content. First, add a new property to our BlogPost model:

class BlogPost(db.Model):
  # ...
  deps = aetycoon.PickleProperty()

Once again we're making use of the extra property classes in aetycoon - here, we're using a PickleProperty to store the dependencies we've previously observed on this BlogPost. Now, replace the publish() method of the BlogPost with this:

  def publish(self):
    if not self.path:
      num = 0
      content = None
      while not content:
        path = utils.format_post_path(self, num)
        content = static.add(path, '', config.html_mime_type)
        num += 1
      self.path = path
    if not self.deps:
      self.deps = {}
    self.put()
    for generator_class in generators.generator_list:
      new_deps = set(generator_class.get_resource_list(self))
      new_etag = generator_class.get_etag(self)
      old_deps, old_etag = self.deps.get(generator_class.name(), (set(), None))
      if new_etag != old_etag:
        # If the etag has changed, regenerate everything
        to_regenerate = new_deps | old_deps
      else:
        # Otherwise just regenerate the changes
        to_regenerate = new_deps ^ old_deps
      for dep in to_regenerate:
        generator_class.generate_resource(self, dep)
      self.deps[generator_class.name()] = (new_deps, new_etag)
    self.put()

Starting at the top, we still have the code to find a path for posts that don't yet have one, but now instead of generating and publishing the content, we simply insert a blank page to hold the URL for us. Next, we check if self.deps is set; if it's not, we set it to an empty dictionary. You may also notice we're calling self.put() twice. This could be optimized down to a single put() call, but it would complicate the code, so for the purpose of demonstration we'll leave it as-is for now.

The next section of code is concerned with finding and regenerating changed dependencies. Iterating over each generator in a list that will be provided by our generators module, it does the following:

  1. Fetch the current list of resources and etag from the current ContentGenerator
  2. Fetch the stored list of resources and etag from self.deps
  3. If the etag has changed, we need to regenerate all resources - so we set to_regenerate to the union of the old and new resources.
  4. If the etag has not changed, we only need to regenerate added or removed resources - so we set to_regenerate to the symmetric difference of the old and new resources.
  5. For each resource that needs regenerating, we call generate_resource().
  6. Finally, we update the BlogPost's list of deps with the new set of resources and etag.

Now that we've seen how the dependency system works in theory, let's see it in action by converting the old rendering code to use the new system. Remove the render() method from the BlogPost class, and add the following to the end of generators.py:

class PostContentGenerator(ContentGenerator):
  @classmethod
  def get_resource_list(cls, post):
    return [post.path]

  @classmethod
  def get_etag(cls, post):
    return hashlib.sha1(db.model_to_protobuf(post).Encode()).hexdigest()

  @classmethod
  def generate_resource(cls, post, resource):
    assert resource == post.path
    template_vals = {
        'post': post,
    }
    rendered = utils.render_template("post.html", template_vals)
    static.set(post.path, rendered, config.html_mime_type)
generator_list.append(PostContentGenerator)

As you can see, get_resource_list() simply returns the path of the post - this is the only resource our PostContentGenerator knows about. get_etag() generates an etag for the post by running the SHA1 algorithm over the encoded contents of the post, as described in efficient model memcaching, thus ensuring that any change at all to the BlogPost entity results in regenerating the page. generate_resource() is almost identical to the render() method we just deleted; the only significant difference is that instead of returning the generated page to the caller, we instead update it in the static serving system ourselves. Finally, we add the new class to the generator_list, to ensure it gets processed.

If you try publishing or updating a blog post now, the system ought to behave exactly as it did before - but we've done too much work to stop when we're merely back where we started. Let's define a simple ContentGenerator to generate and update the index page of the blog, so we can finally have a homepage:

class IndexContentGenerator(ContentGenerator):
  """ContentGenerator for the homepage of the blog and archive pages."""

  @classmethod
  def get_resource_list(cls, post):
    return ["index"]

  @classmethod
  def get_etag(cls, post):
    return hashlib.sha1(post.title + post.summary).hexdigest()

  @classmethod
  def generate_resource(cls, post, resource):
    assert resource == "index"
    import models
    q = models.BlogPost.all().order('-published')
    posts = q.fetch(config.posts_per_page)
    template_vals = {
        'posts': posts,
    }
    rendered = utils.render_template("listing.html", template_vals)
    static.set('/', rendered, config.html_mime_type)
generator_list.append(IndexContentGenerator)

This one is slightly - but only slightly - more complicated than the PostContentGenerator. get_resource_list() always returns the static string 'index', while get_etag() generates and etag that depends only on the title and the summary. Summary is a new property we've added to the BlogPost class; we haven't included it here for brevity, but you can see it in the source - it's very straightforward.

generate_resource() fetches a list of the most recent blog posts - the number of which is determined by a configuration option - and renders a template "listing.html" with them, storing the results to the root URL. Note the use of an import statement inside the method, here: This is a nasty trick we need to pull because of the way Python handles imports. Because generators.py is imported by models.py, attempting to import models at the top level of generators.py would result in a recursive import, which is not permitted in Python. To work around this, we only import the models module inside methods that need it.

Note that we don't even attempt to deal with posts that have scrolled off the bottom of the front page; that and other issues will be the subject of a future blog post. Finally, we need to define a template for our index page. Create 'listing.html' in the themes/default directory, and enter the following:

{% extends "base.html" %}
{% block title %}{{config.blog_name}}{% endblock %}
{% block body %}
  {% for post in posts %}
    <h2><a href="{{post.path}}">{{post.title}}</a></h2>
    {{post.summary|linebreaks}}
    <p class="postmeta">
      <a href="{{post.path}}" class="readmore">Read more</a> |
      <span class="date">{{post.published|date:"d F, Y"}}</span>
    </p>
  {% endfor %}
{% endblock %}

This is quite straightforward: After extending our base template, we iterate over each post in the list, outputting an h2 with the title, the post's summary, and a little bit of metadata about it.

Try authoring a new post or editing an existing post. Not only should the existing behaviour continue to work as it always has, but you should now see a fancy listing of recent blog posts on the index page (/) of your blog. This is starting to look like a real blogging system!

As always, you can see the blog-so-far at http://bloggart-demo.appspot.com/, and view the source of this stage here.

In the next post, we'll enhance our listing pages, and add Atom and RSS support.

Comments

blog comments powered by Disqus