Blogging on App Engine, part 7: Migration

This is part of a series of articles on writing a blogging system on App Engine. An overview of what we're building is here.

We're finally going to tackle (at least part of) that big bugbear of blogging systems: Migrating from the old system to the new one. In this post, we'll cover the necessary pre-requisites, briefly cover the theory of importing from a blogging system hosted outside App Engine, then go over a practical example of migrating from Bloog (since that's what this blog is hosted on).

Regenerating posts

Before we can write migration or import scripts, we need to improve (again) our dependency regeneration code. One thing that's probably occurred to you if you've been following this series is that there's currently no easy way to regenerate all the resources when something global changes such as the theme or the configuration. One could simply call .publish() on each blog post, but that would result in regenerating the common resources, such as the index and tags pages, over and over again - potentially hundreds of times. The same applies to migration: We could publish each new post as we process it, but this would again result in much redundant work.

In order to facilitate a more efficient regeneration process, we're going to do some refactoring. For a start, we'll refactor out the dependency generation part of BlogPost.publish():

def publish(self): if not self.path: num = 0 content = None while not content: path = utils.format_post_path(self, num) content = static.add(path, '', config.html_mime_type) num += 1 self.path = path self.put() if not self.deps: self.deps = {} for generator_class, deps in self.get_deps(): for dep in deps: if generator_class.can_defer: deferred.defer(generator_class.generate_resource, None, dep) else: generator_class.generate_resource(self, dep) self.put() def get_deps(self, regenerate=False): for generator_class in generators.generator_list: new_deps = set(generator_class.get_resource_list(self)) new_etag = generator_class.get_etag(self) old_deps, old_etag = self.deps.get(generator_class.name(), (set(), None)) if new_etag != old_etag or regenerate: # If the etag has changed, regenerate everything to_regenerate = new_deps | old_deps else: # Otherwise just regenerate the changes to_regenerate = new_deps ^ old_deps self.deps[generator_class.name()] = (new_deps, new_etag) yield generator_class, to_regenerate

Next, we'll write a class to handle regenerating all content. Add the following to handlers.py:

class PostRegenerator(object): def __init__(self): self.seen = set() def regenerate(self, batch_size=50, start_key=None): q = models.BlogPost.all() if start_key: q.filter('__key__ >', start_key) posts = q.fetch(batch_size) for post in posts: for generator_class, deps in post.get_deps(True): for dep in deps: if (generator_class.__name__, dep) not in self.seen: logging.warn((generator_class.__name__, dep)) self.seen.add((generator_class.__name__, dep)) deferred.defer(generator_class.generate_resource, None, dep) post.put() if len(posts) == batch_size: deferred.defer(self.regenerate, batch_size, posts[-1].key())

This class stores a set of dependencies that it's already generated in its 'seen' attribute. Run from the task queue, it fetches a batch of posts, then generates dependencies for them, and adds regeneration tasks to the task queue for any that haven't already been regenerated. After processing a batch of entries, it enqueues another invocation of itself to continue where it left off. This 'chaining' pattern is a common one when using the Task Queue - we've already seen it in use for generating listing pages, where each listing task generates one page, then creates a new task to generate the following one.

A straightforward admin handler takes care of invoking PostRegenerator on demand:

class RegenerateHandler(BaseHandler): def post(self): regen = PostRegenerator() deferred.defer(regen.regenerate) deferred.defer(post_deploy.post_deploy, post_deploy.BLOGGART_VERSION) self.render_to_response("regenerating.html")

Note that this handler also reruns the post_deploy task, making sure static pages also get regenerated. Once again, we've left out the template changes; you can see these in the repository.

Migrating from an external blog

Since I don't have an external blogging system to demonstrate this with, this section will be brief and theoretical, for now. In future, we may address this for specific blog systems as they arise.

The advanced bulk loading series will be of use here - in particular, loading from alternate data-sources. Here's a skeleton for a BlogPostLoader:

class BlogPostLoader(bulkloader.Loader): def __init__(self, query, converters): self.query = query bulkloader.Loader.__init__('BlogPost', converters) def initialize(self, filename, loader_opts): self.connect_args = dict(urlparse.parse_qsl(loader_opts)) def generate_records(filename): db = MySQLdb.connect(self.connect_args) cursor = db.cursor() cursor.execute(self.query) return iter(cursor.fetchone, None) def finalize(self): regen = PostRegenerator() deferred.defer(regen.regenerate)

Nothing fancy going on here, except the presence of a finalize method. The finalize method creates a PostRegenerator instance, and calls regen.regenerate() on the task queue. As of 1.2.7, remote_api supports the Task Queue API, making this a particularly easy and convenient way to start generating the static content once importing the posts finishes.

Migrating from an App Engine blog

Migrating from a blogging system already hosted on App Engine is a different matter. For this, we'll use the task queue and a custom handler, rather than the bulk loader. Add the following to a new file, migrate.py:

class BloogBreakingMigration(object): class Article(db.Model) title = db.StringProperty() article_type = db.StringProperty() html = db.TextProperty() published = db.DateTimeProperty() updated = db.DateTimeProperty() tags = db.StringListProperty() @classmethod def migrate_one(cls, post_key): logging.debug("Migrating post with key %s", post_key) article = cls.Article.get(post_key) post = models.BlogPost( path=article.key().name(), title=article.title, body=article.html, tags=set(article.tags), published=article.published, updated=article.updated, deps={}) post.put() @classmethod def migrate_all(cls, batch_size=20, start_key=None): q = cls.Article.all(keys_only=True) if start_key: q.filter('__key__ >', start_key) articles = q.fetch(batch_size) for key in articles: cls.migrate_one(key) if len(articles) == batch_size: deferred.defer(cls.migrate_all, batch_size, articles[-1]) else: logging.warn("Migration finished; starting rebuild.") regen = handlers.PostRegenerator() deferred.defer(regen.regenerate)

Note that like the Generator classes, everything here is a class method; we're merely using a class for containment. The class is called BloogBreakingMigration because it's for migrating from a blog based on the 'breaking' branch of my bloog fork; a similar 'BloogMigration' class would handle migration from regular bloog.

The method migrate_all iterates over all Bloog's 'Article' entities, using the same batch-and-chain method used above for post regeneration. Each Article is migrated by calling migrate_one, and at the end of a batch, we chain a new migration task if there are more posts left; otherwise, we start the regeneration process by creating and starting a PostRegenerator.

For now, we won't bother with a user interface for this. To start it off, fire up a remote_api_shell, and enter the following commands:

notdot-blog> from google.appengine.ext import deferred notdot-blog> import datetime notdot-blog> import migrate notdot-blog> deferred.defer(migrate.BloogMigration.migrate_all_breaking, _eta=datetime.datetime.now()-datetime.timedelta(days=1))

Notice that we're setting an ETA in the past for this task - this is a nasty hack to get around the issue that the remote_api console uses the local timezone, while App Engine expects the ETA to be in UTC. Setting an ETA in the past ensures the task gets run immediately. This starts the migration process; keep an eye on the task queue for an idea of when it's done. Once finished, you can load up the blog homepage, and everything should have migrated over to the new system!

You've probably noticed that we've said nothing about migrating comments to the new blog; doing so deserves a post all of its own, so we'll cover it in the next post.

Comments

blog comments powered by Disqus