Blogging on App engine, part 9: Sitemaps and verification

Posted by Nick Johnson | Filed under tech, app-engine, coding, bloggart

This is part of a series of articles on writing a blogging system on App Engine. An overview of what we're building is here.

Today we're going to cover basic sitemap support and verifying your site with Google.

Sitemaps

Sitemaps are a recent innovation that aim to make it easier for search engines to find and index your site. The format is a very straightforward XML file. Several optional attributes can be present, such as the last-modified date and update frequency; for this first attempt we're not going to use any of them, and just provide a basic listing of URLs. Future enhancements could provide more sitemap information, and break the sitemap into multiple files for extensibility.

For the purposes of generating a complete sitemap, we have a significant advantage: Our static serving infrastructure provides us with a convenient means of getting a list of all valid URLs. Not all URLs should be indexed, however, so we should make it possible to specify what content should be indexed. Add a new property to the StaticContent model in static.py:

    indexed = db.BooleanProperty(required=True, default=True)

We'll need to enhance our set() method to take this additional argument, and to trigger a sitemap regeneration if it's set to True. Here's the new set() method, with changes highlighted in yellow:

def set(path, body, content_type, indexed=True, **kwargs):
  content = StaticContent(
      key_name=path,
      body=body,
      content_type=content_type,
      indexed=indexed,
      **kwargs)
  content.put()
  try:
    now = datetime.datetime.now().replace(second=0, microsecond=0)
    eta = now.replace(second=0, microsecond=0) + datetime.timedelta(seconds=65)
    if indexed:
      deferred.defer(
          _regenerate_sitemap,
          _name='sitemap-%s' % (now.strftime('%Y%m%d%H%M'),),
          _eta=eta)
  except (taskqueue.TaskAlreadyExistsError, taskqueue.TombstonedTaskError), e:
    pass
  return content

The pattern we're using for deferred should look familiar - we used it in part 6 to run our post-deploy function only once per deployment. Here we're using a slight variation on it - we're generating a task name based on the current minute, to ensure we don't regenerate the sitemap more than once a minute. We're also setting an ETA 5 seconds into the next minute - the extra 5 seconds is a 'fudge factor' to make sure it's not possible for a change to the content to sneak in at the last possible moment and not be reflected in the updated sitemap file.

Also note that the new argument to set() is optional, meaning we don't need to update all our existing uses of the function - content will be included in the sitemap by default. You may want to modify the generator for the Atom feed and for the cse.xml definition to exclude them from the sitemap; you can see those changes in the updated repository code, but we won't include them here.

Next, we need some code to actually generate a sitemap. Add these two functions, also in static.py:

def _get_all_paths():
  keys = []
  cur = StaticContent.all(keys_only=True).filter('indexed', True).fetch(1000)
  while len(cur) == 1000:
    keys.extend(cur)
    q = StaticContent.all(keys_only=True)
    q.filter('indexed', True)
    q.filter('__key__ >', cur[-1])
    cur = q.fetch(1000)
  keys.extend(cur)
  return [x.name() for x in keys]


def _regenerate_sitemap():
  paths = _get_all_paths()
  rendered = utils.render_template('sitemap.xml', {'paths': paths})
  set('/sitemap.xml', rendered, 'application/xml', False)

_get_all_paths uses keys-only queries to get a list of all keys for StaticContent entities, and _regenerate_sitemap uses this to generate a sitemap from a template and store it. Here's the XML template we're using:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  {% for path in paths %}
    <url>
      <loc>http://{{config.host}}{{path}}</loc>
    </url>
  {% endfor %}
</urlset>

Finally, we'll generate a very simple robots.txt, and add it to our post-deploy generation script so search engines know where to find our sitemap:

Sitemap: http://{{config.host}}/sitemap.xml

And amend post_deploy to generate the new template:

post_deploy_tasks.append(generate_static_pages([
    ('/search', 'search.html', True),
    ('/cse.xml', 'cse.xml', False),
    ('/robots.txt', 'robots.txt', False),
]))

Edit: Putting the sitemap regeneration functions in static.py was a mistake, and can lead to intermittent errors regenerating the sitemap! They've been moved to utils.py in the repository.

Google Site Verification

Putting your sitemap in robots.txt is all very well, but if you want control over the indexing of your sitemap, it helps to add it in Webmaster Tools, and to do that, you need to verify your site. Google provides two ways to verify a site - via a meta tag in your index page, or via a dedicated html file. We'll use the latter, since we can make use of our post-deploy code to generate it.

Add the following to post_deploy.py:

def site_verification(previous_version):
  static.set('/' + config.google_site_verification,
             utils.render_template('site_verification.html'),
             config.html_mime_type, False)

if config.google_site_verification:
  post_deploy_tasks.append(site_verification)

Notice we're using a new config variable here, 'google_site_verification' - this should simply be set to the name of the HTML file Google prompts you to download. Since the contents of the file are formulaic, we don't need to download the file ourselves - we can generate it from the name. Here's the contents of site_verification.html:

google-site-verification: {{config.google_site_verification}}

And that's all that's required. When we upload a new version of our app, if we set the google_site_verification config setting to the correct path, our post-deploy script will generate the file for Google to find. Now that you've set it up, you can redeploy your app, and make it verified; then, you can add the sitemap directly in the control panel.

You can see the code for this stage here, and the latest version of bloggart at http://bloggart-demo.appspot.com.

In the next post, we'll review what we've done in this series of posts, how well it's worked out, and where to go from here.

23 October, 2009

Previous Post Next Post

Nick's Blog