Consuming RSS feeds with PubSubHubbub

Frequently, it's necessary or useful to consume an Atom or RSS feed provided by another application. Doing so, though, is rarely as simple as it seems: To do so robustly, you have to worry about polling frequency, downtime, badly formed feeds, multiple formats, timeouts, determining which items are new and other such issues, all of which distract from your original, seemingly simple goal of retrieving new updates from an Atom feed. You're not alone, either: Everyone ends up dealing with the same set of issues, and solving them in more or less the same manner. Wouldn't it be nice if there was a way to let someone else take care of all this hassle?

As you've no doubt guessed, I'm about to tell you that there is. I'm speaking, of course, of PubSubHubbub. I discussed publishing to PubSubHubbub as part of the Blogging on App Engine series, but I haven't previously discussed what's required to act as a subscriber. Today, we'll cover the basics of PubSubHubbub subscriptions, and how you can use them to outsource all the usual issues consuming feeds.

At this point, you may be wondering how this is ...

Implementing a non-relational database in Go

In a previous Damn Cool Algorithms post, I discussed log structured storage, and how it applies to databases. For a long time, I've wanted to implement a database based on log structured storage, and a few other nice mechanics from other database systems:

  • Tables are key:value mappings, with duplicate keys allowed (Bigtable, BDB)
  • Map-based views, also known as materialized views, for indexing (couchdb).
  • Reducer support for views (couchdb).

Since my previous posts about go have been generally well received, and because I want to explore the language a bit more, I'll be implementing all this in Go. The approach I'd like to take is one of gradually building up abstractions. We'll tackle each of the components in its own post:

  1. An interface for writing records to an append-only file or set of files.
  2. A B-Tree implementation, built on the record interface.
  3. Map-based / materialized views, based on the B-Tree implementation.
  4. Reducers for views.

Unlike previous series, this one is likely to be fairly fragmented. There's a fair chunk of functionality to be implemented here, so I won't be able to get it out at my usual three posts a week schedule. In the meantime ...

Writing a twitter service on App Engine

Services that consume or produce Twitter updates are popular apps these days, and there are more than a few on App Engine, too. Twitter provide an extensive API, which provides most of the features you might want to access.

Broadly, Twitter's API is divided into two distinct parts: The streaming API, and everything else. The streaming API is their recommended way to consume large volumes of updates in real-time; unfortunately, for a couple of reasons, using it on App Engine is not practical at the moment. The rest of their API, however, is well suited to use via App Engine, and covers things such as retrieving users' timelines, mentions, retweets, etc, sending new status updates (and deleting them, and retweeting them), and getting user information.

Authentication

Most of Twitter's API calls require authentication. Currently, Twitter support two different authentication methods: Basic, and OAuth. Basic authentication, as the name suggests uses HTTP Basic authentication, which requires prompting the user for their username and password. We won't be using this, since it's deprecated, and asking users for their credentials is a bad idea. The OAuth API makes it possible to call Twitter APIs on behalf of a user ...

Bulk updates with cursors

Last week, I blogged about cursors, a new feature in version 1.3.1 of the App Engine SDK. Today, I'm going to demonstrate a practical use for them: Bulk datastore updates.

In both the Remote API and deferred articles, I used a (perhaps poorly named) 'mapper' class as an example of ways to use these libraries. In neither case was the class intended to be anything other than a sample use case for the library, but nevertheless, people have used the examples in production. The introduction of cursors provides a prime opportunity to introduce a more robust, yet simpler, version of the bulk updater concept.

First, let's define a few requirements for our bulk updater:

  • Support for any query for which a cursor can be obtained
  • Handles failure of individual updates gracefully
  • Can fail the whole update process if enough errors are encountered
  • Handles timeout errors, service unavailability, etc, transparently
  • Can report completion to admins

As in the Remote API and Deferred articles, we'll implement the updater as an abstract class, which individual updater implementations should subclass. Here's the basic interface:

import logging
import time
from google.appengine.api import mail
from google.appengine.ext ...

No post today

Sorry, but I'm totally mentally exhausted after a long week - including a talk at UCD on Wednesday - and I just don't have the energy to write up today's post. Look out for it on Monday, instead.

Webapps on App Engine, part 6: Lazy loading

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

A major concern for many people developing for App Engine, particularly those building low-to-medium traffic sites, is instance load time. When App Engine serves the first request to a new instance of your app, it must import the request handler module you specified, which in turn imports all the other modules required to serve the request. In large apps, this can add up to quite a lot of additional overhead for loading requests, and substantially impact the experience for end users.

There are a number of things you can do to reduce loading times, including using lighter weight frameworks instead of all inclusive ones, and breaking seldom used components up into separate handlers - an approach taken by bloggart for the admin interface. One source of inefficiency stands out as a prime candidate for optimisation, though: unnecessary imports.

Many frameworks, including the built in webapp framework, require you to provide a list of handler classes that should be instantiated to serve requests, in a 'url map'. When a request comes in, the framework simply instantiates the relevant class and ...

New features in 1.3.1 prerelease: Cursors

Recently, the App Engine team announced that they'd be pre-releasing SDKs for testing and feedback, before they go live in production. With the first prerelease, 1.3.1, a number of new features are included in the SDK. Today we'll discuss cursors - how they work, and what they're useful for.

Cursors are a feature that many people have been waiting for with bated breath. As well as making pagination easier, they also provide a way around the "1000 result limit" that many people feel (in some cases correctly) makes it harder to achieve what they want to do on App Engine.

When it comes to investigating new features, there are two really useful tools: An interactive console - such as that on http://localhost:8080/_ah/admin/, http://shell.appspot.com/ or the remote_api console - and the source code. Many people forget that as an Open Source project, the App Engine SDK code is all available - and easily browseable on code.google.com.

Our first stop is google/appengine/ext/db/__init__.py. Of interest here is the cursor() method, which starts on line 1600. As you can see, when called on a query that's already been ...

Webapps on App Engine, Part 5: Sessions

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

Sessions are another component that's regularly required by webapps, but isn't really a core part of a framework. In this post, we'll discuss the session mechanisms available for App Engine and how they work, and settle on a recommendation for our own lightweight framework.

The basic mechanism behind a session library is straightforward: A random session ID is generated for the user, which is embedded in an HTTP cookie and sent to the user. Meanwhile, a record is created on the server with the same ID, containing any data the webapp wants to store about this user. When the user makes a subsequent request, the session library decodes the session ID from the cookie header, and loads the corresponding session record from permanent storage.

There are three major advantages of handling sessions this way, rather than naively storing session data directly in the cookie:

  • We can store data that the client shouldn't be able to modify, such as the user's access flags.
  • We can store data the client shouldn't even be able ...

Webapps on App Engine, part 4: Templating

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

In the first three posts of this series, we covered all the components of a bare bones webapp framework: routing, request/response encoding, and request handlers. Most people expect a lot more out of their web framework, however. Individual frameworks take different approaches to this, from the minimalist webapp framework, which provides the bare minimum plus some integration with other tools, to the all-inclusive django, to the 'best of breed' Pylons, which focuses on including and integrating the best libraries for each task, rather than writing their own.

For our framework, we're going to take an approach somewhere between webapp's and Pylons': While keeping our framework minimal and modular, we'll look at the best options to use for other components - specifically, templating and session handling. In this post, we'll discuss templating.

To anyone new to webapps, templates may seem somewhat unnecessary. We can simply generate the output direct from our code, right? Many CGI scripting languages used this approach, and the results are often messy. Sometimes, which page to be generated isn't clear ...

Snow Sprint wrap-up, and introducing Tweet Engine

It's Friday evening, which means the Snow Sprint is wrapping up, and everyone's presenting their App Engine apps. There's some pretty impressive work been done in a mere 5 days...

Tweet Engine

First up is us! Myself, Jens Klein, and Sasha Vincic teamed up to write Tweet Engine, a twitter webapp for collaborative tweeting. Many organisations - both companies and open source groups - have shared twitter accounts. Using these shared accounts, however, can be a huge pain, especially if you have multiple accounts to manage. The goal of Tweet Engine is to make this more manageable.

Anyone can sign up by logging in with their Google account. Once signed up, you can add any number of Twitter accounts. We use the Twitter OAuth library, which allows us to obtain permission from a user without prompting you for your password.

Once you've added an account, you can give any number of other people permission to use it. Access is configurable, including full administrator access, just the ability to send and view tweets, or just the ability to suggest tweets for review and approval. Once a suggestion is submitted, anyone with sufficient permissions can approve or decline it. Scheduled ...