Webapps on App Engine part 1: Routing

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

The first part of a framework you encounter when using one is, more often than not, the routing code. With that in mind, it's what we'll be tackling first. There are several approaches to handling request routing, and we'll go on a quick tour of the libraries and approaches before we decide on one and implement it.

The built-in App Engine webapp framework takes an extremely straightforward approach: The incoming request's URL is compared to a list of regular expressions in order, and the first one that matches has the corresponding handler executed. As an enhancement, any captured groups in the regular expression are passed to the request handler as additional arguments. You can see the code that does all this here - it's extremely straightforward and easy to follow.

The webapp module does one thing that I'm not a huge fan of: It ties the request routing in with handling the requests. The same function that finds the appropriate handler for a request also takes care of parsing the request and calling the appropriate methods on the RequestHandler subclass. As a result, RequestHandler classes are not WSGI applications, and you can't mix non-webapp apps in. With that caveat in mind, let's continue to look at some of the alternate approaches.

Django's approach to URL handling is documented here. It bears a remarkable resemblance to the webapp module's approach, and for good reason: A lot of App Engine's python library was inspired by Django. One difference worth noting about the way Django handles things is that the second parameter in each tuple is the string name of the module that will handle requests for that URL regex, rather than the class itself. This seems a little awkward, but provides a real benefit: It means we don't need to load every handler module in order to service a request to just one of them. We'll discuss optimisations like this in more detail in a later post.

In the realm of independent libraries, there are several prominent options. The routes library is a port of the Rails routing system, and uses a rather sophisticated system based on expressions that symbolically represent the sort of URLs you want to match. For example, the string "/error/{action}/{id}" matches any URL that has 3 slash-separated components starting with "/error", and breaks out the last two into separate variables, 'action' and 'id'.

Somewhat counterintuitively, the Routes library doesn't actually route requests to individual WSGI applications or handlers; instead it simply takes care of parsing URL patterns into dictionaries mapping set keys to values found in the URL. In that respect it's very sophisticated, allowing a great deal of flexibility in how you specify your URL parsing, including default values, regular expressions, and other features. It's then the job of another (simpler) piece of WSGI middleware to take the information routes produces and use it to dispatch to the appropriate handler.

Besides being typically easier to understand and more flexible than a naive regular-expression based system, Routes' approach has another advantage: It's possible to perform the reverse transformation, and generate a URL given a dict like that generated by routes. As long as your apps use this to generate URLs when they're needed, this means that you or your users can completely restructure the URL structure of your app without needing to make any changes to the rest of your code.

The werkzeug framework provides a similar mechanism, enhancing it with 'converter' functions that specify the accepted characters and converting the returned value. Unlike Routes, however, it doesn't provide the capability to use the mapping in the reverse direction, generating URLs. Edit: Werkzeug also supports doing reverse mappings, like Routes.

Finally, Webob's "do it yourself framework" demonstrates a method that is very similar to that defined by Routes, but has the substantial advantage that it's easily converted from the semantic form users enter into an ordinary regular expression. It is on this that we will base our own routing middleware.

First, let's take a look at the template_to_regex function from the webob framework, which we will use without modification:

 >>> import re
 >>> var_regex = re.compile(r'''
 ...     \{          # The exact character "{"
 ...     (\w+)       # The variable name (restricted to a-z, 0-9, _)
 ...     (?::([^}]+))? # The optional :regex part
 ...     \}          # The exact character "}"
 ...     ''', re.VERBOSE)
 >>> def template_to_regex(template):
 ...     regex = ''
 ...     last_pos = 0
 ...     for match in var_regex.finditer(template):
 ...         regex += re.escape(template[last_pos:match.start()])
 ...         var_name = match.group(1)
 ...         expr = match.group(2) or '[^/]+'
 ...         expr = '(?P<%s>%s)' % (var_name, expr)
 ...         regex += expr
 ...         last_pos = match.end()
 ...     regex += re.escape(template[last_pos:])
 ...     regex = '^%s$' % regex
 ...     return regex

The workings of this function are described in detail on the webob site, but we'll go over the basics here. Our ultimate goal is to take strings that contain expressions of the form "{variable:regex}" and convert them into fully formed regular expressions. For example, a template such as "/{year:\d\d\d\d}/{month:\d\d}/{slug}" should be converted into the regular expression "^/(?P<year>\d\d\d\d)/(?P<month>\d\d)/(?P<slug>[^/]+)$". The parentheses in the regular expression are capturing subexpressions, meaning their contents will be available to us if the expression matches, while the "?P" part signifies a label for that subexpression, allowing us to access it by name rather than by position.

The main part of the function is a loop over every template expression found in the input string. The output regular expression is built up in the variable named 'regex'. For each match, we first append the text between the previous match (if any) and the current one (after escaping it, so special characters aren't mistakenly interpreted as regular expression modifiers). Then, the variable name and regular expression are extracted from the template expression. If no regular expression was specified, the default expression "[^/]+", meaning one or more non-forward-slash characters, is used. The regular expression for this match is then appended to the regex-so-far, as a named sub-group as we described above. Finally, any remaining text is appended to the string, and the whole regular expression is wrapped in '^' and '$', the regular expression symbols that indicate the start and end of a string.

And yes, if you think using a regular expression to parse bits of regular expression into a new regular expression is all a bit meta, you're not alone.

Now that we've sorted out what format we'll use for specifying routes, we're ready to write the routing code itself. We'll be using a system similar to Routes, whereby you can specify additional default named arguments with your route and handler, but our handlers will be regular WSGI applications, called directly, rather than using the extra layer of indirection provided by Routes. Also, although we have left the door open for being able to do the reverse transform of handlers back to URLs, we won't be doing so in our first iteration.

class WSGIRouter(object):
  def __init__(self):
    self.routes = []

  def connect(self, template, handler, **kwargs):
    """Connects URLs matching a template to a handler application.
    
    Args:
      template: A template string, consisting of literal text and template
        expressions of the form {label[: regex]}, where label is the mandatory
        name of the expression, and regex is an optional regular expression.
      handler: A WSGI application to execute when the template is matched.
      **kwargs: Additional keyword arguments to pass along with those parsed
        from the template.
    """
    route_re = re.compile(template_to_regex(template))
    self.routes.append((route_re, handler, kwargs))

As you can see, the basic code for our router is extremely simple. We define a method, connect(), which takes a template string, a handler, and optional keyword arguments. This method calls template_to_regex to generate a regular expression, then inserts that and the additional argument in the list of routes. The real work happens when our router object is called as a WSGI application:

  def __call__(self, environ, start_response):
    for regex, handler, kwargs in self.routes:
      match = regex.match(environ['PATH_INFO'])
      if match:
        environ['router.args'] = dict(kwargs)
        environ['router.args'].update(match.groupdict())
        return handler(environ, start_response)

The name of the method, __call__, distinguishes this as a special method to python. Normally it's not possible to call an object as you would a function, but if your class defines the __call__ method, this method is executed when someone calls your object. This allows objects of our WSGIRouter class to act as regular WSGI applications.

When our router is called, it iterates over each of the routes that were provided, and for each one attempts to match it against the PATH_INFO CGI variable. If it finds a match, it extends the WSGI environment by adding a variable called 'router.args'. This variable consists of any static arguments that were passed to the connect() method, in addition to the values of all the matched template parameters. The router then calls the selected WSGI app, returning its result to its own caller. It's up to whatever WSGI application is being called to extract the router.args variable from its environment if it needs it, and to act on it accordingly.

Let's define a simple webapp to test all this out:

def hello_app(environ, start_response):
  start_response(200, [("Content-Type", "text/plain")])
  return ["Hello, world."]


def echo_app(environ, start_response):
  start_response(200, [("Content-Type", "text/plain")])
  return [repr(environ['router.args'])]


router = WSGIRouter()
router.connect("/hello", hello_app)
router.connect("/echo/{foo}/{bar:[0-9]+}", echo_app, test="test")

def main():
  run_wsgi_app(router)

if __name__ == '__main__':
  main()

This app defines two handlers, mapped to two different URL patterns. If you go to '/hello', you should see "Hello, world.". If you go to a URL such as "/echo/bleh/123", you should see the complete template dict - in this example, it'll be "{'test': 'test', 'foo': 'bleh', 'bar': '123'}".

That's it for routing! In the next post we'll handle decoding and encoding requests and responses.

Comments

blog comments powered by Disqus