Python Gotchas

Posted by Nick Johnson | Filed under python, tech, app-engine, coding

A lot of App Engine developers are fairly new to Python as well, and so probably haven't encountered a few subtle 'gotchas' about the Python programming language. This post aims to sum up the ones you're most likely to encounter while programming for App Engine.

Mutable default arguments

First up is something every Pythonista learns sooner or later. What's wrong with this snippet?

def append(value, l=[]):
  l.append(value)
  return l

Well, what do we see if we run it more than once?

default_append = []

def append(value, l):
  l.append(value)
  return l

>>> append(1, default_append)
[1]
>>> append(2, default_append)
[1, 2]
...

The solution to this is to use an immutable placeholder value - usually None - as the default argument, and initialize the list inside our function:

def append(value, l=None):
  if not l:
    l = []
  l.append(value)
  return l

Recursive imports

When you import a module for the first time, Python creates a new global scope, and executes the code of the module inside it, then assigns that global scope to the name you just imported the module as. If the module you're importing itself has other modules, Python performs the same process for them, and so forth. On subsequent imports of a module, Python simply looks up the module in its internal list of loaded modules, and returns the existing scope.

Where this becomes a problem is if you have a module, a, which imports another module, b, which in turn attempts to import a - in other words, recursive imports. When something imports a, it's executed, causing the import of b. b executes, but when it reaches the 'import a' statement, Python throws an exception - it can't execute a again without creating an infinite loop, and it can't return the a it's currently importing, because it's not finished yet.

This doesn't come up that often, because dependencies usually don't form cycles. But if you're developing a fairly tightly coupled system, you may well encounter this at some point. Sometimes, it's a sign you need to refactor your placement of code in modules to eliminate the loops, but that's not always possible. For example, in part 3 of the Blogging on App Engine series we encountered exactly this problem, with the generators module importing the models module, and vice-versa.

The workaround is simple, but surprising to people used to compiled languages. Suppose we have the following in a.py:

import b

def a_func():
  return 5

def uses_b():
  return b.b_func() + 2

print uses_b()

And the following in b.py:

import a

def b_func():
  return 10

def uses_a():
  return a.a_func() + 2

The solution is to modify one module - whichever is convenient - to move the 'import' statement inside the functions that need to reference the other module. For example, supposing we decide to modify b.py:

def b_func():
  return 10

def uses_a():
  import a
  return a.a_func() + 2

This works because at the time b.py is executed, the code in the 'uses_a' function is parsed, but not executed. By the time uses_a() is executed, both modules should have finished importing, so Python can resolve the 'import a' statement inside the function by simply fetching the already-imported module.

Edit: In my original version of this post, I had an example that looked like a recursive import, but wasn't. Python is smarter than me, and only throws an exception if you try to call a function in a recursively imported module where the function hasn't been parsed yet. I updated the example above to demonstrate this.

Iterating over query objects

Now to something App Engine specific. A common mistake - so common it's even shown up in the docs once or twice by accident - is to do something like the following:

entities = Entity.all().filter('foo =', bar)
for entity in entities:
  entity.number += 1
db.put(entities)

The overall pattern here is a good one: Updating a set of entities, then using a batch put to store them back to the datastore in a single operation. It's much more efficient than saving them all individually. The code above, however, does absolutely nothing: No records will be updated. To see why, we need to take a closer look at what happens.

The object returned by the expression "Entity.all().filter('foo =', bar)" is a Query object. Query objects expose several methods to refine the query, but they also act as iterables, meaning you can fetch an iterator from them to iterate over the results of the query. The db.put() function also accepts an iterable, and fetches all its elements, storing the results to the datastore.

What happens here, then is that our 'for' loop gets an iterator object from q and executes its body for each entity returned. Then, db.put also fetches an iterator from q - a new iterator, which executes the query a second time, returning the original, unmodified entities, which db.put happily stores back to the datastore. Not only does this code do nothing, but it does it inefficiently!

The solution is a very small change to our code:

entities = Entity.all().filter('foo =', bar).fetch(1000)
for entity in entities:
  entity.number += 1
db.put(entities)

All we've done here is to switch from using the Query object's iterator protocol to fetching results explicitly. The .fetch() method returns a list of Entity objects. The for loop iterates over them, updating them, and then db.put() takes the same list, containing the entities we already modified, and stores them in the datastore. Since we're operating on the same list of entities in each step, everything works as expected.

Reserved module names

This is an issue that's come up a lot more since we added incoming email support to App Engine. Certain module names are used by the Python standard library, and attempting to use them yourself will lead to problems. A prominent example is 'email'. If you name your own module 'email.py', one of two things will happen, depending on the order of directories in the search path: Either every use of the 'email' standard library module will instead load your module, or vice-versa. Since people writing incoming email support on App Engine typically need to use the real 'email' module, neither option is a good one. Take care not to reuse module names from the Python standard library - and if in doubt, check the module list for confirmation. One easy way to avoid this gotcha is to put all your own code inside a package - then, you only need to check one name for conflicts.

Global variables and aliasing

Python only has two scopes - global, and local. Global scope is the scope of the module your code is in, while local scope is the scope of the current function or method. Python separates the two fairly strictly: Code in a local scope can read global variables, but can't, by default, modify them. Attempts to modify a global variable will lead to aliasing - creation of a local variable by the same name. For example:

>>> a_global = 123
>>> 
>>> def test():
...   a_global = 456
... 
>>> print a_global
123
>>> test()
>>> print a_global
123

Python provides a way to explicitly state that you want to modify a global inside a local scope: The global keyword. It works like this:

>>> a_global = 123
>>> 
>>> def test():
...   global a_global
...   a_global = 456
... 
>>> print a_global
123
>>> test()
>>> print a_global
456

Note, however, that this is generally discouraged: Modifying global variables from within a function is seen as bad practice, and unpythonic. Also, remember that this restriction only prevents modifying the variables themselves, not their contents. For example, modifying a mutable list in the global scope is no problem without a 'global' keyword:

>>> a_list = []
>>> def append_list(x):
...   a_list.append(x)
... 
>>> print a_list
[]
>>> append_list(123)
>>> print a_list
[123]

First import of handler modules

To wrap up, we'll cover one final App Engine specific gotcha. The execution model of App Engine Python apps is that the first request to a given request handler module is handled by simply importing the module, cgi-style. Subsequent requests are handled by checking if the module defined a 'main' function. If it did, the main function is executed, instead of re-importing the entire module.

If you want to take advantage of this performance optimisation for requests after the first one, you need to do two things: Define a main() function, and make sure that you call that main function yourself on first import. The second part of this is handled with this bit of boilerplate, which you are probably used to seeing at the bottom of modules:

if __name__ == "__main__":
  main()

If you omit this bit of code, however, the first import of your module simply parses anything, then does nothing, returning a blank page to your user. Subsequent requests execute main and generate the page as normal, leading to a frustrating debugging experience. So always remember the two line stanza from above!

Got your own tips, tricks, or gotchas? Leave them in the comments!

13 November, 2009

Previous Post Next Post

Nick's Blog