Handling downtime: The capabilities API and testing

After the unfortunate outage the other day, how to handle downtime with your App Engine app is a bit of a hot topic. So what better time to address proper error handling for situations where App Engine isn't performing at 100%?

There's three major topics to cover here: Handling timeouts from API calls, using the Capabilities API, and testing your app's support for handling failures. We'll go over them in order.

Handling timeouts

At the 'stub' level, timeouts and other exceptions are communicated by the stub throwing an google.appengine.runtime.apiproxy_errors.ApplicationError. ApplicationError instances have an 'application_error' field, which contains an ID, drawn from google.appengine.runtime.apiproxy_errors, which indicates the cause of the error. As you can see, DEADLINE_EXCEEDED is 4. Other errors of interest are OVER_QUOTA, which will occur if your app runs out of quota for a given API call or capability, and CAPABILITY_DISABLED, which is thrown if the API capability has been explicitly disabled (more on this later).

Each of the various APIs catches ApplicationErrors thrown by their stub, and wraps them in a higher level exception. The datastore, for example, has a function, _ToDatastoreError that maps different error codes to exceptions from datastore_errors, which results in an ApplicationError(4) being transformed into a datastore_errors.Timeout exception. The urlfetch API, similarly, maps exceptions, with a timeout (along with some other errors) being represented as a DownloadError.

The best way to handle timeouts varies from API to API. The datastore API now automatically retries timed out operations. If it cannot execute the operation even after multiple timeouts, it will return a db.Timeout exception (or a db.TransactionFailedError if the exception occurred inside a transaction). For an in-depth description of how to handle datastore timeouts and why they happen, I recommend this excellent and well written article.

Memcache, in contrast, generally won't return timeout errors on get operations, but will rather fail to return a value. Set operations return error codes rather than throwing exceptions, in conformance with the memcached API it imitates. See the memcache docs for details.

Wherever possible, you should handle exceptions on a call-by-call basis, and deal with them appropriately. Sometimes, however, an exception from a given API call simply means you're unable to service the user's request, and have to show them an error page and ask them to try again later. In such situations, it helps to have a catch-all exception handler, which gets invoked for any exceptions that make it to the top level of your app. The webapp framework provides just such a facility in the form of the handle_exception method, which gets called with the exception if your handler methods (get, post, etc) throw one. By default, this method calls self.error(500), logs the exception, and then prints the stacktrace to the output if debugging is enabled. Overriding this to present a nicer message to your users is probably a good idea - even better, override the error() method to display appropriate error pages for all the status codes your app can return!

The Capabilities API

While it's important to have proper exception handling for API calls, that's not all you can do. With the Capabilities API, you can proactively query App Engine to check if a given API, capability, or specific method is available. Documentation, unfortunately, is rather light at the moment, so consult the source for details.

In general, calls to the Capabilities API take a service name - such as 'memcache' or 'datastore_v3' - and optionally either a 'capability', such as 'write', or a specific method, such as 'put'. The API then returns whether or not that entire API, capability, or individual method is available. For example:

from google.appengine.api import capabilities

images_enabled = capabilities.CapabilitySet('images').is_enabled()
datastore_write_enabled = capabilities.CapabilitySet('datastore_v3', capabilities=['write']).is_enabled()
memcache_get_enabled = capabilities.CapabilitySet('memcache', methods=['get']).is_enabled()

We can make use of this to, for example, create some WSGI middleware that automatically returns a friendly error message any time the datastore is entirely disabled (presuming our app is dependent on the datastore, and sets a flag in the WSGI environment if it's read-only:

def capability_middleware(application):
  def wsgi_app(environ, start_response):
    if not capabilities.CapabilitySet('datastore_v3').is_enabled():
      print_error_message(environ, start_response)
    else:
      environ['capabilities.read_only'] = capabilities.CapabilitySet('datastore_v3', capabilities=['write']).is_enabled()
      return application(environ, start_response)

  return wsgi_app

Obviously, far more sophisticated handling of disabled capabilities are possible - for example, you can use the 'read_only' flag the above middleware sets to disable any features of your site that require writing to the datastore, politely informing users that it's not available, rather than resorting to an error page.

Testing timeouts and capabilities

Once you've implemented proper error handling, and you're using the capabilities API, the question inevitably arises: How do I test this? We can do this using hooks - specifically, using a pre-call hook that throws the exception we want to test. Here's a class that makes it simple to test for error returns from APIs:

from google.appengine.runtime import apiproxy
from google.appengine.runtime import apiproxy_errors

class APIErrorHook(object):
  def __init__(self):
    self.error_map = {}  # Maps (api, method) tuples to error statuses

  def set_error_code(self, service, method, code):
    self.error_map[(service, method)] = code

  def get_error_code(self, service, method):
    """Returns the error code to return for a service and method, or None."""
    return self.error_map.get((service, method), None)

  def _error_hook(self, service, method, request, response):
    error_code = self.get_error_code(service, method)
    if error_code:
      raise apiproxy_errors.ApplicationError(error_code)

  def install(self, apiproxy, unique_name):
    apiproxy.GetPreCallHooks().Append(unique_name, self._error_hook)

This should all be fairly straightforward: Once the hook is installed, calling set_error_code with a service and method name will cause all future invocations to raise an ApplicationError with that code. Calling set_error_code with None as the code will make the API call operate normally again. Here's an example of it in use:

from google.appengine.api import apiproxy_stub_map

error_hook = APIErrorHook()
error_hook.install(apiproxy_stub_map.apiproxy, 'error_hook')

db.get(a_key)  # Works

error_hook.set_error_code('datastore_v3', 'Get', apiproxy.DEADLINE_EXCEEDED)
db.get(a_key)  # Throws a db.Timeout error

error_hook.set_error_code('datastore_v3', 'Get', None)
db.get(a_key)  # Works again

That's all there is to it. You can easily use this in your unit tests or for manual testing - just remember to install the API hook when you need it.

For testing the Capabilities API, we need a little more sophistication. The default implementation of the capability service always returns 'ENABLED' for every call. Normally, modifying this behaviour would require writing our own capability stub and reaching into the SDK to replace the default one with our implementation. Fortunately, however, the capabilities API is simple enough that we can instead register a post-call hook that changes the return value to whatever we want it to be. Here's an extension of the above class that adds the ability to enable and disable APIs:

class CapabilityHook(APIErrorHook):
  # Maps (service, method) tuples to the capability it depends on
  _CAPABILITY_MAP = {
    ('datastore_v3', 'Put'): 'write',
    ('datastore_v3', 'Delete'): 'write',
  }

  def __init__(self):
    self.disabled_capabilities = set()  # Set of (service, capability) tuples that are disabled
    super(CapabilityHook, self).__init__()

  def set_capability_disabled(self, service, capability, disabled):
    if disabled:
      self.disabled_capabilities.add((service, capability))
    else:
      self.disabled_capabilities.discard((service, capability))

  def get_error_code(self, service, method):
    if (service, '*') in self.disabled_capabilities:
      return apiproxy.CAPABILITY_DISABLED
    required_capability = CapabilityHook._CAPABILITY_MAP.get((service, method), None)
    if required_capability and (service, required_capability) in self.disabled_capabilities:
      return apiproxy.CAPABILITY_DISABLED
    return super(CapabilityHook, self).get_error_code(service, method)

  def _capability_hook(self, service, method, request, response):
    # Accumulate a mapping of capabilities to enabled-ness
    capabilities = {}
    for capability in request.capability_list():
      if (method, capability) in self.disabled_capabilities:
        capabilities[(method, capability)] = False
      else:
        capabilities[(method, capability)] = True
    for method in request.call_list():
      required_capability = CapabilityHook._CAPABILITY_MAP.get(method, '*')
      if required_capability in self.disabled_capabilities or (service, '*') in self.disabled_capabilities:
        capabilities[(method, capability)] = False
      else:
        capabilities.setdefault((method, capability), True)

    # Add them to the response
    response.clear_config()
    for (service, capability), enabled in capabilities.items():
      config = response.add_config()
      config.set_package(service)
      config.set_capability(capability)
      config.set_status(capabilities.IsEnabledResponse.ENABLED if enabled else capabilities.IsEnabledResponse.DISABLED)

    # Calculate the summary response
    config.set_summary_status(capabilities.IsEnabledResponse.ENABLED if False not in capabilities.values() else capabilities.IsEnabledResponse.DISABLED)

  def install(self, apiproxy, unique_name):
    apiproxy.GetPostCallHooks().Append(unique_name, self._capability_hook, 'capability_service')
    super(CapabilityHook, self).install(apiproxy, unique_name)

This is substantially more complicated than the previous handler, because it has to implement the intricacies of the capabilities API. The basic operation is fairly straightforward, however. The set_capability_disabled method allows you to disable or enable specific capabilities. To disable or enable an entire service, use set_capability_disabled with a capability name of '*'. CapabilityHook then extends the get_error_code method to add additional checks: It returns CAPABILITY_DISABLED if the entire service is disabled, or if the method being requested is mentioned in its _CAPABILITY_MAP with a capability name, and that capability is disabled.

The class also implements a post-call hook for the capability service; this modifies responses by checking its own internal list of capabilities and assembling a response appropriately. The summary status is set to ENABLED iff all the individual queried capabilities and methods are themselves enabled.

Here's a simple example of it in use:

capability_hook = CapabilityHook()
capability_hook.install(apiproxy_stub_map.apiproxy, 'error_hook')

db.put(some_entity)  # Succeeds
db.get(some_key)  # Succeeds
capabilities.CapabilitySet('datastore_v3', capabilities=['write']).is_enabled()  # Returns True

capability_hook.set_capability_disabled('datastore_v3', 'write')
db.put(some_entity)  # Fails, raising datastore_errors.Error
db.get(some_key)  # Still succeeds
capabilities.CapabilitySet('datastore_v3', capabilities=['write']).is_enabled()  # Returns False

There you go: How to handle API call errors, how to detect disabled capabilities, and how to test both of them. Now you're prepared for the next scheduled maintenance, or the unlikely event of another unplanned outage!

Comments

blog comments powered by Disqus