Behind the scenes with remote_api

Posted by Nick Johnson | Filed under python, remote_api, app-engine, internals

I've discussed remote_api in passing many times before on this blog, but never gone into detail about how it works, and the options you have for customizing it. Today, we'll remedy that, by taking a close look at its operation.

You may be wondering why anyone would want to customize remote_api - it seems like a fairly straightforward service, right? There are two main reasons you might want to do some degree of customization:

You're providing a software-as-a-service solution, and need to provide remote_api access to your customers, but want to limit what they can do.
You want to expose an API of your own via remote_api.

The first of these use-cases is particularly apt in the face of this nasty hack, which makes it possible to download a Python app's source if both the remote_api and deferred handlers are installed (and the user is an admin). You may want to use both of these libraries, but still keep your source to yourself. The second use-case is more complicated, and we'll only touch on it in passing.

How remote_api works

remote_api has two components, the client (otherwise known as the 'stub') and the server (otherwise known as the 'handler'). In order to understand how remote_api works, it helps to first have a basic understanding of how API calls on App Engine work in general. When you make an API call - such as fetching datastore records, making a urlfetch request, or enqueueing a task queue task, the following things happen:

The library you called constructs a protocol buffer that encapsulates the request you made.
The library calls apiproxy_stub_map.MakeSyncCall*, passing it the name of the service (such as 'datastore_v3', or 'urlfetch'), and the name of the API method (such as 'put' or 'fetch'), and the request protocol buffer.
The apiproxy_stub_map module uses the service name to look up its mapping from service names to stubs, returning a stub that can handle the RPC call being made.
Any pre-call hooks are run.
The stub is called with the parameters that were originally passed in. The stub may do anything it likes with the request.
Any post-call hooks are run.
The stub returns a response Protocol Buffer to the library code, which interprets it and returns a result to your code.

The reason for this abstraction is to allow for different implementations of services on different runtimes. On the dev_appserver, each service has a stub that provides a local, test implementation of the service in question. For example, here's the code for the urlfetch stub. In production, stubs are provided that send the RPC call off to the real service, and return its response. Either way, the calling code doesn't have to care about how it's implemented, just that it works as expected.

This layer of indirection provides a means for our remote_api service to work. For the client end, we implement a 'universal stub' that takes any API call for which it is registered, serializes it - which is really easy, since Protocol Buffers are designed for serialization - and sends it over HTTP to a server. On the server end, we write a handler that accepts serialized protocol buffers, deserializes them, and sends them to the 'real' stub responsible for executing them. Then, it serializes the response, and sends it back in the HTTP response body.

This sounds really simple, and it is, except for a couple of small complications. The first complication is that the server end receives a blob of data which it knows is a Protocol Buffer, but it doesn't know which one! So, we need some way to map between RPC calls and the request and response Protocol Buffers involved. With that in mind, let's start by taking a look at the server end of remote_api, which can be found here.

Ignore for a moment the RemoteDatastoreStub - we'll come to that in a moment - and skip down to line 184. Here, we define a dictionary called "SERVICE_PB_MAP". This is the promised mapping between RPC calls and Protocol Buffer classes. It consists of a dictionary mapping service names to dictionaries, which themselves map method names to (request class, response class) tuples. Thus, retrieving the request and response Protocol Buffer classes for an API looks something like the following:

request_class, response_class = SERVICE_PB_MAP['datastore_v3']['Get']

Next, take a look at line 299, the post() method of the ApiCallHandler class. This is where the rubber meets the road. First, it decodes the body of the HTTP request as a remote_api_pb.Request protocol buffer, which forms a wrapper for the RPC call. This PB simply contains the service and method name, and the encoded body of the RPC call. Then, it calls the ExecuteRequest method. This method obtains the request and response Protocol Buffer classes, as we described above, and uses the request class to decode the body of the RPC. It then makes the RPC call to the 'real' stub, and returns the response body to post(). The post() method then simply packages it up into a remote_api_pb.Response PB and sends it back as the response body.

Next, let's take a brief look at the client stub, available here. The interesting part starts at line 124, with the RemoteStub class. As we suggested earlier, this class simply accepts any incoming RPC call at all, packages it up in the by-now-familiar fashion, and ships it out over HTTP. When the response comes back, it decodes it and returns it to the app, modulo some error handling, which we'll ignore for now.

That's all there is to the core of remote_api. With a simple stub installed on the client end, and a handler on the server end, it's possible to use any App Engine API without your app having to be aware that it's using remote_api at all! Okay, yes, nearly any API.

You may have noticed that I said earlier that there were "a couple" of complications, but I only described one. Now that you understand the basic operation, though, I can reveal the reason for the second complication. The issue is that although most of the RPC calls you can make in App Engine are stateless - that is, they require no information other than that which is provided in the RPC itself - there are a few that aren't. Pretty much all of these occur in the datastore API, and they involve two things: retrieving data, and transactions.

This is where the rest of the code - and most of the complexity - in remote_api comes in. On the client side, we implement a custom stub, RemoteDatastoreStub, which is responsible for 'papering over' the requirement for state. It does this by making different calls to the server - for example, a request to fetch more results from a query is replaced by a request to execute the query all over again, with an offset corresponding to the query offset. This is assisted by a handler by the same name on the server side, which provides services necessary for doing this.

In general, though, you shouldn't have to care about these patches, as they're only there to ensure that, for all intents and purposes, remote_api works exactly as it would if you were doing things locally. Let's put that aside for now, and look into how you'd customize remote_api.

Customizing remote_api

All customization of remote_api is going to be largely concerned with modifying the handler - the server end. As described earlier, there are two main reasons you might want to do this: In order to restrict access, in a SAAS setting, or in order to implement your own API.

In order to do either of these, you need to customize the SERVICE_PB_MAP in handler.py. Doing this is actually very straightforward, with monkeypatching. Suppose we want to disable access to the taskqueue API via remote_api entirely. One reason to do this would be to make the hack we linked to, for recovering source code, impossible. Here's the complete code to do that:

from google.appengine.ext.remote_api import handler

del handler.SERVICE_PB_MAP['taskqueue']

if __name__ == "__main__":
  handler.main()

Once you've placed that in a module, simply modify your app.yaml to use that module as the handler for the /remote_api URL instead of the default one. This way, our monkeypatch will always be run before the remote_api handler, ensuring it's impossible to use the taskqueue API over remote_api.

As you might guess, you can make other modifications in this fashion, too, such as adding your own APIs. We won't go into detail about adding your own APIs here, as that requires custom Protocol Buffer classes, which get complicated - so we'll leave that as an excercise for the reader.

As a final note, if you wanted to install a custom stub for an existing service, perhaps because you want to examine and modify requests made over remote_api, you can do that by subclassing ApiCallHandler. Here's an example of that in action, with a stub that logs all the calls made through it:

from google.appengine.ext.remote_api import handler
from google.appengine.api import apiproxy_stub_map
from google.appengine.api import apiproxy_stub
import logging

class RemoteLoggingStub(object):
  def MakeSyncCall(self, service, method, request, response):
    logging.info("Handling call %s.%s(%s)", service, method, request)
    apiproxy_stub_map.MakeSyncCall(service, method,request,response)


class ApiCallHandler(handler.ApiCallHandler):
  LOCAL_STUBS = dict(handler.ApiCallHandler).update({
    'datastore_v3': RemoteLoggingStub(),
  })


def main():
  application = webapp.WSGIApplication([('.*', ApiCallHandler)])
  wsgiref.handlers.CGIHandler().run(application)


if __name__ == '__main__':
  main()

Hopefully you've learned something new - if a little esoteric - today. Let us know in the comments if you've found it particularly useful!

14 May, 2010

Previous Post Next Post

Nick's Blog