Announcing the SQLite datastore stub for the Python App Engine SDK

Posted by Nick Johnson | Filed under app-engine, python, coding, sqlite, tech, datastore

For the past couple of weeks, I've been working on one of those projects that seems to suck up every available moment (and some that technically aren't). Now, however, it's largely done, and as an extra bonus, I've been given permission to release it as an early preview for those that are interested.

The code in question is a new implementation of the local datastore for the Python App Engine SDK. While some of you are probably delighted at the news, I expect most of you are puzzled. Why do we need a new local datastore implementation? Let me explain.

The purpose of the local stubs in the App Engine SDK is to exactly replicate the behaviour of the production environment, and in general they do that very well. A specific non-goal is replicating the performance characteristics of the production environment, or being as scalable as the production environment - the stubs are designed for testing, not production use.

The Python SDK's datastore implementation operates by storing the entire contents of your development datastore in memory. It writes changes to disk so that it can reload your datastore when the dev_appserver is restarted, but the in-memory copy is the primary source for all operations. Queries are executed by iterating over every entity of the right kind and filtering based on the criteria you specify.

Despite the use of a linear scan, this straightforward implementation is more than sufficient for most developers. The amount of test data one typically wants to load is fairly small, and it's hard to beat the performance you get from having everything stored in memory. Problems arise, however, when you need to load a larger than normal amount of test data into your app. As soon as the amount of stored data (including storage and indexing overhead) exceeds your available physical memory, performance slows down dramatically, as the dev_appserver has to page bits of the datastore to and from disk. Startup performance can be impacted, too, since the datastore stub has to read the entire datastore off disk each time the dev_appserver is started.

The new local datastore implementation fixes both these issues by rewriting the datastore stub to use SQLite as a backend. Data is stored by SQLite on disk, so memory consumption should not be a problem, and startup times will likewise be unaffected by the size of the local datastore. Here's what you can expect with the new datastore stub:

Improved startup times if you store a lot of data in your local datastore.
Improved performance if you store a lot of data in your local datastore.
Lower memory consumption.

Equally importantly, though, here's what you shouldn't expect:

Performance improvements when dealing with small amounts of data. Although a linear scan for all queries sounds inefficient, it really is hard to beat keeping all your data in memory for speed - so if you have relatively little data, you may not see much or any improvement.
A relational schema. If you open up the SQLite database created by the dev_appserver, don't expect it to be laid out with a table per kind, and columns for properties. More on how the datastore is structured later, if you're interested.

Also, please bear in mind that this is pre-release code. It may wipe out all your data. It may cause the spontaneous generation of a black hole which swallows your cat. It may even work as expected! Nearly anything could happen.

Without further ado, you can get the patch and instructions on how to use it here. Feedback is appreciated - either in the groups, or right here on my blog.

How it works

What follows is a description of some of the internal workings of the datastore_sqlite_stub. If you just want to use it, you can stop now - otherwise, read on!

The basic structure of the SQLite stub is very similar to that used by the production datastore. An 'Entities' table holds all the entities, indexed by entity type and the 'path' component of their key. An 'EntitiesByProperty' table stores every indexed property for every entity, with columns for kind, property name, property value, and the path of the entity this row refers to.

All permitted queries can be executed by selecting from the Entities table, with some number of joins to the EntitesByProperty table. For example, a kind-only query or an ancestor-only query can be satisfied by the Entities table alone. A single property query requires a single join to the EntitiesByProperty table, constrained to find only entities with the correct kind, property name, and value (or range of values). Finally, multiple property queries are satisfied by using multiple joins to the EntitiesByProperty table, one for each required property. Unlike the production datastore, composite indexes aren't (currently) implemented, so queries that would normally require a composite indexes use the aforementioned strategy, leaving it up to the DB to find an efficient query plan.

Another difference from the production datastore is that data in the SQLite stub is partitioned by app ID - so each app has its own Entity and EntitiesByProperty tables.

Data representation and indexing

Data representation provides a particular challenge when it comes to indexing. App Engine provides a wide array of data types, and since the datastore is schemaless, any field can potentially contain any datatype.

Unlike most relational databases, SQLite uses a dynamic typing system, whereby any field can be of any data type - types declared in CREATE TABLE statements merely define 'affinities'. At first glance, this would appear to solve our problem - we can simply store the values as you would expect, and SQLite will allow us. We immediately run into a couple of problems, however.

The first problem is that SQLite doesn't provide the same sorting characteristics as the App Engine datastore. For example, while App Engine sorts all integers before all floats, SQLite sorts them by their magnitude. This could be remedied by splitting the column into two parts - the type and the value - and sorting by them and selecting on them both, were it not for the second issue...

The second issue is that SQLite doesn't implement all the data types App Engine provides - for example, Key and GeoPt. In order to be completely consistent with production, the ordering on these properties is important, too - it's not enough to be able to check for equality. Given that, it looks like we're going to need an alternative encoding scheme.

At first glance, Protocol Buffers seem like they should be it. Given the same values for fields, the Protocol Buffer encoder will always create the same binary PB, so we can definitely compare for equality. Individual fields of a Protocol Buffer are emitted in order, too, so thanks to the way the App Engine PBs are designed, in theory we ought to sort each data type correctly as well. This brilliant idea, alas, is torpedoed by several practical issues with sorting Protocol Buffers:

PBs encode fixed length fields like integers and floats little-endian, which doesn't sort correctly.
Variable length integers are encoded using a scheme that likewise does not preserve ordering of the number being encoded.
Strings are length-prefixed, which means strings are compared first based on size, then on contents - 'aaa' sorts after 'z'!

In production, App Engine solves this by using a special encoding for Protocol Buffers that preserves the desired ordering properties - and that's the approach the SQLite stub takes, too. Here's how it deals with each of the three issues outlined above:

Fixed length numbers are encoded big-endian, so they sort as expected. Further, the sign bit is flipped, and the entire value is inverted with a bitwise NOT if it's negative, to ensure that negative numbers sort smaller than positive numbers.
Variable length integers are encoded by a special scheme that uses the first byte to indicate the sign of the number and the number of bytes that follow. Small values (between -119 and 119) are encoded in a single byte to save space.
Strings are escaped instead of being length prefixed. Nulls in a string (\0) are replaced with the escape sequence \1\1, while \1 bytes are replaced with \1\2. Strings are then terminated with a null byte. This encoding is reversible, but ensures the strings still sort in the same order as they did previously.

With a Protocol Buffer encoder that implements those tweaks, we can simply encode the fields we want to index using it, and insert them as BLOB fields into the database. The encoding ensures that SQLite can sort them as byte values, without needing to interpret or understand them, and the sort order will come out as expected.

That's it for the description of the internals. If you decide to try out the new local datastore, please do provide feedback - here, or in the official group.

03 March, 2010

Previous Post Next Post

Nick's Blog