Guessing subreddits with the Prediction API

Edit: Now with a live demo!

I've written before about the new BigQuery and Prediction APIs, and promised to demonstrate them. Let's take a look at the Prediction API first.

The Prediction API, as I've explained, does a restricted form of machine learning, as a web service. Currently, it supports categorizing textual and numeric data into a preset list of categories. The example given in the talk - language detection - is a good one, but I wanted to come up with something new. A few ideas presented themselves:

  1. Training on movie/book reviews to try and predict the score given based on the text
  2. Training on product descriptions to try and predict their rating
  3. Training on Reddit submissions to try and predict the subreddit a new submission belongs in

All three have promise, but the first could suffer from the fact that the prediction API as it currently stands doesn't understand a relationship between categories - it would have no way to know that the '5 star' rating tag is 'closer to' the '4 star' one than the '1 star' tag. The second seems very ambitious, and it's not clear there's enough information to do that ...

Google I/O playlist, day 4: The BigQuery and Prediction APIs

This is the fourth in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's session isBigQuery and Prediction APIs. These are two awesome APIs that I described previously, and you can look forward to some forthcoming posts exploring how they work and what they can be used for.

This is another language-agnostic video - the APIs, by their nature, are pretty indifferent about what language you access them with. They both depend on Google Storage for their storage needs, so you should probably watch that talk first, though.

If you're only interested in one API or the other, the BigQuery talk starts at 6:15, and the Prediction API talk starts at 24:40. The whole talk is definitely worth watching, though.

Have something you'd particularly like to see demonstrated using the Prediction or BigQuery APIs in a future post? Leave a comment!

Google I/O playlist, day 2: Google Storage for Developers

This is the second in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's video is on Google Storage for Developers. Google Storage for Developers is a new API for storing, serving, and sharing large numbers of files, of nearly any size. While Google Storage isn't directly linked to App Engine, it's likely to be of interest to many App Engine developers. It's also the foundation for the BigQuery and Prediction APIs, which I discussed earlier.

The talk is language-agnostic, so both Python and Java developers will find it of interest. Currently, the only officially supported client library is the Python one, but the protocol is deliberately compatible with other similar services, so the Boto library works as-is, as will other libraries. Further, the API is simple enough that I expect Google-Storage-specific libraries will quickly spring up in most major languages.

Google I/O day 2: BigQuery and Prediction APIs

First up this morning on the App Engine track is the BigQuery and Prediction APIs talk.


First up is BigQuery. BigQuery is a new API that lets you make use of Google's infrastructure for performing queries and analysis over large collections of read only data. It's designed to scale to massive datasets, and integrates well with App Engine and other platforms.

To use it, you start by uploading your data to the new Google Storage service. Then, you import it into BigQuery tables, and you can run queries on those tables. Despite the fact that it handles billions of rows of data, there's no need to explicitly define indexes, or to shard your data.

The syntax used to query should be familiar: It's based on SQL, and is extremely flexible. Using the example of the database of all Wikipedia revisions, getting the 5 most edited titles is as simple as:

SELECT TOP(title, 5), COUNT(*) FROM [bigquery.test.001/tables/wikipedia] WHERE wp_namespace = 0;

The speed has to be seen to be believed - response times from under a second to a few seconds for hundreds of millions to tens of billions of rows - seemingly regardless ...