Edit: Now with a live demo!
The Prediction API, as I've explained, does a restricted form of machine learning, as a web service. Currently, it supports categorizing textual and numeric data into a preset list of categories. The example given in the talk - language detection - is a good one, but I wanted to come up with something new. A few ideas presented themselves:
- Training on movie/book reviews to try and predict the score given based on the text
- Training on product descriptions to try and predict their rating
- Training on Reddit submissions to try and predict the subreddit a new submission belongs in
All three have promise, but the first could suffer from the fact that the prediction API as it currently stands doesn't understand a relationship between categories - it would have no way to know that the '5 star' rating tag is 'closer to' the '4 star' one than the '1 star' tag. The second seems very ambitious, and it's not clear there's enough information to do that ...
This is the fourth in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.
Today's session isBigQuery and Prediction APIs. These are two awesome APIs that I described previously, and you can look forward to some forthcoming posts exploring how they work and what they can be used for.
This is another language-agnostic video - the APIs, by their nature, are pretty indifferent about what language you access them with. They both depend on Google Storage for their storage needs, so you should probably watch that talk first, though.
Have something you'd particularly like to see demonstrated using the Prediction or BigQuery APIs in a future post? Leave a comment!
First up this morning on the App Engine track is the BigQuery and Prediction APIs talk.
First up is BigQuery. BigQuery is a new API that lets you make use of Google's infrastructure for performing queries and analysis over large collections of read only data. It's designed to scale to massive datasets, and integrates well with App Engine and other platforms.
To use it, you start by uploading your data to the new Google Storage service. Then, you import it into BigQuery tables, and you can run queries on those tables. Despite the fact that it handles billions of rows of data, there's no need to explicitly define indexes, or to shard your data.
The syntax used to query should be familiar: It's based on SQL, and is extremely flexible. Using the example of the database of all Wikipedia revisions, getting the 5 most edited titles is as simple as:
SELECT TOP(title, 5), COUNT(*) FROM [bigquery.test.001/tables/wikipedia] WHERE wp_namespace = 0;
The speed has to be seen to be believed - response times from under a second to a few seconds for hundreds of millions to tens of billions of rows - seemingly regardless ...