Damn Cool Algorithms, Part 1: BK-Trees

This is the first post in (hopefully) a series of posts on Damn Cool Algorithms - essentially, any algorithm I think is really Damn Cool, particularly if it's simple but non-obvious.

BK-Trees, or Burkhard-Keller Trees are a tree-based data structure engineered for quickly finding near-matches to a string, for example, as used by a spelling checker, or when doing a 'fuzzy' search for a term. The aim is to return, for example, "seek" and "peek" if I search for "aeek". What makes BK-Trees so cool is that they take a problem which has no obvious solution besides brute-force search, and present a simple and elegant method for speeding up searches substantially.

BK-Trees were first proposed by Burkhard and Keller in 1973, in their paper "Some approaches to best match file searching". The only copy of this online seems to be in the ACM archive, which is subscription only. Further details, however, are provided in the excellent paper "Fast Approximate String Matching in a Dictionary".

Before we can define BK-Trees, we need to define a couple of preliminaries. In order to index and search our dictionary, we need a way to compare strings. The canonical method for this is the Levenshtein ...

Serializing JavaScript objects with circular references

One problem we've been dealing with at work for a while is serializing JavaScript objects to a string representation, when objects can contain circular references. The simplest example of this is an object a, which contains one property (call it 'foo') which is a reference to a.

Mozilla provides a toSource() function on the Object class (and hence every object) which returns a serialized representation of that object, allowing it to be reconstituted with eval(). To handle circular references, any object or array can be tagged with a numeric identifier by prefixing it with "#x=" (where x is an integer), and a tagged object can be referenced with the syntax "#x#". Using this, any object graph can be serialized, and we can even make use of this system to compress the serialized text down further than Mozilla does on its own by using this syntax for all backreferences, instead of just circular ones. The example earlier, of an object referencing itself would serialize as "#1={foo:#1#}".

All fine and dandy, but unfortunately, IE doesn't implement the toSource() function. I don't know about other browsers either - toSource() is part of Netscape's original Javascript documentation, but not ...

Indexing directory structures

After much messing around with SQL DBs and table structures, I eventually resorted to writing my own indexing scheme and storing the index in memory for my FS-indexing needs.
Here's how I went about it:

1) Construct a tree out of the FS you're indexing. Each node in the tree needs a reference to its parent, but if you're just using it for this index, it doesn't actually need references to its children.
2) Construct a dictionary of lists of nodes to act as the index. Dictionary> in generics/templates speak.
3) Iterate through each item, and extract terms for that component. Terms are only extracted for the current component, not its parents - "c:\foo\bar" would only have 'bar' as a term.
4) Find the relevant item in the index for each term (or add it, if need be), and add the current node to the list of nodes against that item.

Once you've constructed the index, searching is as follows:
1) Extract the terms for your search query in the same manner as used for path components above
2) Obtain the list from the index for each of these terms. If any of ...


For some time now, I've been working, on and off, on a P2P filesharing system, which I've dubbed 'Attercop' (named after the term used for 'spider' used in Old English and LOTR). Simply put, it's a replacement for DC++, designed mostly for LAN environments. It uses multicast, so no central hub is required, and it improves on a number of the most glaring weaknesses in DC++. I'm writing it in C#.NET.

Like most projects I embark on, I started off hugely enthusiastic, spending as much time as I could spare on it. Normally, I'd continue like that until either I get the project completed, or I burn-out on it. Quite frequently, the latter has happened - I've burned out and lost interest, with the project unfinished.

However, since I've been spending all the time I've not been at work with Hayley, I've had very little time for it lately - a lack that's hard to mourn, since I consider time with Hayley much better spent - and I'd begun to lose enthusiasm for the project. I concluded that I was (unfortunately) suffering from the same sort of mid-project disinterest I often ...

Transforming ATOM 1.0 into HTML using XSLT

In response to Myke's request for a way to generate an HTML page containing the newest items from a bunch of RSS feeds, I made a suggestion: Use Google Reader to aggregate the feeds for you, export them using the 'sharing' functionality, and then simply transform the resulting ATOM feed into an HTML page using some XSLT on serverside.

I thought there'd be a ready-made XSLT stylesheet for transforming ATOM out there, but I had absolutely no luck finding one. So, I hacked together a basic one for the purpose. Here it is:


  • Published:


I've only tested it with google's atom, but it ought to work with any valid atom feed. I also haven't attempted to format the dates into something more user-friendly - I'm sure XSLT has the facility, but I'll be damned if I can find it right now. Finally, only the title, url, dates and summary are exposed, but it's pretty trivial to add more.

Optimum phone keypad layouts, or, what to do with a boring weekend.

So, I was a little bored, and had an idea. The commonly used (alphabetic) phone keypad layout is not particularly efficient when it comes to T9 (one digit per letter) style entry - there are a lot of collisions. What would an ideal keypad layout look like?

So I hunted up a computer-readable english dictionary with frequency scores (a 0-16 scale), and devised a way to score the efficiency of a layout: group words by the digits they translate into with that layout, and multiply together their frequency scores. So, for example, two colliding words with frequency 8 would score 64 points, while a collision between a word with frequency 1 and one with frequency 15 would score only 15 points. The idea is that collisions between words with greatly different frequencies are much easier to resolve - it'll almost always be the more common word - whilst collisions between similar words are generally bad. Then, sum up all the products to get the final score.

I then seeded a Genetic Algorithm with the default keypad layout, and ran generations of randomly mutating layouts (by reassigning letters at random) then scoring them, then selecting the few layouts with the lowest scores, and ...