Nick's Blog

Avatar

Because repeating myself sucks.

Damn Cool Algorithms, Part 1: BK-Trees

This is the first post in (hopefully) a series of posts on Damn Cool Algorithms - essentially, any algorithm I think is really Damn Cool, particularly if it's simple but non-obvious.

BK-Trees, or Burkhard-Keller Trees are a tree-based data structure engineered for quickly finding near-matches to a string, for example, as used by a spelling checker, or when doing a 'fuzzy' search for a term. The aim is to return, for example, "seek" and "peek" if I search for "aeek". What makes BK-Trees so cool is that they take a problem which has no obvious solution besides brute-force search, and present a simple and elegant method for speeding up searches substantially.

BK-Trees were first proposed by Burkhard and Keller in 1973, in their paper "Some approaches to best match file searching". The only copy of this online seems to be in the ACM archive, which is subscription only. Further details, however, are provided in the excellent paper "Fast Approximate String Matching in a Dictionary".

Before we can define BK-Trees, we need to define a couple of preliminaries. In order to index and search our dictionary, we need a way to compare strings. The canonical method for this is the Levenshtein Distance, which takes two strings, and returns a number representing the minimum number of insertions, deletions and replacements required to translate one string into the other. Other string functions are also acceptable (for example, one incorportating the concept of transpositions as an atomic operation could be used), as long as they meet the criteria defined below.

Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space. Put simply, a metric space is any relationship that adheres to three basic criteria:

  • d(x,y) = 0 <-> x = y (If the distance between x and y is 0, then x = y)

  • d(x,y) = d(y,x) (The distance from x to y is the same as the distance from y to x)

  • d(x,y) + d(y,z) >= d(x,z)


The last of these criteria is called the Triangle Inequality. The Triangle Inequality states that the path from x to z must be no longer than any path that goes through another intermediate point (the path from x to y to z). Look at a triangle, for example: it's not possible to draw a triangle such that it's quicker to get from one point to another by going along two sides than it is by going along the other side.

These three criteria, basic as they are, are all that's required for something such as the Levenshtein Distance to qualify as a Metric Space. Note that this is far more general than, for example, a Euclidian Space - a Euclidian Space is metric, but many Metric Spaces (such as the Levenshtein Distance) are not Euclidian. Now that we know that the Levenshtein Distance (and other similar string distance functions) embodies a Metric Space, we come to the key observation of Burkhard and Keller.

Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.

From here, the construction of a BK-Tree is simple: Each node has a arbitrary number of children, and each edge has a number corresponding to a Levenshtein distance. All the subnodes on the edge numbered n have a Levenshtein distance of exactly n to the parent node. So, for example, if we have a tree with parent node "book" and two child nodes "rook" and "nooks", the edge from "book" to "rook" is numbered 1, and the edge from "book" to "nooks" is numbered 2.

To build the tree from a dictionary, take an arbitrary word and make it the root of your tree. Whenever you want to insert a word, take the Levenshtein distance between your word and the root of the tree, and find the edge with number d(newword,root). Recurse, comparing your query with the child node on that edge, and so on, until there is no child node, at which point you create a new child node and store your new word there. For example, to insert "boon" into the example tree above, we would examine the root, find that d("book", "boon") = 1, and so examine the child on the edge numbered 1, which is the word "rook". We would then calculate the distance d("rook", "boon"), which is 2, and so insert the new word under "rook", with an edge numbered 2.

To query the tree, take the Levenshtein distance from your term to the root, and recursively query every child node numbered between d-n and d+n (inclusive). If the node you are examining is within d of your search term, return it and continue your query.

The tree is N-ary and irregular (but generally well-balanced). Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node! Note that exact searching can also be performed fairly efficiently by simply setting n to 0.

Looking back on this, the post is rather longer and seems more involved than I had anticipated. Hopefully, you will agree after reading it that the insight behind BK-Trees is indeed elegant and remarkably simple.

Comments are closed

34 Comments

  1. Re: Article by MrTrick (2007-04-02)

    Needs more diagrams.

    I lost track after 'From here, the construction of a BK-Tree is simple.'
  2. Re: Article by Nick Johnson (2007-04-02)

    Good point. I'll add some if I get the chance.
  3. Re: Article by MrTrick (2007-04-02)

    Hrm... must be caching problems, I don't see your update...
  4. Re: Article by Ole (2007-04-23)

    Well done!, I have to say that information about bk-trees structure is extremly scarce on the Internet. Most web pages indexed by google :-) speaks about "using bk-trees" but not "how they are build".

    Congratulations.
  5. Re: Article by Fredrik (2007-10-20)

    [url=http://well-adjusted.de/~jrschulz/mspace/mspace.BKTree-class.html]BKTree by Jochen Schulz[/url] is a Python implementation of this algorithm, as a part of a larger library. Comments at the url cited above, source code available at [url]http://well-adjusted.de/~jrschulz/mspace/mspace-pysrc.html#BKTree[/url] and beyond.
  6. Re: Article by Daniel (2007-10-20)

    Hi Nick,

    very nice explanation. I enjoyed reading it.

    Thanks.
  7. Re: Article by Barry (2007-10-20)

    Thanks for this. I have some questions though:

    1) How long does it take on average to build a tree from a dictionary
    2) Assume I want to find best matches with an incorrectly spelt English word (lets assume the dictionary we will be using is the old familiar English dictionary), how would I be able to calculate its best match quickly (what would I use a root)?
    3) Lastly, how big (in terms of memory) is say a BK tree used for an average spell checker (I assume you have this face readily available).

    Thanks again for the fascinating topic.
  8. Re: Article by Anonymous (2008-04-02)

    Holy WTF.
  9. Re: Article by Alexander Jerusalem (2007-10-20)

    Thanks for that! It's very useful and well written. Obviously this could be used with any distance measure, not just Levenshtein. For instance, it seems interesting to use a hierarchical distance function based on unicode categories, so that distance values derived from diacritics and case are always smaller than those derived from variance in base characters.
  10. Re: Article by klausnrooster (2007-10-20)

    Pseudocode would clarify the build process. Damn interesting either way though!
  11. Re: Article by Anonymous (2007-10-20)

    "distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test."

    Eh, shouldn't that be:

    "all our results must have at most distance *n-d* and at least distance d-n from test."

    ?
  12. Re: Article by pp (2007-10-20)

    Sexy! Thanks.
  13. Re: Article by marcos (2007-10-20)

    Really interesting. Thank you.
  14. Re: Article by manthrax (2007-10-20)

    Awesome article. Thank you!
  15. Re: Article by Chris (2007-10-20)

    Nice read... Keep it up!
  16. Re: Article by Volkan YAZICI (2007-10-21)

    It'd be really awesome if somebody would implement BK-Tree's in PostgreSQL using GiST.
  17. Re: Article by Dan (2007-06-19)

    they are also called vp-trees vector space trees
  18. Re: Article by KiLVaiDeN (2007-10-22)

    Cool information, thank you ! A little bit of code would have been particularly nice to enhance your article though, but well already explaining the principle is good enough :)

    The major lack of this algorithm is speed... The Levenshtein distance algorithm is pretty complex.

    If you don't know about them, sometimes it's easier to use SOUNDEX kind of algorithms, which are based on phonetics and can provide good results. Lot of variants exist, and most databases include those by default !

    Cheers
    K
  19. Re: Article by DaveG (2007-10-22)

    Also lost it at "From here, the construction of a BK-Tree is simple"; you introduce 'edge' suddenly with no explanation, which appears to be critical as the article progresses.

    Prior to that point it was very well explained :)
  20. Re: Article by russ (2007-10-23)

    Sounds great but a quick search for a C++ implementation didn't brig up anything useful. Do you have any links to C/C++ implementations of BK-Trees?
  21. Re: Article by dan (2007-11-01)

    It's good to have some diagrams and pictures when explaining such things. Otherwise good post :)
  22. Re: Article by tonys (2007-11-07)

    *Very* neat algorithm - thanks for describing it !

    I've just tried out a slightly optimized version of it that reduces the maximum distance every time it finds a better 'match' so that the tree query function returns only a list of the best possible matches.

    Rough testing seems to indicate that the optimization only offers an occasional/slight improvement when the query distance is 2 but then gets progressively more useful as the distance gets bigger.

    The optimization is pretty obvious but doesn't seem to be in the "Fast Approximate.." paper - do you perhaps know if there are lurking problems ?
  23. Re: Article by Paul Battley (2007-12-22)

    After reading this post, I wrote a simple BK-Tree implementation in Ruby, and did a talk about it at Euruko this year.

    The source code is available via Subversion: svn co http://paulbattley.googlecode.com/svn/bktree/trunk/ bktree

    [url=http://po-ru.com/presentations/fun-with-trees/]My slides[/url] (which probably won't look good in anything other than Firefox, I'm afraid) are also online.
  24. Re: Article by regsitesph@yahoo.com (2008-02-25)

    thanks for describing it !
  25. BK-Trees implementation by Smart Corner (2008-05-13)

    BK-Trees implementation
  26. Re: Article by Anonymous (2008-05-14)

    C# implementation:
    http://www.smartcorner.eu/post/2008/05/BK-Trees-implementation.aspx
  27. Re: Article by Anonymous (2008-05-14)

    C# implementation:

    [url]http://www.smartcorner.eu/post/2008/05/BK-Trees-implementation.aspx[/url]
  28. Re: Article by Nathan (2007-06-22)

    Real good blog. Not a math person, but love to here about algo's like this. Worded well also. Easy to undestand. Keep up the good work
  29. Re: Article by robert (2007-07-06)

    hopefully providers such as c5 or powercollections will pick up on this and push it in as well....

    or maybe we will offer...

    great post sir...

    btw: http://www.itu.dk/research/c5/
  30. Re: Article by Rob Renaud (2007-10-20)

    I also don't follow. Maybe some illustrations of the data structure and handling a query would help.
  31. Re: Article by Close Protection World (2007-10-20)

    diagrams as mentioned would help alot
  32. Re: Article by Dom (2007-10-20)

    So Google doesn't want smart people to know about search engine theory? I wonder why????
  33. Re: Article by anon (2007-10-20)

    nitpicking

    [quote]Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space. Put simply, a metric space is any relationship that adheres to three basic criteria: (...) [/quote]

    These criteria define a *distance*. A metric space is a *tuple* of (S,d) where S is a set and d a distance.
  34. Re: Article by John (2007-10-20)

    Thanks! That was a great introduction to a structure i had not yet come across, but one that helps with a project i'm working on now! Cheers :)