Monday, April 2. 2007Damn Cool Algorithms, Part 1: BK-TreesComments
Display comments as
(Linear | Threaded)
Needs more diagrams.
I lost track after 'From here, the construction of a BK-Tree is simple.'
Well done!, I have to say that information about bk-trees structure is extremly scarce on the Internet. Most web pages indexed by google
Congratulations.
BKTree by Jochen Schulz is a Python implementation of this algorithm, as a part of a larger library. Comments at the url cited above, source code available at http://well-adjusted.de/~jrschulz/mspace/mspace-pysrc.html#BKTree and beyond.
Real good blog. Not a math person, but love to here about algo's like this. Worded well also. Easy to undestand. Keep up the good work
hopefully providers such as c5 or powercollections will pick up on this and push it in as well....
or maybe we will offer... great post sir... btw: http://www.itu.dk/research/c5/
I also don't follow. Maybe some illustrations of the data structure and handling a query would help.
So Google doesn't want smart people to know about search engine theory? I wonder why????
nitpicking
QUOTE: Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space. Put simply, a metric space is any relationship that adheres to three basic criteria: (...) These criteria define a *distance*. A metric space is a tuple of (S,d) where S is a set and d a distance.
Thanks! That was a great introduction to a structure i had not yet come across, but one that helps with a project i'm working on now! Cheers
Thanks for this. I have some questions though:
1) How long does it take on average to build a tree from a dictionary 2) Assume I want to find best matches with an incorrectly spelt English word (lets assume the dictionary we will be using is the old familiar English dictionary), how would I be able to calculate its best match quickly (what would I use a root)? 3) Lastly, how big (in terms of memory) is say a BK tree used for an average spell checker (I assume you have this face readily available). Thanks again for the fascinating topic.
Thanks for that! It's very useful and well written. Obviously this could be used with any distance measure, not just Levenshtein. For instance, it seems interesting to use a hierarchical distance function based on unicode categories, so that distance values derived from diacritics and case are always smaller than those derived from variance in base characters.
Pseudocode would clarify the build process. Damn interesting either way though!
"distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test."
Eh, shouldn't that be: "all our results must have at most distance n-d and at least distance d-n from test." ?
It'd be really awesome if somebody would implement BK-Tree's in PostgreSQL using GiST.
Cool information, thank you ! A little bit of code would have been particularly nice to enhance your article though, but well already explaining the principle is good enough
The major lack of this algorithm is speed... The Levenshtein distance algorithm is pretty complex. If you don't know about them, sometimes it's easier to use SOUNDEX kind of algorithms, which are based on phonetics and can provide good results. Lot of variants exist, and most databases include those by default ! Cheers K
Also lost it at "From here, the construction of a BK-Tree is simple"; you introduce 'edge' suddenly with no explanation, which appears to be critical as the article progresses.
Prior to that point it was very well explained
Sounds great but a quick search for a C++ implementation didn't brig up anything useful. Do you have any links to C/C++ implementations of BK-Trees?
It's good to have some diagrams and pictures when explaining such things. Otherwise good post
Very neat algorithm - thanks for describing it !
I've just tried out a slightly optimized version of it that reduces the maximum distance every time it finds a better 'match' so that the tree query function returns only a list of the best possible matches. Rough testing seems to indicate that the optimization only offers an occasional/slight improvement when the query distance is 2 but then gets progressively more useful as the distance gets bigger. The optimization is pretty obvious but doesn't seem to be in the "Fast Approximate.." paper - do you perhaps know if there are lurking problems ?
After reading this post, I wrote a simple BK-Tree implementation in Ruby, and did a talk about it at Euruko this year.
The source code is available via Subversion: svn co http://paulbattley.googlecode.com/svn/bktree/trunk/ bktree My slides (which probably won't look good in anything other than Firefox, I'm afraid) are also online.
C# implementation:
http://www.smartcorner.eu/post/2008/05/BK-Trees-implementation.aspx |
Calendar
QuicksearchArchivesCategoriesBlog Administration |
|||||||||||||||||||||||||||||||||||||||||||||||||
BK-Trees implementation
Tracked: May 13, 16:26