Saturday, October 20. 2007Update on Anagram TreesTrackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
Thanks for sharing.
Is this similar to trie or kd-tree? http://en.wikipedia.org/wiki/Trie http://en.wikipedia.org/wiki/Kd-tree
It's similar to both, but has differences with both, too. In a Trie, the node is entirely represented by its path from the root, which isn't the case here, and a Kd-tree is binary, whereas this tree has arity that is determined by the data.
An interesting idea, but my intuition is that it's only situationally superior to a straightforward suffix tree, and the suffix tree does strike me as being more suited to natural language. I'd be curious to see how they stack up against each other.
Absolutely, it's rather specialized, but for the purpose for which it's designed (anagram finding), it's absolutely faster than a suffix tree.
Anagrams, not dictionary lookups! Durrrrr. Reading comprehension FTW. That's what I wasn't an English major. Wait, I was an English major. Crap!
When you say "least frequently used letters," do you mean from a heuristic basis or just straight counting / branch weighting? Hmmm, now that I think about it, a really ridiculous/over-the-top way to optimize an anagram lookup algo would be to optimize it based on statistically-probable letter orderings. E.g., you should never be doing work to unscramble a "qu." But beyond that, it would also make sense to try suffix and prefix branches first ("pre-" "-ing" etc). If you really wanted to get snazzy you could even try cross-breeding a suffix tree with your anagram tree to generate a dynamic heuristic, I think. That might be putting too much effort into it though. But I bet it would work better
"That's what I wasn't an English major."
Why. WHY I wasn't an English major. You know what, screw it, I'm taking up panhandling.
hi Nick,
i was thinking of a simple binary tree, on the same lines as ur proposal. i am explaining the search for anagrams in the structure, not the creation of the structure. probably creation would become evident, after i explain the lookup. keep a binary tree, in concept similar to the tree u proposed. each node would be separating words which contain the alphabet, from the once which dont. given a word to lookup, we sort it and then for each alphabet if present in the word, we would traverse both the branches of the tree. one for words contain that alphabet, and one for words not containing the alphabet. for alphabets not in the word, we go along the path containing words which dont have that alphabet. this way we reach some leaves in the structure, which in essence contain a superset of words which we r looking for. now the leaf node contain another tree, not a subtree to the original one. lets call it leaf-tree. for the leaf-ree, the same method as u proposed should word (ex: the one with as many childs as repetition of the alphabet), only difference is that now the alphabets are a subset of english alphabets. which we reached using the main tree. if all this is right, then probably we could also use the optimization, like once mentioned in ur posts, for both the trees. this should probably give better performance results, as we r trying to get to relevant results sooner. the leaf-tree probably should be very small, and hence much more efficient to traverse. i dont have any results, actually am shying away from doing the hard part. let me know of what u think. thanks
What a neat algorithm! I've read about DAWG, anagram dictionary by Knuth, suffix trees, etc. I believe this is by far the most efficient algorithm in finding all sub-anagrams of a word. Thanks for sharing this.
I don't fully understand at the moment and wanted to try the code out, but only python version is available. Since I'm a C++ programmer with no knowledge of python, is there a C++ version of this algorithm available? Or would some kind soul translate the python code to C++? thanks, |
Calendar
QuicksearchArchivesCategoriesBlog Administration |
|||||||||||||||||||||||||||||||||||||||||||||||||