Damn Cool Algorithms, Part 3: Anagram Trees
When it comes to finding anagrams of words, a frequent approach is to use an anagram dictionary - simply put, sort the letters in your word to provide a unique key that all anagrams of a word have in common. Another approach is to generate a letter-frequency histogram for each letter in your word. (Both these approaches are more or less equivalent, in fact.) These approaches make the problem of finding exact single-word anagrams for strings very efficient - O(1) if you use a hashtable.
However, the problem of finding subset anagrams - a word that contains a subset of the letters in a string - is still rather inefficient, requiring either a brute force O(n) search through the dictionary, or looking up every substring of the sorted input string, which is O(2^l) with the number of letters in the input string. Finding subset anagrams is significantly more interesting, too, as it has applications in finding multi-word anagrams, as well as being applicable to problem domains such as scrabble.
However, with a little more effort, and the above observation that we can generate a histogram that uniquely represents a given set of letters, we can generate a tree structure that makes looking up subset anagrams much more efficient. To build the tree, we follow this simple procedure:
Assume we have the following information:
- A lexicon or dictionary of words to populate the tree with
- An alphabet for words in the lexicon
- The tree we are building
- A current node
- Generate a letter-frequency histogram for the term.
- Set the current node to the root of the tree.
- For each symbol in the alphabet:
- Get the frequency of the current symbol in the current term. Call it f
- Set the current node to the fth child node of the current node, creating it if it doesn't exist
- Append the current term to the list of words on the current (leaf) node
The result of following this simple procedure is a fixed-height tree, 27 nodes deep, with all the words in the leaf nodes, and each internal tier of the tree corresponding to a symbol from the alphabet. Here's an (abbreviated) example:
Once the tree is built, we find subset anagrams for an input string as follows:
Assume we have the following information:
- The tree we built using the above procedure.
- The alphabet we used above.
- A frontier set, initially empty.
- Initialize the frontier set to contain the root of the tree.
- Generate a letter-frequency histogram for the input string.
- For each symbol in the alphabet:
- Get the frequency of the current symbol in the input string. Call it f.
- For each node in the current frontier set, add the subnodes numbered 0 through f to the new frontier set.
- The frontier set now consists of leaf nodes, containing all the subset anagrams of the input string.
Runtime analysis of this algorithm is rather difficult, for me, at least. Intuitively and in practice, it's a lot faster than either of the brute-force approaches, but quantifying that in big-O notation is something that's escaped me. As an upper bound, it cannot be less efficient than O(n) - only a constant factor worse than the brute-force approach. As a lower bound, a lookup in which the frontier set always has one node, lookup time is proportional to the length of the alphabet, or O(1). The average case depends on how large a subset of the dictionary the input string selects. Quantifying by the size of the output, approximately O(m) operations are required. If anyone knows how to determine more solid bounds for runtime, please do let me know in the comments.
One disadvantage of this approach is that there is substantial memory overhead. Using my Python implementation of the algorithm, and importing /usr/share/dict/words, which is approximately 2MB on this machine results in over 300MB of memory consumed. Using the Pickle module to serialize to disk, the output file is over 30MB, and compresses with gzip down to about 7MB. I suspect part of the large memory overhead is due to the minimum size of Python's dictionaries; I will modify the implementation to use lists and update this post if I can make it more efficient.
Here's a few stats on the tree generated that may be of interest:
Total words: 234,936
Leaf nodes: 215,366
Internal nodes: 1,874,748
From this we can see that the average cardinality of internal nodes is very low - not much more than 1. A breakdown of the number of nodes in each tier helps clarify this:
| Tier | Number of nodes |
|---|---|
| 0 | 1 |
| 1 | 7 |
| 2 | 25 |
| 3 | 85 |
| 4 | 203 |
| 5 | 707 |
| 6 | 1145 |
| 7 | 1886 |
| 8 | 3479 |
| 9 | 8156 |
| 10 | 8853 |
| 11 | 10835 |
| 12 | 19632 |
| 13 | 28470 |
| 14 | 47635 |
| 15 | 73424 |
| 16 | 92618 |
| 17 | 94770 |
| 18 | 125018 |
| 19 | 156406 |
| 20 | 182305 |
| 21 | 195484 |
| 22 | 200031 |
| 23 | 203923 |
| 24 | 205649 |
| 25 | 214001 |
The cardinality of nodes towards the top of the tree is fairly high, but the tree quickly flattens out, and the last four tiers of the tree account for almost half of the total nodes. This suggests one possible space optimisation: Removing the last few tiers of the tree and concatenating their leaf nodes together. When performing lookups, check the selected nodes to ensure they are actually subset anagrams of the input string.
* It's possible I'm simply rediscovering something that's well known in the computer science community, or perhaps mentioned in a computer science paper 30 years ago. Significant searching hasn't turned up anyone using an algorithm like this, or anything else more efficient than the brute-force approaches outlined.
Edit: The source to my initial implementation is here.
Edit: Converting my Python implementation to use lists reduced memory consumption by roughly half. I'll post figures for the pickled tree and the source code when I have the opportunity.
Edit: More updates can be found here.
Comments are closed
20 Comments
Re: Article by JLearn (2007-10-23)
Re: Article by Daniel Wagner (2007-12-22)
~d
Suppose we have a word w. This word contains a a's, b b's, c c's, etc.
For example "aardvark", represented as "aaadkrv".
a = 3, b = 0, c = 0, ..., r = 2, ... z=0
We have a+1 ways of removing zero or more a's from the word:
aaadkrv
aadkrv
adkrv
dkrv
Then we have b+1 ways of removing zero or more b's. (since b=0 in this case, we have 0+1=1 ways of removing zero or more b's). Now, we have (a+1)*(b+1) ways of removing zero or more a's and zero or more b's. Continuing we have (a+1)*(b+1)*(c+1)*...*(z+1) ways of removing zero or more of any letter.
Let m be the length of the word. The worst case for the algorithm is a=b=c=...=z = m/26. In this case, the number of steps is (m/26 + 1)^26.
> However, the problem of finding subset anagrams - a word that contains a subset of the letters in a string - is still rather inefficient, requiring either a brute force O(n) search through the dictionary,
This claim is incorrect. You don't have to search the entire dictionary if you use the same strategy as you used for your algorithm. Build a dictionary of sorted words to lists of words, like:
"act" => ["act", "cat", ...]
"aaadkrv" => ["aardvark", ...]
Now look up the word with zero or more letters removed, like I described above. For aardvark:
lookup(dict, "aaadkrv") \/ lookup(dict, "aadkrv") \/ ... lookup(dict, "v")
This algorithm has the same time complexity as yours. Essentially your algorithm uses a trie [1] whereas this algorithm uses a hash table.
[1] http://en.wikipedia.org/wiki/Trie
Re: Article by Darryl Rees (2007-12-23)
Jules' explanation wasn't that clear to me on first reading either, but essentially he's using a *character sorted* hash key which is equivalent to your histogram.
eg. aaabbc == (a=3,b=2,c=1).
IMO the histogram is a more straightforward and flexible approach; eg. you can easily find permutations of groups of subset anagrams that add up to the same histogram as some given string.
> permutations of groups of subset anagrams that add up to the same histogram as some given string
Do you mean words like "aeenttrlp" gives "tree" & "plant"?
Re: Article by Darryl Rees (2007-12-23)
For aardvark, the hashtable algorithm requires:
a = 3
d = 1
r = 2
v = 1
k = 1
(a+1)(d+1)(r+1)(v+1)(k+1) = 96 lookups
I think your algorithm requires the same number of steps. You visit four branches of "a": the 0 branch, the 1 branch, the 2 branch and the 3 branch. You only visit the 0 branches of the second column (the "b" column), because there is no "b" in "aardvark". Only the 0 branches for "c". You visit the 0 and 1 branches in the "d" column. That's 4 "a" branches \* 2 "d" branches per "a" branch = 8 "d" branches. For "k" you visit the 0 and 1 branches. Now we have 8 "d" branches * 2 "k" branches per "d" branch = 16 "k" branches. 16*3 "r" branches = 48. 24*2 "v" branches = 96.
I hope that this explanation is clear enough. It's much easier to understand if you keep the picture of the tree you posted in your head, and follow the branches your algorithm visits in a breadth-first way.
Sorry, I just had to do that. Read about Direct Acyclic Word Graphs and implement one in C. It will use 90% less memory and be so much faster. You can share all prefixes and all suffixes. Walk through the tree starting at the root and progress to each node that you can add to your word, until you can add no more letters.
Here's what I did: use a bitfield for each of the 26 letters. Use DeBruijn sequences for bit positions. It is unbelievably fast.
For Scrabble you need to use lots of bitfields and bitwise AND operations.
This was all done 20+ years ago, and much better.
This will use 1 MB of RAM and run hundreds of times faster.
Read the World's Fastest Scrabble Algorithm by Appelson & Jacobs (I think).
The 20-year old algorithm not only runs faster than this, but probably runs faster on a 20-year old computer than yours does on a current computer. Well, maybe not.
I'll take a look at the suggested papers, though, thanks.
Re: Article by Darryl Rees (2007-12-23)
Re: Article by Anonymous (2007-12-23)
Oh it is quite possible. I used this several years ago (along with several other people) to solve the ITA addagram challenge years ago. I devised this on my own 6 years ago, and several others did too. I am sure it was around before then too.
Re: Article by Anonymous (2007-12-26)
[url=http://www.itasoftware.com/careers/puzzle_archive.html]ita software puzzle archive[/url]
[url=http://web.archive.org/web/20020802060904/www.itasoftware.com/careers/programmers-archive.php] archive of the solutions[/url]
Re: Article by Anonymous (2007-12-26)
Thanks
I rejected this idea, and thought about another structure, that i like to share. Please read carefully to fully understand.Please comment also and tell if it is good.
I subdivided a terr in several segments, in which i loaded a full dictionary
I took two types of self referential structures.
[1]alpha
[2]len
alpha has one character field, which contains an alphabet.the alphabet denotes the starting alphabet of the words which are under that node.
Simmilar nodes are linked together with a "next" field
Now again i have subdivided the words which have same starting alphabets, and same string length and those are grouped in another tree, under a perticular alpha node.
The alpha node has an array of pointers which points to len type data.There is a char array which denote which pointer array index contains a perticular length of string.
Example:
A main node contains only the words starting with 'a'
under that node one of the len pointer array points to a linear list which contains only 4 string length words starting with 'a' another pointer points to a list who points 6 string length words starting with 'a' etc.
now which pointer index points to which length (4 or 6) is stored inside the main node, in a char array (to reduce memory, 'sted of int, and is matched when searching)
Now the sublists of same string length are not just a linked list, they are linked chunks, or segments, the len nodes.
Each len node contains a 2-d char array each node is capable to store 50 words. When the 50 words are exceeded, a new simmilar len is allocated and then added, and new words are inserted in the new list.
This saves memory space which is taken by the pointer in 1-word per node type list, and also overcomes the problem of allocating large 2d arrays.
Like if there are 1200 words which starts with 'b' and is 6 char in length.then first the 'b' alpha node will be entered. and the 6-string-length node will be visited with the help of the array of pointers(will be created if it does not exist)
Then a len node will be allocated, and 50 words will be stored, after which another len type node will be allocated and more 50 words will be stored.
Like this we can save a lot memory consumed by pointers to indicate the next node.
And small sized arrays are available sparsely in the memory which we link.
Once this is loaded the searching is lightning fast.
like
'stop'
First it visits the node alpha, 's' Then it directly goes into the 4-string-length-word-list with the indexed pointer.
And then performs a linear search.
Then it visits the 't' alpha node and repets the same
And the 'o' and then 'p'
When a match is found, the base address of the string of the string in the node is stored in a pointer to pointer variable.
And at the end passed in the calling function.
I took care of some pitfalls. like in 'eye'
It prints 'eye' two times. As it gets the first 'eye' with the first 'e' and when it enters the 'e' node with the last 'e' it again finds 'eye'
I have kept a flag variable in the main alpha nodes.
Once a main alpha node is visited once the flag if set TRUE
When it is attempted to visit again and see the flag set to TRUE it will at once quit the search.
Like 'madam' (or mmaad whatever you say) it searches with 'm' and gets 'madam'. when visiting the 'm' alpha node it sets the flag variable to TRUE. simmilarly it scans with 'a' and 'd'.
When it comes to the next 'a' it quits the search at once as it finds the flag is set in the 'a' node. Similarly when the last 'm' comes it also finds the flag variable is set in the 'm' node and quits.
So the search cost is saved 33.33%
But for each new set of jumbled word, the flag variables has to be reset. as there is limited and fixed number or alpha nodes 'a' to 'z' (lower case only) so only 26 or such traversal is need with the "next" pointer.
the "down" pointer is used to indicate the len node subtrees (segmented lists)
The loading function is pretty complex, but it loads in lightning speed. And it loads instantly, (not almost, it is instant)And ofcource the search is fast.
But as the words are limited, and very few searches for a computer for nowadays, so i need to calculate the complexity.
The algorithm is under construction. I have thought to just load the len structure nodes with integers, which points to the particular file position, avoiding , loading of the words, which will save memory.
Or this tree can be stored in the file(in a easy decodable manner) and the required segment can be loaded only when it is needed.
I have written a program in C, it works really fine.
Still i have to do a lot of tuning, and reduce the comparisons and arithmetic operations.
Now how is that concept?
Re: Article by legrand (2008-10-23)