Nick's Blog

Avatar

Because repeating myself sucks.

Damn Cool Algorithms, Part 3: Anagram Trees

I hesitate to call this algorithm "damn cool", since it's something I invented* it myself, but I think it _is_ rather cool, and it fits the theme of my algorithms posts, so here it is anyway.

When it comes to finding anagrams of words, a frequent approach is to use an anagram dictionary - simply put, sort the letters in your word to provide a unique key that all anagrams of a word have in common. Another approach is to generate a letter-frequency histogram for each letter in your word. (Both these approaches are more or less equivalent, in fact.) These approaches make the problem of finding exact single-word anagrams for strings very efficient - O(1) if you use a hashtable.

However, the problem of finding subset anagrams - a word that contains a subset of the letters in a string - is still rather inefficient, requiring either a brute force O(n) search through the dictionary, or looking up every substring of the sorted input string, which is O(2^l) with the number of letters in the input string. Finding subset anagrams is significantly more interesting, too, as it has applications in finding multi-word anagrams, as well as being applicable to problem domains such as scrabble.

However, with a little more effort, and the above observation that we can generate a histogram that uniquely represents a given set of letters, we can generate a tree structure that makes looking up subset anagrams much more efficient. To build the tree, we follow this simple procedure:

Assume we have the following information:
  • A lexicon or dictionary of words to populate the tree with
  • An alphabet for words in the lexicon
  • The tree we are building
  • A current node
For each term in the lexicon:
  1. Generate a letter-frequency histogram for the term.
  2. Set the current node to the root of the tree.
  3. For each symbol in the alphabet:
    1. Get the frequency of the current symbol in the current term. Call it f
    2. Set the current node to the fth child node of the current node, creating it if it doesn't exist
  4. Append the current term to the list of words on the current (leaf) node

The result of following this simple procedure is a fixed-height tree, 27 nodes deep, with all the words in the leaf nodes, and each internal tier of the tree corresponding to a symbol from the alphabet. Here's an (abbreviated) example:


Once the tree is built, we find subset anagrams for an input string as follows:

Assume we have the following information:
  • The tree we built using the above procedure.
  • The alphabet we used above.
  • A frontier set, initially empty.
  1. Initialize the frontier set to contain the root of the tree.
  2. Generate a letter-frequency histogram for the input string.
  3. For each symbol in the alphabet:
    1. Get the frequency of the current symbol in the input string. Call it f.
    2. For each node in the current frontier set, add the subnodes numbered 0 through f to the new frontier set.
  4. The frontier set now consists of leaf nodes, containing all the subset anagrams of the input string.

Runtime analysis of this algorithm is rather difficult, for me, at least. Intuitively and in practice, it's a lot faster than either of the brute-force approaches, but quantifying that in big-O notation is something that's escaped me. As an upper bound, it cannot be less efficient than O(n) - only a constant factor worse than the brute-force approach. As a lower bound, a lookup in which the frontier set always has one node, lookup time is proportional to the length of the alphabet, or O(1). The average case depends on how large a subset of the dictionary the input string selects. Quantifying by the size of the output, approximately O(m) operations are required. If anyone knows how to determine more solid bounds for runtime, please do let me know in the comments.

One disadvantage of this approach is that there is substantial memory overhead. Using my Python implementation of the algorithm, and importing /usr/share/dict/words, which is approximately 2MB on this machine results in over 300MB of memory consumed. Using the Pickle module to serialize to disk, the output file is over 30MB, and compresses with gzip down to about 7MB. I suspect part of the large memory overhead is due to the minimum size of Python's dictionaries; I will modify the implementation to use lists and update this post if I can make it more efficient.

Here's a few stats on the tree generated that may be of interest:
Total words: 234,936
Leaf nodes: 215,366
Internal nodes: 1,874,748

From this we can see that the average cardinality of internal nodes is very low - not much more than 1. A breakdown of the number of nodes in each tier helps clarify this:
TierNumber of nodes
0 1
1 7
2 25
3 85
4 203
5 707
6 1145
7 1886
8 3479
9 8156
10 8853
11 10835
12 19632
13 28470
14 47635
15 73424
16 92618
17 94770
18 125018
19 156406
20 182305
21 195484
22 200031
23 203923
24 205649
25 214001

The cardinality of nodes towards the top of the tree is fairly high, but the tree quickly flattens out, and the last four tiers of the tree account for almost half of the total nodes. This suggests one possible space optimisation: Removing the last few tiers of the tree and concatenating their leaf nodes together. When performing lookups, check the selected nodes to ensure they are actually subset anagrams of the input string.

* It's possible I'm simply rediscovering something that's well known in the computer science community, or perhaps mentioned in a computer science paper 30 years ago. Significant searching hasn't turned up anyone using an algorithm like this, or anything else more efficient than the brute-force approaches outlined.

Edit: The source to my initial implementation is here.

Edit: Converting my Python implementation to use lists reduced memory consumption by roughly half. I'll post figures for the pickled tree and the source code when I have the opportunity.

Edit: More updates can be found here.

Comments are closed

20 Comments

  1. Re: Article by JLearn (2007-10-23)

    Nice one, do continue with your 'damn cool algorithms' postings. ;-)
  2. Re: Article by Daniel Wagner (2007-12-22)

    Neat! One possible optimization occurred to me. Suppose you do a preliminary pass on the lexicon to get a frequency analysis. Then, if you branch on unlikely letters earlier, you should keep the total branching down early in the tree (where it really matters). This is just a hypothesis, but thanks for the post -- it's a neat idea!

    ~d
  3. Re: Article by Nick Johnson (2007-12-22)

    Yup - check out the update post [url=http://blog.notdot.net/archives/39-Update-on-Anagram-Trees.html]here[/url] for exactly that. :)
  4. Re: Article by Jules (2007-12-22)

    The time complexity of your algorithm:

    Suppose we have a word w. This word contains a a's, b b's, c c's, etc.

    For example "aardvark", represented as "aaadkrv".
    a = 3, b = 0, c = 0, ..., r = 2, ... z=0

    We have a+1 ways of removing zero or more a's from the word:

    aaadkrv
    aadkrv
    adkrv
    dkrv

    Then we have b+1 ways of removing zero or more b's. (since b=0 in this case, we have 0+1=1 ways of removing zero or more b's). Now, we have (a+1)*(b+1) ways of removing zero or more a's and zero or more b's. Continuing we have (a+1)*(b+1)*(c+1)*...*(z+1) ways of removing zero or more of any letter.

    Let m be the length of the word. The worst case for the algorithm is a=b=c=...=z = m/26. In this case, the number of steps is (m/26 + 1)^26.

    > However, the problem of finding subset anagrams - a word that contains a subset of the letters in a string - is still rather inefficient, requiring either a brute force O(n) search through the dictionary,

    This claim is incorrect. You don't have to search the entire dictionary if you use the same strategy as you used for your algorithm. Build a dictionary of sorted words to lists of words, like:

    "act" => ["act", "cat", ...]
    "aaadkrv" => ["aardvark", ...]

    Now look up the word with zero or more letters removed, like I described above. For aardvark:

    lookup(dict, "aaadkrv") \/ lookup(dict, "aadkrv") \/ ... lookup(dict, "v")

    This algorithm has the same time complexity as yours. Essentially your algorithm uses a trie [1] whereas this algorithm uses a hash table.

    [1] http://en.wikipedia.org/wiki/Trie
  5. Re: Article by Nick Johnson (2007-12-23)

    Thanks for the runtime analysis; however, I don't see how the hashtable approach is equivalent to my trie approach - it seems to me that the hashtable approach requires a much larger number of tries to get all the candidate words.
  6. Re: Article by Darryl Rees (2007-12-23)

    Nick,

    Jules' explanation wasn't that clear to me on first reading either, but essentially he's using a *character sorted* hash key which is equivalent to your histogram.
    eg. aaabbc == (a=3,b=2,c=1).

    IMO the histogram is a more straightforward and flexible approach; eg. you can easily find permutations of groups of subset anagrams that add up to the same histogram as some given string.
  7. Re: Article by Jules (2007-12-23)

    Could you explain what you mean here:

    > permutations of groups of subset anagrams that add up to the same histogram as some given string

    Do you mean words like "aeenttrlp" gives "tree" & "plant"?
  8. Re: Article by Darryl Rees (2007-12-23)

    Yes, sorry it wasn't clearer.
  9. Re: Article by Jules (2007-12-23)

    Can you give an example of a try that you have to look up in the hashtable but you don't have to look up in the tree?

    For aardvark, the hashtable algorithm requires:

    a = 3
    d = 1
    r = 2
    v = 1
    k = 1

    (a+1)(d+1)(r+1)(v+1)(k+1) = 96 lookups

    I think your algorithm requires the same number of steps. You visit four branches of "a": the 0 branch, the 1 branch, the 2 branch and the 3 branch. You only visit the 0 branches of the second column (the "b" column), because there is no "b" in "aardvark". Only the 0 branches for "c". You visit the 0 and 1 branches in the "d" column. That's 4 "a" branches \* 2 "d" branches per "a" branch = 8 "d" branches. For "k" you visit the 0 and 1 branches. Now we have 8 "d" branches * 2 "k" branches per "d" branch = 16 "k" branches. 16*3 "r" branches = 48. 24*2 "v" branches = 96.

    I hope that this explanation is clear enough. It's much easier to understand if you keep the picture of the tree you posted in your head, and follow the branches your algorithm visits in a breadth-first way.
  10. Re: Article by Sam (2007-12-23)

    NOOO!!!!!!!!!!!!!

    Sorry, I just had to do that. Read about Direct Acyclic Word Graphs and implement one in C. It will use 90% less memory and be so much faster. You can share all prefixes and all suffixes. Walk through the tree starting at the root and progress to each node that you can add to your word, until you can add no more letters.

    Here's what I did: use a bitfield for each of the 26 letters. Use DeBruijn sequences for bit positions. It is unbelievably fast.

    For Scrabble you need to use lots of bitfields and bitwise AND operations.

    This was all done 20+ years ago, and much better.

    This will use 1 MB of RAM and run hundreds of times faster.

    Read the World's Fastest Scrabble Algorithm by Appelson & Jacobs (I think).

    The 20-year old algorithm not only runs faster than this, but probably runs faster on a 20-year old computer than yours does on a current computer. Well, maybe not.
  11. Re: Article by Nick Johnson (2007-12-23)

    I'm familiar with DAGs, but I don't believe they will help here - they still rely on order in the word you're searching for, which is entirely contrary to the point when you're searching for anagrams, and leads to the expected combinatorial explosion. Further, what you're suggesting requires evaluating a very large portion of the tree even for simple queries.

    I'll take a look at the suggested papers, though, thanks.
  12. Re: Article by Darryl Rees (2007-12-23)

    Appel and Jacobsen (http://www.nongnu.org/eliot/download/aj.pdf) uses a DAG to represent a minimal finite state recogniser for the lexicon. The DAG stores the ordering of the characters in each word in the lexicon - which is important for their scrabble algorithm but not for finding anagrams. It might use less storage to store the lexicon, but AFAICS it would be a lot slower than the histogram method for exhaustively finding anagrams - you would have to walk the entire graph.
  13. Re: Article by Anonymous (2007-12-23)

    [i]* It's possible I'm simply rediscovering something that's well known in the computer science community...[/i]


    Oh it is quite possible. I used this several years ago (along with several other people) to solve the ITA addagram challenge years ago. I devised this on my own 6 years ago, and several others did too. I am sure it was around before then too.
  14. Re: Article by Nick Johnson (2007-12-23)

    I'm not terribly surprised. However, I couldn't find any references anywhere, so hopefully this post will at least start some people thinking, if nothing else.
  15. Re: Article by Anonymous (2007-12-26)

    [url=http://www.google.com/search?hl=en&q=addagram+ITA+software&btnG=Google+Search]google search[/url]

    [url=http://www.itasoftware.com/careers/puzzle_archive.html]ita software puzzle archive[/url]

    [url=http://web.archive.org/web/20020802060904/www.itasoftware.com/careers/programmers-archive.php] archive of the solutions[/url]
  16. Re: Article by Nick Johnson (2007-12-26)

    If only I'd known to search for "addagram". :)
  17. Re: Article by Anonymous (2007-12-26)

    Just showing that this data structure is in the CS vernacular already. ITA and the developers did not patent it for the very same reasons :) It is a cool algorithm, but I just cringed a little when you used the word "invented"
  18. Re: Article by ie (2008-03-08)

    Very nice post giving an insight into anagrams.

    Thanks
  19. Re: Article by Arjun Pakrashi (2008-07-05)

    It is unbelievable, i thought almost the same concept and similar tree structure, but never tried to implement because it takes a huge memory, but it is fast.

    I rejected this idea, and thought about another structure, that i like to share. Please read carefully to fully understand.Please comment also and tell if it is good.
    I subdivided a terr in several segments, in which i loaded a full dictionary
    I took two types of self referential structures.
    [1]alpha
    [2]len

    alpha has one character field, which contains an alphabet.the alphabet denotes the starting alphabet of the words which are under that node.
    Simmilar nodes are linked together with a "next" field
    Now again i have subdivided the words which have same starting alphabets, and same string length and those are grouped in another tree, under a perticular alpha node.
    The alpha node has an array of pointers which points to len type data.There is a char array which denote which pointer array index contains a perticular length of string.
    Example:
    A main node contains only the words starting with 'a'
    under that node one of the len pointer array points to a linear list which contains only 4 string length words starting with 'a' another pointer points to a list who points 6 string length words starting with 'a' etc.
    now which pointer index points to which length (4 or 6) is stored inside the main node, in a char array (to reduce memory, 'sted of int, and is matched when searching)
    Now the sublists of same string length are not just a linked list, they are linked chunks, or segments, the len nodes.
    Each len node contains a 2-d char array each node is capable to store 50 words. When the 50 words are exceeded, a new simmilar len is allocated and then added, and new words are inserted in the new list.
    This saves memory space which is taken by the pointer in 1-word per node type list, and also overcomes the problem of allocating large 2d arrays.
    Like if there are 1200 words which starts with 'b' and is 6 char in length.then first the 'b' alpha node will be entered. and the 6-string-length node will be visited with the help of the array of pointers(will be created if it does not exist)
    Then a len node will be allocated, and 50 words will be stored, after which another len type node will be allocated and more 50 words will be stored.
    Like this we can save a lot memory consumed by pointers to indicate the next node.
    And small sized arrays are available sparsely in the memory which we link.

    Once this is loaded the searching is lightning fast.
    like
    'stop'
    First it visits the node alpha, 's' Then it directly goes into the 4-string-length-word-list with the indexed pointer.
    And then performs a linear search.

    Then it visits the 't' alpha node and repets the same
    And the 'o' and then 'p'
    When a match is found, the base address of the string of the string in the node is stored in a pointer to pointer variable.
    And at the end passed in the calling function.

    I took care of some pitfalls. like in 'eye'
    It prints 'eye' two times. As it gets the first 'eye' with the first 'e' and when it enters the 'e' node with the last 'e' it again finds 'eye'

    I have kept a flag variable in the main alpha nodes.
    Once a main alpha node is visited once the flag if set TRUE
    When it is attempted to visit again and see the flag set to TRUE it will at once quit the search.

    Like 'madam' (or mmaad whatever you say) it searches with 'm' and gets 'madam'. when visiting the 'm' alpha node it sets the flag variable to TRUE. simmilarly it scans with 'a' and 'd'.
    When it comes to the next 'a' it quits the search at once as it finds the flag is set in the 'a' node. Similarly when the last 'm' comes it also finds the flag variable is set in the 'm' node and quits.
    So the search cost is saved 33.33%
    But for each new set of jumbled word, the flag variables has to be reset. as there is limited and fixed number or alpha nodes 'a' to 'z' (lower case only) so only 26 or such traversal is need with the "next" pointer.
    the "down" pointer is used to indicate the len node subtrees (segmented lists)

    The loading function is pretty complex, but it loads in lightning speed. And it loads instantly, (not almost, it is instant)And ofcource the search is fast.

    But as the words are limited, and very few searches for a computer for nowadays, so i need to calculate the complexity.

    The algorithm is under construction. I have thought to just load the len structure nodes with integers, which points to the particular file position, avoiding , loading of the words, which will save memory.

    Or this tree can be stored in the file(in a easy decodable manner) and the required segment can be loaded only when it is needed.

    I have written a program in C, it works really fine.
    Still i have to do a lot of tuning, and reduce the comparisons and arithmetic operations.

    Now how is that concept?
  20. Re: Article by legrand (2008-10-23)

    Pour un *algorithme* de resolution d'anagrammes : tri des mots du dictionnaire dans l'ORDRE ALPHABETIQUE puis recherche textuelle .