Friday, October 19. 2007Damn Cool Algorithms, Part 3: Anagram TreesTrackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
Neat! One possible optimization occurred to me. Suppose you do a preliminary pass on the lexicon to get a frequency analysis. Then, if you branch on unlikely letters earlier, you should keep the total branching down early in the tree (where it really matters). This is just a hypothesis, but thanks for the post -- it's a neat idea!
~d
The time complexity of your algorithm:
Suppose we have a word w. This word contains a a's, b b's, c c's, etc. For example "aardvark", represented as "aaadkrv". a = 3, b = 0, c = 0, ..., r = 2, ... z=0 We have a+1 ways of removing zero or more a's from the word: aaadkrv aadkrv adkrv dkrv Then we have b+1 ways of removing zero or more b's. (since b=0 in this case, we have 0+1=1 ways of removing zero or more b's). Now, we have (a+1)*(b+1) ways of removing zero or more a's and zero or more b's. Continuing we have (a+1)*(b+1)*(c+1)*...*(z+1) ways of removing zero or more of any letter. Let m be the length of the word. The worst case for the algorithm is a=b=c=...=z = m/26. In this case, the number of steps is (m/26 + 1)^26. > However, the problem of finding subset anagrams - a word that contains a subset of the letters in a string - is still rather inefficient, requiring either a brute force O(n) search through the dictionary, This claim is incorrect. You don't have to search the entire dictionary if you use the same strategy as you used for your algorithm. Build a dictionary of sorted words to lists of words, like: "act" => ["act", "cat", ...] "aaadkrv" => ["aardvark", ...] Now look up the word with zero or more letters removed, like I described above. For aardvark: lookup(dict, "aaadkrv") \/ lookup(dict, "aadkrv") \/ ... lookup(dict, "v") This algorithm has the same time complexity as yours. Essentially your algorithm uses a trie [1] whereas this algorithm uses a hash table. [1] http://en.wikipedia.org/wiki/Trie
Thanks for the runtime analysis; however, I don't see how the hashtable approach is equivalent to my trie approach - it seems to me that the hashtable approach requires a much larger number of tries to get all the candidate words.
Nick,
Jules' explanation wasn't that clear to me on first reading either, but essentially he's using a character sorted hash key which is equivalent to your histogram. eg. aaabbc == (a=3,b=2,c=1). IMO the histogram is a more straightforward and flexible approach; eg. you can easily find permutations of groups of subset anagrams that add up to the same histogram as some given string.
Could you explain what you mean here:
> permutations of groups of subset anagrams that add up to the same histogram as some given string Do you mean words like "aeenttrlp" gives "tree" & "plant"?
Yes, sorry it wasn't clearer.
Can you give an example of a try that you have to look up in the hashtable but you don't have to look up in the tree?
For aardvark, the hashtable algorithm requires: a = 3 d = 1 r = 2 v = 1 k = 1 (a+1)(d+1)(r+1)(v+1)(k+1) = 96 lookups I think your algorithm requires the same number of steps. You visit four branches of "a": the 0 branch, the 1 branch, the 2 branch and the 3 branch. You only visit the 0 branches of the second column (the "b" column), because there is no "b" in "aardvark". Only the 0 branches for "c". You visit the 0 and 1 branches in the "d" column. That's 4 "a" branches * 2 "d" branches per "a" branch = 8 "d" branches. For "k" you visit the 0 and 1 branches. Now we have 8 "d" branches * 2 "k" branches per "d" branch = 16 "k" branches. 16*3 "r" branches = 48. 24*2 "v" branches = 96. I hope that this explanation is clear enough. It's much easier to understand if you keep the picture of the tree you posted in your head, and follow the branches your algorithm visits in a breadth-first way.
NOOO!!!!!!!!!!!!!
Sorry, I just had to do that. Read about Direct Acyclic Word Graphs and implement one in C. It will use 90% less memory and be so much faster. You can share all prefixes and all suffixes. Walk through the tree starting at the root and progress to each node that you can add to your word, until you can add no more letters. Here's what I did: use a bitfield for each of the 26 letters. Use DeBruijn sequences for bit positions. It is unbelievably fast. For Scrabble you need to use lots of bitfields and bitwise AND operations. This was all done 20+ years ago, and much better. This will use 1 MB of RAM and run hundreds of times faster. Read the World's Fastest Scrabble Algorithm by Appelson & Jacobs (I think). The 20-year old algorithm not only runs faster than this, but probably runs faster on a 20-year old computer than yours does on a current computer. Well, maybe not.
I'm familiar with DAGs, but I don't believe they will help here - they still rely on order in the word you're searching for, which is entirely contrary to the point when you're searching for anagrams, and leads to the expected combinatorial explosion. Further, what you're suggesting requires evaluating a very large portion of the tree even for simple queries.
I'll take a look at the suggested papers, though, thanks.
Appel and Jacobsen (http://www.nongnu.org/eliot/download/aj.pdf) uses a DAG to represent a minimal finite state recogniser for the lexicon. The DAG stores the ordering of the characters in each word in the lexicon - which is important for their scrabble algorithm but not for finding anagrams. It might use less storage to store the lexicon, but AFAICS it would be a lot slower than the histogram method for exhaustively finding anagrams - you would have to walk the entire graph.
* It's possible I'm simply rediscovering something that's well known in the computer science community...
Oh it is quite possible. I used this several years ago (along with several other people) to solve the ITA addagram challenge years ago. I devised this on my own 6 years ago, and several others did too. I am sure it was around before then too.
I'm not terribly surprised. However, I couldn't find any references anywhere, so hopefully this post will at least start some people thinking, if nothing else.
Just showing that this data structure is in the CS vernacular already. ITA and the developers did not patent it for the very same reasons
It is unbelievable, i thought almost the same concept and similar tree structure, but never tried to implement because it takes a huge memory, but it is fast.
I rejected this idea, and thought about another structure, that i like to share. Please read carefully to fully understand.Please comment also and tell if it is good. I subdivided a terr in several segments, in which i loaded a full dictionary I took two types of self referential structures. [1]alpha [2]len alpha has one character field, which contains an alphabet.the alphabet denotes the starting alphabet of the words which are under that node. Simmilar nodes are linked together with a "next" field Now again i have subdivided the words which have same starting alphabets, and same string length and those are grouped in another tree, under a perticular alpha node. The alpha node has an array of pointers which points to len type data.There is a char array which denote which pointer array index contains a perticular length of string. Example: A main node contains only the words starting with 'a' under that node one of the len pointer array points to a linear list which contains only 4 string length words starting with 'a' another pointer points to a list who points 6 string length words starting with 'a' etc. now which pointer index points to which length (4 or 6) is stored inside the main node, in a char array (to reduce memory, 'sted of int, and is matched when searching) Now the sublists of same string length are not just a linked list, they are linked chunks, or segments, the len nodes. Each len node contains a 2-d char array each node is capable to store 50 words. When the 50 words are exceeded, a new simmilar len is allocated and then added, and new words are inserted in the new list. This saves memory space which is taken by the pointer in 1-word per node type list, and also overcomes the problem of allocating large 2d arrays. Like if there are 1200 words which starts with 'b' and is 6 char in length.then first the 'b' alpha node will be entered. and the 6-string-length node will be visited with the help of the array of pointers(will be created if it does not exist) Then a len node will be allocated, and 50 words will be stored, after which another len type node will be allocated and more 50 words will be stored. Like this we can save a lot memory consumed by pointers to indicate the next node. And small sized arrays are available sparsely in the memory which we link. Once this is loaded the searching is lightning fast. like 'stop' First it visits the node alpha, 's' Then it directly goes into the 4-string-length-word-list with the indexed pointer. And then performs a linear search. Then it visits the 't' alpha node and repets the same And the 'o' and then 'p' When a match is found, the base address of the string of the string in the node is stored in a pointer to pointer variable. And at the end passed in the calling function. I took care of some pitfalls. like in 'eye' It prints 'eye' two times. As it gets the first 'eye' with the first 'e' and when it enters the 'e' node with the last 'e' it again finds 'eye' I have kept a flag variable in the main alpha nodes. Once a main alpha node is visited once the flag if set TRUE When it is attempted to visit again and see the flag set to TRUE it will at once quit the search. Like 'madam' (or mmaad whatever you say) it searches with 'm' and gets 'madam'. when visiting the 'm' alpha node it sets the flag variable to TRUE. simmilarly it scans with 'a' and 'd'. When it comes to the next 'a' it quits the search at once as it finds the flag is set in the 'a' node. Similarly when the last 'm' comes it also finds the flag variable is set in the 'm' node and quits. So the search cost is saved 33.33% But for each new set of jumbled word, the flag variables has to be reset. as there is limited and fixed number or alpha nodes 'a' to 'z' (lower case only) so only 26 or such traversal is need with the "next" pointer. the "down" pointer is used to indicate the len node subtrees (segmented lists) The loading function is pretty complex, but it loads in lightning speed. And it loads instantly, (not almost, it is instant)And ofcource the search is fast. But as the words are limited, and very few searches for a computer for nowadays, so i need to calculate the complexity. The algorithm is under construction. I have thought to just load the len structure nodes with integers, which points to the particular file position, avoiding , loading of the words, which will save memory. Or this tree can be stored in the file(in a easy decodable manner) and the required segment can be loaded only when it is needed. I have written a program in C, it works really fine. Still i have to do a lot of tuning, and reduce the comparisons and arithmetic operations. Now how is that concept? |
Calendar
QuicksearchArchivesCategoriesBlog Administration |
|||||||||||||||||||||||||||||||||||||||||||||||||