Indexing directory structures

Posted by Nick Johnson | Filed under tech, coding

After much messing around with SQL DBs and table structures, I eventually resorted to writing my own indexing scheme and storing the index in memory for my FS-indexing needs.
Here's how I went about it:

1) Construct a tree out of the FS you're indexing. Each node in the tree needs a reference to its parent, but if you're just using it for this index, it doesn't actually need references to its children.
2) Construct a dictionary of lists of nodes to act as the index. Dictionary> in generics/templates speak.
3) Iterate through each item, and extract terms for that component. Terms are only extracted for the current component, not its parents - "c:\foo\bar" would only have 'bar' as a term.
4) Find the relevant item in the index for each term (or add it, if need be), and add the current node to the list of nodes against that item.

Once you've constructed the index, searching is as follows:
1) Extract the terms for your search query in the same manner as used for path components above
2) Obtain the list from the index for each of these terms. If any of the lists are empty, return immediately with an empty set - no results are possible.
3) For each item in the returned lists, follow the chain of parents, checking which terms from the original search are present in the item and all its ancestors. If you match all terms, add it to the result set. If you reach the root of the tree without doing so, discard the node.

I can't help thinking it could be more efficient - there's got to be a way to intersect the sets returned for each term, or to make use of the property that the end results for a search can consist only of items in each term's list, or their descendents, but I can't think how to do it right now.

Performance tests: Indexing over 600,000 items, searches for 4 terms return in less than 1/100 of a second. Searches with more terms show only linear increase in time taken, of course.

30 August, 2006

Previous Post Next Post

Nick's Blog

Indexing directory structures

Comments

Blogroll