Inverted Indexes – Inside How Search Engines Work

An Inverted Index is a structure used by search engines and databases to make search terms to files or documents, trading the speed writing the document to the index for searching the index later on. There are two versions of an inverted index, a record-level index which tells you which documents contain the term and a fully inverted index which tells you both the document a term is contained in and where in the file it is. For example if you built a search engine to search the contents of sentences and it was fed these sentences:

{0} - "Turtles love pizza"
{1} - "I love my turtles"
{2} - "My pizza is good"

Then you would store them in a Inverted Indexes like this:

            Record Level     Fully Inverted
"turtles"   {0, 1}           { (0, 0), (1, 3) }
"love"      {0, 1}           { (0, 1), (1, 1) }
"pizza"     {0, 2}           { (0, 2), (2, 1) }
"i"         {1}              { (1, 0) }
"my"        {1, 2}           { (1, 2), (2, 0) }
"is"        {2}              { (2, 2) }
"good"      {2}              { (2, 3) }

The record level sets represent just the document ids where the words are stored, and the fully inverted sets represent the document in the first number inside the parentheses and the location in the document is stored in the second number.

So now if you wanted to search all three documents for the words “my turtles” you would grab the sets (looking at record level only):

"turtles"   {0, 1}
"my"        {1, 2}

Then you would intersect those sets, coming up with the only matching set being 1. Using the Fully Inverted Index would also let us know that the word “my” appeared at position 2 and the word “turtles” at position 3, assuming the word position is important your search.

There is no standard implementation for an Inverted Index as it’s more of a concept rather than an actual algorithm, this however gives you a lot of options.

For the index you can choose to use things like Hashtables, BTrees, or any other fast search data structure.

The intersection becomes a more interesting problem. You can try using Bloom Filters if accuracy isn’t 100% needed, you can brute force the problem by doing a full scan of each set for O(M+N) time for joining two sets. You can also try to do something a little more complicated. Rumor has it that search engines like Google and Bing only merge results until they have enough for a search page and them dump the sets they are loading, though I know very little about how they actually solve this problem.

Here is an example of a simple Inverted Index written in C# that uses a Dictionary as the index and the Linq Intersect function:

public class InvertedIndex
{
    private readonly Dictionary<string, HashSet<int>> _index = new Dictionary<string, HashSet<int>>();
    private readonly Regex _findWords = new Regex(@"[A-Za-z]+");

    public void Add(string text, int docId)
    {
        var words = _findWords.Matches(text);

        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;

            if (!_index.ContainsKey(word))
                _index[word] = new HashSet<int>();

            if (!_index[word].Contains(docId))
                _index[word].Add(docId);
        }
    }

    public List<int> Search(string keywords)
    {
        var words = _findWords.Matches(keywords);
        IEnumerable<int> rtn = null;

        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;
            if (_index.ContainsKey(word))
            {
                rtn = rtn == null ? _index[word] : rtn.Intersect(_index[word]);
            }
            else
            {
                return new List<int>();
            }
        }

        return rtn != null ? rtn.ToList() : new List<int>();
    }
}

The Bloom Filter – Probably Answering The Question “Do I Exist?”

Originally conceived of in 1970 by Burton Howard Bloom, the Bloom Filter is a probabilistic data structure that can determine if something is likely to be a part of a set. The upside of a Bloom filter is you can find out quickly if something definitely doesn’t exist, and within a certain margin of error if something does. In other words you will have false positives, but never false negatives.

So how does it work? Well to start Bloom Filter has three elements to it:

M – The number of bits it uses to track the existence of items.
N – The number of items that will likely be entered into the Bloom Filter
K – The number of hash functions the Bloom Filter will utilize.

To use these three elements all we have to do is hash all N elements with all K hash functions and turn them into indexes in the bit array of size M and mark all those bits to 1.

For example if I had three words I wanted to track in my filter {“test”, “cow”, “pizza”} this would make my problem set size N. I could track these in a 10 bit array (M=10) and track them using the three hashes (K=3) { MD5, SHA512, and RIPEMD160 }.

Our array starts off looking like this:

 0  1  2  3  4  5  6  7  8  9
[0][0][0][0][0][0][0][0][0][0]

For each has function we’ll hash the word, convert it to a 64 bit integer, mod it by M (10), take the absolute value of the result and mark the array at that index with a 1. Here are the indexes for the above words and hashes:

      MD5   SHA512  RIPEMD160
test   7       8        8
cow    9       7        6
pizza  4       2        4

By marking our array using those indexes we end up with our bloom filter looking like this:

 0  1  2  3  4  5  6  7  8  9
[0][0][1][0][1][0][1][1][1][1]

Now if we check to see if the words “pizza”, “trout”, and “blue” exist all we need to do is use all three hashes we used before and make sure that all the indexes they point two are marked one. Here’s what the hashes produce:

      MD5   SHA512  RIPEMD160
pizza  4       2        4
trout  0       3        3
blue   8       8        4

And as you can see by matching those indexes to our array the words “pizza” and “blue” exist, while the word “trout” doesn’t. The Bloom Filter works!

But wait, we never added “blue” to the array, so why did it come back as being correct? It’s because we’ve made the mistake of making the values of M and K too small, so our false positive rate is too large. How do we determine the false positive rate? Without getting into the math (if you want it look here), the equation for determining error is:

Running our numbers of M=10, K=3, and N=3 into this equation we find out out false positive rate is going to be 32.65%! To fix this all we need to do is either increase M or K (in this case M since our array is already getting pretty saturated). So if we increase M to say 100 we’ll drop our error rate to 0.1% for 3 items.

So how does this work in practice? Let’s say we want a fast spell checker where our only concern is if a word is spelled correctly or not, and we’re not concerned with finding possible suggestions (this is useful in things like browsers which will mark misspelled words without showing possible matches unless the user takes action, at which point you can fall back to a BK-Tree). If you based this spell check off of the enable1 dictionary your problem set is 172,820 words. We can spell check that list with 1 MB (8,388,608 bits) of memory with an error rate of 0.02% using only three hash functions. This nets an extremely fast (based on hash function speed) and memory efficient way to perform a simple spell check with an acceptable margin of error.

So where might you find this being used in the wild? As it turns out this structure is used all over the place, for instance in Big Table, Cassandra, and Chrome.

Below is a working example of a simple Bloom Filter in .NET 4.5.

/*
    The MIT License (MIT)
    Copyright (c) 2013

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
    to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
    and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 */
using System;
using System.Collections;
using System.Collections.Generic;
using System.Text;
using System.Security.Cryptography;

public class BloomFilter
    {
        public int M { get; private set; }

        public int N { get; private set; }

        public int K { get { return _hashAlgorithms.Count; } }

        private readonly BitArray _filter;

        private readonly List<HashAlgorithm> _hashAlgorithms;

        public BloomFilter(int size)
        {
            _filter = new BitArray(size);
            M = size;

            _hashAlgorithms = new List<HashAlgorithm> { MD5.Create(), SHA512.Create(), RIPEMD160.Create() };
        }

        public void Add(string word)
        {
            N++;
            foreach (var index in _GetIndexes(word))
            {
                _filter[index] = true;
            }
        }

        public bool Contains(string word)
        {
            var matchCount = 0;
            foreach (var index in _GetIndexes(word))
            {
                if (_filter[index])
                    matchCount++;
            }

            return matchCount == K;
        }

        private IEnumerable<int> _GetIndexes(string word)
        {
            var hashIndexes = new List<int>();

            foreach (var hashAlgorithm in _hashAlgorithms)
            {
                var bytes = hashAlgorithm.ComputeHash(Encoding.UTF8.GetBytes(word));
                hashIndexes.Add(Math.Abs((int)BitConverter.ToInt64(bytes, 0) % M));
            }

            return hashIndexes;
        }

        public decimal FalsePositiveRate()
        {
            return Convert.ToDecimal(Math.Pow(1.0 - (Math.Exp(-K * (N + 0.5) / (M - 1))), K));
        }
    }

The BK-Tree – A Data Structure for Spell Checking

Proposed by Burkhard and Keller in 1973, the BK-Tree is a data structure used for spell checking based on the Levenshtein Distance between two words, which is basically the number of changes you need to make to a word to turn it into another word. Some example distances:

LevenshteinDistance(cook, book) -> 1
LevenshteinDistance(cook, books) -> 2
LevenshteinDistance(what, water) -> 3

To make the last example clear, the distance between what and water is based on three moves, the first is to drop the h, then add the e and finally add the r. You can use this distance to work out how close someone is to spelling a word correctly, for instance if I want to know out of the set of words [cook, book, books, what, water] which word the misspelling wat is closest to I could do a full scan of the list and find the words with the lowest distance:

LevenshteinDistance(wat, cook) -> 4
LevenshteinDistance(wat, book) -> 4
LevenshteinDistance(wat, books) -> 5
LevenshteinDistance(wat, what) -> 1
LevenshteinDistance(wat, water) -> 2

Based on that search I can determine the user probably meant the word what due to it having the lowest distance of 1.

This works in the case of a small number of words since O(n) isn’t bad in this case, however if I wanted to scan a full dictionary for the closest match O(n) isn’t optimal. This is where a BK-Tree comes in.

To build a BK-Tree all you have to do is take any word from your set and plop it in as your root node, and then add words to the tree based on their distance to the root. For instance if I started a tree with the word set [book, books, cake] then my tree would look like this if I started by making the word book my root node:

This is because:

LevenshteinDistance(book, books) -> 1
LevenshteinDistance(book, cake) -> 4

Of course now if we add the word boo to the mix we get a conflict because:

LevenshteinDistance(book, boo) -> 1

which collides with books. To handle this we now have to chain boo from books by making it a child of books based on their Levenshtein Distance.

LevenshteinDistance(books, boo) -> 2

We continue to use this strategy as we add words, so if we throw in [Cape, Boon, Cook, Cart] then we get this:

Now that we have our structure the obvious question is how do we search it? This is simple as now all we need to do is take our misspelled word and find matches within a certain level of tolerance, which we’ll call N. We do this by taking the Levenshtein Distance of our word and compare it to the root, then crawl all nodes that are that distance ±N.

For example, if we searched for the closest match to the misspelling caqe within the tolerance of 1 in the above structure we would crawl it like this:

1) LevenshteinDistance(caqe, book) -> 4
    a) Check (4 <= 1) - Don't add book as a match
    b) Crawl all edges between 3 and 5 (4-1,4+1)
2) Crawl into edge path 4 from book
3) LevenshteinDistance(caqe, cake) -> 1
    a) Check (1 <= 1) - Add cake as a match
    b) Crawl all edges between 0 and 2 (1-1, 1+1)
4) Crawl into edge path 1 from cake
5) LevenshteinDistance(caqe, cape) -> 1
    a) (1 <= 1) - Add cape as a match
6) Crawl into edge path 2 from cake
7) LevenshteinDistance(caqe, cart) -> 2
    a) Check (2 <= 1) - Don't add cart as a match

From this example it appears we can now find misspellings at a O(log n) time, which is better than our O(n) from before. This however doesn’t tell the whole story, because our tolerance can drastically increase the number of nodes we need to visit in order to do the search. A discussion over a BK-Tree implementation in Lucene seems to indicate a tolerance of 2 should limit a dictionary search to 5-10% of a tree, while someone else did some tests with random strings to get their own metrics (though random strings are a poor representation of real-world misspellings). In practice the speed improvement of this structure is going to be highly dependent on the set of words you are using and the typical inputs you run into so independent research is needed to see if it applies to any specific problem (such as matching misspelled city names).

As always here’s a C# implementation under the MIT license:

/*
The MIT License (MIT)
Copyright (c) 2013

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

using System;
using System.Collections;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;

public class BKTree
{
	private Node _Root;

	public void Add(string word)
	{
		word = word.ToLower();
		if (_Root == null)
		{
			_Root = new Node(word);
			return;
		}

		var curNode = _Root;

		var dist = LevenshteinDistance(curNode.Word, word);
		while (curNode.ContainsKey(dist))
		{
			if (dist == 0) return;

			curNode = curNode[dist];
			dist = LevenshteinDistance(curNode.Word, word);
		}

		curNode.AddChild(dist,word);
	}

	public List<string> Search(string word, int d)
	{
		var rtn = new List<string>();
		word = word.ToLower();

		RecursiveSearch(_Root, rtn, word, d);

		return rtn;
	}

	private void RecursiveSearch(Node node, List<string> rtn, string word, int d )
	{
		var curDist = LevenshteinDistance(node.Word, word);
		var minDist = curDist - d;
		var maxDist = curDist + d;

		if (curDist <= d)
			rtn.Add(node.Word);

		foreach (var key in node.Keys.Cast<int>().Where(key => minDist <= key && key <= maxDist))
		{
			RecursiveSearch(node[key], rtn, word, d);
		}
	}

	public static int LevenshteinDistance(string first, string second)
	{
		if (first.Length == 0) return second.Length;
		if (second.Length == 0) return first.Length;

		var lenFirst = first.Length;
		var lenSecond = second.Length;

		var d = new int[lenFirst + 1, lenSecond + 1];

		for (var i = 0; i <= lenFirst; i++)
			d[i, 0] = i;

		for (var i = 0; i <= lenSecond; i++)
			d[0, i] = i;

		for (var i = 1; i <= lenFirst; i++)
		{
			for (var j = 1; j <= lenSecond; j++)
			{
				var match = (first[i - 1] == second[j - 1]) ? 0 : 1;

				d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + match);
			}
		}

		return d[lenFirst, lenSecond];
	}
}

public class Node
{
	public string Word { get; set; }
	public HybridDictionary Children { get; set; }

	public Node() { }

	public Node(string word)
	{
		this.Word = word.ToLower();
	}

	public Node this[int key]
	{
		get { return (Node)Children[key]; }
	}

	public ICollection Keys
	{
		get
		{
			if(Children == null)
				return new ArrayList();
			return Children.Keys;
		}
	}

	public bool ContainsKey(int key)
	{
		return Children != null && Children.Contains(key);
	}

	public void AddChild(int key, string word)
	{
		if(this.Children == null)
			Children = new HybridDictionary();
		this.Children[key] = new Node(word);
	}
}

The Trie Data Structure – A Prefix Tree for Autocompletes

After covering the GADDAG Data Structure it seems like a good idea to follow up on the structure it’s based on, the Trie. The Trie is a very special case structure where that’s optimized for prefix searches and was first proposed in 1960 by Edward Fredkin, which has the advantage of using less memory than a GADDAG but has the downside where it’s only efficient for one purpose. It just so happens though that purpose is very common on websites: autocompletion.

A Trie for any one word looks like a linked list, as seen here:

No matter how complex a Trie gets, every word will only appear once and it will appear like it does above somewhere in the tree.

Let’s look at a more complex structure containing the words “Call”, “Cat”, “Cater”, “Bat”, “Bake”, “Cake”, and “Cope”:

Based on the example above, if you wanted to know every word that contained the prefix “CA” you can simply walk to the node “C” from the root, then down to “A”, and then crawl the rest of the tree to grab all the words that a hanging off of that node.

A Trie has a very fast prefix search time of O(M), where M is the number of letters in the prefix, and it has return time of O(M+N) where N is the number of nodes hanging off of that prefix. This makes the absolute worst case time for a Trie search O(N) where N is the number of nodes in the Trie, which happens when you search for an empty string as the prefix.

The worst case time can be avoided in situations like autocompletes where you will likely not return every possible word hanging off the the prefix, but you will instead only return the top N results. This is shown in the code below, and drastically reduces the time needed to return when dealing with large tries and short prefixes, for example when you want to autocomplete against a dictionary and the user just entered their first keystroke.

Here is a sample of a Trie implemented in .NET 4.5, I’m licensing it under the MIT License so it is free to use however you wish. This implementation does not include optimizations like using binary arrays to store characters or node compression (which would turn this into a radix tree), it was written to be simple to follow for those that want to walk through the operation of a Trie.

/*
    The MIT License (MIT)
    Copyright (c) 2013

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
    to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
    and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 */

using System.Collections;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;

public class Trie
{
    public Node RootNode { get; private set; }

    public Trie()
    {
        RootNode = new Node { Letter = Node.Root };
    }

    public void Add(string word)
    {
        word = word.ToLower() + Node.Eow;
        var currentNode = RootNode;
        foreach(var c in word)
        {
            currentNode = currentNode.AddChild(c);
        }
    }

    public List<string> Match(string prefix, int? maxMatches)
    {
        prefix = prefix.ToLower();

        var set = new HashSet<string>();

        _MatchRecursive(RootNode, set, "", prefix, maxMatches);
        return set.ToList();
    }

    private static void _MatchRecursive(Node node, ISet<string> rtn, string letters, string prefix, int? maxMatches)
    {
        if (maxMatches != null && rtn.Count == maxMatches)
            return;

        if (node == null)
        {
            if (!rtn.Contains(letters)) rtn.Add(letters);
            return;
        }

        letters += node.Letter.ToString();

        if (prefix.Length > 0)
        {
            if (node.ContainsKey(prefix[0]))
            {
                _MatchRecursive(node[prefix[0]], rtn, letters, prefix.Remove(0, 1), maxMatches);
            }
        }
        else
        {
            foreach (char key in node.Keys)
            {
                _MatchRecursive(node[key], rtn, letters, prefix, maxMatches);
            }
        }
    }
}
public class Node
{
    public const char Eow = '$';
    public const char Root = ' ';

    public char Letter { get; set; }
    public HybridDictionary Children { get; private set; }

    public Node() { }

    public Node(char letter)
    {
        this.Letter = letter;
    }

    public Node this[char index]
    {
        get { return (Node)Children[index]; }
    }

    public ICollection Keys
    {
        get { return Children.Keys; }
    }

    public bool ContainsKey(char key)
    {
        return Children.Contains(key);
    }

    public Node AddChild(char letter)
    {
        if (Children == null)
            Children = new HybridDictionary();

        if (!Children.Contains(letter))
        {
            var node = letter != Eow ? new Node(letter) : null;
            Children.Add(letter, node);
            return node;
        }

        return (Node)Children[letter];
    }

    public override string ToString()
    {
        return this.Letter.ToString();
    }
}

Gaddag Data Structure – A Way To Quickly Find Scrabble Words

GADDAG is a data structure proposed by Steven Gordon that is optimized for searching for words in games like Scrabble and Words With Friends. It’s similar to a Trie prefix tree, however it’s organized so that all “hooks” (letter sequences on the board) are available from the root node. This is done by storing every possible prefix and suffix in the word, with the prefix being reversed off of the root node and two parts being separated by a node typically represented by the “>” character.

For instance the word “Call” stored in a GADDAG looks like this:

The magic of this structure is if you want to check if a letter string exists all you need to do is reverse the string and check the branch starting from the first letter. For instance if you had a structure that contained “Call”, “Ball”, and “All”:

You can find all words with “L” in them by taking the “L” from the root then crawling down all the branches. Beyond a single letter you can look for words that contain “ALL” and drill down the nodes in reverse by transversing L->L->A.

One of the downsides to this structure is it takes up a lot of memory, however the paper above gives instructions on how to perform a partial compression of the nodes by reintegrating them after the break, as seen here:

Here is a sample GADDAG implemented in .NET 4.5, I’m licensing it under the MIT License so it is free to use however you wish:

/*
    The MIT License (MIT)
    Copyright (c) 2013

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
    to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
    and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 */
using System;
using System.Collections;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;

public class Gaddag
{
    public Node RootNode { get; private set; }

    public Gaddag()
    {
        RootNode = new Node { Letter = Node.Root };
    }

    public void Add(string word)
    {
        var w = word.ToLower();
        var prevNode = new List<Node>();
        for (var i = 1; i <= w.Length; i++)
        {
            var prefix = w.Substring(0, i);
            var suffix = "";
            if (i != w.Length) suffix = w.Substring(i, w.Length - (i));
            var addword = _StringReverse(prefix) + Node.Break + suffix + Node.Eow;

            var currentNode = RootNode;
            var breakFound = false;
            int j = 0;
            foreach (var c in addword)
            {
                //Long words can be joined back together after the break point, cutting the tree size in half.
                if (breakFound && prevNode.Count > j)
                {
                    currentNode.AddChild(c, prevNode[j]);
                    break;
                }

                currentNode = currentNode.AddChild(c);

                if (prevNode.Count == j)
                    prevNode.Add(currentNode);

                if (c == Node.Break)
                    breakFound = true;
                j++;
            }
        }
    }


    private static string _GetWord(string str)
    {
        var word = "";

        for (var i = str.IndexOf(Node.Break) - 1; i >= 0; i--)
            word += str[i];

        for (var i = str.IndexOf(Node.Break) + 1; i < str.Length; i++)
            word += str[i];

        return word;
    }

    public List<string> ContainsHookWithRack(string hook, string rack)
    {
        hook = hook.ToLower();
        hook = _StringReverse(hook);

        var set = new HashSet<string>();

        _ContainsHookWithRackRecursive(RootNode, set, "", rack, hook);
        return set.ToList();
    }

    private static void _ContainsHookWithRackRecursive(Node node, ISet<string> rtn, string letters, string rack, string hook)
    {
        // Null nodes represent the EOW, so return the word.
        if (node == null)
        {
            var w = _GetWord(letters);
            if (!rtn.Contains(w)) rtn.Add(w);
            return;
        }

        // If the hook still contains letters, process those first.
        if (!String.IsNullOrWhiteSpace(hook))
        {
            letters += node.Letter != Node.Root ? node.Letter.ToString() : "";

            if (node.ContainsKey(hook[0]))
            {
                var h = hook.Remove(0, 1); //Pop the letter off the hook
                _ContainsHookWithRackRecursive(node[hook[0]], rtn, letters, rack, h);
            }
        }
        else
        {
            letters += node.Letter != Node.Root ? node.Letter.ToString() : "";

            foreach (var key in node.Keys.Cast<char>().Where(k => rack.Contains(k) || k == Node.Eow || k == Node.Break))
            {
                var r = (key != Node.Eow && key != Node.Break) ? rack.Remove(rack.IndexOf(key), 1) : rack; //Pull the letter from the rack
                _ContainsHookWithRackRecursive(node[key], rtn, letters, r, hook);
            }
        }
    }

    private static string _StringReverse(string str)
    {
        var charArray = str.ToCharArray();
        Array.Reverse(charArray);
        return (new String(charArray));
    }
}
public class Node
{
    public const char Break = '>';
    public const char Eow = '$';
    public const char Root = ' ';

    public char Letter { get; set; }
    public HybridDictionary Children { get; private set; }

    public Node() { }

    public Node(char letter)
    {
        this.Letter = letter;
    }

    public Node this[char index]
    {
        get { return (Node)Children[index]; }
    }

    public ICollection Keys
    {
        get { return Children.Keys; }
    }

    public bool ContainsKey(char key)
    {
        return Children.Contains(key);
    }

    public Node AddChild(char letter)
    {
        if (Children == null)
            Children = new HybridDictionary();

        if (!Children.Contains(letter))
        {
            var node = letter != Eow ? new Node(letter) : null;
            Children.Add(letter, node);
            return node;
        }

        return (Node)Children[letter];
    }

    public Node AddChild(char letter, Node node)
    {
        if (Children == null)
            Children = new HybridDictionary();

        if (!Children.Contains(letter))
        {
            Children.Add(letter, node);
            return node;
        }

        return (Node)Children[letter];
    }

    public override string ToString()
    {
        return this.Letter.ToString();
    }
}

Xenopax's Blog

Routing words to /dev/null

Category Archives: Data Structures