Inverted Indexes – Inside How Search Engines Work

An Inverted Index is a structure used by search engines and databases to make search terms to files or documents, trading the speed writing the document to the index for searching the index later on. There are two versions of an inverted index, a record-level index which tells you which documents contain the term and a fully inverted index which tells you both the document a term is contained in and where in the file it is. For example if you built a search engine to search the contents of sentences and it was fed these sentences:

{0} - "Turtles love pizza"
{1} - "I love my turtles"
{2} - "My pizza is good"

Then you would store them in a Inverted Indexes like this:

            Record Level     Fully Inverted
"turtles"   {0, 1}           { (0, 0), (1, 3) }
"love"      {0, 1}           { (0, 1), (1, 1) }
"pizza"     {0, 2}           { (0, 2), (2, 1) }
"i"         {1}              { (1, 0) }
"my"        {1, 2}           { (1, 2), (2, 0) }
"is"        {2}              { (2, 2) }
"good"      {2}              { (2, 3) }

The record level sets represent just the document ids where the words are stored, and the fully inverted sets represent the document in the first number inside the parentheses and the location in the document is stored in the second number.

So now if you wanted to search all three documents for the words “my turtles” you would grab the sets (looking at record level only):

"turtles"   {0, 1}
"my"        {1, 2}

Then you would intersect those sets, coming up with the only matching set being 1. Using the Fully Inverted Index would also let us know that the word “my” appeared at position 2 and the word “turtles” at position 3, assuming the word position is important your search.

There is no standard implementation for an Inverted Index as it’s more of a concept rather than an actual algorithm, this however gives you a lot of options.

For the index you can choose to use things like  Hashtables, BTrees, or any other fast search data structure.

The intersection becomes a more interesting problem. You can try using Bloom Filters if accuracy isn’t 100% needed, you can brute force the problem by doing a full scan of each set for O(M+N) time for joining two sets. You can also try to do something a little more complicated. Rumor has it that search engines like Google and Bing only merge results until they have enough for a search page and them dump the sets they are loading, though I know very little about how they actually solve this problem.

Here is an example of a simple Inverted Index written in C# that uses a Dictionary as the index and the Linq Intersect function:

public class InvertedIndex
{
    private readonly Dictionary<string, HashSet<int>> _index = new Dictionary<string, HashSet<int>>();
    private readonly Regex _findWords = new Regex(@"[A-Za-z]+");

    public void Add(string text, int docId)
    {
        var words = _findWords.Matches(text);

        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;

            if (!_index.ContainsKey(word))
                _index[word] = new HashSet<int>();

            if (!_index[word].Contains(docId))
                _index[word].Add(docId);
        }
    }

    public List<int> Search(string keywords)
    {
        var words = _findWords.Matches(keywords);
        IEnumerable<int> rtn = null;

        for (var i = 0; i < words.Count; i++)
        {
            var word = words[i].Value;
            if (_index.ContainsKey(word))
            {
                rtn = rtn == null ? _index[word] : rtn.Intersect(_index[word]);
            }
            else
            {
                return new List<int>();
            }
        }

        return rtn != null ? rtn.ToList() : new List<int>();
    }
}
 

6 thoughts on “Inverted Indexes – Inside How Search Engines Work

  1. love’s fully inverted index should be “{ (0, 1), (1, 1) }”.
    You might want to fix it to avoid confusing your readers :)

    Reply
  2. That’s the base, but it get’s trickier, most full text search indices build also some sort of vector space model, because a word appears more than once and may or may not be relevant to the content of the text.

    Reply
  3. Pingback: Indexing in Neo4j | 2's Complement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s