The Bloom Filter – Probably Answering The Question “Do I Exist?”

Originally conceived of in 1970 by Burton Howard Bloom, the Bloom Filter is a probabilistic data structure that can determine if something is likely to be a part of a set. The upside of a Bloom filter is you can find out quickly if something definitely doesn’t exist, and within a certain margin of error if something does. In other words you will have false positives, but never false negatives.

So how does it work? Well to start Bloom Filter has three elements to it:

M – The number of bits it uses to track the existence of items.
N – The number of items that will likely be entered into the Bloom Filter
K – The number of hash functions the Bloom Filter will utilize.

To use these three elements all we have to do is hash all N elements with all K hash functions and turn them into indexes in the bit array of size M and mark all those bits to 1.

For example if I had three words I wanted to track in my filter {“test”, “cow”, “pizza”} this would make my problem set size N. I could track these in a 10 bit array (M=10) and track them using the three hashes (K=3) { MD5, SHA512, and RIPEMD160 }.

Our array starts off looking like this:

 0  1  2  3  4  5  6  7  8  9
[0][0][0][0][0][0][0][0][0][0]

For each has function we’ll hash the word, convert it to a 64 bit integer, mod it by M (10), take the absolute value of the result and mark the array at that index with a 1. Here are the indexes for the above words and hashes:

      MD5   SHA512  RIPEMD160
test   7       8        8
cow    9       7        6
pizza  4       2        4

By marking our array using those indexes we end up with our bloom filter looking like this:

 0  1  2  3  4  5  6  7  8  9
[0][0][1][0][1][0][1][1][1][1]

Now if we check to see if the words “pizza”, “trout”, and “blue” exist all we need to do is use all three hashes we used before and make sure that all the indexes they point two are marked one. Here’s what the hashes produce:

      MD5   SHA512  RIPEMD160
pizza  4       2        4
trout  0       3        3
blue   8       8        4

And as you can see by matching those indexes to our array the words “pizza” and “blue” exist, while the word “trout” doesn’t. The Bloom Filter works!

But wait, we never added “blue” to the array, so why did it come back as being correct? It’s because we’ve made the mistake of making the values of M and K too small, so our false positive rate is too large. How do we determine the false positive rate? Without getting into the math (if you want it look here), the equation for determining error is:

BloomError

Running our numbers of M=10, K=3, and N=3 into this equation we find out out false positive rate is going to be 32.65%! To fix this all we need to do is either increase M or K (in this case M since our array is already getting pretty saturated). So if we increase M to say 100 we’ll drop our error rate to 0.1% for 3 items.

So how does this work in practice? Let’s say we want a fast spell checker where our only concern is if a word is spelled correctly or not, and we’re not concerned with finding possible suggestions (this is useful in things like browsers which will mark misspelled words without showing possible matches unless the user takes action, at which point you can fall back to a BK-Tree). If you based this spell check off of the enable1 dictionary your problem set is 172,820 words. We can spell check that list with 1 MB (8,388,608 bits) of memory with an error rate of 0.02% using only three hash functions. This nets an extremely fast (based on hash function speed) and memory efficient way to perform a simple spell check with an acceptable margin of error.

So where might you find this being used in the wild? As it turns out this structure is used all over the place, for instance in Big Table, Cassandra, and Chrome.

Below is a working example of a simple Bloom Filter in .NET 4.5.

/*
    The MIT License (MIT)
    Copyright (c) 2013

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
    to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
    and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 */
using System;
using System.Collections;
using System.Collections.Generic;
using System.Text;
using System.Security.Cryptography;

public class BloomFilter
    {
        public int M { get; private set; }

        public int N { get; private set; }

        public int K { get { return _hashAlgorithms.Count; } }

        private readonly BitArray _filter;

        private readonly List<HashAlgorithm> _hashAlgorithms;

        public BloomFilter(int size)
        {
            _filter = new BitArray(size);
            M = size;

            _hashAlgorithms = new List<HashAlgorithm> { MD5.Create(), SHA512.Create(), RIPEMD160.Create() };
        }

        public void Add(string word)
        {
            N++;
            foreach (var index in _GetIndexes(word))
            {
                _filter[index] = true;
            }
        }

        public bool Contains(string word)
        {
            var matchCount = 0;
            foreach (var index in _GetIndexes(word))
            {
                if (_filter[index])
                    matchCount++;
            }

            return matchCount == K;
        }

        private IEnumerable<int> _GetIndexes(string word)
        {
            var hashIndexes = new List<int>();

            foreach (var hashAlgorithm in _hashAlgorithms)
            {
                var bytes = hashAlgorithm.ComputeHash(Encoding.UTF8.GetBytes(word));
                hashIndexes.Add(Math.Abs((int)BitConverter.ToInt64(bytes, 0) % M));
            }

            return hashIndexes;
        }

        public decimal FalsePositiveRate()
        {
            return Convert.ToDecimal(Math.Pow(1.0 - (Math.Exp(-K * (N + 0.5) / (M - 1))), K));
        }
    }
About these ads

One thought on “The Bloom Filter – Probably Answering The Question “Do I Exist?”

  1. Pingback: Inverted Indexes – Inside How Search Engines Work | NullWords Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s