Truly Awful Algorithms – Gnome Sort

Garden Gnome

During my pursuit to learn about new algorithms I came across one with a funny name, and a simple implementation, and is generally just awful to use in any real world case. So here we have Gnome Sort. Proposed by Hamid Sarbazi-Azad in 2000, this algorithm’s best feature is the existence of a single while to do all the sorting. Here’s an implementation of it in C# as an extension to an array of integers*:

public static class SortAlgorithms
    {
        public static int[] GnomeSort(this int[] a)
        {
            if (a.Length <= 1)
                return a;

            var i = 1;
            while (i < a.Length)             {
                if (a[i] >= a[i - 1])
                    i++;
                else
                {
                    var tmp = a[i];
                    a[i] = a[i - 1];
                    a[i - 1] = tmp;
                    if (i > 1)
                        i--;
                }
            }
            return a;
        }
    }

If you will, imagine you asked your garden gnome to sort your flower pots based on size, how would he go about it? Why he would walk down the line of flower pots and look at the one he was next to and the one behind it, and if the one he’s next to is small than the other one he’d swap them and continue to swap it backwards down the line until he no longer finds the pot he’s moving is smaller than the one before it. Then he just starts walking down the line again looking at pots. The best case for this sort is O(N) for an already sorted list, with O(N²) worst case time for a reverse sorted list. The alternative name for this sort is Stupid Sort, not because of the algorithm itself, but because gnomes are stupid and take the most direct approach to solving problems. On the plus side you only had to spend one loop!

* Consider this code MIT Licensed if for some insane reason you want to use it

6 Mistakes That Slow Down Your Website

This is just some random musings on common problems I’ve seen over several years of web development.

1) No indexes on data (in databases or elsewhere)

It’s very easy for developers to use a small data set for their test cases, mainly because large data sets are hard to create, only to find out that something that ran quickly during development when you only had 10 records runs slowly when you have 1 million. What’s interesting about this problem is the lack of indexes is actually a good pattern to use if you are actually going to have a very small amount of data. However if you aren’t then testing against a large data set needs to be a part of your process so you can be certain that it won’t crush your site.

2) No caching or incorrect caching

Caching needs to be the part of a website’s plan, and it’s a really easy area to overlook or to get wrong. There are obvious places to cache data, for instance when you make a common database call it makes sense to store data in local memory. However what if you are using 10 web servers and they all need access to the same cached data? Memcached seems like an obvious choice, however what if each server needs a particular piece of data 20 times per page load? In some situations it may make sense to cache to a memcache cluster, but to also pull the data to a local memory cache.

This doesn’t only go for data that’s processed server side, resources such as javascript and css should be cached as well. The obviously place to cache these are on the user’s web browser, however you can also cache them other places. For instance you can instruct downstream servers to cache files for you, you can also ask downstream proxies or the client software to revalidate cached items before using them. (More info)

You can also use the same strategies for javascript and css for the pages themselves. Some web technologies will allow you to cache dynamically created web pages so you don’t need to reprocess the page at all, instead you can keep the resulting page in memory and just serve it to whomever requests the page. This is great for pages that don’t have user specific data on them and don’t change very often.

3) Too much stuff on a page

A lot of websites love to cram as much as they can onto individual pages, even if you didn’t request it. Every bit of information that is included on a page should be scrutinized, and if it doesn’t add value it shouldn’t be included. “Add value” can be very tricky, as many companies and web masters believe that if they are interested in something adding elements to a page it’s also good for the user. However whenever you add anything to a webpage you may be hurting yourself in three ways:

1) There’s now an extra element competing for a user’s attention
2) You’re now using more bandwidth to deliver the page (increasing delivery time, as well as costs)
3) You’re likely lowering the user experience as more complexity reduces experience.

This isn’t to say that adding information can’t help a webpage. For instance having a product for sale without a description is obviously a bad idea. However having a product for sale with a form to search for houses on the same page is a bad idea because it distracts from the user’s main purpose, and this distraction comes at the cost on the user’s end (increased load time) and your end (increased server usage, lose of sale, etc).

Of course if you aren’t certain if adding something to a webpage is a good idea you should A/B test it to see if it’s used or hurts your website’s performance, and actually honor the result of that test.

4) Loading too much data

A lot of frameworks nowadays make it very easy to access databases and other sources, however sometimes they have a cost that’s invisible to developers. Sometimes the developers themselves will screw up their data pulls, regardless of their framework. It’s important for all developers to understand how data is accessed so they can avoid some common problems, and also understand where the best place to do certain calculations is. For instance if you wanted the average housing value in an area it’s far better to query your database for an average value rather than pulling all your housing data to your web server and doing the calculation over there.

5) Making too many external calls during page creation

Some people when they feel clever will come up with a solution to query their database for a list of item keys, and the grab each item based on their key and place them in a shared cache (so later queries can grab them same items without having to run the exact same query). This is a good idea, however if your query gets back 1000 results and they all need to be individually pulled and cached you’d probably be better off not caching at all because now you’re optimized solution is making 1001 round trips to your database. This doesn’t only apply to database calls, making too many calls to anything not residing inside the application itself is a bad idea (including calls to memcached). Whenever possible external calls should be batched together, for instance in the example above you could try to pull all the items from your original query from the cache and keep and list of items that aren’t cached, then make one large pull for the remaining items.

6) Making too many calls from the page itself

Building on point 5, including several dozen javascript, css, image files, etc will really bog down a webpage once the original HTML is delivered to the user. There are lots of technologies and techniques to combat this problem, such as Combres or CSS Sprites, generally speaking you want to make as few calls for resources as possible whenever possible. For more information on where you’re making too many calls tools like YSlow can make all the difference.

The BK-Tree – A Data Structure for Spell Checking

Proposed by Burkhard and Keller in 1973, the BK-Tree is a data structure used for spell checking based on the Levenshtein Distance between two words, which is basically the number of changes you need to make to a word to turn it into another word. Some example distances:

LevenshteinDistance(cook, book) -> 1
LevenshteinDistance(cook, books) -> 2
LevenshteinDistance(what, water) -> 3

To make the last example clear, the distance between what and water is based on three moves, the first is to drop the h, then add the e and finally add the r. You can use this distance to work out how close someone is to spelling a word correctly, for instance if I want to know out of the set of words [cook, book, books, what, water] which word the misspelling wat is closest to I could do a full scan of the list and find the words with the lowest distance:

LevenshteinDistance(wat, cook) -> 4
LevenshteinDistance(wat, book) -> 4
LevenshteinDistance(wat, books) -> 5
LevenshteinDistance(wat, what) -> 1
LevenshteinDistance(wat, water) -> 2

Based on that search I can determine the user probably meant the word what due to it having the lowest distance of 1.

This works in the case of a small number of words since O(n) isn’t bad in this case, however if I wanted to scan a full dictionary for the closest match O(n) isn’t optimal. This is where a BK-Tree comes in.

To build a BK-Tree all you have to do is take any word from your set and plop it in as your root node, and then add words to the tree based on their distance to the root. For instance if I started a tree with the word set [book, books, cake] then my tree would look like this if I started by making the word book my root node:

This is because:

LevenshteinDistance(book, books) -> 1
LevenshteinDistance(book, cake) -> 4

Of course now if we add the word boo to the mix we get a conflict because:

LevenshteinDistance(book, boo) -> 1

which collides with books. To handle this we now have to chain boo from books by making it a child of books based on their Levenshtein Distance.

LevenshteinDistance(books, boo) -> 2

We continue to use this strategy as we add words, so if we throw in [Cape, Boon, Cook, Cart] then we get this:

Now that we have our structure the obvious question is how do we search it? This is simple as now all we need to do is take our misspelled word and find matches within a certain level of tolerance, which we’ll call N. We do this by taking the Levenshtein Distance of our word and compare it to the root, then crawl all nodes that are that distance ±N.

For example, if we searched for the closest match to the misspelling caqe within the tolerance of 1 in the above structure we would crawl it like this:

1) LevenshteinDistance(caqe, book) -> 4
    a) Check (4 <= 1) - Don't add book as a match
    b) Crawl all edges between 3 and 5 (4-1,4+1)
2) Crawl into edge path 4 from book
3) LevenshteinDistance(caqe, cake) -> 1
    a) Check (1 <= 1) - Add cake as a match
    b) Crawl all edges between 0 and 2 (1-1, 1+1)
4) Crawl into edge path 1 from cake
5) LevenshteinDistance(caqe, cape) -> 1
    a) (1 <= 1) - Add cape as a match
6) Crawl into edge path 2 from cake
7) LevenshteinDistance(caqe, cart) -> 2
    a) Check (2 <= 1) - Don't add cart as a match

From this example it appears we can now find misspellings at a O(log n) time, which is better than our O(n) from before. This however doesn’t tell the whole story, because our tolerance can drastically increase the number of nodes we need to visit in order to do the search. A discussion over a BK-Tree implementation in Lucene seems to indicate a tolerance of 2 should limit a dictionary search to 5-10% of a tree, while someone else did some tests with random strings to get their own metrics (though random strings are a poor representation of real-world misspellings). In practice the speed improvement of this structure is going to be highly dependent on the set of words you are using and the typical inputs you run into so independent research is needed to see if it applies to any specific problem (such as matching misspelled city names).

As always here’s a C# implementation under the MIT license:

/*
The MIT License (MIT)
Copyright (c) 2013

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

using System;
using System.Collections;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;

public class BKTree
{
	private Node _Root;

	public void Add(string word)
	{
		word = word.ToLower();
		if (_Root == null)
		{
			_Root = new Node(word);
			return;
		}

		var curNode = _Root;

		var dist = LevenshteinDistance(curNode.Word, word);
		while (curNode.ContainsKey(dist))
		{
			if (dist == 0) return;

			curNode = curNode[dist];
			dist = LevenshteinDistance(curNode.Word, word);
		}

		curNode.AddChild(dist,word);
	}

	public List<string> Search(string word, int d)
	{
		var rtn = new List<string>();
		word = word.ToLower();

		RecursiveSearch(_Root, rtn, word, d);

		return rtn;
	}

	private void RecursiveSearch(Node node, List<string> rtn, string word, int d )
	{
		var curDist = LevenshteinDistance(node.Word, word);
		var minDist = curDist - d;
		var maxDist = curDist + d;

		if (curDist <= d)
			rtn.Add(node.Word);

		foreach (var key in node.Keys.Cast<int>().Where(key => minDist <= key && key <= maxDist))
		{
			RecursiveSearch(node[key], rtn, word, d);
		}
	}

	public static int LevenshteinDistance(string first, string second)
	{
		if (first.Length == 0) return second.Length;
		if (second.Length == 0) return first.Length;

		var lenFirst = first.Length;
		var lenSecond = second.Length;

		var d = new int[lenFirst + 1, lenSecond + 1];

		for (var i = 0; i <= lenFirst; i++)
			d[i, 0] = i;

		for (var i = 0; i <= lenSecond; i++)
			d[0, i] = i;

		for (var i = 1; i <= lenFirst; i++)
		{
			for (var j = 1; j <= lenSecond; j++)
			{
				var match = (first[i - 1] == second[j - 1]) ? 0 : 1;

				d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + match);
			}
		}

		return d[lenFirst, lenSecond];
	}
}

public class Node
{
	public string Word { get; set; }
	public HybridDictionary Children { get; set; }

	public Node() { }

	public Node(string word)
	{
		this.Word = word.ToLower();
	}

	public Node this[int key]
	{
		get { return (Node)Children[key]; }
	}

	public ICollection Keys
	{
		get
		{
			if(Children == null)
				return new ArrayList();
			return Children.Keys;
		}
	}

	public bool ContainsKey(int key)
	{
		return Children != null && Children.Contains(key);
	}

	public void AddChild(int key, string word)
	{
		if(this.Children == null)
			Children = new HybridDictionary();
		this.Children[key] = new Node(word);
	}
}

The Trie Data Structure – A Prefix Tree for Autocompletes

After covering the GADDAG Data Structure it seems like a good idea to follow up on the structure it’s based on, the Trie. The Trie is a very special case structure where that’s optimized for prefix searches and was first proposed in 1960 by Edward Fredkin, which has the advantage of using less memory than a GADDAG but has the downside where it’s only efficient for one purpose. It just so happens though that purpose is very common on websites: autocompletion.

A Trie for any one word looks like a linked list, as seen here:

No matter how complex a Trie gets, every word will only appear once and it will appear like it does above somewhere in the tree.

Let’s look at a more complex structure containing the words “Call”, “Cat”, “Cater”, “Bat”, “Bake”, “Cake”, and “Cope”:

Based on the example above, if you wanted to know every word that contained the prefix “CA” you can simply walk to the node “C” from the root, then down to “A”, and then crawl the rest of the tree to grab all the words that a hanging off of that node.

A Trie has a very fast prefix search time of O(M), where M is the number of letters in the prefix, and it has return time of O(M+N) where N is the number of nodes hanging off of that prefix. This makes the absolute worst case time for a Trie search O(N) where N is the number of nodes in the Trie, which happens when you search for an empty string as the prefix.

The worst case time can be avoided in situations like autocompletes where you will likely not return every possible word hanging off the the prefix, but you will instead only return the top N results. This is shown in the code below, and drastically reduces the time needed to return when dealing with large tries and short prefixes, for example when you want to autocomplete against a dictionary and the user just entered their first keystroke.

Here is a sample of a Trie implemented in .NET 4.5, I’m licensing it under the MIT License so it is free to use however you wish. This implementation does not include optimizations like using binary arrays to store characters or node compression (which would turn this into a radix tree), it was written to be simple to follow for those that want to walk through the operation of a Trie.

/*
    The MIT License (MIT)
    Copyright (c) 2013

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
    to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
    and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 */

using System.Collections;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;

public class Trie
{
    public Node RootNode { get; private set; }

    public Trie()
    {
        RootNode = new Node { Letter = Node.Root };
    }

    public void Add(string word)
    {
        word = word.ToLower() + Node.Eow;
        var currentNode = RootNode;
        foreach(var c in word)
        {
            currentNode = currentNode.AddChild(c);
        }
    }

    public List<string> Match(string prefix, int? maxMatches)
    {
        prefix = prefix.ToLower();

        var set = new HashSet<string>();

        _MatchRecursive(RootNode, set, "", prefix, maxMatches);
        return set.ToList();
    }

    private static void _MatchRecursive(Node node, ISet<string> rtn, string letters, string prefix, int? maxMatches)
    {
        if (maxMatches != null && rtn.Count == maxMatches)
            return;

        if (node == null)
        {
            if (!rtn.Contains(letters)) rtn.Add(letters);
            return;
        }

        letters += node.Letter.ToString();

        if (prefix.Length > 0)
        {
            if (node.ContainsKey(prefix[0]))
            {
                _MatchRecursive(node[prefix[0]], rtn, letters, prefix.Remove(0, 1), maxMatches);
            }
        }
        else
        {
            foreach (char key in node.Keys)
            {
                _MatchRecursive(node[key], rtn, letters, prefix, maxMatches);
            }
        }
    }
}
public class Node
{
    public const char Eow = '$';
    public const char Root = ' ';

    public char Letter { get; set; }
    public HybridDictionary Children { get; private set; }

    public Node() { }

    public Node(char letter)
    {
        this.Letter = letter;
    }

    public Node this[char index]
    {
        get { return (Node)Children[index]; }
    }

    public ICollection Keys
    {
        get { return Children.Keys; }
    }

    public bool ContainsKey(char key)
    {
        return Children.Contains(key);
    }

    public Node AddChild(char letter)
    {
        if (Children == null)
            Children = new HybridDictionary();

        if (!Children.Contains(letter))
        {
            var node = letter != Eow ? new Node(letter) : null;
            Children.Add(letter, node);
            return node;
        }

        return (Node)Children[letter];
    }

    public override string ToString()
    {
        return this.Letter.ToString();
    }
}

Xenopax's Blog

Routing words to /dev/null

Monthly Archives: March 2013

Truly Awful Algorithms – Gnome Sort

6 Mistakes That Slow Down Your Website

The BK-Tree – A Data Structure for Spell Checking

The Trie Data Structure – A Prefix Tree for Autocompletes