Saturday, September 26, 2015

Radix sort: sorting integers (often) faster than std::sort.

This post will describe a very simple integer sorting algorithm: radix sort. Despite its simplicity, it is a very practical algorithm. As we will see, even a simple implementation can easily outperform std::sort from the C++ standard library.

Suppose we start with an array of numbers such as this:
853, 872, 265, 238, 199, 772, 584, 204, 480, 173,
499, 349, 308, 314, 317, 186, 825, 398, 899, 161
Counter-intuively, we begin by sorting it based on the least significant decimal digit:
480, 161, 872, 772, 853, 173, 584, 204, 314, 265,
825, 186, 317, 238, 308, 398, 199, 499, 349, 899
Now, we sort it based on the middle decimal digit. But we take care that we do this in a stable fashion, that is: for numbers that are tied on the middle digit, keep them in the current order.
204, 308, 314, 317, 825, 238, 349, 853, 161, 265,
872, 772, 173, 480, 584, 186, 398, 199, 499, 899
The numbers are now sorted by the last two digits. It is not hard to guess what we will do next. Once we have sorted them by the most significant digit, taking care not to change the order in case of ties, we will have sorted the array.

I haven't said how exactly we perform the sorting based on a single digit, so let's do this last round slowly. We use count-sort. Here is how it works:

Step 1. Go through the data and count how many times each top digit appears. 0 appears 0 times, 1 appears 4 times, etc.:

count: 0, 4, 3, 5, 2, 1, 0, 1, 4, 0

Step 2. Compute the prefix sums in count. This will give us, for each digit, the index of the first entry with that digit in the final sorted order.

position: 0, 0, 4, 7, 12, 14, 15, 15, 16, 20
For instance, we now know that numbers starting with the digit 4 will begin at index 12.

Step 3. Shuffle the data. For each number, we simply place it directly at the correct position! After placing a number we increment the index for the given digit.
X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 4, 7, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 7, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, 308, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 8, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, 308, 314, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 9, 12, 14, 15, 15, 16, 20
161, 173, 186, 199, 204, 238, 265, 308, 314, 317,
349, 398, 480, 499, 584, 772, 825, 853, 872, 899

Step 4. We have shuffled the data into a new, temporary array. Move it back to the original array. In practice, we can simply swap pointers here.

Running time

Of course in practice we don't sort based on decimal digits. We could sort based on individual bits but we can do better than that. Sort based on groups of bits.

If we sort k bits at a time, there are 2k possible "digits". The count array will need to be of that length. Hence, let's make k ≤ log2 n, so that the helper array isn't longer than the data being sorted.

For added performance, it may be useful to make k somewhat smaller than log2 n. In our implementation below, we use k = ⌊log2 n / 3⌋. This increases the number of rounds 3-fold, but has several advantages that outweigh that:
  • The count array only uses n1/3 memory.
  • Computing the prefix sums in step 2 takes negligible time.
  • Counting in step 1 doesn't randomly increment counters all over memory. It randomly increments counters in a tiny section of memory, which is good for cache performance.
  • Shuffling in step 3 doesn't randomly write all over memory. It writes consecutively in only n1/3 different locations at a time, which also improves cache performance.
For the purposes of analyzing asymptotic performance, we simply say: k = Θ(log n). If the numbers being sorted have ω bits, we have Θ(ω / log n) rounds, each round is done in Θ(n) time, hence the total running time of radix sort is:

Θ(n ω / log n)

Note what this means. The larger the n, the less time we spend per element! This is in contrast with comparison-based sorts, where we spend Θ(log n) per element, which increases with n.

This indicates that there is a threshold: for small n it is better to use a comparison-based sort. For large n, radix sort is better.

What is the threshold? It should be about when ω / log n ≈ log n, that is, when n ≈ 2√ω. For instance, when ω=64,  n ≈ 28 = 256 or so should be the threshold. If n is significantly bigger than this, radix sort should start to dominate.

C++ implementation

template<class T>
void radix_sort(vector<T> &data) {
  static_assert(numeric_limits<T>::is_integer &&
                "radix_sort only supports unsigned integer types");
  constexpr int word_bits = numeric_limits<T>::digits;

  // max_bits = floor(log n / 3)
  // num_groups = ceil(word_bits / max_bits)
  int max_bits = 1;
  while ((size_t(1) << (3 * (max_bits+1))) <= data.size()) {
  const int num_groups = (word_bits + max_bits - 1) / max_bits;

  // Temporary arrays.
  vector<size_t> count;
  vector<T> new_data(data.size());

  // Iterate over bit groups, starting from the least significant.
  for (int group = 0; group < num_groups; ++group) {
    // The current bit range.
    const int start = group * word_bits / num_groups;
    const int end = (group+1) * word_bits / num_groups;
    const T mask = (size_t(1) << (end - start)) - T(1);

    // Count the values in the current bit range.
    count.assign(size_t(1) << (end - start), 0);
    for (const T &x : data) ++count[(x >> start) & mask];

    // Compute prefix sums in count.
    size_t sum = 0;
    for (size_t &c : count) {
      size_t new_sum = sum + c;
      c = sum;
      sum = new_sum;

    // Shuffle data elements.
    for (const T &x : data) {
      size_t &pos = count[(x >> start) & mask];
      new_data[pos++] = x;

    // Move the data to the original array.


I have generated arrays of random 64-bit integers and timed the time per element it takes to sort using std::sort and radix_sort.

3.3 ns
284.2 ns
6.1 ns
91.6 ns
1 000
19.3 ns
59.8 ns
10 000
54.8 ns
46.8 ns
100 000
66.9 ns
40.1 ns
1 000 000
81.1 ns
40.8 ns
10 000 000
95.1 ns
40.7 ns
100 000 000
108.4 ns
40.6 ns

We see the effect as predicted: for std::sort, the running time per element increases with n, for radix_sort it decreases with n. It's not exactly proportional and inversely proportional to log n due to various effects (mostly cache sizes), but the trend is there. Most importantly: for large n, radix_sort is clearly winning!

Further optimizations

More optimizations are possible which can lead to improvements in performance. In particular:
  • Optimize the number of rounds as a function of n. Taking log n / 3 bits at a time is a rough guess at what should work well.
  • Currently we scan the data array twice in each iteration: once to count, a second time to shuffle. It can be reduced to a single scan: while shuffling based on the current digit, we could also be counting the next digit at the same time.
These tweaks might improve the algorithm by a constant factor. Next time, I will describe how to get a better asymptotic running time. Until then!

Sunday, May 17, 2015

How fast can we sort?

We all know that we can sort n things in O(n log n) time and that is the best you can do. Problem solved. Right? Wrong!

Sorting is a fascinating topic and I am planning a whole series of posts about it. It also inspired the blog name.

Let's start with what most programmers already know.

Classic sorting algorithms

O(n log n) sorting algorithms have been known for a very long time. John von Neumann already implemented merge sort in 1945 on the EDVAC computer. We're talking about a computer built out of vacuum tubes, with 1000 words of RAM, capable of performing 1000 operations per second or so.

Later people discovered quicksort in 1962[1] and heapsort in 1964[2]. The only theoretical improvement here is that heapsort is in-place: it uses only O(1) additional memory outside the array. Of course the heap-based priority queue is nice in its own right.

We can easily implement another O(n log n) sorting algorithm, AVL-sort, using the code from my previous post on balanced binary search trees. We just put all the elements into an AVL tree, and the extract them back into a list:
avlSort :: Ord t => [t] -> [t]
avlSort = toList . fromList

Lower bound for comparison-based sorting

Suppose that we restrict ourselves to performing only one operation on the elements we're sorting: compare two of them to see which one is bigger. This is what we mean by "comparison-based sorting".

Every comparison operation returns a single yes or no answer. If we perform k such operations, we can get 2k different sequences of answers, and hence we can distinguish between 2k different permutations. Since there are n! permutations of n elements, it follows that we need at least log2 (n!) comparisons in the worst case (and also on average, which is slightly harder to see) to distinguish between them all.

log2 (n!) ≥ log2 (n/2)(n/2) = n/2 * (log2 n - 1) = Θ(n log n)

So there you go. That's the proof of the lower bound.

The significance of this lower bound has been overblown. Why would you ever so restrict yourself as to only use comparisons?? You can do much more with numbers, or other objects that you'd want to sort, than just compare two of them. You can add them, you can multiply them, you can index arrays with them... All these other operations don't seem all that useful for sorting at first sight. But it turns out they are useful! This is very unintuitive, and fascinating at the same time.

Again and again I have seen people cite this result when it does not apply. Most commonly, the claim is that n numbers can't be sorted faster than O(n log n). For example, people will claim that it's impossible to compute convex hulls in 2D faster than O(n log n), because that requires sorting coordinates. As we will see in a moment, this claim is false. Numbers can actually be sorted faster than this!

Faster sorting algorithms

Let's start talking about some asymptotically faster sorting algorithms. What are some of the things we normally want to sort?

The first group is numbers: integers, rational numbers, floating-point real numbers, complex numbers... OK maybe not complex numbers, they don't have any natural ordering.

The second group is sequences of characters or numbers, such as text strings (e.g. names) or big multi-precision integers.

Sorting integers

Important note here: we are talking about sorting integers that fit in a single machine word, such as 64-bit integers on a 64-bit computer. In general, we will be talking about sorting w-bit integers on a w-bit computer. This is not an artificial restriction: comparison-based sorting algorithms need this as well. Well, we can allow a constant multiple, such as 2w-bit integers on a w-bit computer. After all, we can implement operations on 2-word integers in O(1) time. But we're not talking about huge, multiple-precision integers. If you want to sort million-bit integers, you have to either find a 1000000-bit computer, or skip to the section about sorting sequences.

Here is a summary of various algorithms for sorting integers. I will be writing about some of them in my future posts.

long agomerge sortO(n log n)
long agoradix sortO(n w / log n)
1977van Emde Boas [3]O(n log w)
1983Kirkpatrick & Reisch [4]O(n log (w / log n))
1995Andersson, Hagerup, Nilsson, Raman [6]O(n log log n)
2002Han & Thorup [7]O(n (log log n)1/2)

Some of these run-times depend on w, but the last two are clearly better than O(n log n), independently of w.

Is sorting integers in O(n) time possible?

This is the big open problem. Nobody knows.

Radix sort does it when n is big compared to the word size (log n = Ω(w)).

In their 1995 paper [6], Andersson et al showed how to do this when n is small compared to the word size (log n = O(w1/2-ɛ) for some ɛ>0). There is still a gap in between where we just don't know.

Update: It has been pointed out in the comments that this knowledge gap has been slightly tightened on the "small n compared to the word size" side. As of 2014[8], we know how to sort in O(n) time when log n = O((w / log w)1/2).

What about other kinds of numbers?

Real numbers are typically represented in a floating point format. This seems harder to handle than integers, but it's really not. Floating point numbers are represented as two integers: an exponent and a mantissa. For instance, the IEEE double-precision floating point format, the most common representation out there, stores real numbers as an 11-bit exponent and a 52-bit mantissa. The representation is: 2exponent * (1 + mantissa * 2-52). A number with a higher exponent is bigger than a number with a smaller exponent. For equal exponents, the number with the bigger mantissa is bigger. So really, sorting these real numbers is equivalent to sorting 63-bit integers!

Sorting rational numbers can also be reduced to sorting integers. If you have rational numbers a/b, where a and b are w-bit integers, compute ⌊a * 22w / b⌋ for each, and sort the resulting 3w-bit integers. This works because a difference between two different rationals a/b and c/d is always at least 1/(bd) > 2-2w, so a precision of 2w bits after the binary point is sufficient.

This is another reason sorting integers is an interesting topic: if we can sort integers, we can sort all kinds of numbers.

Sorting strings lexicographically

The important thing to notice here is that we can't compare two elements in constant time any more in this case. Therefore, merge-sort does not work in O(n log n) time. It can be shown that it works in O(L log n) time, where L is the sum of lengths of the strings.

This too can be improved. In 1994, Andersson and Nillson [5] showed how to sort strings in O(L + (time to sort n characters)) time. If the alphabet is small (such as ASCII), we can use count sort to sort the n characters, which gives time complexity O(L + |Σ|), where |Σ| is the size of the alphabet. If the alphabet is large, we can use one of the integer sorting algorithms and arrive at, say, O(L + n (log log n)1/2) time.

This also works for sorting multi-precision integers. If you append their lengths at the start, this is just lexicographic sorting, where the "characters" are word-size numbers.


[1] Hoare, Charles AR. "Quicksort." The Computer Journal 5.1 (1962): 10-16.
[2] Williams, John William Joseph. "ALGORITHM-232-HEAPSORT." Communications of the ACM 7.6 (1964): 347-348.
[3] van Emde Boas, Peter. "Preserving order in a forest in less than logarithmic time and linear space." Information processing letters 6.3 (1977): 80-82.
[4] Kirkpatrick, David, and Stefan Reisch. "Upper bounds for sorting integers on random access machines." Theoretical Computer Science 28.3 (1983): 263-276.
[5] Andersson, Arne, and Stefan Nilsson. "A new efficient radix sort." Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on. IEEE, 1994.
[6] Andersson, Arne, et al. "Sorting in linear time?." Proceedings of the twenty-seventh annual ACM symposium on Theory of computing. ACM, 1995.
[7] Han, Yijie, and Mikkel Thorup. "Integer sorting in O (n√(log log n)) expected time and linear space." Foundations of Computer Science, 2002. Proceedings. The 43rd Annual IEEE Symposium on. IEEE, 2002.
[8] Belazzougui, Djamal, Gerth Stølting Brodal, and Jesper Sindahl Nielsen. "Expected Linear Time Sorting for Word Size Ω(log2 n log log n)." Algorithm Theory–SWAT 2014. Springer International Publishing, 2014. 26-37.

Tuesday, April 21, 2015

Balanced binary search trees: the easy way

The task of balancing binary search trees has a reputation of being very hard to implement. Data structure books seem to list dozens of different cases, especially for deleting elements.

Let's change this - it's not that hard! The analysis will be a little involved, but the code is going to be simple.

Here are some design choices:
  • We'll implement AVL trees. I initially started writing this in terms of 2-3 trees. Then I decided AVL was going to be even simpler.
  • All elements are stored in the leaves. This is unusual, people usually store values in internal nodes. But it makes things so much simpler!
  • Internal nodes store a reference to the smallest (left-most) element of the sub-tree. This will be useful to guide us down the tree.
  • Internal nodes also store the sub-tree height. AVL trees are usually described with nodes only storing the height difference of their children, but storing the height instead makes things easier still.
  • We'll implement this in Haskell. We like algebraic data types for trees!
Example AVL tree.

OK, let's get down to it!


Start by defining binary trees.
data Tree t = Empty | Leaf t | Node Int t (Tree t) (Tree t)
A tree of elements of type t is either empty, it is a leaf, or it is an internal node. A leaf contains a value of type t. An internal node contains: the height, the smallest element in the sub-tree, and the two children. Write some helper functions:
height :: Tree t -> Int
height Empty = error "height Empty"
height (Leaf _) = 0
height (Node h _ _ _) = h

smallest :: Tree t -> t
smallest Empty = error "smallest Empty"
smallest (Leaf x) = x
smallest (Node _ s _ _) = s

left, right :: Tree t -> Tree t
left  (Node _ _ a _) = a
left _ = error "only internal nodes have children"
right (Node _ _ _ b) = b
right _ = error "only internal nodes have children"

toList :: Tree t -> [t]
toList Empty = []
toList (Leaf x) = [x]
toList (Node _ _ a b) = toList a ++ toList b

The toList function is not optimal: it works in Θ(n log n) time (why?). Θ(n) is possible. I'll leave that as a coding exercise.

Time for some AVL-specific stuff. How do we build and balance trees? As a reminder, this is how AVL trees work:
  • All elements in the left sub-tree should be smaller (or equal) to all the elements in the right sub-tree.
  • The difference in height of the two sub-trees should be at most 1. We call this "similar height".
Create another little helper function: build a tree from two sub-trees of similar height:
node :: Tree t -> Tree t -> Tree t
node a b
  | abs (height a - height b) <= 1
    = Node (max (height a) (height b) + 1) (smallest a) a b
  | otherwise = error "unbalanced tree"
It's finally time to start manipulating our trees:

Merging trees

Wait, what? Merging trees as the first operation? What about insert?

We'll get to insert later. As you'll see, merge is going to be our basic operation. Everything else will be defined in terms of it!

What do we mean by merge? Suppose we have two trees, A and B, such that all elements of A are smaller or equal to all elements of B. The merged tree will contain all elements from both.

The "node" operation works when the two trees have similar heights. What if their heights differ by more than 1? Assume that A is the shorter tree. Some helpful notation that we will use:
  • A + B means A merged with B
  • A.L, A.R are the left and right sub-trees
  • A[h] is a tree of height h
  • A[h1..h2] is a tree whose height is between h1 and h2 (inclusive)
  • {A,B} is a tree whose children are A and B. A = {A.L, A.R}
So, again: we're trying to merge a tree A[0 .. h-2] with another tree B[h]. Our strategy is going to be as follows: merge A with B.L, then merge the resulting tree with B.R. Does this work?

Recursively merging a small tree with a larger tree.

A[0 .. h-2] + B[h] = (A[0 .. h-2] + B.L[h-2 .. h-1]) + B.R[h-2 .. h-1] = [h-2 .. h] + B.R[h-2 .. h-1]

How do we know that the tree resulting from the first merge has height in the range [h-2 .. h]? It's because merging two trees of heights h1 and h2 will always result in a tree of height either max(h1, h2) or max(h1, h2) + 1. This is certainly true for similar-height trees. Go ahead and verify that it will remain true in other cases as well.

OK, so what exactly happened when we tried to merge two trees with non-similar heights? After the recursive merge, we reduced the height difference to at most 2. Now what? Let's see if the same process works for a height difference of exactly 2:

A[h-2] + B[h] = (A[h-2] + B.L[h-2 .. h-1]) + B.R[h-2 .. h-1]
  = {A[h-2], B.L[h-2 .. h-1]} + B.R[h-2 .. h-1] = [h-1 .. h] + B.R[h-2 .. h-1]

The trees A and B.L now have similar heights, so their merge is easy. If B.R has height h-1, the final merge is also easy.

What if it has height h-2? Then B[h] = {B.L[h-1], B.R[h-2]} and:

Special case: bad!

A[h-2] + {B.L[h-1], B.R[h-2]} = (A[h-2] + B.L[h-1]) + B.R[h-2] = {[h-2], [h-1]} + [h-2]

Uh oh! Can you see what happened?

We tried to merge [h-2] with {[h-1], [h-2]}, and reduced the problem to merging {[h-2], [h-1]} with [h-2]. That's the mirror image of the original problem! No improvement. If we keep doing this, we'll get into an infinite loop!

We need to use a different strategy in this special case, to break the cycle. Fortunately, it's easy: A, B.L.L, B.L.R, B.R all have similar heights! Just pair them up.

Special case: good.

A[h-2] + {B.L[h-1], B.R[h-2]} = A[h-2] + B.L.L[h-3 .. h-2] + B.L.R[h-3 .. h-2] + B.R[h-2] = {{A, B.L.L}, {B.L.R, B.R}}

Here is the code:
merge :: Tree t -> Tree t -> Tree t
merge Empty x = x
merge x Empty = x
merge a b
  | abs (height a - height b) <= 1
    = node a b
  | height a == height b - 2 && height (right b) == height b - 2
    -- the special case: [h-2] + {[h-1],[h-2]}
    = let (bl,br) = (left b, right b)
          (bll,blr) = (left bl, right bl)
      in node (node a bll) (node blr br)
  | height a < height b
    = merge (merge a (left b)) (right b)
  | otherwise
    = merge (left a) (merge (right a) b)
That's it! Pretty short, huh? That's all we need.

How long does a merge take? We always reduce the difference in heights in the first recursive call, and finish by merging two trees of heights differing by at most 2, which takes constant time. Therefore, merging two trees of heights h1 and h2 takes O(|h1 - h2| + 1) time.

Splitting trees

Now we want to do the reverse of merging: split a tree into two smaller trees. OK but split where? We need some way of telling which elements belong to the left part and which elements belong to the right part.

This is accomplished by a functional argument: a function that returns "false" for the left elements and "true" for the right elements. The requirement is that the function is monotonic: we can't have x < y and x belonging to the right while y belongs to the left. Other than that, any function will do. Often, this will be something like: put x <= 5 on the left and x > 5 on the right.

Is splitting going to be more complicated than merging? No! It's going to be simpler. In fact, we will use merging to do splitting!

Here is where having the reference to the smallest element of a sub-tree comes in handy:
  • If the smallest element of the right sub-tree belongs to the left, the whole left sub-tree belongs to the left, and we only need to split the right sub-tree.
  • If the smallest element of the right sub-tree belongs to the right, the whole right sub-tree belongs to the right, and we only need to split the left sub-tree.

split :: Tree t -> (t->Bool) -> (Tree t, Tree t)
split Empty _ = (Empty, Empty)
split (Leaf x) isBig
  | isBig x   = (Empty, Leaf x)
  | otherwise = (Leaf x, Empty)

split (Node _ _ a b) isBig
  | isBig (smallest b)
    = let (a1,a2) = split a isBig
      in (a1, merge a2 b)
  | otherwise
    = let (b1,b2) = split b isBig
      in (merge a b1, b2)


How fast is it? Since each merge can take Θ(log n) time, are we running the risk of having split run in Θ(log2 n) time?

Fortunately, not.

Let T(h, h1, h2) be the maximum run-time of split of a tree of height h that will return two trees of heights h1, h2.

Suppose that we are in the first case of the code above: we are splitting the left sub-tree a, and then merging a2 with b. If the height of a2 is h3, then the run-time for the recursive split and the following merge is:
T(h-1 [or h-2], h1, h3) + O(h2 - h3 + 1)
Similarly in the second case.

Notice how the merge time is "paid for" by a corresponding decrease in one of the h's. In other words, when we have to do a long merge because of a small tree returned from a recursive call, it means we had less work to do in that recursive call.

By induction: T(h, h1, h2) = O(h + h1 + h2) = O(log n).

Insert, delete, etc

Now that we have split and merge, we can do anything we want, easily!

contains :: Ord t => Tree t -> t -> Bool
contains a x =
  case split a (>=x) of
    (_, Empty) -> False
    (_, b) -> smallest b == x

insert :: Ord t => Tree t -> t -> Tree t
insert a x =
  let (a1, a2) = split a (>=x)
  in merge a1 (merge (Leaf x) a2)

delete :: Ord t => Tree t -> t -> Tree t
delete a x =
  let (b, _) = split a (>=x)
      (_, c) = split a (>x)
  in merge b c

fromList :: Ord t => [t] -> Tree t
fromList = foldl insert Empty

Let's see if it works

$ ghci Tree.hs
GHCi, version 7.4.1:  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
[1 of 1] Compiling Tree             ( Tree.hs, interpreted )
Ok, modules loaded: Tree.
*Tree> let a = fromList [10, 5, 7, 18, 3]
*Tree> toList a
*Tree> toList (insert a 4)
*Tree> toList (delete a 10)
*Tree> contains a 12
*Tree> contains a 5
*Tree> let (b,c) = split a (\x -> x*x > 50)
*Tree> toList b
*Tree> toList c