Saturday, September 26, 2015

Radix sort: sorting integers (often) faster than std::sort.

This post will describe a very simple integer sorting algorithm: radix sort. Despite its simplicity, it is a very practical algorithm. As we will see, even a simple implementation can easily outperform std::sort from the C++ standard library.

Suppose we start with an array of numbers such as this:
853, 872, 265, 238, 199, 772, 584, 204, 480, 173,
499, 349, 308, 314, 317, 186, 825, 398, 899, 161
Counter-intuively, we begin by sorting it based on the least significant decimal digit:
480, 161, 872, 772, 853, 173, 584, 204, 314, 265,
825, 186, 317, 238, 308, 398, 199, 499, 349, 899
Now, we sort it based on the middle decimal digit. But we take care that we do this in a stable fashion, that is: for numbers that are tied on the middle digit, keep them in the current order.
204, 308, 314, 317, 825, 238, 349, 853, 161, 265,
872, 772, 173, 480, 584, 186, 398, 199, 499, 899
The numbers are now sorted by the last two digits. It is not hard to guess what we will do next. Once we have sorted them by the most significant digit, taking care not to change the order in case of ties, we will have sorted the array.

I haven't said how exactly we perform the sorting based on a single digit, so let's do this last round slowly. We use count-sort. Here is how it works:

Step 1. Go through the data and count how many times each top digit appears. 0 appears 0 times, 1 appears 4 times, etc.:

count: 0, 4, 3, 5, 2, 1, 0, 1, 4, 0

Step 2. Compute the prefix sums in count. This will give us, for each digit, the index of the first entry with that digit in the final sorted order.

position: 0, 0, 4, 7, 12, 14, 15, 15, 16, 20
For instance, we now know that numbers starting with the digit 4 will begin at index 12.

Step 3. Shuffle the data. For each number, we simply place it directly at the correct position! After placing a number we increment the index for the given digit.
X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 4, 7, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 7, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, 308, X, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 8, 12, 14, 15, 15, 16, 20
X, X, X, X, 204, X, X, 308, 314, X, X, X, X, X, X, X, X, X, X, X
position: 0, 0, 5, 9, 12, 14, 15, 15, 16, 20
161, 173, 186, 199, 204, 238, 265, 308, 314, 317,
349, 398, 480, 499, 584, 772, 825, 853, 872, 899

Step 4. We have shuffled the data into a new, temporary array. Move it back to the original array. In practice, we can simply swap pointers here.

Running time

Of course in practice we don't sort based on decimal digits. We could sort based on individual bits but we can do better than that. Sort based on groups of bits.

If we sort k bits at a time, there are 2k possible "digits". The count array will need to be of that length. Hence, let's make k ≤ log2 n, so that the helper array isn't longer than the data being sorted.

For added performance, it may be useful to make k somewhat smaller than log2 n. In our implementation below, we use k = ⌊log2 n / 3⌋. This increases the number of rounds 3-fold, but has several advantages that outweigh that:
  • The count array only uses n1/3 memory.
  • Computing the prefix sums in step 2 takes negligible time.
  • Counting in step 1 doesn't randomly increment counters all over memory. It randomly increments counters in a tiny section of memory, which is good for cache performance.
  • Shuffling in step 3 doesn't randomly write all over memory. It writes consecutively in only n1/3 different locations at a time, which also improves cache performance.
For the purposes of analyzing asymptotic performance, we simply say: k = Θ(log n). If the numbers being sorted have ω bits, we have Θ(ω / log n) rounds, each round is done in Θ(n) time, hence the total running time of radix sort is:

Θ(n ω / log n)

Note what this means. The larger the n, the less time we spend per element! This is in contrast with comparison-based sorts, where we spend Θ(log n) per element, which increases with n.

This indicates that there is a threshold: for small n it is better to use a comparison-based sort. For large n, radix sort is better.

What is the threshold? It should be about when ω / log n ≈ log n, that is, when n ≈ 2√ω. For instance, when ω=64,  n ≈ 28 = 256 or so should be the threshold. If n is significantly bigger than this, radix sort should start to dominate.

C++ implementation

template<class T>
void radix_sort(vector<T> &data) {
  static_assert(numeric_limits<T>::is_integer &&
                "radix_sort only supports unsigned integer types");
  constexpr int word_bits = numeric_limits<T>::digits;

  // max_bits = floor(log n / 3)
  // num_groups = ceil(word_bits / max_bits)
  int max_bits = 1;
  while ((size_t(1) << (3 * (max_bits+1))) <= data.size()) {
  const int num_groups = (word_bits + max_bits - 1) / max_bits;

  // Temporary arrays.
  vector<size_t> count;
  vector<T> new_data(data.size());

  // Iterate over bit groups, starting from the least significant.
  for (int group = 0; group < num_groups; ++group) {
    // The current bit range.
    const int start = group * word_bits / num_groups;
    const int end = (group+1) * word_bits / num_groups;
    const T mask = (size_t(1) << (end - start)) - T(1);

    // Count the values in the current bit range.
    count.assign(size_t(1) << (end - start), 0);
    for (const T &x : data) ++count[(x >> start) & mask];

    // Compute prefix sums in count.
    size_t sum = 0;
    for (size_t &c : count) {
      size_t new_sum = sum + c;
      c = sum;
      sum = new_sum;

    // Shuffle data elements.
    for (const T &x : data) {
      size_t &pos = count[(x >> start) & mask];
      new_data[pos++] = x;

    // Move the data to the original array.


I have generated arrays of random 64-bit integers and timed the time per element it takes to sort using std::sort and radix_sort.

3.3 ns
284.2 ns
6.1 ns
91.6 ns
1 000
19.3 ns
59.8 ns
10 000
54.8 ns
46.8 ns
100 000
66.9 ns
40.1 ns
1 000 000
81.1 ns
40.8 ns
10 000 000
95.1 ns
40.7 ns
100 000 000
108.4 ns
40.6 ns

We see the effect as predicted: for std::sort, the running time per element increases with n, for radix_sort it decreases with n. It's not exactly proportional and inversely proportional to log n due to various effects (mostly cache sizes), but the trend is there. Most importantly: for large n, radix_sort is clearly winning!

Further optimizations

More optimizations are possible which can lead to improvements in performance. In particular:
  • Optimize the number of rounds as a function of n. Taking log n / 3 bits at a time is a rough guess at what should work well.
  • Currently we scan the data array twice in each iteration: once to count, a second time to shuffle. It can be reduced to a single scan: while shuffling based on the current digit, we could also be counting the next digit at the same time.
These tweaks might improve the algorithm by a constant factor. Next time, I will describe how to get a better asymptotic running time. Until then!

1 comment:

  1. Ah, the beauty of the Radix sort... I have a soft spot for it, during an interview at Microsoft I chose it when asked to implement any sorting algorithm...

    Definitely you should play with #bits, good for cache locality. Also, cache is one thing, but TLB is just as, if not more important.

    Anyway, Radix-sort has been explored to great depths, by yours truly as well :) Take this, quite old now, paper "Optimizing database architecture for the new bottleneck: memory access". And this, very recent paper from one crazy Greek guy, is just beautiful: "Rethinking SIMD Vectorization for In-Memory Databases".