Saturday, August 4, 2012

Speeding up quicksort with template meta-programming

As you may have realized by now I am a little bit fanatic of C++ templates. I indeed have really a hard time listen to programmers saying: "I do program in C++ but I am not using templates". I swear I heard such statement several times in my life, and it is such a pity there is a number of C++ programmers avoiding templates like they are a disease. Templates are awesome and template meta-programming is way beyond awesomeness (at least for me). From my point of view (involved in compiler design), C++ meta-programming gives me a way to interact with the compiler in a much closer way... and this allows you to test out new compiler transformations before implementing them in the compiler itself. 
Anyhow, this week I spend some spare time trying to make a point (mostly just for the sake of this post), my point is the following: "if you want speed... you shall go crazy with meta-programming". Forget your beloved C (I know several people will not agree on this statement, but seriously guys, C is just a subset of C++). Therefore, I was thinking an algorithm I could speedup with template meta-programming and I thought the most relevant thing would be some kind of sorting algorithm. Indeed there is lot of research around sorting algorithms, mostly related to the computational complexity of algorithm. However, most of the time people forgets that their algorithm is running on top of a super-scalar architecture which, as we seen in the last post, is not that found of conditional jumps. 

Among the myriad of sorting algorithms I choose the quicksort because it's complexity is O(n logn) and in practice, because the sorting is done in-place therefore it plays well with the cache, it performs better than other sorting algorithm with same computational complexity. As prove of its relevance note that the C library provides the std::qsort function and the C++ STL std::sort method also uses a combination of quicksort and introsort. Here it follows the classic implementation of quicksort:  
I tested the execution time of the three sorting routines using the following code. I generate a vector of N elements and initialize each value randomly. Afterwards I sort the same array for 1 million times and sum the total execution time. At the end of the loop I divide by the number of iterations in order to obtain the execution time of a single invocation of the sorting routine. The skeleton of the testing code is the following (ITER = 1000000):
Compared to std::sort and std::qsort our method is not that bad. 

The major differences with the std::qsort (which is a C implementation) is that the C version uses a memcpy operation to perform the swap operation. This is very inefficient especially if compared to the C++ std::swap which uses move semantics (another wonder introduced with C++11) eliminating any need of creating a temporary copy (we will cover move semantics and R-value references in a later post). 

My implementation of quicksort is however not faster than std::sort for larger arrays, this might be caused by std::sort function to switch to the introsort algorithm for such arrays. 

Nevertheless this implementation is a good starting point for start playing with templates (YAY). Indeed the compiler is not able to inline recursive functions, however if we know statically the size of the array we are going to sort we can inline entirely the computation of the quicksort therefore optimizing the sorting for pipeline usage. It is worth noting that not all the control flow statements can be eliminated as the comparison between array elements (which are not known at compile time) needs to be performed at runtime when we actually know the value of the array elements. 

A way of writing the template version of the quicksort algorithm is the following:
Compiling this version of the quicksort is more demanding than the previous one. Indeed the recursive invocation is unrolled at compile time which means the compiler needs to generate an awful amount of code. For example compiling the quicksort for arrays of 50 elements requires 7 minutes and GCC uses around 1.3GB of main memory. The produced binary is 4 MB compared to 27KB which is the size of the executable obtained when the previous sorting function is invoked. 

However if we run the template-based quicksort (template_sort), we obtain the following performance results:

Clearly there is an advantage for arrays up to 25 elements. The template quicksort is 2 times faster than the standard std::sort. Note that the same algorithm is utilized, the only difference is that we optimized the pipeline usage by eliminating recursive function calls by completely unwinding the call stack at compile time. However there is a drawback, because of the size of the generated code, the program doesn't fit into the cache anymore and therefore we miss-utilize L1 Instruction cache. However until 25 elements the generated code is small enough to fit into the instruction cache and the benefits of unrolling are clearly visible. 

The nice thing is that we doubled the performance of the quicksort using an high level language without messing with any assembly instruction. Also the template code is not that different than the trivial implementation of the quicksort. 

C++ <3