Friday, July 27, 2012

A generic loop unroller based on template meta-programming

Loop unrolling (or unwinding) is code transformation used by compilers to improve the utilization of functional units present in modern super-scalar CPUs. Indeed, processors have a pipelined architecture consisting of multiple staged (minimum are 5). While the CPU is executing the instruction in one of the stages he can simultaneously load and decode the next operation pointed by the program counter. However, in the presence of branch instructions, the CPU needs to wait the decode stage in order to know whether the branch has been taken or not in order to adjust the program counter and correctly load the next assembly instruction. Over the years several architectural optimizations have been introduced to reduce the problem (e.g. branch prediction units), however in specific situation the CPU can loose up to 20 cycles because of a branch instruction. 

For this reason it is very important to reduce the amount of branched in the input code. This is the job of the compiler since it is the software agent closest to the actual hardware and it can produce code which better fits the underlying CPU. However, compilers are quite complex and often they even fail in applying elementary optimizations. An example is loop unrolling. Because the compiler often fails to produce such transformation, developers, especially in High Performance Computing (HPC), tend to tune their code by manually unroll loops. This is (in my opinion) a bad practice since the tuned code is not portable anymore (the optimal unroll factor for one machine can be bad for another). A second problem is that by manually unrolling a loop the body is replicated many times which is never a good thing if a bug shows up.

However we don't necessary have to renounce to performance if we want code readability and structure. With C++11 we can have them both. :) The idea is to have an meta-function which we call unroller, this class takes care of unrolling N invocations of a generic functor like below:

This simple function takes care of unrolling N function calls. Because we use inline, the generated code will actually not contain function calls. The next code snippet shows how we can use the unroller

And that's it. You can now control the unrolling factor using the pre-processor directive UnrollFactor, which means you can either define it in your code or provide a value through a Makefile in order to fit the target architecture. 

The next question comes natural, how slower this loop is going to be compared with the C-like version which uses no lambda and which can be unrolled by the underlying compiler? This is a legittimate question and that's why we are going to do some old good benchmarking right now! :)

The loop our unroller will fight against is the following:  
The unroller turned out to be pretty awesome. As expected we have a 2.5 speedup. The amount of computation in the loop body is enough that we start seeing improvements starting from an unrolling factor of 2 or 5. However depending on the number of iteration of the loop and the amount of computation the best unrolling factor may change.  

C++ <3