Loop unrolling (or unwinding) is code transformation used by compilers to improve the utilization of functional units present in modern super-scalar CPUs. Indeed, processors have a pipelined architecture consisting of multiple staged (minimum are 5). While the CPU is executing the instruction in one of the stages he can simultaneously load and decode the next operation pointed by the program counter. However, in the presence of branch instructions, the CPU needs to wait the decode stage in order to know whether the branch has been taken or not in order to adjust the program counter and correctly load the next assembly instruction. Over the years several architectural optimizations have been introduced to reduce the problem (e.g. branch prediction units), however in specific situation the CPU can loose up to 20 cycles because of a branch instruction.
For this reason it is very important to reduce the amount of branched in the input code. This is the job of the compiler since it is the software agent closest to the actual hardware and it can produce code which better fits the underlying CPU. However, compilers are quite complex and often they even fail in applying elementary optimizations. An example is loop unrolling. Because the compiler often fails to produce such transformation, developers, especially in High Performance Computing (HPC), tend to tune their code by manually unroll loops. This is (in my opinion) a bad practice since the tuned code is not portable anymore (the optimal unroll factor for one machine can be bad for another). A second problem is that by manually unrolling a loop the body is replicated many times which is never a good thing if a bug shows up.
However we don't necessary have to renounce to performance if we want code readability and structure. With C++11 we can have them both. :) The idea is to have an meta-function which we call unroller, this class takes care of unrolling N invocations of a generic functor like below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
template <size_t N> struct uint_{ }; | |
template <size_t N, typename Lambda, typename IterT> | |
inline void unroller(const Lambda& f, const IterT& iter, uint_<N>) { | |
unroller(f, iter, uint_<N-1>()); | |
f(iter + N); | |
} | |
template <typename Lambda, typename IterT> | |
inline void unroller(const Lambda& f, const IterT& iter, uint_<0>) { | |
f(iter); | |
} |
This simple function takes care of unrolling N function calls. Because we use inline, the generated code will actually not contain function calls. The next code snippet shows how we can use the unroller:
And that's it. You can now control the unrolling factor using the pre-processor directive UnrollFactor, which means you can either define it in your code or provide a value through a Makefile in order to fit the target architecture.
The next question comes natural, how slower this loop is going to be compared with the C-like version which uses no lambda and which can be unrolled by the underlying compiler? This is a legittimate question and that's why we are going to do some old good benchmarking right now! :)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
std::vector<double> vec(N); | |
// loop body as a lambda | |
auto body = [&](const int& i) { vec[i] = i*(i-1)*4*(i+1); }; | |
assert(vec.size()%UnrollFact == 0 && "Vector size must be divisible by the Unrolling Factor"); | |
auto start = std::chrono::system_clock::now(); | |
for(size_t i = 0, size=vec.size(); i!=size; i+=UnrollFact) { | |
unroller( body, i, uint_<UnrollFact-1>() ); | |
} | |
auto end = std::chrono::system_clock::now(); | |
std::cout << "Elapsed time (in msecs): " | |
<< std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count() | |
<< std::endl; |
And that's it. You can now control the unrolling factor using the pre-processor directive UnrollFactor, which means you can either define it in your code or provide a value through a Makefile in order to fit the target architecture.
The next question comes natural, how slower this loop is going to be compared with the C-like version which uses no lambda and which can be unrolled by the underlying compiler? This is a legittimate question and that's why we are going to do some old good benchmarking right now! :)
The loop our unroller will fight against is the following:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
auto start = std::chrono::system_clock::now(); | |
for(size_t i = 0, size=vec.size(); i!=size; ++i) { | |
vec[i] = i*(i-1)*4*(i+1); | |
} | |
auto end = std::chrono::system_clock::now(); |
The unroller turned out to be pretty awesome. As expected we have a 2.5 speedup. The amount of computation in the loop body is enough that we start seeing improvements starting from an unrolling factor of 2 or 5. However depending on the number of iteration of the loop and the amount of computation the best unrolling factor may change.
C++ <3
C++ <3
No comments:
Post a Comment