Monday, September 17, 2012

The Wonderful World of C++11's Smart Pointers

While I was reading the C++11 my attention was caught by the std::weak_ptr<T> class. Let's rewind a little bit and explain what smart pointer are. The C++03 introduced the std::auto_ptr<T> class which was meant to be a basic implementation of smart pointer. A smart pointer is an object which behaves like a pointer but it doesn't need to be manually cleaned because the pointed are of memory is automatically freed when the pointer object is destroyed. It is quite handy especially in those situation where you want to throw an exception and you don't want to leak any memory. 

Since its introduction, the std::auto_ptr<T> class had several problems (a brief history of std::auto_ptr<T> I found interesting is from Scott Mayers' (C++ God level) blog. One of the main problem is that auto_ptr is not reference counting, which means that there is only one smart pointer owning the reference of memory location. Whenever an object a (of type std::auto_ptr<T>) is copied into b the ownership of the pointer is transferred. 

That means that after the assignment the object a will be an invalid pointer (therefore it will not perform the cleanup of memory) while the pointer b now owns the reference to the memory and he should take care of performing the delete operation. std::auto_ptr<T> for example to be used with containers. Indeed every attempt to access element x of a std::vector<std::auto_ptr<int>> will transfer the ownership of the object in the array to the left side of the assignment leading to disasters. 

The C++11 finally solved this problem. First of all C++11 introduces several smart pointers, exactly 3: std::unique_ptr<T>, std::shared_ptr<T> and std::weak_ptr<T>.
  • std::unique_ptr<T> is very similar to std::auto_ptr<T> of C++03. However because of the introduction of move semantics in C++11 (for which I promised to post something and yet didn't do it) the implementation is more safe. For example now it is possible to have containers of unique_ptrs. In the case you want more details on this have a look to this blog post
  • std::shared_ptr<T> in an implementation of a smart pointer with reference counting. This means that dislike std::unique_ptr<T>, a std::shared_ptr<T> can be copied and assigned. Every time the pointer is copied the object keeps track of how many smart pointer objects are pointing to the same memory location. When the smart pointer goes out of scoped and its distructor gets invoked the object checks whether he is the last smart pointer owning a reference to that memory, if this is the case the delete operation is invoked. 
And finally we arrive at the std::weak_ptr<T> which is the main reason why I started this post in the first place. :) When I first read the definition of a weak_ptr on che C++11 standard I said to my self... nah you will never use this... what are the odds of you needing a weak_ptr??. And I indeed was wrong, couple of days ago I stumble upon a problem that really required to use a weak_ptr, true story. 

It all started with a simple problem. I wanted to have an instance manager handling a number objects. These objects need to refer to other instances of the instance manager. I am not a fan of using pointer, therefore I start with a simple structure which turned to be a disaster :) 

Can you see it? Yep, it always gets me. The main problem is that vectors expands during execution. When an array expand its content is relocated in memory. The array begins its life with a certain capacity. Although the size is zero, the capacity is usually larger than that. Every time the push_back method is invoked the size of the array increases. When the size reaches the capacity the vector needs to be expanded so that more elements can be append. The content of the array is therefore copied over the new allocated memory and the original location is deleted. In my implementation I keep a pointer to an element of the array but surprise surprise... that pointer is going to be invalid if too many elements are pushed into the array.

A workaround is to keep the instances in a std::vector<Instance*>.  However in the era of C++11 goodies we don't want to take care of pointers ourselves, therefore the use of smart pointer is screaming out. My second try was by using a shared_ptrs. Now there it is the implementation which uses shared_ptrs, where is the problem now?

I tricked a little bit the code to make it clear what can go wrong with this implementation. When the instance manager goes out of scope is freed. This means that the repos array gets deallocated. However here something nasty happens, a circular dependency can generate among Instance objects. Indeed we end up in a situation where the first two elements of the vector refers, the situation in memory is the following:
When the first element of the array is removed the pointer has reference counting 2, therefore the number is decreased and the object is not removed. We move on to the second element and same repeats. We therefore end up loosing any reference to the two instances and leak memory. Valgrind should be quite angry if you run this code :). Where do we do now?

That's it. That the reason why C++11 also has std::weak_ptr<T>. Their purpose is to be used in such situation to break cycles. This kind of smart pointer holds a non-owning ("weak") reference to an object that is managed by std::shared_ptr<T>. It must be converted to std::shared_ptr<T> in order to access the referenced object.
The weak_ptr is quite easy to use. You can construct a weak_ptr from a shared_ptr. Every time you want to access the pointed object you need to obtain a shared_ptr. This is done through the lock() method. The lock method requires the object to be valid. The validity of a weak_ptr can be tested using the expired() method. The use of weak_ptr makes Valgrind happy and concludes this post. 

C++ <3

Saturday, August 4, 2012

Speeding up quicksort with template meta-programming

As you may have realized by now I am a little bit fanatic of C++ templates. I indeed have really a hard time listen to programmers saying: "I do program in C++ but I am not using templates". I swear I heard such statement several times in my life, and it is such a pity there is a number of C++ programmers avoiding templates like they are a disease. Templates are awesome and template meta-programming is way beyond awesomeness (at least for me). From my point of view (involved in compiler design), C++ meta-programming gives me a way to interact with the compiler in a much closer way... and this allows you to test out new compiler transformations before implementing them in the compiler itself. 
Anyhow, this week I spend some spare time trying to make a point (mostly just for the sake of this post), my point is the following: "if you want speed... you shall go crazy with meta-programming". Forget your beloved C (I know several people will not agree on this statement, but seriously guys, C is just a subset of C++). Therefore, I was thinking an algorithm I could speedup with template meta-programming and I thought the most relevant thing would be some kind of sorting algorithm. Indeed there is lot of research around sorting algorithms, mostly related to the computational complexity of algorithm. However, most of the time people forgets that their algorithm is running on top of a super-scalar architecture which, as we seen in the last post, is not that found of conditional jumps. 

Among the myriad of sorting algorithms I choose the quicksort because it's complexity is O(n logn) and in practice, because the sorting is done in-place therefore it plays well with the cache, it performs better than other sorting algorithm with same computational complexity. As prove of its relevance note that the C library provides the std::qsort function and the C++ STL std::sort method also uses a combination of quicksort and introsort. Here it follows the classic implementation of quicksort:  
I tested the execution time of the three sorting routines using the following code. I generate a vector of N elements and initialize each value randomly. Afterwards I sort the same array for 1 million times and sum the total execution time. At the end of the loop I divide by the number of iterations in order to obtain the execution time of a single invocation of the sorting routine. The skeleton of the testing code is the following (ITER = 1000000):
Compared to std::sort and std::qsort our method is not that bad. 

The major differences with the std::qsort (which is a C implementation) is that the C version uses a memcpy operation to perform the swap operation. This is very inefficient especially if compared to the C++ std::swap which uses move semantics (another wonder introduced with C++11) eliminating any need of creating a temporary copy (we will cover move semantics and R-value references in a later post). 


My implementation of quicksort is however not faster than std::sort for larger arrays, this might be caused by std::sort function to switch to the introsort algorithm for such arrays. 


Nevertheless this implementation is a good starting point for start playing with templates (YAY). Indeed the compiler is not able to inline recursive functions, however if we know statically the size of the array we are going to sort we can inline entirely the computation of the quicksort therefore optimizing the sorting for pipeline usage. It is worth noting that not all the control flow statements can be eliminated as the comparison between array elements (which are not known at compile time) needs to be performed at runtime when we actually know the value of the array elements. 


A way of writing the template version of the quicksort algorithm is the following:
Compiling this version of the quicksort is more demanding than the previous one. Indeed the recursive invocation is unrolled at compile time which means the compiler needs to generate an awful amount of code. For example compiling the quicksort for arrays of 50 elements requires 7 minutes and GCC uses around 1.3GB of main memory. The produced binary is 4 MB compared to 27KB which is the size of the executable obtained when the previous sorting function is invoked. 


However if we run the template-based quicksort (template_sort), we obtain the following performance results:

Clearly there is an advantage for arrays up to 25 elements. The template quicksort is 2 times faster than the standard std::sort. Note that the same algorithm is utilized, the only difference is that we optimized the pipeline usage by eliminating recursive function calls by completely unwinding the call stack at compile time. However there is a drawback, because of the size of the generated code, the program doesn't fit into the cache anymore and therefore we miss-utilize L1 Instruction cache. However until 25 elements the generated code is small enough to fit into the instruction cache and the benefits of unrolling are clearly visible. 


The nice thing is that we doubled the performance of the quicksort using an high level language without messing with any assembly instruction. Also the template code is not that different than the trivial implementation of the quicksort. 


C++ <3

Friday, July 27, 2012

A generic loop unroller based on template meta-programming

Loop unrolling (or unwinding) is code transformation used by compilers to improve the utilization of functional units present in modern super-scalar CPUs. Indeed, processors have a pipelined architecture consisting of multiple staged (minimum are 5). While the CPU is executing the instruction in one of the stages he can simultaneously load and decode the next operation pointed by the program counter. However, in the presence of branch instructions, the CPU needs to wait the decode stage in order to know whether the branch has been taken or not in order to adjust the program counter and correctly load the next assembly instruction. Over the years several architectural optimizations have been introduced to reduce the problem (e.g. branch prediction units), however in specific situation the CPU can loose up to 20 cycles because of a branch instruction. 

For this reason it is very important to reduce the amount of branched in the input code. This is the job of the compiler since it is the software agent closest to the actual hardware and it can produce code which better fits the underlying CPU. However, compilers are quite complex and often they even fail in applying elementary optimizations. An example is loop unrolling. Because the compiler often fails to produce such transformation, developers, especially in High Performance Computing (HPC), tend to tune their code by manually unroll loops. This is (in my opinion) a bad practice since the tuned code is not portable anymore (the optimal unroll factor for one machine can be bad for another). A second problem is that by manually unrolling a loop the body is replicated many times which is never a good thing if a bug shows up.

However we don't necessary have to renounce to performance if we want code readability and structure. With C++11 we can have them both. :) The idea is to have an meta-function which we call unroller, this class takes care of unrolling N invocations of a generic functor like below:

This simple function takes care of unrolling N function calls. Because we use inline, the generated code will actually not contain function calls. The next code snippet shows how we can use the unroller

And that's it. You can now control the unrolling factor using the pre-processor directive UnrollFactor, which means you can either define it in your code or provide a value through a Makefile in order to fit the target architecture. 

The next question comes natural, how slower this loop is going to be compared with the C-like version which uses no lambda and which can be unrolled by the underlying compiler? This is a legittimate question and that's why we are going to do some old good benchmarking right now! :)

The loop our unroller will fight against is the following:  
The unroller turned out to be pretty awesome. As expected we have a 2.5 speedup. The amount of computation in the loop body is enough that we start seeing improvements starting from an unrolling factor of 2 or 5. However depending on the number of iteration of the loop and the amount of computation the best unrolling factor may change.  

C++ <3

Thursday, July 26, 2012

decltype Insanity... a.k.a. when the return type depends on the function itself?

In a previous post I introduced basic usage of the auto and decltype keywords. In most cases the use of auto or decltype is a shortcut to avoid to manually write down the type of an object. However these two new keyword enables new capabilities of the C++ language. 

During this week I stumble on one of those cases. I don't want to overload you with the details of the specific problem I was solving therefore I will use an equivalent example just for the sake of presentation. What I want to achieve is the following, I want to write a meta-function which takes a variable number of types as input and returns a type structured as a (degenerated) binary tree. An example follows:
The way we want to write this in C++ is by defining a function which makes use of variadic templates (a feature introduced in C++11). In this way we can have a typed function accepting an unbounded number of arguments. However if you start declaring the function signature you will soon realize that you cannot write the return type of this function:
How get around this? So, first of all we have to realize that the return type depends on the type and number of the parameters of the function. Fortunately, C++11 introduced the so called trailing return types, which, as the name suggests, enable the return type of a function to be specified after the input parameters (so that the actual function parameter types can be used in the return type). A second thing we must do, because meta-programming is heavily based on recursion, we must define a termination case for the recursion which, in this case boils down to a function accepting 2 arguments and returning the composition of the two bundled into a pair object.
Now the only thing remaining is to write down the return type of this function. Basically what we want to write in the return type is that my return type is a pair object where the first argument is the type of Arg1 while the second argument is the return type of the recursive invocation of the same function on the rest of the provided arguments. And, thanks to decltype, we can write it down as:
Now we can test whether everything is working by using the following simple main:
I have to admit I was quite impressed for a couple of seconds when I saw the first two tests working, I suppose at this point I have to point out which kind of C++ compiler I am using. Currently in my machine I have installed GCC 4.7.1. Considering that many C++ compilers are still struggling to support variadic templates and GCC can already eat code like this, it is quite cool. As my research activity mostly focuses around compilers, I was wondering how could GCC internally determine the type of the makeTree function considering that the type depends on itself. However I soon find out that GCC is not that smart, indeed if we try to invoke the function with more that 3 arguments the compiler generates the following error:
test.cpp: In function ‘int main(int, char**)’:
test.cpp:26:22: error: no matching function for call to ‘makeTree(int, int, int, int)’
test.cpp:26:22: note: candidates are:
test.cpp:8:24: note: template<class LhsTy, class RhsTy> std::pair<_T1, _T2> makeTree(const LhsTy&, const RhsTy&)
test.cpp:8:24: note:   template argument deduction/substitution failed:
test.cpp:26:22: note:   candidate expects 2 arguments, 4 provided
test.cpp:13:6: note: template<class Arg1, class Arg2, class Arg3, class ... Args> std::pair<Arg1, decltype (makeTree(arg2, arg3, makeTree::args ...))> makeTree(const Arg1&, const Arg2&, const Arg3&, const Args& ...)
test.cpp:13:6: note:   template argument deduction/substitution failed:
test.cpp: In substitution of ‘template<class Arg1, class Arg2, class Arg3, class ... Args> std::pair<Arg1, decltype (makeTree(arg2, arg3, args ...))> makeTree(const Arg1&, const Arg2&, const Arg3&, const Args& ...) [with Arg1 = int; Arg2 = int; Arg3 = int; Args = {int}]’:
test.cpp:26:22:   required from here
test.cpp:13:6: error: no matching function for call to ‘makeTree(const int&, const int&, const int&)’
test.cpp:13:6: note: candidate is:
test.cpp:8:24: note: template<class LhsTy, class RhsTy> std::pair<_T1, _T2> makeTree(const LhsTy&, const RhsTy&)
test.cpp:8:24: note:   template argument deduction/substitution failed:
test.cpp:13:6: note:   candidate expects 2 arguments, 3 provided
Which made me very sad. Now I wonder if the C++ code I wrote fails because of a faulty implementation of the GCC compiler or if what I wrote is not valid C++11 code. Comments are welcome. 

C++ <3

Wednesday, July 25, 2012

Printing tuples

One of the power tools of C++11 standard library are tuples... a.k.a. std::tuple<Args...>. It was not possible to have such utility in C++98 because of the lack of variadic templates. Indeed the std::tuple object heavily relies on this feature which has been introduced with the C++11 standard.

Tuples are collections composed of heterogeneous objects of pre-arranged dimensions. A tuple can be considered a generalization of a struct's member variables. It's use is very similar to the std::pair class which was available since C++98, however while pairs can only contains 2 generic elements, a tuple can be of undefined size. 

The standard way of using tuples in C++11 is the following:
Tuple t1 is constructed using the std::make_tuple function is an utility which easy the construction of tuples without worrying about the type of the single elements which is instead inferred thanks to the template mechanism. Another way to build a tuple is shown for tuple t2 for which we use the tuple class constructor. It is worth noting that we use an initializer list ( { } ) which is again one of the new feature of the C++11 standard (which we will cover one day). When building this tuple, the type of the first element is a const reference to the first element of the tuple t1. At last, tuple t3 is constructed using the std::tie function which creates a tuple of lvalue references. Therefore by writing the first element of tuple t3 we indeed propagate the value to both t1 and t2. And the assert in the last line of the code snippet is satisfied. 


When working with tuples, it is sometimes useful, for debugging purposes for example, to print their values to an output stream. This can be done by overloading the << operator, however, because the access to the tuple's elements is strongly typed, we need some metaprogramming magic in order to be able to print each tuple element. The problem is the following, accessing an element of the tuple is only possible via the std::get method which takes a constant teamplate parameter representing the index of the the value we want to access. Because this value needs to be a constant expression (otherwise the compiler would not be able to determine the return value of the function) we cannot simply iterate through the elements of a the tuple using a loop iterator. What instead we need to do is use recursion:
This method is generic in the sense that it can be used to print any tuple, the output obtained by printing the tuples t1, t2 and t3 at the exit of the program is:
 
Easy, isn't it? :)


C++ <3

Tuesday, July 24, 2012

Once you go 'decltype'...

Since C++11 became available several things have changed in C++ and also the way you used to solve problems now changes. It is indeed an interesting time for C++ developers because it's time to forget old practices and embrace the new features since C++11 is truly awesome.

I want to start this blog introducing decltype, a new keyword of the language which given an expression returns its type.

For example, in C++11 you can write this:

The type of b will be derived from a, therefore it will be int. decltype accepts any expression, therefore it is also possible to write:

In this way we don't need to infer what would be the type of the expression, but we let the compiler do the job considering that it has this knowledge. Although useful, the syntax is quite cumbersome since we have to write the expression twice and this is not probably handy to use in real codes. A better way to obtain the same result is by using the auto keyword, which is also part of the new C++11 standard. With auto declaring b is as easy as follows:


The combination of auto and decltype allows the developer to take advantage of type inference information derived by the compiler. It also allows you to do more with much lesser code. An example I found myself using lately is the declaration of sets/maps which requires to explicitly provide a comparator. The common way of doing this before C++11 was basically in two ways, either by providing a function which implements the key comparator, e.g.:  

Or by using a functor object as follows:

The advantage of using a functor object is that the comparator can have state. Similar behavior can be obtained using a function if static or global variables are being used.  

In C++11, the same thing can be coded with less effort as follows:


We take advantage of the auto initializer to define a lambda function (which I will cover in the future) and using the decltype keyword we get its type (which it would not be trivial to write down ourselves considering that it could be implementation dependent). 


This is just a basic introduction to the wonders of decltype, there is much more to unveil, however, as we say in italian... "you cannot start running if you didn't learn to walk in the first place". Indeed, decltype might seem pretty lame and not useful alone, but combined with other features of C++11 it becomes a very powerful tool. Stay tuned for more C++ love. 


C++ <3