Archive for the ‘Uncategorised’ Category

Modern C++ concurrency - parallel quick-sort with std::future

In this short lesson we will discuss how to parallelize a simple and rather inefficient (because this is not an in-place version) implementation of quick-sort using asynchronous tasks and futures.

We will perform some benchmarking and performance analysis and we will try to understand how we can further improve our implementation.

Quick sort

In this section, I will briefly refresh your memory on quick-sort. I will do so by showing you a simple and self-explicative Haskell version first. We will also write a C++ (serial) version of the same code implementation C++ that we will use as a basis for our parallelization.

Here it goes the Haskell version:

quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
    where
        lesser = filter (< p) xs
        greater = filter (>= p) xs

It is beautifully simple: in order to sort a list with at least, p one element is only necessary to partition sort the rest of the elements xs into two sublists:

  • lesser: containing all the elements in xs smaller than p
  • greater: containing all the elements in xs greater than p
    Once both sublists are sorted we can finally return the whole sorted list using by simply returning gluing lesser, p and greater together in this order.

If you still have trouble understanding the quick-sort algorithm please refer to Wikipedia.

Quick-sort serial version

The following is the serial C++ implementation of the same idea described above. It should be pretty easy to map the following implementation to the Haskell one. Run it on Wandbox

template <typename T>
void quick_sort_serial(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;

//partition the list
  vector<T> lesser;
  copy_if(start_it, end_it, std::back_inserter(lesser),
          [&amp;](const T&amp; el) { return el < pivot; });

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

//solve subproblems
  quick_sort_serial(lesser);
  quick_sort_serial(greater);

//merge
  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);
}

Parallelizing Quick-sort using std::future

In order to speed-up things we are going to use the fact that quick-sort is a divide and conquer algorithm. Each subproblem can be solved independently:
creating and sorting lesser and greater are two independent tasks. We can easily perform both on different threads.

The following is the first parallel version of the quick_sort_serial() above.
Run it on Wandbox

template <typename T>
void filter_less_than(const vector<T>&amp; v, vector<T>&amp; lesser, const int pivot) {
  for (const auto el : v) {
    if (el < pivot) lesser.push_back(el);
  }
}

template <typename T>
void quick_sort_parallel1(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;
  vector<T> lesser;
  auto fut1 = std::async(
        std::launch::async, 
        [&amp;]() {
            filter_less_than<T>(std::ref(v), std::ref(lesser), pivot);
            quick_sort_parallel1<T>(std::ref(lesser));
  });

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

  quick_sort_parallel1(greater);

  fut1.wait();

  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);
}

As you can notice, the creation and sorting of lesser and are performed in parallel. Each thread running an instance of quick_sort_parallel1() will create another thread running quick-sort on one of the two sub-problems while the other subproblem is solved by the current thread.

This is exactly what we are doing when we spawn the following async task:
we are creating a task that will populate lesser with all the elements from v less than pivot and, once ready, it will sort it.
Please note that everything we need to have modified by reference need to be wrapped in a std::ref as we discussed in the previous lessons.

The following picture shows how the execution unfolds for the unsorted list: [2,7,1,6,9,5,8,3,4,10]:

The following code shows hot to spawn an async thread solving the lesser subproblem:

  vector<T> lesser;
  auto fut1 = std::async([&amp;]() {
    filter<T>(std::ref(v), std::ref(lesser), pivot);
    quick_sort<T>(std::ref(lesser));
  });

While this task is running on the newly created thread, we can solve greater on the current thread.

The asynchronous task will recursively spawn other async tasks until a list of size <=1 is created, which is of course already sorted. There is nothing to do in this case.

Once the main thread is done with sorting the greater list, it waits for the asynchronous task to be ready using the std:.future::wait() function.
Once wait returns, both lists are sorted, and we can proceed with merging the result and finally, here it is, we have a sorted list.

Performance analysis

Let's quickly analyze our implementation. We will compare execution time for the single-thread and async-parallel versions above.

Let's start our analysis by taking looking at this graph depicting the execution time (average of 10 runs) for both versions above:

It might be a surprising result to see that the Async parallel version is way slower than the single threaded version, ~55x slower!
Why is that? The reason is that, the parallel version creates a new thread for every single subproblem, even for the ones that are quite small.
Threads are costly to manage by the OS, they use resources and need to be scheduled. For smaller tasks the overhead caused by the additional thread is larger than the gain in performance that we might get by processing the sublist in parallel. This is exactly what is happening.

In order to solve this issue, we want to modify the async code above so that a new thread is spawned only when the input list v is larger than a certain threshold. The code below implements the aforementioned idea:

template <typename T>
void quick_sort_async_lim(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;
  vector<T> lesser;

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

  if (v.size() >= THRESHOLD) {
    auto fut1 = std::async([&amp;]() {
      filter<T>(std::ref(v), std::ref(lesser), pivot);
      quick_sort_async_lim<T>(std::ref(lesser));
    });

    quick_sort_async_lim(greater);
    fut1.wait();

  } else {
    //problem is too small.
    //Do not create new threads
    copy_if(start_it, end_it, std::back_inserter(lesser),
            [&amp;](const T&amp; el) { return el < pivot; });
    quick_sort_async_lim(lesser);
    quick_sort_async_lim(greater);
  }

  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);
}

As you can notice, the only addition that this optimized version has is that a new thread is spawned only when the size of the input list is larger than THRESHOLD If the list is too small, then we fall back on the classic single-thread version.
The following pictures show the result for the optimized version above with value of THRESHOLD=4000. As you can notice the execution time drops sensibly w.r.t single thread version. We have achieved ~4x speedup with a minimal programming effort.

We have introduced a new parameter in our code, and we need to figure out what is the best value of THRESHOLD. In order to do so, let's analyze the performance of the code above for various values of the threshold.
The following graph depict the execution time for various values of THRESHOLD. Note that the y-axis is in log scale. The execution time drops quite abruptly from 0 to 300.

Conclusion

We have used std::future to parallelize the quick-sort algorithm. The async code differs slightly from the serial single thread implementation, but runs 4x faster. On the other hand, we have learned that running too many threads it is definitely not a good idea because each thread comes with an overhead: the OS needs to allocate resources and time to manage them.

Modern C++ concurrency - Returning values from Threads - std::future

Introduction

In this lesson we will talk about a way of returning values from threads, more precisely we will talk about std::future which is a mechanism that C++ offers in order to perform asynchronous tasks and query for the result in the future.
A future represents an asynchronous task, i.e. an operation that runs in parallel to the current thread and which the latter can wait (if it needs to) until the former is ready.
You can use a future all the time you need a thread to wait for a one-off event to happen. The thread can check the status of the asynchronous operation by periodically polling the future while still performing other tasks, or it can just wait for the future to become ready.

Read On…

Modern C++ Concurrency - Synchronizing threads - Condition Variables

Introduction

In the previous lesson we have seen how data can be protected using mutex. We now know how to make threads do their work concurrently without messing around with shared resources and data. But sometimes we need to synchronize their actions in time, meaning that we might want a thread t1 to wait until a certain condition is true before allowing it to continue its execution.

This lesson discusses the tools that we can use to achieve such behavior efficiently using condition variables.

Read On…

Modern C++ Concurrency - How to share data and resources between threads

In this lesson, we will cover the topic of data sharing and resources between threads. Imagine a scenario where an integer o needs to be modified by two threads t1 and t2. If we are not careful in handling this scenario a data race might occur. But what is a data race exactly?

Data Race

A data race occurs when two or more threads access some shared data and at least one of them is modifying such data. Because the threads are scheduled by OS, and scheduling is not under our control, you do not know upfront which thread is going to access the data first. The final result might depend on the order in which threads are scheduled by the OS.

Race conditions occur typically when an operation, in order to be completed, requires multiple steps or sub-operations, or the modification of multiple data. Since this sub-operations end up being executed by the CPU in different instructions, other threads can potentially mess up with the state of the data while the other's thread operation is still ongoing.

Read On…

Static and Dynamic Polymorphism - Curious Recurring Template Pattern and Mixin Classes

 A Few Words on Polymorphism

Polymorphism is the ability to assign a pointer to a base class to an instance of one its derived class. When a method is invoked on that pointer, the derived implementation, if provided, of the method is called otherwise, the inherited method is. The following is an example of such feature.

class Polygon {
  protected:
   double width, height;
  public:
   void Polygon(int a, int b) : width(a), height(b){};
  double area() const =0;
  int perimeter() const
   { return -1; }
};
};

class Rectangle: public Polygon {
public:
double area()
{ return width*height; }
int perimeter() const{ return width*2 + height*2;};
};

class Triangle: public Polygon {
public:
double area()
{ return width*height/2; }
};

int main () {
  Rectangle rect;
  Triangle trgl;
  Polygon * ppoly1 = &rect;
  Polygon * ppoly2 = &trgl;
  ppoly1->set_values (4,5);
  ppoly2->set_values (4,5);
  cout << rect.area() << '\n';
  cout << trgl.area() << '\n';
  return 0;
}

It is implemented with a cost in term of memory and time. For each class a virtual method table is stored and a pointer to it is added to the definition (transparently, by the compiler) of each class containing a virtual method (or deriving from a class that contains a virtual method). The table in turn contains pointer to the actual implementation of the virtual methods for the derived class. So the compiler only knows the object through a pointer to its base class, but it can still generate the correct code since it can indirect the calls to overridden methods via the virtual method table, then lookup for the method in the table and finally call it. So polymorphism comes at a cost of storing a virtual method table for each class, a pointer to it in each instance of a polymorphic class and two level in indirection when calling a virtual method.
Another pitfall is that since the indirection is required, usually virtual methods cannot be in-lined.

 

Curious Recurring Template Pattern

The key idea is: polymorphism without extra run-time cost. Static polymorphism.

Templates can mitigated performance problems of dynamic polymorphism via the so called static polymorphism, or simulated dynamic binding. Read On…

Tower of Hanoi - C++

Tower of Hanoi - C++

This brief article is about the tower of Hanoi. Wrote this super simple C++ for a student and thought maybe it could be helpful.

It works on the idea that in order to move n disks from pile 1 to 3 we need to first move the first n-1 disks to a support pole (choosing the right on is part of the solution, see the code for further detail), then move disk n in the correct position, and finally move the first n-1 disks from support pole to the correct location. Let the recursion does the magic!

The base case is when we have only one disk to move. Simply move the disk in the correct pile.

Tower of Hanoi - C++ Code

Read On…

Construct a binary Tree from its inorder and preorder traversal

Construct a Binary Tree from its inorder and preorder

This article will investigate the problem of reconstructing a binary tree given its inorder and preorder traversal.

Let's say for instance that we have the following binary tree (see figure)

Binary Tree

which has the following in order and preorder traversal respectively.

PRE = \{8,5,9,7,1,12,2,4,11,3\}

IN = \{9,5,1,7,2,12,8,3,11,4\}

 

Given IN and PRE how can we construct the original tree?

The key idea is to observe that the first element in PRE is the root of the tree and that the same element appears somewhere in IN, let's say at position k. This means that in order traversal has processed k element before to process the root, meaning that the number of nodes of the left tree of the root is k. Obviously, if the first k element belongs to the left subtree then all the others belongs to the right tree.

We will use this idea to write a recursive algorithm that builds the tree starting from its traversals. The algorithm works as follows: Read On…

Programming Interview Question - Set Row and Column if - C/C++

Programming Inteview Question

Question: write an algorithm that take as input a matrix of type T  and size M, a value of type T, (v) and a unary predicate P.

Then if holds, the entire rows and column are set to .

For examples if the following is used as input matrix

4 9 14 19 24
3 8 13 18 23
2 7 12 17 22
1 6 11 16 21
0 5 10 15 20

using  the following  equality predicate (==3) (i.e. returns true if the passed parameter is 3) and the resulting matrix is:

-1 9 14 19 24
-1 -1 -1 -1 -1
-1 7 12 17 22
-1 6 11 16 21
-1 5 10 15 20

Hint use template for make the procedure as general as possible.

Programming Inteview Question Solution

Read On…

Programming Inteview Question - Rotate Matrix - C/C++

Question: Given a square matrix of size M and type T, rotate the matrix by 90 degrees counterclockwise in place.

For example the algorithm should return the right matrix if is left one is passed as input.

0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
4 9 14 19 24
3 8 13 18 23
2 7 12 17 22
1 6 11 16 21
0 5 10 15 20

 

 

 

 


 

Solution:

Read On…

Programming Interview Question- Unique Characters in a string - C++

Programming Intervew Question: Unique Characters in a String

Question: implement an algorithm to determine if a string has all unique characters

The idea behing this (fairly easy) question is that whenever we found that any chars is repeated at least twice we should return false. In order to do that we have to look at each char we are loloking for a solution which complexity is at least O(n). We are free to use any support data structure. 256 is the size of the ASCII charset.

Read On…