Archive for the ‘Average’ Category

Modern C++ concurrency - parallel quick-sort with std::future

In this short lesson we will discuss how to parallelize a simple and rather inefficient (because this is not an in-place version) implementation of quick-sort using asynchronous tasks and futures.

We will perform some benchmarking and performance analysis and we will try to understand how we can further improve our implementation.

Quick sort

In this section, I will briefly refresh your memory on quick-sort. I will do so by showing you a simple and self-explicative Haskell version first. We will also write a C++ (serial) version of the same code implementation C++ that we will use as a basis for our parallelization.

Here it goes the Haskell version:

quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
    where
        lesser = filter (< p) xs
        greater = filter (>= p) xs

It is beautifully simple: in order to sort a list with at least, p one element is only necessary to partition sort the rest of the elements xs into two sublists:

  • lesser: containing all the elements in xs smaller than p
  • greater: containing all the elements in xs greater than p
    Once both sublists are sorted we can finally return the whole sorted list using by simply returning gluing lesser, p and greater together in this order.

If you still have trouble understanding the quick-sort algorithm please refer to Wikipedia.

Quick-sort serial version

The following is the serial C++ implementation of the same idea described above. It should be pretty easy to map the following implementation to the Haskell one. Run it on Wandbox

template <typename T>
void quick_sort_serial(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;

//partition the list
  vector<T> lesser;
  copy_if(start_it, end_it, std::back_inserter(lesser),
          [&amp;](const T&amp; el) { return el < pivot; });

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

//solve subproblems
  quick_sort_serial(lesser);
  quick_sort_serial(greater);

//merge
  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);
}

Parallelizing Quick-sort using std::future

In order to speed-up things we are going to use the fact that quick-sort is a divide and conquer algorithm. Each subproblem can be solved independently:
creating and sorting lesser and greater are two independent tasks. We can easily perform both on different threads.

The following is the first parallel version of the quick_sort_serial() above.
Run it on Wandbox

template <typename T>
void filter_less_than(const vector<T>&amp; v, vector<T>&amp; lesser, const int pivot) {
  for (const auto el : v) {
    if (el < pivot) lesser.push_back(el);
  }
}

template <typename T>
void quick_sort_parallel1(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;
  vector<T> lesser;
  auto fut1 = std::async(
        std::launch::async, 
        [&amp;]() {
            filter_less_than<T>(std::ref(v), std::ref(lesser), pivot);
            quick_sort_parallel1<T>(std::ref(lesser));
  });

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

  quick_sort_parallel1(greater);

  fut1.wait();

  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);
}

As you can notice, the creation and sorting of lesser and are performed in parallel. Each thread running an instance of quick_sort_parallel1() will create another thread running quick-sort on one of the two sub-problems while the other subproblem is solved by the current thread.

This is exactly what we are doing when we spawn the following async task:
we are creating a task that will populate lesser with all the elements from v less than pivot and, once ready, it will sort it.
Please note that everything we need to have modified by reference need to be wrapped in a std::ref as we discussed in the previous lessons.

The following picture shows how the execution unfolds for the unsorted list: [2,7,1,6,9,5,8,3,4,10]:

The following code shows hot to spawn an async thread solving the lesser subproblem:

  vector<T> lesser;
  auto fut1 = std::async([&amp;]() {
    filter<T>(std::ref(v), std::ref(lesser), pivot);
    quick_sort<T>(std::ref(lesser));
  });

While this task is running on the newly created thread, we can solve greater on the current thread.

The asynchronous task will recursively spawn other async tasks until a list of size <=1 is created, which is of course already sorted. There is nothing to do in this case.

Once the main thread is done with sorting the greater list, it waits for the asynchronous task to be ready using the std:.future::wait() function.
Once wait returns, both lists are sorted, and we can proceed with merging the result and finally, here it is, we have a sorted list.

Performance analysis

Let's quickly analyze our implementation. We will compare execution time for the single-thread and async-parallel versions above.

Let's start our analysis by taking looking at this graph depicting the execution time (average of 10 runs) for both versions above:

It might be a surprising result to see that the Async parallel version is way slower than the single threaded version, ~55x slower!
Why is that? The reason is that, the parallel version creates a new thread for every single subproblem, even for the ones that are quite small.
Threads are costly to manage by the OS, they use resources and need to be scheduled. For smaller tasks the overhead caused by the additional thread is larger than the gain in performance that we might get by processing the sublist in parallel. This is exactly what is happening.

In order to solve this issue, we want to modify the async code above so that a new thread is spawned only when the input list v is larger than a certain threshold. The code below implements the aforementioned idea:

template <typename T>
void quick_sort_async_lim(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;
  vector<T> lesser;

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

  if (v.size() >= THRESHOLD) {
    auto fut1 = std::async([&amp;]() {
      filter<T>(std::ref(v), std::ref(lesser), pivot);
      quick_sort_async_lim<T>(std::ref(lesser));
    });

    quick_sort_async_lim(greater);
    fut1.wait();

  } else {
    //problem is too small.
    //Do not create new threads
    copy_if(start_it, end_it, std::back_inserter(lesser),
            [&amp;](const T&amp; el) { return el < pivot; });
    quick_sort_async_lim(lesser);
    quick_sort_async_lim(greater);
  }

  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);
}

As you can notice, the only addition that this optimized version has is that a new thread is spawned only when the size of the input list is larger than THRESHOLD If the list is too small, then we fall back on the classic single-thread version.
The following pictures show the result for the optimized version above with value of THRESHOLD=4000. As you can notice the execution time drops sensibly w.r.t single thread version. We have achieved ~4x speedup with a minimal programming effort.

We have introduced a new parameter in our code, and we need to figure out what is the best value of THRESHOLD. In order to do so, let's analyze the performance of the code above for various values of the threshold.
The following graph depict the execution time for various values of THRESHOLD. Note that the y-axis is in log scale. The execution time drops quite abruptly from 0 to 300.

Conclusion

We have used std::future to parallelize the quick-sort algorithm. The async code differs slightly from the serial single thread implementation, but runs 4x faster. On the other hand, we have learned that running too many threads it is definitely not a good idea because each thread comes with an overhead: the OS needs to allocate resources and time to manage them.

Solution to the Codility Common Prime divisors Set Problem

This article discusses (a problem that I recently solved on codility ).

The core of the problem is the following:
Given two non negative integers N and M, 1 \leq M \leq N \leq 2147483647, the task is to check whether they have the same set of prime divisors.
A prime divisor of an integer P is a prime d s.t. d \times k = P for some positive k. You are given up to 6 \times 10^3 of such queries, and should return the total number of them that evaluates to true.

For instance given if N = 156 and M = 78 then our function should return *true* because the set of prime divisor of N is equal the
the set of primal divisor of M i.e. \{2,13,3\} while for N=45 and M=120 the function should return *false*.

Read On…

Tree Vertex Cover Problem

Weighted Tree Vertex Cover Problem

Vertex cover of graph is defined as s.t. . In other word a subset of the vertices such that all vertices are incident to a vertex in the vertex cover.
We will derive an algorithm for finding the weight of a minimal (yes is not unique) vertex cover for a subclass of graphs i.e. tree (which are acyclic graphs with the property that only one path between each vertex exists).

Remember that vertex cover for graphs is an NP-Complete (NP-hard and NP, hard at least as all NP problems and it is an NP problem itself) problem i.e. no deterministic polynomial tyme algorithm was discovered (if you discover one, contact me, we will be millionaire).

Tree Vertex Cover - Problem Definition

Given a weighted tree with write an algorithm for computing a vertex cover with minimum weight i.e. V' is a vertex cover and the sum of the weight of its element is minimal.

The following is the tree structure that we will use throughout the article.

template<typename T,int DEGREE>
struct node{
       array<node*,DEGREE> children;

       T data;
       int weight;

        node(const T& v , int _weight) : weight(_weight){
               data=v;
               mincover=0;
       }
};

What if the weight is equal to the degree of the node?

The first observation we can make is that the root node can weather be or not be in the vertex cover. If we include it in the solution then we are sure that all the edges from it to its children have been covered and to solve the problem we need only to compute the cover of its children (which is a simpler problem). Read On…

List Cycle Detection

List Cycle Detection

Linked list cycle detection problem is a very instructive and fun problem to reason about. This article will state the problem first and then explains how we can solve it efficiently while giving some insight on the underlying math.

A list can gets corrupted and some node can be linked by more than one node, as in the following figure. This could lead to never ending traversal of the list. So it make sense to solve the following problem:

List Cycle Detection - Cicrular List

Cicular List

List Cycle Detection - Problem Statement

  1. Given  a linked list, detect if the list is circular i.e. contains a cycle
  2. Find the starting node of the cycle (the node with two inward arrows in the figure)

The problem is easily solvable in O(n^2) time andO(n) space considering that we can visit the list from the head and store in a separate list the visited nodes. As the visit continues we check if the node we are examining was previously visited (for each node we visit we asks the following question: is that node contained in the support list already?). If yes the list is circular and that node is the start point of the cycle. If we reach the tail of the list then the list is not circular.

We can lower the complexity of this approach down toO(nlog(n)) time using a more efficient support (set) data structure (like a tree set). But we can do much better, and the rest of the article will show how to obtain O(n) time and O(1) space complexity.

List Cycle Detection - Floyd’s algorithm

This algorithm uses the fact that, like clock's hands, things iterating on a cycle at different speed will eventually meet at some point in the future. Consider two runner R_1,R_2 with velocitiesV_1,V_2=2V_1 respectively, starting running from the same point in a circular stadium. They will meet again when the slower runner reach the starting point for the second time. Why? By the time the slower one has completed half circle the faster has completed a complete cycle and by the time the slower finishes his run, arriving at the starting point again, the faster has completed a second entire cycle.

Things are a bit more complicated in the list cycle detection problem because the iterators (the two runners) do not necessarily start they race from the circular part of the list

Consider two iterators p,q with velocities v_p=1,v_q=2  respectively. Suppose the cycle has length n and that it starts at node number A < n. When the slower iterator reaches A the faster is at location 2A. How many iteration k it will take before they meet? And at which node?

The situation is described by the following congruence:

  • A + kv_p \equiv 2A + k2v_p mod(n)
  • \Rightarrow 2A + k2v_p \equiv A + kv_p \; mod(n)
  • \Rightarrow A + k2v_p \equiv kv_p \; mod(n)
  • \Rightarrow A +kv_p \equiv 0 \;mod(n)
  • \Rightarrow A +k \equiv 0 \;mod(n)

which has solution k = n-A. This means that they will meet after k=n-A iteration of the slower iterator. This means that they will meet at A nodes before the beginning of the cycle and we can use this fact to count A nodes from the beginning of the list to deduce the starting point of the cycle. Once the iterators meet in the cycle, we can move the fast iterator back to the beginning of the list and iterate forward one node per step with both iterators until they match again. When we move the fast iterator back at the head of the list, both iterators are A nodes away from the beginning of the cycle. Because of this when we move both of them by one,  they will eventually meet exactly at that node A.

Let's consider now the case when A \geq n. This means that by the time the slower iterator reaches the beginning of the cycle the faster one has completed more that a cycle. What will be the starting point for the faster one? We argue that once p reaches A, q is at node 2A but since A > n, this means that it will be at position A + (A mod(n)). We can now use similar argument to the previous case and write:

  • A + kv_p \equiv A + (A mod(n)) + k2v_p \;mod(n)
  • A + (A mod(n)) + k2v_p \equiv A + kv_p\;mod(n)
  • (A mod(n)) + kv_p mod(n) \equiv 0mod(n)
  • (A mod(n)) + k mod(n) \equiv 0 mod(n) since v_p =1

which has solution k = n-(A mod(n)). This means that the meeting point is A mod(n) nodes before the beginning of the cycle. If we do the same operations as the previous case, A < n, we obtain the same result. Iterators will meet at the beginning of the cycle. Why? Well advancing q makes p cycles possibly several times ( remember that A \geq n  ) and it will clearly stops at A+(n-A mod(n)) + A mod(n) = A +n \;(mod (n))= A.

In other words the slower pointer is at first  at node number A+(n-A mod(n)). We can writeA = bn + r where r = A \;mod(n). After A advancing steps it will be at location  A+(n-A \;mod(n)) +bn +r (mod n) Since bn \; mod (n)=0 the result follows.

As an example consider a list with a cycle of length n=4 starting at node number 10. The first part of the algorithm tells us that the nodes will meet at node 10 + 4 - 10 \: mod(4) = 12. Moving the fast pointer back to the head of the list and iterating one node per time both iterators will lead the slower point to node:

  • 12 again after advancing of 4 nodes
  • 12 again after advancing of 4 nodes
  • 10 advancing of the remaining 2 nodes.

The following illustration depict how the algorithm would work on a list of 8 nodes with a cycle of length 4 starting at node number 4. After 5 steps the slow (blue) and fast (red) iterator points to the same node i.e. node number 6.

After that the fast pointer is moved to the head of the list and both iterator are incremented by 1 until they meet again. When they do, they will meet at the beginning of the cycle.

Execution of the Floyd's algorithm on a list of 8 nodes with a cycle of length 4 starting at node 4.

Yet Another one on Fibonacci serie - Knuth multiplication

Fibonacci serie - Knuth multiplication - Java

Leonardo Fibonacci,the Italian mathematician (from Pisa, 1170, one of the most important mathematician of all time) invented the famous serie defined recursively as follows:

    \[F(n) = F(n-1) + F(n-2)\]

    \[F(0) = 0\]

    \[F(1) = 1\]

We all know that as programmers once in life (at least) we will write some piece of code to compute the n_{th} element of the serie. Recursive implementation is straightforward but inefficient, due to the exponential number of recursive calls, that make "hard" to compute numbers bigger that 40. It is very easy to come up with some iterative or memoised implementation, that lowers the complexity to O(n) and allows to efficently compute "any" number of the serie.
In this post we will derive an implementation (JAVA) that have complexity O(log_2(n))1)using results from Knuth, Fibonacci multiplication,1988. Read On…

References   [ + ]

1. using results from Knuth, Fibonacci multiplication,1988