In this short lesson we will discuss how to parallelize a simple and rather inefficient (because this is **not** an in-place version) implementation of quick-sort using asynchronous tasks and futures.

We will perform some benchmarking and performance analysis and we will try to understand how we can further improve our implementation.

## Quick sort

In this section, I will briefly refresh your memory on quick-sort. I will do so by showing you a simple and self-explicative Haskell version first. We will also write a C++ (serial) version of the same code implementation C++ that we will use as a basis for our parallelization.

Here it goes the Haskell version:

quicksort [] = [] quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater) where lesser = filter (< p) xs greater = filter (>= p) xs

It is beautifully simple: in order to sort a list with at least, `p`

one element is only necessary to partition sort the rest of the elements `xs`

into two sublists:

`lesser`

: containing all the elements in`xs`

**smaller**than`p`

`greater`

: containing all the elementsin `xs`

thangreater `p`

Once both sublists are sorted we can finally return the whole sorted list using by simply returninggluing `lesser`

, `p`

and `greater`

this order.together in

If you still have trouble understanding the quick-sort algorithm please refer to Wikipedia.

## Quick-sort serial version

The following is the serial C++ implementation of the same idea described above. It should be pretty easy to map the following implementation to the Haskell one. Run it on Wandbox

template <typename T> void quick_sort_serial(vector<T>& v) { if (v.size() <= 1) return; auto start_it = v.begin(); auto end_it = v.end(); const T pivot = *start_it; //partition the list vector<T> lesser; copy_if(start_it, end_it, std::back_inserter(lesser), [&](const T& el) { return el < pivot; }); vector<T> greater; copy_if(start_it + 1, end_it, std::back_inserter(greater), [&](const T& el) { return el >= pivot; }); //solve subproblems quick_sort_serial(lesser); quick_sort_serial(greater); //merge std::copy(lesser.begin(), lesser.end(), v.begin()); v[lesser.size()] = pivot; std::copy(greater.begin(), greater.end(), v.begin() + lesser.size() + 1); }

## Parallelizing Quick-sort using std::future

In order to speed-up things we are going to use the fact that quick-sort is a divide and conquer algorithm. Each subproblem can be solved independently:

**creating and sorting lesser and greater are two independent tasks. We can easily perform both on different threads.**

The following is the first parallel version of the `quick_sort_serial()`

above.

Run it on Wandbox

template <typename T> void filter_less_than(const vector<T>& v, vector<T>& lesser, const int pivot) { for (const auto el : v) { if (el < pivot) lesser.push_back(el); } } template <typename T> void quick_sort_parallel1(vector<T>& v) { if (v.size() <= 1) return; auto start_it = v.begin(); auto end_it = v.end(); const T pivot = *start_it; vector<T> lesser; auto fut1 = std::async( std::launch::async, [&]() { filter_less_than<T>(std::ref(v), std::ref(lesser), pivot); quick_sort_parallel1<T>(std::ref(lesser)); }); vector<T> greater; copy_if(start_it + 1, end_it, std::back_inserter(greater), [&](const T& el) { return el >= pivot; }); quick_sort_parallel1(greater); fut1.wait(); std::copy(lesser.begin(), lesser.end(), v.begin()); v[lesser.size()] = pivot; std::copy(greater.begin(), greater.end(), v.begin() + lesser.size() + 1); }

As you can notice, the creation and sorting of lesser and are performed in parallel. Each thread running an instance `quick_sort_parallel1()`

This is exactly what we are doing when we spawn the following async task:

we are creating a task that will populate `lesser`

with all the elements from `v`

less than `pivot`

and, once ready, it will sort it.

Please note that everything we need to have modified by reference need to be wrapped in a `std::ref`

as we discussed in the previous lessons.

The following picture shows how the execution unfolds for the unsorted list: *[2,7,1,6,9,5,8,3,4,10]:*

The following code shows hot to spawn an async thread solving the `lesser`

subproblem:

vector<T> lesser; auto fut1 = std::async([&]() { filter<T>(std::ref(v), std::ref(lesser), pivot); quick_sort<T>(std::ref(lesser)); });

While this task is running on the newly created thread, we can `greater`

The asynchronous task will recursively spawn other async tasks until a list of size `<=1`

is created, which is of course already sorted. There is nothing to do in this case.

Once the main thread is done with sorting `greater`

`std:.future::wait()`

`wait`

## Performance analysis

Let's quickly analyze our implementation. We will compare execution time for the single-thread and async-parallel versions above.

Let's start our analysis by taking looking at this graph depicting the execution time (average of 10 runs) for both versions above:

It might be a surprising result to see that the Async parallel version is way slower than the **~55x** slower!

Why is that? The reason is

Threads are costly to manage by the OS, they use resources and need to be scheduled. For smaller

In order to solve this issue, we want to modify the async code above so that a new thread is spawned only when the input list `v`

is larger than a certain threshold. The code below implements the aforementioned idea:

template <typename T> void quick_sort_async_lim(vector<T>& v) { if (v.size() <= 1) return; auto start_it = v.begin(); auto end_it = v.end(); const T pivot = *start_it; vector<T> lesser; vector<T> greater; copy_if(start_it + 1, end_it, std::back_inserter(greater), [&](const T& el) { return el >= pivot; }); if (v.size() >= THRESHOLD) { auto fut1 = std::async([&]() { filter<T>(std::ref(v), std::ref(lesser), pivot); quick_sort_async_lim<T>(std::ref(lesser)); }); quick_sort_async_lim(greater); fut1.wait(); } else { //problem is too small. //Do not create new threads copy_if(start_it, end_it, std::back_inserter(lesser), [&](const T& el) { return el < pivot; }); quick_sort_async_lim(lesser); quick_sort_async_lim(greater); } std::copy(lesser.begin(), lesser.end(), v.begin()); v[lesser.size()] = pivot; std::copy(greater.begin(), greater.end(), v.begin() + lesser.size() + 1); }

As you can notice, the only addition that this optimized version has is that a new thread is spawned only when the size of the input list is larger than `THRESHOLD`

If the list is too small, then we fall back on the classic single-thread version.

The following pictures show the result for the optimized version above with `THRESHOLD=4000`

`~4x`

speedup with

We have introduced a new parameter in our code, and we need to figure out what is the best value `THRESHOLD`

The following graph `THRESHOLD`

`y-axis`

`0`

`300`

## Conclusion

We have used std:** 4**x faster. On the other hand, we have learned that running too many threads it is definitely not a good idea because each thread comes with an overhead: the OS needs to allocate resources and time to manage them.