Concurrency in Modern C++ - Cumulative sum of a vector using N threads

tutorial2

Modern C++ Concurrency - Cumulative sum of a vector - Part 2

In this tutorial we will continue the exercice we started out in part-1 and we will:

  1. split the work among a number of threads that will be specified by the user via command line
  2. perform some benchmarking to see how our code scales as the number of threads increases. We will compare the execution time of the version of the program running on one thread versus the execution time when running on an increasing number of threads.

In order to ensure that all threads will have a equal amount of work we will give to each thread the same number of number to sum. Suppose we have N threads, t0,t1,t2,...,t_{N-1}$, and a list of M numbers. Each thread will get C=M/Nthreads and it will operate on the following interval:

Let id(t) be the id of the thread t, start(t) and end(t) be the start and end indexes relative the the list of number on which thread t operates, then:

  1. start(t) = id(t)*C
  2. end(t) = start(t) + C

This way a thread with

  • index 0 operates on [0,C)
  • index 2 operates on [C,2*C)
  • index 3 operates on [2*C,3*C)
  • index N-1 operates on [(N-1)C,N*C)

Eventually each threads will complete and in its sum_task structure will contain the sum for that specific chunch of numbers. The main thread will then collect these result and sum them up to get the answer for our problem.

Note that we are only interested in the execution time of the computation part and not in the time spent creating the list of numbers.

The following is the full listing for this tutorial.

#include <chrono>
#include <iostream>
#include <string>  
#include <thread>  
#include <vector>
#include <numeric>
#include <chrono>

using namespace std;
using namespace std::chrono;

template <typename It>
struct sum_task {
 private:
  const It start, finish;
  unsigned long long result;

 public:
  sum_task(const It& s, const It& f) 
  : start(s), finish(f), result(0)
  {};

  void operator()() {
    auto it = start;
    while (it != finish) {
      result += *it++;
    }
  }

  unsigned long long get_result(){
    return result;
  }

};


int main(int argc, char* argv[]) {
    const unsigned long nthreads = std::stoul(argv[1]);
    const unsigned long size = std::stoul(argv[2]);

    std::vector<int> numbers(size);
    std::iota (std::begin(numbers), end(numbers), 1);


    const size_t chunck = size/nthreads;
    if(chunck < 1){
        cerr<< " there is too little work for each thread. Decrease the number of threads and try again";
        exit(1);
    }

    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    using It = decltype(numbers.begin());
    vector<sum_task<It>> tasks;
    //task for the last thread is set up outside this loop
    //to accomodate for the situation where
    //nthreads does not divide size evenly
    for (unsigned i = 0; i < nthreads - 1; ++i) {
      const size_t start = i * chunck;
      const size_t end = (i + 1) * chunck;

      tasks.push_back(
        sum_task<It>(
            numbers.begin() + start,
            numbers.begin() + end
            )
        );
    }
    const size_t start = (nthreads-1) * chunck;
    tasks.push_back(
        sum_task<It>(
            numbers.begin() + start,
            numbers.end()
            )
        );

    vector<thread> threads;
    for (unsigned i = 0; i < nthreads ; ++i) {
      threads.push_back(
        thread(std::ref(tasks[i]))
        );
    }

    //calculate reference result
    unsigned long long reference_res = (size*(size+1))/2;


    //main has no more work to do 
    //it wait for completition of the child threads
    for (unsigned i = 0; i < nthreads ; ++i) {
        threads[i].join();
    }

    //threads completed
    //main threads collect all the partial 
    //results from the threads
    unsigned long long thread_res = 0;
    for (unsigned  i = 0; i < nthreads ; ++i) {
        thread_res+= tasks[i].get_result();
    }


    if(reference_res!=thread_res){
        cout<<"Oops :( .... Result is wrong"<<endl;
        return 1;
    }

    const high_resolution_clock::time_point t2 =
        high_resolution_clock::now();
    const auto duration = 
        duration_cast<microseconds>(t2 - t1).count();
    cout << "Elapsed time: " << duration <<endl;
    return 0;
}

Compilation

As usual we can compile and execute this code using the following:

clang++ -O2 -std=c++14 -g -fsanitize=address -Wall  -Wextra -pthreads -o tutorial2 tutorial2.cpp

Benchmarking

Our experiments will use a list of 500000000 numbers and are executed on a machine equipped with the 8 processors of the following type:

model name  : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
cache size  : 8192 KB

The first thing we need to to is to calculate our reference time i.e. the time needed to perform this task using only one thread.
We will run each version 10 times and we will use the average of those in our analysis.

For the single thread version:

(01:18:10)[knotman@archazzo]for i in `seq 1 10`; do ./tutorial2 1 100000000  ; done
Elapsed time: 60251
Elapsed time: 59267
Elapsed time: 60402
Elapsed time: 59053
Elapsed time: 59278
Elapsed time: 59073
Elapsed time: 59191
Elapsed time: 59085
Elapsed time: 59436
Elapsed time: 59135

The whole benchmark is then performed using the following command

(01:47:26)[knotman@archazzo]  
for i in `seq 1 16` #num threads
do 
    echo "Threads $i" 
    for j in `seq 1 10`
    do 
        ./tutorial2 $i 500000000
    done
done

The following graph contains the result of this experiment.

As you can notice, the time drastically decrease when we go from one thread to two, we gain a speedup of 1.56 but it does not scale well when the number of threads further increases.

This is due to the fact that the actual CPU workload for this task is quite small. Despite the huge number of sum operations that we have to perform, the execution time is dominated by memory operations. The CPU is mostly idle waiting for data to arrive from memory so it can finally perform the sum operation.
This type of program are said to be memory bounded.
In order for our software to scale well as the number of threads increases is it necessary to have actual CPU work to be perfomed.

In the following lessons we will write code for problems that are less gentle with the CPU and we will see better speedup scaling graphs.

Be the first to leave a comment. Don’t be shy.

Join the Discussion

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>