C++ - Simple command line argument manager

Command-line argument management is tricky to get right especially when the number of options that we support and the number of their combination is big. For this kind of applications, there are already a number of very effective libraries and tools supporting the programming in such a scenario. One such library is Boost::program_options which I highly encourage you to use as is it awesome.

But there are other cases where we need to quickly write a prototype or when the number of options and possible configuration is small where using Boost could be overkill. In such cases what we usually do is writing ad-hoc command-line options management code which will never be reuse and it is tailored for the specific application at hand. This approach costs time and it is boring once you did it once.

For this reason, I wrote a super simple and minimalistic C++17 command-line argument manager which allows you to:

  • Check if a set of options has been specified
  • Get the argument of each element of a set of options (as string).

The class is extremely easy to use. You only need to construct an instance of the class by passing the argc and argv variables and then you are ready to go. No more need for ugly and error prone command line argument parsing. For instance, as you can see from the code snippet below, I check for the existence of the -h or --help options and of all the other required options. If one of -X, -O, -P are not present or its required corresponding argument are not specified a help message is printed. Otherwise, the arguments are retrieved and returned as a vector of strings ready to be used.

The example code is listed below or can be downloaded from here (gist on GitHub). You can also try it live on wandbox.

#include <string>
#include <algorithm>
#include <vector>
#include <tuple>
#include <iostream>
#include <optional>

void usage(const std::string&amp; progName) {
  std::cout << progName << " -X FILE -P FILE -O DIRECTORY " << std::endl
            << "Options:"                        << std::endl
            << "-h  | --help    Print this help" << std::endl
            << "-X              Path to X file"  << std::endl
            << "-P              Path to P file"  << std::endl
            << "-O              Path to output directory." << std::endl;

class cmdline_args_parser{
        cmdline_args_parser (const int argc, char **argv){
            for (int i=0; i < argc; ++i)

        const std::optional<std::string> getCmdOption(const std::string &amp;option) const{
            auto  itr =  std::find(tokens.begin(), tokens.end(), option);
            if (itr != tokens.end() &amp;&amp; ++itr != tokens.end())
                return std::optional(*(itr));           
           return std::nullopt;
        template<typename... Options>
        const auto get_all_options(const Options... ops) const{
            std::vector<std::optional<std::string>> v;
            (v.push_back(getCmdOption(ops)), ...);
            return v;

        bool cmdOptionExists(const std::string &amp;option) const{
                std::find(tokens.begin(), tokens.end(), option) != tokens.end();
        template<typename... Options>
        bool all_options_exists(const Options... opts) const{
            return (... &amp;&amp; cmdOptionExists(opts)); 
        template<typename... Options>
        bool any_options_exists(const Options... opts) const{
            return (... || cmdOptionExists(opts)); 
        const std::string&amp; get_program_name() const{ return tokens[0]; }
        std::vector<std::string> tokens;

auto process_args(const cmdline_args_parser&amp; p){
    auto opts = p.get_all_options("-X","-O","-P");
    if(p.any_options_exists("-h", "--help") || 
       !all_of(begin(opts), end(opts), [](const  auto&amp; x ){return x;}) )
    for(const auto opt : opts){

int main(int argc, char** argv){
cmdline_args_parser p (argc, argv);

  return 0;

Modern C++ concurrency - parallel quick-sort with std::future

In this short lesson we will discuss how to parallelize a simple and rather inefficient (because this is not an in-place version) implementation of quick-sort using asynchronous tasks and futures.

We will perform some benchmarking and performance analysis and we will try to understand how we can further improve our implementation.

Quick sort

In this section, I will briefly refresh your memory on quick-sort. I will do so by showing you a simple and self-explicative Haskell version first. We will also write a C++ (serial) version of the same code implementation C++ that we will use as a basis for our parallelization.

Here it goes the Haskell version:

quicksort [] = []
quicksort (p:xs) = (quicksort lesser) ++ [p] ++ (quicksort greater)
        lesser = filter (< p) xs
        greater = filter (>= p) xs

It is beautifully simple: in order to sort a list with at least, p one element is only necessary to partition sort the rest of the elements xs into two sublists:

  • lesser: containing all the elements in xs smaller than p
  • greater: containing all the elements in xs greater than p
    Once both sublists are sorted we can finally return the whole sorted list using by simply returning gluing lesser, p and greater together in this order.

If you still have trouble understanding the quick-sort algorithm please refer to Wikipedia.

Quick-sort serial version

The following is the serial C++ implementation of the same idea described above. It should be pretty easy to map the following implementation to the Haskell one. Run it on Wandbox

template <typename T>
void quick_sort_serial(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;

//partition the list
  vector<T> lesser;
  copy_if(start_it, end_it, std::back_inserter(lesser),
          [&amp;](const T&amp; el) { return el < pivot; });

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

//solve subproblems

  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);

Parallelizing Quick-sort using std::future

In order to speed-up things we are going to use the fact that quick-sort is a divide and conquer algorithm. Each subproblem can be solved independently:
creating and sorting lesser and greater are two independent tasks. We can easily perform both on different threads.

The following is the first parallel version of the quick_sort_serial() above.
Run it on Wandbox

template <typename T>
void filter_less_than(const vector<T>&amp; v, vector<T>&amp; lesser, const int pivot) {
  for (const auto el : v) {
    if (el < pivot) lesser.push_back(el);

template <typename T>
void quick_sort_parallel1(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;
  vector<T> lesser;
  auto fut1 = std::async(
        [&amp;]() {
            filter_less_than<T>(std::ref(v), std::ref(lesser), pivot);

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });



  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);

As you can notice, the creation and sorting of lesser and are performed in parallel. Each thread running an instance of quick_sort_parallel1() will create another thread running quick-sort on one of the two sub-problems while the other subproblem is solved by the current thread.

This is exactly what we are doing when we spawn the following async task:
we are creating a task that will populate lesser with all the elements from v less than pivot and, once ready, it will sort it.
Please note that everything we need to have modified by reference need to be wrapped in a std::ref as we discussed in the previous lessons.

The following picture shows how the execution unfolds for the unsorted list: [2,7,1,6,9,5,8,3,4,10]:

The following code shows hot to spawn an async thread solving the lesser subproblem:

  vector<T> lesser;
  auto fut1 = std::async([&amp;]() {
    filter<T>(std::ref(v), std::ref(lesser), pivot);

While this task is running on the newly created thread, we can solve greater on the current thread.

The asynchronous task will recursively spawn other async tasks until a list of size <=1 is created, which is of course already sorted. There is nothing to do in this case.

Once the main thread is done with sorting the greater list, it waits for the asynchronous task to be ready using the std:.future::wait() function.
Once wait returns, both lists are sorted, and we can proceed with merging the result and finally, here it is, we have a sorted list.

Performance analysis

Let's quickly analyze our implementation. We will compare execution time for the single-thread and async-parallel versions above.

Let's start our analysis by taking looking at this graph depicting the execution time (average of 10 runs) for both versions above:

It might be a surprising result to see that the Async parallel version is way slower than the single threaded version, ~55x slower!
Why is that? The reason is that, the parallel version creates a new thread for every single subproblem, even for the ones that are quite small.
Threads are costly to manage by the OS, they use resources and need to be scheduled. For smaller tasks the overhead caused by the additional thread is larger than the gain in performance that we might get by processing the sublist in parallel. This is exactly what is happening.

In order to solve this issue, we want to modify the async code above so that a new thread is spawned only when the input list v is larger than a certain threshold. The code below implements the aforementioned idea:

template <typename T>
void quick_sort_async_lim(vector<T>&amp; v) {
  if (v.size() <= 1) return;
  auto start_it = v.begin();
  auto end_it = v.end();

  const T pivot = *start_it;
  vector<T> lesser;

  vector<T> greater;
  copy_if(start_it + 1, end_it, std::back_inserter(greater),
          [&amp;](const T&amp; el) { return el >= pivot; });

  if (v.size() >= THRESHOLD) {
    auto fut1 = std::async([&amp;]() {
      filter<T>(std::ref(v), std::ref(lesser), pivot);


  } else {
    //problem is too small.
    //Do not create new threads
    copy_if(start_it, end_it, std::back_inserter(lesser),
            [&amp;](const T&amp; el) { return el < pivot; });

  std::copy(lesser.begin(), lesser.end(), v.begin());
  v[lesser.size()] = pivot;
  std::copy(greater.begin(), greater.end(),
            v.begin() + lesser.size() + 1);

As you can notice, the only addition that this optimized version has is that a new thread is spawned only when the size of the input list is larger than THRESHOLD If the list is too small, then we fall back on the classic single-thread version.
The following pictures show the result for the optimized version above with value of THRESHOLD=4000. As you can notice the execution time drops sensibly w.r.t single thread version. We have achieved ~4x speedup with a minimal programming effort.

We have introduced a new parameter in our code, and we need to figure out what is the best value of THRESHOLD. In order to do so, let's analyze the performance of the code above for various values of the threshold.
The following graph depict the execution time for various values of THRESHOLD. Note that the y-axis is in log scale. The execution time drops quite abruptly from 0 to 300.


We have used std::future to parallelize the quick-sort algorithm. The async code differs slightly from the serial single thread implementation, but runs 4x faster. On the other hand, we have learned that running too many threads it is definitely not a good idea because each thread comes with an overhead: the OS needs to allocate resources and time to manage them.

Modern C++ concurrency - Returning values from Threads - std::future


In this lesson we will talk about a way of returning values from threads, more precisely we will talk about std::future which is a mechanism that C++ offers in order to perform asynchronous tasks and query for the result in the future.
A future represents an asynchronous task, i.e. an operation that runs in parallel to the current thread and which the latter can wait (if it needs to) until the former is ready.
You can use a future all the time you need a thread to wait for a one-off event to happen. The thread can check the status of the asynchronous operation by periodically polling the future while still performing other tasks, or it can just wait for the future to become ready.

Read On…

Modern C++ Concurrency - Synchronizing threads - Condition Variables


In the previous lesson we have seen how data can be protected using mutex. We now know how to make threads do their work concurrently without messing around with shared resources and data. But sometimes we need to synchronize their actions in time, meaning that we might want a thread t1 to wait until a certain condition is true before allowing it to continue its execution.

This lesson discusses the tools that we can use to achieve such behavior efficiently using condition variables.

Read On…

Modern C++ Concurrency - How to share data and resources between threads

In this lesson, we will cover the topic of data sharing and resources between threads. Imagine a scenario where an integer o needs to be modified by two threads t1 and t2. If we are not careful in handling this scenario a data race might occur. But what is a data race exactly?

Data Race

A data race occurs when two or more threads access some shared data and at least one of them is modifying such data. Because the threads are scheduled by OS, and scheduling is not under our control, you do not know upfront which thread is going to access the data first. The final result might depend on the order in which threads are scheduled by the OS.

Race conditions occur typically when an operation, in order to be completed, requires multiple steps or sub-operations, or the modification of multiple data. Since this sub-operations end up being executed by the CPU in different instructions, other threads can potentially mess up with the state of the data while the other's thread operation is still ongoing.

Read On…

Modern C++ Concurrency - How to use a thread object correctly and common pitfalls


Modern C++ Concurrency - How to use a thread object correctly.

This lesson is going to be more theory focused because we will covers some important fact about how to correctly use the thread object.
For instance we will be talking about how to

  1. Pass around threads
  2. Have side effect on object passed to thread by reference
  3. How to avoid common undefined reference situations
  4. How to identify threads uniquely by an id.
  5. Read On…

Concurrency in Modern C++ - Cumulative sum of a vector using N threads


Modern C++ Concurrency - Cumulative sum of a vector - Part 2

In this tutorial we will continue the exercice we started out in part-1 and we will:

  1. split the work among a number of threads that will be specified by the user via command line
  2. perform some benchmarking to see how our code scales as the number of threads increases. We will compare the execution time of the version of the program running on one thread versus the execution time when running on an increasing number of threads.

Read On…

Concurrency in Modern C++ - Cumulative sum of a vector using two threads


Modern C++ Concurrency - Cumulative sum of a vector - Part 1

In this tutorial we will write a c++ code that will take as input a large list of numbers and it will return the cumulative sum of them.
In order to speed up the process we will write this code so it uses two threads. In the process we will learn how to use a callable object with the operator() redefined to create and run a thread.

Read On…

Concurrency in Modern C++ - Hello world

Hello world Concurrency in C++

What is concurrency?

Let’s start off by answering the following question: what is concurrency? Intuitively, concurrency is the execution of operations at the same time. The key part here is at the same time.
Computers are concurrent machines. Nowadays PCs are equipped with several processors and that means that we can exploit all of them at the same time to speed up our software.
Normally an executable runs on a single processor, meaning that at any given time only one of its instructions is executed.
Concurrency is all about executing several instructions at the same time for the same executable.

Hello world code

The C++11 standard introduced a new thread library that allows for standardized programming of concurrent software using threads, but it also offers a bunch of other tools to make concurrent programming safe: synchronization, and atomic operations for instance. Do not worry is any of these words ring a bell now. We will learn a lot about them.

Let’s start whit writing our first concurrent code. Read On…

Solution to the Codility Common Prime divisors Set Problem

This article discusses (a problem that I recently solved on codility ).

The core of the problem is the following:
Given two non negative integers N and M, 1 \leq M \leq N \leq 2147483647, the task is to check whether they have the same set of prime divisors.
A prime divisor of an integer P is a prime d s.t. d \times k = P for some positive k. You are given up to 6 \times 10^3 of such queries, and should return the total number of them that evaluates to true.

For instance given if N = 156 and M = 78 then our function should return *true* because the set of prime divisor of N is equal the
the set of primal divisor of M i.e. \{2,13,3\} while for N=45 and M=120 the function should return *false*.

Read On…

Static and Dynamic Polymorphism - Curious Recurring Template Pattern and Mixin Classes

 A Few Words on Polymorphism

Polymorphism is the ability to assign a pointer to a base class to an instance of one its derived class. When a method is invoked on that pointer, the derived implementation, if provided, of the method is called otherwise, the inherited method is. The following is an example of such feature.

class Polygon {
   double width, height;
   void Polygon(int a, int b) : width(a), height(b){};
  double area() const =0;
  int perimeter() const
   { return -1; }

class Rectangle: public Polygon {
double area()
{ return width*height; }
int perimeter() const{ return width*2 + height*2;};

class Triangle: public Polygon {
double area()
{ return width*height/2; }

int main () {
  Rectangle rect;
  Triangle trgl;
  Polygon * ppoly1 = &rect;
  Polygon * ppoly2 = &trgl;
  ppoly1->set_values (4,5);
  ppoly2->set_values (4,5);
  cout << rect.area() << '\n';
  cout << trgl.area() << '\n';
  return 0;

It is implemented with a cost in term of memory and time. For each class a virtual method table is stored and a pointer to it is added to the definition (transparently, by the compiler) of each class containing a virtual method (or deriving from a class that contains a virtual method). The table in turn contains pointer to the actual implementation of the virtual methods for the derived class. So the compiler only knows the object through a pointer to its base class, but it can still generate the correct code since it can indirect the calls to overridden methods via the virtual method table, then lookup for the method in the table and finally call it. So polymorphism comes at a cost of storing a virtual method table for each class, a pointer to it in each instance of a polymorphic class and two level in indirection when calling a virtual method.
Another pitfall is that since the indirection is required, usually virtual methods cannot be in-lined.


Curious Recurring Template Pattern

The key idea is: polymorphism without extra run-time cost. Static polymorphism.

Templates can mitigated performance problems of dynamic polymorphism via the so called static polymorphism, or simulated dynamic binding. Read On…

Complex number in OpenCL

Complex number in OpenCL - cl_complex

Recently I've been involved in the developing of a library OpenCAL capable of  parallel execution of cellular automata  and finite differences models.

I though it could have been fun to render some huge fractal with it and so I ended up writing some OpenCL code for the generation of Julia sets. Unfortunately OpenCL does not provide support for complex number (CUDA does, check the following link out: CUDA complex number example)  so I had to write it myself.

The following might be useful to anyone with need of support for complex number operations as exponentiation, argument, modulus etc. in OpenCL.


Here is a link to a 324 Megapixels Julia set Rendered image (warning, size >150 MB)

Julia set using OpenCL


Tree Vertex Cover Problem

Weighted Tree Vertex Cover Problem

Vertex cover of graph is defined as s.t. . In other word a subset of the vertices such that all vertices are incident to a vertex in the vertex cover.
We will derive an algorithm for finding the weight of a minimal (yes is not unique) vertex cover for a subclass of graphs i.e. tree (which are acyclic graphs with the property that only one path between each vertex exists).

Remember that vertex cover for graphs is an NP-Complete (NP-hard and NP, hard at least as all NP problems and it is an NP problem itself) problem i.e. no deterministic polynomial tyme algorithm was discovered (if you discover one, contact me, we will be millionaire).

Tree Vertex Cover - Problem Definition

Given a weighted tree with write an algorithm for computing a vertex cover with minimum weight i.e. V' is a vertex cover and the sum of the weight of its element is minimal.

The following is the tree structure that we will use throughout the article.

template<typename T,int DEGREE>
struct node{
       array<node*,DEGREE> children;

       T data;
       int weight;

        node(const T& v , int _weight) : weight(_weight){

What if the weight is equal to the degree of the node?

The first observation we can make is that the root node can weather be or not be in the vertex cover. If we include it in the solution then we are sure that all the edges from it to its children have been covered and to solve the problem we need only to compute the cover of its children (which is a simpler problem). Read On…

Programming Interview Question - Merge Intervals (InterviewCake#4)

Programming Interview Question - Merge Intervals

This post will explore the solutions to the question #4 of the famous website cakeinreview (here the link).

Programming Interview Question - Merge Intervals - Problem Statement

Given a list of pair of integers return a list of merged or condensed intervals.

Given for instance the following input list


your solution should return:


Your function should take care of corner cases like merging two intervals like the following (0,1),(1,2) in (0,2). Give a O(n^2) solution first. Then try to solve it in O(nlog(n)).

Read On…

Tower of Hanoi - C++

Tower of Hanoi - C++

This brief article is about the tower of Hanoi. Wrote this super simple C++ for a student and thought maybe it could be helpful.

It works on the idea that in order to move n disks from pile 1 to 3 we need to first move the first n-1 disks to a support pole (choosing the right on is part of the solution, see the code for further detail), then move disk n in the correct position, and finally move the first n-1 disks from support pole to the correct location. Let the recursion does the magic!

The base case is when we have only one disk to move. Simply move the disk in the correct pile.

Tower of Hanoi - C++ Code

Read On…

Construct a binary Tree from its inorder and preorder traversal

Construct a Binary Tree from its inorder and preorder

This article will investigate the problem of reconstructing a binary tree given its inorder and preorder traversal.

Let's say for instance that we have the following binary tree (see figure)

Binary Tree

which has the following in order and preorder traversal respectively.

PRE = \{8,5,9,7,1,12,2,4,11,3\}

IN = \{9,5,1,7,2,12,8,3,11,4\}


Given IN and PRE how can we construct the original tree?

The key idea is to observe that the first element in PRE is the root of the tree and that the same element appears somewhere in IN, let's say at position k. This means that in order traversal has processed k element before to process the root, meaning that the number of nodes of the left tree of the root is k. Obviously, if the first k element belongs to the left subtree then all the others belongs to the right tree.

We will use this idea to write a recursive algorithm that builds the tree starting from its traversals. The algorithm works as follows: Read On…

List Cycle Detection

List Cycle Detection

Linked list cycle detection problem is a very instructive and fun problem to reason about. This article will state the problem first and then explains how we can solve it efficiently while giving some insight on the underlying math.

A list can gets corrupted and some node can be linked by more than one node, as in the following figure. This could lead to never ending traversal of the list. So it make sense to solve the following problem:

List Cycle Detection - Cicrular List

Cicular List

List Cycle Detection - Problem Statement

  1. Given  a linked list, detect if the list is circular i.e. contains a cycle
  2. Find the starting node of the cycle (the node with two inward arrows in the figure)

The problem is easily solvable in O(n^2) time andO(n) space considering that we can visit the list from the head and store in a separate list the visited nodes. As the visit continues we check if the node we are examining was previously visited (for each node we visit we asks the following question: is that node contained in the support list already?). If yes the list is circular and that node is the start point of the cycle. If we reach the tail of the list then the list is not circular.

We can lower the complexity of this approach down toO(nlog(n)) time using a more efficient support (set) data structure (like a tree set). But we can do much better, and the rest of the article will show how to obtain O(n) time and O(1) space complexity.

List Cycle Detection - Floyd’s algorithm

This algorithm uses the fact that, like clock's hands, things iterating on a cycle at different speed will eventually meet at some point in the future. Consider two runner R_1,R_2 with velocitiesV_1,V_2=2V_1 respectively, starting running from the same point in a circular stadium. They will meet again when the slower runner reach the starting point for the second time. Why? By the time the slower one has completed half circle the faster has completed a complete cycle and by the time the slower finishes his run, arriving at the starting point again, the faster has completed a second entire cycle.

Things are a bit more complicated in the list cycle detection problem because the iterators (the two runners) do not necessarily start they race from the circular part of the list

Consider two iterators p,q with velocities v_p=1,v_q=2  respectively. Suppose the cycle has length n and that it starts at node number A < n. When the slower iterator reaches A the faster is at location 2A. How many iteration k it will take before they meet? And at which node?

The situation is described by the following congruence:

  • A + kv_p \equiv 2A + k2v_p mod(n)
  • \Rightarrow 2A + k2v_p \equiv A + kv_p \; mod(n)
  • \Rightarrow A + k2v_p \equiv kv_p \; mod(n)
  • \Rightarrow A +kv_p \equiv 0 \;mod(n)
  • \Rightarrow A +k \equiv 0 \;mod(n)

which has solution k = n-A. This means that they will meet after k=n-A iteration of the slower iterator. This means that they will meet at A nodes before the beginning of the cycle and we can use this fact to count A nodes from the beginning of the list to deduce the starting point of the cycle. Once the iterators meet in the cycle, we can move the fast iterator back to the beginning of the list and iterate forward one node per step with both iterators until they match again. When we move the fast iterator back at the head of the list, both iterators are A nodes away from the beginning of the cycle. Because of this when we move both of them by one,  they will eventually meet exactly at that node A.

Let's consider now the case when A \geq n. This means that by the time the slower iterator reaches the beginning of the cycle the faster one has completed more that a cycle. What will be the starting point for the faster one? We argue that once p reaches A, q is at node 2A but since A > n, this means that it will be at position A + (A mod(n)). We can now use similar argument to the previous case and write:

  • A + kv_p \equiv A + (A mod(n)) + k2v_p \;mod(n)
  • A + (A mod(n)) + k2v_p \equiv A + kv_p\;mod(n)
  • (A mod(n)) + kv_p mod(n) \equiv 0mod(n)
  • (A mod(n)) + k mod(n) \equiv 0 mod(n) since v_p =1

which has solution k = n-(A mod(n)). This means that the meeting point is A mod(n) nodes before the beginning of the cycle. If we do the same operations as the previous case, A < n, we obtain the same result. Iterators will meet at the beginning of the cycle. Why? Well advancing q makes p cycles possibly several times ( remember that A \geq n  ) and it will clearly stops at A+(n-A mod(n)) + A mod(n) = A +n \;(mod (n))= A.

In other words the slower pointer is at first  at node number A+(n-A mod(n)). We can writeA = bn + r where r = A \;mod(n). After A advancing steps it will be at location  A+(n-A \;mod(n)) +bn +r (mod n) Since bn \; mod (n)=0 the result follows.

As an example consider a list with a cycle of length n=4 starting at node number 10. The first part of the algorithm tells us that the nodes will meet at node 10 + 4 - 10 \: mod(4) = 12. Moving the fast pointer back to the head of the list and iterating one node per time both iterators will lead the slower point to node:

  • 12 again after advancing of 4 nodes
  • 12 again after advancing of 4 nodes
  • 10 advancing of the remaining 2 nodes.

The following illustration depict how the algorithm would work on a list of 8 nodes with a cycle of length 4 starting at node number 4. After 5 steps the slow (blue) and fast (red) iterator points to the same node i.e. node number 6.

After that the fast pointer is moved to the head of the list and both iterator are incremented by 1 until they meet again. When they do, they will meet at the beginning of the cycle.

Execution of the Floyd's algorithm on a list of 8 nodes with a cycle of length 4 starting at node 4.

Dynamic Message of the Day - motd - Fedora Linux

HOW-TO: Dynamic Message of the day

This article is  about setting up a dynamic message of the day (possibly informative and fun) as header of each newly opened shell.

The final result will be something like the following:


It mixes a fun message using fortune and cowsay which you can install using

sudo dnf install fortune-mod cowsay

utility and some informative info about the status of the system as:

  • System load
  • Ram and Swap available and used
  • Disk space
  • Ip address


The script file can be easily configured and extended to suits the your needs. Colors can also be easily customized.

Read On…

Distributed Hadoop installation - Fedora Linux

Distributed  Hadoop and HBase installation - Fedora Linux

In this post I will describe how to get started with the latest versions of hadoop and hbase describing all the step to obtain a working hadoop installation. The steps described here can be easily used to perform a working installation on a large cluster (even tough it can requires additional steps as shared filesystem for instance).


 sudo dnf install openssh openssh-askpass openssh-clients openssh-server 

Don't forget to start the ssh service using the following command:

 sudo service sshd start 

Add a dedicated Hadoop/HBase User (optional but reccomended)

Read On…

Programming Interview Question - Set Row and Column if - C/C++

Programming Inteview Question

Question: write an algorithm that take as input a matrix of type T  and size M, a value of type T, (v) and a unary predicate P.

Then if holds, the entire rows and column are set to .

For examples if the following is used as input matrix

4 9 14 19 24
3 8 13 18 23
2 7 12 17 22
1 6 11 16 21
0 5 10 15 20

using  the following  equality predicate (==3) (i.e. returns true if the passed parameter is 3) and the resulting matrix is:

-1 9 14 19 24
-1 -1 -1 -1 -1
-1 7 12 17 22
-1 6 11 16 21
-1 5 10 15 20

Hint use template for make the procedure as general as possible.

Programming Inteview Question Solution

Read On…

Programming Inteview Question - Rotate Matrix - C/C++

Question: Given a square matrix of size M and type T, rotate the matrix by 90 degrees counterclockwise in place.

For example the algorithm should return the right matrix if is left one is passed as input.

0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
4 9 14 19 24
3 8 13 18 23
2 7 12 17 22
1 6 11 16 21
0 5 10 15 20







Read On…

Programming Interview Question - String Permutation Test - C++

Programming Interview Question

Question: given two string write a method to decide if one is a permutation of the other.

This is a common question that could be easily solved if we know in advance which is the size of the alphabeth. A straightforward approach could be to sort both strings and then compare them. This approach has complexity due to sorting. We can do better and lower the complexity to  if we reason as follows: given a boolean sequence of the same size of the alphabet  initially false initialized.  Then what happen if  for each char we found in the  string we negate the value of the correnspoding bool in the sequence? We end up having the at position if the char appeared an odd number of times and otherwise. Now if the other string is a permutation of the first one, we all agree that for each char in it contains the same number of occurrences. This means that if we apply the same process as before on the boolean sequence using as input the string each bool is being negated an even number of times and this means that its value would be the same as the beginning of the process (all ). So if the sequence does not contains any true value, than it means that the strings contains the same elements i.e. they are permutation of each other.


  1. "abbccd" , "cbdabc" ,
  2. Apply negation using s
  3. Apply negation using v


A possible C++ implementation is shown here

Read On…

Programming Interview Question- Unique Characters in a string - C++

Programming Intervew Question: Unique Characters in a String

Question: implement an algorithm to determine if a string has all unique characters

The idea behing this (fairly easy) question is that whenever we found that any chars is repeated at least twice we should return false. In order to do that we have to look at each char we are loloking for a solution which complexity is at least O(n). We are free to use any support data structure. 256 is the size of the ASCII charset.

Read On…

Bird Flocking on GPU - CUDA

Bird Flocking Simulation

Bird flocking is an extremely interesting natural phenomenon that have been widely studied as witnessed by the number of papers in literature). I will present here a work on aggregate motion of large number of boids in a virtual environment with the presence of predators using CUDA as computational framework.

Beautiful Flocking motion

Collective motion or flocking appears at different fields and scales in nature and several mathematical tools have been developed for analyzing such motions

  1. organism are threaten as particles in Brownian motion combined with attraction/repulsion forces
  2. Differential equation models
  3. Agent based models.

The model presented here is based on a work of Reynolds (1987) which is based on three key behavioral rules:

  • Cohesion: to attempt to stay close to nearby flock-mates;
  • Collision avoidance: to evade objects that are too close;
  • Velocity/Heading Matching: to head in the same direction of nearby flock-mates

and extends it adding predator avoidance and multiple species and bird groups interaction

Bird Flocking Model

The environment is parameterized using the following a set of parameters that describe the size of the virtual environment and the duration of a timestep. What is really interesting is the set of bird's parameter that describe how a bird behave and react to event in its surrounding. Some notable parameters  include the Fyeld Of View (FOV), peak velocity v_p, thrust a and others (see figure).

Bird Parameters

Bird Parameters


The environment is partially observable and the portion of space that is visible to each bird is defined by its FOV. FOV is defined as follows:

Let p'_n= \{p^x_n-v^x_o,p^y_n-v^y_o,p^z_n-v^z_o\} the position vector of the object n in the o's frame of reference, then n is o's neighbor if and only if the followings hold:

    \[\delta_s=|| p_o - p_n ||,\; \delta_s \leq d_s\]

    \[-\; \frac{s_h}{2} \leq \theta \leq \frac{s_h}{2}, \ \ \ \ - \; \frac{s_v}{2} \leq \phi \leq \frac{s_v}{2}\]

where s_h is the maximum horizontal range of view, s_v is the maximum vertical range of view and

    \[\phi = \arccos \left(\frac{p'^z_n}{\sqrt{(p'^x_n)^2 + (p'^y_n)^2 + (p'^z_n)^2}}\right)\]

    \[\theta = atan2 \left(\frac{p'^y_n}{p'^x_n} \right) \]


In formal terms {C}_b^i, the bird b's centroid at time i, is given by:

    \[C_b^i = \frac{1}{|\mathcal{N}_b|}\sum_{n=1}^{|\mathcal{N}_b|}{\vec{p}_n \frac{d_{i,j}}{d_s}} \]



which is basically a weighted average of neighbors position.



A bird try to keep certain distance between itself and
its neighbors. Bird b's separation vector {S}_b^i at time i is given

    \[{S}_b^i =\begin{cases}\left[\sum_{j \in \mathcal{N}_b}{\frac{\vec{p}_b -\vec{p}_j}{||\vec{p}_b - \vec{p}_j||}} \;f_s\right] + a ,&\mbox{if } 0 < |\vec{S}_i|\leq v_p\\v_p, &\mbox{ otherwise }\end{cases} \]

where fs determines how strong is the repulsion against to the neighbor j.



Bird's alignment is computed as follows

    \[\vec{A}_i = \left[\sum_{j \in \mathcal{N}'_b}{\vec{v_j}}\;f_a\right] + a, \;\;0 < |\vec{A}_i| \leq v_p\]

It is a weighted average of the neighbors's heading direction.

Other species and predator avoidance
Other species avoidance is a behavior pretty much similar to the separation. The only difference is that only birds that belong to other species contribute to the result.

Predator avoidance is also a "flee or separation" behavior, but what happens here is that we do not take into account the current predator position, but instead, birds try to "separate" from predators's next position (the prediction is made from current position and velocity and acceleration of the predator).

Predator Avoidance Flee/Separation Behavior

The predator avoidance vector \vec{\Gamma}_b^i is defined as follows:

    \[\vec{\Gamma}_b^i = \left[\sum_{j \in \mathcal{P}_b }{\frac{\vec{p}_i -(\vec{p}_j+\vec{v}_j)}{||\vec{p}_i - (\vec{p}_j+\vec{v}_j)||}} \;f_{p}\right] +a, \;\;0 < |\vec{\Gamma}_i| \leq v_p \notag \]


  •  \mathcal{P}_b is the b's set of predators
  •  f_{p} = \begin{cases} 0 &\mbox{ if } d_{i,j} > r_p\\1 -\frac{d_{i,j}}{r_p} & \mbox{ otherwise}\end{cases}
    is the predator avoidance coefficient, where r_p is the minimum distance bird avoid predator.

The model has been implemented in CUDA to speedup the simulation. The following is a short video which I used during my presentation at PDP16 conference.  The model and the implementation described in this article are much more greatly described in the following slides (download here).

C/C++ - Byte to Number Conversion

Byte Number Conversion

For a project I'm working on these days I was forced to convert raw byte to several numerical values ( to and from binary format).

Bytes are represented in C/C++ as  unsigned char, and std::string are often used as byte buffer as well as old well known raw arrays. I wrote a simple struct that relieve the pain in performing such conversions. It exposes two functions: fromBytes and toBytes.fromBytes takes an additional boolean parameter that takes care of different endianness.

Author: Davide Spataro 2016

Converts byte buffer to type T
typedef unsigned char byte;

template <class T>
struct converter{

 static const size_t size = sizeof(T);

    union conv{
        T value;
        byte bytes[ sizeof( T ) ];
    } c ;

    T fromBytes( const  byte *bytes, bool endianness = false){

            #pragma unroll
            for(int i=0;i<size;i++)
                c.bytes[size-1-i] = bytes[i];
          #pragma unroll
            for(int i=0;i<size;i++)
                c.bytes[i] = bytes[i];

        return c.value;

     byte* toBytes(const T& value,bool endianness = false){
        c.value =value;
        return c.bytes;

    void reverse(){
        #pragma unroll
        for(int i=0;i<size/2;i++){
            byte tmp = c.bytes[i];
            c.bytes[size-1-i] = tmp;



template<class T>
void printHex(const T& key, size_t size){
  std::cout.setf(std::ios::hex, std::ios::basefield);
    for (int i = 0; i < size; i++){ if (i > 0) printf(":");
            printf("%02X", (unsigned char)key[i]);


Usage is very simple: let's say for instance you have the following 8 bytes into an unsigned long long


converter<int64_t> c;
//binary value is 00:00:00:00:00:00:00:5E (8 bytes)
std::string binary = readBinary(); //read binary unsigned long long from somewhere
int64_t res =  c.fromBytes(reinterpret_cast<const unsigned char*>(binary.c_str()),true);
std::cout <<res<< " \n";

this will output clearly output 94.

It works with almost all numerical types (float, all int flavors and floating points)

Music Meeting in Ediburgh

A great experience, that I hope will became an joyful habit, took place last week when a colleague of mine at the Department of Engineering at University of Edinburgh (Kino) told me about a meeting among the researchers/musicians of the university.


I was invited to play some music, and I obviously accepted! It was great to meet new smart people and musician from all over the world and share with them delicious food and wine. There were several instrument involved in the performances such as viola, violin, cello and voice.

Here two (very) short videos from my performance:


The following was played to fulfill a special request from Alice (a P.hD collegue) from Milan.




Largest Prime Number

Largest Prime Number

Recently a team at the University of Central Missouri, headed by Curtis Cooper has announced, via press release from the Mersenne organization

to have discovered a new largest prime. The number have more than 22M digits, to be precise. It's so large that writing 4 digits per centimeter one would be able to cover the entire distance between Edinburgh and Glasgow! Here the full http://www.filedropper.com/largestprimenumber number.

The following is an Haskell micro-script for computing and writing the number to a file (using Haskell because it's lazy input/output allows to write big files to disk without any concern about memory).

import Data.Time
import System.Environment (getArgs)
main :: IO ()
main = do
let n = 2^74207281 -1
startTime <- getCurrentTime
writeFile path (show n)
stopTime <-getCurrentTime
putStrLn ("ElapsedTime: " ++ show (diffUTCTime stopTime startTime))

Compile using using : ghc --make

and execute passing the output filename as the only parameter.

The full number is ~22MB, not surprisingly as one char occupies one byte in memory.

Here the first digits of the number:

268344381703937005859988258738844104703265786972872467031538046586054465054455 ....


Programming Question - Integer Parity

Programming Question: Given an integer, compute its parity(easy)
The parity of an integer is true iff number of set bits (1) is odd, false otherwise. Example: 1234_{10} = 010011010010_2 has 5 set bits and hence its parity is true.



Programming Question - Compute GCD

Programming Question: Compute the Greatest Common Divisor of two integers (easy)

This question is divided in two parts: you will first asked to write a function that computes the  GCD withouth any space/time constraints. The second part will be more challenching, asking for not using multiplication or division or addition.

Part1: write a function that takes two integers as input and returns their gcd using the famous Euclidean algorithm.

Part 2: compute the Greatest Common Divisor of two integers without using multiplication addition or division operators (easy)