Complex number in OpenCL

Complex number in OpenCL - cl_complex

Recently I've been involved in the developing of a library OpenCAL capable of  parallel execution of cellular automata  and finite differences models.

I though it could have been fun to render some huge fractal with it and so I ended up writing some OpenCL code for the generation of Julia sets. Unfortunately OpenCL does not provide support for complex number (CUDA does, check the following link out: CUDA complex number example)  so I had to write it myself.

The following might be useful to anyone with need of support for complex number operations as exponentiation, argument, modulus etc. in OpenCL.


Here is a link to a 324 Megapixels Julia set Rendered image (warning, size >150 MB)

Julia set using OpenCL


Tree Vertex Cover Problem

Weighted Tree Vertex Cover Problem

Vertex cover of graph  G=(V,E) is defined as V'\subseteq V s.t.  \forall (u,v) \in E u \in V' \vee v \in V'. In other word a subset of the vertices such that all vertices are incident to a vertex in the vertex cover.
We will derive an algorithm for finding the weight of a minimal (yes is not unique) vertex cover for a subclass of graphs i.e. tree (which are acyclic graphs with the property that only one path between each vertex exists).

Remember that vertex cover for graphs is an NP-Complete (NP-hard and NP, hard at least as all NP problems and it is an NP problem itself) problem i.e. no deterministic polynomial tyme algorithm was discovered (if you discover one, contact me, we will be millionaire).

Tree Vertex Cover - Problem Definition

Given a weighted tree T = (V,E,f) with f : \mathcal{N^+} \rightarrow \mathcal{N^+} write an algorithm for computing a vertex cover with minimum weight i.e. V' is a vertex cover and the sum of the weight of its element is minimal.

The following is the tree structure that we will use throughout the article.

template<typename T,int DEGREE>
struct node{
       array<node*,DEGREE> children;

       T data;
       int weight;

        node(const T& v , int _weight) : weight(_weight){

What if the weight is equal to the degree of the node?

The first observation we can make is that the root node can weather be or not be in the vertex cover. If we include it in the solution then we are sure that all the edges from it to its children have been covered and to solve the problem we need only to compute the cover of its children (which is a simpler problem). Read On…

Programming Interview Question - Merge Intervals (InterviewCake#4)

Programming Interview Question - Merge Intervals

This post will explore the solutions to the question #4 of the famous website cakeinreview (here the link).

Programming Interview Question - Merge Intervals - Problem Statement

Given a list of pair of integers return a list of merged or condensed intervals.

Given for instance the following input list


your solution should return:


Your function should take care of corner cases like merging two intervals like the following (0,1),(1,2) in (0,2). Give a O(n^2) solution first. Then try to solve it in O(nlog(n)).

Read On…

Tower of Hanoi - C++

Tower of Hanoi - C++

This brief article is about the tower of Hanoi. Wrote this super simple C++ for a student and thought maybe it could be helpful.

It works on the idea that in order to move n disks from pile 1 to 3 we need to first move the first n-1 disks to a support pole (choosing the right on is part of the solution, see the code for further detail), then move disk n in the correct position, and finally move the first n-1 disks from support pole to the correct location. Let the recursion does the magic!

The base case is when we have only one disk to move. Simply move the disk in the correct pile.

Tower of Hanoi - C++ Code

Read On…

Construct a binary Tree from its inorder and preorder traversal

Construct a Binary Tree from its inorder and preorder

This article will investigate the problem of reconstructing a binary tree given its inorder and preorder traversal.

Let's say for instance that we have the following binary tree (see figure)

Binary Tree

which has the following in order and preorder traversal respectively.

PRE = \{8,5,9,7,1,12,2,4,11,3\}

IN = \{9,5,1,7,2,12,8,3,11,4\}


Given IN and PRE how can we construct the original tree?

The key idea is to observe that the first element in PRE is the root of the tree and that the same element appears somewhere in IN, let's say at position k. This means that in order traversal has processed k element before to process the root, meaning that the number of nodes of the left tree of the root is k. Obviously, if the first k element belongs to the left subtree then all the others belongs to the right tree.

We will use this idea to write a recursive algorithm that builds the tree starting from its traversals. The algorithm works as follows: Read On…

List Cycle Detection

List Cycle Detection

Linked list cycle detection problem is a very instructive and fun problem to reason about. This article will state the problem and explains how we can solve it efficiently while giving some insight on the underlying maths.

Well, list can gets corrupted and some node can be linked by one or more nodes (as in the following figure). This could lead to never ending traversal of the list. So it make sense to solve the following problem:

List Cycle Detection - Cicrular List

Cicular List

List Cycle Detection - Problem Statement

  1. Given  a linked list detect (if any) check if the list is circular i.e. contains a cycle
  2. Find the starting node of the cycle (the node with two inward arrows in the figure)

The problem is easily solvable in O(n^2) time andO(n) space considering that we can visit the list from the head and store in a separate list the visited nodes. As the visit continues we check if the node we are examining was previously visited (is that node contained in the support list?). If yes the list is circular and that node is the start point of the cycle. If we reach the tail of the list then the list is not circular. We can lower the complexity of this approach down toO(nlog(n)) time using a more efficient support (set) data structure (like a tree set). But we can do much better, and the rest of the article will show how to obtain O(n) time and O(1).

List Cycle Detection - Floyd’s algorithm

We use the fact that like clock's hands things iterating on a cicle will meet at some point in the future. Consider two runner R_1,R_2 with velocitiesV_1,V_2=2V_1 starting running from the same point in a circular stadium. They will meet again when the slower runner reach the starting point for the second time (Why? By the time the slower one has completed half circle the faster has completed a complete cycle and by the time the slower finishes his run, arriving at the starting point again, the faster has completed a second entire cycle).

Things are a bit more complicated in the List Cycle Detection problems because the iterators do not necessarily start from the same point. Consider two iterators p,q with velocities v_p=1,v_q=2v_p = 2  respectively. Suppose the cycle has length n and that it starts at node number A < n When the slower iterator reaches A the faster is at location 2A. How many iteration k it will take before they meet? And at which node?

The situation is described by the following congruence

  • A + kv_p \equiv 2A + k2v_p mod(n)
  • \Rightarrow 2A + k2v_p \equiv A + kv_p \; mod(n)
  • \Rightarrow A + k2v_p \equiv kv_p \; mod(n)
  • \Rightarrow A +kv_p \equiv 0 \;mod(n)
  • \Rightarrow A +k \equiv 0 \;mod(n)

which has solution k = n-A. This means that they will meet after k=n-A iteration of the slower iterator and since it starts from A this means that they will meet at A nodes before the beginning of the cycle. We can use that fact to count A nodes from the beginning of the list. Once the iterators matche in the cycle, we can move the fast iterator back to the beginning of the list and iterate forward one node per stepo with both iterators until they match again (They are far away A nodes from the beginning of the cycle, so they will meet exaclty in that point.


Let's consider now the case when A \geq n. This means that by the time the slower iterator reaches the beginning of the cycle the faster one has completed more that a cycle. What will be the starting point for the faster one? We argue that once p reaches A q is at node 2A but since A > n this means that some it will be at position A + (A mod(n)). We can now use similar argument to the previous case and write:

  • A + kv_p \equiv A + (A mod(n)) + k2v_p \;mod(n)
  • A + (A mod(n)) + k2v_p \equiv A + kv_p\;mod(n)
  •  (A mod(n)) + kv_p mod(n) \equiv 0mod(n)
  •  (A mod(n)) + k mod(n) \equiv 0 mod(n) since v_p =1

which has solution k = n-(A mod(n)). This means that the meetpoint is A mod(n) nodes before the beginning of the cycle. If we do the sape operation as the previous case we obtain the sme result. Itrators wil meet at the beginning of the cycle. Why? Well advancing q makes p cycles possibly several times ( remember that A \geq n  ) and it will clearly stops at  A+(n-A mod(n)) + A mod(n) = A +n \;(mod (n))= A.

In other words the slower pointer is at first  at node number A+(n-A mod(n)). We can write A = bn + r where r = A \;mod(n). After A advancing steps it will be at location   A+(n-A \;mod(n)) +bn +r (mod n) Since bn \; mod (n)=0 the result follows.

As an example consider in which we have a list with a cycle of length n=4 and starts at node number 10. The first part of the algorithm tells us that the nodes will meet at node 10 + 4 - 10 \: mod(4) = 12. Moving the fast pointer back to te head of the list and iterating one node per time both iterators will lead the slower point to node:

  • 12 again after advancing of 4 nodes
  • 12 again after advancing of 4 nodes
  • 10 advancing of the remaining 2 nodes.

Dynamic Message of the Day - motd - Fedora Linux

HOW-TO: Dynamic Message of the day

This article is  about setting up a dynamic message of the day (possibly informative and fun) as header of each newly opened shell.

The final result will be something like the following:


It mixes a fun message using fortune and cowsay which you can install using

sudo dnf install fortune-mod cowsay

utility and some informative info about the status of the system as:

  • System load
  • Ram and Swap available and used
  • Disk space
  • Ip address


The script file can be easily configured and extended to suits the your needs. Colors can also be easily customized.

Read On…

Distributed Hadoop installation - Fedora Linux

Distributed  Hadoop and HBase installation - Fedora Linux

In this post I will describe how to get started with the latest versions of hadoop and hbase describing all the step to obtain a working hadoop installation. The steps described here can be easily used to perform a working installation on a large cluster (even tough it can requires additional steps as shared filesystem for instance).


 sudo dnf install openssh openssh-askpass openssh-clients openssh-server 

Don't forget to start the ssh service using the following command:

 sudo service sshd start 

Add a dedicated Hadoop/HBase User (optional but reccomended)

Read On…

Programming Interview Question - Set Row and Column if - C/C++

Programming Inteview Question

Question: write an algorithm that take as input a matrix of type T  and size M \times NM, a value of type T, (v) and a unary predicate P.

Then if  p(M[i,j]) holds, the entire rows i and column j are set to v.

For examples if the following is used as input matrix

4 9 14 19 24
3 8 13 18 23
2 7 12 17 22
1 6 11 16 21
0 5 10 15 20

using  the following  equality predicate (==3) (i.e. returns true if the passed parameter is 3) and v=-1 the resulting matrix is:

-1 9 14 19 24
-1 -1 -1 -1 -1
-1 7 12 17 22
-1 6 11 16 21
-1 5 10 15 20

Hint use template for make the procedure as general as possible.

Programming Inteview Question Solution

Read On…

Programming Inteview Question - Rotate Matrix - C/C++

Question: Given a square matrix of size M and type T, rotate the matrix by 90 degrees counterclockwise in place.

For example the algorithm should return the right matrix if is left one is passed as input.

0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
4 9 14 19 24
3 8 13 18 23
2 7 12 17 22
1 6 11 16 21
0 5 10 15 20







Read On…

Programming Interview Question - String Permutation Test - C++

Programming Interview Question

Question: given two string write a method to decide if one is a permutation of the other.

This is a common question that could be easily solved if we know in advance which is the size of the alphabeth. A straightforward approach could be to sort both strings and then compare them. This approach has complexity  O(nlog(n)) due to sorting. We can do better and lower the complexity to  O(n) if we reason as follows: given a boolean sequence of the same size of the alphabet  initially false initialized.  Then what happen if  for each char we found in the  string s we negate the value of the correnspoding bool in the sequence? We end up having the True at position i if the char i appeared an odd number of times and False otherwise. Now if the other string v is a permutation of the first one, we all agree that for each char in s it contains the same number of occurrences. This means that if we apply the same process as before on the boolean sequence using as input the string v each bool is being negated an even number of times and this means that its value would be the same as the beginning of the process (all False). So if the sequence does not contains any true value, than it means that the strings contains the same elements i.e. they are permutation of each other.


  1. s:"abbccd" , v: "cbdabc" , S=\{False,False,False,False,False, False, False ...\}
  2. Apply negation using s
    • S=\{True,False,False,True,False, False, False \ldots\}
  3. Apply negation using v
    • S=\{False,False,False,False,False, False, False ...\}


A possible C++ implementation is shown here

Read On…

Programming Interview Question- Unique Characters in a string - C++

Programming Intervew Question: Unique Characters in a String

Question: implement an algorithm to determine if a string has all unique characters

The idea behing this (fairly easy) question is that whenever we found that any chars is repeated at least twice we should return false. In order to do that we have to look at each char we are loloking for a solution which complexity is at least O(n). We are free to use any support data structure. 256 is the size of the ASCII charset.

Read On…

Bird Flocking on GPU - CUDA

Bird Flocking Simulation

Bird flocking is an extremely interesting natural phenomenon that have been widely studied as witnessed by the number of papers in literature). I will present here a work on aggregate motion of large number of boids in a virtual environment with the presence of predators using CUDA as computational framework.

Beautiful Flocking motion

Collective motion or flocking appears at different fields and scales in nature and several mathematical tools have been developed for analyzing such motions

  1. organism are threaten as particles in Brownian motion combined with attraction/repulsion forces
  2. Differential equation models
  3. Agent based models.

The model presented here is based on a work of Reynolds (1987) which is based on three key behavioral rules:

  • Cohesion: to attempt to stay close to nearby flock-mates;
  • Collision avoidance: to evade objects that are too close;
  • Velocity/Heading Matching: to head in the same direction of nearby flock-mates

and extends it adding predator avoidance and multiple species and bird groups interaction

Bird Flocking Model

The environment is parameterized using the following a set of parameters that describe the size of the virtual environment and the duration of a timestep. What is really interesting is the set of bird's parameter that describe how a bird behave and react to event in its surrounding. Some notable parameters  include the Fyeld Of View (FOV), peak velocity $v_p$, thrust $a$ and others (see figure).

Bird Parameters

Bird Parameters


The environment is partially observable and the portion of space that is visible to each bird is defined by its FOV. FOV is defined as follows:

Let  p'_n= \{p^x_n-v^x_o,p^y_n-v^y_o,p^z_n-v^z_o\} the position vector of the object n in the o's frame of reference, then n is o's neighbor if and only if the followings hold:

\delta_s=|| p_o - p_n ||,\; \delta_s \leq d_s

-\; \frac{s_h}{2} \leq \theta \leq \frac{s_h}{2}, \ \ \ \ - \; \frac{s_v}{2} \leq \phi \leq \frac{s_v}{2}

where s_h is the maximum horizontal range of view, $s_v$ is the maximum vertical range of view and

\phi = \arccos \left(\frac{p'^z_n}{\sqrt{(p'^x_n)^2 + (p'^y_n)^2 + (p'^z_n)^2}}\right)

\theta = atan2 \left(\frac{p'^y_n}{p'^x_n} \right)


In formal terms {C}_b^i, the bird b's centroid at time i, is given by:

C_b^i = \frac{1}{|\mathcal{N}_b|}\sum_{n=1}^{|\mathcal{N}_b|}{\vec{p}_n \frac{d_{i,j}}{d_s}}



which is basically a weighted average of neighbors position.



A bird try to keep certain distance between itself and
its neighbors. Bird b's separation vector {S}_b^i at time i is given

{S}_b^i =\begin{cases}\left[\sum_{j \in \mathcal{N}_b}{\frac{\vec{p}_b -\vec{p}_j}{||\vec{p}_b - \vec{p}_j||}} \;f_s\right] + a ,&\mbox{if } 0 < |\vec{S}_i|\leq v_p\\v_p, &\mbox{ otherwise }\end{cases}

where fs determines how strong is the repulsion against to the neighbor j.



Bird's alignment is computed as follows

\vec{A}_i = \left[\sum_{j \in \mathcal{N}'_b}{\vec{v_j}}\;f_a\right] + a, \;\;0 < |\vec{A}_i| \leq v_p

It is a weighted average of the neighbors's heading direction.

Other species and predator avoidance
Other species avoidance is a behavior pretty much similar to the separation. The only difference is that only birds that belong to other species contribute to the result.

Predator avoidance is also a "flee or separation" behavior, but what happens here is that we do not take into account the current predator position, but instead, birds try to "separate" from predators's next position (the prediction is made from current position and velocity and acceleration of the predator).

Predator Avoidance Flee/Separation Behavior

The predator avoidance vector \vec{\Gamma}_b^i is defined as follows:

\vec{\Gamma}_b^i = \left[\sum_{j \in \mathcal{P}_b }{\frac{\vec{p}_i -(\vec{p}_j+\vec{v}_j)}{||\vec{p}_i - (\vec{p}_j+\vec{v}_j)||}} \;f_{p}\right] +a, \;\;0 < |\vec{\Gamma}_i| \leq v_p \notag


  •  \mathcal{P}_b is the b's set of predators
  •  f_{p} = \begin{cases} 0 &\mbox{ if } d_{i,j} > r_p\\1 -\frac{d_{i,j}}{r_p} & \mbox{ otherwise}\end{cases}
    is the predator avoidance coefficient, where r_p is the minimum distance bird avoid predator.

The model has been implemented in CUDA to speedup the simulation. The following is a short video which I used during my presentation at PDP16 conference.  The model and the implementation described in this article are much more greatly described in the following slides (download here).

C/C++ - Byte to Number Conversion

Byte Number Conversion

For a project I'm working on these days I was forced to convert raw byte to several numerical values ( to and from binary format).

Bytes are represented in C/C++ as  unsigned char, and std::string are often used as byte buffer as well as old well known raw arrays. I wrote a simple struct that relieve the pain in performing such conversions. It exposes two functions: fromBytes and toBytes.fromBytes takes an additional boolean parameter that takes care of different endianness.

Author: Davide Spataro 2016

Converts byte buffer to type T
typedef unsigned char byte;

template <class T>
struct converter{

 static const size_t size = sizeof(T);

    union conv{
        T value;
        byte bytes[ sizeof( T ) ];
    } c ;

    T fromBytes( const  byte *bytes, bool endianness = false){

            #pragma unroll
            for(int i=0;i<size;i++)
                c.bytes[size-1-i] = bytes[i];
          #pragma unroll
            for(int i=0;i<size;i++)
                c.bytes[i] = bytes[i];

        return c.value;

     byte* toBytes(const T& value,bool endianness = false){
        c.value =value;
        return c.bytes;

    void reverse(){
        #pragma unroll
        for(int i=0;i<size/2;i++){
            byte tmp = c.bytes[i];
            c.bytes[size-1-i] = tmp;



template<class T>
void printHex(const T& key, size_t size){
  std::cout.setf(std::ios::hex, std::ios::basefield);
    for (int i = 0; i < size; i++){ if (i > 0) printf(":");
            printf("%02X", (unsigned char)key[i]);


Usage is very simple: let's say for instance you have the following 8 bytes into an unsigned long long


converter<int64_t> c;
//binary value is 00:00:00:00:00:00:00:5E (8 bytes)
std::string binary = readBinary(); //read binary unsigned long long from somewhere
int64_t res =  c.fromBytes(reinterpret_cast<const unsigned char*>(binary.c_str()),true);
std::cout <<res<< " \n";

this will output clearly output 94.

It works with almost all numerical types (float, all int flavors and floating points)

Music Meeting in Ediburgh

A great experience, that I hope will became an joyful habit, took place last week when a colleague of mine at the Department of Engineering at University of Edinburgh (Kino) told me about a meeting among the researchers/musicians of the university.


I was invited to play some music, and I obviously accepted! It was great to meet new smart people and musician from all over the world and share with them delicious food and wine. There were several instrument involved in the performances such as viola, violin, cello and voice.

Here two (very) short videos from my performance:


The following was played to fulfill a special request from Alice (a P.hD collegue) from Milan.




Largest Prime Number

Largest Prime Number

Recently a team at the University of Central Missouri, headed by Curtis Cooper has announced, via press release from the Mersenne organization

to have discovered a new largest prime. The number 2^{742072811}-1 have more than 22M digits, 22338618 to be precise. It's so large that writing 4 digits per centimeter one would be able to cover the entire distance between Edinburgh and Glasgow! Here the full number.

The following is an Haskell micro-script for computing and writing the number to a file (using Haskell because it's lazy input/output allows to write big files to disk without any concern about memory).

import Data.Time
import System.Environment (getArgs)
main :: IO ()
main = do
let n = 2^74207281 -1
startTime <- getCurrentTime
writeFile path (show n)
stopTime <-getCurrentTime
putStrLn ("ElapsedTime: " ++ show (diffUTCTime stopTime startTime))

Compile using using : ghc --make

and execute passing the output filename as the only parameter.

The full number is ~22MB, not surprisingly as one char occupies one byte in memory.

Here the first digits of the number:

268344381703937005859988258738844104703265786972872467031538046586054465054455 ....


Programming Question - Integer Parity

Programming Question: Given an integer, compute its parity(easy)
The parity of an integer is true iff number of set bits (1) is odd, false otherwise. Example: 1234_{10} = 010011010010_2 has 5 set bits and hence its parity is true.



Programming Question - Compute GCD

Programming Question: Compute the Greatest Common Divisor of two integers (easy)

This question is divided in two parts: you will first asked to write a function that computes the  GCD withouth any space/time constraints. The second part will be more challenching, asking for not using multiplication or division or addition.

Part1: write a function that takes two integers as input and returns their gcd using the famous Euclidean algorithm.

Part 2: compute the Greatest Common Divisor of two integers without using multiplication addition or division operators (easy)


Programming Question - 2D array spiral Fashion

Programming Question: given a 2D array, print it in spiral fashion (easy)

Given a 2D array print it in spiral fashion as shown in the picture.


Yet Another one on Fibonacci serie - Knuth multiplication

Java Fibonacci serie - Knuth multiplication

Leonardo Fibonacci,the Italian mathematician (from Pisa, 1170, one of the most important mathematician of all time) invented the famous serie defined recursively as follows:
F(n) = F(n-1) + F(n-2)
F(0) = 0
F(1) = 1

We all know that as programmers once in life (at least) we will write some piece of code to compute the n_{th} element of the serie. Recursive implementation is straightforward but inefficient, due to the exponential number of recursive calls, that make "hard" to compute numbers bigger that 40. It is very easy to come up with some iterative or memoised implementation, that lowers the complexity to O(n) and allows to efficently compute "any" number of the serie.
In this post we will derive an implementation (JAVA) that have complexity O(log_2(n))1)using results from Knuth, Fibonacci multiplication,1988. Read On…

References   [ + ]

1. using results from Knuth, Fibonacci multiplication,1988

Haskell - Sudoku Solver - List Monad

Haskell Sudoku Solver - List Monad

In this article we will write a Sudoku solver in Haskell using the expressive and powerful List Monad to enumerate the solutions trasparently. I assume that the reader has a general understanding of monads (if not read

Su Doku means literally  number place is the name of a popular puzzle game that a special case of the old Latin Square (invented by Leonard Euler). Latin Square force each symbol to appear once in each row and column. In addition to these contraint Sudoku force each symbol to apper one in each 3 \times 3 sub-matrix. In the general case Sudoku is NP-Complete (easy reduction to graph coloring), but certain schemeas can be solved by deduction and logic only (in this case the puzzle is said well constructed).

At high level a Sudoku can be viewed as a list of cell

data Board= Board [Square]

and a Square must certanly carries information about its position and the digit it contains, or even better it could contain (according to the cells in same row, column and submatrix)

data Square= Square ColDigit RowDigit BoxDigit (Either [Digit] Digit)
    deriving (Show)

type RowDigit= Digit
type ColDigit= Digit
type BoxDigit= Digit
type Digit= Char

While solving the puzzle each square will contains either a list of possible digits that can be placed in the square OR (this is the Either logic) a fixed digit (i.e., the square was given by a clue or has been deduced).

The following Squares have respectively position  (4,1) and (4,2) , both lie in submatrix 4 and s1 can contain the digit 1 OR 5 OR 7 OR 4 while s2 is fixed to 6.

let s1=Square '4' '1' '4' (Left ['1','5','7','4'])
let s2=Square '4' '2' '4' (Right '6')

The solver works s.t. it return the list of possible solution for a schema (not well constructed Sudoku may have multiple solutions). In particular if no unresolved square are left (the one with the Left in the Ethier part) it means that the board is solved and does not need any further processing, otherwise we take the square with the minimum number of possible values and try solve the board trying, one at the time, all the values for it. This sort of backtracking is perfomed transparently by the List monad that is used to represent nondeterminism (i.e. multiple possibilities).

solveMany :: Board -> [Board]
solveMany brd =
	case getUnsolvedSquare brd of
		[] 	-> return brd --all squares correctly set up
		sqs -> do
		let selSquare@(Square c r b (Left ds)) : _ = sortBy (byLeftLenght) sqs
		sq <- [ Square c r b (Right d) | d <- ds ] -- Try all possible moves
		solveMany (setSquare sq brd)

Recall that do notation is just syntactic sugar for bind (>>=) operator and that for for list it is defined as following

(>>=) :: [a] -> (a->[b]) >[b]
xs >>= f = concat (map f xs)

It then applies f=solveMany (that takes a board) and produces [Board] for all the digits in the Left of selSquare.
Here the same function wrote with explicit bind operator application that makes much clear the monadiac operation that do the dirty job for us.

solveMany :: Board -> [Board]
solveMany brd =
	case getUnsolvedSquare brd of
		[] 	-> return brd --all squares correctly set up
		sqs ->
			let Square c r b (Left ds) : _ = sortBy (byLeftLenght) sqs
			in [ Square c r b (Right d) | d <- ds ] >>=(\b->solveMany (setSquare b brd))

getUnsolvedSquare :: Board -> [Square]
getUnsolvedSquare (Board sqs) = filter (isSolved) sqs
		isSolved (Square _ _ _ (Left _)) = True
		isSolved _						  = False

The setSquare function is the one devoted to the constraint propagation and is defined as follow:

setSquare :: Square-> Board ->Board
setSquare sq@(Square scol srow sbox (Right d))(Board sqs) = Board (map set sqs)
		set osq@(Square ocol orow obox ods)
			| scol==ocol && srow==orow  			 = sq
			| scol==ocol || srow==orow || sbox==obox = Square ocol orow obox (checkEither ods)
			| otherwise 							 = osq

		checkEither (Left  lds ) 			= Left (lds \\ [d])
		checkEither r@(Right d') | d==d'	= error "Already set this square"
		checkEither dd = dd
setSquare _ _ = error "Bad call to setSquare"

It places Square (with a Right Digit) on a Board and return the new Board taking care of removing the just inserted digit from the possible values for Squares of the same row, column and submatrix. This is how the constraint are propagated, cutting off the search space.

The complete source code is available here and on github.

Stream Compaction on GPU - Efficient implementation - CUDA

Stream Compaction - Introduction1)Markus Billeter et al. Efficient Stream Compaction on Wide SIMD Many-Core Architectures2)InK-Compact-: In kernel Stream Compaction and Its Application to Multi-kernel Data Visualization on GPGPU- D.M. Hughes

An efficient implementation of stream compaction on GPU is presented with code example. Full source code on github.

Stream compaction/reduction is commonly referred as the operation of removing unwanted elements in a collection. More formally, imagine we have a list of element A_{0...N} of N elements and a predicate p : A \to\left \{ True,False\right \} that bisects A in wanted and unwanted elements (ones that satisfy the predicates p and ones that don't). The stream compaction of A under p is an Read On…

References   [ + ]

1. Markus Billeter et al. Efficient Stream Compaction on Wide SIMD Many-Core Architectures
2. InK-Compact-: In kernel Stream Compaction and Its Application to Multi-kernel Data Visualization on GPGPU- D.M. Hughes

Haskell Seminar @ UNICAL

Haskell Seminar at University of Calabria

During the last year I got much interest in functional programming and in particular in Haskell.
Very few people write and study Haskell at University of Calabria (one of them is the manteiner of gnome for NixOs) and so I decided to do a introductive lesson (a big thanks go to the guys of the ciotoflow). It was a great experience, and people were enthusiast and curious.
All the material i produced is available for download and use, and hopefully will be gradually updated and improved. All the files are available on github.

Introductive slides here



Project Euler - Problem 24

In this article we want to briefly discuss a possible solution to the project euler problem 24 which asks:

A permutation is an ordered arrangement of objects. For example, 3124 is one possible permutation of the digits 1, 2, 3 and 4. If all of the permutations are listed numerically or alphabetically, we call it lexicographic order. The lexicographic permutations of 0, 1 and 2 are:

012 021 102 120 201 210

What is the millionth lexicographic permutation of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9?

One possible approach could obviously be to brute force all the lexicographical permutations and then take the millionth element out of the previously sorted collection of permutation (interpreted as number). The drawback of this solution is that unfortunately the number of permutations is more than exponential, is  O(n!) .

We want to show here a more mathematical/combinatorial approach. First of all start considering that how the permutations looks like.
First permutations are (thanks to Stefano Germano for typo correction)  \{0,1,2,3,4,5,6,7,8,9\}, \{0,1,2,3,4,5,6,7,9,8\},\{0,1,2,3,4,5,6,8,7,9\},...,\{0,9,8,7,6,5,4,3,2,1\} .
A deeper look at the permutations reveals that are all the sequences starting with 0 and concatenated with a permutation of the ramaining 9 elements  [1,2,3,4,5,6,7,8,9] . How many of them?  9! .
It means that the permutation we are looking for  has to be greater than the  9!=362880 th permutation:  \{0,9,8,7,6,5,4,3,2,1\} .
From this we can deduce that we can also skip all the sequences starting by 1 because they produces (again)  9! permutations and

 2 \times 9! < 10^6 . Moreover our millioth permutation is in the sequences starting with 2 because  3 \times 9! > 10^6 (hence is not in the permutations starting with 3). Read On…

CUDA - Julia Set example code - Fractals

This article present a CUDA parallel code for the generation of the famous Julia Set. Informally a point of the complex plane belongs to the set if given a function f(z) the serie z_{n+1} = f(z_n) does not tend to infinity. Different function and different initial condition give raise eventually to fractals. One of the most famous serie is z_{n+1} = z_n^2 +c  (the one that generated the video below).
Some nice picture may be obtained with the following initial conditions:

  1. c = 1j # dentrite fractal
  2. c = -0.123 + 0.745j # douady's rabbit fractal
  3. c = -0.750 + 0j # san marco fractal
  4. c = -0.391 - 0.587j # siegel disk fractal
  5. c = -0.7 - 0.3j # NEAT cauliflower thingy
  6. c = -0.75 - 0.2j # galaxies
  7. c = -0.75 + 0.15j # groovy
  8. c = -0.7 + 0.35j # frost

Here a video, showing a sequence of picture generated using different initial conditions. A nice example of how math and computer science can produce art!

CUDA Julia set code

The code is such that it is very easy to change the function and the initial condition (just edit the device function functor).


#define TYPE float
#define cTYPE cuFloatComplex
#define cMakecuComplex(re,i) make_cuFloatComplex(re,i)
#define TYPE double
#define cMakecuComplex(re,i) make_cuDoubleComplex(re,i)

__global__ void computeJulia(int* data,cTYPE c,float zoom){
	int i =  blockIdx.x * blockDim.x + threadIdx.x;
	int j =  blockIdx.y * blockDim.y + threadIdx.y;

	if(i<DIMX && j<DIMY){
		cTYPE p = convertToComplex(i,j,zoom);

cTYPE c0;
__device__ cTYPE juliaFunctor(cTYPE p,cTYPE c){
	return cuCaddf(cuCmulf(p,p),c);

__device__ int evolveComplexPoint(cTYPE p,cTYPE c){
	int it =1;
	while(it <= MAXITERATIONS && cuCabsf(p) <=4){
	return it;

__device__ cTYPE convertToComplex(int x, int y,float zoom){

	TYPE jx = 1.5 * (x - DIMX / 2) / (0.5 * zoom * DIMX) + moveX;
	TYPE jy = (y - DIMY / 2) / (0.5 * zoom * DIMY) + moveY;
	return cMakecuComplex(jx,jy);


The code above are the GPU instructions that are devoted to the computation of a single set. The kernel computeJulia() is executed on DIMX*DIMY (the dimension of the image) threads that indipendently compute the evolveComplexPoint() function on the converted to complex corrensponding pixel indices, the viewport (each threads compute a single pixel that has integer coordinates). The evolveComplexPoint() take care of checking how fast che point diverges.

The code is available here on github. It generates a number of png tha can be then merged in a video (as the one I present here) using a command line tool as ffmpeg.

In order to compile and run it need the CUDA SDK,a CUDA capable device and libpng installed on your system.
To compile and/or generate the video simply hit:

nvcc -I"/usr/local/cuda-6.5/samples/common/inc"  -O3 -gencode arch=compute_35,code=sm_35  -odir "src" -M -o "src/juliaSet.d" "../src/"

nvcc  -I"/usr/local/cuda-6.5/samples/common/inc" -O3 --compile --relocatable-device-code=false -gencode arch=compute_35,code=compute_35 -gencode arch=compute_35,code=sm_35  -x cu -o  "src/juliaSet.o" "../src/"

/usr/local/cuda-6.5/bin/nvcc --cudart static --relocatable-device-code=false -gencode arch=compute_35,code=compute_35 -gencode arch=compute_35,code=sm_35 -link -o  "JuliaSetDynParallelism"  ./src/juliaSet.o ./src/main.o   -lpng

#video generation
ffmpeg -f image2 -i "julia_%05d.png" -r 40

Here another animation


Fold operator - a brief detour

This article aims to be a brief practical introduction to new haskell programmer about the fold high order function (which roughly speaking is a function that takes as input and/or produces as output a function) and its usage. It will also outline a bit of the theory that lies behind this operator (the term operator comes from the field of recursion theory (Kleene, 1952), even if, striclty speaking fold is a function in haskell).

Haskell Fold - Introduction and formal definition

Many problem in function programming are solved by means of tail recursive functions (one that do not perform any other operations that need some values from the recursive call). For instance let's say we want to get the product of integers stored in a list.
What we would do is to write down a tail recursive function (more or less) like this

tailmultiply :: [Int] -> Int -> Int
tailmultiply [] acc     = acc
tailmultiply (x:xs) acc = tailmultiply xs (acc * x)

Note, to stress the tail recursion, a definition like the one that follows would not be tail recursive.

Read On…

Kernel Linux List - An Efficient Generic Linked List

Linux Kernel Linked List

Linux Kernel especially in the past years was plenty of different and replicated implementation of generic data structures such as linked lists, stacks and queues.  Kernel's developers decided to provide a unique set of API for using these kind of software machineries. We will briefly look in this article  at the list.h file (in the linux source tree under /usr/src/linux-headers-`uname -r`/include/) that provides macros and function for utilizing a doubly linked circular list. It is a very smart written piece of code that is worth to take a look at as it is  an opportunity  to improve  C programming skills and at the same time to see an "unconventional" implementation for a common data structure.

From canonical list to Linux kernel linked list

The canonical implementation of a C's linked list consists of a structure and two (for a doubly linked) recursive pointer fields to the structure itself. The links to the previous and subsequent element of the list.

struct point2DNode{
 int x,y;
 struct point2DNode* prev, next;

Read On…

Francis Poulenc - Rapsodie Negre

To fall in the void as I fell: none of you knows what that means… I went down into the void, to the most absolute bottom conceivable, and once there I saw that the extreme limit must have been much, much farther below, very remote, and I went on falling, to reach it.” [Cosmicomics, Italo Calvino]

Apple Xserve xeon Linux/Debian armed

Finally at my Departement we succesfully converted an  unsed  Apple xserve  cluster to a full functional Linux/Debian machine.

The cluster is make up of eleven node connected by Myrinet 10G low latency network and each node is a composed by a dual 4x cores Intel Xeon 2.8 Ghz Harpertown with 10Gb of DDR2 for a total of 88 cores. It is a quite old and small machine and hence was unused. We thought that  it could have been useful if well re-configured as an ad-hoc machine for small sized scientific computation session, as  developing and testing platform of parallel application or for training of undergraduate students that can experience directly with a real parallel machine. So we decided to cast a new light on it by configuring it as a Debian cluster (according to google just few people did it). 10949741_766948060061020_241789111_n

Read On…