I was an intern at CilkArts this summer and my first engagement with Cilk++ was to solve the N-Queens problem with it. The N-Queens problem asks this question: given an n-by-n chessboard and n queens, is it possible to place the n queens on the chessboard without any conflict? Is so, how many different ways can they be placed? An arrangement is said to be conflicting if at least two queens conflict with each other. Two queens conflict if they share the same row, column, or diagonal. For example, the following is a solution for eight queens:
Wikipedia provides an excellent overview of the problem.
Algorithm
The classical solution to N-Queens problem is brute force enumeration with pruning. Brute force enumeration tries every possible arrangement of the n queens and checks if any of them satisfies the non-conflicting criteria. For example, there are n^2 locations on the board, so there are n^2 possible locations for the first queen, n^2-1 for the second, and so on. Thus we have choose(n^2, n) = (n^2)!/((n^2-n)! * n!) possible arrangements of queens. By enumerating all of these possibilities and checking each of them, all solutions can be found. For n=10, this number is ~1.73e13.
Because there could be only one queen per row in a non-conflicting arrangement, we can constrain the first queen to locations in the first row, the second queen to the second row, and so on. This will give n possible locations for each queen, resulting in n^n possible arrangements. For n=10, this is 1e10. Observe that there could be one queen per column as well; the i-th queen will have only at most n-i+1 valid locations, because the previous i-1 queens already took i-1 columns. The number of possibilities is then reduced to n!, or 3.6e6 for n=10.
The final point to reduce enumerations is to prevent queens from sharing a diagonal. While this is straightforward to code, it is hard to provide an actual estimate of the number of enumeration after using it. The following is a pseudo-code for this enumeration plus pruning algorithm:
queens=[]; //locations of already put queens
try (int i) //put i-th queen
{
if (i>n)
{
found_a_solution;
return;
}
for j=1..n, //possible locations of i-th queen in i-th row
{
if (putting i-th queen at location (i,j)
does not conflict with previously put queens)
{
queens[i]=j; //store location of i-th queen
try(i+1); //put (i+1)-th queen
}
}
}
The complexity of this algorithm is upper-bounded by n!.
Implementation
I implemented the above algorithm in C++. Storing the queens array variable is a bit tricky. There were three choices before me: an STL vector, an array, or pointers. There are definitely other ways this could be implemented, such as using an array for space and bookkeeping pointers manually, but here I only considered these three basic cases. They each have their pros and cons:
STL Vector
- Pro: don't need to manually allocate array space
- Con: more actual memory allocations performed
Array
- Pro: easy to implement, need only one allocation in every recursion
- Con: still redundant, the queens placed early are stored many times
Pointers, maintain a cactus stack
- Pro: no redundancy in storage
- Con: hard to implement, can result in many fine grained memory allocations
All three choices were implemented. Before multicore-enabling them, their serial versions were built first. After all, it's always good to have the serial version of a parallel implementation at hand. In the following, the black colored code is the serial version, and the modification to parallelize them with Cilk++ are highlighted in blue.
Parallelizing a piece of code with Cilk++ is easy. What I mainly needed to do was to insert cilk_spawn before every recursive call, and cilk_sync before I used the return value from such calls. Also I needed to remove any data races. The two main tools I used were Cilkscreen (a data race detector), and reducers (race-free global variables provided with Cilk++).
The following are the three implementation. (The STL version is illustrated below, and the other two implementations are available by clicking on the links.)
STL Vector Version
int n; // # of queens
int sn; // # of solutions
typedef std::vector<int> ilist_t; // queens array storage
//# of solutions, reducer
cilk::hyper_ptr<cilk::reducer_opadd<int> > count;
// qs: current queens. qs[i] is the column position of i-th queen
// newq: new queen column position
bool check(ilist_t qs, int newq)
{
int i, num=qs.size();
for (I = 0;I < num; i++)
{
int p = qs[i];
if (p == newq //same column
|| abs(p-newq) == num - i) //diagonal conflict
return false;
}
return true;
}
// naive way to clone a ilist_t
ilist_t clonelist (ilist_t *f)
{
ilist_t r; r.clear();
for (int i=0;isize();i++)
r.push_back((*f)[i]);
return r;
}
// search
// queens: already put queens
// row: row of the new queen to be put
void trysol(ilist_t* queens, int row)
{
if (row>=n) //got a solution
{
sn++;
(*count)++;
return;
}
for (int i=0; i<n; i++)
{
if (check(queens,i))
{
ilist_t* pnewl=new ilist_t(*queens);
pnewl->push_back(i);
cilk_spawn trysol(pnewl,row+1); //pass it as a pointer
}
}
cilk_sync;
delete queens;
}
- The Pointer Version (click to view code)
(This program maintains a cactus stack made of nodes dynamically allocated. It uses function return values, instead of a global variable, to keep the number of solutions.)
Results
All three versions give correct results. Following is a list of the number of solutions for each n from 4 to 14.
n # of solutions
4 2
5 10
6 4
7 40
8 92
9 352
10 724
11 2680
12 14200
13 73712
14 365596
Here the rotation or mirror of a solution is counted as a different solution.
Performance
The time these three programs spent on n=13 is summarized in Table 1 and Figure 1.
|
(seconds) |
1 core |
2 cores |
3 cores |
4 cores |
|
STL vector |
4.171 |
2.453 |
3.000 |
3.062 |
|
Array |
2.312 |
1.296 |
1.390 |
1.375 |
|
Pointer |
5.594 |
3.172 |
3.672 |
3.547 |
Table 1: Performance using the standard Windows memory manager
Figure 1: Performance using the standard Windows memory manager
Apparently all three methods encounter a bottleneck at 3 cores. By analysis, we found that all three implementations use malloc in the inner loop. The Windows malloc acquires an operating lock - and this can result in significant overhead.
At Cilk Arts, we have developed an alternate memory manager for Windows named "Miser," to remove this bottleneck. Table 2 and Figure 2 show the performance with Miser. As we can see in Figure 3, using the Miser memory manager the program achieves almost perfect speed-up.
|
(seconds) |
1 core |
2 cores |
3 cores |
4 cores |
|
STL vector |
3.718 |
1.875 |
1.265 |
0.953 |
|
Array |
2.235 |
1.687 |
0.782 |
0.578 |
|
Pointer |
5.062 |
2.625 |
1.734 |
1.500 |
Table 2: Performance using the Miser memory manager

Figure 2: Performance using the Miser memory manager

Figure 3: Speedup relative to 1 core (using the Miser memory manager)
Work/Span Measurement
It can be quite informative to see the work and span of a program, so I performed work and span measurement on the three implementations. (Charles gives a brief review of the terms in this blog post.)
There are two tools available to measure the work and span. One is a "sim-time" instrumentation tool (dubbed "Cilkometer" for the time being). It reads meta data inserted into the executable by the Cilk++ compiler and analyze the instruction flow of the executable offline, outputting a work/span estimate in number of instructions. It is useful for accurate analysis of the instruction flow.
The other tool is a real-time instrumentation tool. It uses function calls inserted into the executable. These functions keep track of the time spent in each stack frame, outputting a work/span estimation based on the actual execution of the program. The result is given in number of cycles. This latter tool is useful for estimating the actual performance of program.
They each have their pros and cons:
"Sim-time" instrumentation
- Pro: stable, always gives the same number
- Con: may vary from actual performance
Real-time instrumentation
- Pro: measures actual performance
- Con: the result can change from run to run
Tables 3 and 4, and Figure 4 summarize the instrumentation result on the STL vector implementation. For the sim-time instrumentation, it is measured during serial execution. For the real-time instrumentation, it is done on 4 cores on an Intel Core2 Quad 2.4Ghz machine.
|
work (instructions)
|
span (instructions)
|
parallelism |
number of strands |
|
n=4 |
579358 |
542326 |
1.0683 |
48 |
|
n=6 |
1140903 |
564009 |
2.0228 |
454 |
|
n=8 |
10192604 |
590459 |
17.2622 |
6078 |
|
n=10 |
196482482 |
624600 |
314.5733 |
105892 |
|
n=12 |
5514385181 |
676080 |
8156.4093 |
2554366 |
Table 3: Sim time instrumentation results on nQueens STL vector implementation version
|
work (cycles) |
span (cycles) |
parallelism |
number of strands |
|
n=4 |
262494 |
205596 |
1.28 |
33 |
|
n=6 |
1640781 |
269217 |
6.09 |
305 |
|
n=8 |
17259462 |
344457 |
50.11 |
4113 |
|
n=10 |
232806402 |
381951 |
609.52 |
71077 |
|
n=12 |
5300370828 |
950364 |
5577.20 |
1712377 |
Table 4: Real-time instrumentation results on N Queens STL vector implementation version
Figure 4: Measuring Work, Span, Parallelism, Strands
First we see that the results in both cases agree with each other almost perfectly. Second, the work grows exponentially as n increases, while span only grows linearly, resulting in exponentially growing parallelism (in other words, this problem lends itself very well to multithreading).
Summary
The Cilk++ language is studied using the N-Queens problem. The N-Queens problem can be solved by brute force search and pruning. I implemented the algorithm with several variations and tested them thoroughly. The result shows that performance is hampered by the Windows memory manager, but was excellent once Miser (a customized memory manager) replaced the Windows one. The instrumentation results, both offline and online, shows exponentially growing work and linearly growing span. Parallelizing my program with Cilk++ was straightforward; the change to the code was minimal, the speed up with the memory manager is great, and the instrumentation results provide insight into the amount of work, span, and parallelism in a particular program.