The following problem, slightly rephrased, was posed in the Intel TBB
forum.
You have N particles. Particle j exerts force
f(i,j) upon particle i. The usual action/reaction law
holds, i.e., f(i,j) == -f(j,i). Compute the total force acting
on each particle, in parallel. It turns out that the solution is
nontrivial, as we discuss in the rest of this article.
The natural sequential solution is something like this:
for (int i = 0; i < N; ++i)
for (int j = i; j < N; ++j) {
double t = f(i, j);
force[i] += t;
force[j] -= t;
}
The sequential implementation calls f(i,j) once to compute both
the force f(i,j) exerted by particle j upon particle
i, and the force -f(i,j) exerted by i upon
j. Calling f once is better than calling f
twice, especially because f tends to be expensive to compute.
(E.g., computing gravitational forces involves a division and a square
root.)
The serial solution does not parallelize easily, however. You cannot
execute the outer loop in parallel, because then we have a race
condition among parallel updates of force[j] for different
values of i. For a similar reason, parallelizing the inner
loop does not work either.
If you are willing to call f twice, both both as f(i,j)
and as f(j,i), then a simple parallel implementation looks like
this:
cilk_for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
force[i] += f(i, j);
That is, for each particle
i (in parallel) compute
(sequentially) the force exerted upon
i by all other
particles
j. The inner loop is sequential, which avoids the
race condition on the update of
force[i]. However, this
parallel solution calls
f twice as much as the serial one,
which is not good.
So how do you do it? My solution appears below. It calls f
only once for every pair (i,j), for i <= j, so it is
does not double the work, and it is race-free. It is a complete
Cilk++ program that you can run and certify to be race-free by using
Cilkscreen, our race-detecting tool.
#include <assert.h>
#define N 749 // number of particles
double force[N];
double f(int i, int j)
{
// compute your favorite force here. We use a constant
// force 1.0 for illustration purposes
return 1.0;
}
void basecase(int i, int j)
{
assert(i <= j);
double t = f(i, j);
force[i] += t;
force[j] -= t;
}
/* traverse the rectangle i0 <= i < i1, j0 <= j < j1 */
void rect(int i0, int i1, int j0, int j1)
{
int di = i1 - i0, dj = j1 - j0;
if (di > 1 && dj > 1) {
int im = i0 + di / 2;
int jm = j0 + dj / 2;
cilk_spawn rect(i0, im, j0, jm);
cilk_spawn rect(im, i1, jm, j1);
cilk_sync;
cilk_spawn rect(i0, im, jm, j1);
cilk_spawn rect(im, i1, j0, jm);
} else {
for (int i = i0; i < i1; ++i)
for (int j = j0; j < j1; ++j)
basecase(i, j);
}
}
/* traverse the triangle n0 <= i <= j < n1 */
void interact(int n0, int n1)
{
int dn = n1 - n0;
if (dn > 1) {
int nm = n0 + dn / 2;
cilk_spawn interact(n0, nm);
cilk_spawn interact(nm, n1);
cilk_sync;
rect(n0, nm, nm, n1);
} else if (dn == 1) {
basecase(n0, n0);
}
}
int cilk_main()
{
interact(0, N);
}
This program is not meant to be obvious, so let me explain what it
does.
The sequential program is equivalent to calling basecase(i,j)
for all 0 <= i <= j < N. Another way to look at it is that we
call basecase(i,j) for all points (i,j) in the triangle
0 <= i <= j < N, which looks like the shaded area in this figure:
Procedure interact traverses this triangle in parallel, and in
fact it is a little bit more general, because it traverses triangles
of the form n0 <= i <= j < n1. Initially, we set n0 = 0
and n1 = N in cilk_main.
Procedure interact works by recursively partitioning the
triangle. If the triangle consists of only one point, then it visits
the point (n0, n0) directly. Otherwise, the procedure cuts the
triangle into one rectangle and two triangles, like this:
The two smaller triangles can be executed in parallel, because one
consists only of points (i,j) such that i < nm and
j < nm, and the other consists only of points (i,j) such
that i >= nm and j >= nm. Thus, the two triangles
update nonoverlapping regions of the force array, and thus they
do not race with each other. However, the rectangle races with both
triangles, and thus we need the cilk_sync statement before
processing the rectangle.
To traverse a rectangle we use procedure rect, which also works
recursively. Specifically, if the rectangle is large enough, the
procedure cuts the rectangle i0 <= i < i1, j0 <= j < j1,
into four smaller subrectangles, like this:
The amazing thing is that the two black subrectangles can be traversed
in parallel with each other without races. Similarly, the two gray
subrectangles can be traversed in parallel with each other without
races. However, the black subrectangles race with the gray, so we
must use a
cilk_sync statement after processing the first pair
of subrectangles.
To see why there are no races between the two black subrectangles (the
same argument applies to the grey) observe that the i-ranges of
the two subrectangles do not overlap, because one is smaller than
im and the other is larger. For the same reason, the
j-ranges do not overlap either. In order for races not to
occur, however, we must also prove that the i-range of one
subrectangle does not overlap with the j-range of the other,
because the base case updates both force[i] and
force[j]. This property holds because when interact
calls rect initially, the i-range is n0 <= i < nm,
whereas the j-range is nm <= j < n1, so the two
ranges never overlap.