I've been interning at Cilk Arts this summer, working mainly on
designing and implementing a very interesting analysis tool for Cilk++ programs
(more on that later!). I have also, of course, gotten a chance to spend some time
writing Cilk++ code, and boy is it fun. My first impressions were that the
cilk_spawn and cilk_for keywords felt like magical "go faster!"
keywords -- as long as my loops or recursive functions were written without undue
global side-effects (which I tend to avoid anyway, because I'm fond of
functional programming), all I had to do was add in the Cilk keywords, and
presto, my code ran faster (on a multicore machine).
When I really did need global variables, it was pretty easy to use Cilk hyperobjects to still get race-free parallelism, without using any locks. And there is a great, visceral satisfaction in running Cilkscreen and having it report that your program is 100% race-free -- if you've ever written a program in Haskell and gotten it to compile, you know the good feeling I'm talking about.
Of course, I wasn't much satisfied with everything working well, so I immediately set out to see if I could break it. Here, I've
documented my adventures in writing some simple Cilk++ code, some surprises I encountered, and what we're doing over here to help you avoid the same pitfalls in your own
parallel code.
Matrix Multiplication
Let's take a look at a few algorithms for multiplying large, dense
matrices -- a good example of an algorithm that seems simple on the surface, but
that can go quite wrong if you're not careful. The most obvious (serial)
algorithm for doing this is to have three nested for-loops, like so:
void matrix_multiply_1(matrix_t A, matrix_t B, matrix_t C) {
for (int i = 0; i < A.rows; ++i) {
for (int j = 0; j < B.cols; ++j) {
for (int k = 0; k < A.cols; ++k)
C[i][j] += A[i][k] * B[k][j];
}
}
}
Now, this may seem like a perfectly good algorithm -- in fact, it looks really good, because the outer two loops are easily parallelizable using the
cilk_for keyword, and the inner loop can be parallelized as well with the use of a hyperobject. Let's put all those in:
void matrix_multiply_2(matrix_t A, matrix_t B, matrix_t C) {
cilk_for (int i = 0; i < A.rows; ++i) {
cilk_for (int j = 0; j < B.cols; ++j) {
hyper_ptr< reducer_opadd > cij;
cilk_for (int k = 0; k < A.cols; ++k)
*cij += A[i][k] * B[k][j];
C[i][j] = cij->get_value();
}
}
}
Okay, let's run it! I used these two algorithms to multiply a 687x837 matrix
by a 837x1107 matrix on a four-core desktop PC using an alpha version of
Cilk++. The second version is highly parallel, so we should get near-linear
speedup. However, when I tried it, not only did I not get linear speedup,
but the parallel version actually ran twice as slow on four cores
as the serial version did on one core!
So what's the problem? Has Cilk++ failed us? Not at all, in fact!
It turns out that our matrix_multiply_2 code is rather problematic, and
there are two big ways we could improve it.
Overhead in Parallelism
The first problem is that we shouldn't have used a hyperobject in that
innermost loop -- we should have just left it sequential, like so:
void matrix_multiply_3(matrix_t A, matrix_t B, matrix_t C) {
cilk_for (int i = 0; i < A.rows; ++i) {
cilk_for (int j = 0; j < B.cols; ++j) {
for (int k = 0; k < A.cols; ++k)
C[i][j] += A[i][k] * B[k][j];
}
}
}
Why not use a hyperobject here? Well, hyperobjects are
extraordinarily helpful when dealing with code that relies heavily on global variables, but
we don't want to get carried away using them just because we can. They do
introduce some overhead, and that inner loop is barely doing any work at all --
just a few thousand multiplies and adds -- so it's simply not worth it
here. We're much better off letting that loop be serial so that our compiler can
optimize the heck out of it, and besides, we've already got boatloads of
parallelism from the two outer cilk_for loops.
Running matrix_multiply_3 on my machine yields a 2.7x speedup over
matrix_multiply_1 on four cores. Ah, finally, we're getting
speedup instead of slowdown! Still, for an algorithm with as much parallelism as
this one, we should be doing better than that. What else are we doing wrong?
Choosing an Algorithm
Our other problem is that the serial algorithm we started with in
matrix_multiply_1 is actually a pretty terrible way to multiply large
matrices! It seems reasonable on the surface, but in practice it has
awful cache performance, and for a problem like matrix multiplication,
where we're doing much more memory-access than actual calculation, that makes a huge
difference. In this case, we should have used an algorithm with better
locality of reference, like the recursive algorithm shown here:
void matrix_multiply_4(matrix_t A, matrix_t B, matrix_t C,
int i0, int i1, int j0, int j1, int k0, int
k1) {
int di = i1 - i0;
int dj = j1 - j0;
int dk = k1 - k0;
if (di >= dj && di >= dk && di >= THRESHOLD) {
int mi = i0 + di / 2;
matrix_multiply_4(A, B, C, i0, mi, j0, j1, k0, k1);
matrix_multiply_4(A, B, C, mi, i1, j0, j1, k0, k1);
} else if (dj >= dk && dj >= THRESHOLD) {
int mj = j0 + dj / 2;
matrix_multiply_4(A, B, C, i0, i1, j0, mj, k0, k1);
matrix_multiply_4(A, B, C, i0, i1, mj, j1, k0, k1);
} else if (dk >= THRESHOLD) {
int mk = k0 + dk / 2;
matrix_multiply_4(A, B, C, i0, i1, j0, j1, k0, mk);
matrix_multiply_4(A, B, C, i0, i1, j0, j1, mk, k1);
} else {
for (int i = i0; i < i1; ++i) {
for (int j = j0; j < j1; ++j) {
for (int k = k0; k < k1; ++k)
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
On my machine, matrix_multiply_4 runs about three times faster than
matrix_multiply_1, and ran slightly faster (using just one core) than
matrix_multiply_3 did on all four cores. Moreover, we can now
parallelize it
using the cilk_spawn keyword:
void matrix_multiply_5(matrix_t A, matrix_t B, matrix_t C,
int i0, int i1, int j0, int j1, int k0, int
k1) {
int di = i1 - i0;
int dj = j1 - j0;
int dk = k1 - k0;
if (di >= dj && di >= dk && di >= THRESHOLD) {
int mi = i0 + di / 2;
cilk_spawn matrix_multiply_5(A, B, C, i0, mi, j0, j1, k0, k1);
matrix_multiply_5(A, B, C, mi, i1, j0, j1, k0, k1);
cilk_sync;
} else if (dj >= dk && dj >= THRESHOLD) {
int mj = j0 + dj / 2;
cilk_spawn matrix_multiply_5(A, B, C, i0, i1, j0, mj, k0, k1);
matrix_multiply_5(A, B, C, i0, i1, mj, j1, k0, k1);
cilk_sync;
} else if (dk >= THRESHOLD) {
int mk = k0 + dk / 2;
// N.B. It's not safe to use a spawn here. Fun exercise: try putting
// it in and then running Cilkscreen to detect the resulting race.
matrix_multiply_5(A, B, C, i0, i1, j0, j1, k0, mk);
matrix_multiply_5(A, B, C, i0, i1, j0, j1, mk, k1);
} else {
// The problem is now small enough that we can just do things serially.
for (int i = i0; i < i1; ++i) {
for (int j = j0; j < j1; ++j) {
for (int k = k0; k < k1; ++k)
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
Running matrix_multiply_5 on my machine gets a somewhat better
speedup than before -- about 3x on four cores -- and runs over nine times faster
than our original serial algorithm in matrix_multiply_1 and three times
faster than our parallel algorithm in matrix_multiply_3. Note that our speedup
still isn't quite linear, probably either due to memory bandwidth limits or
due to the fact that we aren't able to use a spawn in the third case of
matrix_multiply_5.
Now What?
What's the moral of this story?
- The first lesson is that parallelism
can't always make up for a poor algorithm.
- The second lesson is that
running code in parallel always has overhead (no matter what framework you're using),
and even though Cilk++ gets that overhead very low, it can still be hugely
significant if the overhead is bigger than the actual work we're doing (like it
was when we tried to use a hyperobject in a tight inner loop).
- The third lesson is that shared-memory-multiprocessor hardware can have problems with memory
bandwidth and false sharing, and we need to be aware of that.
So what is Cilk Arts doing to help you avoid problems like these? I
thought you'd never ask!
Let's look back a moment:
- Our first problem was that we were using a
hyperobject where we really didn't need it, and losing more to
overhead than we
were gaining from parallelism. It'd be nice to have a tool that could
calculate the work and span of our program, so that we could see just
how much
parallelism we had, what our granularity was, and how much time we
were losing
to overhead; then, we could easily determine whether the hyperobject
was worth
it.
- Our second problem was that we were having problems with
locality of
reference. It'd be nice to have a tool that could run our program
and look for
problems with false sharing.
Well wouldn't you know it, my project at Cilk Arts this summer was in
developing an analysis tool aimed at doing exactly these sorts of
things. Does
it really work? Well, the tool isn't ready to ship yet, and in fact
it hasn't even been officially announced. However, just recently I was able to
use it to measure and compare the work and spans of the matrix_multiply_2 and
matrix_multiply_3 function I gave above to decide whether using the
hyperobject was worth it. According to the tool's calculations,
using the hyperobject in that case does increase the parallelism of the function
somewhat, but it also makes the span several times larger (because
the inner loop does so little work), which means we can be sure that the whole
function will be much slower even if we had an infinite number of
processors. That's a useful result that we could not have gotten by empirical testing
alone!
The conclusion? Cilk++ is not a magic bullet that will solve all of
your problems for you -- you still need to be judicious when writing your
code.
However, it does make it very easy to expose parallelism in your
code, and with the help of the tools we're working on, it's not too hard to
figure out how to get the best possible parallel performance out of your program.