Subscribe by Email

Your email:

Cilk++ for Linux is Here!

(release overview)


Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

A Tale of Two Algorithms: Multithreading Matrix Multiplication

Posted by Matthew Steele on Thu, Aug 14, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

I've been interning at Cilk Arts this summer, working mainly on designing and implementing a very interesting analysis tool for Cilk++ programs (more on that later!). I have also, of course, gotten a chance to spend some time writing Cilk++ code, and boy is it fun. My first impressions were that the cilk_spawn and cilk_for keywords felt like magical "go faster!" keywords -- as long as my loops or recursive functions were written without undue global side-effects (which I tend to avoid anyway, because I'm fond of functional programming), all I had to do was add in the Cilk keywords, and presto, my code ran faster (on a multicore machine).

When I really did need global variables, it was pretty easy to use Cilk hyperobjects to still get race-free parallelism, without using any locks. And there is a great, visceral satisfaction in running Cilkscreen and having it report that your program is 100% race-free -- if you've ever written a program in Haskell and gotten it to compile, you know the good feeling I'm talking about.

Of course, I wasn't much satisfied with everything working well, so I immediately set out to see if I could break it. Here, I've documented my adventures in writing some simple Cilk++ code, some surprises I encountered, and what we're doing over here to help you avoid the same pitfalls in your own parallel code.

Matrix Multiplication

Let's take a look at a few algorithms for multiplying large, dense matrices -- a good example of an algorithm that seems simple on the surface, but that can go quite wrong if you're not careful. The most obvious (serial) algorithm for doing this is to have three nested for-loops, like so:

void matrix_multiply_1(matrix_t A, matrix_t B, matrix_t C) {
    for (int i = 0; i < A.rows; ++i) {
       for (int j = 0; j < B.cols; ++j) {
        for (int k = 0; k < A.cols; ++k)
            C[i][j] += A[i][k] * B[k][j];
                                }
                            }
                        }

Now, this may seem like a perfectly good algorithm -- in fact, it looks really good, because the outer two loops are easily parallelizable using the cilk_for keyword, and the inner loop can be parallelized as well with the use of a hyperobject. Let's put all those in:

void matrix_multiply_2(matrix_t A, matrix_t B, matrix_t C) {
    cilk_for (int i = 0; i < A.rows; ++i) {
        cilk_for (int j = 0; j < B.cols; ++j) {
         hyper_ptr< reducer_opadd > cij;
         cilk_for (int k = 0; k < A.cols; ++k)
            *cij += A[i][k] * B[k][j];
            C[i][j] = cij->get_value();
         }
    }
}

Okay, let's run it! I used these two algorithms to multiply a 687x837 matrix by a 837x1107 matrix on a four-core desktop PC using an alpha version of Cilk++. The second version is highly parallel, so we should get near-linear speedup. However, when I tried it, not only did I not get linear speedup, but the parallel version actually ran twice as slow on four cores as the serial version did on one core! So what's the problem? Has Cilk++ failed us? Not at all, in fact! It turns out that our matrix_multiply_2 code is rather problematic, and there are two big ways we could improve it.

Overhead in Parallelism

The first problem is that we shouldn't have used a hyperobject in that innermost loop -- we should have just left it sequential, like so:

void matrix_multiply_3(matrix_t A, matrix_t B, matrix_t C) {
    cilk_for (int i = 0; i < A.rows; ++i) {
     cilk_for (int j = 0; j < B.cols; ++j) {
      for (int k = 0; k < A.cols; ++k)
        C[i][j] += A[i][k] * B[k][j];
        }
    }
}

Why not use a hyperobject here? Well, hyperobjects are extraordinarily helpful when dealing with code that relies heavily on global variables, but we don't want to get carried away using them just because we can. They do introduce some overhead, and that inner loop is barely doing any work at all -- just a few thousand multiplies and adds -- so it's simply not worth it here. We're much better off letting that loop be serial so that our compiler can optimize the heck out of it, and besides, we've already got boatloads of parallelism from the two outer cilk_for loops.

Running matrix_multiply_3 on my machine yields a 2.7x speedup over matrix_multiply_1 on four cores. Ah, finally, we're getting speedup instead of slowdown! Still, for an algorithm with as much parallelism as this one, we should be doing better than that. What else are we doing wrong?

Choosing an Algorithm

Our other problem is that the serial algorithm we started with in matrix_multiply_1 is actually a pretty terrible way to multiply large matrices! It seems reasonable on the surface, but in practice it has awful cache performance, and for a problem like matrix multiplication, where we're doing much more memory-access than actual calculation, that makes a huge difference. In this case, we should have used an algorithm with better locality of reference, like the recursive algorithm shown here:

void matrix_multiply_4(matrix_t A, matrix_t B, matrix_t C,
            int i0, int i1, int j0, int j1, int k0, int

k1) {
int di = i1 - i0;
int dj = j1 - j0;
int dk = k1 - k0;
if (di >= dj && di >= dk && di >= THRESHOLD) {
   int mi = i0 + di / 2;
   matrix_multiply_4(A, B, C, i0, mi, j0, j1, k0, k1);
   matrix_multiply_4(A, B, C, mi, i1, j0, j1, k0, k1);
} else if (dj >= dk && dj >= THRESHOLD) {
  int mj = j0 + dj / 2;
  matrix_multiply_4(A, B, C, i0, i1, j0, mj, k0, k1);
  matrix_multiply_4(A, B, C, i0, i1, mj, j1, k0, k1);
} else if (dk >= THRESHOLD) {
  int mk = k0 + dk / 2;
  matrix_multiply_4(A, B, C, i0, i1, j0, j1, k0, mk);
  matrix_multiply_4(A, B, C, i0, i1, j0, j1, mk, k1);
} else {
  for (int i = i0; i < i1; ++i) {
   for (int j = j0; j < j1; ++j) {
     for (int k = k0; k < k1; ++k)
     C[i][j] += A[i][k] * B[k][j];
      }
     }
   }
}

On my machine, matrix_multiply_4 runs about three times faster than matrix_multiply_1, and ran slightly faster (using just one core) than matrix_multiply_3 did on all four cores. Moreover, we can now parallelize it using the cilk_spawn keyword:

void matrix_multiply_5(matrix_t A, matrix_t B, matrix_t C,
        int i0, int i1, int j0, int j1, int k0, int

k1) {
    int di = i1 - i0;
    int dj = j1 - j0;
    int dk = k1 - k0;
    if (di >= dj && di >= dk && di >= THRESHOLD) {
     int mi = i0 + di / 2;
     cilk_spawn matrix_multiply_5(A, B, C, i0, mi, j0, j1, k0, k1);
     matrix_multiply_5(A, B, C, mi, i1, j0, j1, k0, k1);
     cilk_sync;
} else if (dj >= dk && dj >= THRESHOLD) {
  int mj = j0 + dj / 2;
  cilk_spawn matrix_multiply_5(A, B, C, i0, i1, j0, mj, k0, k1);
  matrix_multiply_5(A, B, C, i0, i1, mj, j1, k0, k1);
  cilk_sync;
} else if (dk >= THRESHOLD) {
  int mk = k0 + dk / 2;
// N.B. It's not safe to use a spawn here. Fun exercise: try putting
// it in and then running Cilkscreen to detect the resulting race.
   matrix_multiply_5(A, B, C, i0, i1, j0, j1, k0, mk);
   matrix_multiply_5(A, B, C, i0, i1, j0, j1, mk, k1);
} else {
// The problem is now small enough that we can just do things serially.
for (int i = i0; i < i1; ++i) {
  for (int j = j0; j < j1; ++j) {
   for (int k = k0; k < k1; ++k)
    C[i][j] += A[i][k] * B[k][j];
     }
   }
  }
}

Running matrix_multiply_5 on my machine gets a somewhat better speedup than before -- about 3x on four cores -- and runs over nine times faster than our original serial algorithm in matrix_multiply_1 and three times faster than our parallel algorithm in matrix_multiply_3. Note that our speedup still isn't quite linear, probably either due to memory bandwidth limits or due to the fact that we aren't able to use a spawn in the third case of matrix_multiply_5.


Now What?

What's the moral of this story?

  1. The first lesson is that parallelism can't always make up for a poor algorithm.
  2. The second lesson is that running code in parallel always has overhead (no matter what framework you're using), and even though Cilk++ gets that overhead very low, it can still be hugely significant if the overhead is bigger than the actual work we're doing (like it was when we tried to use a hyperobject in a tight inner loop).
  3. The third lesson is that shared-memory-multiprocessor hardware can have problems with memory bandwidth and false sharing, and we need to be aware of that.

So what is Cilk Arts doing to help you avoid problems like these? I thought you'd never ask! Let's look back a moment:

  1. Our first problem was that we were using a hyperobject where we really didn't need it, and losing more to overhead than we were gaining from parallelism. It'd be nice to have a tool that could calculate the work and span of our program, so that we could see just how much parallelism we had, what our granularity was, and how much time we were losing to overhead; then, we could easily determine whether the hyperobject was worth it.
  2. Our second problem was that we were having problems with locality of reference. It'd be nice to have a tool that could run our program and look for problems with false sharing.

Well wouldn't you know it, my project at Cilk Arts this summer was in developing an analysis tool aimed at doing exactly these sorts of things. Does it really work? Well, the tool isn't ready to ship yet, and in fact it hasn't even been officially announced. However, just recently I was able to use it to measure and compare the work and spans of the matrix_multiply_2 and matrix_multiply_3 function I gave above to decide whether using the hyperobject was worth it. According to the tool's calculations, using the hyperobject in that case does increase the parallelism of the function somewhat, but it also makes the span several times larger (because the inner loop does so little work), which means we can be sure that the whole function will be much slower even if we had an infinite number of processors. That's a useful result that we could not have gotten by empirical testing alone!

The conclusion? Cilk++ is not a magic bullet that will solve all of your problems for you -- you still need to be judicious when writing your code.

However, it does make it very easy to expose parallelism in your code, and with the help of the tools we're working on, it's not too hard to figure out how to get the best possible parallel performance out of your program.

Tags: , , , , , , , ,

COMMENTS

How does the Cilk execution compare to a hand tuned library like Goto or Intel's MKL?

posted @ Wednesday, November 05, 2008 10:51 PM by Theo


Theo, 
 
the program that Matt shows is an example for illustration purposes. As you know, any high-performance implementation needs an optimized kernel for the base case of the recursion, and Matt is not employing 
any optimized kernel. 
 
We have, however, at least two instances in which previous versions of Cilk were employed in conjunction with optimized kernels. Bradley Kuzsmaul 
used a variant of Matt's algorithm in his submission for the <a href="http://bradley.csail.mit.edu/~bradley/hpcc06/" 2006 HPCC challenge /a>, for which he reports about 1 Teraflop on a 256-processor SGI Altix, 
using the Intel MKL as a sequential kernel and the MIT version of Cilk for the parallel scheduling. 
 
A couple of years ago I went down a similar path on an IBM Power5 system, 
produced a kernel that was running at 98% of peak performance (4 flops/cycle) for in-cache problems, and plugged it into the 
algorithm that Matt is describing. The 
final program was running at 92% of peak performance for out of cache problems on one core, and scaling linearly up to 16 (I did not have more than 16 cores available to me). 
 
Regards, 
Matteo Frigo 
Cilk Arts

posted @ Thursday, November 06, 2008 2:30 PM by Matteo Frigo


Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.