Subscribe Via Email

Your email:


Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

Thread Pools: The Simplest Concurrency Platform

Posted by Ilya Mirman on Wed, Aug 20, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

In a previous post, we made the argument that do-it-yourself multithreading makes it hard to obtrain three desirable properties of multicore software: scalability, code simplicity, and modularity.   So if not DIY, what other options are available to us?  The answer is, a concurrency platform - an abstraction layer of software that coordinates, schedules, and manages the multicore resources.

There are a variety of concurrency platforms available today, each with their strengths and weaknesses.  (The Wikipedia entry for “parallel programming model” has a nice summary.)  In the next series of blog posts, we will summarize some of the more popular ones, starting (today) with Thread Pools:

  • Thread pools
  • OpenMP
  • Intel's Threading Building Blocks
  • Message passing
  • Data-parallel languages
  • Cilk++

One of the problems with programming Pthreads or WinAPI threads directly is the overhead associated with creating and destroying them.  A thread pool is a strategy for minimizing that overhead and is possibly the simplest concurrency platform.  The basic idea of a thread pool is to create a set of threads once and for all at the beginning of the program.  When a task is created, it executes on a thread in the pool, returning the thread to the pool when the task is done.Thread Pools

One of the problems with programming Pthreads or WinAPI threads directly is the overhead associated with creating and destroying them.  A thread pool is a strategy for minimizing that overhead and is possibly the simplest concurrency platform.  The basic idea of a thread pool is to create a set of threads once and for all at the beginning of the program.  When a task is created, it executes on a thread in the pool, returning the thread to the pool when the task is done. 

As a practical matter, however, few thread pools are organized in this fashion.  The problem is, what happens when a task arrives to find that there are no threads left in the pool?  Consequently, most thread pools are implemented so that the newly created tasks are placed into a work queue, and when a thread finishes a task, it fetches a new task to execute from the work queue.  If the work queue is empty, it suspends and is woken up when a new task is placed on the work queue.  Synchronization using, for example, a mutual-exclusion lock, is necessary to ensure that tasks are removed atomically so that threads cooperate safely when removing tasks from the work queue.

Great for Independent Tasks 

Thread pools are commonly used for Internet servers, where the tasks represent independent client transactions executing on the server.  In this context, the main issue is scalability, since the work queue represents a bottleneck in the system.  Suppose that the average time to execute a task is x and that the time to remove a task from the queue is y.  Then, the number of threads that can productively use the thread pool is at most x/y, assuming each thread has a processing core to run on.  More than that, and some threads will be starved waiting on the work queue.  For small-scale systems, this limitation is unlikely to matter much, but most thread-pool implementations are not really “future-proof” for this reason.

Risk of Deadlocks 

For application programming, as opposed to server implementation, thread pools pose some concurrency risks.  The reason is that the tasks making up an application tend to be dependent on each other.  In particular, deadlock is a significant concern.  A deadlock occurs when a set of threads creates a cycle of waiting.  For example, suppose that thread 1 holds mutex lock A and is waiting to acquire mutex B, thread 2 is holding mutex B and is waiting to acquire mutex C, and thread 3 holds lock C and is waiting to acquire mutex A.  In this situation, none of the three threads can proceed.  Although deadlock is a concern in any asynchronous concurrency platform, thread pools escalate the concern.  In particular, a deadlock can occur if all threads are executing tasks that are waiting for another task on the work queue in order to  produce a result, but the other task cannot run because all threads are occupied with tasks that are waiting.

Next Time: OpenMP

Questions for You...

  1. What has been your experience with Thread Pools?
  2. When have you had good luck with them?  Bad luck?
  3. What's your favorite concurrency platform?

0 Comments Click here to Read/write comments

A Tale of Two Algorithms: Multithreading Matrix Multiplication

Posted by Matthew Steele on Thu, Aug 14, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

I've been interning at Cilk Arts this summer, working mainly on designing and implementing a very interesting analysis tool for Cilk++ programs (more on that later!). I have also, of course, gotten a chance to spend some time writing Cilk++ code, and boy is it fun. My first impressions were that the cilk_spawn and cilk_for keywords felt like magical "go faster!" keywords -- as long as my loops or recursive functions were written without undue global side-effects (which I tend to avoid anyway, because I'm fond of functional programming), all I had to do was add in the Cilk keywords, and presto, my code ran faster (on a multicore machine).

When I really did need global variables, it was pretty easy to use Cilk hyperobjects to still get race-free parallelism, without using any locks. And there is a great, visceral satisfaction in running Cilkscreen and having it report that your program is 100% race-free -- if you've ever written a program in Haskell and gotten it to compile, you know the good feeling I'm talking about.

Of course, I wasn't much satisfied with everything working well, so I immediately set out to see if I could break it. Here, I've documented my adventures in writing some simple Cilk++ code, some surprises I encountered, and what we're doing over here to help you avoid the same pitfalls in your own parallel code.

Matrix Multiplication

Let's take a look at a few algorithms for multiplying large, dense matrices -- a good example of an algorithm that seems simple on the surface, but that can go quite wrong if you're not careful. The most obvious (serial) algorithm for doing this is to have three nested for-loops, like so:

void matrix_multiply_1(matrix_t A, matrix_t B, matrix_t C) {
    for (int i = 0; i < A.rows; ++i) {
       for (int j = 0; j < B.cols; ++j) {
        C[i][j] = 0;
        for (int k = 0; k < A.cols; ++k)
            C[i][j] += A[i][k] * B[k][j];
                                }
                            }
                        }

Now, this may seem like a perfectly good algorithm -- in fact, it looks really good, because the outer two loops are easily parallelizable using the cilk_for keyword, and the inner loop can be parallelized as well with the use of a hyperobject. Let's put all those in:

void matrix_multiply_2(matrix_t A, matrix_t B, matrix_t C) {
    cilk_for (int i = 0; i < A.rows; ++i) {
        cilk_for (int j = 0; j < B.cols; ++j) {
         hyper_ptr< reducer_opadd > cij;
         cilk_for (int k = 0; k < A.cols; ++k)
            *cij += A[i][k] * B[k][j];
            C[i][j] = cij->get_value();
         }
    }
}

Okay, let's run it! I used these two algorithms to multiply a 687x837 matrix by a 837x1107 matrix on a four-core desktop PC using an alpha version of Cilk++. The second version is highly parallel, so we should get near-linear speedup. However, when I tried it, not only did I not get linear speedup, but the parallel version actually ran twice as slow on four cores as the serial version did on one core! So what's the problem? Has Cilk++ failed us? Not at all, in fact! It turns out that our matrix_multiply_2 code is rather problematic, and there are two big ways we could improve it.

Overhead in Parallelism

The first problem is that we shouldn't have used a hyperobject in that innermost loop -- we should have just left it sequential, like so:

void matrix_multiply_3(matrix_t A, matrix_t B, matrix_t C) {
    cilk_for (int i = 0; i < A.rows; ++i) {
     cilk_for (int j = 0; j < B.cols; ++j) {
      C[i][j] = 0;
      for (int k = 0; k < A.cols; ++k)
        C[i][j] += A[i][k] * B[k][j];
        }
    }
}

Why not use a hyperobject here? Well, hyperobjects are extraordinarily helpful when dealing with code that relies heavily on global variables, but we don't want to get carried away using them just because we can. They do introduce some overhead, and that inner loop is barely doing any work at all -- just a few thousand multiplies and adds -- so it's simply not worth it here. We're much better off letting that loop be serial so that our compiler can optimize the heck out of it, and besides, we've already got boatloads of parallelism from the two outer cilk_for loops.

Running matrix_multiply_3 on my machine yields a 2.7x speedup over matrix_multiply_1 on four cores. Ah, finally, we're getting speedup instead of slowdown! Still, for an algorithm with as much parallelism as this one, we should be doing better than that. What else are we doing wrong?

Choosing an Algorithm

Our other problem is that the serial algorithm we started with in matrix_multiply_1 is actually a pretty terrible way to multiply large matrices! It seems reasonable on the surface, but in practice it has awful cache performance, and for a problem like matrix multiplication, where we're doing much more memory-access than actual calculation, that makes a huge difference. In this case, we should have used an algorithm with better locality of reference, like the recursive algorithm shown here:

void matrix_multiply_4(matrix_t A, matrix_t B, matrix_t C,
            int i0, int i1, int j0, int j1, int k0, int

k1) {
int di = i1 - i0;
int dj = j1 - j0;
int dk = k1 - k0;
if (di >= dj && di >= dk && di >= THRESHOLD) {
   int mi = i0 + di / 2;
   matrix_multiply_4(A, B, C, i0, mi, j0, j1, k0, k1);
   matrix_multiply_4(A, B, C, mi, i1, j0, j1, k0, k1);
} else if (dj >= dk && dj >= THRESHOLD) {
  int mj = j0 + dj / 2;
  matrix_multiply_4(A, B, C, i0, i1, j0, mj, k0, k1);
  matrix_multiply_4(A, B, C, i0, i1, mj, j1, k0, k1);
} else if (dk >= THRESHOLD) {
  int mk = k0 + dk / 2;
  matrix_multiply_4(A, B, C, i0, i1, j0, j1, k0, mk);
  matrix_multiply_4(A, B, C, i0, i1, j0, j1, mk, k1);
} else {
  for (int i = i0; i < i1; ++i) {
   for (int j = j0; j < j1; ++j) {
     C[i][j] = 0;
     for (int k = k0; k < k1; ++k)
     C[i][j] += A[i][k] * B[k][j];
      }
     }
   }
}

On my machine, matrix_multiply_4 runs about three times faster than matrix_multiply_1, and ran slightly faster (using just one core) than matrix_multiply_3 did on all four cores. Moreover, we can now parallelize it using the cilk_spawn keyword:

void matrix_multiply_5(matrix_t A, matrix_t B, matrix_t C,
        int i0, int i1, int j0, int j1, int k0, int

k1) {
    int di = i1 - i0;
    int dj = j1 - j0;
    int dk = k1 - k0;
    if (di >= dj && di >= dk && di >= THRESHOLD) {
     int mi = i0 + di / 2;
     cilk_spawn matrix_multiply_5(A, B, C, i0, mi, j0, j1, k0, k1);
     matrix_multiply_5(A, B, C, mi, i1, j0, j1, k0, k1);
     cilk_sync;
} else if (dj >= dk && dj >= THRESHOLD) {
  int mj = j0 + dj / 2;
  cilk_spawn matrix_multiply_5(A, B, C, i0, i1, j0, mj, k0, k1);
  matrix_multiply_5(A, B, C, i0, i1, mj, j1, k0, k1);
  cilk_sync;
} else if (dk >= THRESHOLD) {
  int mk = k0 + dk / 2;
// N.B. It's not safe to use a spawn here. Fun exercise: try putting
// it in and then running Cilkscreen to detect the resulting race.
   matrix_multiply_5(A, B, C, i0, i1, j0, j1, k0, mk);
   matrix_multiply_5(A, B, C, i0, i1, j0, j1, mk, k1);
} else {
// The problem is now small enough that we can just do things serially.
for (int i = i0; i < i1; ++i) {
  for (int j = j0; j < j1; ++j) {
   C[i][j] = 0;
   for (int k = k0; k < k1; ++k)
    C[i][j] += A[i][k] * B[k][j];
     }
   }
  }
}

Running matrix_multiply_5 on my machine gets a somewhat better speedup than before -- about 3x on four cores -- and runs over nine times faster than our original serial algorithm in matrix_multiply_1 and three times faster than our parallel algorithm in matrix_multiply_3. Note that our speedup still isn't quite linear, probably either due to memory bandwidth limits or due to the fact that we aren't able to use a spawn in the third case of matrix_multiply_5.


Now What?

What's the moral of this story?

  1. The first lesson is that parallelism can't always make up for a poor algorithm.
  2. The second lesson is that running code in parallel always has overhead (no matter what framework you're using), and even though Cilk++ gets that overhead very low, it can still be hugely significant if the overhead is bigger than the actual work we're doing (like it was when we tried to use a hyperobject in a tight inner loop).
  3. The third lesson is that shared-memory-multiprocessor hardware can have problems with memory bandwidth and false sharing, and we need to be aware of that.

So what is Cilk Arts doing to help you avoid problems like these? I thought you'd never ask! Let's look back a moment:

  1. Our first problem was that we were using a hyperobject where we really didn't need it, and losing more to overhead than we were gaining from parallelism. It'd be nice to have a tool that could calculate the work and span of our program, so that we could see just how much parallelism we had, what our granularity was, and how much time we were losing to overhead; then, we could easily determine whether the hyperobject was worth it.
  2. Our second problem was that we were having problems with locality of reference. It'd be nice to have a tool that could run our program and look for problems with false sharing.

Well wouldn't you know it, my project at Cilk Arts this summer was in developing an analysis tool aimed at doing exactly these sorts of things. Does it really work? Well, the tool isn't ready to ship yet, and in fact it hasn't even been officially announced. However, just recently I was able to use it to measure and compare the work and spans of the matrix_multiply_2 and matrix_multiply_3 function I gave above to decide whether using the hyperobject was worth it. According to the tool's calculations, using the hyperobject in that case does increase the parallelism of the function somewhat, but it also makes the span several times larger (because the inner loop does so little work), which means we can be sure that the whole function will be much slower even if we had an infinite number of processors. That's a useful result that we could not have gotten by empirical testing alone!

The conclusion? Cilk++ is not a magic bullet that will solve all of your problems for you -- you still need to be judicious when writing your code.

However, it does make it very easy to expose parallelism in your code, and with the help of the tools we're working on, it's not too hard to figure out how to get the best possible parallel performance out of your program.

0 Comments Click here to Read/write comments

87% efficient parallel conversion of donuts to code

Posted by John Carr on Wed, Aug 06, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 
One morning a few weeks ago I decided to grab a donut on my way to work instead of eating at home in the morning. I bought a donut and a box of 25 Munchkins (donut holes) at the Dunkin Donuts [1] in Auburndale, Massachusetts.

The donut I ate. The box of holes I left in the kitchen before 7:00 AM. By 11:30 three remained. By lunchtime, all were gone.

Wow, that was fast.

I wonder what would happen if I brought in twice as many?

The box of 50 was gone by lunch, but there were more remaining at 11:30 than the first time.

Wow, that's quite an appetite we have. I wonder what would happen if I brought in even more?

The following Monday I brought in 75 donut holes.

By 11:30 about 10 remained, and after lunch only four of the undesirable "plain" variety remained.

The result is, each increment of 25 donuts leaves 3.3 uneaten by 11:30 (round results to integers[2]). In other words, those of us who work mornings are 87% efficient at turning donuts into code. I believe I also proved that satiation is impossible because appetite scales with supply.

Note that we have three college students here working as summer interns. A fourth employee has requested that he be considered a college student for purposes of appetite.


[1] Our founder was recently reminding us to look beyond stereotypes. Those of you inclined to take that advice should know that I once saw a city police officer running a speed trap out of the parking lot of that Dunkin Donuts to meet his ticket quota.


[2] Donut holes are quantized. Half holes are rare.

With ordinary donuts fractions are not merely possible, they are regularly observed as people have an aversion to taking the last piece. So the sequence of donuts remaining may go 3, 2, 1, ½, ¼, ⅛, 0. The sequence terminates because the reward of a sixteenth of a donut is so small and the silliness so large that polite people refuse to split it and one of the rest of us finishes the job in a single bite.

1 Comments Click here to Read/write comments

Tags: 

e-Book on Multicore Programming

Posted by Ilya Mirman on Mon, Jul 28, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

Last week, we published a (free) e-Book, How to Survive the Multicore Software Revolution (or at Least Survive the Hype).

In this e-Book, we have tried to:

  1. Provide some background and context around the emergence of mainstream multicore processors;
  2. Identify the key challenges facing software developers; 
  3. Provide an introduction to multithreading; and 
  4. Review several programming tools and techniques available today, including OpenMP, Intel's TBB, MPI, Cilk++, Pthreads, and others.

We have tried hard to put together an informative, objective, and vendor-neutral resource, and the response has been pretty exciting.  First, our web traffic quadrupled within hours, and the e-Book was downloaded over 1,000 times in the first couple days.

But the other interesting phenomenon was that news of the e-Book was picked up by several bloggers, who drove tons of traffic. On a typical day, most of our traffic comes from Google search results; in the last several days, this has flipped dramatically - with over 75% of the traffic coming from a handful of viral sources, such as bloggers and other community sites:

 


 

So we are pretty psyched with the response -  but we obviously just scratched the surface of what folks might care about. With that in mind - we would love to hear from you:

  • Any suggestions on how we can improve the e-Book?

  • What other content might be of interest to you?

16 Comments Click here to Read/write comments

The Folly Of Do-It-Yourself Multithreading

Posted by Charles Leiserson on Tue, Jul 08, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

Many engineering organizations struggle with NIH ("not invented here") attitudes within their company, since most engineers would rather build than buy. Doing it yourself is more fun - or at least it seems that way when the project kicks off. In the arena of multithreaded software, the attraction of do-it-yourself (DIY) multithreading may sound compelling to some. If you're a top-flight developer, you may well be familiar with POSIX threads (pthreads) or WinAPI threads, and the challenge of building a multicore application from scratch using the primitives provided by these libraries may tantalize your salivary glands. Besides, in the words of Longfellow, won't doing it yourself ensure that it is well done? 

The alternative is to program atop a concurrency platform  — an abstraction layer of software that coordinates, schedules, and manages the multicore resources.  Concurrency platforms include any of various thread-pool libraries, such as .NET’s ThreadPool class; message-passing environments, such as MPI; data-parallel programming environments, such as NESL, RapidMind, or ZPL; task-parallel environments, such as Intel’s Threading Building Blocks (TBB) or Microsoft’s Task Parallel Library (TPL); dynamic multithreading environments, such as Cilk or Cilk++; or combinations, such as OpenMP.  Some concurrency platforms provide linguistic abstractions, some augment languages using pragmas, while others are simply implemented as library functions. 

Three desirable properties 

Although a concurrency platform may provide benefits over DIY multithreading with respect to application performance and software reliability, a strong case against DIY can be made purely on the basis of development time.  (In a previous blog post, I called the three requirements of application performance, software reliability, and development time the multicore-software triad .)  Indeed, as a rule, one can generally develop robust multithreaded software faster using a concurrency platform than doing it yourself with native threads.  The reason is that the DIY strategy makes it hard to obtain three desirable properties of multicore software:

  • Scalability
  • Code simplicity
  • Modularity 

A tiny example 

To illustrate the folly of DIY multithreading and its adverse impact on development time, let’s look at the simple example of parallelizing a recursive Fibonacci calculation.  The ith Fibonacci number is the ith number (indexed from 0) in the sequence 0, 1, 1, 2, 3, 5, 8, 13, 21,, where each number is the sum of the previous two. 

Although this example is tiny and artificial, it illustrates the key issues.  Here’s the original code in C/C++: 

1   #include <stdio.h>
2   #include <stdlib.h>
     
3   int fib(int n)
4   {
5     if (n < 2) return n;
6     else {
7       int x = fib(n-1);
8       int y = fib(n-2);
9       return x + y;
10     }
11   }
     
12   int main(int argc, char *argv[])
13   {
14     int n = atoi(argv[1]);
15     int result = fib(n);
16     printf("Fibonacci of %d is %d.\n", n, result);
17     return 0;
18   }

Incidentally, this algorithm is a terrible way to calculate Fibonacci numbers, since it continually recalculates already computed values and runs in exponential time.  Didactically, however, it’s short and good enough for our purposes.  Just remember that our goal is to parallelize this particular algorithm, not to replace it with a more efficient algorithm for computing Fibonacci numbers.

A version using native threads 

Below is a pthreaded version designed to run on 2 processor cores.  A WinAPI-threaded implementation would follow similar logic, only differing in the names of library functions.  I’m not going to argue that this is the simplest possible threaded code, but it is fairly typical of what you must do to parallelize the application for 2 processor cores.  Anyway, here’s the code:

1   #include <stdio.h>
2   #include <stdlib.h>
3   #include <pthread.h>
     
4   int fib(int n)
5   {
6     if (n < 2) return n;
7     else {
8       int x = fib(n-1);
9       int y = fib(n-2);
10       return x + y;
11     }
12   }
     
13   typedef struct {
14     int input;
15     int output;
16   } thread_args;
     
17   void *thread_func ( void *ptr )
18   {
19     int i = ((thread_args *) ptr)->input;
20     ((thread_args *) ptr)->output = fib(i);
21     return NULL;
22   }
23    
24   int main(int argc, char *argv[])
25   {
26     pthread_t thread;
27     thread_args args;
28     int status;
29     int result;
30     int thread_result;
31     if (argc < 2) return 1;
32