Subscribe Via Email

Your email:


Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

What the $#@! is Parallelism, Anyhow?

Posted by Charles Leiserson on Tue, May 27, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

I’m constantly amazed how many seemingly well-educated computer technologists bandy about the word parallelism without really knowing what they’re talking about. I can’t tell you how many articles and books I’ve read on parallel computing that use the term over and over without ever defining it. Many of these “authoritative” sources cite Amdahl’s Law (1), originally proffered by Gene Amdahl in 1967, but they seem blissfully unaware of the more general and precise quantification of parallelism provided by theoretical computer science. Since the theory really isn’t all that hard, it curious that it isn’t better known. Maybe it needs a better name — “Law” sounds so authoritative. In this blog, I’ll give a brief introduction to this theory, which incidentally provides a foundation for the efficiency of the Cilk++ runtime system.

Amdahl’s Law

First, let’s look at Amdahl’s Law and see what it says and what it doesn’t say. Amdahl made what amounts to the following observation. Suppose that 50% of a computation can be parallelized and 50% can’t. Then, even if the 50% that is parallel took no time at all to execute, the total time is cut at most in half, leaving a speedup of less than 2. In general, if a fraction p of a computation can be run in parallel and the rest must run serially, Amdahl’s Law upper-bounds the speedup by 1/(1–p).

This argument was used in the 1970’s and 1980’s to argue that parallel computing, which was in its infancy back then, was a bad idea — the implication being that most applications have long, inherently serial subcomputations that limit speedup. We now know from numerous examples that there are plenty of applications that can be effectively sped up by parallel computers, but Amdahl’s Law doesn’t really help in understanding how much speedup you can expect from your application. After all, few applications can be decomposed so simply into just a serial part and a parallel part. Theory to the rescue!

A model for multithreaded execution

As with much of theoretical computer science, we need a model of multithreaded execution in order to give a precise definition of parallelism. We can use the dag model for multithreading, which I talked about in my blog, “Are determinacy-race bugs lurking in your multicore application?” (A dag is a directed acyclic graph.) The dag model views the execution of a multithreaded program as a set of instructions (the vertices of the dag) with graph edges indicating dependences between instructions. We say that an instruction x precedes an instruction y, sometimes denoted x ≺ y, if x must complete before y can begin. In a diagram for the dag, x ≺ y means that there is a positive-length path from x to y. If neither x ≺ y nor y ≺ x, we say the instructions are in parallel, denoted x ∥ y. The figure below illustrates a multithreaded dag:

 


 

In the figure, we have, for example, 1 ≺ 2, 6 ≺ 12, and 4 ∥ 9.

Just by eyeballing, what would you guess is the parallelism of the dag? About 3? About 5? It turns out that two measures of the dag, called work and span, allow us to define parallelism precisely, as well as to provide some key bounds on performance. I’m going to christen these bounds “Laws,” so as to compete with the Amdahl cognoscenti. If I’ve learned anything about business, it’s the importance of marketing!

Work

The first important measure is work, which is what you get when you add up the total amount of time for all the instructions. Assuming for simplicity that it takes unit time to execute an instruction, the work for the example dag is 18, because there are 18 vertices in the dag. The literature contains extensions to this theoretical model to handle nonunit instruction times, caching, etc., but for now, dealing with these other effects will only complicate matters.

Let’s adopt a simple notation. Let TP be the fastest possible execution time of the application on P processors. Since the work corresponds to the execution time on 1 processor, we denote it by T1. Among the reasons that work is an important measure is because it provides a bound — Oops, I mean Law — on any P-processor execution time:

 

The Work Law holds, because in our model, each processor executes at most 1 instruction per unit time, and hence P processors can execute at most P instructions per unit time. Thus, to do all the work on P processors, it must take at least T1/P time.

We can interpret the Work Law in terms of speedup. Using our notation, the speedup on P processors is just T1/TP, which is how much faster the application runs on P processors than on 1 processor. Rewriting the Work Law, we obtain T1/TP ≤ P, which is to say that the speedup on P processors can be at most P. If the application obtains speedup proportional to P, we say that the speedup is linear. If it obtains speedup exactly P (which is the best we can do in our model), we say that the application exhibits perfect linear speedup. If the application obtains speedup greater than P (which can’t happen in our model due to the work bound, but can happen in models that incorporate caching and other processor effects), we say that the application exhibits superlinear speedup.

Span

The second important measure is span, which is the longest path of dependences in the dag. The span of the dag in the figure is 9, which corresponds to the path 1→ 2 → 3→ 6 → 7 → 8 → 11→ 12 → 18. This path is sometimes called the critical path of the dag, and span is sometimes referred to in the literature as critical-path length. Since the span is the theoretically fastest time the dag could be executed on a computer with an infinite number of processors (assuming no overheads for communication, scheduling, etc.), we denote it by T.

Like work, span also provides a bou…, uhhh, Law on P-processor execution time:

 

The Span Law holds for the simple reason that a finite number of processors cannot outperform an infinite number of processors, because the infinite-processor machine could just ignore all but P of its processors and mimic a P-processor machine exactly.

Parallelism

Parallelism is defined as the ratio of work to span, or T1/T. Why does this definition make sense? There are several ways to understand it:

  1. The parallelism T1/T is the average amount of work along each step of the critical path.
  2. The parallelism T1/T is the maximum possible speedup that can be obtained by any number of processors.
  3. Perfect linear speedup cannot be obtained for any number of processors greater than the parallelism T1/T. To see this third point, suppose that P> T1/T, in which case the Span Law TP ≥ T implies that the speedup T1/TP satisfies T1/TP T1/T< P. Since the speedup is strictly less than P, it cannot be perfect linear speedup. Note also that if P ≫ T1/T, then T1/TP ≪ P — the more processors you have beyond the parallelism, the less “perfect” the speedup.

For our example, the parallelism is 18/9 = 2. Thus, no matter how many processors execute the program, the greatest speedup that can be attained is only 2, which frankly isn’t much. Somehow, to my eye, it looks like more, but the math doesn’t lie.

Amdahl’s Law Redux

Amdahl’s Law for the case where a fraction p of the application is parallel and a fraction 1–p is serial simply amounts to the special case where T> (1–p) T1. In this case, the maximum possible speedup is T1/T< 1/(1–p). Amdahl’s Law is simple, but the Work and Span Laws are far more powerful.

In particular, the theory of work and span has led to an excellent understanding of multithreaded scheduling, at least for those who know the theory. As it turns out, scheduling a multithreaded computation to achieve optimal performance is NP-complete, which in lay terms means that it is computationally intractable. Nevertheless, practical scheduling algorithms exist based on work and span that can schedule any multithreaded computation near optimally. The Cilk++ runtime system contains such a near-optimal scheduler. I’ll talk about multithreaded scheduling in another blog, where I’ll show how the Work and Span Laws really come into play.

1. Amdahl, Gene. The validity of the single processor approach to achieving large-scale computing capabilities. Proceedings of the AFIPS Spring Joint Computer Conference. April 1967, pp. 483-485.

Tags: , , , ,

COMMENTS

Nice write-up on parallelism; I'll point the students in my algorithms courses towards it (and perhaps some of this will make it into a revision of the textbook at some point?). If anything, the work/span analysis is even more gloomy than Amdahl. For marketing purposes, you might want to rearrange the math so that you can prove that all software applications will scale linearly to billions of processors; some of the competitors to Cilk seem to be doing that ;-).
I'll note that Amdahl was not the first to call his observation a law; he talked about this a bit at an awards dinner last year. A few years after giving the AFIPS talk, he started hearing people talk about "Amdahl's Law," and he had to ask around to figure out what they were talking about!

posted @ Friday, June 20, 2008 6:23 PM by Patrick H. Madden


Patrick: Good point about gloominess. I guess I could have given it a little more "spin". Personally, I'm not gloomy about the amount of parallelism, because it tends to track memory size, which is much bigger than number of processors. As an example, image-processing applications have parallelism on the order of tens of thousands. An interactive game has perhaps a thousand entities and exhibits parallelism in the hundreds. Numerical problems can have parallelism in the millions.
I didn't know that about the naming of Amdahl's Law. I guess I shouldn't be surprised that he didn't name it. After all, who names a Law after one's self?

posted @ Sunday, June 22, 2008 11:10 AM by Charles E. Leiserson


It's also not as gloomy when you consider Gustafson's Law which assumes that the parallel portion of an application grows at a faster rate than it's serial portion. We have plenty of empirical evidence for this. In other words, problem sizes tend to increase over time. Amdahl assumed a fixed problem size. 
 
 
 
Also Charles, I think the span in your example is actually 8. I counted edges and not vertices.

posted @ Wednesday, July 02, 2008 10:09 AM by Michael Champigny


Michael, 
 
Good point! What Gustafson's Law observes is that for many problems, parallelism grows with N, and N grows with time. For example, quicksorting N numbers has parallelism of O(lg N), since the work is O(N lg N) and the span is order N (mainly due to the partitioning). That's not much parallelism, compared to matrix multiplication which has parallelism of almost N^3. Nevertheless, since we're dealing with larger and larger data sets as time goes on due to Moore's Law, the parallelism grows with time, as long as the parallelism isn't a constant, which it can be.  
 
Another reason Amdahl's Law is gloomier than reality is that people are often interested in response time more than throughput. If I have an interactive application for which most operations can be done serially, but one operation takes a long time but is parallelizable, it may be well worth the effort to parallelize the one operation, even though it contributes to only a small part of the workflow. Moreover, Amdahl analyzes a static case. Making an operation cheap can change the user's behavior, encouraging a workflow in which this operation occurs more frequently -- a positive feedback loop which results in a larger fraction of the workflow that is parallelizable. 
 
Regarding the dag, in this model the vertices represent instructions, not the edges, and so the span is indeed 9. You can actually model the execution both ways, with vertices as instructions or edges. We have a debate right now in Cilk Arts as to which we model we should adopt. The edges as instructions is actually a little more mathematically precise, but my experience in presenting the model is that people are more comfortable with vertices being instructions. Opinions welcome!

posted @ Thursday, July 03, 2008 12:33 PM by Charles E. Leiserson


Good point about parallelism changing the workflow. It does have a "viral" effect (in a positive way) on workflow in that the user can change their behaviors once you achieve a noticeable change in latency. If that change is an order of magnitude, you have a paradigm shift, which spurs new algorithms and research. It is a nice feedback loop. 
 
As for the SP-DAG model, my opinion is to go with vertices as spawn, sync, and sequence points. This complicates the math a bit because the span now must include the full path back from the leaf to the root (after all, the DAG must be traversed forwards and backwards for the computation to complete). You could offer a simplified model of span being 2 times the forward span if you collapse sequence edges in a thread. You'd only be off a constant factor anyway (it doesn't impact asymptotic behavior), so it might be worth the simplification.  
 
Either way, it's good to have a conceptual model in mind when working with a new language or framework to make back-of-the-envelope predictions on projected or new platforms. I have used the SP-DAG model frequently for quick analysis of average parallelism on future platforms for certain algorithm, since most of these recurrences can be solved easily with the Master Theorem.

posted @ Saturday, July 05, 2008 8:59 AM by MIchael Champigny


Back to that intuitive feeling that the parallelism should be greater than 2. I like to use the example of driving to my friend's home on the other side of town. No matter how hard I put my foot down on the gas pedal on the highway, my average speed is still limited by the number of red lights I have to stop at. Even with a very fast car!

posted @ Tuesday, July 15, 2008 5:43 AM by John Aynsley


Great article! If you do a followup, will you mention political parallelism as well?

posted @ Sunday, August 10, 2008 12:51 PM by Naturalist


Several things are missing: 
 
1) Latency. On some platforms, an Idle CPU can react more quickly to an event than a non-idle CPU. 
 
2) Caching. 2 CPUs have 2 times the RAM cache than 1 CPU of the same kind. If exploited well for some technical purpose, like games or graphics, it can mean that slow RAM speeds can be replaced by CPU cache reads, improving performance more times than the number of CPUs. Remember, RAM is often a bottleneck these days. 

posted @ Sunday, August 10, 2008 3:32 PM by Lars D


Btw - the span law often doesn't apply, because the OS or hardware may add overhead for each CPU in the system, making T8>TP. 
 
The laws that you mention, are only valid in theory. Reality is different, and most programmers do not control the full stack of hardware, OS, runtime, programming language and application. 

posted @ Sunday, August 10, 2008 3:58 PM by Lars D


Whatever, man. 
 
No matter what you say, it will still take a woman ~9 months to have a baby. Adding more women to the "team" doesn't make the baby faster. That's what Amdahl's Law is about.

posted @ Sunday, August 10, 2008 7:58 PM by Joe Chung


Joe, 
 
Great point. And I bet most obstetricians would agree that the computational dag of a pregnancy is such that the work and span are both 9, and thus can't be sped up. But let's say the result is triplets. suddenly, while the work triples, the time it takes to raise them and send them off to college takes ~18 years, whether or not it's one kid or three. So the first part of the app might be serial in nature, the second part of the app contains some parallelism.

posted @ Sunday, August 10, 2008 9:16 PM by Ilya Mirman


@Joe: If you need a newborn baby fast, borrow one. It's 9*30=270 times faster. That's why all modern CPUs have cache memory. You get a much slower algorithm, if your implementation doesn't exploit the cache.

posted @ Monday, August 11, 2008 6:38 AM by Lars D


@Ilya: You assume that the desired result is to have a 9 month process. What if the desired result is to have a newborn available for filming a commercial for baby food? 
 
A good real-life example, based on a true story: You want data to be sorted. You may start out with a standard sort algorithm. Suddenly you find, that it is slow. You find out, that you can increase the speed 5 times by replacing quicksort with mergesort. Then, you find out, that you can increase the speed 10 times more by using an implementation, that knows how to use the CPU cache well. Then, you realize, that Mergesort can use more CPUs, so you use more CPUs. Suddenly, your application is 100 times slower, because the writes to memory from one CPU is flushing the cache from the other CPU. This didn't happen on your development system, because the dual-CPU system was made in a different way... so you realize that you need to make the parallel programming differently. 
 
Finally, you find that you need to separate the lists, that you merge in your mergesort algorithm, better. After a while, your algorithm works well on your development system and on the target system. 
 
Finally satisfied, you put the parallel mergesort algorithm into all places where it makes sense. After a short while, the test department comes back, and reports that the application runs slower when your multithreaded sort algorithm is used. The reason is, that it now uses 2 CPU caches, leaving less cache memory for other parts of the application, slowing down these parts more, than the speed benefit of using 2 CPUs for sorting. 
 
You are finally asked to remove the parallel processing from your mergesort in order to increase the speed of the system. 

posted @ Monday, August 11, 2008 7:21 AM by Lars D


Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.