COMMENTS
Nice write-up on parallelism; I'll point the students in my algorithms courses towards it (and perhaps some of this will make it into a revision of the textbook at some point?). If anything, the work/span analysis is even more gloomy than Amdahl. For marketing purposes, you might want to rearrange the math so that you can prove that all software applications will scale linearly to billions of processors; some of the competitors to Cilk seem to be doing that ;-).
I'll note that Amdahl was not the first to call his observation a law; he talked about this a bit at an awards dinner last year. A few years after giving the AFIPS talk, he started hearing people talk about "Amdahl's Law," and he had to ask around to figure out what they were talking about!
Patrick: Good point about gloominess. I guess I could have given it a little more "spin". Personally, I'm not gloomy about the amount of parallelism, because it tends to track memory size, which is much bigger than number of processors. As an example, image-processing applications have parallelism on the order of tens of thousands. An interactive game has perhaps a thousand entities and exhibits parallelism in the hundreds. Numerical problems can have parallelism in the millions.
I didn't know that about the naming of Amdahl's Law. I guess I shouldn't be surprised that he didn't name it. After all, who names a Law after one's self?
It's also not as gloomy when you consider Gustafson's Law which assumes that the parallel portion of an application grows at a faster rate than it's serial portion. We have plenty of empirical evidence for this. In other words, problem sizes tend to increase over time. Amdahl assumed a fixed problem size.
Also Charles, I think the span in your example is actually 8. I counted edges and not vertices.
Michael,
Good point! What Gustafson's Law observes is that for many problems, parallelism grows with N, and N grows with time. For example, quicksorting N numbers has parallelism of O(lg N), since the work is O(N lg N) and the span is order N (mainly due to the partitioning). That's not much parallelism, compared to matrix multiplication which has parallelism of almost N^3. Nevertheless, since we're dealing with larger and larger data sets as time goes on due to Moore's Law, the parallelism grows with time, as long as the parallelism isn't a constant, which it can be.
Another reason Amdahl's Law is gloomier than reality is that people are often interested in response time more than throughput. If I have an interactive application for which most operations can be done serially, but one operation takes a long time but is parallelizable, it may be well worth the effort to parallelize the one operation, even though it contributes to only a small part of the workflow. Moreover, Amdahl analyzes a static case. Making an operation cheap can change the user's behavior, encouraging a workflow in which this operation occurs more frequently -- a positive feedback loop which results in a larger fraction of the workflow that is parallelizable.
Regarding the dag, in this model the vertices represent instructions, not the edges, and so the span is indeed 9. You can actually model the execution both ways, with vertices as instructions or edges. We have a debate right now in Cilk Arts as to which we model we should adopt. The edges as instructions is actually a little more mathematically precise, but my experience in presenting the model is that people are more comfortable with vertices being instructions. Opinions welcome!
Good point about parallelism changing the workflow. It does have a "viral" effect (in a positive way) on workflow in that the user can change their behaviors once you achieve a noticeable change in latency. If that change is an order of magnitude, you have a paradigm shift, which spurs new algorithms and research. It is a nice feedback loop.
As for the SP-DAG model, my opinion is to go with vertices as spawn, sync, and sequence points. This complicates the math a bit because the span now must include the full path back from the leaf to the root (after all, the DAG must be traversed forwards and backwards for the computation to complete). You could offer a simplified model of span being 2 times the forward span if you collapse sequence edges in a thread. You'd only be off a constant factor anyway (it doesn't impact asymptotic behavior), so it might be worth the simplification.
Either way, it's good to have a conceptual model in mind when working with a new language or framework to make back-of-the-envelope predictions on projected or new platforms. I have used the SP-DAG model frequently for quick analysis of average parallelism on future platforms for certain algorithm, since most of these recurrences can be solved easily with the Master Theorem.
Back to that intuitive feeling that the parallelism should be greater than 2. I like to use the example of driving to my friend's home on the other side of town. No matter how hard I put my foot down on the gas pedal on the highway, my average speed is still limited by the number of red lights I have to stop at. Even with a
very fast car!
Several things are missing:
1) Latency. On some platforms, an Idle CPU can react more quickly to an event than a non-idle CPU.
2) Caching. 2 CPUs have 2 times the RAM cache than 1 CPU of the same kind. If exploited well for some technical purpose, like games or graphics, it can mean that slow RAM speeds can be replaced by CPU cache reads, improving performance more times than the number of CPUs. Remember, RAM is often a bottleneck these days.
Btw - the span law often doesn't apply, because the OS or hardware may add overhead for each CPU in the system, making T8>TP.
The laws that you mention, are only valid in theory. Reality is different, and most programmers do not control the full stack of hardware, OS, runtime, programming language and application.
Whatever, man.
No matter what you say, it will still take a woman ~9 months to have a baby. Adding more women to the "team" doesn't make the baby faster. That's what Amdahl's Law is about.
Joe,
Great point. And I bet most obstetricians would agree that the computational dag of a pregnancy is such that the work and span are both 9, and thus can't be sped up. But let's say the result is triplets. suddenly, while the work triples, the time it takes to raise them and send them off to college takes ~18 years, whether or not it's one kid or three. So the first part of the app might be serial in nature, the second part of the app contains some parallelism.
@Joe: If you need a newborn baby fast, borrow one. It's 9*30=270 times faster. That's why all modern CPUs have cache memory. You get a much slower algorithm, if your implementation doesn't exploit the cache.
@Ilya: You assume that the desired result is to have a 9 month process. What if the desired result is to have a newborn available for filming a commercial for baby food?
A good real-life example, based on a true story: You want data to be sorted. You may start out with a standard sort algorithm. Suddenly you find, that it is slow. You find out, that you can increase the speed 5 times by replacing quicksort with mergesort. Then, you find out, that you can increase the speed 10 times more by using an implementation, that knows how to use the CPU cache well. Then, you realize, that Mergesort can use more CPUs, so you use more CPUs. Suddenly, your application is 100 times slower, because the writes to memory from one CPU is flushing the cache from the other CPU. This didn't happen on your development system, because the dual-CPU system was made in a different way... so you realize that you need to make the parallel programming differently.
Finally, you find that you need to separate the lists, that you merge in your mergesort algorithm, better. After a while, your algorithm works well on your development system and on the target system.
Finally satisfied, you put the parallel mergesort algorithm into all places where it makes sense. After a short while, the test department comes back, and reports that the application runs slower when your multithreaded sort algorithm is used. The reason is, that it now uses 2 CPU caches, leaving less cache memory for other parts of the application, slowing down these parts more, than the speed benefit of using 2 CPUs for sorting.
You are finally asked to remove the parallel processing from your mergesort in order to increase the speed of the system.
Please, also read the following article:
http://blog.rednael.com/2009/02/05/ParallelProgrammingUsingTheParallelFramework.aspx It's an article about basic parallel programming. Examples in C# .Net included. Also, it describes a lightweight parallel framework to work with tasks. Opposed to some other frameworks, this one is very light and very easy to use.
After reading this article, you should be able to write code using parallelism.
Regards,
Martijn
I've read some articles about parallel programming with C# in a Spanish magazine "Solo Programadores", but they are in Spanish :-)
I am from Spain. However, I found a very interesting article in Packt Publishing's website:
http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp
The author is an Spanish well known writer. It seems he is begining working in english as well.
The article is a chapter of his "C# 2008 and 2005 Threaded Programming" book. Sounds very interesting.
I am currently reading Joe Duffy's "Concurrent Programming on Windows". Highly recommended.
To completement it, I bought "C# 2008 and 2005 Threaded Programming", some reviewers said it had funny examples.
I love exploiting multicore CPUs, it is incredible to see tasks finishing in less time using many cores.
I am going to download Cilk++, however, I am not good with C++. I know C#.
Can you comment on the relative power of multithreaded parallelism, in the Cilk sense, versus nested data parallelism (NESL/DPHaskell/Ct)?
Are there things which you fundamentally can't express efficiently using nested data parallelism?
Bill,
In theory, they are equally powerful in expressing parallelism up to a constant factor. The theorem in the hard direction is that you can express the multithreaded (or any) computation as a circuit made out of NAND gates, and then you can simulate the execution of the circuit using vector operations. Guy Blelloch of CMU has studied this problem in depth. You'd never really want to do such a low-level simulation for generic multithreaded code, however, and thus the practical implications of the theorem are unclear.
In practice, data parallel tends to have a difficult time using caches effectively, compared with multithreading. Data parallel has the advantage, however, that it's easier to build fast processor pipelines for vector operations than for general code flows, and memory behavior is more predictable. I don't think that the processor pipeline will matter in the long run, because the cost of arithmetic operations is going to zero. I believe that the battle will be fought over memory architecture. Which will turn out to be more effective: locality or predictable accessing? Perhaps there's also a middle road.
From a legacy software point of view, however, it seems to me that multithreading looks like an easier path for general multicore enablement than rewriting code as vector operations.
Is work/span model you mentioned similar to average parallelism model referred in "Derek L. Eager, John Zahorjan, and Edward L. La- zowska. Speedup versus efficiency in parallel systems. IEEE Transactions on Computers, 38(3):408{
423, March 1989"? If so, it is also not accurate in case of high variance parallelism.
Sorry for my stupid question.
Yes, the work/span model is based on the work of Eager et al. Why do you say it is not accurate in the case of high-variance parallelism? The theorems say the bounds are tight to within a factor of 2 always, and they approach optimality as work per processor grows compared with span.
There are 2 reasons why I said it was not accurate in case of high variance parallelism. Here are my opinions.
First of all, it's intuitive to infer this because average parallelism is just an average number.Therefore, in case of high variance, it may not accurately depict the real speedup of our systems.
Secondly, to make it more accurate, in my opinion we should consider the variation of parallelism such as what was done in "Technical Report.A model for speedup of parallel programs" of A.B. Downey(Jan 1997.
Thank for your time and consideration.
To support what I said, I give an evidence.
In your writing, you specify " the theory of work and span has led to an excellent understanding of multithreaded scheduling"; however, in "Characterizations of parallelism in applications and their use in scheduling", Kenneth C. Sevcik pointed that "Intuitively, it is easy to see that this approach <average parallelism approuch> will work very well whenever the amount of variation in parallelism within the application is small... However, if the variation is large, then allocating each application a number of processors close to its average parallelism is good only when the overall system load is light to moderate".
Sorry me if I misunderstand your writing.
Thank for your time and consideration.
Allocating a number of processors equal to the parallelism can indeed produce a utilization of 50% over the course of the computation, which can be a substantial waste. If you allocate only 1/10 the parallelism, however, then the utilization is 90%. The key thing is to run with ample
parallel slackness, which damps the variation in instantaneous parallelism (the number of processors with work to do at any given moment).
In his paper, Sevcik generally talks about static scheduling, where a number of processors is allocated before the job begins and it maintains that number throughout the computation. Today's multicore operating systems give a varying number of processors to a job, and systems like Cilk++ yield their processors if they are not doing useful work.
In summary, parallelism is an excellent measure, but you want to run with ample parallel slackness so that you achieve high utilization and near-perfect linear speedup. Of course, there are a bunch of issues we haven't talked about, the most significant of which is probably memory bandwidth, which can limit performance if the cores on the chip cannot be fed from DRAM at a sufficient rate. But, from a scheduling point of view, the methodology of work and span go a long way to providing predictable performance. I invite you to experiment with the Cilkview scalability analyzer, which builds this theory into its analyses.
Thank you for your careful explanation.