Last week I went to the 2008 Programming Language Design and
Implementation (PLDI) conference to receive the award for the most influential
paper published in PLDI 1998. The award
committee had the following to say about our paper:
"The
1998 PLDI paper "Implementation of the Cilk-5 Multithreaded Language"
by Matteo Frigo, Charles E. Leiserson, and Keith H. Randall introduced an
efficient form of thread-local deques to control scheduling of multithreaded
programs. This innovation not only opened the way to faster and simpler
runtimes for fine-grained parallelism, but also provided a basis for simpler parallel
recursive programming techniques that elegantly extend those of sequential
programming. The stack-like side of a deque acts just like a standard procedure
stack, while the queue side enables breadth-first work-stealing by other
threads. The work-stealing techniques introduced in this paper are beginning to
influence common practice, such as the Intel Threading Building Blocks project,
an upcoming standardized fork-join framework for Java, and a variety of
projects at Microsoft."
At Cilk Arts we are really honored by this recognition. I would like to express my deepest gratitude
to Kathleen Fisher and to the rest of the award committee for honoring us in
this way.
Our 1998 paper is based on a couple of simple ideas that
taken together yield a powerful point of view on parallel programming.
The first idea is that creating a new thread does not have
to be expensive. For example, the spawn
keyword in the MIT Cilk implementation costs you something like 10-20
cycles. (Mohr, Kranz, and Halstead were
the first to show how to do this with their lazy task creation technique, which
we optimized by eliminating locks in the common case.)
From the point of view of the programmer, cheap expression
of parallelism eliminates the whole issue of thread granularity that you have
in, e.g., pthreads programs, where you must make sure that the thread that you
create performs enough work to amortize the overhead of the thread
creation. (Then you get into thread
pools, which are a nightmare to manage when you have more than one library
managing its own threads.) In Cilk, in
contrast, you do not have to worry too much about the cost of a spawn: If you
think that a function can be executed in parallel, just spawn it---it will
probably not slow down your program.
The second idea is harder to explain, and you will have to
read the paper for the details, but it boils down to this. In the execution of a parallel program, one
can distinguish between cycles that would be spent on a single-core execution
(called the “work term” in the paper), and cycles that are spent solely because
of parallelism, e.g., spent migrating work from one core to another (called the
“critical-path term” in the paper). When
implementing a parallel programming language, the implementer usually faces
tradeoffs between reducing overheads in the work term or reducing overheads in
the critical-path term. The idea is
that, in Cilk, we *postulate* that the work term is more important.
Why is the work term more important than the critical path
term? (By the way, in more recent
publications we don't say “critical path,” and use the word “span”
instead.) The paper explains that the
work term is more important whenever your program has more parallelism than the
number of availables cores. This tends
to be the case in Cilk programs because 1) expressing parallelism is cheap, so
Cilk programs tend to spawn millions of threads, and 2) if your program did not
have enough parallelism you would not bother running it on many cores anyway. This assumption of abundant “parallel
slackness” leads to several implementation choices explained in the paper, such
as the “two-clone” strategy and an asymmetric mostly lock-free protocol for
work-stealing.
These ideas form a powerful point of view that continues to
be fruitful to this day, and that has impacted the implementation of the new
Cilk++ features such as exception handling and hyperobjects. I look forward to writing another paper on
these topics some day.
PS: I would also like to mention another deserving paper
from PLDI 1998: “Optimizing direct-threaded code by selective inlining,” by Ian
Piumarta and Fabio Riccardi. This paper
describes the technology that was later used in QEMU, and it qualifies in my
opinion as one of the greatest hacks of all time.