Subscribe by Email

Your email:

Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

The Power of Well-Structured Parallelism (answering a FAQ about Cilk++)

Posted by John Carr on Fri, Jan 23, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

The Question...

A few people have asked, more or less, "How can I return from a function without waiting for functions I called to complete?"

The answer is, you can't.

In Cilk++ there is an implicit cilk_sync at the end of every function. These three functions behave the same for practical purposes:

 	void f1() { f2(); }

void f1() { cilk_spawn f2(); }

void f1() { cilk_spawn f2(); cilk_sync; }

In each case function f1 does not return until after f2 returns.

Why Sync?

The sync helps reduce complexity of the implementation, and it also discourages complexity in programs, but most importantly it makes Cilk++ superior to its predecessors in two ways that we believe are essential to effective resource management and ease of use.

First, the implicit sync ensures that resource use of a program does not grow out of proportion to the parallelism in the program. Some older parallel languages turned out to have unbounded resource use -- the longer your program runs, the more memory it uses. In contrast, Cilk++ has been proven to have bounded resource use. On a P-processor system, a Cilk++ program uses no more than P times as much stack memory as it does on a single-processor system. The proof of this property depends on the structure of the control flow graph, including the assumption that functions sync before returning.

Second, a sync at the end of every function helps give a race-free parallel program the same behavior as the corresponding serial program ("serial semantics"). A call to a function works the same regardless of whether the called function contains internal spawns. You don't have to worry about whether the function (which may be in a library you don't have source code for) left another function running in the background that you have to clean up after. Only the spawns visible in the current scope affect a function's behavior.

What Cilk++ is (and is not)

More generally, Cilk++ is not designed for explicit task management: cilk_spawn is not pthread_create. Explicit task management is complicated. Cilk++ is much simpler. Cilk++ keywords grant permission for parallel execution rather than commanding parallel execution. This seemingly small difference actually gives Cilk++ massive advantages.

One important implication is that cilk_spawn is cheap. A spawn just does a little bookkeeping: making a note in a data structure saying that a function call may execute in parallel and checking on return from the function whether it actually executed in parallel. In our current implementation, the cost of cilk_spawn is comparable to the overhead of about 4 function calls. In contrast, pthread_create costs about 1800 times the cost of a function call.

When Cilk++ does need to migrate work to load-balance among the processors, however, it can be expensive -- tens of thousands of cycles instead of tens of cycles. The Cilk++ runtime system guarantees thread migration to be rare if the application has sufficient parallelism. (What is sufficient is a topic we'll address in the future.) Consequently, the overhead of the Cilk++ runtime system for most applications is at most a couple percent, and performance of Cilk++ applications generally scales up linearly with the number of processors.

The requirement that every function syncs before returning means that certain styles of programming work better with Cilk++ than others. Divide-and-conquer techniques tend to work well, for example. Quicksort is an example of such an algorithm. A cilk_for loop is another -- when you change the keyword "for" to "cilk_for" the compiler rewrites your loop as a recursive function that repeatedly divides the range of the loop.

Some multithreaded programs, taken as a whole, do not map naturally to Cilk++. These programs may pass messages back and forth between threads or poll for changes made by another thread. The problem with these programs is that they do not have serial semantics. Serial C++ programs never interleave execution of different functions. A solution is to look for a part of the program with serial semantics. For example, if a GUI thread and a compute thread exchange messages, it may be possible to use Cilk++ within the part of the compute thread that does the work.

Because you can't force creation of a thread, you don't need to explicitly destroy threads either. Cilk++ garbage collects spawned functions. You can spawn off a million function calls in a long running loop without a cilk_sync, if you like. On a four core system your program won't use more than four times the stack space of a single-threaded program. It may be inefficient for one function to spawn a million times (another topic we'll talk about that later), but our system guarantees that the result will not be a million threads running in parallel.

What this means for Cilk++ developers

  • Give your functions serial semantics. A function does some work and returns when it's done.
  • Think about decomposing the problem into parts, not managing threads.

How best to divide the problem into parts, and what to do when your program doesn't seem to fit these principles, are subjects for future discussion.

0 Comments Click here to read/write comments

Four Reasons Why Parallel Programs Should Have Serial Semantics

Posted by Steve Lewin-Berlin on Fri, Dec 12, 2008
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Some parallel programming environments require the developer to relearn the fundamentals of programming in order to think in parallel. Cilk++ takes a different approach. One basic design principle of Cilk++ is that Cilk++ programs have serial semantics, that is, a Cilk++ program can be understood (and executed) as a serial program. The Cilk++ keywords were designed to make this serialization look similar to the parallel Cilk++ program.

This simple but powerful principle guides more than just the appearance of Cilk++ programs. It also simplifies the process of developing and testing a Cilk++ program, and it allows Cilk Arts to provide powerful, efficient, and provably correct tools.

(Serial semantics should not be confused with sequential consistency, which is a concept invented by Leslie Lamport and discussed extensively in the literature of parallel computing. A parallel computing system with sequentially consistent memory guarantees that, in Lamport's words, "... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program." In contrast, serial semantics means that the program can be executed using a single thread of control as an ordinary serial program - one does not need a conceptual parallel model to understand -- or execute -- the program.)

Cilk++ Serial Semantics

First, let's take a closer look at how Cilk++ provides serial semantics.

Cilk++ provides three new keywords: cilk_spawn, cilk_for, and cilk_sync. Each of these has an intuitive serial interpretation:

Keyword

Parallel meaning

Serial meaning

cilk_spawn

Allow a called function to run in parallel with its caller

Call a function normally, where the caller waits for the called function to return before resuming

cilk_sync

Wait for parallel activity to complete

Do nothing (there is no parallel activity)

cilk_for

Allow multiple iterations of the body of a loop to run in parallel

Execute a standard for loop in which loop iterations execute one at a time

As you can see, a Cilk++ program can be converted to a serial program quite easily: simply replace cilk_for with for, and delete the cilk_spawn and cilk_sync keywords. The Cilk++ development system provides a compile-time switch to build the "serialization" of a Cilk++ program. Using Cilk++ for Linux, the option [-fcilk-stub] is used with cilk++ to "stub out" Cilk features. Using Cilk++ for Windows, run cilkpp with the [/cilkp cpp] option.

Okay, I hear you saying, "Fine, your fancy parallel program can devolve into a serial program. So what? If I wanted the serial version, I would have just used C++ in the first place!" Glad you raised the issue! A parallel programming model with serial semantics maintains four real advantages over program models without serial semantics.

  1. The equivalent serial C++ program can easily be debugged and analyzed using existing development tools. This is an ideal way to debug a parallel program without worrying about parallel interaction, and without needing to consider multiple stacks and program counters. Also, as vanilla C++, all symbol and debug information is available to standard system and third-party tools. Tools that are easy to run on the serialization but might require custom versions to operate on the parallel program include performance profilers, code coverage tools, security analysis tools, memory leak checkers, and any other tools that work with C++ programs.
  2. The serial semantics of Cilk++ programs allows Cilk Arts to build better tools that analyze the performance and correctness of Cilk++ programs, and these tools offer strong guarantees about the program. For example, our Cilkscreen race detector runs the test program in a serialized mode on a single processor. With information about the parallel structure of the program (i.e., where the spawns and syncs occur), Cilkscreen can analyze all possible schedules of the program that might execute on any number of processors. In other words, Cilkscreen can detect whether race conditions exist that could manifest under any possible execution of the program. The serial semantics of Cilk++ permits Cilkscreen to analyze the parallel program while running in a single worker (the Cilk++ name for a thread of execution) on a single processor.
  3. The serial semantics of Cilk++ provides a real advantage to program testing and quality assurance. Cilk++ is designed to ensure that parallel programs written in Cilk++ can be deterministic, reliably producing results identical to the serialization. This guarantee makes it easy to leverage traditional testing tools that assume that multiple runs of a program always return the same result if given the same input. The serial semantics of Cilk++ eliminates the need to consider timing and scheduling considerations when evaluating program correctness, and it makes it simple to compare the output of multiple program runs for consistency. Repeatability is one of the key properties that make Cilk++ programs testable and reliable.
  4. Serial semantics provides high performance while allowing you to design your program without specifying the number of processors on which the program will execute. The Cilk++ runtime scheduler takes care of efficiently load balancing your program across however many processors are available. When a portion of your parallel computation executes on a single processor, Cilk++ can execute it just like ordinary C++ code, taking full advantage of all the compiler optimizations and runtime efficiencies that a good C++ system offers. By starting from good single-core performance, Cilk++ ensures that a program with sufficient parallelism gets good speedup whether it is run on a large number of processors or just a few.

What's the Catch?

Perhaps this all sounds too good to be true - and maybe now you're thinking, "Okay, so what‘s the catch?" All right, there are a few things I've glossed over. For one, since there are some kinds of parallelism that actually don't have serial semantics, you can't write programs using these parallelization strategies in Cilk++. For example, Cilk++ doesn't support producer/consumer, software pipelining, or message passing. Nevertheless, although we're giving something up to have to serial semantics, our experience is that less is more. The four advantages outlined above readily make up for any loss in generality.

There's a second catch, however. Note that in the third point above, I said that Cilk++ programs can be deterministic, not that they are deterministic. How so? Well, the output of a program with a determinacy race (see "Are Determinacy Races Lurking in Your Multicore Application") can depend on how the program is scheduled on multiple processors, and the race condition that leads to this nondeterministic behavior is usually a bug. In most programming environments, you'd be stuck at this point trying to find the race bug by laborious program inspection. In contrast, in the Cilk++ programming environment, Cilkscreen can help you find and eliminate these races, giving you your determinism back. Thus, this "catch," though problematic, can be dealt with effectively precisely because of serial semantics: in effect, Cilkscreen uses the serialization as a benchmark for correctness of a parallel execution.

Sometimes, however, a Cilk++ program is intentionally nondeterministic. For example, some search programs use a concurrent hash table to "remember" the results of intermediate subsearches so that the subsearch doesn't have to be reexecuted if it reoccurs. In this context, each slot of the hash table may contain a mutual-exclusion lock to allow parallel branches of a computation to safely access and update the items stored in the slot. Cilkscreen correctly ensures that races are not declared on operations protected by the same lock.

In practice, however, we have found that most Cilk++ programs require few, if any, locks. In particular, Cilk++'s parallel control constructs obviate the need for the locks that other parallel programming models require for interthread communication and synchronization. Moreover, for many common situations where locking would seem to be necessary to avoid a race, Cilk++ provides hyperobjects, a novel data structure that can resolve races on global variables without sacrificing performance or determinism. By avoiding locks and using hyperobjects, Cilk++ sidesteps many performance anomalies caused by locks, such as lock contention, which can slow down parallel programs significantly, and deadlock, which may cause your application to freeze midexecution.

The Bottom Line

To sum up, Cilk++'s serial semantics provide three key benefits - performance, reliability, and productivity:

  • Scalable performance comes from a model that doesn't rely on the programmer knowing the number of available processors in advance. Developers can use traditional serial performance tuning tools to find hotspots and improve both serial and parallel performance.
  • Reliability is enhanced through the testing advantages of repeatability and determinism, as well as the ability to build an efficient and provably correct race detector that can compare the parallel semantics of the program with the serial semantics provided by the program's serialization.
  • Finally, programmers can be more productive when they use familiar paradigms for new development, and of course, legacy serial code can be converted into Cilk++ easily, because it does not need to be rewritten to adopt a completely different programming model such as data-parallel or message-passing styles.

7 Comments Click here to read/write comments

All Posts