Cilk++ Runtime System

The Cilk++ Runtime System (RTS) enables a Cilk++ program to dynamically and automatically exploit an arbitrary number of available processor cores.  With sufficient parallelism and memory bandwidth, the RTS delivers near-perfect linear speed-up as the number of cores increases. On a single core, typical programs run with negligible overhead (less than 2%). The run-time library sits in the customer's application - above the operating system. It is not dependent on operating system specifics, and is not tightly integrated with the operating system.

The RTS guarantees bounds on stack space because of the way the Cilk++ scheduler works. On P processors, Cilk++ consumes at most P times the stack space of a single-processor execution, no matter how much parallelism there is in the code. This guarantee contrasts with that of a naive scheduler, which creates N copies of the work before executing the code even once.

Application examples:

Multicore-enabling the N-Queens Problem Using Cilk++

A Tale of Two Algorithms Multithreading Matrix Multiplication with Cilk++

Limitations of Other Schedulers

Most schedulers use a central scheduler which coordinates with their worker threads. The scheduler parcels out chunks of work and communicates with each worker to determine which processors are busy and which are available for new tasks. These approaches have limitations:

  1. Scheduling work in this fashion requires that the scheduler know, a-priori, how many processors are available to do work. This introduces a dependence on the number of processors.
  2. They introduce communication overhead between the central coordinating scheduler and the processor/workers. This overhead increases dramatically with the number of processors, making it difficult for these solutions to scale.

A Novel Work-Stealing Algorithm

The Cilk++ RTS takes a very different approach. First, there is no constraint on knowing the available resources: the "Cilk-ified" program runs on the resources allocated to it by the operating system - resources that might change dynamically.

When a processor executing a thread runs out of work, it "steals" work from the top of the work queue maintained by another processor. There is an algorithm for deciding which work to steal: the processor steals at random. As a result, almost all of the work is done by the otherwise idle processor, and the only communication required is if the processor happens to steal a piece of work that is being worked on. As a result, Cilk++ imposes very low overhead. Furthermore, if an application has sufficient parallelism, stealing is very infrequent - which is how the RTS achieves linear scaling.

Dynamic Load Balancing

The Cilk++ RTS enables the dynamic load-balancing capability demanded by real-world application environments. If a processor gets descheduled by the operating system (for example a gamer turns on his MP3 player), the work of that processor will automatically be stolen away by the RTS - automatic adaptation. Similarly, if processors become available to the Cilk++ job, they will automatically steal work and make themselves busy.

A particularly challenging aspect of the scheduler is figuring out how to move the state of a task between processors, for complex scenarios such as when that task might be buried several layers down in a nested loop. This is hard to get right, and that's why the Cilk++ RTS dramatically outperforms those of alternatives who have borrowed the MIT-Cilk concept of a work-stealing scheduler.

Cilk++ Components ("What's in the Box?")

In the following clip, Steve Lewin-Berlin, VP of Engineering, outlines "what's in the box?" when you get Cilk++ and start parallel programming, and what you ship to your customers. (Components include the Cilk++ compiler, runtime system, development tools such as the race detector and performance profiler, examples & docs.)