Along with our upcoming release of Cilk++ v.1.1 we are including a new tool to help you visualize application performance: Cilkview. Cilkview runs an application binary and generates performance data on sections you specify. It combines this data with performance estimates generated by the work/span calculator and produces a graph of the results.
Consider the following parallel algorithm for calculating fibonacci numbers:
The Directed Acyclic Graph
(DAG) that represents the call tree fib(35) is a binary tree of parallel nodes, so there is parallelism aplenty.
Furthermore, a look at the function reveals that it is basically CPU-bound. The only memory accesses are into the frame and they have the highest possible locality. Naturally, this is a terrible way to calculate fibonacci numbers, but it is a simple example of the class of algorithms that are easily parallelized. To make Cilkview predict speedup, insert the following code near the call site:
start() and stop() indicate the interesting sections of code to Cilkview. dump() will associate the collected data with a label that Cilkview will use to generate a graph. Compile your program and run Cilkview to generate a graph:
% cilkview ./fib

Cilkview runs your binary on a single core with instrumentation that calculates the work and span of the sections of code you specify. In this case, Cilkview made the same prediction that we did. The steeper line indicates perfect linear speedup up until the theoretical parallelism of the algorithm. The theoretical parallelism does not appear on the graph because it is too large. The more gently sloped line indicates Cilkview's prediction of more realistic performance, burdened speedup, based on the overhead of the Cilk++ runtime. Since the two lines are so close (nearly indistinguishable in this plot), Cilkview predicts that actual trials of this application should scale with nearly perfect linearity.
To test this, Cilkview can also run the trials themselves by specifying the "-trials all" option at the command line. No additional code is necessary:
% cilkview -trials all ./fib

The red stars indicate trial data. Cilkview by default runs one trial for each of 1..n Cilk workers where n is the number of cores. The plot, above, was generated on a computer with 8 cores. As predicted, the trial performances scaled with near-perfect linearity. Slight variations can be attributed to the limitation that only one trial was performed for each configuration.
But what about more common algorithms such as sorting? The following is a simple algorithm for parallel quicksort:
At first glance, this algorithm looks like it has plenty of parallelism, too. Even though there are lots of memory accesses the quicksort algorithm has good cache properties. But consider that the DAG would have a substantial sequential node at the root for the first partition(). Likewise, after the recursive call, each of the first parallel children have to run partition() on their portions of the array before they can do anything parallel. In fact, most of the work in quicksort is done in the partition() algorithm -- serially! Cilkview's analysis bears this out:

The plot represents qsort run on an array 10,000,000 integers. In this graph you can see the that the ideal parallelism (the horizontal line) is small enough to appear on the graph (for this or any reasonable size of input). Plotting the trials gives data that is consistent with this:

The trials fit well within the predicted margin.
Now consider the following code for parallel memcpy:
On the surface, this has lots of parallelism. All of the work is in the cilk_for, and the automatic grain size should handle the overhead of the implicit cilk_spawns with respect to the trivial work done at each iteration. Conceptually, memcpy() is similar to fib() in terms of parallelism. The DAG looks similar to that of fib() except the tree is balanced. Unfortunately, the prediction generated from work/span data is totally inconsistent with the actual performance data from the trials (parallel_memcpy() run on an array 400 million bytes long):

Why does the trial data not match with the prediction? Cilkview takes the work and span into account and estimates the overhead of the runtime to arrive at a reasonable estimation of parallelism. But the limitation was not in the amount of parallelism, here. A discrepancy between the prediction and the trial data indicates other factors at work. In this case, the memory bandwidth became saturated and the speedup ceased to scale as the processors began to queue for access to main memory. The computer on which this was run has two memory controllers, so the scaling would have been even worse on a typical desktop machine. Cilkview's parallelism prediction is useful for understanding whether your algorithm has enough parallelism to scale effectively. A graph in which the trial data falls below the lower-bound indicates that the algorithm is running up against another limitation.
A final, happier example is the parallel matrix multiplication algorithm developed by Matteo Frigo and Matthew Steele. This algorithm was intended as an alternative to more intuitive
parallel matrix multiplication algorithms that tend not to use cache
effectively:

Note that the algorithm does not actually get super-linear speedup as it might appear in the 8-worker case. Remember that a single trial was run at each level, so slight variations in runtime can cause visible changes in the graph.
Cilkview will ship with Cilk++ v.1.1 for both Windows and Linux. There is more functionality included with the tool (e.g., you can instrument multiple sections of code, generate .csv files, set the number of cores on which to predict performance, and more). The hope is that Cilkview will help users confirm or deny their intuitions about parallel algorithms. Also, it makes pretty pictures.