In an earlier post, I made the case for using a concurrency platform rather than building one yourself from a thread library such as Phtreads or WinAPI threads. My colleague at Cilk Arts, Ilya Mirman, outlined the pros and cons of thread pools, arguably the simplest concurrency platform. This time I'll overview OpenMP, a popular open-source concurrency platform.
OpenMP (Open Multi-Processing) supports multithreaded programming through Fortran and C/C++ language pragmas (compiler directives). OpenMP compilers are provided by several companies, including Intel, Microsoft, Sun Microsystems, IBM, Hewlett-Packard, Portland Group, and Absoft, and it is also supported the Gnu gcc compiler. By inserting pragmas into the code, the programmer identifies the sections of code that are intended to run in parallel.
One of OpenMP's strengths is parallelizing loops such as are found in many numerical applications. For example, consider the following C++ OpenMP code snippet which sums the corresponding elements of two arrays:
|
#pragma
omp parallel for
|
|
for
(i=0; i<n; ++i) {
|
|
c[i] = a[i] + b[i];
|
|
}
|
The pragma indicates to the compiler that the iterations of the loop that follows can run in parallel. The loop specification must obey a certain set of patterns in order to be parallelized, and OpenMP does not attempt to determine whether there are dependencies between loop iterations. If there are, the code has a race. An advantage in principle to the pragma strategy is that the code can run as ordinary serial code if the pragmas are ignored. Unfortunately, some of the OpenMP directives for managing memory consistency and local copies of variables affect the semantics of the serial code, compromising this desirable property unless the code avoids these pragmas.
OpenMP schedules the loop iterations using a strategy called work sharing. In this model, parallel work is broken into a collection of chunks which are automatically farmed out to the various processors when the work is generated. OpenMP allows the programmer to specify a variety of strategies for how the work should be distributed. Since work-sharing induces communication and synchronization overhead whenever a parallel loop is encountered, the loop must contain many iterations in order to be worth parallelizing.
Although OpenMP was designed primarily to support a single level of loop parallelization, alternating between serial sections and parallel sections, as illustrated on the right, some implementations of OpenMP support nested parallelism. The work-sharing scheduling strategy is often not up to the task, however, because it is sometimes difficult to determine at the start of a nested loop how to allocate thread resources. In particular, when nested parallelism is turned on, it is common for OpenMP applications to "blow out" memory at runtime because of the inordinate space demands. The latest generation OpenMP compilers have started to address this problem, however.
In summary, if your code looks like a sequence of parallelizable Fortran-style loops, OpenMP will likely give good speedups. If your control structures are more involved, in particular, involving nested parallelism, you may find that OpenMP isn't quite up to the job.
For more information: