The Journey Starts
I joined Cilk Arts three months ago and was truly enthused by the opportunity to work with very smart people developing an important technology, Cilk++, at a company with real potential for success. Adding to the excitement was the fact that I had prior experience developing and maintaining multithreaded applications. In fact, in moments of high self-esteem, I felt that I ranked as a "threads ace." Surely, Cilk++ was just another way of writing multithreaded applications; in short, "old wine in new bottles."
Mid-Journey Status
Three months have passed, I've learned a lot, and my expectations of Cilk Arts have been met.
However, while previous multithreading experience has certainly been helpful, the initial impressions vastly oversimplified the situation. By expanding my point of view about threads and parallelism, I've been able to exploit Cilk++ to accelerate performance in new and existing compute-bound programs simply and elegantly. There's a lot of good new wine in the bottles.
The Benefits
Anyone with threads experience, whether with Windows threads, Pthreads, or some other threading model, can quickly learn to use Cilk++, but some changed thinking will make you more productive more quickly. The benefits of learning Cilk++ and using it to eliminate unnecessary serial execution by parallelizing your single-threaded CPU-intensive applications include:
- Program performance will improve, exploiting application parallelism and the power of multi-core computer systems.
- Program conversion and development are straight-forward and involve the correct use of just three Cilk++ keywords and several additional Cilk++ features, such as reducers. Your parallelized code will be simpler and more elegant than anything you could write using Pthreads or Windows threads.
- Programs will be reliable if you use Cilk++ tools along with sound programming practice.
- The Cilk++ runtime will do all the hard work of scheduling for high performance on the multiple CPU cores.
What Do You Need to Know About Threads? Do You Need to be an Ace?
The more experience you have with multithreaded programming, the better. Even though many Cilk++ users have been successful with little prior threads experience, any experience creating, synchronizing, and managing concurrent tasks will give you a good head start, and your threading knowledge will not be obsolete. In short, you can "make new friends, but keep the old. One is silver, the other is gold."
In my own case, I've worked with threads for about a decade. Four chapters of my book, Windows System Programming, are dedicated to Windows threads, and I've developed and frequently delivered a multi-day Pthreads professional training course. Consulting activities have involved development of large multithreaded systems, and frequently I can help colleagues isolate their bugs involving deadlocks, missed signals (always a risk in Windows), synchronization issues, performance problems, and many other nasty threading problems. Maybe that's not quite "ace level," but I felt ready to tackle Cilk++. Perhaps my experiences can help other thread-aware software engineers get up to speed with Cilk++ quickly and painlessly, with fun and excitement.
What Cilk++ Does Very Well, and What It Doesn't Try to Do
A summary of Cilk++ and parallelization theory and practice is available in our e-Book
How to Survive the Multicore Software Revolution. Here, I'll describe Cilk++ from the point of view of the hypothetical "threads ace."
What Cilk++ Achieves
Cilk++ makes it easy for you to modify existing or new C++ applications to improve performance on systems with multiple CPUs (sometimes known as "shared-memory multiprocessor" or "SMP"). Cilk++ enhances application performance by accelerating serial portions of your code that are inherently parallel.
Cilk++ is easy to use as it uses just three key words which you add to the C++ code to create and synchronize "strands," which are the Cilk++ units of parallel execution (they are very similar to threads, although there are differences). As shown in the e-Book (p. 16), if you were using Pthreads or Windows threads to create parallel computation threads, you'd need to:
- Use complex, ugly and high-overhead system calls such as pthread_create() or Windows' _beginthreadex() (if you don't know already, you don't want to know).
- Manage system resources, such as handles.
- Specify attributes for the threads.
- Synchronize on all the thread handles.
Certainly, you can manage threads this way, and I've done it numerous times, often improving application performance. There are other problems with such a threaded solution, however, and Cilk++ solves these problems transparently, giving much better performance than the do-it-yourself multithreading approach. Most importantly:
- Memory Usage. Cilk++ uses memory efficiently. Suppose, for example, that you wanted to create 1,000 concurrent threads to perform a computational task; this is not uncommon. Each thread would require its own stack (usually 1MB by default), or 1GB total. Cilk++ creates the thousand strands, if necessary, and bounds memory usage as a linear function of the number of CPUs or active "workers." The Cilk++ runtime is also much more intelligent about strand creation than the brute force thread creation.
- Scheduling and Context Switching. Cilk++ work stealing schedules strands to run on available CPUs and switches strands much more quickly than thread creation and context switching and with less chance of page faults and better memory locality. The Implementation of the Cilk-5 Multithreaded Language describes how this works.
- Reliability. Cilk++'s Cilkscreen tool will detect race conditions in your parallel code. What is more, you can use built-in or custom reducers to eliminate races on global variables easily and efficiently.
- Thread Synchronization. A single cilk_sync is all you need to synchronize a collection of strands. A cilk_for statement does not require any explicit syncrhonization.
In summary, the threads ace needs to learn that simplicity is strength and performing acts of heroic or virtuoso thread management is counter-productive. Instead, understand the parallelization inherent to the application and use Cilk++ keywords to express that parallelization and to manage the details.
What Cilk++ Does Not Try to Do
Cilk++ is targeted towards parallelizing CPU-intensive serial code to improve performance and utilize multiple cores; it is not designed to solve all problems for which you might want to use threads. For example, Cilk++ is not an optimal tool if you are building a producer-consumer system or have tightly coupled threads sending each other messages and commands.
When first encountering Cilk++, I immediately noticed what seemed like missing features, such as events, condition variables, thread pools, and thread-local storage. In fact, these features are simply not required; they address problems not in Cilk++'s domain, or they are not necessary since Cilk++ uses different and more powerful techniques.
Expand Your Point of View (and Other Lessons)
Everything you already know and understand about concurrent operation, synchronization, race conditions, deadlocks, and so on, still applies when you use Cilk++. Nonetheless, when first encountering Cilk++, keep a few points in mind:
- Cilk++ is targeted at making your serial applications run faster. It is not appropriate for all multithreaded scenarios, such as transactional workloads, and this limitation is deliberate.
- Cilk++ can manage parallel computation more efficiently and reliably than is possible with a do-it-yourself approach.
- You can use Cilk++ to parallelize independent operations in loop iterations. You can also easily parallelize recursive, divide and conquer (RDC) algorithm implementations. So, if you are like me and shy away from RDC, think again. RDC expands your choices and is very Cilk++ friendly. The Cilk Arts site has sample sorting and matrix multiply programs that illustrate this point.
- Remember everything you learned as a threads ace. In particular, remember the risks and all the ways a program could fail with deadlocks, bad synchronization, and more. Use Cilk++ features to help you avoid these problems.
- This was the hardest for me - There is a well-developed theory of parallelism, with key concepts such as work and span. Cilk++ uses these concepts, and, as a software engineer, it is essential to take the time to think about and use these concepts when designing your system. (The e-Book summarizes the theory.)
There are a few old lessons to remember as well, as using Cilk++ to enhance performance pops old lessons right to the top of the stack:
- A buggy program will still be buggy under Cilk++. Logic flaws won't go away just because you convert to Cilk++. In fact, the logic flaws may be harder to find. Fortunately, Cilk++ lets you create the serial version (serialization) of your program so that you can debug and test as if the program were still written in C++.
- An inefficient program will almost certainly be inefficient in Cilk++. You still need to use good algorithms, give attention to memory access patterns to optimize cache usage, and even perform such mundane tasks as hand optimizing your code (optimizing compilers still miss things).
- Think simplicity. If your program is getting complex, maybe you are doing something wrong.
- Test and analyze your code.
- Be humble (bugs happen), and review the classic multithreaded maxims; I like the last chapter in Programming with POSIX® Threads, where we learn, for example, that you should "never bet on a thread race."