This utility program is a Cilk++ program providing the Unix/Linux "wc" word count utility functionality. wc-cilk processes individual files (specified on the command line) in parallel, whereas wc processes the files sequentially.
Unlike many Cilk++ examples, this example is not CPU intensive; the word counting involves a simple file scan, treating the file content as a text string. Nonetheless, wc-cilk performance scales very well with the number of CPUs, since application speed is limited by file access and memory access.
The implementation, described after the Performance Measurement section, also illustrates Cilk++ built-in reducers, but it only requires one Cilk++ keyword (a single cilk_for). The final section suggests some addtional experiments that I haven't tried (I'm CPU bound and only have one CPU). I'd love to hear from anyone who does try these experiments.
PERFORMANCE MEASUREMENTS
Here are some performance figures processing multiple 64 MB files with different processor counts. In all cases, the file contents are already mapped to memory, so there is no significant I/O.

Performance Summary: 1) Cilk++ and parallelism can provide performance gains, even with applications that are not CPU-bound. This application is limited by memory bandwith.
2) Performance scales well with the CPU count. It is nearly linear. However, using more workers that the CPU count is counter-productive.
3) The test are run repeatedly on the same files, so the files (total 1 GB) are in memory.
4) The processing requires scanning long strings in memory. The cache is not helpful.
IMPLEMENTATION NOTES
1) The implementation uses memory mapped I/O, so there is no direect I/O. This generally improves sequential I/O performance. The system-independent memory mapping functions are implemented in cilk_util.h
(The listing is at the end of this blog).
a) The Windows code uses the Windows API functions CreateFile(), CreateFileMapping(), MapViewOfFile(), GetFileSize(), CloseHandle(), and UnmapViewOfFile().
b) The GCC code uses the Posix functions open(), stat(), mmap(), close(), and munmap().
2) The implementation requires three reducers but only one Cilk++ keyword:
a) Three reducer_opadd objects to sum the character, word, and line counts. Alternatively, you could remove the reducers and perform the sums in a separate loop after the cilk_for loop.
b) A reducer_ostream object so that the main loop can output the results for each file, in order. Alternatively, you could remove this reducer and output the results after the cilk_for loop.
3) The word count function, executed by each strand, is a custom implementation. The results are not guaranteed to match wc exactly. Use at your own risk. Is it possible to improve the performance, or is the code about as fast as you can make it (please let me know if you do speed it up signficantly)?
4) This example and its implementation is adapted from a Windows thread version in the writer's book "Windows System Programming" (Addison-Wesley Microsoft Technology Series, 2004). The only significant change was that thread management (creation, synchronization, etc.) is handled by a single cilk_for statement, greatly simplifying the code.
Click to see the wc-cilk.cilk source code
Click to see the cilk_util.h header file
ADDITIONAL PERFORMANCE EXPERIMENTS
Here are some experiments I'd like to try when time permits. Notice that the previous experiments were run with the files fully mapped into physical memory. There was only 1GB of file data, and the tests ran on otherwise idle machines with many GB of memory.
1) What happens when the files are not in memory and must be read? This can be tricky to achieve; you need to be sure memory is flushed after each test run. You can do this by processing a very large file (larger than the
amount of memory).
- Can you achieve any performance gains even with a single disk? Perhaps the internal disk scheduling in a fragmented file system will result in some gains?
- What happens if the files are on different disk drives?
- What happens if some of the files are on one or more remote systems.
2) How many large files can you process with a single wc-cilk command?
- The Windows tests were run on a 32-bit system. I was not able to process more than 24 64MB files (1.5GB) with a single wc-cilk command. The memory mapping failed as there was insufficient virtual memory space.
- The GCC tests were run on a 64-bit system. It should be possible to process a very large number of very large files.
3) What would happen if you replaced the file memory mapping with conventional read/write I/O? Short experiments indicate that performance is about half that of the current implementation.
4) Compare wc-cilk performance with Linux/Cygwin.
AN UPDATE, BASED ON COMMENTS FROM JEFF HALUSKA BELOW:
Jeff's suggestions paid off; I was able to implement nearly everything, except for eliminating the '\n' test when the character is not space (I may try that later when there is time). We now call isspace() once per loop iteration, and there are two if statements.
The total performance improvement is about 30%. Here are some GCC numbers on an 8-core machine processing 16 64MB files.
# Workers Time (Seconds) Time before Oct 24 Speed up
optimization
16 1.99 2.73 27%
8 1.70 2.60 35%
4 2.91 4.00 27%
2 5.61 7.90 29%
1 11.10 16.17 31%
The modified version of the code can be seen here.