Subscribe by Email

Your email:

Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

Multicore-enabling the Unix/Linux "wc" word count utility

Posted by John Hart on Wed, Oct 08, 2008
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

This utility program is a Cilk++ program providing the Unix/Linux "wc" word count utility functionality. wc-cilk processes individual files (specified on the command line) in parallel, whereas wc processes the files sequentially.

Unlike many Cilk++ examples, this example is not CPU intensive; the word counting involves a simple file scan, treating the file content as a text string.  Nonetheless, wc-cilk performance scales very well with the number of CPUs, since application speed is limited by file access and memory access.

The implementation, described after the Performance Measurement section, also illustrates Cilk++ built-in reducers, but it only requires one Cilk++ keyword (a single cilk_for). The final section suggests some addtional experiments that I haven't tried (I'm CPU bound and only have one CPU).  I'd love to hear from anyone who does try these experiments.

PERFORMANCE MEASUREMENTS

Here are some performance figures processing multiple 64 MB files with different processor counts. In all cases, the file contents are already mapped to memory, so there is no significant I/O.

 


Performance Summary:

1) Cilk++ and parallelism can provide performance gains, even with applications that are not CPU-bound. This application is limited by memory bandwith.
2) Performance scales well with the CPU count. It is nearly linear. However, using more workers that the CPU count is counter-productive.
3) The test are run repeatedly on the same files, so the files (total 1 GB) are in memory.
4) The processing requires scanning long strings in memory. The cache is not helpful.

IMPLEMENTATION NOTES

1) The implementation uses memory mapped I/O, so there is no direect I/O. This generally improves sequential I/O performance. The system-independent memory mapping functions are implemented in cilk_util.h
(The listing is at the end of this blog).

a) The Windows code uses the Windows API functions CreateFile(), CreateFileMapping(), MapViewOfFile(), GetFileSize(), CloseHandle(), and UnmapViewOfFile().
b) The GCC code uses the Posix functions open(), stat(), mmap(), close(), and munmap().

2) The implementation requires three reducers but only one Cilk++ keyword:

a) Three reducer_opadd objects to sum the character, word, and line counts. Alternatively, you could remove the reducers and perform the sums in a separate loop after the cilk_for loop.
b) A reducer_ostream object so that the main loop can output the results for each file, in order. Alternatively, you could remove this reducer and output the results after the cilk_for loop.

3) The word count function, executed by each strand, is a custom implementation. The results are not guaranteed to match wc exactly. Use at your own risk. Is it possible to improve the performance, or is the code about as fast as you can make it (please let me know if you do speed it up signficantly)?

4) This example and its implementation is adapted from a Windows thread version in the writer's book "Windows System Programming" (Addison-Wesley Microsoft Technology Series, 2004).  The only significant change was that thread management (creation, synchronization, etc.) is handled by a single cilk_for statement, greatly simplifying the code.

Click to see the wc-cilk.cilk source code     Click to see the cilk_util.h header file

ADDITIONAL PERFORMANCE EXPERIMENTS

Here are some experiments I'd like to try when time permits. Notice that the previous experiments were run with the files fully mapped into physical memory. There was only 1GB of file data, and the tests ran on otherwise idle machines with many GB of memory.

1) What happens when the files are not in memory and must be read?  This can be tricky to achieve; you need to be sure memory is flushed after each test run. You can do this by processing a very large file (larger than the
amount of memory).
- Can you achieve any performance gains even with a single disk? Perhaps the internal disk scheduling in a fragmented file system will result in some gains?
- What happens if the files are on different disk drives?
- What happens if some of the files are on one or more remote systems.

2) How many large files can you process with a single wc-cilk command?
- The Windows tests were run on a 32-bit system. I was not able to process more than 24 64MB files (1.5GB) with a single wc-cilk command. The memory mapping failed as there was insufficient virtual memory space.
- The GCC tests were run on a 64-bit system. It should be possible to process a very large number of very large files.

3) What would happen if you replaced the file memory mapping with conventional read/write I/O? Short experiments indicate that performance is about half that of the current implementation.

4) Compare wc-cilk performance with Linux/Cygwin.

AN UPDATE, BASED ON COMMENTS FROM JEFF HALUSKA BELOW:

Jeff's suggestions paid off; I was able to implement nearly everything, except for eliminating the '\n' test when the character is not space (I may try that later when there is time). We now call isspace() once per loop iteration, and there are two if statements.

The total performance improvement is about 30%. Here are some GCC numbers on an 8-core machine processing 16 64MB files.

# Workers    Time (Seconds)    Time before Oct 24   Speed up
optimization
16 1.99 2.73 27%
8 1.70 2.60 35%
4 2.91 4.00 27%
2 5.61 7.90 29%
1 11.10 16.17 31%
    The modified version of the code can be seen here.

Tags: , , ,

COMMENTS

There are a some simple optimizations you've missed in the wcfunc. 
 
Allow char * pIn to persist past the scope of the for loop and then nc can be calculated by pIn - pInFile. This will remove the nc++ in your for loop. You can basically remove the nc variable completely. 
 
You are effectively calling isspace twice on every character. Once to find out if it is a space and another time to find out its not space. Save the result of isspace(c) and instead of assigning ch to the value of c, update it to equal the isspace return value (change the type too). Don't forget to update it when you assign it to be a space also. This will eliminate half your isspace calls. 
 
Another small speed gain would be to move the newline check up so it won't check it when it knows it is a space (I didn't put in my caching of the isspace for this example): 
 
Example: 
 
else if (!isspace(c)) { 
if (isspace(ch)) { 
nw++; 

 
if (c == '\n') { 
nl++; 
c = ' '; // Note this isn't ch; 


ch = c; 
 
Those are pretty simple and I don't think complicate the code at all. Would love to see if they made a significant speed increase.

posted @ Tuesday, October 21, 2008 10:41 AM by Jeff Haluska


Hey, I made an error in my understanding of isspace(). I thought a carriage return would return false, not true. Anyways, the same concept still applies but the above code needs to be written by adding an additional else if statement so the newline check only runs when isspace(c) passes.

posted @ Tuesday, October 21, 2008 10:49 AM by Jeff Haluska


Great idea. Thanks, Jeff. 
 
 
 
I implemented your suggestions, but it was even easier since I just used the file size for wc; there was no need to compute wc 
 
 
 
Here are some quick results; the improvement is definitely worthwhile. I'll run addtional tests, possibly with more code modifications, later and post the results. I'll also post the new code. 
 
 
 
The performance improvements ranged from 8% to 29%. 
 
 
 
1) Process 2 64MB files 
 
1 worker: OLD: 3.48s 
 
NEW: 3.01 
 
 
 
2 workers: OLD: 2.53s 
 
NEW: 1.79s 
 
 
 
 
 
1) Process 8 64MB files 
 
1 worker: OLD: 17.08 
 
NEW: 14.43 
 
 
 
2 workers: OLD: 9.73 
 
NEW: 8.92 
 
 
 

posted @ Tuesday, October 21, 2008 1:25 PM by John Hart


Good to hear it was worthwhile. I was afraid the memory would be a bottleneck and the speed improvements would be in the ms range. 
 
One other small modification would put the check for '\0' only when isspace(c) fails. You could eliminate this check if you use the file size, but the original code terminated when it hit the first '\0'. 
 
After thinking about it a little more, you should just try replacing the if statements and the isspace() with a switch statement. If the compiler implements it with a jump table it may be even faster and easier still.

posted @ Tuesday, October 21, 2008 1:57 PM by Jeff Haluska


You're right, again, Jeff. I removed the '\0' test since I'm using file length, and the test really was not appropriate, anyhow. 
 
 
 
There is addtional improvement, but we may be reaching the point of diminishing returns. The "New" times above, in order, changed as follows: 
 
NEW: 3.01 --> 2.68 
 
NEW: 1.79 --> 1.51  
 
NEW: 14.43 --> 13.36 
 
NEW: 8.92 --> 8.63 
 
 
 
Tomorrow, I'll try this on 8 and 16-core machines and publish the results here. 
 
 
 
Now, if someone has some good "grep" code, I can put that into this program and publish the results. The parallel structure is exactly the same; just replace the wcfunc() function. In fact, a number of Linux/UNIX utilities could be parallelized in this way. 
 

posted @ Tuesday, October 21, 2008 3:04 PM by John Hart


Jeff's suggestions are implemented. See the "AN UPDATE, BASED ON COMMENTS FROM JEFF HALUSKA BELOW" section, which has performance data and speed up (about 30%).

posted @ Friday, October 24, 2008 12:10 PM by John Hart


Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.