Subscribe by Email

Your email:

Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

A Parallel bzip2

Posted by John Carr on Mon, Apr 27, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

We at Cilk Arts are releasing a parallel version of the popular compression tool bzip2. Our version of bzip2 compresses files several times faster on multicore processors and shows good scalability on many-core systems. Compared to the pthread-based pbzip2 it runs faster, works when reading from a pipe, and requires less source code change. It took little over a week to develop, most of that time in testing and debugging. Here's typical speed-up data on a 16-core system (2.2GHz AMD Barcelona CPU, four quad-core sockets):


Cilk's bzip2 is useful in its own right and also serves as a technology demonstration, helping answer the questions:

  • Where do you look for parallelism?
  • How do you parallelize?
  • What tradeoffs can be made to reduce span and increase parallelism?

Download our parallel bzip2!

Cilk's bzip2 is available as 64 bit Linux binary (built on Debian), 32 bit Linux binary, and source. Source diffs run about 40 KB. Links follow.

DownloadDescription
bzip2+Cilk-x64 Cilk++ parallel bzip2 for 64 bit Linux/x86 systems
bzip2+Cilk-x86 Cilk++ parallel bzip2 for 32 bit Linux/x86 systems
bzip2+Cilk-diff Partial diffs, showing key changes to Cilkify bzip2
bzip2+Cilk-source Complete source code for bzip2+Cilk

Where is the parallelism?

The slow part of bzip2 is sorting. (Sort the N possible rotations of N bytes of data.) Sorting can be highly parallel. Speed up sorting and everything speeds up. But that isn't where I added parallelism. After looking at the sorting code for a while I decided bzip's sorting was not a good match for Cilk++.

Sorting N N-byte strings in general takes O(N2 lg N) time on arbitrary data and can be parallelized. Unfortunately (or fortunately, depending on your point of view) data to be compressed is usually not random. Compression relies on order in its input. Also, in bzip2 the strings to be sorted are not independent of each other. Quicksort (the kind we like to parallelize) does worse than average on ordered inputs, while bzip2's inherently serial sorting function does better. Because of this data-dependent behavior it would be too easy to write a sorting function that was highly parallel yet slower than the optimized serial version.

I looked elsewhere for parallelism. bzip2 compresses blocks rather than infinite byte streams. Instead of compressing each block faster by sorting in parallel I compressed several blocks in parallel.

Output dependencies

Even though we can compress different parts of the file in parallel, the compressed data must be written in order. A block can be written immediately only if all earlier blocks have been written. Otherwise it must be held in memory until earlier blocks are done. The data structure to implement this behavior in Cilk++ is called a stream reducer. One of the Cilk++ product header files, reducer_ostream.h, implements the same concept for C++ ostreams.

The stream reducer for bzip2 has a very simple interface. Other than constructing an object, the only thing you can do is write data. If the data can not be written to output immediately, the reducer saves it in memory. The Cilk++ runtime calls the reduce method to merge the results of compressing adjacent blocks. Like the write method, the reduce method either writes the data immediately or saves it for later.

Such a reducer preserves logical program order at the cost of memory use. Any output that is generated out of order must be stored in memory until all previous output is written. The Cilk++ runtime guarantees that the peak number of instances of a reducer is comparable to the number of processors, but an instance of a stream reducer can hold an unbounded amount of storage. Disposing of an instance of a stream reducer only frees the container, not the data.

In practice we have not found memory use to be a problem in bzip2. Blocks tend to complete in approximately the order they were started. There are about as many out of order blocks in memory as there are processors.

But the worst case is, the entire output except the first block must be held in memory if the first block of the file takes a long time to compress. Because we have not observed this to be a problem in practice our bzip2 makes no attempt to avoid it. The simplest solution would be to track memory use and sync if a limit is exceeded. A more general solution is a research topic.

What tradeoffs can be made to reduce span and increase parallelism?

The first working implementation of parallel bzip2 did not have a lot of parallelism. Testing showed maximum parallel speedup around 4-5 times. Predictions of the Cilk Parallel Performance Analyzer were similar. This first version worked well enough on desktop systems with only a few cores but did not scale in the way we are accustomed to.

The main limit to scaling was too much serial work -- the "span" was too long.

bzip2 has four steps:

  1. run length encode the input (this may increase or decrease size)
  2. divide the encoded input into mostly fixed-size blocks
  3. compress each block
  4. concatenate the compressed blocks

The first two steps, combined, are inherently serial. You don't know what block a byte belongs in unless you have read all previous bytes. Reading and run length encoding the input are faster than compression, but not extraordinarily faster.

Fortunately the division into fixed size blocks is not an essential part of the algorithm. The file format only puts an upper limit on block size. The decompressor understands odd-sized blocks; it sees one at the end of almost every file, holding the remainder after dividing by the block size. We only need to use fixed size blocks if we need to replicate the original bzip2's output bit-for-bit.

What if we reverse the run length encoding and chunking steps? These operations do not commute. The run length encoded blocks will be different sizes, generally smaller in the more parallel version. Compression will be slightly reduced because longer blocks compress a little more. But now run length encoding runs in parallel, and with random access I/O breaking a file into pieces is fast. (This mechanism doesn't work with pipes.)

Our bzip2 has two parallel modes. bzip2 -p, the default, runs the "slow" parallel version which produces identical output to the original with parallelism around five. bzip2 -P runs a "fast" parallel version which produces different output and has parallelism greater than ten.

We could probably make our bzip2 even more parallel by recursively dividing the input into chunks. We are currently going against our our advice to avoid spawn in a serial loop. We have a reason to use that style. As I wrote above, out of order execution carries a memory use penalty. The serial spawn version has typical memory use O(P × block size) -- blocks are compressed approximately in order and each processor has to keep its block in memory. The more parallel version would have to store half to most of the file in memory because a compressed block can not be written to disk until all prior compressed blocks have been written to disk.

(Note that "top" may report that bzip2 is using most of physical memory to compress a large file even in our serial spawn version. This is misleading because bzip2 uses memory mapped I/O for input. Memory that would be charged to the operating system file cache when using read/write I/O is charged to the process when using mapped I/O.)

Conclusion

So how fast is it? How well does it compress?

On a 250 MB, relatively incompressible file the Cilk PPA reports:

$ cilk/bin/cilkscreen  -w ./bzip2 -P9v < big>/dev/null
(stdin): 1.221:1, 6.550 bits/byte, 18.12% saved, 250654720 in, 205224279 out.

Cilkscreen Parallel Performance Analyzer V1.0.3, Build 7085
1) Parallelism Profile
Work : 263,451,361,865 instructions
Span : 1,886,273,787 instructions
Burdened span : 1,886,429,787 instructions
Parallelism : 139.67
Burdened parallelism : 139.66
Number of spawns/syncs: 349
Average instructions / strand : 251,384,887
Strands along span : 55
Average instructions / strand on span : 34,295,887
Total number of atomic instructions : 12

2) Speedup Estimate
2 processors: 1.90 - 2.00
4 processors: 3.80 - 4.00
8 processors: 7.37 - 8.00
16 processors: 13.53 - 16.00
32 processors: 23.23 - 32.00

That was the highly parallel version. The moderately parallel version looks like

$ cilk/bin/cilkscreen  -w ./bzip2 -p9v < big>/dev/null
(stdin): 1.223:1, 6.543 bits/byte, 18.21% saved, 250654444 in, 204998273 out.

Cilkscreen Parallel Performance Analyzer V1.0.3, Build 7085
1) Parallelism Profile
Work : 286,569,081,860 instructions
Span : 24,176,935,450 instructions
Burdened span : 24,178,585,450 instructions
Parallelism : 11.85
Burdened parallelism : 11.85
Number of spawns/syncs: 277
Average instructions / strand : 344,433,992
Strands along span : 553
Average instructions / strand on span : 43,719,593
Total number of atomic instructions : 12

2) Speedup Estimate
2 processors: 1.75 - 2.00
4 processors: 2.80 - 4.00
8 processors: 3.99 - 8.00
16 processors: 5.08 - 11.85
32 processors: 5.88 - 11.85

Here's a real test -- if I run the highly parallel version on our 16 core server I can see the difference relative to the original bzip2:

$ time ./bzip2 -Pvc big > big1
big: 1.221:1, 6.550 bits/byte, 18.12% saved, 250654720 in, 205224279 out.

real 0m8.489s
user 2m5.836s
sys 0m7.636s
$ time bzip2 -vc big > big1
big: 1.223:1, 6.543 bits/byte, 18.21% saved, 250654720 in, 204998273 out.

real 1m49.648s
user 1m48.459s
sys 0m1.132s

Observed results are near the low end of the predicted range. This is expected -- the low end of the range is the best estimate, while the high end is the guaranteed not to exceed parallel speedup.

On my quad core desktop system I see these compression ratios (higher is better) and run times (lower is better) compressing 23 MB of LaTeX files and associated PDFs:

ProgramRatioTime
bzip2 serial2.5196.03
bzip2 -p2.5192.30
bzip2 -P2.5051.67

Tags: ,

COMMENTS

"Quicksort (the kind we like to parallelize) does worse than average on ordered inputs, while bzip2's inherently serial sorting function does better." 
 
Sure it does worse than average but if you can parallelize that over a bunch of cpus why doesn't that still win over running superior algorithm on just a single cpu? Knowing full well that the algorithm chosen and its time complexity can make all the difference in the world I'm wondering if perhaps your use of "worse than average" isn't perhaps a gross understatement? That might account for how the superior yet serial sort would win. But if the parallel algorithm scales linearly there must be some number of cpus at which it would pull ahead, no?

posted @ Monday, April 27, 2009 5:54 PM by Tracy Reed


Compared to the pthread-based pbzip2 it [...] works when reading from a pipe 
 
pbzip2-1.0.5 works when reading from a pipe, 
 
pbzip2 -c - <input >output.bz2

posted @ Monday, April 27, 2009 9:16 PM by lacos


@Tracy Reed 
 
 
 
Tracy, 
 
 
 
If we were targeting thousand core systems I might make a different tradeoff. On typical desktop systems I did not think I could use two or four way parallelism to make a slower general purpose algorithm decisively beat a special purpose algorithm with over a decade of tuning behind it. 
 
 
 
Imagine you're sorting items that are usually already sorted. 
 
Naive quicksort runs in O(N^2) time while a special purpose sorter runs in O(N) time. That's not the exact situation here; it's an example of why I wanted to be careful. 
 
 
 
bzip2's sorting algorithm obviously had some thought put into it. 
 
If it had exactly the same algorithmic complexity as quicksort on realistic inputs it could still be a faster by a large constant factor due to better optimization for its limited purpose. 
 
 
 
I also wanted to keep this simple and debuggable. The sorting code is complicated. While bare-bones quicksort is simple, each time you say "just randomize the input first" or "bin based on the first character before the main sort" you introduce a little extra complexity. You need more and more test cases. 
 
 
 
As it turned out, we got nearly linear speedup, comparable to what we get on quicksort. Without parallelizing the sorting stage the atomic unit of compression work is less than a second long. Your megabyte files don't compress any faster but your gigabyte files do.

posted @ Tuesday, April 28, 2009 5:58 AM by John Carr


What about parallelizing bzip2's decompression? This was being discussed on the <a href="http://realworldtech.com/forums/index.cfm?action=detail&id=98883&threadid=98430&roomid=2">Real World Tech forum recently.

posted @ Thursday, May 07, 2009 12:38 PM by Andrew Shewmaker


Andrew Shewmaker, please check out my website, namely lbzip2. Additionally, please see http://lists.debian.org/debian-mentors/2009/02/msg00135.html

posted @ Thursday, May 07, 2009 2:02 PM by lacos


Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.