We at Cilk Arts are releasing a parallel version of the popular compression
tool bzip2. Our version of bzip2 compresses files several times faster
on multicore processors and shows good scalability on many-core systems.
Compared to the pthread-based pbzip2 it runs faster, works when reading
from a pipe, and requires less source code change. It took little over
a week to develop, most of that time in testing and debugging. Here's typical speed-up data on a 16-core system (2.2GHz AMD Barcelona CPU, four quad-core sockets):
Cilk's bzip2 is useful in its own right and also serves as a technology
demonstration, helping answer the questions:
- Where do you look for parallelism?
- How do you parallelize?
- What tradeoffs can be made to reduce span and increase parallelism?
Download our parallel bzip2!
Cilk's bzip2 is available as 64 bit Linux binary (built on Debian),
32 bit Linux binary, and source. Source diffs run about 40 KB.
Links follow.
Where is the parallelism?
The slow part of bzip2 is sorting. (Sort the N possible rotations
of N bytes of data.) Sorting can be highly parallel. Speed up
sorting and everything speeds up. But that isn't where I added
parallelism. After looking at the sorting code for a while I
decided bzip's sorting was not a good match for Cilk++.
Sorting N N-byte strings in general takes O(N2 lg N) time on
arbitrary data and can be parallelized. Unfortunately (or
fortunately, depending on your point of view) data to be compressed is
usually not random. Compression relies on order in its input. Also,
in bzip2 the strings to be sorted are not independent of each other.
Quicksort (the kind we like to parallelize) does worse than average on
ordered inputs, while bzip2's inherently serial sorting function does
better. Because of this data-dependent behavior it would be too easy
to write a sorting function that was highly parallel yet slower than
the optimized serial version.
I looked elsewhere for parallelism. bzip2 compresses blocks rather
than infinite byte streams. Instead of compressing each block faster
by sorting in parallel I compressed several blocks in parallel.
Output dependencies
Even though we can compress different parts of the file in parallel,
the compressed data must be written in order. A block can be written
immediately only if all earlier blocks have been written. Otherwise
it must be held in memory until earlier blocks are done. The data
structure to implement this behavior in Cilk++ is called a stream
reducer. One of the Cilk++ product header files, reducer_ostream.h,
implements the same concept for C++ ostreams.
The stream reducer for bzip2 has a very simple interface. Other than
constructing an object, the only thing you can do is write data. If
the data can not be written to output immediately, the reducer saves
it in memory. The Cilk++ runtime calls the reduce method to merge the
results of compressing adjacent blocks. Like the write method, the
reduce method either writes the data immediately or saves it for later.
Such a reducer preserves logical program order at the cost
of memory use. Any output that is generated out of order
must be stored in memory until all previous output is written.
The Cilk++ runtime guarantees that the peak number of instances
of a reducer is comparable to the number of processors, but an
instance of a stream reducer can hold an unbounded amount of storage.
Disposing of an instance of a stream reducer only frees the container,
not the data.
In practice we have not found memory use to be a problem in bzip2.
Blocks tend to complete in approximately the order they were started.
There are about as many out of order blocks in memory as there are
processors.
But the worst case is, the entire output except the first block must
be held in memory if the first block of the file takes a long time to
compress. Because we have not observed this to be a problem in
practice our bzip2 makes no attempt to avoid it. The simplest
solution would be to track memory use and sync if a limit is exceeded.
A more general solution is a research topic.
What tradeoffs can be made to reduce span and increase parallelism?
The first working implementation of parallel bzip2 did not have a lot
of parallelism. Testing showed maximum parallel speedup around 4-5
times. Predictions of the Cilk Parallel Performance Analyzer were
similar. This first version worked well enough on desktop systems
with only a few cores but did not scale in the way we are accustomed
to.
The main limit to scaling was too much serial work -- the "span" was
too long.
bzip2 has four steps:
- run length encode the input (this may increase or decrease size)
- divide the encoded input into mostly fixed-size blocks
- compress each block
- concatenate the compressed blocks
The first two steps, combined, are inherently serial. You don't know
what block a byte belongs in unless you have read all previous bytes.
Reading and run length encoding the input are faster than compression,
but not extraordinarily faster.
Fortunately the division into fixed size blocks is not an essential
part of the algorithm. The file format only puts an upper limit on
block size. The decompressor understands odd-sized blocks; it sees
one at the end of almost every file, holding the remainder after
dividing by the block size. We only need to use fixed size blocks
if we need to replicate the original bzip2's output bit-for-bit.
What if we reverse the run length encoding and chunking steps? These
operations do not commute. The run length encoded blocks will be
different sizes, generally smaller in the more parallel version.
Compression will be slightly reduced because longer blocks compress a
little more. But now run length encoding runs in parallel, and with
random access I/O breaking a file into pieces is fast. (This
mechanism doesn't work with pipes.)
Our bzip2 has two parallel modes. bzip2 -p, the default, runs the
"slow" parallel version which produces identical output to the original
with parallelism around five. bzip2 -P runs a "fast" parallel version
which produces different output and has parallelism greater than ten.
We could probably make our bzip2 even more parallel by recursively
dividing the input into chunks. We are currently going against our
our advice to avoid spawn in a serial loop. We have a reason to use
that style. As I wrote above, out of order execution carries a memory
use penalty. The serial spawn version has typical memory use
O(P × block size) -- blocks are compressed
approximately in order and each processor has to keep its block in
memory. The more parallel version would have to store half to most of
the file in memory because a compressed block can not be written to
disk until all prior compressed blocks have been written to disk.
(Note that "top" may report that bzip2 is using most of physical
memory to compress a large file even in our serial spawn version.
This is misleading because bzip2 uses memory mapped I/O for input.
Memory that would be charged to the operating system file cache when
using read/write I/O is charged to the process when using mapped I/O.)
Conclusion
So how fast is it? How well does it compress?
On a 250 MB, relatively incompressible file the Cilk PPA reports:
$ cilk/bin/cilkscreen -w ./bzip2 -P9v < big>/dev/null
(stdin): 1.221:1, 6.550 bits/byte, 18.12% saved, 250654720 in, 205224279 out.
Cilkscreen Parallel Performance Analyzer V1.0.3, Build 7085
1) Parallelism Profile
Work : 263,451,361,865 instructions
Span : 1,886,273,787 instructions
Burdened span : 1,886,429,787 instructions
Parallelism : 139.67
Burdened parallelism : 139.66
Number of spawns/syncs: 349
Average instructions / strand : 251,384,887
Strands along span : 55
Average instructions / strand on span : 34,295,887
Total number of atomic instructions : 12
2) Speedup Estimate
2 processors: 1.90 - 2.00
4 processors: 3.80 - 4.00
8 processors: 7.37 - 8.00
16 processors: 13.53 - 16.00
32 processors: 23.23 - 32.00
That was the highly parallel version. The moderately parallel version
looks like
$ cilk/bin/cilkscreen -w ./bzip2 -p9v < big>/dev/null
(stdin): 1.223:1, 6.543 bits/byte, 18.21% saved, 250654444 in, 204998273 out.
Cilkscreen Parallel Performance Analyzer V1.0.3, Build 7085
1) Parallelism Profile
Work : 286,569,081,860 instructions
Span : 24,176,935,450 instructions
Burdened span : 24,178,585,450 instructions
Parallelism : 11.85
Burdened parallelism : 11.85
Number of spawns/syncs: 277
Average instructions / strand : 344,433,992
Strands along span : 553
Average instructions / strand on span : 43,719,593
Total number of atomic instructions : 12
2) Speedup Estimate
2 processors: 1.75 - 2.00
4 processors: 2.80 - 4.00
8 processors: 3.99 - 8.00
16 processors: 5.08 - 11.85
32 processors: 5.88 - 11.85
Here's a real test -- if I run the highly parallel version on our 16
core server I can see the difference relative to the original bzip2:
$ time ./bzip2 -Pvc big > big1
big: 1.221:1, 6.550 bits/byte, 18.12% saved, 250654720 in, 205224279 out.
real 0m8.489s
user 2m5.836s
sys 0m7.636s
$ time bzip2 -vc big > big1
big: 1.223:1, 6.543 bits/byte, 18.21% saved, 250654720 in, 204998273 out.
real 1m49.648s
user 1m48.459s
sys 0m1.132s
Observed results are near the low end of the predicted range.
This is expected -- the low end of the range is the best estimate,
while the high end is the guaranteed not to exceed parallel speedup.
On my quad core desktop system I see these compression ratios
(higher is better) and run times (lower is better)
compressing 23 MB of LaTeX files and associated PDFs:
| Program | Ratio | Time |
| bzip2 serial | 2.519 | 6.03 |
| bzip2 -p | 2.519 | 2.30 |
| bzip2 -P | 2.505 | 1.67 |