Friday, January 14, 2011

Multi-threaded compression

 Since a few years now, most PC systems sold are using multi-cores processors. As a logical consequence, there is an increase pressure from users to create applications capable to use all these available cores, since sleeping silicon seems a waste.

Obviously, this trend also reached compression utilities. Now the question is, how all this does work ?

There are several methods, and we'll investigate the easiest one to begin with. Simple thinking make us consider splitting the source file into smaller blocks, and process these blocks in parallel. This is easy to understand and setup, and in practice works well. Very well indeed, since this generic method can be extended to any number of cores, at least as many as there are blocks.

But there is a drawback : it is not possible to use in a block information from another block to improve compression. And this is a real issue, with direct impact on compression ratio.

Fortunately, for "low-depth" compressors such as LZ4 or Zhuff, this problem is not too large. Since these compressors use the last 64KB of data to predict the next bytes, the inefficiency is clearly limited to the first 64KB of the block. Which means that, for a big enough block, this inefficiency can be made small.
But how much is "big enough" ?

I was asked this very question by Con Kolivas, working on his own lrzip compressor (only available for Linux for the time being). Since lrzip is a heavily multi-threaded implementation of 7zip on top of rzip, he wants to dispatch as many processes as possible, while keeping memory usage in check. Selecting a too large block size has its own drawback, such as consuming more memory, creating more latency before cores are loaded, and such. Therefore, we want block size to be as small as possible, but not too small, to control the inefficiency within manageable area.

Here are some results taken out from LZ4 working on enwik8 with various block sizes, expressed as a multiple of search depth :

Block Size    Inefficiency
  2X              2.4%
  4X              1.2%
  8X              0.6%
 16X              0.3%
 32X              0.15%


As can be seen, the inefficiency is reduced by 2 at each step, which is not a random effect : the "inefficient" part of the block is only the first 1X, therefore its "relative" importance is divided by 2 each time block size is multiplied by 2.

All in all, enwik8 is probably one of the worst cases (outside of specifically crafted ones) since it heavily depends on match finds, so it is a good basis to select a threshold.

In my opinion, 16X is already quite good, with a low enough inefficiency ratio.
16X for 64KB, this means 1MB block size. In a word, manageable.

Monday, January 10, 2011

Improving I/O : Zhuff v0.5

 There is a subtle difference between Windows Seven and good old Windows XP when it comes to reading data from a disk : the prefetcher has matured much.

To say it simply, the prefetcher is in charge of loading data into memory before it is even required by the program. Sounds like magic ? Well, not really : whenever you look at a film for example, it's pretty obvious that the movie file will be loaded sequentially. So after the program has requested data from position 1M to position 2M, it's quite logical to expect the next read operation to be from 2M to 3M.

By prefetching HDD data into memory, the OS makes it (almost) instantly available to the program, smoothing video playback, or whatever the program wants to do with it.
Obviously, simplest prefetching would be to load the entire file into memory. In this case, we call it a "cache". But this is not really practical. A film is several GB large for example. So a simpler version would be to simply load the next KB of data, just in case they would be required by the program later.

In practice, this works relatively well. There are nonetheless some subtle differences between systems, one of which i just learned this week, studying XP behavior.

I was puzzled by the fact that newest v0.4 version of Zhuff was reported sometimes slower on some systems running XP, while the underlying compression algorithm was proven to be much faster. The only other identified change was a move from 512KB chunks to 1MB chunks.
This proved enough to put too much load on XP prefetcher. While such a change produced no effect on my Windows Seven test system. Hence my surprised.

I worked my way through a new I/O code, which would chunk code into smaller parts, down to 256KB. Below that point, there is a small compression penalty. Nothing too large though...

It required a bit more than words, decoupling buffer from read/write operations. Anyway, the results were better than expected. To give an example, using a RAM disk drive under Windows XP, one of my test file went down from 5.20s (v0.4) to 4.50s (v0.5). Quite a large improvement for just an I/O change.
By contrast, Windows Seven did not resulted into any change : timing remained steady at 4.75s, whatever the chunk size.

Finally, you can get your hand on the result, a new version of Zhuff, that you can download here.
Improvements will be mostly visible using a very fast hard disk, such as an SSD, or a RAM drive.