Monday, June 6, 2011

LZ4 - Improved performance

Since Przemyslaw Skibinski provided a benchmark dedicated to comparing fast compressors, i've been interested in a direct comparison between LZ4 and Google's snappy.

Snappy is a very fast compressor, but its source code is quite more complex than LZ4. At first, it may look like this complexity will cost quite some CPU cycles, but this is not matched by the results : snappy proves to be a very speedy competitor.
Therefore, I've been having a closer look into LZ4, looking for some potential wasted cycles, but couldn't find a way to make the code even leaner than it already is. A few tiny % could be grabbed here and there, but the speed difference remained too large to be clearly explained by the algorithm alone.

It turns out that the difference comes mostly from a simple reason : cache policy.
LZ4 was designed with L2 cache limitation in mind. In order to ensure fast compression, all data should be retrieved from the L2 cache, which is relatively fast, at about 10 cycles per access.
Fair enough, but there is even faster : the L1 cache can provide access within 3 cycles or less. It looks like a small difference, but it does add up with each data access. In the end, the total speed difference can reach 20%, exactly what's missing to LZ4 to reach snappy's speed.

And ... that's all it takes. Modifying the performance parameter "HASH_LOG" to keep most data accesses within the L1 cache proved enough to cover the difference.
Therefore, starting from r10 revision, the default performance parameter is 12, for 16KB, which is just the right amount for Intel L1 data cache. AMD users may try 13 for 32KB, since their L1 data cache is twice bigger.
Addendum : note that the above results are correct only for 32 bits systems. 64 bits ones will spend twice more memory.

There is just a little catch : with less memory to fill its tables, LZ4 miss more opportunities to find matches, translating into worsened compression ratio. I've partly mitigated the effect with a more thorough search, but there is still a measurable compression ratio deficit. In any case, if the compression ratio is more important for you, just increase "HASH_LOG" value to were it was before (17, for 512KB). And if ratio is really very important, move away from fast scan, and start implementing a more intensive search routine.

LZ4 was not created with such memory requirements in mind, but it's refreshing to see this algorithm accommodate these new conditions so easily.

Another interesting side effect of using less memory for compression is faster decompression. At first, it seems counter intuitive, since the decompression algorithm is not affected by the change, and does not need any memory at all beyond input and output buffers. But a reason can be found : with less matches, the decompression algorithm will have less "segments" to copy, and its speed is tied to this number : better have a few large copy operations than many small ones.

Well, if you want to give it a try, just download the latest source code version from Google code. Typical speed gain is about 20% for compression and 5% for decompression, while ratio is typically 2% less, although obviously your mileage will vary depending on the file to be compressed.

Update : well, another update (r11), another performance improvement. This time, it targets GCC compiler optimization and benefits greatly to decompression speed.

Update 2 : a compression benchmark comparison has been completed by m^2 on his Atom-based system.