Saturday, February 19, 2011

Improved sampling for better compression

 When building a classic compression algorithm, it is expected to use past data to populate the predictor for future data. It all works pretty well, and by giving recent data a higher probability of representing future ones, it also provides an automatic "adaptation" property, which is highly desirable.

As a typical behavior, most compressors (zip, 7zip, rar, etc.) use an entire search window to build a dictionary of references. For best compression, all elements in the window must be registered, and sorted.

Now all this cost a lot of memory and CPU cycles.

For better speed, and memory savings, the dictionary can be made significantly smaller. This is one of the basis for LZ4, Zhuff and Etincelle results. But how to select which data should be part of dictionary, and which one should not ?

There are several considerations at stake. First, about speed.
The more elements in the dictionary, the slower the algorithm. So we want as little as possible, to limit the number of updates and sort operations. Basically, each time a reference in the table is erased by a newer one without being used, it was a useless update.
Updating only on the most probable references makes the best use of CPU cycles.

Second, compression.
For a pure LZ model, using offset to identify streams to copy, any missing position in the search windows can only hurt compression, since it means that the "range of references" is not fully populated. Therefore, this strategy is a tricky business. We have to measure the effects of missing references on compression ratio, to check if the cost is worth the gains.

For indirect models, things are getting better. Indirect model means that we are referencing a position in a table, instead of a directly calculable position. The table must be updated with exactly the same rule on both compression and decoding side. This obviously means more work and memory needed on the decoder side. This family includes LZRW and ROLZ reference models.
This is much better, since by not referencing some positions, we don't create a penalty due to "unused" slots. Indeed, if we intentionally skip "bad" positions in favor of "good" ones, we should even improve compression !
This is what effectively happens with Etincelle, when "incompressible segment detection" algorithm was added. It improved both the speed and the compression rate together. A good bet.

So, where are we with these early considerations ?
Mostly, we know that selecting only the "good" positions to be part of the dictionary will offer excellent speed improvement. But where are they ?

After much runs, i ended with a ton of log files and statistics, providing the following "general" conclusion : positions most likely to become useful references are "near the borders of segments". A segment in this definition is either a match sequence, or a continuous stream of literals. The larger the segment, the more the rule applies.

I therefore tried to check this result with Zhuff, and quickly witnessed some compression rate improvements. Selecting the right balance between speed and compression ratio is another (complex) story, but at least it provides some new room for improvement.

Looking even farther, the idea that some positions are "more likely" than others could lead to attributing to references a "probability", instead of a simple index, which is just a quick shortcut for a "flat probability" model. Now, that would make a deep change compared to current model in use within most LZ-based compressors. But let's keep that idea for a later note....