Wednesday, December 15, 2010

RZM

Once upon a time...
There was one of those incredibly talented programmer, which could produce in short time what could take years to amateurs as myself. During its short presence in the compression scene, he produced several excellent programs, featuring compression / speed ratio beyond anything that was available back then.

Among these little jewels, was RZM, and LZ-class compressor. Right from the first release, it became a serious competitor to 7-zip, the best reference LZ implementation so far. This was achieved without even using any external filter, nor any strange trick, just plain LZ.

Or was it ? Christian Martelock, the author, described its algorithm as being a "dull ROLZ order-1" compression engine, with 64K positions per context.
This description was close enough from Etincelle to deserve a try.

Etincelle is also an order-1 ROLZ; however, it gets its speed thanks to a quick scan into dictionary. This is in contrast with RZM, which makes a full search.

It was just a matter of little time to implement full search into Etincelle, and get some first results.
RZM is approximately 30% more powerful than Etincelle, therefore, i was wondering if such kind of result was readily reachable.

It was not, obvously. With full search, Etincelle features a 15% improvement, same as LZ4.
So where the other 15% could come from ?

Well, for a start, Christian clearly states that he uses Optimal Parsing to get this result. From experience, Optimal Parsing provides a fair boost to compression ratio, but not a huge one. 3-5% is a typical improvement to be expected.

Another, more obvious, modification would be to move away from Huffman entropy coder, and exchange it for a Range coder. With its improved precision, this would help quite a bit, especially on literal occurrence. I suspect an improvement in the range of 2-3%,

Now, that still leave us with something like 8% to find out. And this part is not so clearly explained.

On reading the thread on RZM at encode.ru, a repeated pattern became apparent : all encoded tokens are context-evaluated.
It's a step above my simple block-based implementations. While i'm simply separating data depending on their type, ensuring homogeneous probabilities, context-evaluation goes a few steps further : it separates probabilities based on one or several selected elements or statistics, which are considered "correlated" to the probability being calculated.
It does not need to be completely deterministic (in which case, we would get a perfect predictor). A loosely coupled correlation is good enough to squeeze the probability table in a way which makes it more accurate, therefore producing better compression.

Since we benefit more from better correlation, the choice of context is pretty important.
As an example, for literal, the previous character is usually a good indication. Hence it makes sense to use it as a context. That's called Order-1 entropy coding.
Obviously, many more measurable element can be selected. For example : was the previous position a literal, or the end of a match, what was the length of the previous match, and so on. They can also be combined.
But, the prediction will only be as better as the correlation can be. Therefore, mixing random elements together will prove a poor indicator.
The story becomes even more harder when considering that correlation elements may change overtime. This is quite normal if data type being compressed changes, even within a single file. Tracking such changes can become a significant burden, and therefore, faster alternative is to fix the probability model, hard-wiring the correlation rules. This is necessarily less optimal, but obviously much cheaper to calculate.

Another problem in creating too many contexts is sparsity : the more numerous the contexts, the more sparse the data in each of them. It becomes therefore difficult, if not dangerous, to try to extract some probabilities from a too small sample set; this is the "escape estimation" problem, well known since PPM era.
This problem is critical for hard-wired correlation rules, such as PPM. More modern techniques, with dynamic merging & splitting, can somewhat circumvent it.

Exploiting such correlation requires, as a basic minimum, a fully adaptive entropy coder, able to change its probabilities at each step. Although simple in principle, its implementation is not, and can become a serious speed bottleneck. There is, however, a specific sub-case : the binary range coder. With just two output values to take care of (1 or 0), probability updates are very simple. Hence its predominance in most, if not all, the best compression programs today.

So here can have a look at the different steps to progress towards an RZM-like compressor :
1) move to range coder
2) work on optimal parsing
3) move to fully adaptive entropy
4) play with hard wired contexts
5) play with context selection, mixing, and creation

Quite a long road ahead...

No comments:

Post a Comment