Tuesday, November 25, 2014

Portability woes : Endianess and Alignment (Part 2)


In the previous part, endianess and 32/64 bits detection were detailed. Now we'll have a look at more complex memory alignment troubles.

Memory Access Alignment is a lesser known issue, but its impact can be huge : it will crash your program or result in a disproportionate slow down.

Let's first summarize the problem. When accessing a single byte into memory, there is no restriction. But when trying to access 2-bytes at a time (short), or 4-bytes at a time (int), alignment get into the way.
Aligned property means a 2-bytes field must always be accessed on an even address (multiple of 2), or accessing a 4-bytes field is always done on an address multiple of 4, and so on.
From a performance perspective, accessing many bytes at a time is a win, as it makes better use of the memory bus. But when a multi-bytes field is accessed on a non-aligned memory address, all sort of problems can happen : bus width or addressing limitation, cache line overlap, memory segment border overlap, and so on. 

All these issues can be solved, and indeed, on the most widely known programming environment, x86/x64, these problems are solved since a long long time.
But it has a cost, it makes the CPU more complex, and consume some precious transistor space. As a consequence, several CPU vendors selected to be a bit lazy on these issues, deciding to not address them, leaving the problem into the hands of software developers. In such case, if an unaligned memory access is nonetheless performed, the CPU sends an exception, typically resulting in a program crash.
To make the matter more complex, some CPU addressed alignment issues, but in an inefficient manner, resulting in undesirable slow performance. Other ones address it correctly for short (2-bytes) or int (4-bytes) but not long long (8-bytes).

Data alignment issue is well described, with many sources throughout Internet. Unfortunately, finding a proper portable solution for it is not, many "advisers" simply telling to avoid unaligned access altogether. Thanks, really.
But the above condition cannot be met in every circumstances. Consider how the compression algorithm works : it looks for similar pattern into past data. Such pattern can appear anywhere, not just on "aligned" addresses.

For a portable code, this situation is a nightmare. The "safe" approach would be to always access data byte-by-byte, but then, the impact on performance is huge, and for speed-oriented application such as LZ4, this trade-off is unacceptable.

The way it was handled by LZ4 up to now relied on the compiler. The basic idea is : the compiler should be fully aware if its target CPU can, or cannot, handle unaligned memory access.
This is achieved through the "pack" instruction, which, in a nutshell, tell the compiler to "not align these data", and therefore generate proper cautious assembler code when accessing them.

It gives the following result :

#if defined(__GNUC__)  && !defined(LZ4_FORCE_UNALIGNED_ACCESS)
#  define _PACKED __attribute__ ((packed))
#else
#  define _PACKED
#endif

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__IBMC__) || defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(1)
#  else
#    pragma pack(push, 1)
#  endif
#endif

typedef struct { U16 v; }  _PACKED U16_S;
typedef struct { U32 v; }  _PACKED U32_S;
typedef struct { U64 v; }  _PACKED U64_S;
typedef struct {size_t v;} _PACKED size_t_S;

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(0)
#  else
#    pragma pack(pop)
#  endif
#endif

#define A16(x)   (((U16_S *)(x))->v)
#define A32(x)   (((U32_S *)(x))->v)
#define A64(x)   (((U64_S *)(x))->v)
#define AARCH(x) (((size_t_S *)(x))->v)

If it looks a bit complex, that's because it is.
The first issue we have is that issuing the "pack" instruction must be done in a variety of ways, depending on the compiler and its version. It translates into this monstrous macro, trying to figure out all possible situations reported up to now. As you can guess, I regularly receive notice of new situations the macro cannot cope with.
This is about as bad as the previous stories regarding 32/64 bits and endianess.

But there is more.
Compilers are not that clever.
In many circumstances, the compiler will issue a "safe" slow code to access unaligned data, even though the target CPU is able to efficiently handle this situation, resulting in a large speed drop. This is especially true for late ARM CPU.
To counter-balance this effect, there is a need to manually "turn off" the "pack" instruction, using in the above example the #define LZ4_FORCE_UNALIGNED_ACCESS.
Unfortunately, the manual switch is almost never used. Source code will most of the time be compiled "as is", which is no surprise.

So we have 2 issues : issuing the "pack" instruction is complex, and not future-proof, and compilers don't automatically make the best choice.

To save the day, maybe a new runtime check will help, like for previous issues ?
Alas, there is no portable runtime test available to check for aligned properties.
(Note : of course, one could imagine running an external program/process just for this purpose, but it's outside of the scope of a little single-file library).

So we are stuck, aren't we ?

Well, that's the difficult part. To make some progresses on current situation, I'm going to change the logic : instead of relying on the compiler, take explicit charge to handle unaligned accesses.

The core foundation of the new solution is based on below function, already used within lz4frame :

static U32 LZ4_readLE32 (const BYTE* srcPtr)
{
    U32 value32 = srcPtr[0];
    value32 += (srcPtr[1]<<8);
    value32 += (srcPtr[2]<<16);
    value32 += (srcPtr[3]<<24);
    return value32;
}
What's good with this function ?
It handles both endianess and alignment in a safe way. The code is portable.

What's wrong with it ?
It's the safe approach, and therefore is slower than necessary when CPU can correctly handle unaligned memory accesses.

So, we will now special-cases CPU which do support efficient unaligned access.

Some of them are easily detectable, thanks to  widely supported standard macro : __i386____x86_64____ARM_ARCH_7__ are known architectures with good support for unaligned memory accesses. __ARM_ARCH_6__ is also able to handle it, but in a less efficient manner, so it's unclear if it's really faster than the portable version.

Finding a list of CPU with efficient support of unaligned memory accesses (and their related detection macro) has proven an impossible task so far. One may have in mind that Linux faces a similar challenge, which is supposed to be solved thanks to the macro CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. Unfortunately, I couldn't find a place where this macro is defined. It seems to be a decentralized methodology, where each architecture tells independently if it's compatible or not. For the Linux kernel, it's likely the correct method. But that also means there is no central repository where this property is listed.

So I'm a bit stuck right now.
My expectation is that external contributors interested in LZ4 performance may benchmark the speed of the new version, tentatively enabling/disabling the prominent switch at the top of lz4.c when they see fit :

/*
 * CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS :
 * You can force the code to use unaligned memory access, should you know your CPU can handle it efficiently.
 * If it effectively results in better speed (up to 50% improvement can be expected)
 * please report your configuration to upstream (https://groups.google.com/forum/#!forum/lz4c)
 * so that an automatic detection macro can be added to mainline.
 */
/* #define CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS 1 */

Each time a CPU is known to efficiently handle unaligned memory access, its standard detection macro can be added to the list, in order to automatically use the faster code path.

Is it any better than previous method ?
Well, sort of.

To begin with, there is no longer a need to fiddle with the different "pack" instructions depending on compilers and versions. This is one less complexity to manage, and dependency to worry.
Getting formally in charge of unaligned access allows introduction of dedicated functions. For example, a more adapted "string comparison" function could be introduced.

More importantly, instead of crashing when the detection fail, the library will now run slower, but still run correctly. But it introduces some new risk : many users may simply not notice the slow down, and just use the library "as is", unaware of the latent performance improvement which could be unleashed. The hope is, as long as a few contributors can detect and report the performance issue, the situation can be solved for everybody with similar setup.


Latest version of LZ4 using these new detection routines is accessible in the feature branch "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, it's still worthwhile to check.

Monday, November 24, 2014

Portability woes : Endianess and Alignment (Part 1)

 In creating an ultra-portable code, able to be compiled on (almost) every platform, there are some unusual problems to take into consideration. Listing them by order of increasing complexity, this post will review in detail address space, endianess and alignment restrictions (part 2).

Detecting 32 vs 64 bits
This is the easiest part, now practiced by many programmers. It's very common for a program to be designed with a single target environment. But since PC programmers had to deal with the 32->64 bits transitions, there are many solutions available around, just looking throughout Internet. However, they are not all equivalent.
Consider the initial solution adopted by LZ4 :

/* 32 or 64 bits ? */
#if (defined(__x86_64__) || defined(_M_X64) || defined(_WIN64) \
  || defined(__64BIT__)  || defined(__mips64) \
  || defined(__powerpc64__) || defined(__powerpc64le__) \
  || defined(__ppc64__) || defined(__ppc64le__) \
  || defined(__PPC64__) || defined(__PPC64LE__) \
  || defined(__ia64) || defined(__itanium__) || defined(_M_IA64) \
  || defined(__s390x__) )   /* Detects 64 bits mode */
#  define LZ4_ARCH64 1
#else
#  define LZ4_ARCH64 0
#endif

It works. But you can easily spot the weakness : this is a long macro, with many special cases, and there is no guarantee there will be no more additional cases in the future. And by the way, this is what happened : the list was completed thanks to contributors adding one by one each target platform as they were discovered.
So it's complex, and not completely future proof.
Let's compare to the new method :

static unsigned LZ4_64bits(void) { return sizeof(void*)==8; }

Yes, a single trivial line, and it is future proof. It could be done as a macro too, but I prefer a static function, since it gives the compiler a chance to do something clever about it.

The initial feeling is that the macro is "runtime free" while the function will cost a small comparison test every time it's called, thus be slower.
But eventually, that's not what a modern compiler is expected to do. Clever compilers will realize this function always return the same value, and therefore replace it by its result. When there are branches depending on the result, the compiler is also expected to automatically solve the test, and remove the useless branch through dead code optimization.
And in practice, it works well.

It doesn't solve everything though.
For example, using Visual, the intrinsic function _BitScanForward64() is only accessible during x64 compilation. Compiling a source code mentioning this function in 32-bits mode will fail the link stage, even if the program will never call the function. That's a situation a runtime test cannot solve.
For this special case, it's still necessary to restrict the compiled source code through macro selection.

Detecting Endianess
While 32/64 bits is (mostly) a question of performance, endianess will impact result correctness. So it's damn important to correctly detect it.
A code able to handle different endianess is less common, but it's still relatively easy to find several solutions over Internet. And once again, they are not all equivalent.

Initially, LZ4 adopted the macro detection approach :

/*
 * Little Endian or Big Endian ?
 * Overwrite the #define below if you know your architecture endianess
 */
#include <stdlib.h>   /* Apparently required to detect endianess */
#if defined (__GLIBC__)
#  include <endian.h>
#  if (__BYTE_ORDER == __BIG_ENDIAN)
#     define LZ4_BIG_ENDIAN 1
#  endif
#elif (defined(__BIG_ENDIAN__) || defined(__BIG_ENDIAN) || defined(_BIG_ENDIAN)) && !(defined(__LITTLE_ENDIAN__) || defined(__LITTLE_ENDIAN) || defined(_LITTLE_ENDIAN))
#  define LZ4_BIG_ENDIAN 1
#elif defined(__sparc) || defined(__sparc__) \
   || defined(__powerpc__) || defined(__ppc__) || defined(__PPC__) \
   || defined(__hpux)  || defined(__hppa) \
   || defined(_MIPSEB) || defined(__s390__)
#  define LZ4_BIG_ENDIAN 1
#else
/* Little Endian assumed. PDP Endian and other very rare endian format are unsupported. */
#endif
As you can see, it is a big mess. It depends on so many different parameters, it's hard to maintain, and it's difficult to guarantee it will always work. Indeed, it does not. Quite regularly, I received external contributions regarding specific platforms which would fail the test, in both directions (some little endian declared as big endian, and the other way round).
Add to this already complex situation the case of bi-endian CPU, a growing list of hardware which can select to be little-endian or big-endian, at will. That makes using architecture as an endian hint an unsustainable proposition.

As previously, the intention is to replace this list of macros by a guaranteed runtime test. Here is the new method within LZ4 :

static unsigned LZ4_isLittleEndian(void)
{

const union { U32 i; BYTE c[4]; } one = { 1 }; /* don't use static : performance detrimental */
return one.c[0]; }

Once again, the runtime method is not just much shorter and readable, it's also guaranteed to always produce the right result, whatever the CPU and its local mode.

However, this time, it's a little bit more difficult for the compiler to make the test "runtime free".

Here, unfortunately, your mileage may vary. For example, in the above example, I only succeeded in making the compiler realize the function always produce the same result after removing the "static" property from variable "one". Other possibilities exist, such as filling a 32-bits value and then accessing it with a char* pointer. They all work. At the end of the day, the question is : which version is most likely to be reduced into a constant value by as many compilers as possible ?

The above version proved successful so far. I can only wish it will remain as successful for other compiler/target combinations. At least, it's no longer the correctness of the test which is at stake, only its performance.


In the next article, we'll review memory access alignment restriction, which is, by far, the most complex issue.
In the meantime, should you wish to review and test the new detection methods, you can grab them in the latest LZ4 feature branch named "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, especially big-endian ones, it would deserve to be checked.

Saturday, September 27, 2014

Counting bytes fast - little trick from FSE

 An apparently trivial and uninteresting task nonetheless received some special optimization care within FSE : counting the bytes (or 2-bytes shorts when using the U16 variant).

It seems a trivial task, and could indeed be realized by a single-line function, such as this one (assuming table is properly allocated and reset to zero) :

while (ptr<end) count[*ptr++]++;

And it works. So what's the performance of such a small loop ?
Well, when counting some random data, the loop performs at 1560 MB/s on the test system. Not bad.
(Edit : Performance numbers are measured  on a Core i5-3340M @2.7GHz configuration. Benchmark program is also freely available within the FSE project)
But wait, data is typically not random, otherwise it wouldn't be compressible. Let's use a more typical compression scenario for FSE, with a distribution ratio of 20%. With this distribution, the counting algorithm works at 1470 MB/s. Not bad, but why does it run slower ? We are starting to notice a trend here.
So let's go to the other end of the spectrum, featuring highly compressible data with a distribution ratio of 90%. How fast does the counting algorithm run on such data ? As could be guessed, speed plummets, reaching a miserable 580 MB/s.

This is a 3x performance hit, and more importantly, it makes counting now a sizable portion of the overall time to compress a block of data (let's remind FSE targets speeds of 400 MB/s overall, so if just counting costs that much, it drags the entire compression process down).

What does happen ? This is where it becomes interesting. This is an example of CPU write commit delay.

Because the algorithm writes into a table, this data can't be cached within registers. Writing to a table cell means the result must necessarily be written to memory.
Of course, this data is probably cached into L1, and a clever CPU will not suffer any delay for this first write.
The situation becomes tricky for the following byte. In the 90% distribution example, it means we have a high probability to count the same byte twice. So, when the CPU wants to add +1 to the appropriate table cell, write commit delay gets into the way. +1 means CPU has to perform both a read and then a write at this memory address. If the previous +1 is still not entirely committed, cache will make the CPU wait a bit more before delivering the data. And the impact is sensible, as measured by the benchmark.

So, how to escape this side-effect ?
A first idea is, don't read&write to the same memory address twice in a row. A proposed solution can be observed in the FSE_count() function. The core loop is (once cleaned) as follows :

Counting1[*ip++]++;
Counting2[*ip++]++;
Counting3[*ip++]++;
Counting4[*ip++]++;

The burden of counting bytes is now distributed over 4 tables. This way, when counting 2 identical consecutive bytes, they get added into 2 different memory cells, escaping write commit delay. Of course, if we have to deal with 5 or more identical consecutive bytes, write commit delay will still be there, but at least, the latency has been used counting 3 other bytes, instead of wasted.

The function is overall more complex : more tables, hence more memory to init, special casing non-multiple-of-4 input sizes, regroup all results at the end, so intuitively there is a bit more work involved in this strategy. How does it compare with the naive implementation ?

When compressing random data, FSE_count() gets 1780 MB/s, which is already slightly better than the naive strategy. But obviously, that's not the target. This is when distribution gets squeezed that it makes the most difference, with 90% distribution being counted at 1700 MB/s. Indeed, it's still being hit, but much less, and prove overall much more stable.


With an average speed > 1700MB/s, it may seem that counting is a fast enough operation. But it is still nonetheless the second contributor to overall compression time, gobbling on its own approximately 15% of budget. That's perceptible, and still a tad too much if you ask me for such a simple task. But save another great find, it's the fastest solution I could come up with.

Edit :
Should you be willing to test some new ideas for the counting algorithm, you may find it handy to get the benchmark program which produced the speed results mentioned in this article. The program is part of the "test directory" within FSE project, as a single file named fullbench.c :
https://github.com/Cyan4973/FiniteStateEntropy/blob/master/test/fullbench.c

Edit 2 :
Thanks to recent comments, notably from gpdNathan, and Henry Wong, a new and better reason has been provided to explain the observed delay. Its name is store-to-load forwarding. I would like to suggest here the read of the detailed explanation from Nathan Kurz, backed by his cycle-exact Likwid analysis, and the excellent article from Henry on CPU microarchitecture.
In a nutshell, while write commit delay used to be a problem, it should now be properly handled by store-cache on modern CPU. However, it introduces some new issues, related to pipeline, serial dependency and prefetching, with remarkably similar consequences, save the number of lost cycles at stake, which is quite reduced.

Edit 3 :
Nathan Kurz provided an entry which beats the best speed so far, achieving 2010 MB/s on a Core i5-3340M @ 2.7 GHz. Its entry is provided within the fullbench program (as algorithm 202), alongside a simplified version which achieves the same speed but is shorter (algorithm 201).
It's more than 10% better than the initial entry suggested in this blog, and so is definitely measurable.
Unfortunately, these functions use SSE 4.1 intrinsic functions, and therefore offer limited portability perspectives.

Saturday, July 19, 2014

xxHash wider : assessing quality of a 64-bits hash function

 The initial version of xxHash was created in a bid to find a companion error detection algorithm for LZ4 decoder. The objective was set after discovering that usual implementation of CRC32 were so slow that the overall process of decoding + error check would be dominated by error check.
The bet was ultimately successful, and borrowed some its success from Murmurhash, most notably its test tool smHasher, the best of its kind to measure the quality of a hash algorithm. xxHash speed advantage stems from its explicit usage of ILP (Instruction Level Parallelism) to keep the multiple ALU of modern CPU cores busy.

Fast forward to 2014, the computing world has evolved a bit. Laptops, desktops and servers have massively transitioned to 64-bits, while 32-bits is still widely used but mostly within smartphones and tablets. 64-bits computing is now part of the daily experience, and it becomes more natural to create algorithms targeting primarily 64-bits systems.

An earlier demo of XXH64 quickly proved that moving to 64-bits achieves much better performance, just by virtue of wider memory accesses. For some time however, I wondered if it was a "good enough" objective, if XXH64 should also offer some additional security properties. It took the persuasion of Mathias Westerdhal to push me to create XXH64 as a simpler derivative of XXH32, which was, I believe now, the right choice.

XXH64 is therefore a fairly straighfoward application of XXH methodology to 64-bits : an inner loop with 4 interleaved streams, a tail sequence, to handle input sizes which are not multiple of 32, and a final avalanche, to ensure all bits are properly randomized. The bulk of the work was done by Mathias, while I mostly provided some localized elements, such as prime constants, shift sequences, and some optimization for short inputs.

The quality of XXH64 is very good, but that conclusion was difficult to assess. A key problem with 64-bits algorithms is that it requires to generate and track too many results to properly measure collisions (you need 4 billions hashes for a 50% chance of getting 1 collision). So, basically, all tests must be perfect, ending with 0 collision. Which is the case.
Since it's a bare minimum, and not a good enough objective to measure 64-bits quality, I also starred at bias metric. The key idea is : any bit within the final hash must have a 50% chance of becoming 0 or 1. The bias metric find the worst bit which deviates from 50%. Results are good, with typical worst deviation around 0.1%, equivalent to perfect cryptographic hashes such as SHA1.

Since I was still not totally convinced, I also measured each 32-bits part of the 64-bits hash (high and low) as individual 32-bits hashes. The theory is : if the 64-bits hash is perfect, any 32-bits part of it must also be perfect. And the good thing is : with 32-bits, collision can be properly measured. The results are also excellent, each 32-bits part scoring perfect scores in all possible metric.

But it was still not good enough. We could have 2 perfect 32-bits hashes coalesced together, but being a repetition of each other, which of course would not make an excellent 64-bits hash. So I also measured "Bit Independence Criteria", the ability to predict one bit thanks to another one. On this metric also, XXH64 got perfect score, meaning no bit can be used as a possible predictor for another bit.

So I believe we have been as far as we could to measure the quality of this hash, and it looks good for production usage. The algorithm is delivered with a benchmark program, integrating a coherency checker, to ensure results remain the same across any possible architecture. It's automatically tested using travis continuous test environment, including valgrind memory access verification.

Note that 64-bits hashes are really meant for 64-bits programs. They get roughly double speed thanks to increased number of bits loaded per cycle. But if you try to fit such an algorithm on a 32-bits program, the speed will drastically plummet, because emulating 64-bits arithmetic on 32-bits hardware is quite costly.

SMHasher speed test, compiled using GCC 4.8.2 on Linux Mint 64-bits. The reference system uses a Core i5-3340M @2.7GHz
VersionSpeed on 64-bitsSpeed on 32-bits
XXH6413.8 GB/s1.9 GB/s
XXH326.8 GB/s6.0 GB/s






Tuesday, May 20, 2014

Streaming API for LZ4

 For quite some time, the LZ4 Streaming API project has been started and delayed, as other priorities stepped in the way. To be fair, one important delaying factor was the difficulty to define a "clean enough" API, something that would be simple to use and understand, while also providing the level of fine-tuning required by advanced programmers (typically within embedded environments).

I feel quite close today from breaking this shell, with an interface definition which matches my definition of "clean enough" API. So let's share some preliminary results.

Current streaming interface

The current streaming API exposes the following functions :

void* LZ4_create (const char* inputBuffer);
int   LZ4_compress_continue (void* LZ4_Data, const char* source, char* dest, int inputSize);
char* LZ4_slideInputBuffer (void* LZ4_Data);
int   LZ4_free (void* LZ4_Data);

Although it works as intended, this API introduces some critical limitations.

First, there is a need to create a data structure which tracks the behavior of the stream.
This is void* LZ4_data, which will be generated by LZ4_create().

It looks fine, but a first issue is that allocation happens inside LZ4_create(). In some circumstances, this may not be acceptable : sometimes allocation must be fully controlled by the host process. For example, standard functions such as malloc() may be unavailable. For this use case have been created 2 more functions :

int LZ4_sizeofStreamState(void);
int LZ4_resetStreamState(void* state, const char* inputBuffer);

It looks fine too, but is unfortunately still not usable for static allocation purpose (on stack, for example).

The second issue is the const char* inputBuffer argument, specified at creation stage, because it will be fixed during the entire compression process. It's not possible to jump between memory segments. This is a major limitation, preventing "in-place" compression of scattered memory segments.

LZ4_compress_continue() looks fine as a function definition, but what is less so is the documented requirement : up to 64KB of previously processed data must stand *before* the data to be compressed. This is a major constraint, typically requiring to prepare data into an intermediate buffer (which in some circumstances, might prove an unaffordable burden).

Then there is LZ4_slideInputBuffer(), which is arguably not at its right responsibility level. Its mission is to copy the last previous 64KB of source data to the beginning of the input buffer. It exists because the content of void* LZ4_data is not accessible.

No function is defined to simply load a dictionary : to achieve that goal, one has to compress a segment using LZ4_compress_continue(), throw the result, and compress the next segment using data accumulated so far.

To summarize, yes it can work, but it's cumbersome. One has to accept the idea that data to compress must be prepared into an intermediate buffer where above conditions can be met.
It's not convenient, and may sometimes not be possible, for example due to allocation limitations.
It also has a negative impact on performance, due to memory copy operations overhead.

New streaming interface definition

While defining the streaming API, I constantly struggled to find the right balance between a "fully automated" API, and a "user-controlled" one. The "fully automated API" would take in charge all kind of workload, generate data respecting the LZ4 Streaming Format with full headers and markers, and take care of allocation for required buffers. On the other hand, the "user-controlled" API would serve the needs for host programs which require full control over resource allocation.

I therefore settled with providing both.
The "user-controlled" version will be part of LZ4 library. It's the simpler one, and only takes care of generating compressed block format, chained or independent. It depends on the host process to pay attention to all memory operations (allocation, position and availability) and to provide its own custom format linking the successive blocks.

The "fully automated" API will be part of a new library, which will be called LZ4S for "LZ4 Streaming". It is basically a user program of the previous API.

In this blog post, we will therefore concentrate on the first API, since it is also an underlying foundation for the 2nd one.

typedef struct { unsigned int table[LZ4_STREAMSIZE_U32]; } LZ4_stream_t;

int LZ4_compress_continue (void* LZ4_stream, const char* source, char* dest, int inputSize);

OK, let's focus on the most important differences with current API.

An LZ4_stream_t structure is exposed. It is provided to give a better control over memory management to the host program. For example, allocation can be static (stack, global) or dynamic, and use any user-defined allocation function.

Obviously, the host program is both in charge of allocating and freeing this memory space. It may also duplicate it "at will", which is a good idea for "static dictionary compression" scenarios.

Cleaning (or resetting) LZ4_stream_t is only a matter of memset() it to zero. It's a requirement to do it once before using this structure.

LZ4_stream_t 's size is controlled by the value LZ4_STREAMSIZE_U32, which is checked at compile time thanks to a static assert piece of code, as already used within xxHash. It mostly depends on LZ4's hash table (which typical size is about 16KB). Hash table size is a parameter controlled by the defined value LZ4_MEMORY_USAGE. Up to now, this define was present into lz4.c. To be coherent with the new interface, it will be moved to lz4.h instead. I don't foresee any issue with that.

LZ4_stream_t is described as a table of unsigned int. This is intentional. The real usage of this memory bank remains hidden inside lz4.c. This way, internal variables within the structure cannot be used as a kind of implicit interface contract, allowing future modifications to happen without breaking compatibility with existing programs.

Once the streaming structure is created, you can start to populate it by loading a static dictionary using :

int LZ4_loadDict (void* LZ4_stream, const char* source, int size);

This part is optional. Loading a dictionary flushes any prior data reference from LZ4_stream_t , if there is any, so you may also use this function to initialize an LZ4_stream_t structure by simply setting a size of 0.

LZ4_compress_continue() looks the same as previously, and is indeed compatible,  but a major difference is that it no longer requires to compress next data block "right after" previous data. Previous and Next data can be anywhere in memory. The LZ4_stream_t structure will keep track of it automatically.

One strong assumption of LZ4_compress_continue() is that previously compressed data is still available. Unfortunately, this might not be the case, and this function cannot guess that.
Should previously compressed data be no longer accessible, for example because you need the space for something else, or you can't guarantee its availability, it's possible to solve that situation :
  • You can "save" the relevant data segment from previous block (typically, the last 64KB, it can also be less than that) into a "safe" place. At which point, you will need to indicate this change of position.
int LZ4_moveDict (void* LZ4_stream, char* safeBuffer, int dictSize);

Memory space available at char* safeBuffer must be at least dictSize,  since it is the amount of data  LZ4_moveDict() will copy there (Note : the maximum amount of data that will be copied is 64KB, even if dictSize is larger).


Decompression

The current streaming decompression API is defined as follows :

int LZ4_decompress_safe_withPrefix64k (const char* source, char* dest, int compressedSize, int maxOutputSize);
int LZ4_decompress_fast_withPrefix64k (const char* source, char* dest, int originalSize);

Its documented requirement is that previously decompressed data must stand *right before* the memory address where the next data block is going to be decompressed. It works, but implies the creation of a temporary buffer to match the requirement.

The new streaming API get rid of this limitation, allowing previous data to stand anywhere within accessible memory :

int LZ4_decompress_safe_usingDict (const char* source, char* dest, int compressedSize, int maxOutputSize, const char* dictStart, int dictSize);
int LZ4_decompress_fast_usingDict (const char* source, char* dest, int originalSize, const char* dictStart, int dictSize);

The dictionary definition is part of the argument list (const char* dictStart, int dictSize), therefore there is no need for a tracking structure, in contrast with compression.



A first code implementing this API is currently available in the "dev" branch of LZ4 on github. It is still early stage, and probably the right time to be provide your comments should you have any.

[Edit] Added : LZ4_moveDict() as a potential replacement of LZ4_SetDictPos()
[Edit] Added : Paragraph on decompression
[Edit] Modified : move to LZ4_moveDict()
[Edit] Interface definition converges towards LZ4_compress_continue()
[Edit] "streaming" branch now merged into "dev" branch

Monday, April 7, 2014

Taking advantage of unequalities to provide better compression

 When starting investigation on ANS properties, in November 2013, I stumbled upon the fact that positions in the table are not equivalent. Basically, the low states are more "worthy" than higher states. Back then, it wasn't properly explained nor forecast by the theory.

On first look, it may look like a nuisance : usual arithmetic coders offer clean flat probabilities, this uneven layout seems like an added "noise" on top of an already complicated algorithm. On second though, it looks like a potential opportunity, even if complex, to improve accuracy by taking advantage of such differences.
With many other issues to solve to get a working ANS decoder out, this issue was left for a later investigation. And here we are today.

The same conclusion was later demonstrated by Charles Bloom, within his serie of blog posts on ANS. He produced a graphic, which proved much clearer than the stats produced the year before.

Probability of each state value

Note that the "clean graph" is obtained through an average of multiple "synthetic" inputs, not real files, and requires the "precise distribution algorithm" for the layout of symbols. If you try to reproduce this result on your own, you are likely to obtain a graph which roughly produces the same perspective, but not with the same accuracy.

As stated by Charles, the graph follows quite closely a 1/X distribution. It's an important conclusion, because it hints towards a formula able to "guess" the best spot for a singleton probability. It seems that, as long as P is between sqrt(2) and sqrt(2)/2, we can find its optimal position in the table.

It's an interesting result that we'll try exploit. I made multiple tests with different P values, looking for the optimal position in the table, and the result was relatively close to the 1/X formula (best position is usually slightly different, but the overall difference in file size is extremely small)

  P    1/X  Best  Difference
0.707  511  499    4 bytes
0.850  339  351    2 bytes
1.000  212  209    2 bytes
1.150  117  115    1 byte
1.300   44   43    2 bytes
1.414    0    0    0 bytes

So we've got a way here to improve accuracy of singleton probabilities (those for which the normalized frequency is 1). Can we use that ?

Unfortunately, we must also take into consideration the increased header cost to specify position accurately. If a symbol probability is normalized to 1 (a singleton), it means its number of occurrence is relatively small. Therefore, we can't spend 10 bits to provide the exact best position of a rare symbol into the table, it would be counter productive.

In order to experiment, I targeted the easy gains, and took a look at low-probability symbols, those for which realP < 1. Such cases are unfortunately quite common. They are also the most wasteful ones, since they require the minimum normalized frequency, which is 1 (even if realP = 0.01, we can't round that down to zero, otherwise the compression would be lossy).
One way to minimize their loss would be to put them at the top of the table, where the probability is lowest (approximately sqrt(2)/2 ~= 0.707). The end effect is somewhat equivalent to what Charles Bloom did within his own tANS version, but in a more "directive" manner, since we are cherry picking which symbols must fit at the top of the table.
The impact on header size is going to be very moderate, since now we just have to distinguish between a "normal 1" and a "low 1".

Let's check if this method produces some worthy gains :

Table  Low1     HC     Fast3  Gains    Fast1
4K    435286  435426  435398  0.026%  435723
2K    435677  436041  436072  0.091%  436783
1K    436691  437756  437875  0.271%  439181

Indeed, this is much better than anticipated. This new version handily beats the "perfect normalization algorithm", by taking advantage of unequal probabilities within ANS table. It is therefore now integrated within the open source version of FSE on Github.

A great property is that this improved table initialization algorithm is achieved with the same speed as before, so it's basically an improvement we get for free.

Another important property is that the gain improves while table size decreases. It's great : it means we can consider reducing the table size, hence its memory cost, while keeping the compression loss in check. Notice how we get better compression performance today than earlier version (Fast1) did produce using tables twice bigger !

It's an important benefit, and it enables some critical RAM saving, essential to improve Zhuff performance.
But let's keep that part for a later blog post...