**zlib**'s compression method, an LZ77 variant called *deflation*,
emits compressed data as a sequence of blocks. Various block types are
allowed, one of which is *stored blocks*; these are simply composed
of the raw input data plus a few header bytes. In the worst possible
case, where all other block types would expand the data, deflation falls
back to stored blocks. Thus the only expansion is an overhead of 5 bytes
per 32 KB block (.02%), plus a one-time overhead of six bytes for the
entire stream. In the absolute worst case of a single-byte input stream, the
overhead therefore amounts to 1100% (11 bytes of overhead, 1 byte of actual
data). For larger stream sizes, the overhead approaches the limiting value
of .02%.

Empirically, the deflate method is capable of compression factors exceeding
1000:1. (The test case was 50MB file filled with zeros; it compressed to
roughly 49 KB.) Mark loves to calculate stuff like this and reports
that the theoretical limit for the zlib *format* (as opposed to its
*implementation* in the currently available sources) is 1032:1. To
quote him,

The limit comes from the fact that one length/distance pair can represent at most 258 output bytes. A length requires at least one bit and a distance requires at least one bit, so two bits in can give 258 bytes out, or eight bits in give 1032 bytes out. A dynamic block has no length restriction, so you could get arbitrarily close to the limit of 1032:1.

He goes on to note that the current implementation limits its dynamic blocks to about 8 KB (corresponding to 8MB of input data); together with a few bits of overhead, this implies an actual compression limit of about 1030.3:1. Not only that, but the compressed data stream is itself likely to be rather compressible (in this special case only), so running it through deflate again should produce further gains.

By way of comparison, note that a version of run-length encoding optimized
for this sort of unusual data file -- that is, by using 32-bit integers for
the lengths rather than the more usual 8-bit bytes or 16-bit words -- could
encode the test file in five bytes. That would be a compression factor of
10,000,000:1 (or 10.000.000:1 for you Europeans, or 10^{7}:1 for all
of those engineers and scientists whose browsers support superscripts).

Finally, please note that this level of compression is *extremely* rare
and only occurs with really trivial files (e.g., a megabyte of zeros). More
typical zlib compression ratios are on the order of 2:1 to 5:1.

**zlib**'s memory footprint can also be specified fairly precisely. It
is larger for compression than for decompression, and the exact requirements
depend on how the library was compiled.

The memory requirements for compression depend on two parameters,
**windowBits** and **memLevel**:

deflate memory usage (bytes) = (1 << (windowBits+2)) + (1 << (memLevel+9))

For the default values of 15 and 8, respectively, this is 256 KB. Both windowBits and memLevel can be set to lower values at compile time via the MAX_WBITS and MAX_MEM_LEVEL macros, but only at a cost in compression efficiency.

The memory requirements for decompression depend only on **windowBits**,
but this is, in a sense, a harsher limitation: whereas data streams
compressed with a smaller window will merely be a bit larger than they
would have otherwise, a reduced window size for decompression means that
streams compressed with larger windows *cannot be decompressed at all*.
Having said that:

inflate memory usage (bytes) = (1 << windowBits) + inflate_huft usage

Empirically, no more than 1041 inflate_huft structures are allocated: 166 for distance codes and 875 for literal/length codes. Each inflate_huft is the size of two pointers (i.e., four, eight or sixteen bytes, depending on whether pointers are 16-bit, 32-bit or 64-bit). Typically, therefore, inflate() requires no more than 42 KB of storage on a 32-bit machine--this includes the 32768-byte sliding window, 84 bytes of initialization overhead, 1292 bytes of inflate_blocks overhead, 8328 bytes of inflate_huft structures, and 148 bytes of memory-allocation overhead (37 allocation calls, assuming 4 bytes' overhead per call). 64 KB is almost certainly a hard upper limit.

This section uses superscripts, which are not supported by some older browsers.

Both Adler-32 and CRC-32 (*cyclic redundancy check*) are 32-bit checks.
But while the CRC can take on any 32-bit value (2^{32} possibilities),
Adler-32 is limited to 65521^{2} possibilities. So the probability
of a false positive on random errors for CRC-32 is
2.3283 x 10^{-10}, whereas it is very slightly higher for
Adler-32 at 2.3294 x 10^{-10}.

The above assumes that all the values are accessible given the amount of data. That is true after only four bytes for the CRC-32, but Adler-32 requires, on the average, about 0.5 KB of data to get rolling--or 1 KB if it's ASCII data (text). So if the Adler-32 is used on significantly less than about a kilobyte, it will be noticeably weaker than a CRC-32 on the same small block.

A properly constructed CRC-*n* has the nice property that less than
*n* bits in error is always detectable. This is not always true for
Adler-32 -- Adler-32 can detect all one- or two-byte errors but can miss some
three-byte errors. However, Adler-32 has been constructed to minimize the ways
to make small changes in the data that result in the same check value, through
the use of sums significantly larger than the bytes and by using a prime
(65521) for the modulus. It is in this area that some analysis is deserved,
but it has not yet been done.

This last potential weakness is not a major concern in the application of
Adler-32 to **zlib** (or any other history-based compressor), since if
there is an error at some point in a stream, it will be massively propagated
after that. It would be of concern in an application with transmission or
storage that has a borderline signal-to-noise ratio, for which small numbers
of random errors are expected. For that sort of application one would
certainly want to use a CRC or, better yet, Reed-Solomon error-correction
coding. But even in this case, if the data being transmitted or stored uses
some sort of history-dependent compression (as in **zlib**) *and was
compressible to begin with*, then an Adler-32 used after decompression
would be adequate since the decompressor would significantly amplify any small
errors in the compressed stream. (For incompressible data, most modern
compressors operate in a pass-through mode, so the original comment about
using a CRC or ECC holds.)

The main reason for Adler-32 is, of course, speed in software implementations. The authors wanted a check on zlib's decompression, but not a significant speed penalty just for the check. So Mark came up with the Adler-32 as a faster but still effective alternative to the CRC-32.

Click here for an informal explanation of the deflate algorithm.

Click here to return to the zlib Home Page.

Copyright © 1996-1998 Greg Roelofs and Mark Adler.