LZ77 Data Compression

Wil Baden 1994-12-09

Programmers are lousy lovers. They always try to get the job done faster than before. And when they do, they brag that they have better performance. Programmers are the only men who boast how small theirs is.

Since 1984 there has been amazing progress in data compression.

In the early 90's I got SALIENT SOFTWARE's AutoDoubler for the Macintosh. My 80 megabyte hard drive had 2 meg available when I installed the program. Since it was a Tuesday, I went out for lasagna, and when I got back an hour later I had 19 meg available.

My 80 meg hard drive soon held 108 megs worth of data with room for 25 to 50 more megabytes.

Not only that, but many programs loaded faster and read data faster. When a file takes only half as much disk space, the data can be read twice as fast.

How they do it is a trade secret, and Salient has applied for a patent on their technology. There are also many variations possible concerning details.

However, I have a good idea about where to begin looking.

Modern methods of data compression all go back to J. ZIV and A. LEMPEL. In 1977 they published a paper in an engineering journal on a new approach to data compresson.

J. ZIV and A. LEMPEL, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, 23:3, 337-343.

In 1978 they published a paper about a related and more elaborate method. In 1984 Unisys employee TERRY WELCH described and had patented a version of the 1978 method suitable for programming. This is called LZW for Lempel, Ziv, and Welch.

LZW is the basis of ARC and PKARC on the PC, compress in Unix, and the original StuffIt on the Mac.

Around 1988 after losing a law suit PHIL KATZ (PKARC) came out with a better program, PKZIP. This is derived from the 1977 Ziv-Lempel paper. It turns out that the simpler method has better performance and is smaller. With additional processing, phenomonal results have been obtained.

All popular archivers - arj, lha, zip, zoo, stac, auto-doubler, current stuffit - are variations on the LZ77 theme.

The idea of LZ77 is very simple. It is explained in the FAQ (frequently asked question) list for compression technology:

The LZ77 family of compressors

LZ77-based schemes keep track of the last n bytes of data seen, and when a phrase is encountered that has already been seen, they output a pair of values corresponding to the position of the phrase in the previously-seen buffer of data, and the length of the phrase.
In effect the compressor moves a fixed-size "window" over the data (generally referred to as a "sliding window" [or "ring buffer"], with the position part of the (position, length) pair referring to the position of the phrase within the window.
The most commonly used algorithms are derived from the LZSS scheme described by JAMES STORER and THOMAS SZYMANSKI in 1982. In this the compressor maintains a window of size N bytes and a "lookahead buffer" the contents of which it tries to find a match for in the window:

while( lookAheadBuffer not empty ) { get pointer ( position, match ) to the longest match in the window for the lookahead buffer; if( length > MINIMUM_MATCH_LENGTH ) { output a ( position, length ) pair; shift the window length characters along; } else { output the first character in the lookahead buffer; shift the window 1 character along; } }

Decompression is simple and fast: Whenever a ( position, length ) pair is encountered, go to that ( position ) in the window and copy ( length ) bytes to the output.
Sliding-window-based schemes can be simplified by numbering the input text characters mod N, in effect creating a circular buffer. The sliding window approach automatically creates the LRU effect which must be done explicitly in LZ78 schemes.
Variants of this method apply additional compression to the output of the LZSS compressor, which include a simple variable-length code (LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP 1.x)), all of which result in a certain degree of improvement over the basic scheme, especially when the data are rather random and the LZSS compressor has little effect.

A copy of this FAQ is available by ftp from rtfm.mit.edu in /pub/usenet/news.answers as compression-faq/part[1-3]. The profane pseudocode given for LZ77 compression can be Forthed as:

BEGIN look-ahead-buffer-used 0= not WHILE get-pointer(position,match)-to-longest-match length minimum-match-length > IF output-a-(position,match)-pair shift-the-window-length-characters-along ELSE output-first-character-in-lookahead-buffer shift-the-window-1-character-along THEN REPEAT

The bottleneck is the finding the longest match quickly. A na•ve brute force method is hardly acceptable. "It's hardly acceptable" is a gentilism for "it sucks". Hashing, or binary search trees, or a combination, is recommended.

Waterworth patented (4,701,745) the algorithm now known as LZRW1, because Ross Williams reinvented it later and posted it on comp.compression on April 22, 1991. The same algorithm has later been patented by Gibson & Graybill. The patent office failed to recognize that the same algorithm was patented twice, even though the wording used in the two patents is very similar.
The Waterworth patent is now owned by Stac Inc, which won a lawsuit against Microsoft, concerning the compression feature of MSDOS 6.0. Damages awarded were $120 million.
Fiala and Greene obtained in 1990 a patent (4,906,991) on all implementations of LZ77 using a tree data structure.
Notenboom (from Microsoft) 4,955,066 uses three levels of compression, starting with run length encoding.
The Gibson & Graybill patent 5,049,881 covers the LZRW1 algorithm previously patented by Waterworth and reinvented by Ross Williams. Claims 4 and 12 are very general and could be interpreted as applying to any LZ algorithm using hashing (including all variants of LZ78):
Phil Katz, author of pkzip, also has a patent on LZ77 (5,051,745) but the claims only apply to sorted hash tables, and when the hash table is substantially smaller than the window size.
IBM patented (5,001,478) the idea of combining a history buffer (the LZ77 technique) and a lexicon (as in LZ78).
Stac Inc patented (5,016,009 and 5,126,739) yet another variation of LZ77 with hashing. The '009 patent was used in the lawsuit against Microsoft (see above). Stac also has patents on LZ77 with parallel lookup in hardware (4,841,092 and 5,003,307).
Chambers 5,155,484 is yet another variation of LZ77 with hashing. The hash function is just the juxtaposition of two input bytes. This is the "invention" being patented. The hash table is named "direct lookup table". [Chambers is the author of AutoDoubler and DiskDoubler.]

A simple implementation of LZSS using binary search trees giving very good but not best performance was put into the public domain in 1988 by HARUHIKO OKUMURA. This implementation has inspired the high performance programs now in use.

Given here is a Standard Forth version of that program. It shows its genealogy by the unusually long Forth definitions. I believe that politically correct factoring would not help understanding and would degrade performance. This program is 8 to 10 times faster than the brute force implementation I gave at the 1992 FORML Conference. It can serve as material for studying data compression in Forth, as the original program did in C and Pascal.

That Sam-I-am! That Sam-I-am! I do not like that Sam-I-am! Do you like green eggs and ham? I do not like them, Sam-I-am. I do not like green eggs and ham.

Compressed with LZSS this becomes:

"|" represents a format byte. "[]" represents a two-byte position and length.

The program uses words from the Core and Core Extension word sets. It also uses READ-FILE and WRITE-FILE from the File Access word set. It presumes that R/O, R/W, W/O, BIN, OPEN-FILE, CREATE-FILE, and TO, will be used appropriately for file assignment. The program also uses NOT, which can be equivalent to either 0= or INVERT.

Standard Forth file access for character-by-character input or output is hardly acceptable. Read-Char used here is painfully defined with READ-FILE.

A better definition would be to buffer input of many characters at a time. The same should be done for Write-String.

The spelling of a word is consistent and no words are distinguished by a difference of case. It is immaterial whether letter-case in your system is significant or insignificant.

The following was used to test.

S" HD:Custom:Junk.LZ" DELETE-FILE DROP S" HD:Custom:Junk" DELETE-FILE DROP S" HD:Custom:Green-Eggs" R/O OPEN-FILE Checked TO In-File S" HD:Custom:Junk.LZ" W/O BIN CREATE-FILE Checked TO Out-File Encode S" HD:Custom:Junk.LZ" LISTING S" HD:Custom:Junk.LZ" R/O BIN OPEN-FILE Checked TO In-File S" HD:Custom:Junk" W/O CREATE-FILE Checked TO Out-File Decode In-File Closed Out-File Closed S" HD:Custom:Junk" LISTING