LZ77 Data Compression

Get TEXT

Wil Baden 1994-12-09

Programmers are lousy lovers. They always try to get the job done faster than before. And when they do, they brag that they have better performance. Programmers are the only men who boast how small theirs is.

Since 1984 there has been amazing progress in data compression.

In the early 90's I got SALIENT SOFTWARE's AutoDoubler for the Macintosh. My 80 megabyte hard drive had 2 meg available when I installed the program. Since it was a Tuesday, I went out for lasagna, and when I got back an hour later I had 19 meg available.

My 80 meg hard drive soon held 108 megs worth of data with room for 25 to 50 more megabytes.

Not only that, but many programs loaded faster and read data faster. When a file takes only half as much disk space, the data can be read twice as fast.

How they do it is a trade secret, and Salient has applied for a patent on their technology. There are also many variations possible concerning details.

However, I have a good idea about where to begin looking.

Modern methods of data compression all go back to J. ZIV and A. LEMPEL. In 1977 they published a paper in an engineering journal on a new approach to data compresson.

J. ZIV and A. LEMPEL, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, 23:3, 337-343.

In 1978 they published a paper about a related and more elaborate method. In 1984 Unisys employee TERRY WELCH described and had patented a version of the 1978 method suitable for programming. This is called LZW for Lempel, Ziv, and Welch.

LZW is the basis of ARC and PKARC on the PC, compress in Unix, and the original StuffIt on the Mac.

Around 1988 after losing a law suit PHIL KATZ (PKARC) came out with a better program, PKZIP. This is derived from the 1977 Ziv-Lempel paper. It turns out that the simpler method has better performance and is smaller. With additional processing, phenomonal results have been obtained.

All popular archivers - arj, lha, zip, zoo, stac, auto-doubler, current stuffit - are variations on the LZ77 theme.

The idea of LZ77 is very simple. It is explained in the FAQ (frequently asked question) list for compression technology:

The LZ77 family of compressors

LZ77-based schemes keep track of the last n bytes of data seen, and when a phrase is encountered that has already been seen, they output a pair of values corresponding to the position of the phrase in the previously-seen buffer of data, and the length of the phrase.
In effect the compressor moves a fixed-size "window" over the data (generally referred to as a "sliding window" [or "ring buffer"], with the position part of the (position, length) pair referring to the position of the phrase within the window.
The most commonly used algorithms are derived from the LZSS scheme described by JAMES STORER and THOMAS SZYMANSKI in 1982. In this the compressor maintains a window of size N bytes and a "lookahead buffer" the contents of which it tries to find a match for in the window:

      while( lookAheadBuffer not empty )
          {
          get pointer ( position, match ) to the longest match in the window
              for the lookahead buffer;

          if( length > MINIMUM_MATCH_LENGTH )
              {
              output a ( position, length ) pair;
              shift the window length characters along;
              }
          else
              {
              output the first character in the lookahead buffer;
              shift the window 1 character along;
              }
          }

Decompression is simple and fast: Whenever a ( position, length ) pair is encountered, go to that ( position ) in the window and copy ( length ) bytes to the output.
Sliding-window-based schemes can be simplified by numbering the input text characters mod N, in effect creating a circular buffer. The sliding window approach automatically creates the LRU effect which must be done explicitly in LZ78 schemes.
Variants of this method apply additional compression to the output of the LZSS compressor, which include a simple variable-length code (LZB), dynamic Huffman coding (LZH), and Shannon-Fano coding (ZIP 1.x)), all of which result in a certain degree of improvement over the basic scheme, especially when the data are rather random and the LZSS compressor has little effect.

A copy of this FAQ is available by ftp from rtfm.mit.edu in /pub/usenet/news.answers as compression-faq/part[1-3]. The profane pseudocode given for LZ77 compression can be Forthed as:

    BEGIN
        look-ahead-buffer-used 0= not
    WHILE
        get-pointer(position,match)-to-longest-match
        length minimum-match-length > IF
            output-a-(position,match)-pair
            shift-the-window-length-characters-along
        ELSE
            output-first-character-in-lookahead-buffer
            shift-the-window-1-character-along
        THEN
    REPEAT

The bottleneck is the finding the longest match quickly. A na•ve brute force method is hardly acceptable. "It's hardly acceptable" is a gentilism for "it sucks". Hashing, or binary search trees, or a combination, is recommended.

A simple implementation of LZSS using binary search trees giving very good but not best performance was put into the public domain in 1988 by HARUHIKO OKUMURA. This implementation has inspired the high performance programs now in use.

Given here is a Standard Forth version of that program. It shows its genealogy by the unusually long Forth definitions. I believe that politically correct factoring would not help understanding and would degrade performance. This program is 8 to 10 times faster than the brute force implementation I gave at the 1992 FORML Conference. It can serve as material for studying data compression in Forth, as the original program did in C and Pascal.

As an example, here is the beginning of Green Eggs and Ham, copyright 1960, DR. SEUSS.

    That Sam-I-am!
    That Sam-I-am!
    I do not like that Sam-I-am!

    Do you like green eggs and ham?

    I do not like them, Sam-I-am.
    I do not like green eggs and ham.

Compressed with LZSS this becomes:

    |That Sam|-I-am!
    []|I do not| like t[]|
    Do you[]|green eg|gs and h|am?
    []em,|[].[][].

"|" represents a format byte. "[]" represents a two-byte position and length.

The program uses words from the Core and Core Extension word sets. It also uses READ-FILE and WRITE-FILE from the File Access word set. It presumes that R/O, R/W, W/O, BIN, OPEN-FILE, CREATE-FILE, and TO, will be used appropriately for file assignment. The program also uses NOT, which can be equivalent to either 0= or INVERT.

The program uses words from the Core and Core Extension word sets. It also uses READ-FILE and WRITE-FILE from the File Access word set. It presumes that R/O, R/W, W/O, BIN, OPEN-FILE, CREATE-FILE, and TO, will be used appropriately for file assignment. The program also uses NOT, which can be equivalent to either 0= or INVERT.

Standard Forth file access for character-by-character input or output is hardly acceptable. Read-Char used here is painfully defined with READ-FILE.

A better definition would be to buffer input of many characters at a time. The same should be done for Write-String.

The spelling of a word is consistent and no words are distinguished by a difference of case. It is immaterial whether letter-case in your system is significant or insignificant.

The following was used to test.

    S" HD:Custom:Junk.LZ" DELETE-FILE DROP
    S" HD:Custom:Junk" DELETE-FILE DROP

    S" HD:Custom:Green-Eggs" R/O OPEN-FILE Checked TO In-File
    S" HD:Custom:Junk.LZ" W/O BIN CREATE-FILE Checked TO Out-File

    Encode

    S" HD:Custom:Junk.LZ" LISTING

    S" HD:Custom:Junk.LZ" R/O BIN OPEN-FILE Checked TO In-File
    S" HD:Custom:Junk" W/O CREATE-FILE Checked TO Out-File

    Decode

    In-File Closed    Out-File Closed

    S" HD:Custom:Junk" LISTING