home *** CD-ROM | disk | FTP | other *** search
- Only part one of this article was ever written. For further
- information on Lempel-Ziv compression and decompression, please
- refer to the source notes supplied with the DWC source code.
- -----------------------------------------------------------------
-
-
- Lempel-Ziv-Welch
- Compression Algorithm
-
- Dean W. Cooper
- Germantown, MD
-
- This is part one in a three part article that explains a
- general purpose compression algorithm that can compress text
- files as much as 75%, and many other types of files as well. It
- is perhaps the best general purpose algorithm and can only be
- beat consistently by special purpose algorithms designed for
- specific types of data.
-
- The Lempel-Ziv-Welch (LZW) compression algorithm is a simple
- and relatively fast way to compress many kinds of data including
- text files, object files and libraries, executable files, and
- graphic bitmaps. It is a general algorithm that assumes only one
- thing in the data that it is compressing: that the data contains
- similar sequences of symbols. Like the Huffman compression
- algorithm, LZW depends upon the frequency of occurrences of
- symbols. But where Huffman depends upon the frequency of
- discrete symbols, LZW depends upon the frequency of sequences of
- symbols.
- By definition, sequences of symbols are called strings. In
- actual implementation, the set of all discrete symbols must be
- small, so usually the data is interpreted as a stream of bytes
- allowing for 256 discrete symbols. These are called atomic
- strings or atoms for short.
- The general idea of LZW is to create a table of all strings
- that make up the data and then output an encoding for the
- strings. For example, let's look at this paragraph to see one
- way we could encode strings. It has seven sentences so let's
- call them our strings. We can encode them simply enough: the
- first sentence is 1, the second is 2, etc. Thus, the encoded
- output is "1234567". That's all there is to the compressed data,
- that is, as long as the decompressor knows about the encoding we
- made. Unfortunately, we can't assume the decompressor knows our
- encoding, so we are forced to pass this information along to the
- decompressor.
- This is usually done by simply tacking on the required
- information onto the front of the encoded data. For instance,
- the Huffman algorithm tacks onto the front of its encoded data
- the frequency histogram of the discrete symbols it compressed.
- This is just enough information for the decompressor to recreate
- the encoding used. LZW, on the other hand, is a little trickier
- and actually merges the encoded data with the information
- required for the decompressor to know the encoding.
- The trick is in how LZW defines a string. Simply, a string
- is either just an atom or a string followed by an atom. Thus,
- the sequence "abc" is a string because the atom 'c' follows the
- string "ab", which is a string because the atom 'b' follows the
- string "a", which is a string because it itself is an atom.
- Now, using this definition, the LZW algorithm reads in its
- data and creates a table of strings. Since every byte read in is
- an atom, adding the byte to the preceding string read in will
- create a new string one symbol longer. The only rule is to read
- in bytes progressively making larger and larger strings until a
- string is formed that is not already in the string table. When
- this happens, the last known string's encoding is output and the
- new string is added to the table. That's all there is to it!
- To start things off, we can immediately add to the string
- table the 256 atomic strings since the decompressor already knows
- these strings. Also, let's call the string already read in the
- predecessor and the last byte read in the follower. The
- predecessor plus the follower creates a new string that we must
- check to see if it already exists in the string table. For an
- example, let's compress the letters "ababc".
- The first byte read in, 'a', is a degenerate case since it
- has no predecessor. A null predecessor plus the follower 'a'
- make the string "a" which is already in the string table so we
- must continue. "a" now becomes the predecessor string and we
- read in 'b' as the follower. The string "ab" is not in the
- table, so we add it and output the code for the predecessor "a"
- which in ASCII is 97. The code given to the string "ab" is
- simply 256, that is, the next code available after the atomic
- strings 0-255.
- The new predecessor is now "b" and the next follower is 'a'.
- This again forms a string, "ba", which is not in the table so we
- output the code for the predecessor, "b" which is 98, and add the
- string "ba" to the table with the code 257. Now, the predecessor
- is "a" and the next follower is 'b'. This time the string "ab"
- is in the table so we continue. The predecessor becomes "ab" and
- the next follower is 'c'. This forms the string "abc" which is
- not in the table, so we output the code for the predecessor which
- is 256 and add the string "abc" to the table. The new
- predecessor is "c", but now we are out of data so there is no
- follower. This is another special case and here we just output
- the code for the last predecessor, "c" which is 99.
- Thus, the encoding for the compressed sequence "ababc" is
- 97,98,256,99. Not so great for such a short sequence, but LZW
- does better as it creates a larger string table. Note how at
- least nine bits per code is needed as new string codes start at
- 256 after the atomic strings. Also note how a code unknown to
- the decompressor is not used until it could figure out for itself
- what the code represents. The pseudo-code for the compression
- algorithm is as follows:
-
- Predecessor = Read first data byte
- While ( Follower = Read data byte exists )
- if ("Predecessor + Follower" is in string table)
- Predecessor = Predecessor + Follower
- else
- Output code for Predecessor
- Add "Predecessor + Follower" to string table
- Predecessor = Follower
-
- Output code for predecessor
-
- Each entry in the string table can be simply made up of the
- codes for the predecessor and follower strings that make that
- entry. In our example above the string table would be:
-
- Code: Predecessor: Follower:
- 0 NOPREDECESSOR 0 Atomic strings
- ... "
- 255 NOPREDECESSOR 255 "
- 256 97 98 String "ab"
- 257 98 97 String "ba"
- 258 256 99 String "abc"
-
- Although the string table could be scanned for the string we
- are looking for every time we need to check, this would be very
- slow, so we add a hash table. Given a predecessor code and a
- follower code, we can put these values through a hash function
- and obtain an index into the hash table. A hash table entry only
- needs to have the code of the string that it is associated with
- and a pointer to a linked list of other codes of strings that
- also hashed to the same hash table entry. The simplest way is to
- have the linked list use empty slots in the hash table itself.
- Next time, I will elaborate on implementation details of the
- compression algorithm along with some optimization techniques.
-