Simtel MSDOS 1993 May

home *** CD-ROM | disk | FTP | other *** search

/ Simtel MSDOS 1993 May / SIMTEL_0593.ISO / msdos / archivrs / dwc_a501.exe / ARTICLE.DOC next >

Wrap

Text File | 1988-08-29 | 7.9 KB | 139 lines

Only part one of this article was ever written. For further information on Lempel-Ziv compression and decompression, please refer to the source notes supplied with the DWC source code. ----------------------------------------------------------------- Lempel-Ziv-Welch Compression Algorithm Dean W. Cooper Germantown, MD This is part one in a three part article that explains a general purpose compression algorithm that can compress text files as much as 75%, and many other types of files as well. It is perhaps the best general purpose algorithm and can only be beat consistently by special purpose algorithms designed for specific types of data. The Lempel-Ziv-Welch (LZW) compression algorithm is a simple and relatively fast way to compress many kinds of data including text files, object files and libraries, executable files, and graphic bitmaps. It is a general algorithm that assumes only one thing in the data that it is compressing: that the data contains similar sequences of symbols. Like the Huffman compression algorithm, LZW depends upon the frequency of occurrences of symbols. But where Huffman depends upon the frequency of discrete symbols, LZW depends upon the frequency of sequences of symbols. By definition, sequences of symbols are called strings. In actual implementation, the set of all discrete symbols must be small, so usually the data is interpreted as a stream of bytes allowing for 256 discrete symbols. These are called atomic strings or atoms for short. The general idea of LZW is to create a table of all strings that make up the data and then output an encoding for the strings. For example, let's look at this paragraph to see one way we could encode strings. It has seven sentences so let's call them our strings. We can encode them simply enough: the first sentence is 1, the second is 2, etc. Thus, the encoded output is "1234567". That's all there is to the compressed data, that is, as long as the decompressor knows about the encoding we made. Unfortunately, we can't assume the decompressor knows our encoding, so we are forced to pass this information along to the decompressor. This is usually done by simply tacking on the required information onto the front of the encoded data. For instance, the Huffman algorithm tacks onto the front of its encoded data the frequency histogram of the discrete symbols it compressed. This is just enough information for the decompressor to recreate the encoding used. LZW, on the other hand, is a little trickier and actually merges the encoded data with the information required for the decompressor to know the encoding. The trick is in how LZW defines a string. Simply, a string is either just an atom or a string followed by an atom. Thus, the sequence "abc" is a string because the atom 'c' follows the string "ab", which is a string because the atom 'b' follows the string "a", which is a string because it itself is an atom. Now, using this definition, the LZW algorithm reads in its data and creates a table of strings. Since every byte read in is an atom, adding the byte to the preceding string read in will create a new string one symbol longer. The only rule is to read in bytes progressively making larger and larger strings until a string is formed that is not already in the string table. When this happens, the last known string's encoding is output and the new string is added to the table. That's all there is to it! To start things off, we can immediately add to the string table the 256 atomic strings since the decompressor already knows these strings. Also, let's call the string already read in the predecessor and the last byte read in the follower. The predecessor plus the follower creates a new string that we must check to see if it already exists in the string table. For an example, let's compress the letters "ababc". The first byte read in, 'a', is a degenerate case since it has no predecessor. A null predecessor plus the follower 'a' make the string "a" which is already in the string table so we must continue. "a" now becomes the predecessor string and we read in 'b' as the follower. The string "ab" is not in the table, so we add it and output the code for the predecessor "a" which in ASCII is 97. The code given to the string "ab" is simply 256, that is, the next code available after the atomic strings 0-255. The new predecessor is now "b" and the next follower is 'a'. This again forms a string, "ba", which is not in the table so we output the code for the predecessor, "b" which is 98, and add the string "ba" to the table with the code 257. Now, the predecessor is "a" and the next follower is 'b'. This time the string "ab" is in the table so we continue. The predecessor becomes "ab" and the next follower is 'c'. This forms the string "abc" which is not in the table, so we output the code for the predecessor which is 256 and add the string "abc" to the table. The new predecessor is "c", but now we are out of data so there is no follower. This is another special case and here we just output the code for the last predecessor, "c" which is 99. Thus, the encoding for the compressed sequence "ababc" is 97,98,256,99. Not so great for such a short sequence, but LZW does better as it creates a larger string table. Note how at least nine bits per code is needed as new string codes start at 256 after the atomic strings. Also note how a code unknown to the decompressor is not used until it could figure out for itself what the code represents. The pseudo-code for the compression algorithm is as follows: Predecessor = Read first data byte While ( Follower = Read data byte exists ) if ("Predecessor + Follower" is in string table) Predecessor = Predecessor + Follower else Output code for Predecessor Add "Predecessor + Follower" to string table Predecessor = Follower Output code for predecessor Each entry in the string table can be simply made up of the codes for the predecessor and follower strings that make that entry. In our example above the string table would be: Code: Predecessor: Follower: 0 NOPREDECESSOR 0 Atomic strings ... " 255 NOPREDECESSOR 255 " 256 97 98 String "ab" 257 98 97 String "ba" 258 256 99 String "abc" Although the string table could be scanned for the string we are looking for every time we need to check, this would be very slow, so we add a hash table. Given a predecessor code and a follower code, we can put these values through a hash function and obtain an index into the hash table. A hash table entry only needs to have the code of the string that it is associated with and a pointer to a linked list of other codes of strings that also hashed to the same hash table entry. The simplest way is to have the linked list use empty slots in the hash table itself. Next time, I will elaborate on implementation details of the compression algorithm along with some optimization techniques.