============================================================================= LZW Compression by Bill Lucier (Blucier@ersys.edmonton.ab.ca, b.lucier1 on Genie) LZW is perhaps the most widely used form of data compression today. It is simple to implement and achieves very decent compression at a fairly quick pace. LZW is used in PKZIP (shrink),PKARC (crunch), gifs,V.42bis and unix's compress. This article will attempt to explain how the compression works with a short example and 6502 source code in Buddy format. Originally named lz78, it was invented by Jacob Ziv and Abraham Lempel in 1978 , it was later modified by Terry Welch to its present format. The patent for the LZW compression method is presently held by Unisys. LZW compresses data by taking phrases and compressing them into codes. The size of the codes could vary from 9 bits to 16 bits. Although for this implementation we will be using only 12 bits. As byte are read in from a file they are added to a dictionary. For a 12-bit implementation a dictionary will be 4k (2^12=4096) . Each entry in the dictionary requires five bytes, which will be built in the form of a tree. It is not a binary tree because each node may have more than two offsprings. In fact, because our dictionary can hold up to 4096 different codes it is possible to have one node with 3800 children nodes, although this is not likely to happen. The five bytes that make up our tree will be: The parent code: Each node has one and only one parent node. When the parent code is less then 255 it is the end of a phrase. The codes 0-255 do not actually exist in the tree. The following values do not appear either as they have special meaning: 256 : End of Stream-This marks the end of one compressed file 257 : This tells the decompressor to increase the number of bits its reading by one. 258 : Wipe out dictionary The code value : This is the code that will be sent to the compressor. The character : The value contained at this node. It have a value of 0-255. Initially we send out codes that are 9 bits long, this will cover the values 0-511. Once we have reached 511, we will need to increase the number of bits to write by 1. This will give room for code numbers 512-1023, or (2^10)-1. At this point we must ensure that the decompressor knows how bits to read in at once so a code number 257 is sent to indicate that the number of bits to be read is to be bumped up by one. The size of the dictionary is finite so at some point we do have to be concerned with what we will do when it does fill up. We could stop compiling new phrases and just compress with the ones that are already in the dictionary. This is not a very good choice, files tend to change frequently (eg. program files as they change from code to data) so sticking with the same dictionary will actually increase the size of the file or at best, give poor compression. Another choice is to wipe the dictionary out and start building new codes and phrases, or wipe out some of the dictionary leaving behind only the newer codes and phrases. For the sake of simplicity this program will just wipe out the dictionary when it becomes full. To illustrate how LZW works a small phrase will be compressed : heher. To start the first two characters would be read in. The H would be treated as the parent code and E becomes the character code. By means of a hashing routine (the hashing routine will be explained more fully in the source code) the location where HE should be is located. Since we have just begun there will be nothing there,so the phrase will be added to the dictionary. The codes 0-258 are already taken so we start using 259 as our first code. The binary tree would look something like this: node # 72 - H | node #3200 259 - E The node # for E is an arbitrary one. The compressor may not choose that location, 3200 is used strictly for demonstration purposes. So at node #3200 the values would be: Parent code - 72 code value - 259 character - E The node #72 is not actually used. As soon as a value less than 255 is found it is assumed to be the actual value. We can't compress this yet so the value 72 is sent to the output file(remember that it is sent in 9 bits). The E then becomes the parent code and a new character code ( H ) is read in. After again searching the dictionary the phrase EH is not found. It is added to the dictionary as code number 260. Then we send the E to the disk and H becomes the new parent code and the next E becomes the new character code. After searching the dictionary we find that we can compress HE into the code 259,we want to compress as much as possible into one code so we make 259 the parent code. There may be a longer string then HE that can be compressed. The R is read in as the new character code. The dictionary is searched for the a 259 followed a R, since it is not found it is added to the dictioary and it looks like this: node #72 - H node #69 - E | | node #3200 259 - E node #1600 260 - H | node #1262 261 - R Then the value 259 is sent to the output file (to represent the HE) and since that is the EOF the R is sent as well,as well as a 256 to indicate the EOF has been reached. Decompression is extremely simple. As long as the decompressor maintains the dictionary as the compressor did, there will be no problems,except for one problem that can be handled as an exceptional case. All of the little details of increasing the number of bits to read, and when to flush the dictionary are taken care of by the compressor. So if the dictionary was increased to 8k, the compressor would have to be set up to handle a larger dictionary, but the decompressor only does as the compressed file tells it to and will work with any size dictionary. The only problem would be that a larger dictionary will creep into the ram under the rom or possibly even use all available memory, but assuming that the ram is available the decompressor will not change. The decompressor would start out reading 9 bits at a time, and starts it free code at 259 as the compressor did. To use the above input from the compressor as an example, the output was: 72 - For the First H 69 - For the First E 259 - For the Compressed HE 82 - For the R 256 - Eof indicator To begin decompressing, two values are needed. The H and E are read in, (note they will both be 9 bits long). As they are both below 256 they are at the end of the string and are sent straight to the output file. The first free code is 259 so that is the value assigned to the phrase HE. Note when decompressing there is no need for the hashing routine, the codes are the absolute locations in the dictionary (i.e. If the dictionary was considered to be an array then the entry number 259 would be dictionary[259]), because of this, the code value is no longer needed. So the decompressor would have an entry that looks like this: Node # 259 Parent Code - H Character - E The decompressor will read in the next value (259). Because the node number is at the end of the compressed string we will have to take the code value and place it on a stack, and take them off in a Last-in,First-out (LIFO) fashion. That is to say that the first character to go on the stack (in this case the E) will be the last to come off. The size of the stack is dependent on the size of the dictionary, so for this implementation we need a stack that is 4k long. After all the characters from the string have been placed on the stack they are taken off and sent to the outputfile. There is one small error that is possible with LZW because of the way the compressor defines strings. Consider the compression dictionary that has the following in it: node # Code Parent character Value code ------ ------ ------ --------- 65 65 n/a A 723 259 65 C 1262 260 259 U 2104 261 260 T 2506 262 261 E Now if the compressor was to try to compress the string ACUTEACUTEA The compressor will find a match for the first five characters 'ACUTE' and will send a 262 to the output file. Then it will add the following entry to the dictionary: 3099 263 262 A Now it will try to compress the remaining characters, and it finds that it can compress the entire string with the code 263, but notice that the middle A, the one that was just added onto the end of the string 'ACUTE' was never sent to the output file. The decompressor will not have the code 263 defined in it's dictionary. The last code it will have defined will be 262. This problem is easily remedied though, when the decompressor does not have a code defined, it takes the first letter from the last phrase defined and tacks it onto the end of the last phrase. IE It takes the first letter (the A) from the phrase and adds it on to the end as code #263. This particular implementation is fairly slow because it reads a byte and then writes one, it could be made much faster with some buffering. It is also limited to compressing and decompressing one file at a time and has no error checking capabilities. It is meant strictly to teach LZW compression, not provide a full fledged compressor. And now for the code: SYS 4000 ; sys 999 on a 64 .DVO 9 ; or whatever drive used for output .ORG 2500 .OBJ "LZW.ML" TABLESIZE =5021 ; THE TABLESIZE IS ACTUALLY 5021, ABOUT 20% LARGER THEN 4K. THIS GIVES ; THE HASHING ROUTINE SOME ROOM TO MOVE. IF THE TABLE WAS EXACTLY 4K ; THERE WOULD BE FREQUENT COLLISIONS WHERE DIFFERENT COMBINATIONS OF ; CHARACTERS WOULD HAVE THE SAME HASH ADDRESS. INCREASING THE TABLE SIZE ; REDUCES THE NUMBER OF COLLISIONS. EOS =256 ; eos = End of stream This marks the end of file FIRSTCODE =259 MAXCODE =4096 BUMPCODE =257 ; Whenever a 257 is encountered by the decompressor it ; increases the number of bits it reads by 1 FLUSHCODE =258 TABLEBASE =14336 ; The location that the dictionary is located at DECODESTACK =9300 ; The location of the 4k LIFO stack ; ORG = DECOMPRESS FILE ; ORG + 3 = COMPRESS FILE JMP EXPANDFILE ;******************************** ; COMPRESSFILE ;******************************** COMPRESSFILE JSR INITDIC ; EMPTY THE DICTIONARY LDA #128 STA BITMASK LDA #0 STA RACK JSR GETCHAR ; GET A CHAR FROM THE INPUT FILE STA STRINGCODE ; INITIALIZE THE STRINGCODE (PARENT CODE) LDA #0 STA STRINGCODE+1 NEXTCHAR JSR GETCHAR STA CHARACTER JSR FINDNODE ; FINDNODE CALCULATES THE HASHED LOCATION OF LDA ($FE),Y ; THE STRINGCODE AND CHARACTER IN THE DICT. INY ; AND SETS $FE/$FF POINTING TO IT. IF THE ENTRY AND ($FE),Y ; HAS TWO 255 IN IT THEN IT IS EMPTY AND SHOULD CMP #255 ; BE ADDED TO THE DICTIONARY. BEQ ADDTODICT LDA ($FE),Y ; IT HAS A DEFINED PHRASE. STORE THE CODE VALUE IN STA STRINGCODE+1; THE PARENT CODE DEY LDA ($FE),Y STA STRINGCODE JMP EOF ADDTODICT LDY #0 - LDA NEXTCODE,Y STA ($FE),Y INY CPY #5 BNE - INC NEXTCODE ; INCREASE THE NEXTCODE BNE + INC NEXTCODE+1 + JSR OUTPUT LDA NEXTCODE+1 ; CHECK IF NEXTCODE=4096 IF SO THEN FLUSH THE CMP #>MAXCODE ; DICTIONARY AND START ANEW BNE CHECKBUMP LDA NEXTCODE CMP #FLUSHCODE ; DICTIONARY STA STRINGCODE+1 JSR OUTPUT JSR INITDIC JMP CHECKEOF CHECKBUMP LDA NEXTBUMP+1 CMP NEXTCODE+1 ; CHECKBUMP CHECK TO SEE IF THE NEXTCODE HAS BNE CHECKEOF ; REACHED THE MAXIMUM VALUE FOR THE CURRENT LDA NEXTBUMP ; NUMBER OF BITS BEING OUTPUT. CMP NEXTCODE ; FOR X BITS NEXTCODE HAS Y PHRASES BNE CHECKEOF ; -------- ----------------------- LDA #>BUMPCODE ; 9 511 STA STRINGCODE+1 ; 10 1023 LDA #EOS ; SEND A 256 TO INDICATE EOF STA STRINGCODE+1 LDA #FIRSTCODE STA NEXTCODE+1 LDA #512 STA NEXTBUMP+1 LDA #<512 STA NEXTBUMP LDA #TABLEBASE STA $FF LDA #TABLESIZE STA $FD - LDY #0 LDA #255 ; ALL THE CODE VALUES ARE INIT TO 255+256*255 STA ($FE),Y ; OR -1 IN TWO COMPLEMENT INY STA ($FE),Y CLC LDA #5 ; EACH ENTRY IN THE TABLE TAKES 5 BYTES ADC $FE STA $FE BCC + INC $FF + LDA $FC BNE + DEC $FD + DEC $FC LDA $FD ORA $FC BNE - RTS ;************************************ ; GETCHAR ;************************************ GETCHAR JSR $FFCC LDX #2 JSR $FFC6 JMP $FFCF ;************************************ ; OUTPUT ;************************************ OUTPUT LDA #0 ; THE NUMBER OF BITS OUTPUT CAN BE OF A VARIABLE STA MASK+1 ; LENGTH,SO THE BITS ARE ACCUMULATED TO A BYTE IS LDA #1 ; FULL AND THEN IT IS SENT TO THE OUTPUT FILE LDX CURRENTBITS DEX - ASL ROL MASK+1 DEX BNE - STA MASK MASKDONE LDA MASK ORA MASK+1 BNE + RTS + LDA MASK AND STRINGCODE STA 3 LDA MASK+1 AND STRINGCODE+1 ORA 3 BEQ NOBITON LDA RACK ORA BITMASK STA RACK NOBITON LSR BITMASK LDA BITMASK BNE + JSR $FFCC LDX #3 JSR $FFC9 LDA RACK JSR $FFD2 LDA #0 STA RACK LDA #128 STA BITMASK + LSR MASK+1 ROR MASK JMP MASKDONE ;****************************** ; FINDNODE ; THIS SEARCHES THE DICTIONARY TILL IT FINDS A PARENT NODE THAT MATCHES ; THE STRINGCODE AND A CHILD NODE THAT MATCHES THE CHARACTER OR A EMPTY ; NODE. ;******************************* ; THE HASHING FUNCTION - THE HASHING FUNCTION IS NEEDED BECAUSE ; THERE ARE 4096 X 4096 (16 MILLION) DIFFERENT COMBINATIONS OF ; CHARACTER AND STRINGCODE. BY MULTIPLYING THE CHARACTER AND STRINGCODE ; IN A FORMULA WE CAN DEVELOP OF METHOD OF FINDING THEM IN THE ; DICTIONARY. IF THE STRINGCODE AND CHARACTER IN THE DICTIONARY ; DON'T MATCH THE ONES WE ARE LOOKING FOR WE CALCULATE AN OFFSET ; AND SEARCH THE DICTIONARY FOR THE RIGHT MATCH OR A EMPTY ; SPACE IS FOUND. IF AN EMPTY SPACE IS FOUND THEN THAT CHARACTER AND ; STRINGCODE COMBINATION IS NOT IN THE DICTIONARY FINDNODE LDA #0 STA INDEX+1 LDA CHARACTER ; HERE THE HASHING FUNCTION IS APPLIED TO THE ASL ; CHARACTER AND THE STRING CODE. FOR THOSE WHO ROL INDEX+1 ; CARE THE HASHING FORMULA IS: EOR STRINGCODE ; (CHARACTER << 1) ^ STRINGCODE STA INDEX ; FIND NODE WILL LOOP TILL IT FINDS A NODE LDA INDEX+1 ; THAT HAS A EMPTY NODE OR MATCHES THE CURRENT EOR STRINGCODE+1 ; PARENT CODE AND CHARACTER STA INDEX+1 ORA INDEX BNE + LDX #1 STX OFFSET DEX STX OFFSET+1 JMP FOREVELOOP + SEC LDA #TABLESIZE SBC INDEX+1 STA OFFSET+1 FOREVELOOP JSR CALCULATE LDY #0 LDA ($FE),Y INY AND ($FE),Y CMP #255 BNE + LDY #0 RTS + INY - LDA ($FE),Y CMP STRINGCODE-2,Y BNE + INY CPY #5 BNE - LDY #0 RTS + SEC LDA INDEX SBC OFFSET STA INDEX LDA INDEX+1 SBC OFFSET+1 STA INDEX+1 AND #128 BEQ FOREVELOOP CLC LDA #TABLESIZE ADC INDEX+1 STA INDEX+1 JMP FOREVELOOP ;*************************** ; CALCULATE ; TAKES THE VALUE IN INDEX AND CALCULATES ITS LOCATION IN THE DICTIONARY ;**************************** CALCULATE LDA INDEX STA $FE LDA INDEX+1 STA $FF ASL $FE ROL $FF ASL $FE ROL $FF CLC LDA INDEX ADC $FE STA $FE LDA INDEX+1 ADC $FF STA $FF CLC LDA #TABLEBASE ADC $FF STA $FF LDY #0 RTS ;****************************** ; DECODESTRING ;****************************** DECODESTRING TYA ; DECODESTRING PUTS THE STRING ON THE STACK STA COUNT ; IN A LIFO FASHION. LDX #>DECODESTACK CLC ADC #EOS BNE + JMP CLOSE + JSR $FFCC LDX #3 JSR $FFC9 LDA OLDCODE ; Send oldcode to the output file JSR $FFD2 NEXT JSR INPUT LDA RETURN STA NEWCODE LDA RETURN+1 STA NEWCODE+1 CMP #1 ; All of the special codes Flushcode,BumpCode & EOS BNE ++ ; Have 1 for a MSB. LDA NEWCODE CMP #"C"ANDA$<>"E"THEN20 30 INPUT"NAME OF INPUT FILE";FI$:IFLEN(FI$)=.THEN30 40 INPUT"NAME OF OUTPUT FILE";FO$:IFLEN(FO$)=.THEN40 50 OPEN2,9,2,FI$+",P,R":OPEN3,9,3,FO$+",P,W" 60 IFA$="E"THENSYSEX 70 IFA$="C"THENSYSCO 80 END For those interested in learning more about data compression/decompression I recommend the book 'The Data Compression Book' written by Mark Nelson. I learned a great deal from reading this book. It explains all of the major data compression methods. (huffman coding, dictionary type compression such as LZW, arithmatic coding, speech compression and lossy graphics compression) Questions or comments are welcome, they may be directed to me at : Internet : Blucier@ersys.edmonton.ab.ca Genie : b.lucier1 ===============================================================================