home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
CP/M
/
CPM_CDROM.iso
/
cpm
/
utils
/
squsq
/
crunch.izf
/
CRUNCH.INF
Wrap
Text File
|
1988-02-28
|
11KB
|
192 lines
CRUNCH.TXT
Background on CRUNCH, from author Steven Greenberg's documentation.
---------
In order to make a practical stand alone "cruncher" that was easy
to use, especially for those already familiar with squeezers, some
header information had to be included in the resulting "crunched" file
(eg. the filename of the original file, etc.). I have defined a header
based on the time tested squeezed file format, with some necessary
changes and a few additions. The additions are mostly to insure that
files crunched now will always be un-crunchable with future versions of
the uncruncher, no matter what possible enhancements are made. Those
familiar with the MS-DOS ARC.xxx program have probably seen this idea in
action. More on this later.
Another slight problem with LZWCOM and LZWUNC had to do with the
question of termination. When the input file was exhausted during
compression, it was unlikely the output file was on a sector boundary.
No matter what the rest of the final output sector was padded with
("1A"'s were used), the uncruncher would try to uncrunch those bytes
(since all data is conceivably valid). This resulted in occasional
extra sectors of garbage following an otherwise properly decoded file.
While this did not usually cause a problem, it was certainly not
desirable.
I have chosen to handle the termination problem the same way it
was handled with squeezed files; by dedicating a unique code to
represent EOF (End Of Field). By only allowing 4095 instead of 4096
different codes (not a major shortcoming), code 000 can become a
dedicated EOF. As soon as it is encountered on the input file, the
decoding process is known to be complete. For those who are interested,
the exact code put out by CRUNCH can be duplicated by the "C" program
LZWCOM if table entry zero "artificially" flagged as "used" (before
initializing the table). That insures that the code will never come up,
except when manually inserted at the end of file.
The other functional difference from LZWCOM involves repeat byte
coding. CRUNCH converts the "physical input stream" into a "logical
input stream", which is then handed to the cruncher. The conversion
takes 3 or more contiguous occurrences of the same byte and encodes them
as <byte> "90H" <count> where "count" is the number of "additional"
occurrences of <byte> (ie total occurrences -1). 90H itself is encoded
as "90H", "00". This scheme is identical to that used in standard
squeezing.
Crunching requires only one pass through the input file, while
squeezing requires two. While this is one of its significant
advantages, it does complicate the problem of including a checksum, if
one is desired, in the header of the result file (since the value is
not known until everything is done). A bad solution is to close the
finished output file, re-open it, insert the checksum, and close it
again. A good solution is to put the checksum at the end of the output
file right after the EOF. And that's where it is. With all this in
mind, herein follows a specification for the format of a crunched file.
---------------------------------------------
ID FIELD: Bytes 0 and 1 are always 076H and 0FEH, respectively.
This identifies the file as "crunched".
FILENAME: The filename field starts at byte 2. It is a field of
variable length, terminated by a zero byte. The field contains the
filename as ASCII characters, including an ASCII "." immediately
preceding the filename's extension. Less than eight characters may
precede the "."; there is no necessity to pad the filename with blanks.
Additional characters after the third extension character but before the
zero byte specifically are allowed and will be ignored by the current
uncruncher. This allows an area of unlimited size for date stamping, or
other miscellaneous information which a future cruncher or application
program might want to insert, for use or display by some uncrunching
program. By skipping over these bytes now, future incompatibilities are
eliminated.
Following the zero byte are the following 4 bytes, in order:
REFERENCE REVISION LEVEL: 1 byte }
SIGNIFICANT REVISION LEVEL: 1 byte } described later
ERROR DETECTION TYPE: 1 byte }
SPARE: 1 byte }
CRUNCHED OUTPUT: After the SPARE byte, the actual crunched output
finally begins. The crunched output is a series of 12-bit codes in
"natural" order. (Every other 12-bit code starts on a byte boundary
and includes the 4 ms bits of the next byte. The "odd" codes start in
the middle of a byte and include the whole following byte as the
remaining 8 ls bits). A 12-bit code of 000 is an EOF, as explained
above. If the EOF code itself ends in the middle of a byte, an
additional 4 bits of zero are padded on to get back on a byte boundary
for the checksum.
CHECKSUM: The next two bytes are the 16-bit checksum, least
significant byte first. The checksum is the modula 2^16 sum of all the
bytes as input from the physical input stream, prior to repeat byte
encoding (or, in the case of uncrunching, as output to the physical
output stream, after repeat byte decoding).
REMAINDER OF THE SECTOR: The remaining bytes in the sector
following the checksum are irrelevant. CRUNCH fills them with "1A"'s.
---------------------------------------------
These are the four bytes not fully described above:
"Reference Revision Level": The program/revision level of the
program that performed the crunch operation. This byte is put in for
general reference only. The current value is "20" (hex).
"Significant Revision Level": If the value of this byte in a
crunched data file exceeds the value contained within the uncrunching
program, the message "File requires newer revision of program" will be
displayed. If changes or enhancements are ever made to CRUNCH which
are significant enough to actually output an incompatible file, the
information in this byte will allow a new revision of UNCR to be
compatible with all existing data files, old or new. The error message
gets displayed only if someone tries to uncrunch a new file with an old
uncruncher which doesn't know about the "future" format yet. Current
value is "23" (hex).
"Error Detection Type": If this value is non-zero, the current
uncruncher will not examine the checksum or give an error associated
with it. This will permit a CRC type (or no error checking) value to be
used if circumstances warrant it. The current UNCR program is always
checking for "illegal" codes, which are ones which have not been defined
by previous codes. If any are encountered, the message "Invalid
Crunched File" is displayed. This inherent self-checking probably
precludes the necessity of more advanced error checking.
"Spare": The SPARE byte is a spare byte.
Notes from CRUNCH 2.3.
CRUNCH 1.x maintained a table representing up to 4096 strings of
varying lengths using the so called LZW algorithm, which has been
described in the earlier documentation. These strings were entered
into a table in a manner where the strings content was used to
determine the physical location (hashing), and that location was used
as the output code. Hash "collisions" were resolved by maintaining
another 12 bits of data per entry which was a "link", or pointer to
another entry.
In contrast, CRUNCH 2.x uses an "open-addressing, double hashing"
method similar to that employed in the UNIX COMPRESS. This method
involves a table of length 5003, where only 4096 entries are ever made,
insuring the table never gets much more than about 80% full. When a
hash collision occurs, a secondary hash function is used to check a
series of additional entries until an empty entry is encountered. This
method creates a table filled with many criss-crossed "virtual" chains,
without the use of a "link" entry in the table.
One reason this is important is that [without using any additional
memory] the 1 1/2 bytes which were previously allocated as a link can
now become the [output] code number. This enables us to assign code
numbers, which are kept right alongside the entry itself, independently
of the entry's physical location. This allows the codes to be assigned
in order, permitting the use of 9-bit representations until there are
512 codes in the table, after which 10 bit representations are output,
etc.
The variable bit length method has three ramifications. It is
particularly helpful when encoding very short files, where the table
never even fills up. It also provides a fixed additional savings (not
insubstantial) even when the table does fill up. Thirdly, it reduces
the overhead associated with an "adaptive reset" to the point where it
becomes a very viable alternative. "Adaptive reset" simply means
throwing out the whole table and starting over. This can be quite
advantageous when used properly. CRUNCH v2.x employs this provision,
which was not incorporated in the V1.x algorithm.
"Code reassignment" is an advancement the author of CRUNCH, Steven
Greenberg, has introduced with the release of CRUNCH v2.0 based on
original work. It is not used in COMPRESS, any MS-DOS ARC program, or
any other data compression utility currently available. There are many
ways one might go about this (and at least as many possible pitfalls).
The algorithm selected seemed to represent a good tradeoff between
speed, memory used, and improved performance, while maintaining
"soundness of algorithm" (ie it works).
Briefly, it works as follows: Once the table fills up, the code
reassignment process begins. (At this same time, the possibility of
adaptive reset is also enabled). Whenever a new code would otherwise
be made (if the table weren't full), the entries along the hash chain
which would normally contain the entry are scanned. The first, if any,
of those entries which was made but never subsequently referenced is
bumped in favor of the new entry. The uncruncher, which would not
normally need to perform any hash type function, has an auxiliary
physical to logical translation table, where it simulates the hashing
going on in the cruncher. In this fashion it is able to exactly
reproduce the reassignments made my the cruncher, which is essential.