home *** CD-ROM | disk | FTP | other *** search
- $Header: /usr/people/tcl/src/uutar/RCS/README,v 1.3 1993/09/12 00:40:52 tcl Exp $
-
- ----------- What are encode/decode?
-
- Encode and decode are utilities which encode binary data into
- printable format suitable for transmission via email, posting to
- usenet, etc. They are intended to replace the aging uuencode and
- uudecode.
-
- ----------- Features:
-
- Encode features a very flexible encoding scheme which allows the user
- to specify exactly which printable characters to use in the output.
- The default is to use all 95 printable characters in the encoding
- process, as this produces the least expansion of the input data.
- However, for cases such as file transfer to a mainframe or to a
- foreign country where some characters may be modified en route, these
- characters can simply be removed from the output character set.
- Encoding is possible with as few as 2 characters in the output
- character set.
-
- Regardless of how many characters are specified in the output
- character set, encode only expands the data by a factor very close to
- the theoretical limit for that number of characters. (see next
- section)
-
- My implementation is simple (less than 500 lines total without
- comments) and efficient (runs at a speed comparable to
- uuencode/uudecode)
-
- ----------- Some theory on file expansion during encoding:
-
- The number of bits required to encode n distinct values is log2(n)
- (log base 2 of n). For example, to encode 256 distinct values, you
- need log2(256) = 8 bits. Let's think of the input file before encoding
- as a raw stream of bits without byte boundaries. If we want to
- represent this data with 256 distinct characters, we will consume 8
- bits of the input bitstream per output character. This is how files
- are normally encoded. However, if we can't use all 256 output
- characters, we will consume fewer than 8 input bits per output
- character, and thus we will require more output characters to
- represent the input bitstream than if we had 256 output characters.
- Thus, the process of encoding a binary file in printable format will
- necessarily expand the file. For example if we use the 95 printable
- characters, we'll consume an average of log2(95) = 6.57 bits in the
- input stream for each output character. Thus the file will be expanded
- by a factor of log2(256)/log2(95) = log(256)/log(95) = 1.217 or 21.7%.
- Note that this is a theoretical figure. In practice, we can't
- subdivide bits, but this figure does provide a theoretical estimate of
- the smallest amount of expansion we can hope to get with n output
- characters. In practice some coding schemes should be able to do
- better for select cases, but for a very large sample space of random
- data, no encoding scheme should ever be able to do better than this
- theoretical limit.
-
- Uuencode maps 3 input characters to 4 output characters for an
- expansion of 33% (not including control information). Lately several
- encoding schemes which map 4 input characters to 5 output characters
- have popped up, for an expansion of 25%.
-
- An analysis of encode shows that the average expansion over a very
- large input file of random data is
- 8 / (pb - 2 + 2n/p)
- where n is the number of output characters, p is the smallest power of
- 2 greater than or equal to n, and pb is log2(p), or the number of bits
- needed to represent p values. A graph of this function for values of n
- from 2 to 256 shows a very close approximation of the theoretical
- expansion of log(256)/log(n). For example, for n = 95, the expansion
- factor is
- 8 / (7 - 2 + 2*95/128) = 1.234 or 23.4%
-
- Note that all expansion factors given above fail to take into account
- the addition of newline characters to limit output width.
-
- ----------- The encoding process:
-
- The encoding process used by encode is simply to throw away the byte
- boundaries in the input bitstream and insert new byte boundaries in
- such a manner that there are only n distinct "tokens" in the input
- stream where n is the number of output characters. These tokens can
- then be mapped one-to-one with the output characters, both during
- encoding and decoding. A good example of this process is uuencode,
- which discards the byte boundaries which occur every 8 bits and
- inserts byte boundaries every 6 bits. The result is a series of tokens
- with a maximum of 64 possible values, each of which is mapped
- one-to-one with the output character set of 64 printable characters.
- This process is trivial for any n which is a power of two, you simply
- insert byte boundaries every log2(n) bits. When n is not a power of 2,
- however, the process is somewhat more complicated.
-
- We can no longer insert the byte boundaries at regular intervals of b
- bits, since this would imply 2^b output characters. If we select b
- such that 2^b < n, then we aren't using all n output characters, and
- we're expanding the file more than necessary. On the other hand if we
- select b such that 2^b > n, we don't have enough output characters to
- encode the data. The solution is to start with the smallest b such
- that 2^b >= n and then eliminate some of the input tokens until there
- are exactly n of them, then we can map one-to-one with the output
- characters. Input tokens can be eliminated by taking two input tokens
- and combining them to form a single, shorter token. This is best
- explained by giving an example.
-
- Let's say we have 6 output characters. We start with 8 input tokens:
- 000,001,010,011,100,101,110,111
- This set of tokens has the property that any input bitstream can
- be broken down to a series of these tokens in exactly one way.
- Now let's combine two of the tokens. The tokens to be combined must
- have identical bits except for the last bit, and the process of
- combining strips that bit from the tokens. e.g. 110 and 111 can be
- combined into the token 11, so we now have the token set
- 000,001,010,011,100,101,11
- If we combine two more tokens, 100 and 101 -> 10, we get
- 000,001,010,011,10,11
- This token set still has the property that any input bitstream can be
- broken down into a series of these tokens in exactly one way, and
- since there are 6 of them, we can map one-to-one with the output
- character set.
-
- The standard for the generation of these tokens will be as follows:
- Start with 2^b distinct tokens of length b bits, where b is the
- smallest integer such that 2^b >= n, where n is the number of output
- characters. Then, as above, while there are more than n tokens of any
- length, replace the two numerically greatest b length tokens with a
- single b-1 length token such that the b-1 length token is equivalent
- to the b-1 most significant bits of either b length token. (It is
- asserted that at any time in the procedure, the two numerically
- greatest b length tokens differ only in the least significant bit).
-
- The standard for the one-to-one mapping between tokens and output
- characters will be as follows: tokens will be sorted such that all b
- length tokens come first, in numerical order, followed by all b-1
- length tokens, in numerical order. Output characters will be sorted by
- ascii code in numerical order. A one-to-one mapping will be
- established between these two sets.
-
- The standard for the checksum will be as follows: The checksum will be
- computed on the decoded data. It will be 32 bits wide. For each
- character read from the input file during encoding or written to the
- output file diring decoding, the checksum will first be rolled 7 bits
- to the left (the 7 bits which slide off the MSB end will be reinserted
- into the LSB end) and then the character will be xor'd onto the low
- order 8 bits of the checksum.
-
- ----------- Implementation:
-
- Decoding with this scheme is trivial: you simply map the printable
- character from the input to the corresponding variable length token,
- and then append that token to the decoded bitstream.
-
- Encoding is a bit more tricky however, since the token length is
- variable, and the input bitstream has no token boundaries in it. The
- solution is to set up a 256 element array which is indexed by the next
- 8 bits in the input bitstream. Note that these 8 bits are not
- necessarily byte-aligned in the input file. The indexed element in the
- array will indicate how many bits should be consumed in the input, and
- what printable character to append to the output. For example, in
- order to recognize the token 010, all elements of the array whose
- index is 010xxxxx for all xxxxx should be set up to indicate that 3
- bits were seen and give the printable character that maps to 010. The
- input bitstream will then be advanced by 3 bits and the operation is
- repeated, using the next 8 bits to index the array again.
-
- My implementation of this encoding process is fairly simplistic and
- incorporates no more than the basic functionality provided by
- uuencode/uudecode. It is intended primarily to introduce this encoding
- scheme to the public in the hopes that it will be widely adopted.
- Should such adoption occur, this file should be used as a standard
- reference for the encoding algorithm.
-
-