home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The Datafile PD-CD 1B
/
DATAFILE_PDCD1B.iso
/
_pocketbk
/
pocketbook
/
004
/
oplexamp_z
/
HUFFMAN.TXT
< prev
next >
Wrap
Text File
|
1994-05-07
|
3KB
|
54 lines
>Ah, excellent. The text in question is the bible, 4.5M long
>uncompressed. The dictionary derived from this is about 14000
>words, the dictionary file is 112k.
>I understand a 4M flash is due soon, but it will still be
>more costly than the 2M one of course.
You should definitely consider huffman-coding. For one thing, you need it.
I'm afraid you won't get sufficient compression without it - to fit in the
2M flash.
Secondly, it is very, very easy, because you have a known text. That means
you can hardwire the decoding tables, and the decoding logic will be
extremely simple.
How to do this: (this is actually a simplified version of an algorithm which
I think is in the comp.compression faq, the faq algorithm gives denser
tables with slightly more complex code).
Encode the most frequent x words, where x is the largest x for which the
longest huffman-code will fit in 8 bits (or 9 bits, or 7 bits, whatever
seems appropriate). Include in the decoding program a table of length 256
(2**8). To decode a symbol, just index the table with the 8 next bits in
input, to find in the table 1) The next symbol in input, and 2) The length
of this symbol. Advance input with the length, use the symbol, and repeat.
Writing the table: For each code c, eg. c=10111, representing a symbol s, do
for all possible values of xxx, let c' = xxxc = xxx11101
table[c'].length = length(c) = 5
table[c'].symbol = s
If I haven't said this already: Encode "letter-words" and "non-letters
words" separately. - That means with separate huffman tables, too.
Probably the space character will be so frequent that it's information
contents is a fraction of a bit, but spaces will still cost you 1 bit each
with huffman coding. Arithmetic coding would handle this, but is much more
complex. You might handle this heuristically.
The characters in literals (eg. in the dictionary) might also be
huffman-coded. If you don't do that, at least pack the 52 letters in 6
bits.
If you want your encoded text to include other things besides word indices,
such as literal words or end-of-text markers, then don't use flag bits:
This will be very wasteful, as word indices will be much more frequent than
anything else: Instead include these possibilities as "special symbols" in
the huffman code for word indices.
Hope this helps. Please give me some feedback on how this works for you;
frankly, I haven't implemented half of this myself.
Anders Munch | Department of Computer Science
juul@diku.dk | University of Copenhagen, Denmark