+HCU papers
Little essay about the various methods and viewpoints of crunching
~ version June 1999 ~
by Joa ----- Part. VIII (Burrows - Wheeler - Transformation (BWT)
Courtesy of fravia's page of reverse engineering
Well, Joa continues his fundamental paper on crunching, this is part VIII
enjoy!
12 December 98 |
Joa
| ~ |
crunchi1.htm
| Little essay about the various methods and viewpoints of crunching
| papers
| ~ | fra_0126 |
10 June 98 |
Joa
| ~ |
crunchi2.htm
| Little essay about the various methods and viewpoints of crunching II
| papers
| ~ | fra_0129 |
17 June 98 |
Joa
| ~ |
crunchi3.htm
| Little essay about the various methods and viewpoints of crunching III
| papers
| ~ | fra_012E |
17 June 98 |
Joa
| ~ |
crunchi4.htm
| Little essay about the various methods and viewpoints of crunching IV
| papers
| ~ | fra_012F |
17 June 98 |
Joa
| ~ |
crunchi5.htm
| Little essay about the various methods and viewpoints of crunching V
| papers
| ~ | fra_012F |
17 June 98 |
Joa
| ~ |
crunchi6.htm
| Little essay about the various methods and viewpoints of crunching VI
| papers
| ~ | fra_012F |
17 June 98 |
Joa
| ~ |
crunchi7.htm
| Little essay about the various methods and viewpoints of crunching VII
| papers
| ~ | fra_xxxx |
Little essay about the various methods and viewpoints of crunching.
Part VIII: Burrows - Wheeler - Transformation (BWT)
By Joa Koester (JoKo2000@HotMail.Com)
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.
hello, hello,
well, shave my legs and call me Grandpa if that wasn't a long delay, but i still
have some serious difficulties accessing the Internet recently. I hope that i
can access you more frequently in the near future. And to sweet up the time till
then a little bit here my next essay about crunching methods:
THE BURROWS - WHEELER - TRANSFORMATION (tataaa :)
As we discussed last time the mechanism of Arithmetic Crunching some of you
asked me about the source (the input of the cruncher) and that sometimes it
would take the cruncher a long time to adjust to the source data. Yes, there is
a problem with this kind of crunching: the source itself should be transformed
to be better crunchable. But how do we transform a data to be better crunchable
with a fast algorithm?
Well, the arithmetic cruncher would love to crunch data like (all hex):
00, 01, 05, 02, 08, c2, 03, 01, c0, 0a, 12, 01, 02....
but how do we transform a sequence like {0a, f2, f2, 3c, 0a, 77, 44, 0a} into
such cruncherfriendly data?
Well, luckily for us, there is an algorithm available for us that does exactly
what we need. This alg is called Move-To-Front (MTF).
The MTF-Algorithm
MTF is - once you had the idea - a pretty simple algorithm. It transforms the
input data into some other data, it is reversible and it is pretty fast. The
basic idea is that we don't output the data itself but the index of the char
in some table (in our case the table has the usual possible 256 entries).
After the output we save the byte, shift the rest below one position up and
MOVE the saved byte TO the FRONT of the table. A simple char-Table will show
you what i mean:
Let's presume the input is {0a, f2, f2, 3c, 0a, 77, 44, 0a}
That's 8 entries. But we have only 5 distinct values. As we want a reference
table, we only need the distinct values, so for the sake of our example
we take a reference table with only 5 entries (which are - for the example -
filled with the distinct input values as starting values. In a real case the
table would have 256 entries filled with the 256 possible values for start):
Table Value
[ ]
0 0a
1 f2
2 3c
3 77
4 44
Now let's transform the first byte: 0a
We search our table for the value 0a and we find it at offset 00. And that's our
output: 00. Now we shift the rest one position upwards and move the 0a at offset
00 to offset 00. Luckily for us the loop terminates immediately and our table
looks now exactly like before:
Table Value
[ ]
0 0a
1 f2
2 3c
3 77
4 44
And now the second byte: f2
f2 is found at position 01 and this is our output: 01. Now we save the f2 at
position 01, move all other entries below the position 01 one up and move the
saved f2 to position 00. Let's watch the manipulation of the table in slow
motion:
Table Value Table Value Table Value
[ ] [ ] [ ]
0 0a f2 is saved. Copy 0 0a insert the -> 0 f2
1 f2 <- the entries below -> 1 0a saved f2 at 1 0a
2 3c one position up: 2 3c offset 00: 2 3c
3 77 3 77 3 77
4 44 4 44 4 44
The next byte is again: f2
As we find it at position 00 our output is 00. Our moving loop terminates
immediately and now let's have a look at our table and our output so far:
Table Value
[ ]
0 f2
1 0a
2 3c
3 77
4 44
Output so far: 00, 01, 00
Oh, ah, fascinating. Our input has surely been transformed into something more
likely to be crunched by an Arithmetic Cruncher system.
Let's continue with the next bytes:
Inputbyte is 3c
3c is found at offset 02. So we save the 3c at offset 02, move the rest below
position 02 one up and move 3c to offset 00:
Table Value Table Value Table Value
[ ] [ ] [ ]
0 f2 3c is saved. Copy 0 f2 insert the -> 0 3c
1 0a the entries below 1 f2 saved f2 at 1 f2
2 3c <- one position up: -> 2 0a offset 00: 2 0a
3 77 4 77 4 77
4 44 5 44 5 44
Output so far: 00, 01, 00, 02
The next input byte is 0a.
It is found at offset 02 which is our output. Now we move the 0a at position 02 to the front:
Table Value Table Value Table Value
[ ] [ ] [ ]
0 3c 0a is saved. Copy 0 3c insert the -> 0 0a
1 f2 the entries below 1 3c saved 0a at 1 3c
2 0a <- one position up: -> 2 f2 offset 00: 2 f2
3 77 4 77 4 77
4 44 5 44 5 44
Output so far: 00, 01, 00, 02, 02
Now for the next input byte: 77. 77 is found at offset 03 which is output. This
is how our table is altered:
Table Value Table Value Table Value
[ ] [ ] [ ]
0 0a 77 is saved. Copy 0 0a insert the -> 0 77
1 3c the entries below 1 0a saved 77 at 1 0a
2 f2 one position up: 2 3c offset 00: 2 3c
3 77 <- -> 4 f2 4 f2
4 44 5 44 5 44
Output so far: 00, 01, 00, 02, 02, 03
After the next input byte (44) our output is 04 and our table looks like this:
Table Value
[ ]
0 44
1 77
2 0a
4 3c
5 f2
Output so far: 00, 01, 00, 02, 02, 03, 04
After the final input byte (0a) our output is 02 and our table looks like this:
Table Value
[ ]
0 0a
1 44
2 77
4 3c
5 f2
Output so far: 00, 01, 00, 02, 02, 03, 04, 02
As you can see these output bytes are far more crunchable than the original
bytes. But is it really reversible? Let's retransform the first few bytes
together. At first we build the basic transformation table which is the same as
the cruncher uses (remember how cruncher and decruncher have to use the same
model?):
Table Value
[ ]
0 0a
1 f2
2 3c
3 77
4 44
Again, in a real version the table would be filled with the standard 256
possible values of the ASCII-TABLE.
Now let's retransform the first byte: 00
We look for the 00th byte in our table, find a 0a and output it. Then we take
the 00 as starting point for the shifting and moving and proceed as we already
know from the first pass. As the loop terminates immediately we can continue
with the next input byte.
The next input byte is a 01. Now, at offset 01 we find a f2, output it and
proceed: Save the char at offset 01, shift all entries below offset 01 one
position up and move the saved byte into position 00. After that our table looks
like this:
Table Value
[ ]
0 f2
1 0a
2 3c
3 77
4 44
As you can already see the table looks exactly like the table when we had our
first pass and transformed the char f2. If we would continue to scan thru the
rest of the input (the former output) we would see that the transformation
table would be exactly like the one in the first pass.
So, let's recapitulate: the MTF-Algorithm is fast, easy to implement and a good
preprocessor for Huffman or (even better) Arithmetic Crunching. It's also a nice
basic idea for crypting data.
Yes, but, ehm...
Yes, Watson?
Well, i thought, that this essay was about the Burrows-Wheeler-Transformation,
not about the MTF-algorithm.
You are right. But the authors of the BWT, Michael Burrows and David Wheeler
recommended that data should be crunched in such an order:
Data -> (maybe a simple RLE cruncher ->) BWT -> MTF -> Huffman or Arithmetic Cruncher
Ah, i see.
Yes, and now as the basic premises are all clear let's examine the BWT.
What is the BWT? Well, a short answer would be: an alg that sorts data, saves a
certain aspekt of the sorted data, saves a starting index and that's it.
What? No compression?
No, Watson, no compression. The transformation is just what it is: A
transformation. But think about it for a moment. A reversible transformation has
a big advantage for designers: You can inspect and/or manipulate the transformed
data in another way than you can with the original data. Take the MTF for
example. It transforms the input data into some other data. But this other data
is in most cases much better crunchable than the original data. And as the
process is completely reversible you don't have any disadvantage. You could
write a special cruncher sensitive to the output of MTF processes with which you
could achieve better results than achieved with normal data. You could say that
this kind of transforming data, manipulating that other data and retransforming
the data to the original kind is some form of indirection. Indirection is very
crucial for programmers (and philosophers); it enables you to solve and
discuss problems on a much more comfortable level than the original problem
stated.
Keep this in your mind: if you have some problems, viewing the problem from
a differenct angle gives you much more possibilities than trying to solve the
problem with brute force. But back to the BWT.
So, the BWT is just some kind of indirection. And it has to be reversible. But
how does this work together with sorting?
Well, the BWT algorithm gives you the possibility to unsort your data...
Yes, you read right. With BWT you can unsort the saved data you constructed by
BWT. OK, the data you save itself is not sorted, but is has some interesting
features:
1. It has in some way the FOLLOWING context of the original data.
2. It enables you to unsort the data and rebuild the original data.
3. The bigger the chunks of data you sort, the better your results for the
crunching stage.
So, speedwise it's a matter of how fast your sorting algorithm is.
crunchwise it's a matter of how much memory you have.
So, for example let's transform the word CRACKER.
To do this we need an array to all possible shifted forms of the word.
If you take the word CRACKER, it has 7 letters. So we can shift it
7 times, from shift 0 to shift 6. The result is something like this:
CRACKER
RACKERC
ACKERCR
CKERCRA
KERCRAC
ERCRACK
RCRACKE
How do arrange this? Well, one easy way would be to duplicate the source data
in RAM and just install enough charpointers (7 in that case).
< RAM-DATA >