home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
OS/2 Professional
/
OS2PRO194.ISO
/
os2
/
packer
/
vv097exe
/
vvintro.asc
< prev
next >
Wrap
Text File
|
1993-01-07
|
24KB
|
334 lines
Introduction to VVencode/VVdecode
Table of Contents
1 Introduction to VVcode .....................: 2
1.1 The Aston Archive........................... 2
1.2 Specification for a Coding Scheme ..................: 3
1.3 The Search Commences ........................ 6
1.3.1 Portable Coding Schemes .................. 6
1.3.2 Platform Specific Coding Schemes .............: 7
1.4 VVcode is Born............................: 7
1.4.1 A Production VVcode .................... 7
1.4.2 Arguments against VVcode ................. 8
1.5 Availability of VVcode ........................: 8
1
1 Introduction to VVcode
Reliable and faithful exchange of binary files between computers over networks is a well-known
problem, especially if the computers use different operating systems and are connected to different
networks via a gateway. Unfortunately inter-networking and electronic mail are very much children
of the 1960s: they might have had to wait until the 1970s for their naissance, but their progenitors
were mentally locked-in to the concept of the 7-bit ASCII code for conveying textual information.
The TEX community has long been aware of this problem when trying to exchange "machine-
independent" `.dvi' files and font-related data such as `.tfm' and `.pk' files. It has sometimes
been possible to exchange this binary data by using encoding schemes that allow the data to be
represented using a subset of the seven-bit ASCII character set.
Academics and authors in many fields have hitherto been able to pass `.tex' files back-and-
forth by electronic mail_apart from a few minor quirks and blemishes, such TEX source files
pass unharmed across the planet's networks. Problems are encountered when mail passes through
certain gateway machines which introduce irreversible character corruptions. Particularly notorious
is the Janet/Bitnet gateway which has the unfortunate habit of converting `^' to `"' and `"' to `%':
since it leaves `%' itself unaffected, this makes recovery of the original file a non-trivial exercise. It
sometimes also changes the brace characters `-"' into odd characters above 128: this is particularly
embarrassing, of course, for `.tex' files!
For some years many TEX users, particularly those working in languages other than English,
and thus familiar with character set encodings containing other than the basic ASCII set, have been
agitating for TEX to be able to handle input in their mother tongues, using their own languages'
character sets. In 1989, Knuth announced TEX V3, and implementors world-wide beavered away
to bring each implementation up-to-date. TEX V3 now supports eight-bit character sets and so
`.tex' source files are now effectively `binary' files and will therefore suffer from the same exchange
problems experienced with `.dvi' files.
All those authors that had previously been able to cooperate, despite being separated by hun-
dreds or thousands of miles, might once again be forced to entrust floppy disks to the vagaries of
the world's postal systems (although one shouldn't underestimate the bandwidth of the Royal [or
other] Mail system).
Unless or until the various e-mail protocols, networks and software are converted to support un-
corrupted transmission of characters codes 0x20 : : :0x7e and 0xa1 : : :0xfe, it will have to become
the norm for `.tex' sources to be encoded for transmission by e-mail.
This problem is of course well known outside the TEX community.
1.1 The Aston Archive
The author is a volunteer assistant to Peter Abbott in running the world's principal repository
of TEX-related material at Aston University in Birmingham. The archive (host: TeX.Ac.Uk) holds
several hundred megabytes of text and binary files including:
o program sources for TEX, METAFONT, DVI drivers and many other utilities;
o binary executables for a variety of popular operating systems (e.g. Atari, Macintosh, MS-DOS,
Unix, VAX/VMS and VM/CMS);
o METAFONT sources for Computer Modern and other fonts;
o binary font files (mainly `.tfm' and `.pk') for a number of different output devices;
o text macro and style files.
The archive provides access to these files via the following services:
o NIFTP1 from Janet hosts_typically 300 megabytes of data are transferred every month; this
would probably be much greater if we were not limited by the bandwidth of our 9600Bd
connection to Janet.
o FTP and Telnet access from Internet hosts.
o Interactive browsing service via Janet PAD, including the facility to send files out using NIFTP
(and later FTP).
o Interactive browsing service via dialup modem lines, including the facility to download files
using Kermit and similar protocols.
o An e-mail file server which typically sends 150 megabytes of data per month to sites all over
the world (though predominantly to EARN/Bitnet sites).
o A magnetic media distribution service via surface carriers. Copies of the entire archive have
been sent to embryonic TEX communities in Czechoslovakia, Hungary and Poland.
We have experienced many problems trying to support all of these file types, operating systems
and access methods. The e-mail file server clearly needs a reliable method of encoding files if its
many customers are not to be denied access to the non-text files in the archive.
Binary files such as `.pk' font files are stored in different ways to accommodate the requirements
of the different operating systems supported. Currently we maintain multiple font directory trees for
the Macintosh, MS-DOS, Unix and VAX/VMS with all the attendant problems of synchronization,
disk space and archivists' time. We need a single storage format which allows export to all of our
supported operating systems.
1.2 Specification for a Coding Scheme
In mid-1990, the archivists came to the conclusion that a universal encoding scheme was required
to accommodate the many different kinds of file and file organizations that needed to be supported
by the archive.
_________________________________________
1 Network Independent File Transfer Protocol _ in the UK, one does not perform the pseudo-
login that Internet users are accustomed to using with the FTP protocol: instead, one issues a
"transfer request" for a file to be sent to or from the remote machine _ the transfer itself takes
place asynchronously. One nice consequence is that such transfers can be queued for overnight
execution, leaving daytime bandwidth free for e-mail and true remote interactive logins.
Niel Kempson formulated the first draft of this specification in mid-1990; the requirements of
the encoding scheme may be summarized as follows:
Preserving File Structure
It is insufficient, especially for an archive holding text and binary files for a variety of
machine types, merely to encode data simply as a stream of bytes:
o Virtually all operating systems (except Unix) make a distinction between binary
and text files, so the coding system should recognize and maintain this distinction.
o Unix and most PC-based operating systems treat files as streams of bytes with
no further structure imposed. On the other hand, certain widely-used operating
systems (e.g. VAX/VMS and VM/CMS) have record-oriented file systems where
different types of file are stored in a format appropriate to the type of file2 .
For these operating systems, we consider it essential that the encoding scheme
should identify, preserve and record the most commonly used file organizations.
The decoding program should be able to use this information to create the output
file using the organization appropriate to the operating system in use. If the
information is of no consequence to the receiving system, the default file structure
(if any) should be created. If the encoding system does not have structure in
its files, the receiving system may provide suitable defaults automatically. In all
cases the programs should permit the user to override or supplement file structure
information.
o Whenever possible, these details of structure should be determined automatically
by the encoding program; at the very least, an indication of whether the file is text
or binary shall be provided, even under an operating system such as Unix that need
make no such distinction for its own use, to allow decoding to an appropriate file
organization on those systems that do make such a distinction.
Coding Scheme
Whatever method is used must allow encoded data to be e-mailed:
o It should be possible to specify the coding table to be used to encode the data.
The coding table used shall be recorded with each part of the encoded data.
o If a recorded coding table is found while decoding, it should be used to construct
an appropriate decoding table. Simple one-to-one character corruptions should be
corrected as long as only one of the input characters is mapped to any one output
character.
o The recommended encoding uses only the following characters:
+-0123456789
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
_________________________________________
2 It is often argued that the increase in efficiency more than offsets the increase in complexity.
Such an encoding as originally used for XXcode has been shown to pass successfully
through all the gateways which are known to corrupt characters.
Integrity of Encoded Data
We want to ensure that the whole encoded file passes through the e-mail network.
o Encoded lines should be prefixed by an appropriate character string to distinguish
them from unwanted lines such as mail headers and trailers. Whilst not essential,
this feature does assist the decoding program in ignoring these spurious data.
o Lines should not end with whitespace characters as some mailers and operating
systems strip off trailing whitespace.
o The encoding program should calculate parameters of the input file such as the
number of bytes and CRC and record them at the end of the encoded data.
The decoding program should calculate the same parameters from the decoded
data and compare the values obtained from those recorded at the end of the en-
coded data.
Making Files Mailable
A mechanism is needed to overcome some gateways' refusal to handle large files.
o The encoding program should be able to split the encoded output into parts, each
no larger than a maximum specified size. Splitting the output into smaller parts
is useful if the encoded data is to be transmitted using electronic mail or over
unreliable network links that do not stay up long enough to transmit a large file.
The recommended default maximum part size is 30kB.
o The decoding program should be able to decode a multi-part encoded file very
flexibly. It should not be necessary to:
1. strip out mail headers and trailers;
2. combine all of the parts into one file in the correct order;
3. process each part of the encoded data as a separate file.
o In addition any file specifications from the operating system on which the VVE
file was created must not prevent the file from being decoded.
Miscellaneous
Further considerations include:
o Support for character sets other than ASCII is essential if the encoding scheme is to
be useful to IBM hosts. The encoding program should label the character set used
by the encoded data, and both encoder and decoder should enable the conversion
between the local character set and another character set. For example a user on
an EBCDIC host should be able to encode text files for transmission to another
EBCDIC host, or to convert them to ASCII before encoding and transmission to
an ASCII host. Similarly, that user should be able to decode text files from ASCII
and EBCDIC machines, creating EBCDIC output files.
o Where possible, the original file's timestamp should be encoded and used by the
decoding program when recreating the file: this will permit archives to retain the
originator's time of creation for files, and thus permit the users (not to mention
the archivists) to identify more clearly when a new version of a file has been made
available. Timezones should be supported where possible.
o The encoding and decoding schemes should be able to read and write files that are
compatible with one or more of the well established coding schemes (e.g. UUcode,
XXcode).
o The source code for the programs should be freely available. It should also be
portable and usable with as many computers, operating systems and compilers as
possible.
1.3 The Search Commences
Naturally, the first step was to examine the existing coding schemes in comparison with the above
ideal specification. Such schemes fell into two broad classes: portable schemes, which were intended
to permit the encoding of files on any computer architecture into a form that could be transmitted
electronically, and decoded on the same or a different architecture; and platform-specific schemes,
which provided rather better support for transferring files between two computers using the same
architecture and operating system.
1.3.1 Portable Coding Schemes
The most commonly used coding schemes supported by a variety of platforms are:
o BOO
o UU
o XX
Most implementations of these schemes known to the authors are designed for use with stream
file systems. These programs have no means of recording, let alone preserving, record structure
and are thus unsuitable for our purposes. This is not surprising since UUcode and its mutation
XXcode were developed specifically for exchanging files between Unix systems. In fairness to these
schemes, they are well suited to the transmission of text files and certain unstructured binary files.
Standard UUcode encodes files using characters ` ' : :`:_' of ASCII. This can result in one or
more spaces appearing at the ends of lines: some mailers decide that this is information not worth
transmitting, with consequent inability to reconstruct the original file.
Files containing characters such as `^' are often irreversibly corrupted by mail gateways; this
problem led to the development of XXcode which uses a rather more robust character set, namely:
+-0123456789
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
The encoding table used is recorded with the encoded data to allow the detection of character
corruptions, and the correction of reversible character transpositions. Whilst superficially a step
forward, XXcode offered little more than most existing versions of UUcode, which already supported
coding tables. Its major contribution was in formalizing the encoding table, and in particular its
default table was proof against all the known gateway-induced corruptions.
1.3.2 Platform Specific Coding Schemes
Encoding schemes have been developed to support transfer of files possessing some structure
which therefore cannot be reconstructed correctly when encoded by the portable schemes. When
the encoding and decoding programs of such a platform specific scheme are each used on the same
computer and operating system type, files may be encoded and transmitted with a great deal
of confidence that the decoded file will reproduce the original's structure and attributes in their
entirety.
Examples of such programs are TELCODE and MFTU for VMS, NETDATA for IBM mainframes, and
Stuffit and MacBinary for the Macintosh. But these programs have the major disadvantage that
they have each been implemented only on the single architecture for which they were designed:
thus the only two of these schemes that could be used on the VMS-based Aston Archive would be
of minimal interest elsewhere!
The Archive's content is in some respects artificially inflated by the presence of `.hqx' files for
Macintoshes, `.boo' for MS-DOS, etc., which have to be held in pre-encoded form for transfer by
those requiring them.
1.4 VVcode is Born
Realizing that none of the existing portable schemes were close enough to our ideal, an early
version of our specification was circulated on various mailing lists by Niel Kempson towards the
end of 1990. When the anticipated "nil return" was all that resulted, Brian Hamilton Kelly went
ahead and created a rudimentary VVencode by modifying an existing VAX Pascal implementation
of uuencode. After generating the companion VVdecode, he then re-implemented the programs in
Turbo C under the MS-DOS operating system on the IBM-PC, and thereby was able to prove that
the new scheme was both viable and sufficient.
This version didn't support file formats, time stamping, file splitting, character sets or CRC
checking.
1.4.1 A Production VVcode
Following the minor feasibility study, Niel Kempson re-engineered the pair of programs from
scratch (adding certain features of the evolving specification), paying particular attention to making
the code portable across a wide variety of operating systems. Particular care was taken to avoid
the use of supposedly "standard" C functions that experience had shown behaved differently under
individual manufacturer's implementations, or were even non-existent in some. Therefore the code
may sometimes appear to be performing certain operations in a very long-winded way; it's very
easy to look at it and say "why didn't the author use the foo() function, which does this much
more efficiently?", but this function may not even exist under another implementation of C, or
behave in a subtly different manner.
The core functions of VVcode are implemented as a collection of routines written in as portable
a fashion as possible, and a separate module of a few operating system specific routines for file
I/O, timestamping, command-line or other interface, etc. Porting VVcode to a new platform should
require only that this latter module be re-implemented, in most cases by adapting an existing one.
VVcode implements all of the features listed in the specification, apart from the ability to generate
UUcode and XXcode compatible files. However, the decoding program is backwards compatible and
can decode files generated by UUcode and XXcode.
1.4.2 Arguments against VVcode
When the advent of the VVcode system was first aired in the various electronic digests,
some heated debate followed along the lines that a new encoding scheme was unnecessary, since
UUcode/XXcode sufficed for them. However, all these correspondents were Unix users who had
interpreted the `VV' as meaning "Vax-to-Vax" by analogy with `uu'3 and who felt that such a
scheme should be private to VAXen. The authors' reply was to the effect that the encoding scheme
was intended to support the needs of archives like Aston's, and as such, had to provide
1. an automated tool (it would be somewhat difficult to expect our users to be able to tell the
encoder what sort of file structure it was handling, when this concept was entirely alien to
many of them);
2. facilities to encode binaries for many operating systems;
3. mail server features, such as splitting of large files;
4. operation across the widest possible combination of platforms.
The overhead of using the VVcode system is at most a couple of hundred bytes over using
UUcode, and the extra functionality and universality with respect to UUcode or XXcode thereby
comes almost for free.
1.5 Availability of VVcode
At present, the VVcode system is only available in C, but it has been shown to run successfully
on the following combinations of hardware, operating system and compiler:
_________________________________________
3 `V' was chosen simply because it followed `U'; at one time, we'd seriously considered calling it
YAFES _ Yet Another File Encoding Scheme!
Macintosh At the time of writing (May 1991) John Rawnsley of the University of Warwick had
commenced development of a Macintosh port, which will encode the resource and data
forks in a manner that will permit the former to be ignored by non-Macintosh systems.
MS-DOS
o IBM PS/2, PC (and clones); MS-DOS 3.3, 4.01, 5.00; Borland Turbo C 1.5, 2.0,
Borland C++ 1.0, 2.0, 3.0 and Microsoft C 5.1, 6.0
OS/2
o IBM PS/2, PC (and clones); OS/2 2.0; Microsoft C 6.0 and GNU C 2.1
Unix
o Sun 3; SunOS 3.x and 4.0.3; native C and GNU C
o Sun Sparcstation 1; SunOS 4.1; native C and GNU C
o SCO Unix V/386 v3.2.2, Microsoft C compiler
VAX/VMS
o All VAXen; VMS 5.2-5.4-1; VAX C V3.0-V3.2 and GNU C 1.40
VM/CMS
o VM/CMS; Whitesmith C compiler v1.0 (This implementation was ported by
Rainer Sch"opf; basing it upon the Unix implementation, this took him about one
day.)