home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Usenet 1994 October
/
usenetsourcesnewsgroupsinfomagicoctober1994disk2.iso
/
misc
/
volume36
/
translit
/
part01
next >
Wrap
Text File
|
1993-03-22
|
62KB
|
1,496 lines
Newsgroups: comp.sources.misc
From: jkl@osc.edu (Jan Labanowski)
Subject: v36i023: translit - transliterate foreign alphabets, Part01/10
Message-ID: <csm-v36i023=translit.163954@sparky.IMD.Sterling.COM>
X-Md4-Signature: 1fdf62718ac15c13f16020f8f731cbf8
Date: Fri, 19 Mar 1993 22:40:58 GMT
Approved: kent@sparky.imd.sterling.com
Submitted-by: jkl@osc.edu (Jan Labanowski)
Posting-number: Volume 36, Issue 23
Archive-name: translit/part01
Environment: UNIX, MS-DOS, VMS
Available-from: kekule.osc.edu (128.146.36.48) in /pub/russian/translit
Copyright-note: Yes, you have to distribute the complete package.
Translit is a general transliteration program. It transliterates
between different alphabet representations of different languages.
It is frequently necessary to convert from one representation to another
representation of the foreign alphabet. E.g., in the Library of Congress
transliteration, the Russian letter sha is transliterated as two Latin
letters "sh" while the popular word processors use a code 232 (decimal),
the RELCOM network uses a code 221, and the KOI7 set uses character "["
for the same letter. So if your screen driver, printer, word processor,
etc. uses different codes than the text file which you have, you need to
transliterate.
The TRANSLIT program is a powerful tool for such tasks. It converts an input
file in one representation to the output file in another representation using
appropriate, user defined, transliteration table. Transliteration table allows
for very elaborate transliteration tasks and includes provisions for plain
character sequences, character lists, regular expressions (flexible matches),
SHIFT-OUT/IN sequences and more. The program comes with documentation and
examples of popular transliteration schemes. The Russian language serves
as an example. Other files will be added with your collaboration.
The most current version of translit will be available from ftp kekule.osc.edu
(or ftp 128.146.36.48) in the directory /pub/russian/translit
Via E-mail, first retrieve the file readme.doc. It describes the files in
the program distribution and has detailed instructions on how to obtain the
program. Send the message:
send translit/readme.doc from russian
to OSCPOST@osc.edu or OSCPOST@OHSTPY.BITNET. The file readme.doc will be
forwarded to your mailbox.
Enjoy,
Author coordinates:
Jan Labanowski
P.O. Box 21821
Columbus, OH 43221-0821, USA
jkl@osc.edu, JKL@OHSTPY.BITNET
-------
#! /bin/sh
# This is a shell archive. Remove anything before this line, then feed it
# into a shell via "sh file" or similar. To overwrite existing files,
# type "sh file -c".
# Contents: translit.1
# Wrapped by kent@sparky on Fri Mar 19 16:00:08 1993
PATH=/bin:/usr/bin:/usr/ucb:/usr/local/bin:/usr/lbin ; export PATH
echo If this archive is complete, you will see the following message:
echo ' "shar: End of archive 1 (of 10)."'
if test -f 'translit.1' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'translit.1'\"
else
echo shar: Extracting \"'translit.1'\" \(56776 characters\)
sed "s/^X//" >'translit.1' <<'END_OF_FILE'
X.TH TRANSLIT JKL "23-Jan-1993" JKL "Version 1.0"
X.DA 20 Jan 1993
X.SH NAME
X.IP \fITRANSLIT\fR
XProgram to transliterate texts in different character sets. The program
Xconverts input character codes (or sequences of codes) to a different set
Xof output character codes (or sequences of codes). Intended for
Xtransliteration to/from phonetic representation of foreign letters with
XLatin letters from/to special national codes used for these letters.
XIt supports simple matches, character lists and flexible matches via
Xregular expressions. The new transliteration schemes are easily added
Xby creating simple transliteration tables. Multiple character sets
Xare supported for input and output. It does not yet support UNICODE,
Xbut some day it will.
X
X.SH COPYRIGHT
XCopyright (c) 1993 Jan Labanowski and JKL Enterprises, Inc.
X.br
XYou may distribute the Software only as a complete set of files.
XYou may distribute the modified Software only if you retain the
XCopyright notice and you do not delete original code, data, documentation
Xand associated files.
XThe Software is copyrighted. You may not sell the software or incorporate
Xit in the commercial product without written permission from
XJan Labanowski or JKL Enterprises, Inc. You are allowed to charge for media
Xand copying if you distribute the whole unaltered package.
X
X.SH SYNOPSIS
X.B translit
X[
X.B -i
X.I inpfile
X][
X.B -o
X.I outfile
X][
X.B -d
X][
X.B -t
X.I transtbl \|\||\|\| transtbl
X]
X.br
X
X.SH OPTIONS
X.IP "\fB-i\fP \fIinpfile\fP"
X.I inpfile
Xis a name of input file to be transliterated.
XIf "\fB-i\fP" is not specified, the input is taken from
Xstandard input.
X.IP "\fB-o\fP \fIoutfile\fP"
X.I outfile
Xis an output file, where the transliterated
Xtext is stored. If "\fB-o\fP" is not specified, the output is
Xdirected to the standard output. Program will not overwrite the existing
Xfile. If file exists, you need to delete it first.
X.IP "\fB-d\fP"
XSome information on character codes read from transliteration table file
Xare sent to standard error ("\fIstderr\fP"). Useful when developing
Xnew transliteration tables.
X.IP "\fB-t\fP \fItranstbl\fP"
X.I transtbl
Xis a transliteration table file which you want to use. The "\fB-t\fP"
Xoption may be omitted if the \fItranstbl\fR
Xis specified as the last parameter on the
Xcommand line. The program first tries to locate \fItranstbl\fR
Xfile in the current directory, and if not found, it
Xsearches the directory chosen at compilation/installation time in
X"\fIpaths.h\fP". If no "\fItranstbl\fP" is given, the default file name
Xspecified in "\fIpaths.h\fP" is taken. The compile/installation
Xtime defaults in
X"\fIpaths.h\fR" for the search directory and the default
Xfile name can be overiden
Xby setting environment variables: TRANSP and TRANSF, respectively (see below).
X
X.SH ENVIRONMENT VARIABLES
XThe default path to the directory holding transliteration tables can
Xbe overiden by setting environment variable TRANSP. The default name
Xfor the transliteration table can be overiden by setting TRANSF environment
Xvariable. However, when the transliteration file is given on the command line,
Xit will overide the defaults and environment setting.
XHere are some examples of setting environment
Xvariables for different operating systems:
X.sp
X.in +2m
X.br
X\fIUN*X System\fR
X.br
X.nf
X If you are using \fIcsh\fR (C-shell):
X setenv TRANSP /home/john/translit/
X setenv TRANSF koi8-tex.rus
X If you are using \fIsh\fR (Bourne Shell):
X set TRANSP=/home/john/translit/
X export TRANSP
X set TRANSF=koi8-tex.rus
X export TRANSF
X\fIVAX-VMS System\fR
X TRANSP:==SYS$USER:[JOHN.TRANSLIT]
X TRANSF:==KOI8-TEX.TBL
X\fIPC-DOS or MS-DOS\fR
X SET TRANSP=C:\|\\\|JOHN\|\\\|TRANSLIT\|\\
X SET TRANSF=KOI8-TEX.TBL
X.fi
X.in -2m
XNote that the directory path has to include concluding
Xslashes, \|\\\| or \|/\|\|.
X
X
X.SH EXAMPLES
X.ta 5m
X.br
X cat text.koi8 \|\||\|\| translit koi8-tex.rus > text.tex
X.br
Xin UN*X is equivalent to:
X.sp 1
X translit -t koi8-tex.rus -o text.tex -i text.koi8
X.br
Xand converts file text.koi8 to file text.tex using transliteration
Xspecified in the file koi8-tex.rus.
X.sp 1
X translit -i text.koi8 koi8-cl.rus
X.br
Xdisplays the converted text from file text.koi8 on your terminal. The
Xconversion table is koi8-cl.rus (KOI8 --> Library of Congress).
X.sp 1
X translit -i text.alt -t alt-koi8.rus \|\||\|\| translit -o text.tex -t koi8-tex.rus
X.br
Xis essentially equivalent to the following two commands in UN*X or MS-DOS:
X.br
X translit -i text.alt -o junkfile -t alt-koi8.rus
X.br
X translit -i junkfile -o text.tex -t koi8-tex.rus
X.br
Xand converts the file in ALT character set to a LaTeX file for printing.
X.sp
X translit -i russ.txt pho-koi8.rus \|\||\|\| translit -o russ.tex koi8-tex.rus
X.br
Xconverts file russ.txt from phonetic transliteration to LaTeX file russ.tex
Xfor printing.
X.sp 2
X
X.SH TRANSLITERATION TABLES
XThe following transliteration files are available with the current
Xdistribution. Consult the comments in the individual files for details.
X.IP \fIkoi8-tex.rus\fP
XConversion table which changes the file in KOI8 (8 bit character set
Xused by RELCOM news service) to a LaTeX file for printing with
X\fIAMS\fR WNCYR fonts.
X.IP \fItex-koi8.rus\fP
XConversion table for the LaTeX to KOI8 conversion. Note that it will not
Xhandle complicated cases, since LaTeX is a program, and only TeX can
Xconvert a LaTeX source to the characters. However, it should work OK
Xfor simple cases of text only files, and may need some editing for
Xcomplicated cases.
X.IP \fIalt-gos.rus\fP
XThis is a transliteration data file for converting from ALT (Bryabrins
Xalternativnyj variant used in many popular wordprocessors)
Xto GOSTSCII 84 (approx. ISO-8859-5?)
X.IP \fIalt-koi8.rus\fP
XThis is a transliteration data file for converting from ALT to KOI8.
XKOI8 is meant to be GOST 19768-74 (as used by RELCOM).
X.IP \fIgos-alt.rus\fP
XThis is a transliteration data file for converting GOSTSCII 84
X(approx. ISO-8859-5?) to ALT (Bryabrins alternativnyj variant)
X.IP \fIgos-koi8.rus\fP
XThis is a transliteration data file for converting GOSTSCII 84
X(approx. ISO-8859-5?) to KOI8 used by RELCOM
XKOI8 is meant to be GOST 19768-74
X.IP \fIkoi8-alt.rus\fP
XThis is a transliteration data file for converting from KOI8.
XKOI8 is meant to be GOST 19768-74, to ALT (Bryabrins alternativnyj variant)
X.IP \fIkoi8-gos.rus\fP
XThis is a transliteration data file for converting from KOI8 (Relcom).
XKOI8 is meant to be GOST 19768-74, to GOSTSCII 84 (approx. ISO-8859-5)
X.IP \fIkoi8-7.rus\fP
XThis file converts from KOI8 to KOI7.
X.IP \fIkoi7-8.rus\fP
XThis file converts from KOI7 to KOI8. Before you attempt the conversion,
Xyou might need to perform a simple edit on your file. You MUST read the
Xcomments in \fIkoi7-8.rus\fR before you attempt this conversion.
X.IP \fIkoi7nl-8.rus\fP
XThis file assumes that there are only Russian letters (no Latin)
Xin the input file. If you have Latin letters, and you inserted SHIFT-OUT/IN
Xcharacters, use file \fIkoi7-8.rus\fP.
X.IP \fIkoi8-lc.rus\fP
XThis file converts KOI8 to the Library of Congress transliteration.
XSome extensions are added.
X.IP \fIkoi8-php.rus\fP
XThis file converts KOI8 to the Pokrovsky transliteration.
X.IP \fIphp-koi8.rus\fP
XThis file converts from Pokrovsky transliteration to KOI8.
X.IP \fIkoi8-phg.rus\fP
XThis file converts from KOI8 to GOST transliteration.
X.IP \fIphg-koi8.rus\fP
XThis file converts from GOST transliteration to KOI8.
X.IP \fIpho-koi8.rus\fP
XThis is a table which will convert from many "phonetic" transliteration
Xschemes to KOI8. It is elaborate and it takes a lot of time to
Xtransliterate the file using this table. Some transliterations are
Xhopeless and internally inconsistent (as humans...), so the results
Xcannot be bug free.
XYou might want to modify the file, if your transliteration
Xpatterns are different than those assumed in this file. You may also want
Xto simplify this file if the phonetic transliteration you are converting
Xis a sound one (most are not, e.g., they use e for je and e oborotnoye,
Xts for c and t-s, h for kha, i for i-kratkoe, etc.).
X.sp
X
X.SH INTRODUCTION
XIf you do not intend to write your own transliteration tables, you may
Xskip this description and go directly to the installation and
Xcopyright sections. However, you might want to read this material anyhow,
Xto better understand the traps and complexities of transliteration.
XIt is frequently necessary to transliterate text, i.e., to change one set
Xof characters (or composite characters, phonemes, etc.) to another set.
X.PP
XOn computers, the transliteration operation consists of converting the input
Xfile in some character set to the output file in another character set.
X.PP
XIn the simplest case, the single characters are transliterated, i.e, their
Xcodes are changed according to some transliteration table. This is called
Xremapping and, assuming the one-to-one mapping, the task can be accomplished
Xby a simple pseudo program:
X.br
X new_char_code = character_map[old_char_code];
X.PP
XIf the one-to-one correspondence does not exist (i.e., some codes may
Xbe present in one set, but do not have corresponding codes in another set),
Xprecise transliteration is not possible. In such cases there are 3 obvious
Xpossibilities:
X.br
X 1. skip characters which do not have counterparts,
X.br
X 2. retain unchanged codes of these characters,
X.br
X 3. convert the codes to multicharacter sequences.
X.br
XIn some cases, the file can contain more than one character sets, e.g.,
Xthe file can contain Latin characters (e.g. English text) and Cyrillic
Xcharacters (e.g. Russian text). If the character codes assigned to
Xcharacters in different sets do not overlap, this is still a simple mapping
Xproblem. This is a case with KOI8 or GOSTCII character tables for Russian,
Xwhich reserve the lower 127 codes for standard ASCII codes (which include
Xall Latin characters) and characters with codes above 127 for Cyrillic letters.
X.PP
XIf character codes overlap, there is a SHIFT-OUT/SHIFT-IN technique in
Xwhich the meaning of the character sequence is determined by an opening
Xcode (or sequence of characters codes). In this case, the meaning of the
Xseries of characters is determined by the SHIFT-OUT character (or sequence)
Xwhich precedes them. The SHIFT-IN character (or sequence) following the
Xseries of characters returns the "reader" to the default or previous status.
XTo schemes are used:
X.br
X (char_set_1)(SHIFT-IN[1])(SHIFT-OUT[2])(char_set_2)...
X.br
Xor
X.br
X (char_set_1)(SHIFT-OUT[2])(char_set_2)(SHIFT-OUT[1])char_set_1...
X.br
X.sp 1
XSince computer keyboards, screens, printers, software, etc., are by necessity
Xlanguage specific (the most popular being ASCII), there is a problem of typing
Xforeign language text which contains letters different than standard Latin
Xalphabet. For this reason, many transliteration schemes use several Latin
Xletters to represent a single letter of foreign alphabet, for example:
X.br
Xzh is used to represent cyrillic letter zhe, \|\\\|"o may be used to
Xrepresent the o umlaut, etc.
X
XIf there is one-to-one mapping of such sequences to another alphabet, it
Xis also easy to process. However, it is necessary to substitute longest
Xsequences first. For example, a frequently used transliteration
Xfor cyrillic letters:
X.br
X.ta 2mL 7mL 11mL 24mL
X \fIshch\fR --- letter \fBshcza\fR 221 (decimal KOI8 code)
X.br
X \fIsh\fR --- letter \fBsha\fR 219
X.br
X \fIch\fR --- letter \fBcze\fR 222
X.br
X \fIc\fR --- letter \fBtse\fR 195
X.br
X \fIh\fR --- letter \fBkha\fR 200
X.br
X \fIa\fR --- letter \fBa\fR 193
X.PP
XObviously, in this case, we should proceed first with converting all \fIshch\fR
Xsequences to \fBshcha\fR letter, then two-character \fIsh\fR
Xand \fIch\fR, and then single
Xcharacter \fBc\fR and \fBh\fR.
XGenerally, for the one-to-one transliteration, the longest
Xsequences should be precessed first, and the order of conversion within
Xsequences of the same length makes no difference.
XFor example, converting the word "shchah" to KOI8 should proceed in a following
Xway:
X.br
X \fIshchah\fR --> (221)\fIah\fR, (221)\fIah\fR --> (221)(193)\fIh\fR, (221)(193)\fIh\fR --> (221)(193)(200)
X.br
XThere is a multitude of reasons why transliteration is done. I wrote this
Xprogram having in mind the following ones:
X.br
X 1) to print cyrillic text using TeX/LaTeX and cyrillic fonts
X.br
X 2) to read KOI8 encoded messages from Russia on my ASCII terminal.
X.br
XHowever, I was trying to make it flexible to accommodate other uses.
X
X.SH PROGRAM OPERATION
XThe program converts the input file to an output file using
Xtransliteration rules from the transliteration rule file which
Xyou specify with option \fB-t\fR.
XSome examples of transliteration rule files are enclosed.
XBefore program can be used, the transliteration rules need to be specified.
X.PP
XThese are given as a file which consist of the following parts
Xdescribed below:
X.br
X.in +2m
X.in +5m
X.ti -5m
X1) File format number (it is 1 at this moment)
X.ti -5m
X2) Delimiters used to enclose a) simple strings, b) character lists,
Xc) regular expressions
X.ti -5m
X3) Starting sequence for output
X.ti -5m
X4) Ending sequence for output
X.ti -5m
X5) Number of input "character sets"
X.ti -5m
X6) SHIFT-OUT/SHIFT-IN sequences for each input character set
X.ti -5m
X7) Number of output "character sets"
X.ti -5m
X8) SHIFT-OUT/SHIFT-IN sequences for each output character set
X.ti -5m
X9) Transliteration table
X.in -5m
X.in -2m
X.PP
X\fIGENERAL COMMENTS\fR
X.br
XThe transliteration rules file consists of comments and data.
XThe comments may be included in the file as:
X.in +5m
X.ti -2m
Xa) line comments --- lines starting with ! or # character (# or ! must be
Xin the first column of a line) are treated as comments and are not
Xread in by the program.
X.ti -2m
Xb) comments following all required entries on the line. They must be
Xseparated by at least one space from the last data entry on the line
Xand need not start with any particular character. These comments cannot
Xbe used within multiline sequences.
X.br
X.in -5m
X.PP
XThe data entries consist of integer numbers and strings.
XThe strings may represent:
X.br
X a) plain strings
X.br
X b) character lists
X.br
X c) regular expressions
X.br
X.PP
XAll strings which appear in the file, are processed through the
X"string processor", which allows entering unprintable characters as codes.
XThe character code is specified as a backslash "\|\\\|" followed by at least
X2 digit(s) (i.e., \|\\\|01 produces code=1, but \|\|\\\|1 is passed unchanged). The
Xfollowing formats are supported:
X.br
X \|\\\|0123 character of octal code 123 (when leading zero present)
X.br
X \|\\\|123 character of decimal code 123 (when leading digit is not zero)
X.br
X \|\\\|0o123 or \|\\\|0O123 character of octal code 123
X.br
X \|\\\|0d123 or \|\\\|0D123 character of decimal code 123
X.br
X \|\\\|0xA3 or \|\\\|0XA3 or \|\\\|0xa3 character of hexadecimal code A3
X.br
X.PP
XThe allowed digits are 0-7 for octal codes, 0-9 for decimal codes and
X0-F (and/or 0-f) for hexadecimal codes.
XIn a situation when code has to be followed by a digit character,
Xyou need to enter the
Xdigit as a code. E.g., if you want character \|\\\|0xA3 followed by a letter C,
Xyou need to specify letter C as a code (\|\\\|0x43 or \|\\\|103 or \|\\\|0o103 or \|\\\|0d67)
Xand type the sequence as, e.g., \|\\\|0xA3\|\\\|103.
XCharacter resulting in a code 0 (zero) (e.g., \|\\\|00) is special. It tells:
X"skip everything what follows me in this string".
XIt does not make sense to use it, since you can always terminate the
Xsequence with a delimiter. When you use an empty string as a matching
Xsequence, remember that it does not match anything.
X.sp
XIf the line with entries is too long, you can break it between the
Xfields.
XIf the string is too long to fit a line, you can break it before any nonblank
Xcharacter by the \|\\\| (backslash) followed by white space (i.e., new lines,
Xspaces, tabs, etc.). The \|\\\| and the following white space will be removed
Xfrom the string by the string preprocessor. However, you are not allowed
Xto break the individual character codes (and you probably would not
Xdo it ever for aestetic purposes).
XFor example:
X.br
X "experi\\
X.br
X mental design"
X.br
Xis equivalent to:
X.br
X "experimental design"
X.br
Xwhile:
X.br
X "experimental\\
X.br
X design"
X.br
Xis equivalent to:
X.br
X "experimentaldesign"
X.br
XIf you need to have \|\\\| followed by a space in your string, you need to
Xenter either a backslash or a space following it as an explicit character
Xcode, for example:
X.br
X "\|\\\|\|\\\|0o40"
X.br
Xwill produce a \|\\\| followed by the space, while the string:
X.br
X "\|\\\| "
X.br
Xwill be empty.
X.sp 1
XThe preprocessor knows only about comments, plain characters, character codes,
Xand continuation lines. However, some characters and their combinations
Xmay have a special meaning in lists and regular expressions.
X.sp 2
X\fIDETAILS OF FILE STRUCTURE\fR
X.sp
X.PP
X.in +3m
X.ti -3m
XAd.1) File format number. This is simply a digit 1 on a line by itself at the
Xmoment. This entry is included to allow future extensions of the
Xtransliteration description file without the need to modify older
Xtransliteration descriptions (program will read data according to
Xthe current file format number given in the file).
X.sp
X.ti -3m
XAd.2) String delimiters. The subsequent 3 lines specify pairs of
Xsingle character delimiters for 3 types of text data.
XThe line format is:
X.br
X opening_character closing_character.
X.br
XThese are needed to mark the beginning/end and the type of the text data.
XEach string (text datum) is saved starting from the first character after
Xopening delimiter, and ends at the last character before the closing
Xdelimiter. If you need to use the closing delimiter within a string,
Xyou need to specify it as its code (e.g., if you are using () pair as
Xdelimiters, specify ")" as \|\\\|0x29). The opening delimiter may be the same
Xor different from the closing delimiter.
X.sp
X.in +2m
X.ti -2m
Xa) The first line contains characters used to enclose (bracket)
Xa \fIplain string\fR. Plain strings are directly matched to input data or
Xdirectly sent to output.
XI suggest to stick to " " pair for plain strings.
XThe ASCII code for " is \|\\\|0d34 = \|\\\|0x22 = \|\\\|0o42 if you need it inside the
Xstring itself.
X.sp
X.ti -2m
Xb) The second line contains characters to mark the beginning and the end
Xof the \fIlist\fR. Lists are used to translate single character codes.
XI suggest [ and ] delimiters for the list (ASCII code of "]" is:
X\|\\\|0d93 = \|\\\|0x5D = \|\\\|0o135). The lists may include ranges, for example:
X[a-zA-Z0-9] will include all Latin letters (small and capital) and digits.
XNote that order is important: [a-d] is equivalent to [abcd], while
X[d-a] will result in an error. If you want to include "-" (minus) in the
Xlist, you need to place it as the first or the last character. There are only
Xtwo special characters on the list, the "-" described above, and the "]"
Xcharacter. You need to enter the "]" as its code. E.g., for
XASCII character table [*--] is equivalent to [*+,-], is equivalent to
X[\|\\\|42\|\\\|43\|\\\|44\|\\\|45]. The order of characters in the list does not matter
Xunless the input list corresponds to the output list (this will be
Xexplained later). Empty lists do not make sense.
X.sp
X.ti -2m
Xc) The third line of delimiter specification contains delimiters for
X\fIregular expression\fRs and \fIsubstitution expression\fRs.
XThese strings are used for "flexible" matches
Xto the text in the input file. They are very similar to the ones used in
XUN*X for searching text in utilities like: grep, sed, vi, awk, etc., though
Xonly a subset of full UN*X regular expression syntax is used here.
XI suggest enclosing them within braces { and } (ASCII code for } is
X\|\\\|0d125 = \|\\\|0x7D = \|\\\|0o175). Actually, regular expressions can only
Xbe used for input sequences, and for output sequences the {} are
Xused to enclose substitution sequences. This will be explained
Xbelow. The description of the
Xsyntax for regular/substitution expressions is
Xadapted from the documentation for the regexp package of Henry
XSpencer, University of Toronto --- this regular expression package
Xwas incorporated, after minute modifications, into the program.
X.br
X.sp 2
X.ce
X\fBREGULAR EXPRESSION SYNTAX\fR
X.br
XA regular expression is zero or more branches, separated by
X`\|\||\|\|'. It matches anything that matches one of the branches.
XThe `\|\||\|\|' simply means "or".
X.ti +2m
XA branch is zero or more pieces, concatenated. It matches a
Xmatch for the first, followed by a match for the second,
Xetc.
X.ti +2m
XA piece is an atom possibly followed by `*', `+', or `?'.
XAn atom followed by `*' matches a sequence of 0 or more
Xmatches of the atom. An atom followed by `+' matches a
Xsequence of 1 or more matches of the atom. An atom followed
Xby `?' matches zero or one occurrences of atom.
X.ti +2m
XAn atom is a regular expression in parentheses (matching a
Xmatch for the regular expression), a range (see below), `.'
X(matching any single character), a `\|\\\|' followed by
Xa single character (matching that character), or a
Xsingle character with no other significance (matching that
Xcharacter).
X.ti +2m
XA range is a sequence of characters enclosed in `[\|\|]'. It
Xnormally matches any single character from the sequence. If
Xthe sequence begins with `^', it matches any single character
Xnot from the rest of the sequence. If two characters in
Xthe sequence are separated by `-', this is shorthand for the
Xfull list of ASCII characters between them (e.g. `[0-9]'
Xmatches any decimal digit). To include a literal `]' in the
Xsequence, make it the first character (following a possible
X`^'). To include a literal `-', make it the first or last
Xcharacter. The regular expression can contains subexpressions
Xwhich are enclosed in a (\|\|) pair. These subexpressions are numbered
X1 to 9 and can be nested. The numbering of subexpressions is
Xgiven in the order of their opening parentheses "(". For
Xexample:
X.br
X.ta 6mL
X (111)...(22(333)222(444)222)...(555)
X.br
XNote that expression 2 contains within itself expressions 3 and 4.
X.br
XThese subexpressions can be referenced in the substitution string which
Xis described below in the paragraph below, or can be used to delimit
Xatoms.
X.in +2m
XExamples:
X.in +2m
X.ti -2m
X{[\|\\\|0d32\|\\\|0d09]\|\\\|0d10} --- will match space or tab followed by new line
X.ti -2m
X{[Tt][Ss]} --- will match TS, Ts, tS and ts
X.ti -2m
X{TS\|\||\|\|Ts\|\||\|\|tS\|\||\|\|ts} --- same as above
X.ti -2m
X{[\|\\\|0d09-\|\\\|0d15 ][^hH][^uU][a-zA-Z]*[\|\\\|0d09-\|\\\|0d15 ]} --- all words which
Xdo not start with hu, Hu, hU, HU. There is a space between
X\|\\\|0d15 and ].
X.br
XNote that specifying expressions like {.*} (i.e., match all characters)
Xdoes not make much sense, since it would mean here: match the whole input
Xfile. However, expressions like {A.*B} should be acceptable, since they
Xmatch a pair of A and B, and everything in between them, e.g. for a
Xstring like: "This is Mr. Allen and this is Mr. Brown." this expression
Xshould match the string: "Allen and this is Mr. B".
X.br
X.in -4m
XRemember to put a backslash "\|\\\|" in front of the following
Xcharacters: .\|\|[\|\|(\|\|)\|\||\|\|?\|\|+\|\|*\|\|\|\\\| if you want
Xtheir literal meaning outside the
Xrange enclosed in [\|\|]. Inside the range they have their literal meaning.
XIf you know the syntax of UN*X regular expressions, please note that
X\|\|^\|\| and \|$\| anchors are not supported and are treated as normal
Xcharacters (with the exception of \|\|^\|\| negation within [\|\|]).
X.sp
X.ce
X\fBSUBSTITUTION EXPRESSIONS\fR
X.br
XAfter finding a match for a regular expression in the input text,
Xa substitution is made.
XIt can be a simple substitution where the whole matching string
Xis replaced by another string, or it may reuse a portion or
Xthe whole matching string. The subexpressions (the ones enclosed
Xin parentheses) within the regular
Xexpression which matched the input text can be referenced in the
Xsubstitution expression.
XOnly the following characters have special meaning within substitution
Xexpression:
X.in +4m
X.ta 3m
X.br
X.ti -2m
X& --- will put the whole matching string.
X.ti -2m
X\|\\\|1 --- will put the match for the 1st subexpression in (\|\|).
X.ti -2m
X\|\\\|2 --- will put the string which matched 2nd subexpression,
Xetc.
X.ti -2m
X\|\\\|9 --- will place in a replacement string the 9th
Xsubexpression (provided that there was 9 (\|\|) pairs in
Xthe regular expression)
X.in -4m
X.sp
XOnly 9 subexpressions are allowed.
XAll other characters and sequences within the substitution expression
Xwill be placed in a substitution string as written. To be able to put
Xa single backslash there, you need to put two of them.
XTo be able to place the unchanged codes of the
Xabove characters (i.e., to make them literals), you need to precede them
Xwith a backslash "\|\\\|", i.e., to get & in the output string
Xyou need to write it as \|\\\|&. Similarly, to place literal
X\|\\\|1, \|\\\|2, etc., you need to enter it as \|\\\|\|\\\|1, \|\\\|\|\\\|2, etc.
XNote that characters .+[]()^, etc. which had a special meaning in
Xthe regular expressions, do not have any special meaning in the
Xsubstitution expression and will be output as written.
X.in +2m
XExample:
X.br
XThe regular expression:
X.in +2m
X.ti -2m
X{([Tt])([Ss])} and the corresponding substitution expression {\|\\\|1.\|\\\|2}
Xputs a period
Xbetween adjoining letters t and s preserving their letter case.
X.br
XThe expression:
X.ti -2m
X{([A-Za-z]+)-[ \|\\\|0x09]*([\|\\\|0x0A-\|\\\|0x0D]+)[ \|\\\|0x09]*([A-Za-z,.?;:"\|\\\|)'`!]+)[ \|\\\|0x09]}
X.br
Xand the substitution expression {\|\\\|1\|\\\|3\|\\\|2} dehyphenate words (when you
Xunderstand this one, you are a guru...). For example:
Xcon- (NL)cert is changed to concert(NL), where NL stands for New
XLine. It looks for one or more letters (saves them as substring 1)
Xfollowed by a hyphen (which may be followed by zero or more spaces
Xor tabs). The hyphen must be followed by a NewLine (ASCII characters
X0A-0D hex form various new line sequences) and saves NewLine sequence
Xas a subexpression 2.
XThen it looks for zero or more tabs and spaces (at the beginning of
Xthe line). Then it looks for the rest of the hyphenated word and
Xsaves it as substring 3. The word may have punctuation attached.
XThen it looks again for some spaces or tabs. The substitution expression
Xjunks all sequences which were not within (), i.e., hyphen and
Xspaces/tabs and inserts only substrings but in a different
Xorder. The \|\\\|1 (word beginning) is followed by \|\\\|3 (word end) and
Xfollowed by the NewLine --- \|\\\|2. The {\|\\\|2\|\\\|1\|\\\|3} would
Xbe probably equally good, though you would need to move the punctuation
Xmatching to the beginning of the regular expression.
X.in -6m
X.ti -3m
XAd.3) Starting sequence. This sequence will be sent to the output before
Xany text. It is enclosed in the pair of string delimiters. I use it
Xto output LaTeX preamble. However, it can be empty, if not used.
XThe (sequence) may contain any characters, including new lines, etc.
X.nf
X.ta 2m 4m
X Example:
X "" # empty sequence
X.sp
X Example:
X "\|\\\|documentstyle{article}
X \|\\\|input cyracc
X \|\\\|begin{document}
X "
X is right (note a new line at the end), but
X.br
X "\|\\\|documentstyle{article}
X \|\\\|input cyracc # this comment will be included!
X \|\\\|begin{document}" # while this will not
X is wrong.
X.sp
X.fi
X.ti -3m
XAd.4) Ending sequence. Similar to 1), but will be appended at the end of the
Xoutput file.
X.nf
X For example:
X "\|\\\|end{document}
X "
X.fi
X.sp
X.ti -3m
XAd.5) Number of input character sets. For example, in some incarnation of
XKOI7, there are two character sets: Latin and Cyrillic. Cyrillic
Xcharacter sequence follows SHIFT-OUT character (CTRL-N), \|\\\|0x0e,
Xand is terminated by SHIFT-IN character (CTRL-O), \|\\\|0x0f.
XAnother way of looking at it is that Latin characters follow
XCTRL-O and cyrillic ones follow CTRL-N.
X.sp
XIf there is only one character set on input you should specify 0
Xas a number of input char sets,
Xsince the input file obviously does not contain any SHIFT-OUT/IN
Xsequences.
X.sp
X.ti -3m
XAd.6) SHIFT-OUT/SHIFT-IN sequences for each input character set.
XThese lines appear only if you specified nonzero number of character sets.
XThese lines contain also "nesting sequences", which will be
Xexplained later in this section.
XYou do not use "nesting sequences" frequently, and let us assume
Xfor a moment that nesting data are empty strings.
XThe strings or regular expressions specified here are matched
Xwith the contents of input text. If match was found, the matching sequence
Xis usually deleted from the input text and:
X.in +4m
X.ti -2m
Xa) for SHIFT-OUT sequence: the current input character set number is changed
Xto the new one corresponding to the SHIFT-OUT sequence, or
X.ti -2m
Xb) for SHIFT-IN sequence: the previous input character set number is restored,
X(i.e., the one which preceded the SHIFT-OUT sequence for the current set).
XNote that only the SHIFT-IN sequence for the current set is matched.
XThe SHIFT-IN sequences for other character sets than the current set are
Xnot matched.
XThe bracketing of sets is assumed
Xperfect. If the SHIFT-IN sequence for the current set is an empty string,
Xthe input set number is changed when SHIFT-OUT sequence of the new set
Xis detected.
X.in -4m
XFor each input character set, you have to specify a line consisting
Xof 6 strings/expressions separated by spaces:
X.br
X SO-match SO-subs NEST-up NEST-down SI-match SI-subs
X.br
Xwhere:
X.br
X.in +2m
X.ti -2m
XSO-match --- the string or regular expression for the SHIFT-OUT sequence
Xfor the current character set. If detected, the input character set is
Xchanged to this set.
X.ti -2m
XSO-subs --- this is usually an empty string (i.e., the input sequence
Xmatching SO-match is removed). But it can be a replacement string or
Xa substitution expression, which will substitute the original matching
XSHIFT-OUT sequence.
X.ti -2m
XNEST-up --- this string (or a regular expression) is usually an empty
Xstring). However, it can be used to count brackets for detection of SHIFT-IN
Xbracket, if SHIFT-IN sequence is not unique. Its use is explained below.
X.ti -2m
XNEST-down --- a counterpart of NEST-up. It is explained later.
X.ti -2m
XSI-match --- when a sequence in an input file matches the string or regular
Xexpression given as SI-match for a current input character set, the
Xinput character set number is restored to the previous set. Note, that
Xonly SI-match for a current set is matched with input characters.
X.ti -2m
XSI-subs --- this is usually an empty string (i.e., input sequence which
Xmatched SI-match is removed), but if it is not, the input characters which
Xmatched the SI-match are replaced with the SI-subs.
X.sp
X.in -2m
X.br
XThe KOI7 case described above may be specified as:
X.nf
X.ta 5m 10m 15m 20m 25m
X.nf
X 2 # 2 input sets
X ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 1)
X "\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 "\|\\\|017" ""\0\0\0\0 # Cyrillic(set 2)
X or
X 2 # 2 sets
X "\|\\\|017" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 1)
X "\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Cyrillic(set 2)
X.fi
X.br
XBefore the input is processed, the program is initialized to the character
Xset of the first set. In the above case, it is important, since declaration:
X.nf
X 2 # 2 sets
X "\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Cyrillic(set 1)
X "\|\\\|017" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 2)
X.br
X.fi
Xwould be wrong and would mess up the Latin characters preceding
Xfirst Cyrillic sequence.
X.sp 1
XThe nesting sequences are used only for specific situations. I needed them
Xto write a transliteration table from LaTeX to KOI8.
XIn LaTeX the { } pair is used for grouping and appears frequently in
Xthe text. The sequence of cyrillic characters is also a group
Xin LaTeX.
XThe SHIFT-OUT sequence for Russian letters in LaTeX is (at least in
Xmy case): "{\|\\\|cyr ", and the end
Xof the Russian letters is marked by "}", but the "}" has to be the
Xbracket matching the opening "{" in "{\|\\\|cyr ", not just any bracket.
XFor this reason, my SHIFT-OUT/IN entry was in this case:
X.br
X "{\|\\\|cyr " "" "{" "}" "}" "" # Cyrillic codes
X.br
XWhenever the "{\|\\\|cyr " was found, the program zeroes the counter.
XIt adds +1 to it, when NEST-up sequence (i.e., the "{" here) is found, and
Xsubtracts 1 from it, when the NEST-down sequence is found (i.e., the "}").
XThe checking for a SHIFT-IN sequence (i.e., the "}") for cyrillic set
Xis done only when
Xthe counter value is zero (i.e., all pairs inside the cyrillic text are
Xmatched. In fact, the process is more
Xcomplicated than that (the counter for an opened character set is
Xplaced on the stack), but these are details you can find in the code
Xitself.
X.sp
X.ti -3m
XAd.7) Number of output "character sets". This is analogous to the input case.
XThe characters sent to output may belong to different sets. For example,
Xwhen the character (or the sequence) from set 2 is followed by the character
X(or the sequence) from set 1,
Xthe program first sends the SHIFT-IN sequence for set 2 (if it is not
Xempty) and then the SHIFT-OUT sequence for set 1 (if it is not empty). If the
Xoutput character (or sequence) is assigned to set 0, then no SHIFT-IN/SHIFT-OUT
Xsequences are sent to output.
X.br
XIf there is only one set of output characters, you should specify 0.
XNote that you may have several input sets and several output sets, though
Xthis is rare. Usually, you have one input set and many
Xoutput character sets, or vice versa. Again, if you have only one output set,
Xyou do not have any SHIFT-IN/SHIFT-OUT sequences, since those are
Xsend to output only when a set number is changed.
XBut you are free to experiment.
X.sp
X.ti -3m
XAd.8) SHIFT-OUT/SHIFT-IN sequences for each output character set. It is
Xsimilar to the input case, however, the NEST-in and NEST-up sequences
Xare not used here. Again, before any text is sent to output, the
Xcharacter set specified as the first one is assumed. If SHIFT-OUT/IN
Xsequences are not used (i.e., you have only one output character set),
Xyou will not have any SHIFT-OUT/SHIFT-IN data lines.
XThe KOI8 (single character set containing all Latin and Russian letters)
Xto KOI7 (the set using overlapping codes switched by SHIFT-OUT/IN sequences)
Xconversion could be therefore accomplished by the following table:
X.br
X 2 # 2 output sets
X.br
X ""\0\0\0\0 ""\0\0\0\0 # Latin Letters
X.br
X "\|\\\|016" "\|\\\|017" # Russian Letters
Xcase
X.sp
X.ti -3m
XAd.9) Transliteration table for individual character or their sequences.
XIt is a core of your transliteration data.
XThere are 4 columns in the transliteration
Xtable:
X.br
X.in +3m
X(inp_set_no) (inp_seq) (out_set_no) (out_seq)
X.br
X.in -3m
XThese 4 columns are separated by spaces. The (input_set_number)
Xcorresponds to the input character set number as specified above for
Xinput SHIFT-OUT/SHIFT-IN data, or zero.
XIf zero is used (even if number of input sets is not zero), the
X(input_sequence) will be always matched, irrespectively of the current
Xinput character set imposed by the SHIFT-OUT sequence. This is useful,
Xsince some characters are universal (e.g., new lines, spaces, pluses,
Xminuses, etc.) irrespectively of the current character set.
XThe (input_sequence) is the sequence of characters to be matched with
Xcharacters in the input file, and if found (within the character set
Xspecified) it is replaced by the (output_sequence) and sent to output
X(i.e., the matching is interrupted, the (output_sequence) sent to ouput,
Xthe input file pointer is moved to the first character after the
Xmatched sequence and matching resumes).
XThe (output_set_number) specifies the output character set. When the
Xoutput character set changes during transliteration, the appropriate SHIFT-IN
Xsequence of the previous set and the current set's SHIFT-OUT sequence is sent
Xto output. The (output_set_number) may also be zero (even if number of
Xoutput sets is not zero). In this case, the current output set status
Xis not changed, and no SHIFT-IN/OUT sequences is sent to output. Lastly, the
Xoutput set code may be -1, -2 or -3.
XIn this case, the substitution is performed
Xwithin input string that matched but the output sequence is not sent to
Xthe output yet. Depending on the code, the following action is performed:
X.in +4m
X.ti -2m
X-1 --- program makes the substitution in the input string (i.e., substitutes
Xthe matching string with the input string in the input buffer).
XIt does not send the output sequence to the output, but
Xcontinues matching input sequences following the currently
Xmatched one.
X.ti -2m
X-2 --- like code -1, but matching is resumed from the first sequence on
Xthe list.
X.ti -2m
X-3 --- like code -1, but matching is resumed from the input SHIFT-OUT/IN
Xsequences.
X.in -4m
XE.g., if the unprocessed text in the input file is:
X.br
X mental procedure was not successful since..........
X.br
Xand there was a line in transliteration table:
X.br
X 0 "me" -1 "you"
X.br
Xthe input text would be changed to:
X.br
X yountal procedure was not successful since..........
X.br
Xand all remaining matching data would be applied to this text, rather than
Xoriginal text.
XThe -2 code backsteps to the point where the matching of
Xtransliteration starts.
XThe -3 code backsteps even further, to the point where the
Xinput SHIFT-OUT and SHIFT-IN sequences are matched.
XSince the order of sequences to match
Xis crucial here, for the case of output set code -1/-2/-3
Xeven one-character input sequences are matched in the order specified.
XBE CAREFUL HERE. You may create infinite loops. If you use
Xcode -2/-3, be sure that the resulting sequence after substitution
Xwith the code -2/-3, will not match previous sequences
Xwith codes -2/-3.
X.br
XThe (output_sequence)
Xis a sequence which substitutes the corresponding (input_sequence).
XIf (output_sequence) is "" (i.e., empty string) then (input_sequence)
Xis effectively deleted.
XThe (input_sequence)s are compared with input in the order specified
Xunless backstepping -2/-3 code is used (the matching is done from the
Xfirst sequence again). I use the code -1 e.g.,
Xto dehyphenate words when changing to LaTeX.
XCode -2 is useful if you want to skip next comparisons, and the resulting
Xsubstitution string will match earlier matching expressions.
XI do not see any use for the code -3, but you may have one.
XThe order for multicharacter sequences is
Xtherefore important (the single character sequences are always compared
Xafter all multicharacter sequences, and can be therefore put anywhere).
XThe longer multicharacter sequences should be specified before
Xshorter ones, unless they are some "preprocessing" steps with codes
X-1/-2/-3. The order may sometimes be crucial.
XIf you need single character sequences matched in a specific order,
Xenter them as regular expressions, i.e., as {c} instead of "c".
XIn short, the multicharacter input sequences and regular expressions
Xare matched to input text in the order specified. For the sake of
Xefficiency, the single character input sequences (with exception of
Xoutput set code -1/-2/-3) and input lists are handled as a case of remapping
Xand are matched in the order of character codes associated with them.
XIf you specify the same single input character twice for a given input set,
Xthe program will complain.
XThe following combinations of input and output sequences are allowed:
X.nf
X.ta 2m 24m
X Input Sequence Output Sequence
X "\fIplain string\fR" only "\fIplain string\fR"
X [\fIlist\fR] [\fIlist\fR] or "\fIplain string\fR"
X {\fIregular expression\fR} {\fIsubstitution expression\fR} or
X.br
X "\fIplain string\fR"
X.br
X.fi
XWhen match is found, the matching sequence is removed and substituted
Xwith an output sequence. If this results is changing the current output
Xcharacter set, the appropriate SHIFT-IN/SHIFT-OUT pair is sent to the
Xoutput before the transliterated output sequence. If list is
Xused as the input sequence, you may either use:
X.br
X.in +2m
X.ti -2m
Xa) plain string as output
Xsequence. In this case, if current input character belongs to the input list,
Xit is replaced by the output string. I use it to delete ranges of
Xcharacters which do not have any corresponding characters in the output
Xset (e.g., some graphics characters). In this case, the order of
Xcharacters on the input list is not important.
X.ti -2m
Xb) if the output string is also a
Xlist then it has to contain exactly the same number of characters as
Xthe input list. In this case, the 1st character from the input list
Xis replaced by the 1st character from the output list, the 2nd one
Xby the 2nd one, etc. Therefore, the order of characters is important.
X.br
X.in -2m
XTheoretically, if there is one-to-one correspondence between characters
Xin the input set and characters in the output set,
Xyou can make the conversion by
Xusing a single line consisting of two lists. But it looks ugly... And is
Xdifficult to read.
XAnd for the program, the substitution takes the same time, if
Xthe characters are specified separately, or when they are specified
Xas matching lists.
XIf regular expression is used to match the input characters, the matching
Xsequence may be replaced by a plain string or a substitution string,
Xwhich was described above.
X.in +3m
XExamples:
X.br
X.ta 3m 10m 20m 30m 40m
X 2 "CCCP" 0 ""\0\0\0\0
X.br
Xwill delete all occurrences of CCCP from the input file (but not Cccp or
XCCCp) for input set 2.
X.sp 1
X 0 "\|\\\|0xD1" 0 "ya"
X.br
Xwill replace all occurrences of character of the code \|\\\|0xD1 with a two
Xletter sequence "ya".
X.sp 1
X 0 \|\\\|0xD1 2 q
X.br
Xwill replace all characters \|\\\|0xD1 with a character "q" and output
XSHIFT-IN/OUT sequence if necessary.
X.sp 1
X 2 "q" 0 "\|\\\|0xD1"
X.br
Xwill replace letter q (if the current input set is 2) with a code \|\\\|0xD1.
X.sp 1
X 0 "\|\\\|0xD1" 2 "ya"
X.br
Xwill replace code \|\\\|0xD1 with a sequence ya (assuming that SHIFT-OUT
Xand SHIFT-IN sequences
Xfor output set 2 are: {\|\\\|cyr and }, respectively, you will get {\|\\\|cyr ya}).
X.sp
XIf a character is not specified in the transliteration table, it will
Xbe output as is, i.e., it corresponds to a line:
X.br
X 0 "c" 0 "c"
X.br
Xwhere c is the character. If you want to delete certain characters, you
Xneed to explicitly specify this, e.g.:
X.br
X 0 [a-z] 0 ""
X.br
Xwill delete all lower case Latin letters from the text.
X.in -3m
XBefore you decide to create your own transliteration file, please examine
Xexisting transliteration files. Do yourself (and others) a favor --- put
Xas many comments as possible there. If you allow others to use your
Xtransliteration files, please include your name and e-mail address
Xand file creation date.
X.in -4m
X.sp 2
XProgram matches the sequences in a specific order:
X.in +4m
X.ti -2m
X\01) Match/substitute input SHIFT-OUT sequences
X.ti -2m
X\02) If matched, save current set and start new one
X.ti -2m
X\03) If matched, zero nest counter for NEST sequences
X.ti -2m
X\04) Match/substitute current set SHIFT-IN-sequence
X.ti -2m
X\05) If matched, restore previous set number
X.ti -2m
X\06) If matched, restore previous set nest counter
X.ti -2m
X\07) Match/substitute transliteration sequences
X.ti -2m
X\08) If matched and code = -1 make substitution in input buffer and
Xcontinue matching the next sequence.
X.ti -2m
X\09) If matched and code = -2 make substitution and goto 7)
X.ti -2m
X10) If matched and code = -3 make substitution and goto 1)
X.ti -2m
X11) Match (no substitution) NEST-up and NEST-down to input buffer
X.ti -2m
X12) If NEST-up matched, increment counter for current set
X.ti -2m
X13) If NEST-down matched, decrement counter for current set
X.ti -2m
X14) If match in 7) send substitute sequence to output
X.ti -2m
X15) If no match in 7) (or code -1) output current input character
X.ti -2m
X16) Advance input pointer to point at new characters
X.ti -2m
X17) If End of File, break
X.ti -2m
X18) Goto 1)
X.br
X.fi
X
X.PP
X.SH ASCII CHARACTER CODES
X.nf
X.ta 2m 6m 9m 13m 16m 20m 22m 26m 29m 33m 36m 40m
X dec hx oct ch dec hx oct ch
X
X \0\00 00 000 ^@ NUL \064 40 100 @
X \0\01 01 001 ^A SOH \065 41 101 A
X \0\02 02 002 ^B STX \066 42 102 B
X \0\03 03 003 ^C ETX \067 43 103 C
X \0\04 04 004 ^D EOT \068 44 104 D
X \0\05 05 005 ^E ENQ \069 45 105 E
X \0\06 06 006 ^F ACK \070 46 106 F
X \0\07 07 007 ^G BEL \071 47 107 G
X \0\08 08 010 ^H BS \072 48 110 H
X \0\09 09 011 ^I HT \073 49 111 I
X \010 0a 012 ^J LF \074 4a 112 J
X \011 0b 013 ^K VT \075 4b 113 K
X \012 0c 014 ^L FF \076 4c 114 L
X \013 0d 015 ^M CR \077 4d 115 M
X \014 0e 016 ^N SO \078 4e 116 N
X \015 0f 017 ^O SI \079 4f 117 O
X \016 10 020 ^P DLE \080 50 120 P
X \017 11 021 ^Q DC1 \081 51 121 Q
X \018 12 022 ^R DC2 \082 52 122 R
X \019 13 023 ^S DC3 \083 53 123 S
X \020 14 024 ^T DC4 \084 54 124 T
X \021 15 025 ^U NAK \085 55 125 U
X \022 16 026 ^V SYN \086 56 126 V
X \023 17 027 ^W ETB \087 57 127 W
X \024 18 030 ^X CAN \088 58 130 X
X \025 19 031 ^Y EM \089 59 131 Y
X \026 1a 032 ^Z SUB \090 5a 132 Z
X \027 1b 033 ^[ ESC \091 5b 133 [
X \028 1c 034 ^\\ FS \092 5c 134 \\
X \029 1d 035 ^] GS \093 5d 135 ]
X \030 1e 036 ^^ RS \094 5e 136 ^
X \031 1f 037 ^_ US \095 5f 137 _
X \032 20 040 SP \096 60 140 `
X \033 21 041 ! \097 61 141 a
X \034 22 042 " \098 62 142 b
X \035 23 043 # \099 63 143 c
X \036 24 044 $ 100 64 144 d
X \037 25 045 % 101 65 145 e
X \038 26 046 & 102 66 146 f
X \039 27 047 ' 103 67 147 g
X \040 28 050 ( 104 68 150 h
X \041 29 051 ) 105 69 151 i
X \042 2a 052 * 106 6a 152 j
X \043 2b 053 + 107 6b 153 k
X \044 2c 054 , 108 6c 154 l
X \045 2d 055 - 109 6d 155 m
X \046 2e 056 . 110 6e 156 n
X \047 2f 057 / 111 6f 157 o
X \048 30 060 0 112 70 160 p
X \049 31 061 1 113 71 161 q
X \050 32 062 2 114 72 162 r
X \051 33 063 3 115 73 163 s
X \052 34 064 4 116 74 164 t
X \053 35 065 5 117 75 165 u
X \054 36 066 6 118 76 166 v
X \055 37 067 7 119 77 167 w
X \056 38 070 8 120 78 170 x
X \057 39 071 9 121 79 171 y
X \058 3a 072 : 122 7a 172 z
X \059 3b 073 ; 123 7b 173 {
X \060 3c 074 < 124 7c 174 |
X \061 3d 075 = 125 7d 175 }
X \062 3e 076 > 126 7e 176 ~
X \063 3f 077 ? 127 7f 177 DEL
X
X.br
X
X.SH CONVERSION: DECIMAL<-->OCTAL<-->HEX.
X.nf
X.cs R 24
X 000 000 00 064 100 40 128 200 80 192 300 C0
X 001 001 01 065 101 41 129 201 81 193 301 C1
X 002 002 02 066 102 42 130 202 82 194 302 C2
X 003 003 03 067 103 43 131 203 83 195 303 C3
X 004 004 04 068 104 44 132 204 84 196 304 C4
X 005 005 05 069 105 45 133 205 85 197 305 C5
X 006 006 06 070 106 46 134 206 86 198 306 C6
X 007 007 07 071 107 47 135 207 87 199 307 C7
X 008 010 08 072 110 48 136 210 88 200 310 C8
X 009 011 09 073 111 49 137 211 89 201 311 C9
X 010 012 0A 074 112 4A 138 212 8A 202 312 CA
X 011 013 0B 075 113 4B 139 213 8B 203 313 CB
X 012 014 0C 076 114 4C 140 214 8C 204 314 CC
X 013 015 0D 077 115 4D 141 215 8D 205 315 CD
X 014 016 0E 078 116 4E 142 216 8E 206 316 CE
X 015 017 0F 079 117 4F 143 217 8F 207 317 CF
X 016 020 10 080 120 50 144 220 90 208 320 D0
X 017 021 11 081 121 51 145 221 91 209 321 D1
X 018 022 12 082 122 52 146 222 92 210 322 D2
X 019 023 13 083 123 53 147 223 93 211 323 D3
X 020 024 14 084 124 54 148 224 94 212 324 D4
X 021 025 15 085 125 55 149 225 95 213 325 D5
X 022 026 16 086 126 56 150 226 96 214 326 D6
X 023 027 17 087 127 57 151 227 97 215 327 D7
X 024 030 18 088 130 58 152 230 98 216 330 D8
X 025 031 19 089 131 59 153 231 99 217 331 D9
X 026 032 1A 090 132 5A 154 232 9A 218 332 DA
X 027 033 1B 091 133 5B 155 233 9B 219 333 DB
X 028 034 1C 092 134 5C 156 234 9C 220 334 DC
X 029 035 1D 093 135 5D 157 235 9D 221 335 DD
X 030 036 1E 094 136 5E 158 236 9E 222 336 DE
X 031 037 1F 095 137 5F 159 237 9F 223 337 DF
X 032 040 20 096 140 60 160 240 A0 224 340 E0
X 033 041 21 097 141 61 161 241 A1 225 341 E1
X 034 042 22 098 142 62 162 242 A2 226 342 E2
X 035 043 23 099 143 63 163 243 A3 227 343 E3
X 036 044 24 100 144 64 164 244 A4 228 344 E4
X 037 045 25 101 145 65 165 245 A5 229 345 E5
X 038 046 26 102 146 66 166 246 A6 230 346 E6
X 039 047 27 103 147 67 167 247 A7 231 347 E7
X 040 050 28 104 150 68 168 250 A8 232 350 E8
X 041 051 29 105 151 69 169 251 A9 233 351 E9
X 042 052 2A 106 152 6A 170 252 AA 234 352 EA
X 043 053 2B 107 153 6B 171 253 AB 235 353 EB
X 044 054 2C 108 154 6C 172 254 AC 236 354 EC
X 045 055 2D 109 155 6D 173 255 AD 237 355 ED
X 046 056 2E 110 156 6E 174 256 AE 238 356 EE
X 047 057 2F 111 157 6F 175 257 AF 239 357 EF
X 048 060 30 112 160 70 176 260 B0 240 360 F0
X 049 061 31 113 161 71 177 261 B1 241 361 F1
X 050 062 32 114 162 72 178 262 B2 242 362 F2
X 051 063 33 115 163 73 179 263 B3 243 363 F3
X 052 064 34 116 164 74 180 264 B4 244 364 F4
X 053 065 35 117 165 75 181 265 B5 245 365 F5
X 054 066 36 118 166 76 182 266 B6 246 366 F6
X 055 067 37 119 167 77 183 267 B7 247 367 F7
X 056 070 38 120 170 78 184 270 B8 248 370 F8
X 057 071 39 121 171 79 185 271 B9 249 371 F9
X 058 072 3A 122 172 7A 186 272 BA 250 372 FA
X 059 073 3B 123 173 7B 187 273 BB 251 373 FB
X 060 074 3C 124 174 7C 188 274 BC 252 374 FC
X 061 075 3D 125 175 7D 189 275 BD 253 375 FD
X 062 076 3E 126 176 7E 190 276 BE 254 376 FE
X 063 077 3F 127 177 7F 191 277 BF 255 377 FF
X.cs R
X.br
X.sp
X.fi
X
X.SH INSTALLATION
XProgram is given in a source form. It was tried under UN*X, VMS and
XMS-DOS systems and ran. The file \fIreadme.doc\fR contains the details
Xon how to obtain the whole package. You can retrieve this file
Xfrom anonymous ftp on kekule.osc.edu in the directory /pub/russian/translit.
XYou can also obtain it via e-mail by sending a message:
X.br
X get translit/readme.doc from russian
X.br
Xto OSCPOST@osc.edu or OSCPOST@OHSTPY.BITNET.
X.sp
XThe source of the program consists of several files:
X.br
X.IP \fIpaths.h\fR
Xmust be edited before compilation. It contains its
Xown comments what to do. The defines in this file relate to the operating
Xsystem you are using and the default path for searching transliteration
Xtable.
X.br
X.IP \fItranslit.c\fR
XIt contains the main program.
XThis was intended to be a portable code.
X.br
X.IP \fIreg_exp.h\fR
Xthe include file for regular expression matching
Xlibrary of Henry Spencer from the University of Toronto. This regular
Xexpression package was posted to comp.sources.misc (volume 3). Also 4 patches
Xwere posted (in volumes: 3, 4, 4, 10). I applied the patches to the original
Xcode and made small modifications to the code, which are marked in the
Xsource code.
X.br
X.IP \fIreg_exp.c\fR
Xthe regular expression library for compilation and
Xmatching of regular expressions.
X.br
X.IP \fIreg_sub.c\fR
Xthe regular expression substitution routine.
X.br
X.sp
X.PP
XBefore you compile this program you have to edit \fIpaths.h\fR.
XRead comments in the file.
XDuring compilation, all source code should reside in the
Xcurrent directory.
X.br
XThen you may compile the program under UN*X as (for example):
X.br
X cc -o translit translit.c reg_exp.c reg_sub.c
X.br
Xand copy the program \fItranslit\fR to some standard directory which is
Xin users' path (for example: /usr/local/bin). Then you need to copy
Xtransliteration tables to the directory which you have chosen in \fIpaths.h\fR.
XIf you get errors, then it is not OK. Please, report them to the author (with
Xall the gory details: error message, line number, machine, operating system,
Xetc.).
X.sp
XUnder VMS (VAXes) you need to compile it as:
X.br
X cc translit
X.br
X cc reg_exp
X.br
X cc reg_sub
X.br
X link translit+reg_exp+reg_sub,sys$library:vaxcrtl/lib
X.br
Xand before you can use the program, you need to type (or better put into your
XLOGIN.COM file) a line:
X.br
X translit == "$SYS$USER:[ME.TRA]TRANSLIT.EXE"
X.br
Xor whatever is the full path to the \fItranslit\fR executable image which
Xyou created with LINK. Note the quotes and the $ sign in front of program
Xpath.
X.sp
XOn an IBM-PC I used MicroSoft C 5.1 as:
X.br
X.in +2m
X.ti -1m
Xcl /FeTRANSLIT /AL /FPc /W1 /F 5000 /Ox /Gs translit.c reg_exp.c reg_sub.c
X.in -2m
X.sp 2
X.SH RULES, CONDITIONS AND AUTHOR'S WHISHES
XYou can distribute this code and associated files under these conditions:
X.br
X.in +4m
X.ti -2m
X 1) You will distribute all files (even if you
Xthink that they are garbage). You may get the complete set from anonymous
Xftp at kekule.osc.edu in /pub/russian/translit. You can also get the program
Xand associated files via e-mail. To get the instructions for e-mail
Xdistribution send a line:
X.br
X send translit/readme.doc from russian
X.br
Xto OSCPOST@osc.edu or OSCPOST@OHSTPY.BITNET.
XYou are not allowed to distribute the incomplete distribution. The following
Xfiles should be present in the distribution:
X.ta 2m 22n
X.nf
X alt-gos.rus - ALT to GOSTCII table
X alt-koi8.rus - ALT to KOI8 table
X example.alt.uu - uuencoded example in ALT
X example.ko8.uu - uuencoded example in KOI8
X example.pho - phonetic transliteration example
X example.tex - LaTeX example
X gos-alt.rus - GOSTCII to ALT table
X gos-koi8.rus - GOSTCII to KOI8 table
X koi7-8.rus - KOI7 to KOI8 table
X koi7nl-8.rus - KOI7 (no Latin) to KOI8 table
X koi8-7.rus - KOI8 to KOI7 table
X koi8-alt.rus - KOI8 to ALT table
X koi8-gos.rus - KOI8 to GOSTCII table
X koi8-lc.rus - KOI8 to Library of Congress table
X koi8-phg.rus - KOI8 to GOST transliteration
X koi8-php.rus - KOI8 to Pokrovsky transliteration
X koi8-tex.rus - KOI8 to LaTeX conversion
X order.txt - Order form for ordering the program
X paths.h - Include file for translit.c
X phg-koi8.rus - GOST transliteration to KOI8
X pho-8sim.rus - Simple phonetic to KOI8
X pho-koi8.rus - Various phonetic to KOI8
X php-koi8.rus - Pokrovsky to KOI8
X readme.doc - short description of the files
X reg_exp.c - regular expression code by Henry Spencer
X reg_exp.h - include for reg_exp.c and reg_sub.c
X reg_sub.c - regular expression code by H. Spencer
X tex-koi8.rus - LaTeX to KOI8
X translit.c - TRANSLIT main program
X translit.ps - TRANSLIT manual in PostScript
X translit.1 - TRANSLIT manual in *roff
X translit.txt - Plain ASCII TRANSLIT manual
X.sp 1
X.fi
X.ti -2m
X 2) You may expand/change the files and the program and distribute modified
Xfiles, provided that you do
Xnot delete anything (you can always comment the unnecessary portions out)
Xand clearly mark your changes. Please send the copy of the modified
Xversion to the author, though you are not required to do so.
XI will give you all the credit for your enhancements. I simply wish that
Xthere is a single point of distribution for this code, so it is maintained
Xto some extent. If you create additional transliteration definition files,
Xplease, send them to the author if you may. I will add them to the program
Xdistribution. I want to fix bugs and expand/optimize this code,
Xbut I need your help.
XI need your transliteration files for languages which I do not know or
Xdo not use currently.
XYour suggestions for improving documentation are most welcome (I am not
Xa native English speaker).
X.ti -2m
X3) You will not charge money for the program and/or associated files,
Xexcept for media and copying costs. If you want to sell it, contact the author
Xfirst. Bear in mind
Xthat the regular expression package by Henry Spencer has some
Xcopyright restrictions.
XBut there are other regular expression packages which do not have these
Xrestrictions (which are not violated by this offering).
X.ti -2m
X4) I will gladly help you with advice on compiling this software and
Xtry to fix bugs when time allows. However, if you want a ready to run
Xexecutable, you need to order it for a very nominal fee from
X\fIJKL ENTERPRISES, INC.\fR as described in the file \fIorder.txt\fR
Xwhich must be a part of a complete distribution.
X.in -4m
X
X.SH AUTHOR
XJan Labanowski, P.O. Box 21821, Columbus, OH 43221-0821, USA.
XE-mail: jkl@osc.edu, JKL@OHSTPY.BITNET.
X
END_OF_FILE
if test 56776 -ne `wc -c <'translit.1'`; then
echo shar: \"'translit.1'\" unpacked with wrong size!
fi
# end of 'translit.1'
fi
echo shar: End of archive 1 \(of 10\).
cp /dev/null ark1isdone
MISSING=""
for I in 1 2 3 4 5 6 7 8 9 10 ; do
if test ! -f ark${I}isdone ; then
MISSING="${MISSING} ${I}"
fi
done
if test "${MISSING}" = "" ; then
echo You have unpacked all 10 archives.
rm -f ark[1-9]isdone ark[1-9][0-9]isdone
else
echo You still must unpack the following archives:
echo " " ${MISSING}
fi
exit 0
exit 0 # Just in case...