home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Languages Around the World
/
LanguageWorld.iso
/
language
/
russian
/
cyrilic2
/
addpage.doc
< prev
next >
Wrap
Text File
|
1989-06-22
|
14KB
|
353 lines
ADDPAGE
Dimitri Vulis
Department of Mathematics
CUNY Graduate Center
33 W 42 St
New York, NY 10036-8099
USA
dlv@cunyvms1.bitnet
June 22, 1989
1. General overview:
MS/PC DOS 3.3 (and later) is distributed with the following code pages:
437 - USA
865 - Norway
860 - Portuguese
863 - Canada-French
850 - Multilingual
This package will allow you to patch your copy of DOS to add more code pages. A
Cyrillic code page, arbitrarily numbered 880, corresponding to the standards ISO
8859 part 5, ECMA 113 and GOST 19768-74 is provided with the package. There is
also an INCOMPLETE Greek code page, arbitrarily numbered 890, corresponding to
the standards ECMA 118 and ISO 8859-3 (and hence the appropriate ELOT standard)
(see below).
Hardware requirements: you have to have EGA, VGA or MCGA to use code pages. DOS
code pages don't work with CGA or Hercules.
Once the Cyrillic code page is loaded and selected, you will be able to
correctly display on your screen documents that contain Cyrillic text coded in
accordance with the above standards, in particular GOST-coded (Soviet) text.
In addition, a TSR program called KEYBRU will redefine your keyboard into a
standard Russian layout when ScrollLock is pressed. This program will allow you
to type in Russian text. Unfortunately, there is no easy way to patch DOS
(KEYBOARD.SYS) to add an additional keyboard layout.
In order to print out Cyrillic documents you need standard font(s) for your
particular printer. A downloadable font for Epson FX (9-pin) is included. One of
the updates for this program will include downloadable fonts for HP LJ. A
technique for printing out Cyrillic documents using TeX and American Math
Society's Cyrillic fonts is provided.
A Russian version of TeX, Donald Knuth's typesetting system, is currently being
developed at various sites. A set of Russian hyphenation patterns is included.
2. The code pages.
ISO 8859-3/ECMA-113 includes the characters necessary to handle Russian,
Bulgarian, Byelorussian, Macedonian, Serbocroatian and Ukrainian text.
The positions of the Russian letters coincide with GOST. The positions of
Serbian letters in ISO 8859-5 do NOT coincide with the Jugoslavian standard JUS
I.B1.003, ISO registration 146/147.
Several additional rasters that may be useful for Russian text (obsolete
letters and guillemets) have been placed in the unused positions. Depending on
your applications, you may want to replace some Serbian characters with them.
I made an effort to make Russian letters shapes distinct from usually identical
Latin letters, e.g. Russian ░ and Latin A. The shapes generally resemble those
produced by Soviet DP equipment. If you improve the shape of any of the
letters in 880.ASM, please let me know.
The Greek code page includes the characters necessary to handle Modern Greek
text. The only accents are diaeresis and tonos. It does nor include asper,
lenis, acute, grave, circumflex accents and their combinations that are
necessary for Classical Greek.
I lack the expertise to draw the remaining Greek rasters. Although many people
have expressed interest in a Greek code page, no one has been willing to write
the rasters. I am distributing 890.ASM with most characters missing, hoping
that someone will volunteer to complete it.
3. The keyboard driver.
Only the Russian keyboard driver is provided. Scroll Lock switches between
Russian and Latin keyboards. The layout is:
≡ ² / " : , . ; ? ò < > (Alt-1, etc)
1 2 3 4 5 6 7 8 9 0 - =
┘ µ π ┌ ╒ ▌ ╙ Φ Θ ╫ σ Ω
q w e r t y u i o p [ ]
Σ δ ╥ ╨ ▀ α ▐ █ ╘ ╓ φ
a s d f g h j k l ; '
∩ τ ß ▄ ╪ Γ ∞ ╤ ε ±
z x c v b n m , . /
The funny symbols should be displayed at Russian letters if you've installed
the code page properly.
Most newer Soviet keyboards have the /? key in the same place as US keyboards
do. I put the ± character here, like on a typewriter, since it is used by
students of Russian. If you don't need this character, you may want to edit
KEYBRU.ASM and comment out the 2 relevant lines. Soviet keyboards have 2 extra
keys (as do most European keyboards): <> and *. For this reason, in Russian
mode some punctuation marks can only be obtained using Alt- and a key in the
upper row.
Like any TSR (terminate and stay resident) program, KEYBRU may interfere with
the normal operation of your computer. Whenever this happens, try changing the
order in which your TSRs are loaded.
Caps Lock does not work properly for several keys. Also Scroll Lock is examined
when the character is dequeued, not when it is enqueued, which changes nothing,
unless you type ahead while switching between Russian and Latin modes. These
problems would not occur if I could include the keyboard in KEYBOARD.SYS.
Later versions of this package may include standard Serbian, Ukrainian, etc,
keyboard layouts as well. Please let me know if you know what these layouts
are.
The standard Greek keyboard layout includes dead keys; one has to really
include it in KEYBOARD.SYS.
4. Installation.
Notation: DOS directory is usually the root. Work directory may be a floppy
disk (A:). Make sure the following files are available:
EGA.CPI in the DOS directory (from your DOS distribution disk).
DISPLAY.SYS in the DOS directory (from your DOS distribution disk).
MODE.COM on your PATH (from your DOS distribution disk).
KEYBRU.COM on your PATH (from this archive).
880.CP and ADDPAGE.BAS in the work directory (from this archive).
Run the basic program ADDPAGE as follows:
------------------->Type:
C> BASIC A:ADDPAGE
CPI filename: \DOS\EGA.CPI (or just \EGA.CPI etc)
Target codepage: 880
In or Out? I
CP filename: A:880.CP (or whereever)
Code page not in file---replacing...
<long wait>
Ok
SYSTEM
-----
Erase 880.CP and ADDPAGE.BAS from your hard disk, you don't need them anymore.
Use your favorite editor to add the following to your CONFIG.SYS:
DEVICE=C:\DOS\DISPLAY.SYS CON:=(EGA,437,(1,3))
If you are using EGA, not VGA, change the last 3 to 2.
(You have to say 'EGA' even if you're really using a VGA. 1 is the number of
code pages you are going to load; increase this number if you want to load
other pages. 3 is the number of font variants; 3 is for VGA (16, 14 and 8
pixels high); 2 is for EGA (only 14 and 8). See your DOS manual if you need
more info.)
Use your favorite editor to add the following to your AUTOEXEC.BAT:
MODE CON CP PREPARE=((880) \DOS\EGA.CPI)
MODE CON CP SELECT=880
KEYBRU
Reboot.
You can switch the code pages anytime using
MODE CON CP SELECT=437
for U.S. page and
MODE CON CP SELECT=880
for Cyrillic page. CHCP won't work.
5. Printing
To print Russian text on a 9-pin Epson FX printer, first send the downloadable
font in EPSON9.FNT to the printer: COPY/B EPSON9.FNT PRN. If you print from
within a word processor, make sure it does not reset the printer and delete the
fonts before you print.
The enclosed program TRR.C will translate Russian characters in STDIN into calls
to AMS Cyrillic fonts for TeX in STDOUT. It is assumed that \mcyr is defined as
a font or font family. (Just say \font\mcyr=mcyr10, if you're not sure.) This
has been tested with both Plain TeX and LaTeX. Unfortunately, this approach is
not compatible with TeX's hyphenation algorithm.
6. Hyphenation
The file RUSSHYPH.PAT contains an improved version of the patterns presented in
my M.A. thesis, "An Implementation of Liang's Algorithm for the Russian
language", submitted to CCNY in October of 1988. The patterns find all the valid
and no invalid hyphens in a 50,000+ word dictionary (including inflections). The
main improvements, compared with the thesis, are:
a) I fixed a few incorrect hyphenations, e.g., ▀▐-▄▌╪Γ∞, from ▀▐▄-▌╪Γ∞, etc. I
also keyed in the balance of the dictionary, and added a few words supplied by
A. Samarin (see below).
b) All the technical terms that I could find that are borrowed from German,
Dutch, English, etc, are hyphenated correctly.
c) The patterns won't split a single vowel off a part of many compound words.
Thus, ▀α∩▄▐π-╙▐█∞▌╪┌, ╤╪-▐█▐╙, ▌╒▐-╤δτ▌δ┘, ▌╒-▐Σ╨Φ╪╫▄, etc, are now suppressed.
(Such breaks are not strictly illegal, but certainly offensive, and a good break
is just 1 letter away.)
d) The patterns will handle many common abbreviations, such as ╤▐αΓ-▀α▐╥▐╘▌╪µ╨,
▀╨αΓ-πτ╒╤╨, ┌▐▄ß-▐α╙, etc. (Most such words are not considered to be part of the
language, but occur often in certain kinds of texts.)
Although (c) and (d) sound like neat tricks, such words usually fit one of the
common patterns. I used a slightly modified PATGEN, running on a Kouwei computer
from Barry Hu, Microstar, to generate the patterns. PATGEN is an extremely
powerful tool, and I would never have generated such good patterns without it.
Like PATGEN says,
127685 good, 0 bad, 0 missed
In April of 1989 I was informed by Alexander Samarin of the Institute for High
Energy Physics in Serpukhov, USSR, currently at CEARN, that an algorithm for
automatic hyphenation of Russian words was developed at IHEP and a preprint was
published in 1983. I was unable to get the paper or the algorithm. Alexander
Samarin kindly sent me a file with about 21,000 inflected Russian words, for
which I am very grateful.
These patterns are 'final', in a sense that I don't expect to change or improve
them in the future. I am not aware of any Russian words that are not hyphenated
correctly by the patterns. It is possible to manufacture abbreviations (4) that
won't be broken up completely, although invalid breaks are unlikely. It is very
hard to find a compound word (3) where a single vowel might be split off. Of
course, if you use the patterns and find any word that's not fully and correctly
hyphenated, please let me know.
The patterns, meant as input to Liang's algorithm, consist of strings of
letters and digits, where a digit placed between two letters indicates a
`hyphenation value' for its position. Odd values permit breaks; even values
(including zero, assumed when the digit is omitted) prohibit breaks. The text
processing program finds all the patterns whose letters match part of the word,
takes the maximum hyphenation value for each position between letters and examines
its parity to exhibit the legal breaks.
I have been told that both Microsoft Word and WordPerfect use Liang's algorithm
for hyphenation, but have English patterns hardwired.
7. Possible problems
Q: The installation is too complex.
A: I cannot give out patched out EGA.CPI because it's copyrighted. Ask someone
to help.
Q: BASIC won't run on this machine.
A: Do the installation on another machine that has BASIC, and then copy
EGA.CPI.
Q: KEYBRU conflicts with other TSRs.
A: Sigh. Try changing the order in which the TRSs are loaded. A better solution
would be to add the Russian keyboard layout to KEYBOARD.SYS and to use a
vanilla DOS KEYB command; alas, I was unable to do it.
Q: Why does WordPerfect misinterpret some of the Russian letters?
A: I don't know. Typing lowercase p α (224) causes WP to look for a file
"alth.wmp". This is a problem with WP, not with the keyboard driver.
Q: Is it possible to change the codes for some of the letters?
A: Yes, you can alter 880.ASM, and MASM/LINK/EXE2BIN it to obtain 880.CP, and
then repeat the installation. The resulting code page will not be compatible
with 880, so it should be given a different number. This is not a good idea.
8. Technical remark
Here is some C code that uses code pages:
#include <stdio.h>
#include <dos.h>
main()
{
union REGS regs;
union SREGS sregs;
unsigned filhandl=0; /* or open /dev/con */
int *foo; /* model= compact! A 32-bit pointer */
short hwcpcount,prepcpcount;
int i;
regs.x.bx=filhandl; /* handle for STDIN, hopefully=/dev/con */
regs.h.ah=0x44; /* IOCTL */
regs.h.al=0x00; /* get info */
intdosx(®s,®s,&sregs);
if (!(regs.x.dx&0x4000)) /* code page supported bit */
{
printf("Code page not supported by STDIN");
return;
}
regs.h.ah=0x44; /* IOCTL */
regs.h.al=0x0c; /* generic character IOCTL */
regs.x.bx=filhandl;
regs.h.ch=0x03; /*console?*/
regs.h.cl=0x6b; /* query prepared code pages */
intdosx(®s,®s,&sregs);
foo=(int *)((sregs.ds<<16)+regs.x.dx);
foo++; /* # bytes returned */
hwcpcount=*foo++;
printf("%d hardware pages: ",hwcpcount);
for (i=0; i<hwcpcount; i++ )
printf("%d ",*foo++);
prepcpcount=*foo++;
printf("%d prepared pages: ",prepcpcount);
for (i=0; i<prepcpcount; i++ )
printf("%d ",*foo++);
}
8. Credits, acknowledgements, etc
The contents of this archive are placed in public domain; all copyright is
waived. You may use it as you please.
If you find this package useful, please let me know at:
DLV@CUNYVMS1.BITNET
or:
Dimitri Vulis
Department of Mathematics
CUNY Graduate Center
New York, NY 10036-8099
U.S. of A.
(Note: never use my old "529 W 111th" address listed in some directories!)
I may then notify you of updates (this is more likely if you provide a e-mail
address reachable from BITNET).
Feel free to comment on the letter shapes. I always appreciate constructive
criticism.
I would like to thank Burton Randol, Giorgio Mantzivis, Johann van
Wingen, Donald Parsons and my father L.N.Klyukvin for their help with
this project.
You may try contacting your DOS OEM and asking them to include the Cyrillic and
Greek code pages in their standard DOS distribution.