Languages Around the World

home *** CD-ROM | disk | FTP | other *** search

/ Languages Around the World / LanguageWorld.iso / language / russian / cyrilic2 / addpage.doc < prev next >

Wrap

Text File | 1989-06-22 | 14KB | 353 lines

ADDPAGE Dimitri Vulis Department of Mathematics CUNY Graduate Center 33 W 42 St New York, NY 10036-8099 USA dlv@cunyvms1.bitnet June 22, 1989 1. General overview: MS/PC DOS 3.3 (and later) is distributed with the following code pages: 437 - USA 865 - Norway 860 - Portuguese 863 - Canada-French 850 - Multilingual This package will allow you to patch your copy of DOS to add more code pages. A Cyrillic code page, arbitrarily numbered 880, corresponding to the standards ISO 8859 part 5, ECMA 113 and GOST 19768-74 is provided with the package. There is also an INCOMPLETE Greek code page, arbitrarily numbered 890, corresponding to the standards ECMA 118 and ISO 8859-3 (and hence the appropriate ELOT standard) (see below). Hardware requirements: you have to have EGA, VGA or MCGA to use code pages. DOS code pages don't work with CGA or Hercules. Once the Cyrillic code page is loaded and selected, you will be able to correctly display on your screen documents that contain Cyrillic text coded in accordance with the above standards, in particular GOST-coded (Soviet) text. In addition, a TSR program called KEYBRU will redefine your keyboard into a standard Russian layout when ScrollLock is pressed. This program will allow you to type in Russian text. Unfortunately, there is no easy way to patch DOS (KEYBOARD.SYS) to add an additional keyboard layout. In order to print out Cyrillic documents you need standard font(s) for your particular printer. A downloadable font for Epson FX (9-pin) is included. One of the updates for this program will include downloadable fonts for HP LJ. A technique for printing out Cyrillic documents using TeX and American Math Society's Cyrillic fonts is provided. A Russian version of TeX, Donald Knuth's typesetting system, is currently being developed at various sites. A set of Russian hyphenation patterns is included. 2. The code pages. ISO 8859-3/ECMA-113 includes the characters necessary to handle Russian, Bulgarian, Byelorussian, Macedonian, Serbocroatian and Ukrainian text. The positions of the Russian letters coincide with GOST. The positions of Serbian letters in ISO 8859-5 do NOT coincide with the Jugoslavian standard JUS I.B1.003, ISO registration 146/147. Several additional rasters that may be useful for Russian text (obsolete letters and guillemets) have been placed in the unused positions. Depending on your applications, you may want to replace some Serbian characters with them. I made an effort to make Russian letters shapes distinct from usually identical Latin letters, e.g. Russian ░ and Latin A. The shapes generally resemble those produced by Soviet DP equipment. If you improve the shape of any of the letters in 880.ASM, please let me know. The Greek code page includes the characters necessary to handle Modern Greek text. The only accents are diaeresis and tonos. It does nor include asper, lenis, acute, grave, circumflex accents and their combinations that are necessary for Classical Greek. I lack the expertise to draw the remaining Greek rasters. Although many people have expressed interest in a Greek code page, no one has been willing to write the rasters. I am distributing 890.ASM with most characters missing, hoping that someone will volunteer to complete it. 3. The keyboard driver. Only the Russian keyboard driver is provided. Scroll Lock switches between Russian and Latin keyboards. The layout is: ≡ ² / " : , . ; ? ò < > (Alt-1, etc) 1 2 3 4 5 6 7 8 9 0 - = ┘ µ π ┌ ╒ ▌ ╙ Φ Θ ╫ σ Ω q w e r t y u i o p [ ] Σ δ ╥ ╨ ▀ α ▐ █ ╘ ╓ φ a s d f g h j k l ; ' ∩ τ ß ▄ ╪ Γ ∞ ╤ ε ± z x c v b n m , . / The funny symbols should be displayed at Russian letters if you've installed the code page properly. Most newer Soviet keyboards have the /? key in the same place as US keyboards do. I put the ± character here, like on a typewriter, since it is used by students of Russian. If you don't need this character, you may want to edit KEYBRU.ASM and comment out the 2 relevant lines. Soviet keyboards have 2 extra keys (as do most European keyboards): <> and *. For this reason, in Russian mode some punctuation marks can only be obtained using Alt- and a key in the upper row. Like any TSR (terminate and stay resident) program, KEYBRU may interfere with the normal operation of your computer. Whenever this happens, try changing the order in which your TSRs are loaded. Caps Lock does not work properly for several keys. Also Scroll Lock is examined when the character is dequeued, not when it is enqueued, which changes nothing, unless you type ahead while switching between Russian and Latin modes. These problems would not occur if I could include the keyboard in KEYBOARD.SYS. Later versions of this package may include standard Serbian, Ukrainian, etc, keyboard layouts as well. Please let me know if you know what these layouts are. The standard Greek keyboard layout includes dead keys; one has to really include it in KEYBOARD.SYS. 4. Installation. Notation: DOS directory is usually the root. Work directory may be a floppy disk (A:). Make sure the following files are available: EGA.CPI in the DOS directory (from your DOS distribution disk). DISPLAY.SYS in the DOS directory (from your DOS distribution disk). MODE.COM on your PATH (from your DOS distribution disk). KEYBRU.COM on your PATH (from this archive). 880.CP and ADDPAGE.BAS in the work directory (from this archive). Run the basic program ADDPAGE as follows: ------------------->Type: C> BASIC A:ADDPAGE CPI filename: \DOS\EGA.CPI (or just \EGA.CPI etc) Target codepage: 880 In or Out? I CP filename: A:880.CP (or whereever) Code page not in file---replacing... <long wait> Ok SYSTEM ----- Erase 880.CP and ADDPAGE.BAS from your hard disk, you don't need them anymore. Use your favorite editor to add the following to your CONFIG.SYS: DEVICE=C:\DOS\DISPLAY.SYS CON:=(EGA,437,(1,3)) If you are using EGA, not VGA, change the last 3 to 2. (You have to say 'EGA' even if you're really using a VGA. 1 is the number of code pages you are going to load; increase this number if you want to load other pages. 3 is the number of font variants; 3 is for VGA (16, 14 and 8 pixels high); 2 is for EGA (only 14 and 8). See your DOS manual if you need more info.) Use your favorite editor to add the following to your AUTOEXEC.BAT: MODE CON CP PREPARE=((880) \DOS\EGA.CPI) MODE CON CP SELECT=880 KEYBRU Reboot. You can switch the code pages anytime using MODE CON CP SELECT=437 for U.S. page and MODE CON CP SELECT=880 for Cyrillic page. CHCP won't work. 5. Printing To print Russian text on a 9-pin Epson FX printer, first send the downloadable font in EPSON9.FNT to the printer: COPY/B EPSON9.FNT PRN. If you print from within a word processor, make sure it does not reset the printer and delete the fonts before you print. The enclosed program TRR.C will translate Russian characters in STDIN into calls to AMS Cyrillic fonts for TeX in STDOUT. It is assumed that \mcyr is defined as a font or font family. (Just say \font\mcyr=mcyr10, if you're not sure.) This has been tested with both Plain TeX and LaTeX. Unfortunately, this approach is not compatible with TeX's hyphenation algorithm. 6. Hyphenation The file RUSSHYPH.PAT contains an improved version of the patterns presented in my M.A. thesis, "An Implementation of Liang's Algorithm for the Russian language", submitted to CCNY in October of 1988. The patterns find all the valid and no invalid hyphens in a 50,000+ word dictionary (including inflections). The main improvements, compared with the thesis, are: a) I fixed a few incorrect hyphenations, e.g., ▀▐-▄▌╪Γ∞, from ▀▐▄-▌╪Γ∞, etc. I also keyed in the balance of the dictionary, and added a few words supplied by A. Samarin (see below). b) All the technical terms that I could find that are borrowed from German, Dutch, English, etc, are hyphenated correctly. c) The patterns won't split a single vowel off a part of many compound words. Thus, ▀α∩▄▐π-╙▐█∞▌╪┌, ╤╪-▐█▐╙, ▌╒▐-╤δτ▌δ┘, ▌╒-▐Σ╨Φ╪╫▄, etc, are now suppressed. (Such breaks are not strictly illegal, but certainly offensive, and a good break is just 1 letter away.) d) The patterns will handle many common abbreviations, such as ╤▐αΓ-▀α▐╥▐╘▌╪µ╨, ▀╨αΓ-πτ╒╤╨, ┌▐▄ß-▐α╙, etc. (Most such words are not considered to be part of the language, but occur often in certain kinds of texts.) Although (c) and (d) sound like neat tricks, such words usually fit one of the common patterns. I used a slightly modified PATGEN, running on a Kouwei computer from Barry Hu, Microstar, to generate the patterns. PATGEN is an extremely powerful tool, and I would never have generated such good patterns without it. Like PATGEN says, 127685 good, 0 bad, 0 missed In April of 1989 I was informed by Alexander Samarin of the Institute for High Energy Physics in Serpukhov, USSR, currently at CEARN, that an algorithm for automatic hyphenation of Russian words was developed at IHEP and a preprint was published in 1983. I was unable to get the paper or the algorithm. Alexander Samarin kindly sent me a file with about 21,000 inflected Russian words, for which I am very grateful. These patterns are 'final', in a sense that I don't expect to change or improve them in the future. I am not aware of any Russian words that are not hyphenated correctly by the patterns. It is possible to manufacture abbreviations (4) that won't be broken up completely, although invalid breaks are unlikely. It is very hard to find a compound word (3) where a single vowel might be split off. Of course, if you use the patterns and find any word that's not fully and correctly hyphenated, please let me know. The patterns, meant as input to Liang's algorithm, consist of strings of letters and digits, where a digit placed between two letters indicates a `hyphenation value' for its position. Odd values permit breaks; even values (including zero, assumed when the digit is omitted) prohibit breaks. The text processing program finds all the patterns whose letters match part of the word, takes the maximum hyphenation value for each position between letters and examines its parity to exhibit the legal breaks. I have been told that both Microsoft Word and WordPerfect use Liang's algorithm for hyphenation, but have English patterns hardwired. 7. Possible problems Q: The installation is too complex. A: I cannot give out patched out EGA.CPI because it's copyrighted. Ask someone to help. Q: BASIC won't run on this machine. A: Do the installation on another machine that has BASIC, and then copy EGA.CPI. Q: KEYBRU conflicts with other TSRs. A: Sigh. Try changing the order in which the TRSs are loaded. A better solution would be to add the Russian keyboard layout to KEYBOARD.SYS and to use a vanilla DOS KEYB command; alas, I was unable to do it. Q: Why does WordPerfect misinterpret some of the Russian letters? A: I don't know. Typing lowercase p α (224) causes WP to look for a file "alth.wmp". This is a problem with WP, not with the keyboard driver. Q: Is it possible to change the codes for some of the letters? A: Yes, you can alter 880.ASM, and MASM/LINK/EXE2BIN it to obtain 880.CP, and then repeat the installation. The resulting code page will not be compatible with 880, so it should be given a different number. This is not a good idea. 8. Technical remark Here is some C code that uses code pages: #include <stdio.h> #include <dos.h> main() { union REGS regs; union SREGS sregs; unsigned filhandl=0; /* or open /dev/con */ int *foo; /* model= compact! A 32-bit pointer */ short hwcpcount,prepcpcount; int i; regs.x.bx=filhandl; /* handle for STDIN, hopefully=/dev/con */ regs.h.ah=0x44; /* IOCTL */ regs.h.al=0x00; /* get info */ intdosx(®s,®s,&sregs); if (!(regs.x.dx&0x4000)) /* code page supported bit */ { printf("Code page not supported by STDIN"); return; } regs.h.ah=0x44; /* IOCTL */ regs.h.al=0x0c; /* generic character IOCTL */ regs.x.bx=filhandl; regs.h.ch=0x03; /*console?*/ regs.h.cl=0x6b; /* query prepared code pages */ intdosx(®s,®s,&sregs); foo=(int *)((sregs.ds<<16)+regs.x.dx); foo++; /* # bytes returned */ hwcpcount=*foo++; printf("%d hardware pages: ",hwcpcount); for (i=0; i<hwcpcount; i++ ) printf("%d ",*foo++); prepcpcount=*foo++; printf("%d prepared pages: ",prepcpcount); for (i=0; i<prepcpcount; i++ ) printf("%d ",*foo++); } 8. Credits, acknowledgements, etc The contents of this archive are placed in public domain; all copyright is waived. You may use it as you please. If you find this package useful, please let me know at: DLV@CUNYVMS1.BITNET or: Dimitri Vulis Department of Mathematics CUNY Graduate Center New York, NY 10036-8099 U.S. of A. (Note: never use my old "529 W 111th" address listed in some directories!) I may then notify you of updates (this is more likely if you provide a e-mail address reachable from BITNET). Feel free to comment on the letter shapes. I always appreciate constructive criticism. I would like to thank Burton Randol, Giorgio Mantzivis, Johann van Wingen, Donald Parsons and my father L.N.Klyukvin for their help with this project. You may try contacting your DOS OEM and asking them to include the Cyrillic and Greek code pages in their standard DOS distribution.