Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / misc / volume1 / 8708 / 17 < prev next >

Wrap

Internet Message Format | 1990-07-13 | 23.5 KB

From: root@hobbes.UUCP (John Plocher) Newsgroups: comp.sources.misc Subject: string compare for 8-bit non-English characters (accents...) Message-ID: <4194@ncoast.UUCP> Date: 17 Aug 87 01:06:06 GMT Sender: allbery@ncoast.UUCP Organization: U of Wisconsin - Madison Spanish Department Lines: 592 Approved: allbery@ncoast.UUCP X-Archive: comp.sources.misc/8708/17 In response to the discussion in comp.std.internat and the amiga group about how programmers don't take into account non-English character sets and their properties, I submit the following: A strcmp() replacement which correctly handles accented characters and non-ASCII collating. -John Plocher /* This is NOT a shar */ /**************************************************************************** * * stracmp.c string compare with accented characters and * non-ASCII collating sequences * * Copyright 1985, 1987, 1987 by John Plocher (plocher@hobbes.UUCP) * May be used in any product as long as this notice is retained and * credit is given. * **************************************************************************** * * Revision Control Information * * By: $Author: plocher $ * $Revision: 1.3 $ * Last modified: $Date: 87/08/15 01:32:36 $ * Source is in: $Source: /usr/src/local/biblos/RCS/stracmp.c,v $ * Release state: $State: Usenet $ * * Library module * * Modification Log * ---------------- * * $Log: stracmp.c,v $ * Revision 1.3 87/08/15 01:32:36 plocher * fixed crt-independent 8-bit character output * * Revision 1.2 87/08/15 01:17:28 plocher * passes lint with no complaints * * Revision 1.1 86/04/12 * Revision 1.0 85/05/27 * * **************************************************************************** * * Compile with * * cc -c stracmp.c # for a library object file * - or - * cc -o stracmp -DMAIN stracmp.c # for a standalone testbed * **************************************************************************** * * stracmp() implements a string compare which correctly handles * accented (non English) characters which have been encoded using * 8-bit characters. It uses character lookup tables for doing * string compares when accented characters are present and/or a * non-ASCII collating sequence is desired. * * Also, because this is used in bibliographic lookups, this routine * supports the concept of comments within a string. Everything * between [ and ] (inclusive) is ignored for all comparisons. * Comments may NOT be nested. Comments are also delimited by * an end of string ('\0'), but that is not the "correct" way. * * Reference: * * Gibaldi, and Walter S. Achtert. _MLA_Handbook_for_Writers_of_Research_ * Papers_. New York: Modern Language Association of America, 1984. * Page 76. * **************************************************************************** * * Theory: * The correct way of sorting (or comparing) strings which contain * accented characters is to first compare the strings with all accents * stripped. If the two strings are the same, then and only then are the * accents used. This second comparison involves only the accents. * You can think of this as comparing the two strings with all the letters * stripped. * * Also, there are times when the "normal" ASCII collating sequence is * not appropriate for lexical ordering. (ie. A <AE> B C <CEDILLA> D ...> * **************************************************************************** * Examples: ***** * , : * Comparing Junta and Junta (the second word has diacritical * marks over the two vowels) * * first we compare("Junta", "Junta") which shows them EQUAL * then we must compare(" ", " ' :") * * , : * Thus, Junta comes before Junta in the lexical ordering of the two words. * ***** * , , * Comparing Junta and Junto (both words have accented 'u's) * * first we compare("Junta", "Junto"); since they are * different we do not need to do anything more with the accents: * , , * "Junta" is less than "Junto". * **************************************************************************** * * Implementation: * * The accented string is broken into two strings: * 1) a string of letter values with accents stripped, and * 2) a string of accent values with letters stripped. * * The comparison is table based in order to speed things up and * allow arbitrary collating sequences. * * For a given character x, translate[x] is the "value" * used for sorting with strcmp(), and accent[x] * tells whether the character carries an accent, should * be ignored, or is a normal character. If accent[] indicates * that the character carries a diacritical, the value of accent[] * is used to rank the accented character against the same letter * but different diacritics: * , . * ie. The letter a differs from a; which is less depends on the * values of accent[]. If the values in accent[] for these two letters * are the same, the accented letters are considered identical. * * The stracmp() routine is fully protected against NUL pointers * being passed as parameters, * All internal space needed is taken from the heap with a single malloc() * and free()'d on exit. The heap space needed is * 2 * ( strlen(s1) + strlen(s2) ) + 4 * The stack space needed is 3 ints and 4 pointers. * There are two static 256 element arrays of unsigned chars used for * defining the accents and collation sequence. * * The runtime time is * TIME( strlen(s1) ) * + TIME( strcpy(x,s1) ) * K * + TIME( strlen(s2) ) * + TIME( strcpy(x,s2) ) * K * + TIME( strcmp(t1,t2) ) * +[TIME( strcmp(a1,a2) )] (* iff needed *) * + TIME( malloc() ) * + TIME( free() ) * where 1 < K < 2 * **************************************************************************** */ #define VERSION "$Revision: 1.3 $" /* #define MAIN /* compile as a test program, not a library */ /* #define ON_IBMPC /* iff MAIN is defined does your crt show */ /* the IBM character set? */ #define BRACKET_COMMENTS /* if defined, stuff within [ ]'s is ignored */ #define IBMPC_ROM /* Tables match the IBM PC ROM tables */ /* #define ISO_LATIN_1 /* Tables for ISO LATIN-1 (ISO 8859-1) */ /***************************************************************************/ #if defined(IBMPC_ROM) + defined(ISO_LATIN_1) != 1 One and only one of these may be defined. #endif #ifdef MAIN # include <stdio.h> /* For confidence test */ # ifdef ON_IBMPC # define PRINT printf # else # define PRINT crtaccent # endif #endif #ifndef lint static char rcsid[] = "$Header: stracmp.c,v 1.3 87/08/15 01:32:36 plocher Usenet $"; #endif extern char *malloc(); extern void exit(); extern void free(); #ifdef IBMPC_ROM /* IBM-PC ROM based character set */ /* The translate table maps from a printable character to a "value". This * "value" is used to determine sorting order ( a smaller "value" is less * than a larger "value" ). * * Note that in the following table, the letters 'C' and <Cedilla> are * both given the same "value". This is because these two letters are * "the same" WHEN ACCENT MARKS ARE IGNORED. (Same for all other accented * characters - they share the same value with the underlying character.) * * The table following this has the entry for <Cedilla> flagged as an * accent, the entry for 'C' does not. Therefore, when sorting, a * <Cedilla> will sort with, but following, the entries beginning with 'C'. * * The accent table is used solely to differentiate between letters which * have the same value in the translate table. The reasons for two tables * instead of one table of shorts are that strcmp() works with char*'s, not * short*'s, and that the tables are easier to understand this way. * * One could also increment the values for 'D'..'~' by 1 and give the value of * <Cedilla> as value('C') + 1. In this case the accent table would not be * needed to distinguish between the two. */ static unsigned char translate[256] = { /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* - - - - - - - - - - - - - - - - */ /*0*/ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, /*1*/ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, /*2*/ ' ','!',34, '#','$','%','&',39, '(',')','*','+',',','-','.','/', /*3*/ '0','1','2','3','4','5','6','7','8','9',':',';','<','=','>','?', /*4*/ '@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O', /*5*/ 'P','Q','R','S','T','U','V','W','X','Y','Z','[','\\',']','^','_', /*6*/ '`','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o', /*7*/ 'p','q','r','s','t','u','v','w','x','y','z','{','|','}','~',127, /*8*/ 'C','u','e','a','a','a','a','c','e','e','e','i','i','i','A','A', /*9*/ 'E',145,146,'o','o','o','u','u','y','O','U',155,156,157,158,159, /*A*/ 'a','i','o','u','n','N','a','o','?',169,170,171,172,'!',174,175, /*B*/ 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191, /*C*/ 192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207, /*D*/ 208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223, /*E*/ 224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239, /*F*/ 240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255 }; /* * for a given character x, accent[x] determines if the * character should be ignored (0), or used as given by translate[x] but * marked as an accent (1..n). * * Accents have a sorting order given by the value stored in this table. * (This feature is currently used in the following way: Accent value= * 0 Character is totally ignored in all sorting operations * 1 Normal unaccented character (ASCII) * 2..n accents from the extended IBM charset */ static unsigned char accent[256] = { /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* - - - - - - - - - - - - - - - - */ /*0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* control */ /*1*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* chars */ /*2*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* alphanumerics */ /*3*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*4*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*5*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*6*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*7*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, /* DEL */ /*8*/ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* accented chars*/ /*9*/ 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, /*A*/ 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, /* aeiou ? ! */ /*B*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* line graphics */ /*C*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* line graphics */ /*D*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* line graphics */ /*E*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* greek */ /*F*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 /* math */ }; #endif #ifdef ISO_LATIN_1 /* ISO Latin-1 character set */ /* * Attached is the repertoire of ISO Latin Alphabet Nr 1 (IS 8859-1). I have * indicated an alternate name where there might be confusion in the U.S.. * * List is from Tim Lasko Digital Equipment Corporation Maynard, MA * * R/C - row/column of code table * Dec - Decimal * Oct - Octal * * R/C Dec Oct Symbol Name * * 02/00 032 040 SP SPACE * ... same as USASCII ... * 07/14 126 176 ~ TILDE * * 10/00 160 240 NBSP NO-BREAK SPACE * 10/01 161 241 INVERTED EXCLAMATION MARK * 10/02 162 242 CENT SIGN * 10/03 163 243 POUND SIGN * 10/04 164 244 CURRENCY SIGN * 10/05 165 245 YEN SIGN * 10/06 166 246 BROKEN BAR * 10/07 167 247 PARAGRAPH SIGN, (U.S.) SECTION SIGN * 10/08 168 250 DIERESIS * 10/09 169 251 COPYRIGHT SIGN * 10/10 170 252 FEMININE ORDINAL INDICATOR * 10/11 171 253 LEFT ANGLE QUOTATION MARK * 10/12 172 254 NOT SIGN * 10/13 173 255 SHY SOFT HYPHEN * 10/14 174 256 REGISTERED TRADEMARK SIGN * 10/15 175 257 MACRON * * 11/00 176 260 RING ABOVE, DEGREE SIGN * 11/01 177 261 PLUS-MINUS SIGN * 11/02 178 262 SUPERSCRIPT TWO * 11/03 179 263 SUPERSCRIPT THREE * 11/04 180 264 ACUTE ACCENT * 11/05 181 265 MICRO SIGN * 11/06 182 266 PILCROW SIGN, (U.S.) PARAGRAPH * 11/07 183 267 MIDDLE DOT * 11/08 184 270 CEDILLA * 11/09 185 271 SUPERSCRIPT ONE * 11/10 186 272 MASCULINE ORDINAL INDICATOR * 11/11 187 273 RIGHT ANGLE QUOTATION MARK * 11/12 188 274 VULGAR FRACTION ONE QUARTER * 11/13 189 275 VULGAR FRACTION ONE HALF * 11/14 190 276 VULGAR FRACTION THREE QUARTERS * 11/15 191 277 INVERTED QUESTION MARK * * 12/00 192 300 LATIN CAPITAL LETTER A WITH GRAVE ACCENT * 12/01 193 301 LATIN CAPITAL LETTER A WITH ACUTE ACCENT * 12/02 194 302 LATIN CAPITAL LETTER A WITH CIRCUMFLEX ACCENT * 12/03 195 303 LATIN CAPITAL LETTER A WITH TILDE * 12/04 196 304 LATIN CAPITAL LETTER A WITH DIAERESIS * 12/05 197 305 LATIN CAPITAL LETTER A WITH RING ABOVE * 12/06 198 306 CAPITAL DIPHTHONG AE * 12/07 199 307 LATIN CAPITAL LETTER C WITH CEDILLA * 12/08 200 310 LATIN CAPITAL LETTER E WITH GRAVE ACCENT * 12/09 201 311 LATIN CAPITAL LETTER E WITH ACUTE ACCENT * 12/10 202 312 LATIN CAPITAL LETTER E WITH CIRCUMFLEX ACCENT * 12/11 203 313 LATIN CAPITAL LETTER E WITH DIAERESIS * 12/12 204 314 LATIN CAPITAL LETTER I WITH GRAVE ACCENT * 12/13 205 315 LATIN CAPITAL LETTER I WITH ACUTE ACCENT * 12/14 206 316 LATIN CAPITAL LETTER I WITH CIRCUMFLEX ACCENT * 12/15 207 317 LATIN CAPITAL LETTER I WITH DIAERESIS * * 13/00 208 320 CAPITAL ICELANDIC LETTER ETH * 13/01 209 321 LATIN CAPITAL LETTER N WITH TILDE * 13/02 210 322 LATIN CAPITAL LETTER O WITH GRAVE ACCENT * 13/03 211 323 LATIN CAPITAL LETTER O WITH ACUTE ACCENT * 13/04 212 324 LATIN CAPITAL LETTER O WITH CIRCUMFLEX ACCENT * 13/05 213 325 LATIN CAPITAL LETTER O WITH TILDE * 13/06 214 326 LATIN CAPITAL LETTER O WITH DIAERESIS * 13/07 215 327 MULTIPLICATION SIGN * 13/08 216 330 LATIN CAPITAL LETTER O WITH OBLIQUE STROKE * 13/09 217 331 LATIN CAPITAL LETTER U WITH GRAVE ACCENT * 13/10 218 332 LATIN CAPITAL LETTER U WITH ACUTE ACCENT * 13/11 219 333 LATIN CAPITAL LETTER U WITH CIRCUMFLEX * 13/12 220 334 LATIN CAPITAL LETTER U WITH DIAERESIS * 13/13 221 335 LATIN CAPITAL LETTER Y WITH ACUTE ACCENT * 13/14 222 336 CAPITAL ICELANDIC LETTER THORN * 13/15 223 337 SMALL GERMAN LETTER SHARP s * * 14/00 224 340 LATIN SMALL LETTER a WITH GRAVE ACCENT * 14/01 225 341 LATIN SMALL LETTER a WITH ACUTE ACCENT * 14/02 226 342 LATIN SMALL LETTER a WITH CIRCUMFLEX ACCENT * 14/03 227 343 LATIN SMALL LETTER a WITH TILDE * 14/04 228 344 LATIN SMALL LETTER a WITH DIAERESIS * 14/05 229 345 LATIN SMALL LETTER a WITH RING ABOVE * 14/06 230 346 SMALL DIPHTHONG ae * 14/07 231 347 LATIN SMALL LETTER c WITH CEDILLA * 14/08 232 350 LATIN SMALL LETTER e WITH GRAVE ACCENT * 14/09 233 351 LATIN SMALL LETTER e WITH ACUTE ACCENT * 14/10 234 352 LATIN SMALL LETTER e WITH CIRCUMFLEX ACCENT * 14/11 235 353 LATIN SMALL LETTER e WITH DIAERESIS * 14/12 236 354 LATIN SMALL LETTER i WITH GRAVE ACCENT * 14/13 237 355 LATIN SMALL LETTER i WITH ACUTE ACCENT * 14/14 238 356 LATIN SMALL LETTER i WITH CIRCUMFLEX ACCENT * 14/15 239 357 LATIN SMALL LETTER i WITH DIAERESIS * * 15/00 240 360 SMALL ICELANDIC LETTER ETH * 15/01 241 361 LATIN SMALL LETTER n WITH TILDE * 15/02 242 362 LATIN SMALL LETTER o WITH GRAVE ACCENT * 15/03 243 363 LATIN SMALL LETTER o WITH ACUTE ACCENT * 15/04 244 364 LATIN SMALL LETTER o WITH CIRCUMFLEX ACCENT * 15/05 245 365 LATIN SMALL LETTER o WITH TILDE * 15/06 246 366 LATIN SMALL LETTER o WITH DIAERESIS * 15/07 247 367 DIVISION SIGN * 15/08 248 370 LATIN SMALL LETTER o WITH OBLIQUE STROKE * 15/09 249 371 LATIN SMALL LETTER u WITH GRAVE ACCENT * 15/10 250 372 LATIN SMALL LETTER u WITH ACUTE ACCENT * 15/11 251 373 LATIN SMALL LETTER u WITH CIRCUMFLEX ACCENT * 15/12 252 374 LATIN SMALL LETTER u WITH DIAERESIS * 15/13 253 375 LATIN SMALL LETTER y WITH ACUTE ACCENT * 15/14 254 376 SMALL ICELANDIC LETTER THORN * 15/15 255 377 LATIN SMALL LETTER y WITH DIAERESIS */ unsigned char translate[256] = { /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* - - - - - - - - - - - - - - - - */ /*0*/ 0 ,' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ', /*1*/ ' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ', /*2*/ ' ','!',34, '#','$','%','&',39, '(',')','*','+',',','-','.','/', /*3*/ '0','1','2','3','4','5','6','7','8','9',':',';','<','=','>','?', /*4*/ '@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O', /*5*/ 'P','Q','R','S','T','U','V','W','X','Y','Z','[','\\',']','^','_', /*6*/ '`','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o', /*7*/ 'p','q','r','s','t','u','v','w','x','y','z','{','|','}','~',127, /*8*/ ' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ', /*9*/ ' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ', /*A*/ 160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175, /*B*/ 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191, /*C*/ 'A','A','A','A','A','A',198,'C','E','E','E','E','I','I','I','I', /*D*/ 208,'N','O','O','O','O','O',215,'O','U','U','U','U','Y',222,'s', /*E*/ 'a','a','a','a','a','a',230,'c','e','e','e','e','i','i','i','i', /*F*/ 240,'n','o','o','o','o','o',247,'o','u','u','u','u','y',254,'y' }; static unsigned char accent[256] = { /* 0 1 2 3 4 5 6 7 8 9 A B C D E F */ /* - - - - - - - - - - - - - - - - */ /*0*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* control */ /*1*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* chars */ /*2*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* alphanumerics */ /*3*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*4*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*5*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*6*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /*7*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, /* DEL */ /*8*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* control */ /*9*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* chars */ /*A*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* punctuation */ /*B*/ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* punctuation */ /*C*/ 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* A C, E I */ /*D*/ 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, /* O N U Y */ /*E*/ 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* a c, e i */ /*F*/ 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2 /* o n u y */ }; #endif #ifdef BRACKET_COMMENTS #define REDUCE(ORIG, ACCENTS, ASCII) { \ char *pa, *pt; \ pa = ACCENTS; \ pt = ASCII; \ while (*ORIG) { \ if ( *ORIG == '[' ) { \ while ( *ORIG && *ORIG != ']' ) \ ORIG++; /* ignore anything within []'s */ \ if (*ORIG) \ ORIG++; /* skip trailing ] */ \ continue; \ } \ if (accent[ *ORIG ]) { \ *pa++ = accent[ (unsigned)(*ORIG) ]; \ *pt++ = translate[ (unsigned)(*ORIG) ]; /* set collating seq */ \ } \ ORIG++; \ } \ *pa = *pt = '\0'; \ } #else #define REDUCE(ORIG, ACCENTS, ASCII) { \ char *pa, *pt; \ pa = ACCENTS; \ pt = ASCII; \ while (*ORIG) { \ if (accent[ *ORIG ]) { \ *pa++ = accent[ (unsigned)(*ORIG) ]; \ *pt++ = translate[ (unsigned)(*ORIG) ]; /* set collating seq */ \ } \ ORIG++; \ } \ *pa = *pt = '\0'; \ } #endif #define MALLOC( pointer, type, size ) \ pointer = ( type *)malloc((unsigned) size ); \ if ( pointer == (type *)NULL) { \ (void)printf("\n MALLOC returned NULL: pointer (size)"); \ exit(0); \ } #define FREE( pointer ) \ (void)free( (char *)pointer ); int stracmp(s1,s2) unsigned char *s1, *s2; { int value; unsigned int i1, i2; /* length of given strings */ char *as1, *as2; /* accent strings */ char *ts1, *ts2; /* strings with accent marks stripped */ if (s1 == NULL) /* cover our ass */ if (s2 == NULL) return 0; /* NULL == NULL :-) */ else return -1; /* NULL < "anything" */ else if (s2 == NULL) return 1; /* "anything > NULL */ i1 = strlen((char *)s1) + 1; i2 = strlen((char *)s2) + 1; MALLOC(as1, char, 2 * (i1 + i2) + 4); /* accent chars */ ts1 = as1 + i1 + 1; as2 = ts1 + i1 + 1; ts2 = as2 + i2 + 1; REDUCE( s1, as1, ts1); REDUCE( s2, as2, ts2); if ( (value = strcmp(ts1, ts2) ) ) { FREE(as1); return( value ); /* strings differ already */ } /* * at this point, ts1 == ts2, and we need to decide if * the accents (if any) break the tie. */ value = strcmp( as1, as2 ); FREE(as1); return value; } #ifdef MAIN #ifndef ON_IBMPC #include <ctype.h> void crtaccents(s) unsigned char *s; { while ( s && *s ) { switch( *s ) { case '\r': (void)printf("\\r"); break; case '\b': (void)printf("\\b"); break; case '\t': (void)printf("\\t"); break; case '\f': (void)printf("\\f"); break; default : if (isascii( *s )) (void)putchar(*s); else (void)printf("\\%03o",*s); break; } s++; } } #endif #define COMPARE( check, s1, s2 ) { \ (void)PRINT(s1) ; \ result = stracmp( (unsigned char *)s1, (unsigned char *)s2 ); \ if (result < 0) (void)printf(" < "); \ else if (result > 0) (void)printf(" > "); \ else (void)printf(" = "); \ (void)PRINT(s2); (void)putchar('\t'); \ if (result == check) (void)printf("OK\n"); \ else (void)printf("WRONG!!!\n"); \ } main() { int result; (void)printf("stracmp demo - version %s\n", VERSION); /* These tests assume IBM ROM tables */ COMPARE( 0,"John Plocher", "John Plocher"); /* = */ COMPARE( 0,"John[ Michael] Plocher[@hobbes.UUCP]", "John Plocher"); /* = */ COMPARE( 0,"John Plocher", "John[ Michael] Plocher"); /* = */ COMPARE(-1,"John Plocher", "J\242hn Plocher"); /* < */ COMPARE( 1,"J\242hn Pl\242cher", "J\242hn Plocher"); /* > */ COMPARE( 1,"J\242hn P\242lcher", "J\242hn Pl\242cher"); /* > */ COMPARE( 0,"J\242hn Pl\242cher", "J\242hn Pl\242cher"); /* = */ COMPARE( 1,"J\242\242n Pl\242cher", "J\242hn Pl\242cher"); /* > */ return(0); } #endif -- John Plocher uwvax!geowhiz!uwspan!plocher plocher%uwspan.UUCP@uwvax.CS.WISC.EDU