Source Code 1992 March

home *** CD-ROM | disk | FTP | other *** search

/ Source Code 1992 March / Source_Code_CD-ROM_Walnut_Creek_March_1992.iso / usenet / altsrcs / 1 / 1904 < prev next >

Wrap

Internet Message Format | 1990-12-28 | 3KB

From: cedman@lynx.ps.uci.edu (Carl Edman) Newsgroups: alt.sources Subject: Re: Fast strcmp() wanted. Message-ID: <CEDMAN.90Oct4171710@lynx.ps.uci.edu> Date: 5 Oct 90 00:17:10 GMT <CEDMAN.90Sep27075013@lynx.ps.uci.edu> <1990Sep27.151543.8025@ccs.carleton.ca> <CEDMAN.90Sep29091115@lynx.ps.uci.edu> <6003@hplabsz.HPL.HP.COM> <1145@exodus.Eng.Sun.COM> Organization: non serviam Lines: 61 Nntp-Posting-Host: lynx.ps.uci.edu In-reply-to: falk@peregrine.Sun.COM's message of 4 Oct 90 21:03:47 GMT In article <1145@exodus.Eng.Sun.COM> falk@peregrine.Sun.COM (Ed Falk) writes: In article <6003@hplabsz.HPL.HP.COM> sartin@hplabs.hp.com (Rob Sartin) writes: >In article <CEDMAN.90Sep29091115@lynx.ps.uci.edu> cedman@lynx.ps.uci.edu (Carl Edman) writes: >>string structure (or better class, long live C++ ! :-) which calculates >>a 32-bit CRC for each string the first time and stores it somewhere. >>Then only 1 (inlined) longword-compare will do the stringcomparisons >>for you. > >Afraid not. It'll give you an estimate of whether the strings match >(correctly identifying those that don't). You will need to then >actually compare the strings if they are the same. This method will >also be unable to reproduce strcmp's behavior (strcmp returns a signed >result indicated the <, =, > by being negative, zero, positive), it will >only return a boolean (match, no match). Also, these two strings "ab\0x" "ab\0y" (where x and y are any garbage that happens to be in memory after the terminating '\0') will be evaluated as unequal. There are workarounds for both problems, of course, but I think there won't be much of a performance improvement after you've done all it requires to get it right. Please , take note ! I did NOT suggest that this method is always the best, but I gave 3 or 4 different methods, and some hints, about when I would use which one. And I still hold that the CRC-method could be by far the most efficient under some conditions: - each word is checked often. Then a small overhead like f.e. cleaning up the garbage behind the end of the string for the one time calculation wouldn't matter - strings are long, so the overhead of 32-bit per string is justified - you only need to know if 2 strings are identical - most compares between unequal strings You can even imagine systems where it could be possible to remove the actual string from memory, and simply assume that if the 32-bit CRC match, the strings match. Such systems would have to be tolerant about an occasional mismatch. If memory serves correctly the above approach is used in some implementations for diff. (only to give one practical, real world example) Carl Edman Theorectial Physicist,N.:A physicist whose | Send mail existence is postulated, to make the numbers | to balance but who is never actually observed | cedman@golem.ps.uci.edu in the laboratory. | edmanc@uciph0.ps.uci.edu