Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / misc / volume23 / quranref / part08 < prev next >

Wrap

Text File | 1991-10-19 | 26.4 KB | 630 lines

Newsgroups: comp.sources.misc From: goer@midway.uchicago.edu (Richard L. Goerwitz) Subject: v23i074: quranref - Holy Qur'an word and passage based retrievals, Part08/08 Message-ID: <1991Oct19.022406.13196@sparky.imd.sterling.com> X-Md4-Signature: 5bad0c869b381ab6b4b5f8e598b14f45 Date: Sat, 19 Oct 1991 02:24:06 GMT Approved: kent@sparky.imd.sterling.com Submitted-by: goer@midway.uchicago.edu (Richard L. Goerwitz) Posting-number: Volume 23, Issue 74 Archive-name: quranref/part08 Environment: Icon ---- Cut Here and feed the following to sh ---- #!/bin/sh # this is quranref.08 (part 8 of a multipart archive) # do not concatenate these parts, unpack them in order with /bin/sh # file README continued # if test ! -r _shar_seq_.tmp; then echo 'Please unpack part 1 first!' exit 1 fi (read Scheck if test "$Scheck" != 8; then echo Please unpack part "$Scheck" next! exit 1 else exit 0 fi ) < _shar_seq_.tmp || exit 1 if test ! -f _shar_wnt_.tmp; then echo 'x - still skipping README' else echo 'x - continuing file README' sed 's/^X//' << 'SHAR_EOF' >> 'README' && X"and"). This tells Quranref that you want to perform an intersection Xwith respect to another set of passages. After typing "a" and hitting Xreturn, Quranref asks you for a unit (c = chapter, v = verse). XNormally you would press "v." You are then asked for a range X(normally 0). After entering a unit and range, you would enter a new Xword or pattern to look for, and then press "f" to tell Quranref you Xare finished. What Quranref would retrieve in this instance is a list Xof all Biblical verses which contain both of the words, or Xword-patterns, that you specified. Note that if you had entered 1 as Xyour range, you would have gotten a list of all passages containing Xword(s) matching the first pattern and which either contain, *or are Xadjacent to another passage containing*, a word matching the next Xpattern you specified. X In addition to "a" ("and"), Quranref also accepts "o" ("or") Xand "n" (and-not) directives. Also, words and patterns preceded by an Xexclamation point and a space ("! ") are inverted (a la egrep -v). I Xwould not recommend using the "! " much, though. It is slow, and Xusually brings about massive hit lists. If you want, say, all Xoccurrences of the word "woman" that don't contain the word "child," Xthen formalize your search as "woman" and-not "child," rather than as X"woman" and "! child." The only thing slower than a search for X"woman" and "! child" would be to look for "woman" together with the Xwords "the" and "child." There are about 4100 passages containing X"the," and although I've used a cute trick to reduce the number that Xhave to be stored, retrieving them all is still a mess (takes almost X30 seconds on my machine). X X X-------- X X XAdditional Notes: X X As mentioned above, this package is really just a wrapper Xaround a more general set of indexing and retrieval utilities I'm Xusing for personal research. Despite the way they are used here, Xthese utilities are *not* geared solely for the Quran. In fact, they Xare set up so that they can be used with just about any text broken up Xinto hierarchically arranged divisions. As noted above, this Xdistribution is actually built on top of a similar package geared for XChristian and Jewish Bible research. If you need help integrating a Xnew text into the retrieve package, drop me a line (i.e. new Quran Xtranslations, 'ahadith, biblical texts, etc.). If I'm not busy, and Xthe job looks to be one I can help you out with, I'll be glad to do Xso. If nothing else, I can at least get you started, and offer Xpointers on how to proceed. Please, though, if you don't have M. H. XShakir's Quran translation, and can't ftp from the location specified Xearlier on in this document (i.e. princeton.edu), please *DON'T* write Xto me asking me to e-mail you the files, or to package them up on Xdisks. I've been flooded with requests for biblical texts already, Xand can't reasonably oblige them all. X It is with some reservation that I mention here several Xfeatures Quranref possesses that I haven't fully documented. Most are Xones 1) that I'm not likely to continue supporting, 2) that haven't Xbeen tested, 3) that are too slow to be practical, or 4) that are Xlikely to change. First of all, the "d" command can take a number Xargument, which causes it to display the list whose position in the Xglobal list of lists corresponds to that number). On the top level X(and in some other places), the "!" command can also pass arguments to Xa shell (/bin/sh, or the value of your SHELL environment variable). XAlso, if you type "f lord god" at the main prompt, and press return, Xyou'll get a list of verses containing the words "lord" and "god" X(i.e. Quranref will, in other words, perform a verse-based, range 0 X"and" on the respective hit lists for these two words). Finally, when Xbrowsing search lists, you can look at the first line of each verse by Xtyping "l" and return (typing l+return again turns this feature off). X While I don't want to hide the existence of these marginal Xfeatures, I don't want to encourage anyone to expect their presence in Xlater versions, or to suggest that they will work properly in the Xcurrent one. I'm particularly worried about the "l" and "f lord god" Xexamples above. The "l" listing option is very slow. Also, telling Xpeople that "f lord god" is okay also might lead them to think that XQuranref has a concept of word order within verses. In fact, this is Xjust an alternate way of performing a set intersection on the hit Xlists for "lord" and "god." If you use "undocumented" features such Xas these, be aware that there may be difficulties inherent in their Xuse, and that, in general, I've avoided mentioning them until now Xprecisely because I'm not quite sure they are worthy of mention in the Xfirst place. X X X-------- X X XProblems: X X Doubtless you will find problems, more options not discussed Xin the documentation, and just general indications that this program Xwas written late at night after I was done all my serious work for the Xday :-). If - no, when - this happens, I encourage you to drop me a Xline. I'd like to know about any flaws you run into, especially Xmajor, systemic ones. X Generally, I really hope that the bugs will not prove too Xannoying, and that the package will prove generally useful to you the Xuser, and, if you place it in a public directory, to anyone else who Xmight happen to try it out. X X X -Richard L. Goerwitz goer%sophist@uchicago.bitnet X goer@sophist.uchicago.edu rutgers!oddjob!gide!sophist!goer SHAR_EOF echo 'File README is complete' && true || echo 'restore of README failed' rm -f _shar_wnt_.tmp fi # ============= README.rtv ============== if test -f 'README.rtv' -a X"$1" != X"-c"; then echo 'x - skipping README.rtv (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting README.rtv (Text)' sed 's/^X//' << 'SHAR_EOF' > 'README.rtv' && X-------- X X XName: retrieve XLanguage: Icon XContents: tools for word-based, indexed access to text files XRequires: up-to-date Icon Program Library, up-to-date iconc/icont, UNIX XFiles: binsrch.icn bmp2text.icn gettokens.icn indexutl.icn initfile.icn X makeind.icn retrieve.icn retrops.icn searchb.icn whatnext.icn X huffcode.icn huffadj.icn X X X-------- X X XOverview: X X Scholars have traditionally split so-called "Classics" - the XQuran, the Bible, and generally any closely studied literary or Xreligious text - into hierarchically arranged divisions (in the case Xof the Bible, these are books, chapters, and verses). Such divisions Xdrastically simplify the process of citation and reference. XFortunately for those of us who need electronic access to these files, Xthis hierarchical system of divisions permits easy representation Xusing bit-fields, i.e. fixed-width series' of binary digits. Such Xrepresentations are compact, and allow the programmer to implement Xhigh-level boolean operations and range-based searches using simple Xshifts, additions, and subtractions. X The package with which this README file is packed - "retrieve" X- offers a naive, but generalized and fairly high-level, tool for Xindexing texts which are divided up in the manner just described, and Xfor performing word-based searches on them. These word-based searches Xoffer wildcard-based access to word patterns (e.g. "give me every Xpassage containing a word with the letters 'NIX'"). The search Xfacilities also permit boolean and range-based specifications (e.g. X"give me every instance of word X occurring within eleven sections of Xthe word Y"). One can also access passages by both absolute (e.g. X"give me book 1, chapter 3, verse 4"), and relative, location (e.g. X"give me the passage occurring before/after the one I just looked Xat"). X Retrieve performs only superficial compression, and is written Xentirely in Icon. As a result it is something of a disk hog, and Xtakes a long time to index files. Surprisingly, though, once set up, Xfiles incorporated into the retrieve package can be accessed quite Xrapidly. After a brief initialization process (takes 2-4 seconds on a XSun4), absolute locations can be retrieved with no perceptible delay. XThe same is true of relative locations (again, after a lag on first Xinvocation). Regular expression based searches appear instantaneous Xon a fast machine (there is a just perceptible delay on a Sun4 for a Xfour megabyte indexed file, five to ten seconds on a Xenix/386 box Xwith a relatively slow disk). Boolean and range-based searches take Xthe longest, varying widely according to their complexity and the Xnumber of "hits." X X X-------- X X XInstallation: X X Retrieve is really not a program as such. It is a set of Xtools for indexing, and accessing indexed, files. Installation Xconsists of four basic steps: X X 1) creating an indexable file X 2) indexing that file X 3) writing a program using the retrieve interface X 4) compiling and running what you wrote in (3) X XThese steps are discussed in detail in the following sections. X X X-------- X X XStep 1: Creating an Indexable File X X The format for indexable files must conform to a simple, but Xstrict, set of guidelines. Basically, they must interleave series' of Xlocation designators (internally represented by so-called "bitmaps") Xwith actual text: X X ::001:001:001 X This is text. X ::001:001:002 X This is more text. X XThe initial :: (double colon) delimits lines containing the location Xdesignators. These designators translate into integers divisible Xinternally into (in this case) three bit-fields of length 10 (enough Xto handle 999:999:999), which serve as a location markers for the text Xthat goes with them. Note that the translation process is invisible. XAll you need to do is make sure, X X a) that the location designators are correctly paired with X blocks of text, and X b) that the fields are numbered consistently, beginning with X the same low value (usually 1 or 0), and continuing in X ascending order until they roll over again to their low X value X X Rather than speak merely in the abstract about the format, let Xme offer a simple illustration taken from the King James Bible. The Xfirst verse in the Bible is Genesis chapter 1 verse 1. This passage Xmight be designated 1:1:1. Verses in Genesis chapter 1 would continue Xin ascending order to verse 31 (1:1:31), after which chapter 2 would Xbegin (i.e. 1:2:1). The resulting text would look like: X X ::1:1:1 X In the beginning God created the heaven and the earth. X ::1:1:2 X And the earth was without form, and void; and darkness was X upon the face of the deep. And the Spirit of God moved upon X the face of the waters. X ::1:1:3 X And God said, Let there be light: and there was light. X ::1:1:4 X And God saw the light, that it was good: and God divided the X light from the darkness. X ::1:1:5 X And God called the light Day, and the darkness he called X Night. And the evening and the morning were the first day. X X ... X X ::1:2:1 X Thus the heavens and the earth were finished, and all the host X of them. X XAlthough you can use any number of fields you want or need, and can Xuse any nonnumeric separator (e.g. 01-01-01-05-03), lines containing Xlocation designators *must* begin with "::," and must be ordered Xsequentially throughout the input file, paired with the correct text Xblock in each instance. X X X-------- X X XStep 2: Indexing the File X X Indexing the file created in step (1) entails compiling and Xinvoking a program called "makeind." The compilation end of this Xprocess would typically be achieved by typing: X X icont -o makeind makeind.icn gettokens.icn indexutl.icn X XOne of the files listed just above, gettokens.icn, is of particular Xinterest. It contains the tokenizing routine to be used in creating Xthe main word index. Should this routine prove unsatisfactory for one Xreason or another you are free to replace it with something more to Xyour liking. Just comment out the old gettokens() routine, and insert Xthe new one in its place. Then recompile. X Once you have compiled makeind, you must invoke it for the Xtext file you created in step (1). Invoking makeind involves Xspecifying a file to be indexed, the number of fields in location Xmarkers for that file, and the maximum value for fields. If you plan Xon invoking passages by relative location, you must also use the -l Xoption, which tells makeind to build a .LIM file, which records the Xhigh values for a specific field throughout the file being indexed. XLet us say you have examined Genesis 1:31 in the Bible, and want to Xlook at the next verse. The only easy way the procedure that handles Xthis particular chore can know the maximum verse value for Genesis Xchapter 1 (31) is to store this maximum value in a file. By supplying Xmakeind with an -l argument, you are telling it to create a file to Xstore such values. X Just for illustration's sake, let us suppose you want to index Xthe King James Version (KJV). How might you invoke makeind to Xaccomplish this? First you would need to determine the maximum field Xvalue for your text. In the case of the Christian English Bible, this Xis 176. The English Bible (including Apocrypha) contains 84 books (at Xleast in the RSV). The Protestant KJV contains 66. The maximum Xnumber of chapters in any book is 150 (Psalms; 151 for Catholics). XThe maximum number of verses in any one chapter in any one book is 176 X(Psalm 119). 176 would therefore be the maximum value any field would Xhave to contain. You would pass this information to makeind via the X-m option. The total number of fields is three, naturally (book, Xchapter, and verse). This value would be passed using the -n option. XAs noted above, in order to use relative locations you would need to Xtell makeind what field to record max values for. In our hypothesized Xscenario, you would want makeind to store the max value for the verse Xfield for every chapter of every book in the Bible. The verse field X(field #3), in other words, is your "rollover" field, and would be Xpassed to makeind using the -l option. Assuming "kjv" to be the name Xof your indexable biblical text, this set of circumstances would imply Xthe following invocation for makeind: X X makeind -f kjv -m 176 -n 3 -l 3 X XIf you were to want a case-sensitive index (not a good idea), you Xwould add "-s" to the argument list above (the only disadvantage a Xcase-insensitive index would bring is that it would obscure the XLord/lord, and other similar, distinctions). X Actual English Bible texts usually take up 4-5 megabytes. XIndexing one would require over three times that much core memory, and Xwould take at least several hours on a fast CPU. The end result would Xbe a set of data files occupying about 2 megabytes plus the original Xfile (which Bibleref compresses down to about 3/4 its original size). XThe Bible is hardly a small book. Once these data files were created, Xthey could be moved, along with the compressed original source file, Xto any platform you desired. The old input file is saved with a .BAK Xextension in case you would like to save it. X Having indexed, and having moved the files to wherever you Xwanted them, you would then be ready for step 3. X X X-------- X X XStep 3: Writing a Program to Access Indexed Files X X When accessing text files such as the Bible, the most useful Xunit for searches is normally the word. Let us suppose you are a Xzealous lay-speaker preparing a talk on fire imagery and divine wrath Xin the Bible. You would probably want to look for every passage in Xthe text that contained words like X X fire, fiery X burn X furnace X etc. X XTo refine the search, let us say that you want every instance of one Xof these fire words that occurs within one verse of a biblical title Xfor God: X X God X LORD X etc. X XThe searches for fire, fiery, burn, etc. would be accomplished by Xcalling a routine called retrieve(). Retrieve takes three arguments: X X retrieve(pattern, filename, invert_search) X XThe first argument should be a string containing a regular expression Xbased pattern, such as X X fir(y|e|iness)|flam(e|ing)|burn.*? X XNote that the pattern must match words IN THEIR ENTIRETY. So, for Xinstance, "fir[ie]" would not catch "fieriness," but rather only X"fire." Likewise, if you want every string beginning with the Xsequence "burn," the string "burn" will not work. Use "burn.*" Xinstead. The filename argument supplies retrieve() with the name of Xthe original text file. The last argument, if nonnull, inverts the Xsense of the search (a la egrep -v). In the case of the fire words Xmentioned above, one would invoke retrieve() as follows: X X hits1 := retrieve("fir(y|e|iness)|flam(e|ing)|burn.*?", "kjv") X XFor the divine names, one would do something along these lines: X X hits2 := retrieve("god|lord", "kjv") X X Having finished the basic word searches, one would then Xperform a set intersection on them. If we are looking for fire words Xwhich occur at most one verse away from a divine name, then we would Xspecify 1 as our range (as opposed to, say, zero), and the verse as Xour unit. The utility you would use to carry out the search is Xr_and(). R_and() would be invoked as follows: X X hits3 := r_and(hits1, hits2, "kjv", 3, 1) X XThe last two arguments, 3 and 1, specify field three (the "verse" Xfield) and field 1 (the range). X To display the text for your "hit list" (hits3 above), you Xwould call bitmap_2_text(): X X every write(!bitmap_2_text(hits3, "kjv")) X XBitmap_2_text converts the location designators contained in hits3 Xinto actual text. X The three basic functions mentioned above - retrieve(), Xr_and(), and bitmap_2_text() - are contained in the three distinct Xfiles (retrieve.icn, retrops.icn, and bmp2text.icn, respectively). XOther useful routines are included in these files, and also in Xwhatnext.icn. If you are planning on writing a retrieval engine for Xserious work of some kind, you would probably want to construct a mini Xinterpreter, which would convert strings typed in by the user at Xrun-time into internal search and retrieval operations. X Note that I have included no routine to parse or expand Xhuman-readable input (the nature of which will naturally vary from Xtext to text). Again, using the Bible as our hypothetical case, it Xwould be very useful to be able to ask for every passage in, say, XGenesis chapters 2 through 4, and to be able to print these to the Xscreen. Doing this would require a parsing routine to break down the Xreferences, and map them to retrieve-internal format. The routine Xwould then have to generate all valid locations from the minimum value Xin chapter 2 above to the max in chapter 4. See the file whatnext.icn Xfor some aids in accomplishing this sort of task. X X X-------- X X XStep 4: Compiling and Running Your Program X X Assuming you have written a search/retrieval program using the Xroutines contained in retrieve.icn, retrops.icn, bmp2text.icn, and Xwhatnext.icn, you would now be ready to compile it. In order to Xfunction properly, these routines would need to be linked with Xinitfile.icn and indexutl.icn. Specific dependencies are noted in the Xindividual files in case there is any confusion. X If you have made significant use of this package, you probably Xshould not worry about the exact dependencies, though. Just link Xeverything in together, and worry about what isn't needed after you Xhave fully tested your program: X X icont -o yourprog yourprog.icn initfile.icn indexutl.icn \ X retrieve.icn retrops.icn bmp2text.icn binsrch.icn X X X-------- X X XProblems, bugs: X X This is really an early test release of the retrieve package. XI use it for various things. For instance, I recently retrieved a Xtext file containing informal reviews of a number of science fiction Xworks. My father likes SciFi, and it was close to Fathers' Day, so I Xindexed the file, and performed cross-referenced searches for words Xlike "very good," "brilliant," and "excellent," omitting authors my Xfather has certainly read (e.g. Herbert, Azimov, etc.). I also had Xoccasion to write a retrieval engine for the King James Bible (hence Xthe many examples from this text), and to construct a retrieval Xpackage for the Hebrew Bible, which I am now using to gather data for Xvarious chapters of my dissertation. I'm happy, incidentally, to hand Xout copies of my KJV retrieval program. It's a clean little program Xthat doubtless many would find useful. The Hebrew Bible retrieval Xpackage I'll hand out as well, but only to fully competent Icon Xprogrammers who feel comfortable with Hebrew and Aramaic. This latter Xretrieval package a much less finished product, and would almost Xcertainly need to be hacked to work on platforms other than what I Xhave here at home (a Xenix/386 setup with a VGA). X In general, I hope that someone out there will find these Xroutines useful, if for no other reason than that it will mean that I Xget some offsite testing. Obviously, the whole package could have Xbeen written/maintained in C or something that might offer much better Xperformance. Doing so would, however, have entailed a considerable Xloss of flexibility, and would have required a lot more time on my Xpart. Right now, the retrieve package occupies about 70k of basic Xsource files, probably half of which consists of comments. When Xcompiled together with a moderate-size user interface, the total Xpackage typically comes to about 150k. In-core size typically runs Xabout 300k on my home machine here (a Xenix/386 box), with the basic Xrun-time interpreter taking up a good chunk of that space all on its Xown. It's not a small package, but I've found it a nice base for Xrapid prototyping and development of small to medium-size search and Xretrieval engines. X X -Richard L. Goerwitz goer%sophist@uchicago.bitnet X goer@sophist.uchicago.edu rutgers!oddjob!gide!sophist!goer SHAR_EOF true || echo 'restore of README.rtv failed' rm -f _shar_wnt_.tmp fi # ============= outbits.icn ============== if test -f 'outbits.icn' -a X"$1" != X"-c"; then echo 'x - skipping outbits.icn (File already exists)' rm -f _shar_wnt_.tmp else > _shar_wnt_.tmp echo 'x - extracting outbits.icn (Text)' sed 's/^X//' << 'SHAR_EOF' > 'outbits.icn' && X############################################################################ X# X# Name: outbits.icn X# X# Title: output variable-length characters in byte-size chunks X# X# Author: Richard L. Goerwitz X# X# Version: 1.5 X# X############################################################################ X# X# In any number of instances (e.g. when outputting variable-length X# characters or fixed-length encoded strings), the programmer must X# fit variable and/or non-byte-sized blocks into standard 8-bit X# bytes. Outbits() performs this task. X# X# Pass to outbits(i, len) an integer i, and a length parameter (len), X# and outbits will suspend byte-sized chunks of i converted to X# characters (most significant bits first) until there is not enough X# left of i to fill up an 8-bit character. The remaining portion is X# stored in a buffer until outbits() is called again, at which point X# the buffer is combined with the new i and then output in the same X# manner as before. The buffer is flushed by calling outbits() with X# a null i argument. Note that len gives the number of bits there X# are in i (or at least the number of bits you want preserved; those X# that are discarded are the most significant ones). X# X# A trivial example of how outbits() might be used: X# X# outtext := open("some.file.name","w") X# l := [1,2,3,4] X# every writes(outtext, outbits(!l,3)) X# writes(outtext, outbits(&null,3)) # flush buffer X# X# List l may be reconstructed with inbits() (see inbits.icn): X# X# intext := open("some.file.name") X# l := [] X# while put(l, inbits(intext, 3)) X# X# Note that outbits() is a generator, while inbits() is not. X# X############################################################################ X# X# Links: none X# See also: inbits.icn X# X############################################################################ X X Xprocedure outbits(i, len) X X local old_part, new_part, window, old_byte_mask X static old_i, old_len, byte_length, byte_mask X initial { X old_i := old_len := 0 X byte_length := 8 X byte_mask := (2^byte_length)-1 X } X X old_byte_mask := (0 < 2^old_len - 1) | 0 X window := byte_length - old_len X old_part := ishift(iand(old_i, old_byte_mask), window) X X # If we have a no-arg invocation, then flush buffer (old_i). X if /i then { X if old_len > 0 then { X old_i := old_len := 0 X return char(old_part) X } else { X old_i := old_len := 0 X fail X } X } else { X new_part := ishift(i, window-len) X len -:= (len >= window) | { X old_len +:= len X old_i := ior(ishift(old_part, len-window), i) X fail X } X# For debugging purposes. X# write("old_byte_mask = ", old_byte_mask) X# write("window = ", image(window)) X# write("old_part = ", image(old_part)) X# write("new_part = ", image(new_part)) X# write("outputting ", image(ior(old_part, new_part))) X suspend char(ior(old_part, new_part)) X } X X until len < byte_length do { X suspend char(iand(ishift(i, byte_length-len), byte_mask)) X len -:= byte_length X } X X old_len := len X old_i := i X fail X Xend SHAR_EOF true || echo 'restore of outbits.icn failed' rm -f _shar_wnt_.tmp fi rm -f _shar_seq_.tmp echo You have unpacked the last part exit 0 exit 0 # Just in case... -- Kent Landfield INTERNET: kent@sparky.IMD.Sterling.COM Sterling Software, IMD UUCP: uunet!sparky!kent Phone: (402) 291-8300 FAX: (402) 291-4362 Please send comp.sources.misc-related mail to kent@uunet.uu.net.