Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / unix / volume11 / id / part01 / TUTORIAL < prev

Wrap

Text File | 1987-09-25 | 15.7 KB | 434 lines

This is a program identifier database package. These tools provide a logical extension to ctags. (which is limited in that it only stores the location of function and type *definitions*a) The ID facility stores the locations for all uses of identifiers, pre-processor names, and numbers. (in decimal, octal or hex) When fixing or enhancing a large program (particularly one that is unfamiliar) it is often necessary to audit the use of global data-structures in order to verify that the proposed modification will not trigger any hidden `gotchas'. Often this entails grepping through many thousands of lines of source code spread over dozens and sometimes hundreds of source files in multiple sub-directories. This process places a significant load on computing resources, and takes a long time. There is even the danger that a programmer will avoid doing a complete audit due to the perceived cost--he or she will rely on memory and hope that there are no booby traps. The id-database is most useful for maintaining large programs that consist of many source files. The database is simply a two dimensional boolean array indexed by identifier-name and source-file-name. For a given identifier and source-file, if the identifier occurs in the file, the boolean value is TRUE. The database may be queried either by identifier-name or file-name. The following types of queries supported: * name lookup list all the files where an identifier occurs. The name may be a regular expression. * name apropos list all the files for all identifiers that have the sub-string name in them. Matches are done in a case-insensitive mammer. * name `grep' search for an identifier in all the files where it occurs. This is an optimized `grep' over all the sources--we only search on files that contain the identifier. * name edit invoke an editor on the files where an identifier occurs, and use the identifier as an initial search string. * file lookup list all identifiers that occur in a file, or list the identifiers that are common between two files. * non-unique names list the names of all indentifiers whose names are non-unique within some number of characters. This is useful when porting a program from a `flexnames' system to one more limited names. * solo list all identifiers that occur exactly once in a software system. This may be useful for locating identifiers that are declared but never used, or library functions that are used but never declared. The first four queries are handled by one program. The type of query is determined by the name the program was invoked with. The four links are lid(1) for `lookup id', aid(1) for `apropos id', gid(1) for `grep id' and eid(1) for `edit id'. One or more identifiers may be passed on the command line. The identifiers may be literal strings or regular expressions. Here are some examples: $ lid FILE FILE extern.h {fid,gets0,getsFF,idx,init,lid,mkid,opensrc,scan-asm,scan-c}.c $ lid FILE$ AF_FILE mkid.c AF_IDFILE mkid.c FILE extern.h {fid,gets0,getsFF,idx,init,lid,mkid,opensrc,scan-asm,scan-c}.c IDFILE id.h {fid,lid,mkid}.c IdFILE {fid,lid}.c argFILE mkid.c gidFILE lid.c idFILE {init,mkid}.c inFILE {gets0,getsFF,scan-asm,scan-c}.c openSrcFILE extern.h {idx,mkid,opensrc}.c srcFILE {idx,mkid,opensrc}.c $ lid ^get get opensrc.c getAdaId getscan.c getAsmId extern.h {getscan,scan-asm}.c getCId extern.h {getscan,scan-c}.c getDirToName extern.h {fid,lid,paths}.c getId {idx,mkid}.c getLanguage extern.h {getscan,idx,mkid}.c getLispId getscan.c getPascalId getscan.c getRoffId getscan.c getSCCS extern.h opensrc.c getScanner extern.h {getscan,idx,mkid}.c getTeXId getscan.c getTextId getscan.c getc {gets0,getsFF,lid,scan-asm,scan-c}.c getchar lid.c getenv extern.h lid.c gets lid.c getsFF extern.h {bitsvec,fid,getsFF,lid,mkid}.c As you can see, when a regular expression is used, it is possible to get more than one line of output. If you wish multiple lines to be merged into one, supply the `-m' option: $ lid -m ^get ^get extern.h {bitsvec,fid,gets0,getsFF,getscan,idx,lid,mkid,opensrc,paths,scan-asm,scan-c}.c The query program searches for numbers numerically rather than textually. Therefore you may search for multiple representations of a number. It is best to illustrate this with examples: $ lid -a 0x10 020 numtst.c 0x00010 numtst.c 0x0010 scan-c.c 0x10 {id,radix}.h {scan-asm,stoi}.c 16 numtst.c The `-a' argument tells lid(1) to look for 0x10 in all radixes. (For numbers 0 through 7, lid(1) looks for all radixes by default. For numbers greater than 7, lid(1) only looks for the radix that the argument is supplied in.) It is also possible to restrict the search to selected radixes by supplying an argument consisting of one or more of the key-letters `o', `d', and `x' for octal decimal and hexadecimal respectively: $ lid -o 0x10 020 numtst.c $ lid -x 16 0x00010 numtst.c 0x0010 scan-c.c 0x10 {id,radix}.h {scan-asm,stoi}.c $ lid -d 020 16 numtst.c The grep interface behaves somewhat like the following command: $ grep -w -n `lid TRUE` Heres some sample output for the equivalent gid command: $ gid TRUE bool.h:5: #define TRUE (0==0) lid.c:102: case 'm': forceMerge = TRUE; break; lid.c:170: Merging = TRUE; lid.c:204: crunching = TRUE; lid.c:553: hitDigits = TRUE; lid.c:787: return TRUE; mkid.c:117: Verbose = TRUE; mkid.c:191: keepLang = TRUE; scan-asm.c:79: static bool eatUnder = TRUE; scan-asm.c:80: static bool preProcess = TRUE; scan-asm.c:96: static bool newLine = TRUE; scan-asm.c:130: newLine = TRUE; scan-asm.c:141: newLine = TRUE; scan-asm.c:145: newLine = TRUE; scan-asm.c:150: newLine = TRUE; scan-asm.c:165: newLine = TRUE; scan-c.c:88: static bool eatUnder = TRUE; scan-c.c:101: static bool newLine = TRUE; scan-c.c:138: newLine = TRUE; scan-c.c:199: newLine = TRUE; scan-c.c:205: newLine = TRUE; scan-c.c:210: newLine = TRUE; wmatch.c:37: return TRUE; Notice that each line is reported in the same format as a C-preprocessor error message. This feature allows gid(1) lines to be digested by any program that parses error messages, such as error(1) and gnu-emacs. If you want to edit all files that have an identifier, you may conveniently do so with eid(1): $ eid TRUE TRUE bool.h {lid,mkid,scan-asm,scan-c,wmatch}.c Edit? [y1-9^S/nq] Before the editor is invoked, you are given the lid(1) output to review and comfirm. If you want to edit all files listed, respond with a newline or with `y'. If you want to skip some number of files into the argument list, respond with a single digit `1' through `9' to skip that many files, or do a string-search to the first file you want with `^S<string>' or `/<string>'. If you don't want to edit anything, type `n' to go on to the next argument you gave to eid(1) or type `q' to quit altogether. The behavior of the editing interface is controlled by three environment variables called EIDARG, EIDLDEL, and EIDRDEL. The best way to illustrate their use by an example. Here is how to define them for vi(1) (using /bin/sh syntax) EIDARG='+/%s/' # printf(3) string for initial search-string argument EIDLDEL='\<' # left word-delimiter EIDRDEL='\>' # right word-delimiter `EID[LR]DEL' are positioned around the identifier as left and right word-delimiters if your editor supports that notion. Then the whole name-string is sprintf(3)'ed into `EIDARG' to construct the initial search-string argument to the editor. If your editor can't digest such an argument, simply leave these variables undefined in the environment. Some emacs users are appalled at the notion of starting up a fresh editor simply to follow an identifier. For those who are fortunate enough to have a programmable emacs such as gnu-emacs, it is fairly simple to devise a command that invokes gid(1) and digests its output as though it were /lib/cpp error strings to be examined. (Sorry, no such code is provided at this posting...) Another type of query is to find all identifiers that are non-unique within some number of characters. This is useful for finding potential portability problems when moving to a system whose compiler or linker limits the number of significant characters in a name. The `-u<n>' argument does the trick. Here's a list of identifiers that may yield multiply-defined errors in a symbol table that only knows about the first 7 characters: $ lid -u7 SCAN_TEX getscan.c SCAN_TEXT getscan.c idh_argc id.h {init,mkid}.c idh_argo id.h {init,mkid}.c idh_namc id.h {fid,mkid}.c idh_namo id.h {fid,init,lid,mkid}.c oldHashSize mkid.c oldHashTable mkid.c Better yet, if you want to edit these, try $ eid -u7 ^SCAN_TE getscan.c Edit? [y1-9^S/nq] n ^idh_arg getscan.c id.h {init,mkid}.c Edit? [y1-9^S/nq] n ^idh_nam {fid,getscan}.c id.h {init,lid,mkid}.c Edit? [y1-9^S/nq] n ^oldHash {fid,getscan}.c id.h {init,lid,mkid}.c Edit? [y1-9^S/nq] n An additional feature of lid(1) is that pathnames are automatically adjusted for the current working directory. Large programs such as the UNIX kernel are often partitioned into subsystems whose sources live in different directories. What follows are several examples of the same search conducted from different points in the UNIX kernel source hierarchy: $ cd /src/uts/m68k $ lid bdevsw bdevsw sys/conf.h cf/conf.c io/bio.c os/{fio,main,prf,sys3}.c $ cd io $ lid bdevsw bdevsw ../sys/conf.h ../cf/conf.c bio.c ../os/{fio,main,prf,sys3}.c $ cd ../os bdevsw ../sys/conf.h ../cf/conf.c ../io/bio.c {fio,main,prf,sys3}.c The database is built with mkid(1). The user supplies pathnames either on the command line or on stdin. Here's the output of the `verbose' option to mkid(1): $ mkid -v *.h *.c c: bitops.h c: bool.h c: extern.h c: id.h c: patchlevel.h c: radix.h c: string.h c: basename.c c: bitcount.c c: bitops.c c: bitsvec.c c: bsearch.c c: bzero.c c: document.c c: fid.c c: gets0.c c: getsFF.c c: getscan.c c: hash.c c: idx.c c: init.c c: lid.c c: mkid.c c: numtst.c c: opensrc.c c: paths.c c: scan-asm.c c: scan-c.c c: stoi.c c: tty.c c: uerror.c c: wmatch.c Compressing Hash Table... Sorting Hash Table... Writing `ID'... Names: 593, Numbers: 64, Strings: 43, Solo: 119, Total: 697 Occurrances: 11.67, Load: 0.17, Probes: 1.07 Mkid(1) echoes the name of each file as it is scanned, prefixed by the name of the language it thinks the file is written in. Mkid(1) reports how many unique names and numbers were found, how many names occurred only once, and the total for names and numbers. It also reports the average number of occurrances for all names and numbers. Next, there are some hash-table statistics on the load-factor and the average number of open-addressed probes. Mkid(1) can take arguments from the command line, from stdin, or from a file. A file full of filenames may also contain mkid options of the form -<option>. Filenames and options appear in the file one-per-line. Typical usage for this feature is as follows: $ find . -name '*.[chys]' -print >IDFILES $ mkid -aIDFILES -- or -- $ find . -name '*.[chys]' -print |mkid - Mkid(1) stashes the filenames and relevant arguments in the database itself. It uses these to support the ``incremental-update' option. If invoked with `-u', mkid(1) checks the modification times of all constituent files, and only re-scans those that are newer than the database itself. It is invoked like so: $ mkid -u In summation, mkid(1) can get arguments from one of four places: 1) the command line, 2) a file, 3) stdin, 4) the database itself. Mkid(1) accepts a number of scanner-specific arguments. Generally, these are introduced with `-S<lang>' where <lang> is the name of a language, such as `c' or `asm'. You can get a scanner-specific usage-report with `-S<lang>?' (Of course, the `?' must be escaped to get it past the shell) Here's scanner-usage for the assembly language scanner: $ mkid -Sasm\? The Assembler scanner arguments take the form -Sasm<arg>, where <arg> is one of the following: (<cc> denotes one or more characters) -c<cc> . . . . <cc> introduce(s) a comment until end-of-line. (+|-)u . . . . (Do|Don't) strip a leading `_' from ids. (+|-)a<cc> . . Allow <cc> in ids, and (keep|ignore) those ids. (+|-)p . . . . (Do|Don't) handle C-preprocessor directives. (+|-)C . . . . (Do|Don't) handle C-style comments. (/* */) `-Sasm-c<cc>' tells the scanner what characters are used to introduce comments that extend to end-of-line. Use `-Sasm+u' if your C compiler prepends leading underscores to external names. This way, mkid(1) will strip leading underscores, and the name `foo' in a C source will be correctly associated with the name `_foo' in an assembler source. If your compiler doesn't prepend leading underscores, use `-Sasm-u'. Many assemblers allow special characters to be mixed with alpha-numerics in label, constant and register names. Common choices are `.', `%', and `$'. Thus, a label such as `L%123' should be scanned as one token, not broken up into the name `L' and the number 123. `-Sasm-a%.' tells the scanner to allow `%' and `.' in tokens, but to throw away tokens containing `%' or `.' `-Sasm+a%.' tells the scanner to keep such tokens and put them into the database. `-Sasm+p' tells the scanner to handle `#include' and `#define' lines as in C source, and `-Sasm+C' tells it to ignore C-style comments. Here's the scanner-usage for C: $ mkid -Sc\? The C scanner arguments take the form -Sc<arg>, where <arg> is one of the following: (<cc> denotes one or more characters) (+|-)u . . . . (Do|Don't) strip a leading `_' from ids in strings. -s<cc> . . . . Allow <cc> in string ids. The `+u' argument is akin to the argument for the assembly-language scanner. Mkid(1) keeps the contents of quoted-strings if the string contains a single valid C name and nothing else. E.g. mkid(1) would keep the contents of "_proc". Such strings are interesting because they may contain symbol names that a program uses for nlist lookups. So, if your compiler prepends underscores to external symbols, use `-Sc+u' so that mkid(1) will strip them back off. Mkid(1) normally throws away the contents of quoted strings that have anything other than a single name in them. You can tell mkid(1) to accept additional characters in strings with `-Sc-s<cc>' where <cc> is one or more special characters. E.g. `-Sc-s/.-:,' will include most of the strings containing pathnames that you are likely to encounter. Another class of scanner argument allows you to associate a suffix with a language. E.g. `-S.y=c' tells mkid(1) to use the C language scanner on all files ending with .y. You can ask mkid(1) for the available scanners and associated suffixes like so: $ mkid -S\?=\? .c=c, .h=c, .y=c, .s=asm, .p=pascal, .pas=pascal Please note, mkid(1) is lying to you about its Pascal prowess! At the time of this posting, there are scanners for C and assembly language sources. There are also stubs for Pascal, Ada and LISP. The scanners are very fast. The assembly language scanner knows how to throw away C-style comments as well as the traditional `comment- character-until-end-of-line' style. In order to test new scanners, there is a scanner driver called idx(1). Idx(1) simply calls the scanner to get identifiers one-at-a-time prints them on stdout one-per-line. For more information, read the manual pages! Happy Hacking, -- -- Greg McGary -- P.O. Box 286 -- Lincoln, MA 01773 -- -- 9/15/87 -- -- Until the end of 1987, -- Consulting to Sun's East Coast Division: -- gmcgary@ecd.sun.com -- gmcgary@suneast.uu.net -- -- After that, probably consulting in Europe...