home *** CD-ROM | disk | FTP | other *** search
-
- This is a program identifier database package. These tools provide a
- logical extension to ctags. (which is limited in that it only stores the
- location of function and type *definitions*a) The ID facility
- stores the locations for all uses of identifiers, pre-processor
- names, and numbers. (in decimal, octal or hex)
-
- When fixing or enhancing a large program (particularly one that is
- unfamiliar) it is often necessary to audit the use of global
- data-structures in order to verify that the proposed modification will
- not trigger any hidden `gotchas'. Often this entails grepping through
- many thousands of lines of source code spread over dozens and sometimes
- hundreds of source files in multiple sub-directories. This process
- places a significant load on computing resources, and takes a long
- time. There is even the danger that a programmer will avoid doing a
- complete audit due to the perceived cost--he or she will rely on memory
- and hope that there are no booby traps.
-
- The id-database is most useful for maintaining large programs that
- consist of many source files. The database is simply a two dimensional
- boolean array indexed by identifier-name and source-file-name. For a
- given identifier and source-file, if the identifier occurs in the file,
- the boolean value is TRUE. The database may be queried either by
- identifier-name or file-name.
-
- The following types of queries supported:
-
- * name lookup
- list all the files where an identifier occurs. The name
- may be a regular expression.
-
- * name apropos
- list all the files for all identifiers that have the sub-string
- name in them. Matches are done in a case-insensitive mammer.
-
- * name `grep'
- search for an identifier in all the files where it occurs.
- This is an optimized `grep' over all the sources--we only
- search on files that contain the identifier.
-
- * name edit
- invoke an editor on the files where an identifier occurs,
- and use the identifier as an initial search string.
-
- * file lookup
- list all identifiers that occur in a file, or list
- the identifiers that are common between two files.
-
- * non-unique names
- list the names of all indentifiers whose names are non-unique
- within some number of characters. This is useful when porting
- a program from a `flexnames' system to one more limited names.
-
- * solo
- list all identifiers that occur exactly once in a software
- system. This may be useful for locating identifiers that are
- declared but never used, or library functions that are used
- but never declared.
-
-
- The first four queries are handled by one program. The type of query
- is determined by the name the program was invoked with. The four links
- are lid(1) for `lookup id', aid(1) for `apropos id', gid(1) for `grep
- id' and eid(1) for `edit id'. One or more identifiers may be passed on
- the command line. The identifiers may be literal strings or regular
- expressions. Here are some examples:
-
- $ lid FILE
- FILE extern.h {fid,gets0,getsFF,idx,init,lid,mkid,opensrc,scan-asm,scan-c}.c
-
- $ lid FILE$
- AF_FILE mkid.c
- AF_IDFILE mkid.c
- FILE extern.h {fid,gets0,getsFF,idx,init,lid,mkid,opensrc,scan-asm,scan-c}.c
- IDFILE id.h {fid,lid,mkid}.c
- IdFILE {fid,lid}.c
- argFILE mkid.c
- gidFILE lid.c
- idFILE {init,mkid}.c
- inFILE {gets0,getsFF,scan-asm,scan-c}.c
- openSrcFILE extern.h {idx,mkid,opensrc}.c
- srcFILE {idx,mkid,opensrc}.c
-
- $ lid ^get
- get opensrc.c
- getAdaId getscan.c
- getAsmId extern.h {getscan,scan-asm}.c
- getCId extern.h {getscan,scan-c}.c
- getDirToName extern.h {fid,lid,paths}.c
- getId {idx,mkid}.c
- getLanguage extern.h {getscan,idx,mkid}.c
- getLispId getscan.c
- getPascalId getscan.c
- getRoffId getscan.c
- getSCCS extern.h opensrc.c
- getScanner extern.h {getscan,idx,mkid}.c
- getTeXId getscan.c
- getTextId getscan.c
- getc {gets0,getsFF,lid,scan-asm,scan-c}.c
- getchar lid.c
- getenv extern.h lid.c
- gets lid.c
- getsFF extern.h {bitsvec,fid,getsFF,lid,mkid}.c
-
- As you can see, when a regular expression is used, it is possible to
- get more than one line of output. If you wish multiple lines to be
- merged into one, supply the `-m' option:
-
- $ lid -m ^get
- ^get extern.h {bitsvec,fid,gets0,getsFF,getscan,idx,lid,mkid,opensrc,paths,scan-asm,scan-c}.c
-
- The query program searches for numbers numerically rather than
- textually. Therefore you may search for multiple representations of a
- number. It is best to illustrate this with examples:
-
- $ lid -a 0x10
- 020 numtst.c
- 0x00010 numtst.c
- 0x0010 scan-c.c
- 0x10 {id,radix}.h {scan-asm,stoi}.c
- 16 numtst.c
-
- The `-a' argument tells lid(1) to look for 0x10 in all radixes. (For
- numbers 0 through 7, lid(1) looks for all radixes by default. For numbers
- greater than 7, lid(1) only looks for the radix that the argument is
- supplied in.) It is also possible to restrict the search to selected
- radixes by supplying an argument consisting of one or more of the
- key-letters `o', `d', and `x' for octal decimal and hexadecimal
- respectively:
-
- $ lid -o 0x10
- 020 numtst.c
-
- $ lid -x 16
- 0x00010 numtst.c
- 0x0010 scan-c.c
- 0x10 {id,radix}.h {scan-asm,stoi}.c
-
- $ lid -d 020
- 16 numtst.c
-
-
- The grep interface behaves somewhat like the following command:
-
- $ grep -w -n `lid TRUE`
-
- Heres some sample output for the equivalent gid command:
-
- $ gid TRUE
- bool.h:5: #define TRUE (0==0)
- lid.c:102: case 'm': forceMerge = TRUE; break;
- lid.c:170: Merging = TRUE;
- lid.c:204: crunching = TRUE;
- lid.c:553: hitDigits = TRUE;
- lid.c:787: return TRUE;
- mkid.c:117: Verbose = TRUE;
- mkid.c:191: keepLang = TRUE;
- scan-asm.c:79: static bool eatUnder = TRUE;
- scan-asm.c:80: static bool preProcess = TRUE;
- scan-asm.c:96: static bool newLine = TRUE;
- scan-asm.c:130: newLine = TRUE;
- scan-asm.c:141: newLine = TRUE;
- scan-asm.c:145: newLine = TRUE;
- scan-asm.c:150: newLine = TRUE;
- scan-asm.c:165: newLine = TRUE;
- scan-c.c:88: static bool eatUnder = TRUE;
- scan-c.c:101: static bool newLine = TRUE;
- scan-c.c:138: newLine = TRUE;
- scan-c.c:199: newLine = TRUE;
- scan-c.c:205: newLine = TRUE;
- scan-c.c:210: newLine = TRUE;
- wmatch.c:37: return TRUE;
-
- Notice that each line is reported in the same format as a
- C-preprocessor error message. This feature allows gid(1) lines to be
- digested by any program that parses error messages, such as error(1)
- and gnu-emacs.
-
- If you want to edit all files that have an identifier, you may
- conveniently do so with eid(1):
-
- $ eid TRUE
- TRUE bool.h {lid,mkid,scan-asm,scan-c,wmatch}.c
- Edit? [y1-9^S/nq]
-
- Before the editor is invoked, you are given the lid(1) output to review
- and comfirm. If you want to edit all files listed, respond with a
- newline or with `y'. If you want to skip some number of files into the
- argument list, respond with a single digit `1' through `9' to skip that
- many files, or do a string-search to the first file you want with
- `^S<string>' or `/<string>'. If you don't want to edit anything, type
- `n' to go on to the next argument you gave to eid(1) or type `q' to
- quit altogether.
-
- The behavior of the editing interface is controlled by three
- environment variables called EIDARG, EIDLDEL, and EIDRDEL. The best
- way to illustrate their use by an example. Here is how to define them
- for vi(1) (using /bin/sh syntax)
-
- EIDARG='+/%s/' # printf(3) string for initial search-string argument
- EIDLDEL='\<' # left word-delimiter
- EIDRDEL='\>' # right word-delimiter
-
- `EID[LR]DEL' are positioned around the identifier as left and right
- word-delimiters if your editor supports that notion. Then the whole
- name-string is sprintf(3)'ed into `EIDARG' to construct the initial
- search-string argument to the editor. If your editor can't digest such
- an argument, simply leave these variables undefined in the
- environment.
-
- Some emacs users are appalled at the notion of starting up a fresh editor
- simply to follow an identifier. For those who are fortunate enough to have
- a programmable emacs such as gnu-emacs, it is fairly simple to devise
- a command that invokes gid(1) and digests its output as though it were
- /lib/cpp error strings to be examined. (Sorry, no such code is provided
- at this posting...)
-
- Another type of query is to find all identifiers that are non-unique
- within some number of characters. This is useful for finding potential
- portability problems when moving to a system whose compiler or linker
- limits the number of significant characters in a name. The `-u<n>'
- argument does the trick. Here's a list of identifiers that may yield
- multiply-defined errors in a symbol table that only knows about the
- first 7 characters:
-
- $ lid -u7
- SCAN_TEX getscan.c
- SCAN_TEXT getscan.c
- idh_argc id.h {init,mkid}.c
- idh_argo id.h {init,mkid}.c
- idh_namc id.h {fid,mkid}.c
- idh_namo id.h {fid,init,lid,mkid}.c
- oldHashSize mkid.c
- oldHashTable mkid.c
-
- Better yet, if you want to edit these, try
-
- $ eid -u7
- ^SCAN_TE getscan.c
- Edit? [y1-9^S/nq] n
- ^idh_arg getscan.c id.h {init,mkid}.c
- Edit? [y1-9^S/nq] n
- ^idh_nam {fid,getscan}.c id.h {init,lid,mkid}.c
- Edit? [y1-9^S/nq] n
- ^oldHash {fid,getscan}.c id.h {init,lid,mkid}.c
- Edit? [y1-9^S/nq] n
-
-
- An additional feature of lid(1) is that pathnames are automatically
- adjusted for the current working directory. Large programs such as the
- UNIX kernel are often partitioned into subsystems whose sources live in
- different directories. What follows are several examples of the same
- search conducted from different points in the UNIX kernel source
- hierarchy:
-
- $ cd /src/uts/m68k
- $ lid bdevsw
- bdevsw sys/conf.h cf/conf.c io/bio.c os/{fio,main,prf,sys3}.c
-
- $ cd io
- $ lid bdevsw
- bdevsw ../sys/conf.h ../cf/conf.c bio.c ../os/{fio,main,prf,sys3}.c
-
- $ cd ../os
- bdevsw ../sys/conf.h ../cf/conf.c ../io/bio.c {fio,main,prf,sys3}.c
-
- The database is built with mkid(1). The user supplies pathnames
- either on the command line or on stdin. Here's the output of the
- `verbose' option to mkid(1):
-
- $ mkid -v *.h *.c
- c: bitops.h
- c: bool.h
- c: extern.h
- c: id.h
- c: patchlevel.h
- c: radix.h
- c: string.h
- c: basename.c
- c: bitcount.c
- c: bitops.c
- c: bitsvec.c
- c: bsearch.c
- c: bzero.c
- c: document.c
- c: fid.c
- c: gets0.c
- c: getsFF.c
- c: getscan.c
- c: hash.c
- c: idx.c
- c: init.c
- c: lid.c
- c: mkid.c
- c: numtst.c
- c: opensrc.c
- c: paths.c
- c: scan-asm.c
- c: scan-c.c
- c: stoi.c
- c: tty.c
- c: uerror.c
- c: wmatch.c
- Compressing Hash Table...
- Sorting Hash Table...
- Writing `ID'...
- Names: 593, Numbers: 64, Strings: 43, Solo: 119, Total: 697
- Occurrances: 11.67, Load: 0.17, Probes: 1.07
-
- Mkid(1) echoes the name of each file as it is scanned, prefixed by the
- name of the language it thinks the file is written in. Mkid(1) reports
- how many unique names and numbers were found, how many names occurred
- only once, and the total for names and numbers. It also reports the
- average number of occurrances for all names and numbers. Next, there
- are some hash-table statistics on the load-factor and the average
- number of open-addressed probes.
-
- Mkid(1) can take arguments from the command line, from stdin, or from
- a file. A file full of filenames may also contain mkid options of the form
- -<option>. Filenames and options appear in the file one-per-line. Typical
- usage for this feature is as follows:
-
- $ find . -name '*.[chys]' -print >IDFILES
- $ mkid -aIDFILES
-
- -- or --
-
- $ find . -name '*.[chys]' -print |mkid -
-
- Mkid(1) stashes the filenames and relevant arguments in the database
- itself. It uses these to support the ``incremental-update' option.
- If invoked with `-u', mkid(1) checks the modification times of all
- constituent files, and only re-scans those that are newer than the
- database itself. It is invoked like so:
-
- $ mkid -u
-
- In summation, mkid(1) can get arguments from one of four places:
- 1) the command line, 2) a file, 3) stdin, 4) the database itself.
-
- Mkid(1) accepts a number of scanner-specific arguments. Generally,
- these are introduced with `-S<lang>' where <lang> is the name of
- a language, such as `c' or `asm'. You can get a scanner-specific
- usage-report with `-S<lang>?' (Of course, the `?' must be escaped
- to get it past the shell)
-
- Here's scanner-usage for the assembly language scanner:
-
- $ mkid -Sasm\?
- The Assembler scanner arguments take the form -Sasm<arg>, where
- <arg> is one of the following: (<cc> denotes one or more characters)
- -c<cc> . . . . <cc> introduce(s) a comment until end-of-line.
- (+|-)u . . . . (Do|Don't) strip a leading `_' from ids.
- (+|-)a<cc> . . Allow <cc> in ids, and (keep|ignore) those ids.
- (+|-)p . . . . (Do|Don't) handle C-preprocessor directives.
- (+|-)C . . . . (Do|Don't) handle C-style comments. (/* */)
-
- `-Sasm-c<cc>' tells the scanner what characters are used to introduce comments
- that extend to end-of-line.
-
- Use `-Sasm+u' if your C compiler prepends leading underscores to external
- names. This way, mkid(1) will strip leading underscores, and the name
- `foo' in a C source will be correctly associated with the name `_foo'
- in an assembler source. If your compiler doesn't prepend leading
- underscores, use `-Sasm-u'.
-
- Many assemblers allow special characters to be mixed with
- alpha-numerics in label, constant and register names. Common choices
- are `.', `%', and `$'. Thus, a label such as `L%123' should be scanned
- as one token, not broken up into the name `L' and the number 123.
- `-Sasm-a%.' tells the scanner to allow `%' and `.' in tokens, but to throw
- away tokens containing `%' or `.' `-Sasm+a%.' tells the scanner to keep such
- tokens and put them into the database.
-
-
- `-Sasm+p' tells the scanner to handle `#include' and `#define' lines as
- in C source, and `-Sasm+C' tells it to ignore C-style comments.
-
- Here's the scanner-usage for C:
-
- $ mkid -Sc\?
- The C scanner arguments take the form -Sc<arg>, where <arg>
- is one of the following: (<cc> denotes one or more characters)
- (+|-)u . . . . (Do|Don't) strip a leading `_' from ids in strings.
- -s<cc> . . . . Allow <cc> in string ids.
-
- The `+u' argument is akin to the argument for the assembly-language
- scanner. Mkid(1) keeps the contents of quoted-strings if the string
- contains a single valid C name and nothing else. E.g. mkid(1) would
- keep the contents of "_proc". Such strings are interesting because
- they may contain symbol names that a program uses for nlist lookups.
- So, if your compiler prepends underscores to external symbols, use
- `-Sc+u' so that mkid(1) will strip them back off.
-
- Mkid(1) normally throws away the contents of quoted strings that have
- anything other than a single name in them. You can tell mkid(1) to
- accept additional characters in strings with `-Sc-s<cc>' where <cc> is
- one or more special characters. E.g. `-Sc-s/.-:,' will include most of
- the strings containing pathnames that you are likely to encounter.
-
- Another class of scanner argument allows you to associate a suffix
- with a language. E.g. `-S.y=c' tells mkid(1) to use the C language
- scanner on all files ending with .y. You can ask mkid(1) for the
- available scanners and associated suffixes like so:
-
- $ mkid -S\?=\?
- .c=c, .h=c, .y=c, .s=asm, .p=pascal, .pas=pascal
-
- Please note, mkid(1) is lying to you about its Pascal prowess!
- At the time of this posting, there are scanners for C and assembly
- language sources. There are also stubs for Pascal, Ada and LISP. The
- scanners are very fast. The assembly language scanner knows how
- to throw away C-style comments as well as the traditional `comment-
- character-until-end-of-line' style. In order to test new scanners,
- there is a scanner driver called idx(1). Idx(1) simply calls the
- scanner to get identifiers one-at-a-time prints them on stdout one-per-line.
-
- For more information, read the manual pages!
-
- Happy Hacking,
- --
- -- Greg McGary
- -- P.O. Box 286
- -- Lincoln, MA 01773
- --
- -- 9/15/87
- --
- -- Until the end of 1987,
- -- Consulting to Sun's East Coast Division:
- -- gmcgary@ecd.sun.com
- -- gmcgary@suneast.uu.net
- --
- -- After that, probably consulting in Europe...
-