home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Crawly Crypt Collection 2
/
crawlyvol2.bin
/
apps
/
science
/
chempars
/
chempars.doc
next >
Wrap
Text File
|
1986-11-12
|
6KB
|
130 lines
CHEMPARSE - A Recursive Depth Parser for Organic Substructures
CHEMPARSE is a program which interprets chemical structures
and substructures you type by using a recursive descent parser. A
RDP parser is a good way to handle input that can be described
using a context-free grammar. A grammar in computer science is a
set of rules that can be used to tell whether a given input is
valid - sort of the same idea as English grammar, but more
rigorously defined. A context-free grammar is one where each
"phrase" (called a non-terminal) can be expanded the same way into
the final "words" (called terminals), regardless of its location.
Pascal and LISP are examples of computer languages that can
be descibed with context-free grammars. The rules for what you
can put inside a BEGIN..END block are the same, regardless of
whether it's part of a IF x THEN BEGIN..END statement, a WHILE x
DO BEGIN..END statement, or the main block of a program. English
is not a context-free grammar.
Bachus-Naur formalism is a good way to describe the rules of
a context-free grammar. For each non-terminal, a BNF sentence
describes how it can be expanded. Some examples from Pascal are:
IF-THEN-statement ::= "IF" logical-expression "THEN" statement.
statement ::= BEGIN-END-block | IF-THEN-statement | assignment-
statment | other-statments | empty-statement.
BEGIN-END-block ::= "BEGIN" statement { ";" statement } "END".
The symbols used in BNF are:
::= "is defined as"
| "or"
{ x } x may be present 0, 1, or more times
[ x ] x may be present 0 or 1 times but not more
The non-terminals are IF-THEN-statement, logical-expression,
statement, BEGIN-END-block, assignment-statement, other-
statements, and empty-statement. The terminals are IF, THEN,
BEGIN, END and ;. You might notice that BEGIN-END-block is used
to define STATEMENT and STATEMENT in turn is used to define BEGIN-
END-block. Although not done here, it's also possible to use a
non-terminal in its own definition. One example might be:
integer ::= digit [ integer ].
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" |
"9".
A RDP works by calling a separate parsing routine for each of
the BNF sentences. If non-terminals are found, their parsing
routines are called until eventually the system works its way down
to only terminals. RDP's are thus highly recursive (hence the
name), and are best written in languages such as Pascal and LISP
which handle recursion easily.
Context-free grammars and RDP can be used for more than just
describing computer languages. Chemical structures can also be
described in a linear notation using context-free grammar. This
linear notation has long been a convenient way for chemists to
describe a structure in text. Some examples are:
Ethanol (ethyl alcohol) CH3-CH2-OH
Ethyl acetate CH3-C(=O)-O-CH2-CH3
t-Butyl amine NH2-C(CH3)(CH3)CH3
Single bonds are represented with a dash, double bonds with an
equal sign. Branching is handled by placing the branch inside
parentheses.
We have been using this notation at the CASE (Computer-
Assisted Structure Elucidation) lab at the chemistry departement
of Arizona State University to enter structure and substructure
information (it's a lot cheaper than buying a graphics
workstation!). We have expanded the symbols used to allow for a
wider variety of structures and the limitations of ASCII
characters. Triple bonds are represented with a plus sign. Bonds
not explicitly shown may be any type (except bonds to hydrogen
atoms, which are always single and never shown). A decimal point
may be used to explicily show an undefined bond. Rings are shown
by "labeling" one atom and then connecting to the label from
another atom. Disjoint (in two or more fragments) substructures
are entered by separating each fragment with a semicolon. Some
more examples:
Acetonitile CH3-C+N
Cyclopropane 1:CH2-CH2-CH2-1
Benzene 1:CH=CH-CH=CH-CH=CH-1
Two ethyl groups CH3-CH2;CH3-CH2
Any nitrogen-oxygen bond NO or N.O
We are currently using a RDP for this notation with several
applications. The BNF is:
(Sub)structure ::= fragment [ ";" (sub)structure ].
fragment ::= atom-group [ branch ] [ bond fragment ].
branch ::= "(" bond fragment ")" [ branch ].
bond ::= [ "-" | "=" | "+" | "." ].
atom-group ::= digit | ( [ digit ":" ] atom [ "H" num-of-H ] ).
atom ::= "C" | "N" | "O" | "S" | "F" | "CL" | "BR" | "I" | "NO2" |
"A".
digit ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9".
num-of-H ::= { "0" | "1" | "2" | "3" }.
The CHEMPARS program reads a line of up to 60 characters,
sends it to the RDP routine, and produces a table listing the
non-hydrogen atoms, their connections, and possible atom types.
Entering "Q" quits the program. If you have a color monitor, be
sure to run it in medium resolution. The source code is written
in O.S.S. Personal Pascal. The procedures which parse each non-
terminal are:
Non-terminal Procedure
------------ ---------
(sub)structure P_SUBSTR
fragment P_FRAG
branch P_PAREN
bond P_BOND
atom-group P_ELG
atom P_ATOM
digit (handled in-line)
num-of-H P_NUMHS
Much of the above text is a result of discussions between
myself and Chris Chabris, who greatly clarified my muddy concepts
and pointed out the relationships between BNF, CFG, and RDP.
Thanks, Chris.
-- Brad Christie 76167,1461