home *** CD-ROM | disk | FTP | other *** search
-
- REGEXP(3) UNIX Programmer's Manual REGEXP(3)
-
- NNAAMMEE
- rreeggccoommpp, rreeggeexxeecc, rreeggssuubb, rreeggeerrrroorr - regular expression handlers
-
- SSYYNNOOPPSSIISS
- ##iinncclluuddee <<rreeggeexxpp..hh>>
-
- _r_e_g_e_x_p _*
- rreeggccoommpp(_c_o_n_s_t _c_h_a_r _*_e_x_p)
-
- _i_n_t
- rreeggeexxeecc(_c_o_n_s_t _r_e_g_e_x_p _*_p_r_o_g, _c_o_n_s_t _c_h_a_r _*_s_t_r_i_n_g)
-
- _v_o_i_d
- rreeggssuubb(_c_o_n_s_t _r_e_g_e_x_p _*_p_r_o_g, _c_o_n_s_t _c_h_a_r _*_s_o_u_r_c_e, _c_h_a_r _*_d_e_s_t)
-
- DDEESSCCRRIIPPTTIIOONN
- The rreeggccoommpp(), rreeggeexxeecc(), rreeggssuubb(), and rreeggeerrrroorr() functions implement
- egrep(1)style regular expressions and supporting facilities.
-
- The rreeggccoommpp() function compiles a regular expression into a structure of
- type regexp, and returns a pointer to it. The space has been allocated
- using malloc(3) and may be released by free.
-
- The rreeggeexxeecc() function matches a NULterminated _s_t_r_i_n_g against the com
- piled regular expression in _p_r_o_g. It returns 1 for success and 0 for
- failure, and adjusts the contents of _p_r_o_g's _s_t_a_r_t_p and _e_n_d_p (see below)
- accordingly.
-
- The members of a regexp structure include at least the following (not
- necessarily in order):
-
- char *startp[NSUBEXP];
- char *endp[NSUBEXP];
-
- where NSUBEXP is defined (as 10) in the header file. Once a successful
- rreeggeexxeecc() has been done using the rreeggeexxpp(), each _s_t_a_r_t_p _e_n_d_p pair de
- scribes one substring within the _s_t_r_i_n_g, with the _s_t_a_r_t_p pointing to the
- first character of the substring and the _e_n_d_p pointing to the first char
- acter following the substring. The 0th substring is the substring of
- _s_t_r_i_n_g that matched the whole regular expression. The others are those
- substrings that matched parenthesized expressions within the regular ex
- pression, with parenthesized expressions numbered in lefttoright order
- of their opening parentheses.
-
- The rreeggssuubb() function copies _s_o_u_r_c_e to _d_e_s_t, making substitutions accord
- ing to the most recent rreeggeexxeecc() performed using _p_r_o_g. Each instance of
- `&' in _s_o_u_r_c_e is replaced by the substring indicated by _s_t_a_r_t_p[] and
- _e_n_d_p[]. Each instance of `\_n', where _n is a digit, is replaced by the
- substring indicated by _s_t_a_r_t_p[_n] and _e_n_d_p[_n]. To get a literal `&' or
- `\_n' into _d_e_s_t, prefix it with `\'; to get a literal `\' preceding `&' or
- `\_n', prefix it with another `\'.
-
- The rreeggeerrrroorr() function is called whenever an error is detected in
- rreeggccoommpp(), rreeggeexxeecc(), or rreeggssuubb(). The default rreeggeerrrroorr() writes the
- string _m_s_g, with a suitable indicator of origin, on the standard error
- output and invokes exit(2). The rreeggeerrrroorr() function can be replaced by
- the user if other actions are desirable.
-
- RREEGGUULLAARR EEXXPPRREESSSSIIOONN SSYYNNTTAAXX
- A regular expression is zero or more _b_r_a_n_c_h_e_s, separated by `|'. It
- matches anything that matches one of the branches.
-
-
- A branch is zero or more _p_i_e_c_e_s, concatenated. It matches a match for
- the first, followed by a match for the second, etc.
-
- A piece is an _a_t_o_m possibly followed by `*', `+', or `?'. An atom fol
- lowed by `*' matches a sequence of 0 or more matches of the atom. An
- atom followed by `+' matches a sequence of 1 or more matches of the atom.
- An atom followed by `?' matches a match of the atom, or the null string.
-
- An atom is a regular expression in parentheses (matching a match for the
- regular expression), a _r_a_n_g_e (see below), `.' (matching any single char
- acter), `^' (matching the null string at the beginning of the input
- string), `$' (matching the null string at the end of the input string), a
- `\' followed by a single character (matching that character), or a single
- character with no other significance (matching that character).
-
- A _r_a_n_g_e is a sequence of characters enclosed in `[]'. It normally match
- es any single character from the sequence. If the sequence begins with
- `^', it matches any single character _n_o_t from the rest of the sequence.
- If two characters in the sequence are separated by `-', this is shorthand
- for the full list of ASCII characters between them (e.g. `[09]' matches
- any decimal digit). To include a literal `]' in the sequence, make it
- the first character (following a possible `^'). To include a literal
- `-', make it the first or last character.
-
- AAMMBBIIGGUUIITTYY
- If a regular expression could match two different parts of the input
- string, it will match the one which begins earliest. If both begin in
- the same place but match different lengths, or match the same length in
- different ways, life gets messier, as follows.
-
- In general, the possibilities in a list of branches are considered in
- lefttoright order, the possibilities for `*', `+', and `?' are consid
- ered longestfirst, nested constructs are considered from the outermost
- in, and concatenated constructs are considered leftmostfirst. The match
- that will be chosen is the one that uses the earliest possibility in the
- first choice that has to be made. If there is more than one choice, the
- next will be made in the same manner (earliest possibility) subject to
- the decision on the first choice. And so forth.
-
- For example, `(ab|a)b*c' could match `abc' in one of two ways. The first
- choice is between `ab' and `a'; since `ab' is earlier, and does lead to a
- successful overall match, it is chosen. Since the `b' is already spoken
- for, the `b*' must match its last possibilitythe empty stringsince it
- must respect the earlier choice.
-
- In the particular case where no `|'s are present and there is only one
- `*', `+', or `?', the net effect is that the longest possible match will
- be chosen. So `ab*', presented with `xabbbby', will match `abbbb'. Note
- that if `ab*', is tried against `xabyabbbz', it will match `ab' just af
- ter `x', due to the beginsearliest rule. (In effect, the decision on
- where to start the match is the first choice to be made, hence subsequent
- choices must respect it even if this leads them to lesspreferred alter
- natives.)
-
- RREETTUURRNN VVAALLUUEESS
- The rreeggccoommpp() function returns NULL for a failure (rreeggeerrrroorr() permit
- ting), where failures are syntax errors, exceeding implementation limits,
- or applying `+' or `*' to a possiblynull operand.
-
- SSEEEE AALLSSOO
- ed(1), ex(1), expr(1), egrep(1), fgrep(1), grep(1), regex(3)
-
- HHIISSTTOORRYY
- Both code and manual page for rreeggccoommpp(), rreeggeexxeecc(), rreeggssuubb(), and
- rreeggeerrrroorr() were written at the University of Toronto and appeared in
- 4.3BSD-Tahoe. They are intended to be compatible with the Bell V8
- regexp(3), but are not derived from Bell code.
-
- BBUUGGSS
- Empty branches and empty regular expressions are not portable to V8.
-
- The restriction against applying `*' or `+' to a possiblynull operand is
- an artifact of the simplistic implementation.
-
- Does not support egrep's newlineseparated branches; neither does the V8
- regexp(3), though.
-
- Due to emphasis on compactness and simplicity, it's not strikingly fast.
- It does give special attention to handling simple cases quickly.
-
- BSD Experimental April 19, 1991 3
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-