home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Amiga Magazin: Amiga-CD 1996 July
/
AMIGA_1996_7.BIN
/
ausgabe_7_96
/
pd-programmierung
/
perl5_002bin.lha
/
man
/
catp
/
perlre.0
< prev
next >
Wrap
Text File
|
1996-03-02
|
46KB
|
727 lines
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
NNNNAAAAMMMMEEEE
perlre - Perl regular expressions
DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
This page describes the syntax of regular expressions in
Perl. For a description of how to actually _u_s_e regular
expressions in matching operations, plus various examples
of the same, see mmmm//////// and ssss//////////// in the _p_e_r_l_o_p manpage.
The matching operations can have various modifiers, some
of which relate to the interpretation of the regular
expression inside. These are:
iiii DDDDoooo ccccaaaasssseeee----iiiinnnnsssseeeennnnssssiiiittttiiiivvvveeee ppppaaaatttttttteeeerrrrnnnn mmmmaaaattttcccchhhhiiiinnnngggg....
mmmm TTTTrrrreeeeaaaatttt ssssttttrrrriiiinnnngggg aaaassss mmmmuuuullllttttiiiipppplllleeee lllliiiinnnneeeessss....
ssss TTTTrrrreeeeaaaatttt ssssttttrrrriiiinnnngggg aaaassss ssssiiiinnnngggglllleeee lllliiiinnnneeee....
xxxx EEEExxxxtttteeeennnndddd yyyyoooouuuurrrr ppppaaaatttttttteeeerrrrnnnn''''ssss lllleeeeggggiiiibbbbiiiilllliiiittttyyyy wwwwiiiitttthhhh wwwwhhhhiiiitttteeeessssppppaaaacccceeee aaaannnndddd ccccoooommmmmmmmeeeennnnttttssss....
These are usually written as "the ////xxxx modifier", even
though the delimiter in question might not actually be a
slash. In fact, any of these modifiers may also be
embedded within the regular expression itself using the
new ((((????............)))) construct. See below.
The ////xxxx modifier itself needs a little more explanation.
It tells the regular expression parser to ignore
whitespace that is not backslashed or within a character
class. You can use this to break up your regular
expression into (slightly) more readable parts. The ####
character is also treated as a metacharacter introducing a
comment, just as in ordinary Perl code. Taken together,
these features go a long way towards making Perl 5 a
readable language. See the C comment deletion code in the
_p_e_r_l_o_p manpage.
RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
The patterns used in pattern matching are regular
expressions such as those supplied in the Version 8 regexp
routines. (In fact, the routines are derived (distantly)
from Henry Spencer's freely redistributable
reimplementation of the V8 routines.) See the section on
_V_e_r_s_i_o_n _8 _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s for details.
In particular the following metacharacters have their
standard _e_g_r_e_p-ish meanings:
\\\\ QQQQuuuuooootttteeee tttthhhheeee nnnneeeexxxxtttt mmmmeeeettttaaaacccchhhhaaaarrrraaaacccctttteeeerrrr
^^^^ MMMMaaaattttcccchhhh tttthhhheeee bbbbeeeeggggiiiinnnnnnnniiiinnnngggg ooooffff tttthhhheeee lllliiiinnnneeee
.... MMMMaaaattttcccchhhh aaaannnnyyyy cccchhhhaaaarrrraaaacccctttteeeerrrr ((((eeeexxxxcccceeeepppptttt nnnneeeewwwwlllliiiinnnneeee))))
$$$$ MMMMaaaattttcccchhhh tttthhhheeee eeeennnndddd ooooffff tttthhhheeee lllliiiinnnneeee ((((oooorrrr bbbbeeeeffffoooorrrreeee nnnneeeewwwwlllliiiinnnneeee aaaatttt tttthhhheeee eeeennnndddd))))
|||| AAAAlllltttteeeerrrrnnnnaaaattttiiiioooonnnn
(((()))) GGGGrrrroooouuuuppppiiiinnnngggg
[[[[]]]] CCCChhhhaaaarrrraaaacccctttteeeerrrr ccccllllaaaassssssss
13/Feb/96 perl 5.002 with 1
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
By default, the "^" character is guaranteed to match only
at the beginning of the string, the "$" character only at
the end (or before the newline at the end) and Perl does
certain optimizations with the assumption that the string
contains only one line. Embedded newlines will not be
matched by "^" or "$". You may, however, wish to treat a
string as a multi-line buffer, such that the "^" will
match after any newline within the string, and "$" will
match before any newline. At the cost of a little more
overhead, you can do this by using the /m modifier on the
pattern match operator. (Older programs did this by
setting $$$$****, but this practice is deprecated in Perl 5.)
To facilitate multi-line substitutions, the "." character
never matches a newline unless you use the ////ssss modifier,
which tells Perl to pretend the string is a single
line--even if it isn't. The ////ssss modifier also overrides
the setting of $$$$****, in case you have some (badly behaved)
older code that sets it in another module.
The following standard quantifiers are recognized:
**** MMMMaaaattttcccchhhh 0000 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
++++ MMMMaaaattttcccchhhh 1111 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
???? MMMMaaaattttcccchhhh 1111 oooorrrr 0000 ttttiiiimmmmeeeessss
{{{{nnnn}}}} MMMMaaaattttcccchhhh eeeexxxxaaaaccccttttllllyyyy nnnn ttttiiiimmmmeeeessss
{{{{nnnn,,,,}}}} MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn ttttiiiimmmmeeeessss
{{{{nnnn,,,,mmmm}}}} MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn bbbbuuuutttt nnnnooootttt mmmmoooorrrreeee tttthhhhaaaannnn mmmm ttttiiiimmmmeeeessss
(If a curly bracket occurs in any other context, it is
treated as a regular character.) The "*" modifier is
equivalent to {{{{0000,,,,}}}}, the "+" modifier to {{{{1111,,,,}}}}, and the "?"
modifier to {{{{0000,,,,1111}}}}. n and m are limited to integral values
less than 65536.
By default, a quantified subpattern is "greedy", that is,
it will match as many times as possible without causing
the rest pattern not to match. The standard quantifiers
are all "greedy", in that they match as many occurrences
as possible (given a particular starting location) without
causing the pattern to fail. If you want it to match the
minimum number of times possible, follow the quantifier
with a "?" after any of them. Note that the meanings
don't change, just the "gravity":
****???? MMMMaaaattttcccchhhh 0000 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
++++???? MMMMaaaattttcccchhhh 1111 oooorrrr mmmmoooorrrreeee ttttiiiimmmmeeeessss
???????? MMMMaaaattttcccchhhh 0000 oooorrrr 1111 ttttiiiimmmmeeee
{{{{nnnn}}}}???? MMMMaaaattttcccchhhh eeeexxxxaaaaccccttttllllyyyy nnnn ttttiiiimmmmeeeessss
{{{{nnnn,,,,}}}}???? MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn ttttiiiimmmmeeeessss
{{{{nnnn,,,,mmmm}}}}???? MMMMaaaattttcccchhhh aaaatttt lllleeeeaaaasssstttt nnnn bbbbuuuutttt nnnnooootttt mmmmoooorrrreeee tttthhhhaaaannnn mmmm ttttiiiimmmmeeeessss
Since patterns are processed as double quoted strings, the
following also work:
13/Feb/96 perl 5.002 with 2
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
\\\\tttt ttttaaaabbbb
\\\\nnnn nnnneeeewwwwlllliiiinnnneeee
\\\\rrrr rrrreeeettttuuuurrrrnnnn
\\\\ffff ffffoooorrrrmmmm ffffeeeeeeeedddd
\\\\aaaa aaaallllaaaarrrrmmmm ((((bbbbeeeellllllll))))
\\\\eeee eeeessssccccaaaappppeeee ((((tttthhhhiiiinnnnkkkk ttttrrrrooooffffffff))))
\\\\000033333333 ooooccccttttaaaallll cccchhhhaaaarrrr ((((tttthhhhiiiinnnnkkkk ooooffff aaaa PPPPDDDDPPPP----11111111))))
\\\\xxxx1111BBBB hhhheeeexxxx cccchhhhaaaarrrr
\\\\cccc[[[[ ccccoooonnnnttttrrrroooollll cccchhhhaaaarrrr
\\\\llll lllloooowwwweeeerrrrccccaaaasssseeee nnnneeeexxxxtttt cccchhhhaaaarrrr ((((tttthhhhiiiinnnnkkkk vvvviiii))))
\\\\uuuu uuuuppppppppeeeerrrrccccaaaasssseeee nnnneeeexxxxtttt cccchhhhaaaarrrr ((((tttthhhhiiiinnnnkkkk vvvviiii))))
\\\\LLLL lllloooowwwweeeerrrrccccaaaasssseeee ttttiiiillllllll \\\\EEEE ((((tttthhhhiiiinnnnkkkk vvvviiii))))
\\\\UUUU uuuuppppppppeeeerrrrccccaaaasssseeee ttttiiiillllllll \\\\EEEE ((((tttthhhhiiiinnnnkkkk vvvviiii))))
\\\\EEEE eeeennnndddd ccccaaaasssseeee mmmmooooddddiiiiffffiiiiccccaaaattttiiiioooonnnn ((((tttthhhhiiiinnnnkkkk vvvviiii))))
\\\\QQQQ qqqquuuuooootttteeee rrrreeeeggggeeeexxxxpppp mmmmeeeettttaaaacccchhhhaaaarrrraaaacccctttteeeerrrrssss ttttiiiillllllll \\\\EEEE
In addition, Perl defines the following:
\\\\wwww MMMMaaaattttcccchhhh aaaa """"wwwwoooorrrrdddd"""" cccchhhhaaaarrrraaaacccctttteeeerrrr ((((aaaallllpppphhhhaaaannnnuuuummmmeeeerrrriiiicccc pppplllluuuussss """"____""""))))
\\\\WWWW MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----wwwwoooorrrrdddd cccchhhhaaaarrrraaaacccctttteeeerrrr
\\\\ssss MMMMaaaattttcccchhhh aaaa wwwwhhhhiiiitttteeeessssppppaaaacccceeee cccchhhhaaaarrrraaaacccctttteeeerrrr
\\\\SSSS MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----wwwwhhhhiiiitttteeeessssppppaaaacccceeee cccchhhhaaaarrrraaaacccctttteeeerrrr
\\\\dddd MMMMaaaattttcccchhhh aaaa ddddiiiiggggiiiitttt cccchhhhaaaarrrraaaacccctttteeeerrrr
\\\\DDDD MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----ddddiiiiggggiiiitttt cccchhhhaaaarrrraaaacccctttteeeerrrr
Note that \\\\wwww matches a single alphanumeric character, not
a whole word. To match a word you'd need to say \\\\wwww++++. You
may use \\\\wwww, \\\\WWWW, \\\\ssss, \\\\SSSS, \\\\dddd and \\\\DDDD within character classes
(though not as either end of a range).
Perl defines the following zero-width assertions:
\\\\bbbb MMMMaaaattttcccchhhh aaaa wwwwoooorrrrdddd bbbboooouuuunnnnddddaaaarrrryyyy
\\\\BBBB MMMMaaaattttcccchhhh aaaa nnnnoooonnnn----((((wwwwoooorrrrdddd bbbboooouuuunnnnddddaaaarrrryyyy))))
\\\\AAAA MMMMaaaattttcccchhhh oooonnnnllllyyyy aaaatttt bbbbeeeeggggiiiinnnnnnnniiiinnnngggg ooooffff ssssttttrrrriiiinnnngggg
\\\\ZZZZ MMMMaaaattttcccchhhh oooonnnnllllyyyy aaaatttt eeeennnndddd ooooffff ssssttttrrrriiiinnnngggg ((((oooorrrr bbbbeeeeffffoooorrrreeee nnnneeeewwwwlllliiiinnnneeee aaaatttt tttthhhheeee eeeennnndddd))))
\\\\GGGG MMMMaaaattttcccchhhh oooonnnnllllyyyy wwwwhhhheeeerrrreeee pppprrrreeeevvvviiiioooouuuussss mmmm////////gggg lllleeeefffftttt ooooffffffff
A word boundary (\\\\bbbb) is defined as a spot between two
characters that has a \\\\wwww on one side of it and and a \\\\WWWW on
the other side of it (in either order), counting the
imaginary characters off the beginning and end of the
string as matching a \\\\WWWW. (Within character classes \\\\bbbb
represents backspace rather than a word boundary.) The \\\\AAAA
and \\\\ZZZZ are just like "^" and "$" except that they won't
match multiple times when the ////mmmm modifier is used, while
"^" and "$" will match at every internal line boundary.
To match the actual end of the string, not ignoring
newline, you can use \\\\ZZZZ((((????!!!!\\\\nnnn)))).
When the bracketing construct (((( ............ )))) is used, \<digit>
matches the digit'th substring. Outside of the pattern,
always use "$" instead of "\" in front of the digit. (The
\<digit> notation can on rare occasion work outside the
13/Feb/96 perl 5.002 with 3
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
current pattern, this should not be relied upon. See the
WARNING below.) The scope of $<digit> (and $$$$````, $$$$&&&&, and $$$$''''))))
extends to the end of the enclosing BLOCK or eval string,
or to the next successful pattern match, whichever comes
first. If you want to use parentheses to delimit
subpattern (e.g. a set of alternatives) without saving it
as a subpattern, follow the ( with a ?.
You may have as many parentheses as you wish. If you have
more than 9 substrings, the variables $$$$11110000, $$$$11111111, ... refer
to the corresponding substring. Within the pattern, \10,
\11, etc. refer back to substrings if there have been at
least that many left parens before the backreference.
Otherwise (for backward compatibility) \10 is the same as
\010, a backspace, and \11 the same as \011, a tab. And
so on. (\1 through \9 are always backreferences.)
$$$$++++ returns whatever the last bracket match matched. $$$$&&&&
returns the entire matched string. ($0 used to return the
same thing, but not any more.) $$$$```` returns everything
before the matched string. $$$$'''' returns everything after
the matched string. Examples:
ssss////^^^^(((([[[[^^^^ ]]]]****)))) ****(((([[[[^^^^ ]]]]****))))////$$$$2222 $$$$1111////;;;; #### sssswwwwaaaapppp ffffiiiirrrrsssstttt ttttwwwwoooo wwwwoooorrrrddddssss
iiiiffff ((((////TTTTiiiimmmmeeee:::: ((((........))))::::((((........))))::::((((........))))////)))) {{{{
$$$$hhhhoooouuuurrrrssss ==== $$$$1111;;;;
$$$$mmmmiiiinnnnuuuutttteeeessss ==== $$$$2222;;;;
$$$$sssseeeeccccoooonnnnddddssss ==== $$$$3333;;;;
}}}}
You will note that all backslashed metacharacters in Perl
are alphanumeric, such as \\\\bbbb, \\\\wwww, \\\\nnnn. Unlike some other
regular expression languages, there are no backslashed
symbols that aren't alphanumeric. So anything that looks
like \\, \(, \), \<, \>, \{, or \} is always interpreted
as a literal character, not a metacharacter. This makes
it simple to quote a string that you want to use for a
pattern but that you are afraid might contain
metacharacters. Simply quote all the non-alphanumeric
characters:
$$$$ppppaaaatttttttteeeerrrrnnnn ====~~~~ ssss////((((\\\\WWWW))))////\\\\\\\\$$$$1111////gggg;;;;
You can also use the built-in _q_u_o_t_e_m_e_t_a_(_) function to do
this. An even easier way to quote metacharacters right in
the match operator is to say
////$$$$uuuunnnnqqqquuuuooootttteeeedddd\\\\QQQQ$$$$qqqquuuuooootttteeeedddd\\\\EEEE$$$$uuuunnnnqqqquuuuooootttteeeedddd////
Perl 5 defines a consistent extension syntax for regular
expressions. The syntax is a pair of parens with a
question mark as the first thing within the parens (this
was a syntax error in Perl 4). The character after the
13/Feb/96 perl 5.002 with 4
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
question mark gives the function of the extension.
Several extensions are already supported:
(?#text) A comment. The text is ignored. If the ////xxxx
switch is used to enable whitespace formatting,
a simple #### will suffice.
(?:regexp)
This groups things like "()" but doesn't make
backrefences like "()" does. So
sssspppplllliiiitttt((((////\\\\bbbb((((????::::aaaa||||bbbb||||cccc))))\\\\bbbb////))))
is like
sssspppplllliiiitttt((((////\\\\bbbb((((aaaa||||bbbb||||cccc))))\\\\bbbb////))))
but doesn't spit out extra fields.
(?=regexp)
A zero-width positive lookahead assertion. For
example, ////\\\\wwww++++((((????====\\\\tttt))))//// matches a word followed by
a tab, without including the tab in $$$$&&&&.
(?!regexp)
A zero-width negative lookahead assertion. For
example ////ffffoooooooo((((????!!!!bbbbaaaarrrr))))//// matches any occurrence of
"foo" that isn't followed by "bar". Note
however that lookahead and lookbehind are NOT
the same thing. You cannot use this for
lookbehind: ////((((????!!!!ffffoooooooo))))bbbbaaaarrrr//// will not find an
occurrence of "bar" that is preceded by
something which is not "foo". That's because
the ((((????!!!!ffffoooooooo)))) is just saying that the next thing
cannot be "foo"--and it's not, it's a "bar", so
"foobar" will match. You would have to do
something like ////((((????ffffoooooooo))))............bbbbaaaarrrr//// for that. We say
"like" because there's the case of your "bar"
not having three characters before it. You
could cover that this way:
////((((????::::((((????!!!!ffffoooooooo))))............||||^^^^........????))))bbbbaaaarrrr////. Sometimes it's still
easier just to say:
iiiiffff ((((////ffffoooooooo//// &&&&&&&& $$$$```` ====~~~~ ////bbbbaaaarrrr$$$$////))))
(?imsx) One or more embedded pattern-match modifiers.
This is particularly useful for patterns that
are specified in a table somewhere, some of
which want to be case sensitive, and some of
which don't. The case insensitive ones merely
need to include ((((????iiii)))) at the front of the
pattern. For example:
13/Feb/96 perl 5.002 with 5
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
$$$$ppppaaaatttttttteeeerrrrnnnn ==== """"ffffoooooooobbbbaaaarrrr"""";;;;
iiiiffff (((( ////$$$$ppppaaaatttttttteeeerrrrnnnn////iiii ))))
#### mmmmoooorrrreeee fffflllleeeexxxxiiiibbbblllleeee::::
$$$$ppppaaaatttttttteeeerrrrnnnn ==== """"((((????iiii))))ffffoooooooobbbbaaaarrrr"""";;;;
iiiiffff (((( ////$$$$ppppaaaatttttttteeeerrrrnnnn//// ))))
The specific choice of question mark for this and the new
minimal matching construct was because 1) question mark is
pretty rare in older regular expressions, and 2) whenever
you see one, you should stop and "question" exactly what
is going on. That's psychology...
BBBBaaaacccckkkkttttrrrraaaacccckkkkiiiinnnngggg
A fundamental feature of regular expression matching
involves the notion called _b_a_c_k_t_r_a_c_k_i_n_g. which is used
(when needed) by all regular expression quantifiers,
namely ****, ****????, ++++, ++++????, {{{{nnnn,,,,mmmm}}}}, and {{{{nnnn,,,,mmmm}}}}????.
For a regular expression to match, the _e_n_t_i_r_e regular
expression must match, not just part of it. So if the
beginning of a pattern containing a quantifier succeeds in
a way that causes later parts in the pattern to fail, the
matching engine backs up and recalculates the beginning
part--that's why it's called backtracking.
Here is an example of backtracking: Let's say you want to
find the word following "foo" in the string "Food is on
the foo table.":
$$$$____ ==== """"FFFFoooooooodddd iiiissss oooonnnn tttthhhheeee ffffoooooooo ttttaaaabbbblllleeee...."""";;;;
iiiiffff (((( ////\\\\bbbb((((ffffoooooooo))))\\\\ssss++++((((\\\\wwww++++))))////iiii )))) {{{{
pppprrrriiiinnnntttt """"$$$$2222 ffffoooolllllllloooowwwwssss $$$$1111....\\\\nnnn"""";;;;
}}}}
When the match runs, the first part of the regular
expression (\\\\bbbb((((ffffoooooooo))))) finds a possible match right at the
beginning of the string, and loads up $$$$1111 with "Foo".
However, as soon as the matching engine sees that there's
no whitespace following the "Foo" that it had saved in $$$$1111,
it realizes its mistake and starts over again one
character after where it had had the tentative match.
This time it goes all the way until the next occurrence of
"foo". The complete regular expression matches this time,
and you get the expected output of "table follows foo."
Sometimes minimal matching can help a lot. Imagine you'd
like to match everything between "foo" and "bar".
Initially, you write something like this:
13/Feb/96 perl 5.002 with 6
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
$$$$____ ==== """"TTTThhhheeee ffffoooooooodddd iiiissss uuuunnnnddddeeeerrrr tttthhhheeee bbbbaaaarrrr iiiinnnn tttthhhheeee bbbbaaaarrrrnnnn...."""";;;;
iiiiffff (((( ////ffffoooooooo((((....****))))bbbbaaaarrrr//// )))) {{{{
pppprrrriiiinnnntttt """"ggggooootttt <<<<$$$$1111>>>>\\\\nnnn"""";;;;
}}}}
Which perhaps unexpectedly yields:
ggggooootttt <<<<dddd iiiissss uuuunnnnddddeeeerrrr tttthhhheeee bbbbaaaarrrr iiiinnnn tttthhhheeee >>>>
That's because ....**** was greedy, so you get everything
between the _f_i_r_s_t "foo" and the _l_a_s_t "bar". In this case,
it's more effective to use minimal matching to make sure
you get the text between a "foo" and the first "bar"
thereafter.
iiiiffff (((( ////ffffoooooooo((((....****????))))bbbbaaaarrrr//// )))) {{{{ pppprrrriiiinnnntttt """"ggggooootttt <<<<$$$$1111>>>>\\\\nnnn"""" }}}}
ggggooootttt <<<<dddd iiiissss uuuunnnnddddeeeerrrr tttthhhheeee >>>>
Here's another example: let's say you'd like to match a
number at the end of a string, and you also want to keep
the preceding part the match. So you write this:
$$$$____ ==== """"IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 55553333111144447777"""";;;;
iiiiffff (((( ////((((....****))))((((\\\\dddd****))))//// )))) {{{{ #### WWWWrrrroooonnnngggg!!!!
pppprrrriiiinnnntttt """"BBBBeeeeggggiiiinnnnnnnniiiinnnngggg iiiissss <<<<$$$$1111>>>>,,,, nnnnuuuummmmbbbbeeeerrrr iiiissss <<<<$$$$2222>>>>....\\\\nnnn"""";;;;
}}}}
That won't work at all, because ....**** was greedy and gobbled
up the whole string. As \\\\dddd**** can match on an empty string
the complete regular expression matched successfully.
BBBBeeeeggggiiiinnnnnnnniiiinnnngggg iiiissss <<<<IIII hhhhaaaavvvveeee 2222:::: 55553333111144447777>>>>,,,, nnnnuuuummmmbbbbeeeerrrr iiiissss <<<<>>>>....
Here are some variants, most of which don't work:
$$$$____ ==== """"IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 55553333111144447777"""";;;;
@@@@ppppaaaattttssss ==== qqqqwwww{{{{
((((....****))))((((\\\\dddd****))))
((((....****))))((((\\\\dddd++++))))
((((....****????))))((((\\\\dddd****))))
((((....****????))))((((\\\\dddd++++))))
((((....****))))((((\\\\dddd++++))))$$$$
((((....****????))))((((\\\\dddd++++))))$$$$
((((....****))))\\\\bbbb((((\\\\dddd++++))))$$$$
((((....****\\\\DDDD))))((((\\\\dddd++++))))$$$$
}}}};;;;
13/Feb/96 perl 5.002 with 7
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
ffffoooorrrr $$$$ppppaaaatttt ((((@@@@ppppaaaattttssss)))) {{{{
pppprrrriiiinnnnttttffff """"%%%%----11112222ssss """",,,, $$$$ppppaaaatttt;;;;
iiiiffff (((( ////$$$$ppppaaaatttt//// )))) {{{{
pppprrrriiiinnnntttt """"<<<<$$$$1111>>>> <<<<$$$$2222>>>>\\\\nnnn"""";;;;
}}}} eeeellllsssseeee {{{{
pppprrrriiiinnnntttt """"FFFFAAAAIIIILLLL\\\\nnnn"""";;;;
}}}}
}}}}
That will print out:
((((....****))))((((\\\\dddd****)))) <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 55553333111144447777>>>> <<<<>>>>
((((....****))))((((\\\\dddd++++)))) <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 5555333311114444>>>> <<<<7777>>>>
((((....****????))))((((\\\\dddd****)))) <<<<>>>> <<<<>>>>
((((....****????))))((((\\\\dddd++++)))) <<<<IIII hhhhaaaavvvveeee >>>> <<<<2222>>>>
((((....****))))((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: 5555333311114444>>>> <<<<7777>>>>
((((....****????))))((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: >>>> <<<<55553333111144447777>>>>
((((....****))))\\\\bbbb((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: >>>> <<<<55553333111144447777>>>>
((((....****\\\\DDDD))))((((\\\\dddd++++))))$$$$ <<<<IIII hhhhaaaavvvveeee 2222 nnnnuuuummmmbbbbeeeerrrrssss:::: >>>> <<<<55553333111144447777>>>>
As you see, this can be a bit tricky. It's important to
realize that a regular expression is merely a set of
assertions that gives a definition of success. There may
be 0, 1, or several different ways that the definition
might succeed against a particular string. And if there
are multiple ways it might succeed, you need to understand
backtracking in order to know which variety of success you
will achieve.
When using lookahead assertions and negations, this can
all get even tricker. Imagine you'd like to find a
sequence of nondigits not followed by "123". You might
try to write that as
$$$$____ ==== """"AAAABBBBCCCC111122223333"""";;;;
iiiiffff (((( ////^^^^\\\\DDDD****((((????!!!!111122223333))))//// )))) {{{{ #### WWWWrrrroooonnnngggg!!!!
pppprrrriiiinnnntttt """"YYYYuuuupppp,,,, nnnnoooo 111122223333 iiiinnnn $$$$____\\\\nnnn"""";;;;
}}}}
But that isn't going to match; at least, not the way
you're hoping. It claims that there is no 123 in the
string. Here's a clearer picture of why it that pattern
matches, contrary to popular expectations:
$$$$xxxx ==== ''''AAAABBBBCCCC111122223333'''' ;;;;
$$$$yyyy ==== ''''AAAABBBBCCCC444444445555'''' ;;;;
pppprrrriiiinnnntttt """"1111:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$xxxx ====~~~~ ////^^^^((((AAAABBBBCCCC))))((((????!!!!111122223333))))//// ;;;;
pppprrrriiiinnnntttt """"2222:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$yyyy ====~~~~ ////^^^^((((AAAABBBBCCCC))))((((????!!!!111122223333))))//// ;;;;
pppprrrriiiinnnntttt """"3333:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$xxxx ====~~~~ ////^^^^((((\\\\DDDD****))))((((????!!!!111122223333))))//// ;;;;
pppprrrriiiinnnntttt """"4444:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$yyyy ====~~~~ ////^^^^((((\\\\DDDD****))))((((????!!!!111122223333))))//// ;;;;
This prints
13/Feb/96 perl 5.002 with 8
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
2222:::: ggggooootttt AAAABBBBCCCC
3333:::: ggggooootttt AAAABBBB
4444:::: ggggooootttt AAAABBBBCCCC
You might have expected test 3 to fail because it just
seems to a more general purpose version of test 1. The
important difference between them is that test 3 contains
a quantifier (\\\\DDDD****) and so can use backtracking, whereas
test 1 will not. What's happening is that you've asked
"Is it true that at the start of $$$$xxxx, following 0 or more
nondigits, you have something that's not 123?" If the
pattern matcher had let \\\\DDDD**** expand to "ABC", this would
have caused the whole pattern to fail. The search engine
will initially match \\\\DDDD**** with "ABC". Then it will try to
match ((((????!!!!111122223333 with "123" which, of course, fails. But
because a quantifier (\\\\DDDD****) has been used in the regular
expression, the search engine can backtrack and retry the
match differently in the hope of matching the complete
regular expression.
Well now, the pattern really, _r_e_a_l_l_y wants to succeed, so
it uses the standard regexp backoff-and-retry and lets \\\\DDDD****
expand to just "AB" this time. Now there's indeed
something following "AB" that is not "123". It's in fact
"C123", which suffices.
We can deal with this by using both an assertion and a
negation. We'll say that the first part in $$$$1111 must be
followed by a digit, and in fact, it must also be followed
by something that's not "123". Remember that the
lookaheads are zero-width expressions--they only look, but
don't consume any of the string in their match. So
rewriting this way produces what you'd expect; that is,
case 5 will fail, but case 6 succeeds:
pppprrrriiiinnnntttt """"5555:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$xxxx ====~~~~ ////^^^^((((\\\\DDDD****))))((((????====\\\\dddd))))((((????!!!!111122223333))))//// ;;;;
pppprrrriiiinnnntttt """"6666:::: ggggooootttt $$$$1111\\\\nnnn"""" iiiiffff $$$$yyyy ====~~~~ ////^^^^((((\\\\DDDD****))))((((????====\\\\dddd))))((((????!!!!111122223333))))//// ;;;;
6666:::: ggggooootttt AAAABBBBCCCC
In other words, the two zero-width assertions next to each
other work like they're ANDed together, just as you'd use
any builtin assertions: ////^^^^$$$$//// matches only if you're at
the beginning of the line AND the end of the line
simultaneously. The deeper underlying truth is that
juxtaposition in regular expressions always means AND,
except when you write an explicit OR using the vertical
bar. ////aaaabbbb//// means match "a" AND (then) match "b", although
the attempted matches are made at different positions
because "a" is not a zero-width assertion, but a one-width
assertion.
One warning: particularly complicated regular expressions
can take exponential time to solve due to the immense
13/Feb/96 perl 5.002 with 9
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
number of possible ways they can use backtracking to try
match. For example this will take a very long time to run
////((((((((aaaa{{{{0000,,,,5555}}}})))){{{{0000,,,,5555}}}})))){{{{0000,,,,5555}}}}////
And if you used ****'s instead of limiting it to 0 through 5
matches, then it would take literally forever--or until
you ran out of stack space.
VVVVeeeerrrrssssiiiioooonnnn 8888 RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
In case you're not familiar with the "regular" Version 8
regexp routines, here are the pattern-matching rules not
described above.
Any single character matches itself, unless it is a
_m_e_t_a_c_h_a_r_a_c_t_e_r with a special meaning described here or
above. You can cause characters which normally function
as metacharacters to be interpreted literally by prefixing
them with a "\" (e.g. "\." matches a ".", not any
character; "\\" matches a "\"). A series of characters
matches that series of characters in the target string, so
the pattern bbbblllluuuurrrrffffllll would match "blurfl" in the target
string.
You can specify a character class, by enclosing a list of
characters in [[[[]]]], which will match any one of the
characters in the list. If the first character after the
"[" is "^", the class matches any character not in the
list. Within a list, the "-" character is used to specify
a range, so that aaaa----zzzz represents all the characters between
"a" and "z", inclusive.
Characters may be specified using a metacharacter syntax
much like that used in C: "\n" matches a newline, "\t" a
tab, "\r" a carriage return, "\f" a form feed, etc. More
generally, \_n_n_n, where _n_n_n is a string of octal digits,
matches the character whose ASCII value is _n_n_n.
Similarly, \x_n_n, where _n_n are hexidecimal digits, matches
the character whose ASCII value is _n_n. The expression \c_x
matches the ASCII character control-_x. Finally, the "."
metacharacter matches any character except "\n" (unless
you use ////ssss).
You can specify a series of alternatives for a pattern
using "|" to separate them, so that ffffeeeeeeee||||ffffiiiieeee||||ffffooooeeee will match
any of "fee", "fie", or "foe" in the target string (as
would ffff((((eeee||||iiii||||oooo))))eeee). Note that the first alternative
includes everything from the last pattern delimiter ("(",
"[", or the beginning of the pattern) up to the first "|",
and the last alternative contains everything from the last
"|" to the next pattern delimiter. For this reason, it's
common practice to include alternatives in parentheses, to
minimize confusion about where they start and end. Note
13/Feb/96 perl 5.002 with 10
PERLRE(1) User Contributed Perl Documentation PERLRE(1)
however that "|" is interpreted as a literal with square
brackets, so if you write [[[[ffffeeeeeeee||||ffffiiiieeee||||ffffooooeeee]]]] you're really only
matching [[[[ffffeeeeiiiioooo||||]]]].
Within a pattern, you may designate subpatterns for later
reference by enclosing them in parentheses, and you may
refer back to the _nth subpattern later in the pattern
using the metacharacter \_n. Subpatterns are numbered
based on the left to right order of their opening
parenthesis. Note that a backreference matches whatever
actually matched the subpattern in the string being
examined, not the rules for that subpattern. Therefore,
((((0000||||0000xxxx))))\\\\dddd****\\\\ssss\\\\1111\\\\dddd**** will match "0x1234 0x4321",but not
"0x1234 01234", since subpattern 1 actually matched "0x",
even though the rule 0000||||0000xxxx could potentially match the
leading 0 in the second number.
WWWWAAAARRRRNNNNIIIINNNNGGGG oooonnnn \\\\1111 vvvvssss $$$$1111
Some people get too used to writing things like
$$$$ppppaaaatttttttteeeerrrrnnnn ====~~~~ ssss////((((\\\\WWWW))))////\\\\\\\\\\\\1111////gggg;;;;
This is grandfathered for the RHS of a substitute to avoid
shocking the sssseeeedddd addicts, but it's a dirty habit to get
into. That's because in PerlThink, the right-hand side of
a ssss//////////// is a double-quoted string. \\\\1111 in the usual double-
quoted string means a control-A. The customary Unix
meaning of \\\\1111 is kludged in for ssss////////////. However, if you get
into the habit of doing that, you get yourself into
trouble if you then add an ////eeee modifier.
ssss////((((\\\\dddd++++))))//// \\\\1111 ++++ 1111 ////eeeegggg;;;;
Or if you try to do
ssss////((((\\\\dddd++++))))////\\\\1111000000000000////;;;;
You can't disambiguate that by saying \\\\{{{{1111}}}}000000000000, whereas you
can fix it with $$$${{{{1111}}}}000000000000. Basically, the operation of
interpolation should not be confused with the operation of
matching a backreference. Certainly they mean two
different things on the _l_e_f_t side of the ssss////////////.
13/Feb/96 perl 5.002 with 11