home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
HTML - Publishing on the Internet
/
html_cdrom.iso
/
tools
/
html
/
linux
/
check
/
htmlsrpl.man
< prev
next >
Wrap
Text File
|
1995-02-21
|
11KB
|
218 lines
htmlsrpl version 1.11, January 22 1995
Name:
htmlsrpl.pl - HTML-aware search-and-replace program, with
either literal strings or regular expressions. Acts either
only outside HTML/SGML tags, or only within tags; can be
restricted to operate only within and/or only outside
specified elements; can also upper-case tag names. Runs
under perl.
Typical use:
perl htmlsrpl.pl [options] infile.html > outfile.html
Where command-line options have the form "option=value" (without whitespace
on either side of the `=' character), and all options should precede
filename arguments on the command line.
Basic command-line options:
old="..." String or expression to be replaced. Must be defined and
non-null (unless the upcase=1 option is specified).
new="..." The new replacement string or expression. If ``new='' is
absent or null, the old="..." string is deleted.
intags=1 If this option is specified on the command line, strings
within tags are changed, but not text outside of tags. (The
default action, if this option is absent, is to only replace
text outside of tags.)
Element inclusion/exclusion command-line options:
inside=... The value of this option is a tagname or a comma-separated
list of tagnames (e.g. inside=A or inside=b,i). Search and
replace operations will only take place in material that is
contained within all the specified elements. So if inside=b,i
has been specified on the command line, only "Text3" in the
following input file would be subject to search and replace:
"Text1<B>Text2<I>Text3</I></B>". The order of inclusion makes
no difference (so that <B> nested inside <I> would be treated
exactly the same as <I> nested inside <B>).
outside=... Search and replace will only take place outside the tag or
(comma-separated) list of tags specified with this option. So
if outside=b,i is specified, nothing contained within a
<B>...</B> or <I>...</I> element will be subject to search and
replace.
inmost=... The same as inside=, except that search and replace only
occurs _immediately_ within the element specified (i.e.
inmost=b would mean that only "Text2" would be subject to
search and replace in "Text1<B>Text2<I>Text3</I></B>").
If more than one of these options is specified, search-and-replace only
takes place when all the conditions specified in the options are satisfied.
This program uses a rather simple-minded algorithm for determining what
is contained within an element. There is a small list of known non-pairing
tags (such as <IMG>, <BR>, etc.). When any opening tag not on this list is
encountered, it is pushed onto a stack of presently-containing elements.
When any closing tag is encountered, the most-recently occurring matching
tagname is removed from the stack, along with everything above it in the
stack (if no matching opening tag has been encountered, htmlsrpl.pl exits
with an error -- use the htmlchek program in this package to help find the
HTML error). This means, for example, that a <P> element unclosed by a </P>
will often be considered to extend much farther than it should according
to the HTML DTD; also, in a list such as "<DL><DT>Text1<DD>Text2</DL>",
"Text2" is actually considered to be contained within a <DT> element.
Note that when the inside=, inmost=, or outside= options are used
together with the intags=1 option, a tag is never considered to be
contained within the element which it itself delimits (i.e. the inclusion
and exclusion relationships established by a tag come into force at the end
of the tag if it is an opening tag, and at the beginning of the tag if it
is a closing tag). Also, inclusions and exclusions are always calculated
from the unprocessed input, before any search and replace has taken place.
Regexp command-line options:
regexp=1 If this option is specified, old="..." is used as a Perl
regular expression, rather than as a simple literal string
(the default is that both old="..." and new="..." are handled
as simple literal strings). See the Perl documentation for
information on regular expressions. Special characters that
are shell metacharacters will have to be quoted on the
command line, to protect them from interpretation by the
shell. The `/' character should be escaped by a preceding
backslash, or should be written as "\057", since this
character is used as the delimiter in the Perl s/.../.../
construct.
regeval=1 If this option is specified, old="..." is used as a
regular expression, and new="..." is a statement to be
evaluated, as in the Perl s/.../statement/e construct.
Special variables such as $`, $&, $', $1 etc. can be used as
part of such a statement (remember that the "." operator is
used to concatenate string values). If you use an erroneous
expression, you will get a Perl errormessage (not a htmlsrpl
errormessage), which you will have to interpret using the Perl
manual.
case=1 If this option is specified along with the regexp=1,
regeval=1, or delete=1 options, then they operate without
caring about alphabetic case.
Command-line options that affect what is matched against:
lines=1 If this option is specified, the chunks of the input file
that will be individually searched and replaced are those
that result when tag beginnings (`<') and tag endings (`>')
are boundaries; these chunks can contain embedded newlines.
(Remember that in Perl the regexp /./ does not match newline
("\n"); you can use [^\000] instead.)
If the lines=1 option is not specified, then the default
behavior is that linebreaks are also boundaries; the chunks
then do not contain newlines. The `<' and `>' characters
themselves are never part of the chunks matched against (they
can only be altered by use of the delete=1 option), except
for `>' characters outside of tags, which are treated as
ordinary text.
slash=1 If this option is specified, then the `/' slash character
immediately following the `<' character of a closing tag is
not matched against, and is not affected by any search-and-
replace operation (except, of course, tag deletion with
delete=1). Implies intags=1.
delete=1 If this option is specified, old="..." is treated as a
regexp and is matched against tagnames (not against the entire
contents of tags); where tagnames match, the entire tag,
including the surrounding `<' and `>' characters, is deleted.
This option implies intags=1 and slash=1, and is incompatible
with regexp=1, regeval=1, or a non-null value of new=.
Uppercasing option:
upcase=1 If this option is present, then tag names (the sequence of
non-whitespace immediately following a `<' character) are
upper-cased. Does not upper-case tag options (attributes).
If old= is null or absent, then this is the only thing that
htmlsrpl.pl does, and any other command-line options are
ignored. Otherwise, uppercasing is done first, before any
specified search-and-replace operation (and the intags=1
option is assumed). Note that qualifiers like `inmost=' will
govern the scope of any search-and-replace operation that
accompanies uppercasing, but uppercasing itself always
affects all tags.
Final status message:
At the end of processing, if no errors occurred, htmlsrpl.pl outputs a
message to STDERR (either "Changed!" or "Unchanged"), informing whether
or not any substitutions were actually performed on the output.
Summary:
You can do some cute things by playing around with these options. For
example, ``perl htmlsrpl.pl regexp=1 old=".*"'' deletes all text (except
newlines) outside tags, while adding ``intags=1'' to this command line means
that all text inside tags is deleted instead (leaving ghostly ``<>'' markers
behind). The command line ``perl htmlsrpl.pl delete=1 case=1 old="blink"''
nukes any <BLINK> tags (yay!), while ``perl htmlsrpl.pl slash=1 case=1
lines=1 regexp=1 old="^blink[^\000]*" new="I"'' will change all BLINK tags,
with accompanying attributes (possibly on multiple lines), and replace them
with the appropriate opening <I> and closing </I> tags. A command like ``perl
htmlsrpl.pl outside=cite,h1,h2,h3,h4,h5,h6,title old="Pride and Prejudice"
new="<cite>Pride and Prejudice</cite>"'' can be used to add mark-up in the
appropriate places.
Limitations:
A limitation of this program is that it always treats `<' and `>' in the
input file as tag-beginning and tag-ending characters (even in comments),
and terminates prematurely if `<' and `>' are found in inappropriate places
(except that loose `>' characters outside tags are harmless). In this case
a "die" message will be output to STDERR, and the last line of the output
will be "ERROR!".
If you misspell an option name, then you'll either get an error when Perl
tries to open a file with that name, or you'll get an indiscriminate
"No `old=' string was specified" errormessage.
The program processes all files on the command line to STDOUT; to process a
number of files individually, use the iteration mechanism of your shell; for
example:
for a in *.html ; do perl htmlsrpl.pl old=ABC new=XYZ $a > otherdir/$a ; done
in Unix sh, or:
for %a in (*.htm) do call htmlsrpl %a otherdir\%a
in MS-DOS, where htmlsrpl.bat is the following one-line batch file:
perl htmlsrpl.pl old=ABC new=XYZ %1 > %2
Author:
Copyright H. Churchyard 1994, 1995 -- freely redistributable. This code is
functional but not very well commented or aesthetic -- sorry! If you find
an error in this program, e-mail me at churchh@uts.cc.utexas.edu.
htmlsrpl version 1.11, January 22 1995