htmlsrpl version 1.1, January 7 1995
Name:
htmlsrpl.pl - HTML-aware search-and-replace program,
with either literal strings or regular expressions. Acts either only outside
HTML/SGML tags, or only within tags; can be restricted to operate only within
and/or only outside specified elements; can also upper-case tag
names. Runs under perl.
perl htmlsrpl.pl [options] infile.html
> outfile.html
Where command-line options have the form "option=value" (without
whitespace on either side of the `=' character), and all options should
precede filename arguments on the command line.
- old="..."
- String or expression to be replaced. Must be defined and non-null
(unless the upcase=1 option is specified).
- new="..."
- The new replacement string or expression. If ``new='' is absent
or null, the old="..." string is deleted.
- intags=1
- If this option is specified on the command line, strings within
tags are changed, but not text outside of
tags. (The default action, if this option is absent, is to only replace text
outside of tags.)
- inside=...
- The value of this option is a tagname or a comma-separated list of
tagnames (e.g. inside=A or inside=b,i). Search and replace
operations will only take place in material that is contained within all the
specified elements. So if
inside=b,i has been specified on the command line, only
"Text3" in the following input file would be subject to search and
replace: "Text1<B>Text2<I>Text3</I></B>".
The order of inclusion makes no difference (so that <B> nested inside
<I> would be treated exactly the same as <I> nested inside
<B>).
- outside=...
- Search and replace will only take place outside the tag or
(comma-separated) list of tags specified with this option. So if
outside=b,i is specified, nothing contained within a
<B>...</B> or <I>...</I> element will be subject to
search and replace.
- inmost=...
- The same as inside=, except that search
and replace only occurs immediately within the element specified
(i.e. inmost=b would mean that only "Text2" would be subject
to search and replace in
"Text1<B>Text2<I>Text3</I></B>").
If more than one of these options is specified, search-and-replace only
takes place when all the conditions specified in the options are satisfied.
This program uses a rather simple-minded algorithm for determining what is
contained within an element. There is a small list of known non-pairing tags
(such as <IMG>,
<BR>, etc.). When any opening tag not
on this list is encountered, it is pushed onto a stack of presently-containing
elements. When any closing tag is encountered, the most-recently occurring
matching tagname is removed from the stack, along with everything above it in
the stack (if no matching opening tag has been encountered,
htmlsrpl.pl exits with an error -- use the
htmlchek program in this package to help
find the HTML error). This means, for example, that a <P> element
unclosed by a </P> will often be considered to extend much farther than
it should according to the HTML DTD; also, in a
list such as
"<DL><DT>Text1<DD>Text2</DL>",
"Text2" is actually considered to be contained within a <DT>
element.
Note that when the inside=,
inmost=, or
outside= options are used together with the
intags=1 option, a tag is never considered to
be contained within the element which it itself delimits (i.e. the inclusion
and exclusion relationships established by a tag come into force at the end of
the tag if it is an opening tag, and at the beginning of the tag if it is a
closing tag). Also, inclusions and exclusions are always calculated from the
unprocessed input, before any search and replace has taken place.
- regexp=1
- If this option is specified, old="..." is
used as a Perl regular expression, rather than as a simple literal string
(the default is that both old="..." and
new="..." are handled as simple literal strings).
See the Perl documentation for information on regular expressions. Special
characters that are shell metacharacters will have to be quoted on the
command line, to protect them from interpretation by the shell. The `/'
character should be escaped by a preceding backslash, or should be written as
"\057", since this character is used as the delimiter in the Perl
s/.../.../ construct.
- regeval=1
- If this option is specified, old="..." is
used as a regular expression, and new="..." is a
statement to be evaluated, as in the Perl s/.../statement/e construct.
Special variables such as $`, $&, $', $1 etc. can be used as part of such
a statement (remember that the "." operator is used to concatenate string
values). If you use an erroneous expression, you will get a Perl
errormessage (not a htmlsrpl errormessage), which you will have to
interpret using the
Perl
manual.
- case=1
- If this option is specified along with the
regexp=1,
regeval=1, or
delete=1 options, then they operate without
caring about alphabetic case.
- lines=1
- If this option is specified, the chunks of the input file that will be
individually searched and replaced are those that result when tag beginnings
(`<') and tag endings (`>') are boundaries; these chunks can contain
embedded newlines. (Remember that in Perl the regexp /./ does not
match newline ("\n"); you can use [^\000] instead.)
If the lines=1 option is not specified, then the default
behavior is that linebreaks are also boundaries; the chunks then do not
contain newlines. The `<' and `>' characters themselves are never part
of the chunks matched against (they can only be altered by use of the
delete=1 option), except for `>' characters
outside of tags, which are treated as ordinary text.
- slash=1
- If this option is specified, then the `/' slash character immediately
following the `<' character of a closing tag is not matched against, and
is not affected by any search-and-replace operation (except, of course, tag
deletion with delete=1). Implies
intags=1.
- delete=1
- If this option is specified, old="..." is
treated as a regexp and is matched against tagnames (not against the entire
contents of tags); where tagnames match, the entire tag, including the
surrounding `<' and `>' characters, is deleted. This option implies
intags=1 and
slash=1, and is incompatible with
regexp=1,
regeval=1, or a non-null value of
new=.
- upcase=1
- If this option is present, then tag names (the sequence of non-whitespace
immediately following a `<' character) are upper-cased. Does not
upper-case tag options (attributes). If
old= is null or absent, then this is the only thing
that htmlsrpl.pl does, and any other command-line options are
ignored. Otherwise, uppercasing is done first, before any specified
search-and-replace operation (and the intags=1
option is assumed). Note that qualifiers like
`inmost=' will govern the scope of any
search-and-replace operation that accompanies uppercasing, but uppercasing
itself always affects all tags.
You can do some cute things by playing around with these options. For
example, ``perl htmlsrpl.pl regexp=1
old=".*"'' deletes all text (except newlines) outside
tags, while adding ``intags=1'' to this command
line means that all text inside tags is deleted instead (leaving ghostly
``<>'' markers behind). The command line ``perl
htmlsrpl.pl delete=1 case=1
old="blink"'' nukes any <BLINK> tags (yay!),
while ``perl htmlsrpl.pl slash=1
case=1 lines=1
regexp=1 old="^blink[^\000]*"
new="I"'' will change all BLINK tags, with
accompanying attributes (possibly on multiple lines), and replace them with
the appropriate opening <I> and closing </I> tags. A command like
``perl htmlsrpl.pl
outside=cite,h1,h2,h3,h4,h5,h6,title
old="Pride and Prejudice"
new="<cite>Pride and Prejudice</cite>"''
can be used to add mark-up in the appropriate places.
A limitation of this program is that it always treats `<' and `>' in
the input file as tag-beginning and tag-ending characters (even in comments),
and terminates prematurely if `<' and `>' are found in inappropriate
places (except that loose `>' characters outside tags are harmless). In
this case a "die" message will be output to STDERR, and the last line of the
output will be "ERROR!".
If you misspell an option name, then you'll either get an error when Perl
tries to open a file with that name, or you'll get an indiscriminate "No
`old=' string was specified" errormessage.
The program processes all files on the command line to STDOUT; to process
a number of files individually, use the iteration mechanism of your shell; for
example:
for a in *.html ; do perl htmlsrpl.pl
old=ABC new=XYZ $a > otherdir/$a ;
done
in Unix sh, or:
for %a in (*.htm) do call htmlsrpl %a
otherdir\%a
in MS-DOS, where htmlsrpl.bat is the following one-line batch
file:
perl htmlsrpl.pl old=ABC
new=XYZ %1 > %2
Copyright H. Churchyard 1994, 1995 -- freely redistributable. This
code is functional but not very well commented or aesthetic -- sorry! If you
find an error in this program, e-mail me at
churchh@uts.cc.utexas.edu.
htmlsrpl version 1.1, January 7 1995