htmlchek version 4.1, February 20 1995
htmlchek.awk, htmlchek.pl - Syntactically
checks HTML 2.0 or 3.0 files for a number of possible errors; can do
local link cross-reference checking, and generate a rudimentary
reference-dependency map. Runs under awk
or perl. Includes a number of supplemental utilities for HTML
file processing.
Table of contents
Author: H. Churchyard
churchh@uts.cc.utexas.edu
README.40:
htmlchek version 4.0, January 17 1995
htmlchek -- Syntactically checks HTML 2.0 or 3.0 files for a
number of possible errors; can do local link
cross-reference checking, and generate a
rudimentary reference-dependency map. Runs
under awk or perl. Includes a number of
supplemental utilities for HTML file processing.
This release of htmlchek (version 4.0) is a moderately significant
upgrade to previous versions, and includes the following files:
(The documentation for all programs and shell scripts other than htmlsrpl.pl
is in htmlchek.man/htmlchek.html.)
README.40 This file
htmlchek.man Documentation
htmlchek.html HTML version of Documentation
htmlchek.awk Awk version of htmlchek HTML error checker
htmlchek.pl Port of htmlchek to perl
example.cfg Sample htmlchek configuration file
html2dtd.cfg Config. file for stricter compliance with 2.0 DTD
htmlqref.txt Yet another HTML quick reference (plain text)
htmlqref.html HTML version of yet another HTML quick reference
htmlsrpl.pl HTML-aware search-and-replace program (perl)
htmlsrpl.man Documentation for htmlsrpl.pl
htmlsrpl.html HTML version of documentation for htmlsrpl.pl
xtraclnk.pl Extracts links and link/title text from HTML files (perl)
makemenu.awk Makes simple menu for HTML files using <TITLE>; can also
makemenu.pl make table of contents using <H1>-<H6> (awk/perl)
dehtml.awk Remove all HTML markup, preliminary to spell check (awk)
dehtml.pl Perl version of dehtml
entify.awk Replace high Latin 1 alphabetic characters with ampersand
entify.pl entities for safe 7-bit transport (awk/perl)
metachar.awk Trivial program to protect HTML/SGML "&<>" metacharacters
metachar.pl in text to be included in an HTML file (awk/perl)
(Unix shell files:)
htmlchek.sh Run htmlchek.awk under the best available interpreter,
and with options checking
htmlchkp.sh Run htmlchek.pl with external options checking
runachek.sh Do cross-reference checking using htmlchek.awk
runpchek.sh Do cross-reference checking using htmlchek.pl
rducfila.sh Reduce .NAME/.HREF files (external xref check, awk)
rducfilp.sh Reduce .NAME/.HREF files (external xref check, perl)
makemenu.sh Run makemenu.awk under the best available interpreter,
and with options checking
dehtml.sh Run dehtml.awk under the best available interpreter
The htmlchek program checks for quite a number of possible defects
in the HTML (Hyper-Text Mark-up Language) version 2.0 SGML files used
on the World-Wide Web. (Preliminary HTML 3.0 files for the Arena
browser, or files with Netscape extensions, can also be checked by
specifying the appropriate options.) The program makes no claim to
understand all of SGML, but is easy and relatively simple to use,
gives lots of information (including about many stylistically bad
practices), can do local cross-reference checking and generate
rudimentary reference-dependency maps, and can be run on any platform
for which the language interpreter (awk or perl) is available.
This release of htmlchek also includes a number of supplemental
utilities, including the htmlsrpl.pl HTML-aware search-and-replace
program, which uses either literal strings or regular expressions;
acts either only outside HTML/SGML tags, or only within tags; can be
restricted to operate only within and/or only outside specified
elements; and can also upper-case tag names.
The accompanying .sh files are for greater ease of use under Unix
(actually, any Posix 1003.2, including VMS Posix) but nothing in
htmlchek.awk or htmlchek.pl themselves, or in the accompanying
supplemental programs, depends on the Unix operating system (in
particular, the perl programs do not use any of the Unix-specific
systems-programming features of the perl language), so that this
package can be used on non-Unix systems.
If you seem to get a million errors the first time you run htmlchek
on a file, don't be dismayed -- sometimes htmlchek can't compensate
for an error, so that the invalid HTML code it has encountered affects
its interpretation of valid HTML code later on in the file. Just go
back and fix the _first_ error, or first few errors, in the HTML file,
then run htmlchek again and see what you get. Iterate as necessary.
(However, I have tried to eliminate many of the cascades of redundant
errormessages that some earlier versions of this program tended to
generate.)
The htmlchek program performs a fairly comprehensive job of
checking for HTML errors, but does not always exactly follow the
official standard (currently this is version 1.22 of the HTML 2.0
DTD). Bad stylistic practices are warned against, as well as actual
HTML errors, and in some cases htmlchek is stricter than the standard,
in order to accommodate the peculiarities of some browsers. The idea
is that HTML code should be ruggedized for the real world, rather than
just being SGML-ically correct -- especially since the official
standard allows many SGML features which are hardly understood by any
HTML-specific applications; for example, according to the official
standard the following is a completely valid HTML 2.0 file (without
even any omitted tags!):
<><HEAD/<TITLE///<BODY/text<IMG TOP SRC=x.gif<![IGNORE[ </HTML>]]>/</>
Version 4.0 of the htmlchek distribution has the following new features:
Main changes to htmlchek: added internal cross-reference checking (not as
hard as I thought it would be!); added option of generating dependency
map; added command-line options to allow `<' and`>' characters within
quoted attribute values and <!-- --> comments, and `>' characters outside
tags. Other changes: added HTML quick reference, in plain text and .html
versions; added htmlsrpl.pl; added xtraclnk.pl; added makemenu.awk/
makemenu.pl; added metachar.awk/metachar.pl; added Perl version of
entify; enhanced the Unix/Posix-1003.2 shell scripts to redirect
non-program output to STDERR, detect non-zero exit status of awk/perl,
and add required trailing slashes automatically. Minor changes to
htmlchek: added sample configuration files; added check for content of
<ADDRESS> element; now detect multiple <HEAD> elements in document;
<OPTION>, <TEXTAREA>, and <TITLE> elements should not contain any tags;
<INPUT>, <SELECT> and <TEXTAREA> do not have to be _immediately_
contained within a <FORM> (inclusion exception); allow reqopts=
command-line option to specify multiple required attributes for a single
tag; added dlstrict= option and changed default strictness to that of
dlstrict=1; differentiated novalopts= from tagopts=; added subtract="..."
command-line option (to facilitate checking files outside current
directory); updated Arena/HTML3 language definition; tinkered with the
Netscape language definition (in the absence of any definitive
documentation); improved internal htmlchek.pl options checking; other
minor fixes and enhancements.
Both the awk program htmlchek.awk and a port of this awk program to
perl are included in the distribution (the original reason for doing
the perl port in the first place was to make it possible to add full
off-site cross-reference checking over the the Web; however, this
project may never be completed, and at present the awk and perl
programs have the same functionality); similarly, most of the
supplemental programs also have both awk and perl versions. You might
use one or the other based on personal preference, or because some
vendor-supplied awks on Unix boxes have proven to exhibit unendearing
peculiarities (you can also get around this by using GNU gawk if it is
on your system, or getting it from one of the ftp sites listed at the
end of almost every posting to the Usenet group gnu.announce and
compiling it; the program htmlchek.sh will automatically run gawk in
preference to nawk or awk, if gawk is on your system and in your
PATH). Gawk for MS-DOS (and a pointer to OS/2 gawk) is available from
ftp://oak.oakland.edu/SimTel/msdos/awk/. (See awk-perl.html.)
The anonymous ftp site for htmlchek is at:
ftp://ftp.cs.buffalo.edu/pub/htmlchek/
The htmlchek documentation can be browsed online at:
http://uts.cc.utexas.edu/~churchh/htmlchek.html
Typical command lines:
awk -f htmlchek.awk [options] infiles.html > outfile.check
perl htmlchek.pl [options] infiles.html > outfile.check
The options are in the form "option=value" (see htmlchek.html or
htmlchek.man). Remember that on some Unix systems ``awk'' is an
archaic incompatible program, so you should use ``nawk'' or ``gawk''
instead; the shell script htmlchek.sh will do this automatically (and
do some options checking as well):
sh htmlchek.sh [options] infiles.html > outfile.check
Author: Henry Churchyard churchh@uts.cc.utexas.edu
README.41:
htmlchek version 4.1, February 20 1995
This is a bugfix and update to version 4.0, adding several minor
features for greater convenience of use.
Changes are: Don't warn about null <TEXTAREA></TEXTAREA> element; only
check for inappropriate whitespace within elements commonly rendered
as underlined (<A> and <U>); check ordering of head tags before body
tags even in absence of explicit <head>...</head>; allow comments
between list items; only output non-numeric unquoted option values in
each file; corrected processing of HTML3 <LH>; updated HTML 3 language
definition to January 19 1995 draft; tinkered with Netscape extensions
language-definition yet again; added inline=1 command-line parameter;
added listfile=/lf= command-line parameter (especially for greater
MS-DOS convenience); allow cf= as abbreviation of configfile=;
ampersands followed by non-alphabetics generate warnings rather than
errors (so corresponding erromessage was removed from entify); added
"changed"/"unchanged" STDERR messages to htmlsrpl.pl output; added
.gif's to documentation; added awk-perl.html to documentation; added
index.html menu to documentation.
New files in this release are:
README.41 This file
index.html HTML version of README.40, README.41, and menu
awk-perl.html Where to obtain Awk and Perl
geterr.sh Trivial script to extract only ERROR! messages
from htmlchek output
geterwrn.sh Trivial script to extract only ERROR!/Warning!
messages from htmlchek output
___
awk.gif | .gif files used
camel.gif | in htmlchek HTML
ftp.gif | documentation
htmlchek.gif | (uuencoded as .uue
htmlchks.gif | files in the
valdhtml.gif | comp.sources.misc
warning.gif ___| Usenet distribution)
Author: Henry Churchyard churchh@uts.cc.utexas.edu