htmlchek version 4.0, January 17 1995

Name:

htmlchek.awk, htmlchek.pl - Syntactically checks HTML 2.0 or 3.0 files for a number of possible errors; can do local link cross-reference checking, and generate a rudimentary reference-dependency map. Runs under awk or perl. Includes a number of supplemental utilities for HTML file processing.

Typical Command Lines
Description
- HTML Error Messages
- Operating System Dependency (shell scripts)
Command-line Options
Cross-Reference Checking
- External Cross-Reference Checking
Language Customization Options
Supplemental HTML-file processing programs: dehtml, entify, and metachar
- dehtml (Removes all HTML markup, preliminary to spell check.)
- entify (Replaces high Latin-1 alphabetic characters with ampersand entities for safe 7-bit transport.)
- metachar (Trivial program to protect HTML/SGML metacharacters "&<>" in plain text that is to be included in an HTML file.)
- (See also the separate documentation for the htmlsrpl.pl HTML-aware search-and-replace program included in this package.)
Supplemental link extraction programs: makemenu and xtraclnk.pl
- makemenu (Makes simple menu for HTML files, based on each file's <TITLE>; can also make a simple table of contents based on <H1>-<H6> headings.)
- xtraclnk.pl (Extracts links/anchors from HTML files; isolates text contained in <A> and <TITLE> elements.)
Limitations
Author
- Common Problems
- Meaningless Bitmap Graphic

Typical Command Lines:

awk -f htmlchek.awk [options] infiles.html > outfile.check

perl htmlchek.pl [options] infiles.html > outfile.check

The options are in the form "option=value" (see the sections ``Command-line Options'' and ``Language Customization Options'' below). The following is an alternative invocation of htmlchek.awk under Unix (to ensure, as far as possible, that the program is not run under incompatible ``old awk''):

sh htmlchek.sh [options] infiles.html > outfile.check

(If the files htmlchek.awk, htmlchek.pl, or htmlchek.sh are not in the current directory, the pathname to where they are located will have to be prefixed -- but see ``shell scripts'' below.)

Description:

This program checks for quite a number of possible defects in the HTML (Hyper-Text Mark-up Language) version 2.0 SGML files used on the World-Wide Web. (Files with Netscape extensions, or with features from the preliminary Arena/HTML 3.0 document, can also be checked by specifying the appropriate options, as explained below.) Diagnostic messages are output to STDOUT and so generally appear on the terminal/window, unless they are redirected to an output file, as is done in the examples given above (of course, this all depends on the operating system -- the Macintosh doesn't even have a "command line" as such, but you can set up "droplets" with MacPerl).

The output of htmlchek is divided into two parts, for each file checked:

First, if any possible problems are detected, these are signaled by messages (one per line) on each problem. (Note that lines of output which signal errors and warnings all contain the character `!'.)

"ERROR!"
This string is included when there is a definite error in the input HTML source code. Sometimes multiple error messages can be generated by a single error (see ``Limitations'' below), in which case only the first message may be significant.
"Warning!"
This string is included in messages which point out stylistically deprecated HTML coding, or the absence of certain recommended features. Such messages are intended to be more or less advisory.
Second, at the end of each file's output, diagnostics are generated as to the tags used in the file and the options used with each tag, along with possible additional global warnings (these final diagnostics/warnings can be longer than 80 columns).

A very limited form of cross-reference checking (making sure that file-local <...HREF="#..."> references actually exist) is automatically performed within each file; for larger-scale cross-reference checking see the appropriate section below.

If you process more than one file at a time (by specifying multiple files or wildcards on the command line, e.g. ``perl htmlchek.pl *.html'' or ``awk -f htmlchek.awk *.html''), then errors are located by filename and line number.

HTML Error Messages:

Most of the error and warning messages should be fairly self-evident, assuming a familiarity with the basic HTML language documentation; the following is a basic glossary of terms used (note that tag "options" are what are called "attributes" in SGML):

An "element" is <X>...</X> (for example, <A HREF="#page2">Page 2</A>).
A "tag name" is <X...> (for example, "A" in <A HREF="#page2">).
An "option" is <...Y="..."> (for example, "HREF" in <A HREF="#page2">).
An "option value" is <...="Z"> (for example, "#page2" in <A HREF="#page2">).

One warning that may be obscure, "Jump from header level H0", means that the first heading in the file is not <H1>; to be consistent with a system of sub-sections, the first heading should be <H1>, there should only be one <H1> heading in a file, and the heading level should never increase in value by more than 1 between two successive headers, as in the following scheme:

  _____________________________________________________
 |                   [whole  document]                 |
 | H1 ------------------------------------------------ |
 |_____________________________________________________|
 |              [first-level subdivisions]             |
 | H2 ------------ | H2 ------------ | H2 ------------ |
 |_________________|_________________|_________________|
 |              [second-level subdivisions]            |
 | H3 --- | H3 --- | H3 --- | H3 --- | H3 --- | H3 --- |
 |________|________|________|________|________|________|

 etc.

To check whether or not the headings in a file reflect the file's logical organization, run the file through makemenu with the toc=1 command-line option, and see what you get.

An error that can sometimes be counter-intuitive is a ``<LI> outside list'' or ``<DT>/<DD> outside <DL>...</DL>'' error: in the sequence <UL><B><LI></B></UL>, the <LI> is actually not in the list, since it is not immediately contained within the <UL>...</UL> element (but is rather immediately contained within the non-list <B>...</B> element).

The htmlchek program performs a fairly comprehensive job of checking for HTML errors, but does not always exactly follow the official standard. Bad stylistic practices are warned against, as well as actual HTML errors, and in some cases htmlchek is stricter than the standard, in order to accommodate the peculiarities of some browsers (the idea is that HTML code should be ruggedized for the real world, not just SGML-ically correct -- see below under ``nowswarn=1'', ``nonrecurpair='', ``metachar='', the differences between the 1.21 and 1.22 DTD's, and SHORTTAG). And htmlchek doesn't bother to emulate some broken features of the standard (such as allowing <TEXTAREA> elements to recursively nest, etc.); it is also laxer than the standard in allowing <ADDRESS>, <HR>, and <H1>-<H6> headings within <LI> and <DD> list items (since these tags can occur legitimately within a <BLOCKQUOTE>...</BLOCKQUOTE> element itself in a list item -- though they are not supposed to occur directly within a list item). See further under ``Limitations'' below.

Operating System Dependency (shell scripts):

The files included in this package whose names end with ".sh" are shell scripts for greater ease of use under Unix or Posix 1003.2. However, nothing in the checking or supplemental programs themselves depends on the Unix operating system (with one minor exception -- see ``Limitations'' below), so that htmlchek.awk and htmlchek.pl can be run on any system where an awk or perl interpreter is available.

These Unix shell scripts are typically invoked by means of a command line of the form ``sh scriptname [script-options]'', with possible additional shell parameters such as output redirection (with the `>' character) or background execution (with the `&' character -- e.g. "sh htmlchek.sh infiles.html > outfile"). If you have set execute permission on the script files (e.g. by means of "chmod +x *.sh"), and the directory where they reside is specified in the PATH environment variable, then you can omit "sh" from the beginning of the command line.

For the shell scripts htmlchek.sh, htmlchkp.sh, runachek.sh, and runpchek.sh, if htmlchek.awk or htmlchek.pl is not in the current directory when the shell script is run, the environment variable HTMLCHEK should be set to the name of the directory where the program is located. See your shell documentation for information about how to set environment variables ("setenv HTMLCHEK /somedir/" in csh and tcsh, "HTMLCHEK=/somedir/; export HTMLCHEK" in sh and its offspring). The HTMLCHEK environment variable is also used for the shell scripts that run supplemental programs, namely makemenu.sh and dehtml.sh. (The rducfil?.sh scripts only use "one-liner" awk and perl programs, specified within the script files themselves, and so do not need $HTMLCHEK in the environment.)

These shell scripts return 1 on exit if some detectable error occurred, and output an appropriate errormessage to STDERR; otherwise they return 0.

Command-line Options:

Options are in the form "option=value" (where the `=' should not have spaces on either side of it); options should be specified on the command line PRECEDING any names of HTML files to check (see ``Typical Command Lines'' above). Options which follow filenames will not necessarily take effect (they are silently ignored in Posix-compliant awk, and generate an error in perl). Also, misspelled options will be silently ignored in awk. (On Unix, the shell scripts htmlchek.sh and htmlchkp.sh automatically check for command-line option errors, so you don't have to worry about these problems.)

Options that affect the definition of the HTML language used to interpret and check files are discussed in the ``Language Customization Options'' section below; the other options are "nowswarn=", "sugar=", "xref=", "map=", "refsfile=", "append=", "dirprefix=", "usebase=", and "subtract=".

Output Options:

The two options "nowswarn=" and "sugar=" control features of the output of htmlchek:

nowswarn=1: If this option is specified, it turns off messages that warn you about inappropriate whitespace (which may confuse browsers) in low-level mark-up elements. These warning messages can be numerous enough to make it difficult to pick out other warning and error messages.
sugar=1: If this option is specified, then "filename: linenumber:" is prefixed to non-file-final error and warning messages (for compatibility with editors such as emacs which use diagnostic output which is formatted in this way, from Unix tools such as ``cc'' and ``lint'').

Cross-reference Checking Output Options:

These options are connected with details of multi-file cross-reference checking; if you intend to do such cross-reference checking using the run?chek.sh shell scripts under Unix, you can ignore these options and jump to the next section below.

xref=1: If this option is specified, cross-reference checking is performed on the files that are checked. If the refsfile= option is not specified, then the results (unresolved locations and references) are put at the end of the STDOUT output. If refsfile=``prefixname'' is specified, then the cross-reference checking results are put in separate files: locations in the checked files which were not referenced from within these files are in a file named ``prefixname.NAME'', references from the checked files which are not to locations found within the files are in ``prefixname.HREF'', and references to in-line images are in ``prefixname.SRC''. (See further discussion under ``Cross-reference checking'' below.)
map=1: If this option is specified along with xref=1, information about which files refer to which other files and resources (i.e. a dependency map) is generated; this can be quite large. If refsfile=``prefixname'' is specified as well, the dependency map will be placed in a separate file called ``prefixname.MAP''.
refsfile=``prefixname'': If this option is specified along with xref=1, then the output of internal cross-reference checking is put in separate files. If refsfile= is specified without xref=1 also being specified, then raw lists of non-cross-checked references are output to separate files (this was used for external cross-reference checking in earlier versions of htmlchek, as can still be done using the rducfil?.sh scripts, if desired). All HREF="..." references contained in the HTML files being checked are output to a file named ``prefixname.HREF'', all references to in-line images contained in the HTML files are output to a file ``prefixname.SRC'', and the destination locations specified in the HTML files are output to a file ``prefixname.NAME''. (Note that <...HREF="..."> references to non-inline images will be found in the .HREF file, not the .SRC file.)
append=1: If this option is specified along with refsfile=, then the resulting three files (or four, if the optional .MAP file is generated) will be appended to, if they already exist from a previous run, instead of being replaced. This may be useful for cross-reference checking of files which are not in a single sub-directory tree on a single machine (see ``External Cross-Reference Checking'' below). A blank line is added to each file at the beginning of each run, so that the output due to successive runs can be separated (but these are not preserved when the rducfil?.sh scripts are run).
Cross-reference Checking URL Prefix Options:
dirprefix=``pathname'': When "xref=1" and/or "refsfile=" is also specified, then the value of ``pathname'' (which should be a valid absolute or quasi-absolute URL pathname beginning) is prefixed to destination locations and relative URL's. This can be useful in cross-reference checking, in order to resolve relative URL's to absolute URL's when the files you are checking sometimes cross-refer with absolute URL's (see the next section below).
usebase=1: When "usebase=1" is specified, the URL specified in <BASE HREF="..."> in each file is assumed to be the name of the file (and the "dirprefix=", if any, is ignored in the processing of the file after the <BASE> is found). This only takes affect after the <BASE> tag is encountered in each file, so that <BASE HREF="..."> should be the first of the tags with NAME, HREF, ID, etc. options in each file.
subtract=``pathname'': If this option is specified, then ``pathname'' is removed from the beginnings of filenames specified on the command line, before the dirprefix= or usebase=1 prefix is added, for URL purposes. This can be useful for running cross-reference checking on files not located in the current directory. If a filename is given which does not begin with the specified prefix, then htmlchek stops with an error.

Cross-Reference Checking:

Here ``cross-reference checking'' does not mean traversing the Web and finding out whether off-site remote URL's actually exist. It only means gathering together all the locations and references in a local HTML file, or a collection of local HTML files, and finding all the locations which are unreferenced within these files, and all the references which are not to a location within this collection. (A dependency map of what HTML files reference what resources can also be optionally generated.) You should generally delay cross-reference checking until you have more or less debugged your HTML files and corrected syntactically malformed references.

The programs htmlchek.awk and htmlchek.pl do cross-reference checking when the xref=1 command-line option is specified. Under Unix or Posix 1003.2, the shell scripts runachek.sh (for cross-reference checking with htmlchek.awk) and runpchek.sh (for cross-reference checking with htmlchek.pl) take care of many of the details of specifying htmlchek command-line options for cross-reference checking, and invoke the Unix find utility to look at all the *.html files in a directory hierarchy. These scripts have the following syntax (where ``dirprefix'' and ``outfileprefix'' stand for the first two shell script command-line parameters, the presence of which is obligatory):

sh runachek.sh dirprefix outfileprefix [directory] [options]

sh runpchek.sh dirprefix outfileprefix [directory] [options]

The third, optional, command-line parameter ``directory'' is the path to the top directory of the tree in which all .html files are to be checked (for example, "$HOME/public_html"); if this is parameter is not present, then the current directory is used by default. This path should not end with a trailing `/' character. If you specify an incorrect directory path and get an errormessage from the Unix `find' utility, then the awk or Perl interpreter may end up looking for input from STDIN (the keyboard); press control-D to get back to your shell.

The first parameter ``dirprefix'' should be either the null string (''), or an absolute or quasi-absolute URL pathname beginning. What ``dirprefix'' should be specified as, depends on how the files you are checking cross-refer to each other with <... HREF="..."> links. In the situation in which the files refer to each other strictly with simple relative URL's (i.e. which do not begin with "//" or "/" -- ignoring the optional access-method prefix), such as "subdir/otherfile.html#section1", then you don't need the ``dirprefix'' mechanism, and you can get away with specifying the first parameter of run?chek.sh as the null string (and skip the rest of this paragraph). However, if you have non-relative URL's in your cross references, then you need to specify a ``dirprefix'' (note that if there are files in more than one directory, and files in subordinate directories refer to files further up the hierarchy, then you may want to use non-relative URL's, since while "../" is legitimate in a URL, relative URL's beginning with "../" can sometimes cause problems). The value to use for ``dirprefix'' should be the string used, in cross-references among your files with non-relative URL's, to refer to the root of the tree of files being checked (i.e. the value of the optional ``directory'' parameter, or, if this is not specified, the current directory when run?chek.sh is being run). Whichever type of non-relative URL your documents use for this purpose (whether a host-local reference like "/~myself/subdir/", a full reference with access method like "http://myhost.edu/~myself/subdir/", or any intermediate form), you should use the appropriate prefix as your ``dirprefix'' string; if your files use an inconsistent mixture of these different reference types, then no single ``dirprefix'' can work, and cross-reference checking will partially fail. Finally, if each file has its own name specified in a <BASE HREF="..."> reference, you can let ``dirprefix'' be the null string, and use the option usebase=1.

The second parameter ``outfileprefix'' is the name of the files (with the extensions ".ERR", ".NAME", ".HREF", and ".SRC" -- and also ".MAP" if the map=1 parameter is included on the command line) in which the output of the HTML-checking and cross-referencing process will be put.

After these parameters, optional parameters that follow on the remainder of the command line can be any of the "option=value" pairs discussed in the ``Command-line Options'' section above or the ``Language Customization Options'' section below (except for refsfile=, dirprefix=, xref=1, and subtract=, which are specified within the run?chek.sh scripts).

So the following are some typical command lines (remember that putting the name of a shell script first in a command line, as in the first example, implies that you have set execute permission by running chmod):

runpchek.sh http://uts.cc.utexas.edu/~churchh/ check configfile=example.cfg

sh runachek.sh '' out $HOME/public_html map=1 &

The second example shows how cross-reference checking may be run as a background process.

If no error has occurred, then when the shell script has finished, non-cross-referencing errorcheck data is in the file ``outfileprefix.ERR'', locations in the checked files which were not referenced from within these files are in a file named ``outfileprefix.NAME'', references from the checked files which are not to locations found within the files are in ``outfileprefix.HREF'', and references to in-line images are in ``outfileprefix.SRC''.

If ``outfileprefix.HREF'' and ``outfileprefix.NAME'' have file lengths greater than one, this does not necessarily signal an error: ``outfileprefix.HREF'' will contain not only `dangling' references to local HTML files, but also references to non-inline images, sounds, etc., and external references (to files not in the directory tree being checked, including files on other WWW sites) as well. Similarly, the file ``outfileprefix.NAME'' contains locations which are not referenced locally, but these locations might be referenced from outside the local directory tree.

It would be nice to check for the existence of local images listed in ``outfileprefix.SRC'' (and also the local non-inline images in ``outfileprefix.HREF''), but in general the references to these images are in URL format there (rather than in local filesystem format), so that there is no way to do this at the Unix shell level.

External Cross-Reference Checking:

If you have several directory trees of HTML files which cross-refer, and each hierarchy needs a different ``dirprefix'', you can still do cross-reference checking, if you run cross-reference checking for each individual directory tree, specifying the same output file names for each run, and use "append=1" as one of the options. (You can even do cross-reference checking across multiple machines, if you have an account on each machine, and transfer the cumulative .NAME, .HREF, and .SRC files to each machine before running local cross-reference checking on that machine -- of course, the ``dirprefix'' string on each machine will have to include a hostname for this to work.) Under Unix, the rducfil?.sh shell scripts will then reduce the resulting .HREF and .NAME files, so that they only contain unresolved references, by removing all items which are contained in both the files (these shell scripts will also sort the .SRC file and collapse duplicate entries). The rducfil?.sh scripts take only one command-line parameter, the common prefix of the .NAME, .HREF, and .SRC files to be resolved.

Language Customization Options:

By default, htmlchek checks HTML files more or less according to versions 1.21 and 1.22 of the HTML 2.0 standard (with some departures, as discussed above). (The only difference between the two versions is that the 1.22 DTD was changed -- apparently for technical SGML reasons -- to allow null elements of the type <X></X>; since such null elements are undesirable on a number of grounds, htmlchek still warns about them). However, htmlchek is not limited to checking HTML files according to a single language definition.

Language Extensions (Arena/HTML3, Netscape):

The following options add extensions to the HTML language checked for:

arena=1 or html3=1 or htmlplus=1: Specifying any of these options means that files are checked according to a preliminary (December 1994) version of the emerging HTML 3.0 specification, and not as HTML 2.0. (Note that htmlchek doesn't check for the differences between MATH and non-MATH.)
netscape=1: Specifying this option means that the Netscape extensions do not generate errormessages.

Since the HTML language will continue to evolve, the HTML 3.0 definition is still preliminary, and the Netscape extensions document is unclear on some points (and uses the word "tag" rather confusingly) -- therefore, the language definitions coded in the htmlchek program are clearly not cast in stone. For this reason I have also provided htmlchek with the following command-line or configuration file options to customize many features of how htmlchek treats individual tags, and thus the language that is checked for:

Tag definition options:

nonpair=: Defines a tag or a list of tags as non-pairing (i.e. only <X> is encountered, never </X>). If more than one tag is to be defined as non-pairing, then they should be separated by commas: "nonpair=x,Y,z". (On the command line, there can be no whitespace on either side of the equals sign or commas; in the configuration file the syntax is less strict.) The alphabetic case of the tag names does not matter, as seen in this example, but the case of the option does ("NONPAIR=..." will not work -- on VMS I think this means you'll have to quote the whole "option=value" unit).
Non-pairing tags in HTML 2.0 include <BR>, <HR>, <IMG>, and <LINK>.
loosepair=: Defines a tag or a comma-separated list of tags as optionally pairing (a <X> can be followed by a matching </X>, but need not be).
Optionally pairing tags in HTML 2.0 include <P>, <LI>, <DD>, <DT>, and <OPTION>.
strictpair=: Defines a tag or a comma-separated list of tags as obligatorily pairing (a <X> must always be followed by a matching </X>). (So "strictpair=p" would cause <P> to be checked as a paragraph container.) Most tags in HTML are of this type.
nonrecurpair=: Defines a tag or a comma-separated list of tags as obligatorily pairing, and in addition specifies that each tag is non-self-nesting -- i.e. one occurrence of an <X>...</X> element can never occur inside another occurrence of <X>...</X> (no matter how many intervening levels of structure there are). Thus since <A> is a non-self-nesting tag, the sequence <A>...<B>...<A>...</A>...</B>...</A> is forbidden.
Tags which are specially defined as non-self-nesting in HTML 2.0 are <A> and <FORM>; also, a number of other tags turn out to be non-self-nesting, (the headings <H1>-<H6>, <ADDRESS>, <PRE>, <DT>, <MENU>, and <DIR>)
Declaring an obligatorily-pairing tag to be non-self-nesting is a powerful technique for detecting missing closing tags, which unintendedly result in an element being much bigger than it should be (the other checks in htmlchek may only detect such errors much later on, possibly at the end of the document, while a self-nesting error will generally show up close to the site where the missing closing tag should be). For this reason, and because self-nesting is actually by mistake in almost all cases, I have defined most of the HTML 2.0 obligatorily-pairing tags as non-self-nesting in htmlchek, although this is stricter than the official standard (to restore the "official" behavior, the configuration file html2dtd.cfg can be used, as discussed below).

If a new tag is declared with any of the preceding four options, it becomes a "known" tag to htmlchek. The options in the following sub-section should only be applied to tags which have been declared in this way (or are already known to htmlchek), or the results may not be what you expect.

Other tag behavior options:

lowlevelpair=: Defines an obligatorily pairing tag, or a comma-separated list of such tags, as low-level markup. Low-level markup elements can generally only include each other (and not things such as lists, headings, paragraphs, and blockquotes).
Low-level markup tags in HTML 2.0 include <A>, <B>, <EM>, etc. (By special dispensation, the <A> element is allowed to contain <H1>-<H6> headings, though a warning is generated.)
lowlevelnonpair=: Defines a non-obligatorily-pairing tag, or a comma-separated list of such tags, to be allowable within low-level markup and non-block elements.
Non-obligatorily-pairing low-level markup tags in HTML 2.0 are <BR>, and <IMG>. (By special dispensation, a <PRE> element is allowed to contain <HR>, and <ADDRESS> is allowed to contain <P>.)
nonblock=: Defines a pairing tag, or a comma-separated list of such tags, to only contain low-level markup (the difference from lowlevelpair= is that nonblock= elements cannot contain each other). Making an optionally-pairing tag (such as <P> in the default definition) a non-block tag will not in general work, since htmlchek will not assume an implicit closing tag (such as </P>) before lists, headings, blockquotes, etc. (<DT> and <LI> do work, since they're confined to lists).
Non-block tags in HTML 2.0 include <DT>, headings <H1>-<H6>, <PRE>, and <ADDRESS> (with the exceptions noted under the lowlevelnonpair= option), and also <LI> within a <MENU> or <DIR> list.
deprecated=: Defines a tag or a comma-separated list of tags as deprecated and obsolescent. If such a tag occurs in the file, there is a warning message in the file-final tag diagnostics.
Deprecated tags in HTML 2.0 include <LISTING>, <PLAINTEXT>, and <XMP> (note that htmlchek doesn't use the special deprecated tag-insensitive pseudo-SGML-"CDATA" mode in parsing within these elements).
tagopts=: Defines allowed options for tags. Uses a different syntax than the above options to htmlchek; here comma separated "tag,option" pairs are themselves separated by colons. So to allow the <P> tag to have the options ALIGN and NOWRAP, one could specify "tagopts=P,align:p,nowrap" on the command line, or in the configuration file.
novalopts=: Defines allowed options for tags, with the same syntax as tagopts=; the difference is that options defined with novalopts= are not required to have a value (like the HTML 2.0 options COMPACT, ISMAP, etc.).
reqopts=: Defines required options for tags. Uses the same syntax as tagopts=, and causes an implicit tagopts= definition. So "reqopts=IMG,WIDTH:img,height" means that IMG tags are required to have WIDTH and HEIGHT options (which will be included in HTML 3.0, and can greatly speed the display of documents in Netscape).
dlstrict=1 or dlstrict=2 or dlstrict=3: This option controls how <DT> and <DD> tags are distributed within a <DL>...</DL> list. With dlstrict=3, every <DD> must be immediately preceded by a <DT> (as in previous drafts of the HTML 2.0 standard; the SGML "content model" is "(DT,DD?)+"). With dlstrict=2, <DD> can be indirectly preceded by <DT> (SGML content model "(DT,DD*)+" or "DT,(DT|DD)*"). With dlstrict=1 (the default behaviour of htmlchek) <DD> and <DT> can be freely intermixed in the list (SGML content model "(DT|DD)+").

Beware that some of the above definitions have the effect of undefining things that are incompatible with what you are defining (to avoid logical inconsistencies). For example, if you define "lowlevelpair=p", then the tag <P> will be undefined as a loosely-pairing tag (since this is incompatible with ``lowlevelpair'' status). This means it will be treated as an unknown tag, unless you add an explicit "strictpair=p" or "nonrecurpair=p" declaration.

General parsing configuration options:

metachar=1 or metachar=2 or metachar=3: This option controls how htmlchek responds to `<' and `>' characters in tags. If metachar=3 is specified, then these characters are allowed within comments and quoted option values (following the SGML syntax), so that <IMG SRC="leftarrow.gif" ALT="->"> or  etc. would not cause errormessages. The default value, metachar=2, does not allow `<' and `>' in tags or comments (so that `>' inside a quoted option value will be interpreted as prematurely ending the tag); this more accurately reflects the behaviour of some HTML browsers. Finally, metachar=1 restricts comments further by requiring them to be on a single line (another limitation of some browsers); the warning "Complex comment" is then generated for multi-line  constructs.
nogtwarn=1: If this option is specified, no warnings are generated for loose `>' characters outside of tags. Such loose `>' characters are bad style (it is better to use ">"), and warning about them can be a useful error-detecting technique, but they are not actually incorrect SGML.

Configuration File:

Since it is cumbersome to specify long strings on the command line, there is an alternative configuration file mechanism. Specifying configfile=``filename'' on the command line will cause htmlchek to read in options from the file. The same "option=value" units that are recognized on the command line should be specified one per line in the configuration file (note that all lines in the configuration file which do not contain the `=' character are treated as comment lines and silently ignored).

Two sample configuration files are included in the htmlchek distribution, example.cfg and html2dtd.cfg. If html2dtd.cfg is invoked (by using configfile=html2dtd.cfg on the command line), then htmlchek conforms more strictly to the official HTML 2.0 DTD (following the SGML treatment of the `<' and `>' characters, and allowing low-level mark-up tags to self-nest).

There are some differences between specifying options on the command line and in the configuration file. On the command line, if there are multiple instances of the same "xxx=" option, all but the last will be silently ignored, but in the configuration file such multiple definitions will have cumulative effect. Also the relative order of evaluation on the command line is undefined (if you have both "strictpair=p" and a "nonrecurpair=p" definitions on the command line, you don't know which will override the other), while the order of statements in a configuration file is significant, since later definitions will override previous ones. Also, there can be no spaces or tabs around the `=', `,' or `:' characters on the command line, but this requirement is relaxed in the configuration file.

You can include definitions both on the command line and in the configuration file, in which case command line definitions will override those in the configfile= (specify an "arena=off" on the command line to override an "arena=1" in the configuration file, and similarly with html3=, htmlplus=, and netscape=). The internal definitions invoked by "arena=1" etc. and "netscape=1" will override definitions specified in the configuration file, but not those on the command line.

Note that the options discussed in the ``Command-line Options'' section above (append=, dirprefix=, refsfile=, sugar=, and usebase=) cannot be specified in the configuration file (nor, obviously, can configfile= itself be specified there). This is because the configfile= is a language definition file, not a user preference file. (If I ever implement a user preference file in a future version of htmlchek, it will be separate from the configfile=.) Since nowswarn= is actually a language configuration option, it can be specified in the configuration file.

Supplemental HTML-file processing programs: dehtml, entify, and metachar

dehtml

dehtml removes all HTML markup from a file so you can spell-check the darn thing. The commoner ampersand entities are translated to the appropriate single characters, so you can spell check if you're writing in a non-English language, and your spelling checker understands 8-bit Latin-1 alphabetic characters. Note that dehtml makes no pretensions to being an intelligent HTML-to-text translator; it completely ignores everything within <...>, and passes everything outside <...> through completely unaltered (except known ampersand entities).

Typical command lines:

awk -f dehtml.awk infile.html > outfile.txt

perl dehtml.awk infile.html > outfile.txt

The shell script file dehtml.sh runs dehtml.awk using the best available interpreter (under Unix):

sh dehtml.sh infile.html > outfile.txt

This program processes all files on the command line to STDOUT; to process a number of files individually, use the iteration mechanism of your shell; for example:

for a in *.html ; do awk -f dehtml.awk $a > otherdir/$a ; done

in Unix sh, or:

for %a in (*.htm) do call dehtml %a otherdir\%a

in MS-DOS, where dehtml.bat is the following one-line batch file:

gawk -f dehtml.awk %1 > %2

entify

The relatively tiny entify program translates Latin-1 high alphabetic characters in a file to HTML ampersand entities for safety when moving the file through non-8-bit-safe transport mechanisms (principally non-Mime RFC-822 e-mail and Usenet). This is for the greater convenience of those writing European languages with editors which use Latin-1 characters; entify can be run just before distributing an HTML file externally.

Typical command line:

awk -f entify.awk infile.8bit > outfile.html

perl entify.pl infile.8bit > outfile.html

(Note that entify doesn't help in checking whether an HTML file is OK, but is rather used as a precautionary measure to prevent the file from being mangled by archaic 7-bit software.)

metachar

This relatively trivial script protects the HTML/SGML metacharacters `&', `<' and `>' by replacing them with the appropriate ampersand entity references; it is useful for importing plain text into an HTML file. Typical command lines:

awk -f htmlchek.awk infile.text > outfile.htmltext

perl htmlchek.pl infile.text > outfile.htmltext

While dehtml and entify aren't primarily error-checking programs, if they do happen to find errors connected with their functioning, then the error messages are on lines beginning "&&^" which are intermixed with the non-error output.

Supplemental link extraction programs: `makemenu` and `xtraclnk.pl`

`makemenu`:

This program creates a simple menu for HTML files specified on the command line; the text in each input file's <TITLE>...</TITLE> element is placed in a link to that file in the output menu file. If the toc=1 command-line option is specified, makemenu also includes a simple table of contents for each input file in the menu, based on the file's <H1>-<H6> headings, interpreted as a system of sub-sections, with appropriate indenting.

If there are links inside headings, then makemenu will attempt to preserve the validity of <A HREF="..."> references, and transform an <A NAME="..."> into an <A HREF="..."> link to the heading from the menu file; however, makemenu is limited by the fact that it does not examine each <A> tag in a heading individually, but only does global search-and-replace operations on the whole <Hn>...</Hn> element (for this reason, the values of <A HREF=> and <A NAME=> are only operated on if they are quoted).

In general, makemenu is a small and somewhat simple program, and not an error-checker, so it is possible to confuse it by giving it erroneous or bizarrely-formatted HTML input. The following are typical command lines (makemenu.sh is a Unix shell script to run makemenu.awk under the best available awk interpreter, and with options checking):

awk -f makemenu.awk [options] infiles.html > menu.html

perl makemenu.pl [options] infiles.html > menu.html

sh makemenu.sh [options] infiles.html > menu.html

Further documentation is included as commments at the beginning of the makemenu.awk and makemenu.pl source files.

`xtraclnk.pl`:

This program extracts links and anchors from HTML files, and isolates text contained in <A> and <TITLE> elements. It copies <A HREF="...">Text</A> references from an input file to the output, and takes the text in <TITLE>Text</TITLE> elements and <A NAME="...">Text</A> anchors in an input file, and outputs them as references to the input file in which they were found. So the output of xtraclnk.pl contains basically only <A HREF="...">Text</A> links (which can be optionally sandwiched by a minimal HTML header and footer, so that the output is itself a valid HTML file).

This was suggested by an idea of John Harper at Toronto U.; what he had in mind, I think, was to use this as part of a CGI script which would dynamically construct an HTML document with links to all files with a title or anchors that contain text matching a user-specified search pattern. However, xtraclnk.pl also has some value as an HTML style debugging tool: if you have used a lot of context-dependent titles like "Intro", and meaningless link text like "Click Here", this will be very apparent when you view the HTML document (derived with xtraclnk.pl using the title= option) which contains only the text inside titles and anchors in your other HTML documents. This program can also be used to enforce consistency in link text: if there is random variation between different <A HREF="...">LinkText</A> elements which all point towards the same resource, this will be apparent when the output of xtraclnk.pl is sorted.

Also, by looking over the sorted output of xtraclnk.pl, it becomes relatively easy to detect mistaken links, that point to someplace other than what was intended.

Further documentation is in comments at the beginning of the source file itself (command line options other than title= and loc= work the same way as the options of htmlchek).

Limitations:

The classification of each problem as being an "ERROR!" or a "Warning!" is sometimes a somewhat subjective decision on my part.

Htmlchek does not pretend to handle the full range of constructs that the HTML 2.0 declaration and DTD would make available in SGML, but which are not understood by most HTML applications. Thus the weird and wonderful syntactic features enabled by MINIMIZE SHORTTAG YES in the HTML declaration are not supported (for example, "<><HEAD/<TITLE///<BODY/text<IMG TOP SRC=x.gif<FORM/<TEXTAREA NAME=a ROWS=1 COLS=9/<></>///</>" is technically a completely valid HTML 2.0 document, with no tags omitted!). And SGML meta-constructs such as <!ENTITY...> etc. are not checked for, and can result in error messages.

Ampersand codes like & are not checked against any fixed list of approved HTML/SGML entities, since such lists are in principle rather extensible. If you want to check if any ampersand entities outside of the currently most-commonly recognized HTML set are used, then you can run a file though the separate program dehtml and see if any ampersand codes are left (on Unix, do something like "sh dehtml.sh infile.html | egrep '&[0-9a-zA-Z.]*;'").

This program doesn't check that tag options (attributes) such as ALIGN actually have values taken from the approved set (since no part of the evolving HTML standards are more fluid); but as an aid to typo-detection, the option=value pairs in which the value is unquoted (such as ALIGN=BOTTOM) are output as part of each file's tag diagnostics; this allows you to pick out incorrect pairs like ALIGN=BOTOMM. (In order to get a single set of tag diagnostics for multiple files, you can do something like "cat *.html | awk -f htmlchek.awk" on Unix -- however, this allows the errors in one file to affect the interpretation of following files.)

Only the commonly-used double-quote character (") is recognized in quoting option values, though the single-quote character (') should theoretically also be able to be used.

Htmlchek tries to enforce the <HTML><HEAD>...</HEAD> <BODY>...</BODY> </HTML> format; if you don't want to add these constructs to your files, then you'll have to learn to ignore the warnings that will inevitably be produced. (Marcus E. Hennecke <marcush@leland.stanford.edu> has posted a perl script, old2newhtml, which automatically adds these tags to a HTML file.) However, I've tried to cut down the output, so that a warning is generally produced only for the first item of each type that is not contained in an <HTML>, <HEAD>, or <BODY> element, and not for every uncontained item in a file.

Error messages are output at the earliest point at which an error becomes detectable, which is not necessarily always the point where the bad code actually occurs (this is particularly true for "improper nesting errors" -- i.e. where there is a pending <x>, so that </x> is expected, but </y> is found instead). Complaints about encountering text where no text should be found (such as immediately within a <HEAD>...<HEAD> element) are deferred until the start of the first following tag.

As with almost any parser or lint-type program that doesn't just give up at the first error, the presence of one real error can generate a cascade of subsequent bogus errors. I've tried to eliminate some of the more redundant repeated errormessages that earlier versions of this program tended to generate in such cases. However, it is still true that sometimes htmlchek can't compensate for an error (particularly a self-nesting error), so that the invalid HTML code it has encountered affects its interpretation of valid HTML code later on in the file -- and some of the subsequent errormessages and warnings for that file may not be useful. The only remedy is to fix the first error (which is the real one), and run the check again.

Checking the same HTML file twice with htmlchek.awk can result in a different ordering of the final tag diagnostics etc., due to the indeterminacy of the awk "for (x in array) {...}" looping construct. In general, wherever there is a list in the output, the list will be sorted with htmlchek.pl and unsorted with htmlchek.awk.

If you run cross-reference checking on a non-Unix operating system, there won't be any problem for files in a single directory, but cross reference checking across a directory hierarchy does take advantage of the fact that the URL format coincides with Unix-specific filename conventions. For example, if an <A NAME="XXX"> reference occurs in a file which is not in the directory at the top of the tree in which HTML files are being checked, then what is output to the .NAME file is the concatenation of the string specified in the dirprefix= option, with the name of the current file on the command line (after removing any initial "./" generated by the Unix find utility, and also any prefix specified with the subtract="..." option), and then with the `#' character, and finally with the NAME="..." string. On Unix, if the dirprefix= string is a valid URL beginning (for example "http://myhost.edu/~myself/"), then the result of this concatenation will also be a valid URL (e.g. "http://myhost.edu/~myself/subdir/somefile#XXX"). On VMS this would produce "http://myhost.edu/~myself/[.subdir]somefile#XXX" or worse. For HREF="..." references, things can get more complicated. Under MS-DOS, you can get around this problem by using forward slashes `/' in filenames on the command line, not backslashes (`\').

Version History:

The version history is contained in the file htmlchek.awk

Author:

Copyright H. Churchyard 1994, 1995 -- freely redistributable. This code is functional but not very well commented (and readability has not been improved in htmlchek.awk by pushing all lines which would overflow 80 columns flush left) -- sorry! If you find an error in this program, e-mail me at churchh@uts.cc.utexas.edu.

Common Problems:

If you get an awk error under Unix, the most common problem that people seem to be having is inadvertently running incompatible ``old awk''; also, some vendor-supplied awks under Unix have problems (which can be avoided by using gawk -- see the file README.40). Try using htmlchek.sh (or dehtml.sh and makemenu.sh for dehtml.awk and makemenu.awk) and see if the problem goes away.

Meaningless Bitmap Graphic:

No Web document would be complete without including a meaningless bitmap graphic.

htmlchek version 4.0, January 17 1995