htmlchek.awk, htmlchek.pl - Syntactically checks HTML 2.0 or 3.0 files for a number of possible errors; can do local link cross-reference checking, and generate a rudimentary reference-dependency map. Runs under awk or perl. Includes a number of supplemental utilities for HTML file processing.
awk -f htmlchek.awk [options] infiles.html > outfile.check
perl htmlchek.pl [options] infiles.html > outfile.check
The options are in the form "option=value" (see the sections ``Command-line Options'' and ``Language Customization Options'' below). The following is an alternative invocation of htmlchek.awk under Unix (to ensure, as far as possible, that the program is not run under incompatible ``old awk''):
sh htmlchek.sh [options] infiles.html > outfile.check
(If the files htmlchek.awk, htmlchek.pl, or htmlchek.sh are not in the current directory, the pathname to where they are located will have to be prefixed -- but see ``shell scripts'' below.)
This program checks for quite a number of possible defects in the HTML (Hyper-Text Mark-up Language) version 2.0 SGML files used on the World-Wide Web. (Files with Netscape extensions, or with features from the preliminary Arena/HTML 3.0 document, can also be checked by specifying the appropriate options, as explained below.) Diagnostic messages are output to STDOUT and so generally appear on the terminal/window, unless they are redirected to an output file, as is done in the examples given above (of course, this all depends on the operating system -- the Macintosh doesn't even have a "command line" as such, but you can set up "droplets" with MacPerl).
The output of htmlchek is divided into two parts, for each file checked:
A very limited form of cross-reference checking (making sure that file-local <...HREF="#..."> references actually exist) is automatically performed within each file; for larger-scale cross-reference checking see the appropriate section below.
If you process more than one file at a time (by specifying multiple files or wildcards on the command line, e.g. ``perl htmlchek.pl *.html'' or ``awk -f htmlchek.awk *.html''), then errors are located by filename and line number.
Most of the error and warning messages should be fairly self-evident, assuming a familiarity with the basic HTML language documentation; the following is a basic glossary of terms used (note that tag "options" are what are called "attributes" in SGML):
One warning that may be obscure, "Jump from header level H0", means that the first heading in the file is not <H1>; to be consistent with a system of sub-sections, the first heading should be <H1>, there should only be one <H1> heading in a file, and the heading level should never increase in value by more than 1 between two successive headers, as in the following scheme:
_____________________________________________________ | [whole document] | | H1 ------------------------------------------------ | |_____________________________________________________| | [first-level subdivisions] | | H2 ------------ | H2 ------------ | H2 ------------ | |_________________|_________________|_________________| | [second-level subdivisions] | | H3 --- | H3 --- | H3 --- | H3 --- | H3 --- | H3 --- | |________|________|________|________|________|________| etc.
To check whether or not the headings in a file reflect the file's logical organization, run the file through makemenu with the toc=1 command-line option, and see what you get.
An error that can sometimes be counter-intuitive is a ``<LI> outside list'' or ``<DT>/<DD> outside <DL>...</DL>'' error: in the sequence <UL><B><LI></B></UL>, the <LI> is actually not in the list, since it is not immediately contained within the <UL>...</UL> element (but is rather immediately contained within the non-list <B>...</B> element).
The htmlchek program performs a fairly comprehensive job of checking for HTML errors, but does not always exactly follow the official standard. Bad stylistic practices are warned against, as well as actual HTML errors, and in some cases htmlchek is stricter than the standard, in order to accommodate the peculiarities of some browsers (the idea is that HTML code should be ruggedized for the real world, not just SGML-ically correct -- see below under ``nowswarn=1'', ``nonrecurpair='', ``metachar='', and SHORTTAG). And htmlchek it is also laxer than the standard in allowing <ADDRESS>, <HR>, and <H1>-<H6> headings within <LI> and <DD> list items (since these tags can occur legitimately within a <BLOCKQUOTE>...</BLOCKQUOTE> element itself in a list item -- though they are not supposed to occur directly within a list item). Similarly htmlchek does not check for <IMG> directly in <PRE> (which is not allowed by the official standard), since the standard does allow <IMG> indirectly in <PRE>. See further under ``Limitations'' below.
Options are in the form "option=value" (where the `=' should not have spaces on either side of it); options should be specified on the command line PRECEDING any names of HTML files to check (see ``Typical Command Lines'' above). Options which follow filenames will not necessarily take effect (they are silently ignored in Posix-compliant awk, and generate an error in perl). Also, misspelled options will be silently ignored in awk. (On Unix, the shell scripts htmlchek.sh and htmlchkp.sh automatically check for command-line option errors, so you don't have to worry about these problems.)
Options that affect the definition of the HTML language used to interpret and check files are discussed in the ``Language Customization Options'' section below; the other options are "inline=1", "nowswarn=", "sugar=", "xref=", "map=", "refsfile=", "append=", "dirprefix=", "usebase=", and "subtract=".
The three options "inline=1", "nowswarn=", and "sugar=1" control features of the output of htmlchek:
These options are connected with details of multi-file cross-reference checking; if you intend to do such cross-reference checking using the run?chek.sh shell scripts under Unix, you can ignore these options and jump to the next section below.
Here ``cross-reference checking'' does not mean traversing the Web and finding out whether off-site remote URL's actually exist. It only means gathering together all the locations and references in a local HTML file, or a collection of local HTML files, and finding all the locations which are unreferenced within these files, and all the references which are not to a location within this collection. (A dependency map of what HTML files reference what resources can also be optionally generated.) You should generally delay cross-reference checking until you have more or less debugged your HTML files and corrected syntactically malformed references.
The programs htmlchek.awk and htmlchek.pl do cross-reference checking when the xref=1 command-line option is specified. Under Unix or Posix 1003.2, the shell scripts runachek.sh (for cross-reference checking with htmlchek.awk) and runpchek.sh (for cross-reference checking with htmlchek.pl) take care of many of the details of specifying htmlchek command-line options for cross-reference checking, and invoke the Unix find utility to look at all the *.html files in a directory hierarchy. These scripts have the following syntax (where ``dirprefix'' and ``outfileprefix'' stand for the first two shell script command-line parameters, the presence of which is obligatory):
sh runachek.sh dirprefix outfileprefix [directory] [options]
sh runpchek.sh dirprefix outfileprefix [directory] [options]
The third, optional, command-line parameter ``directory'' is the path to the top directory of the tree in which all .html files are to be checked (for example, "$HOME/public_html"); if this is parameter is not present, then the current directory is used by default. This path should not end with a trailing `/' character. If you specify an incorrect directory path and get an errormessage from the Unix `find' utility, then the awk or Perl interpreter may end up looking for input from STDIN (the keyboard); press control-D to get back to your shell.
The first parameter ``dirprefix'' should be either the null string (''), or an absolute or quasi-absolute URL pathname beginning. What ``dirprefix'' should be specified as, depends on how the files you are checking cross-refer to each other with <... HREF="..."> links. In the situation in which the files refer to each other strictly with simple relative URL's (i.e. which do not begin with "//" or "/" -- ignoring the optional access-method prefix), such as "subdir/otherfile.html#section1", then you don't need the ``dirprefix'' mechanism, and you can get away with specifying the first parameter of run?chek.sh as the null string (and skip the rest of this paragraph). However, if you have non-relative URL's in your cross references, then you need to specify a ``dirprefix'' (note that if there are files in more than one directory, and files in subordinate directories refer to files further up the hierarchy, then you may want to use non-relative URL's, since while "../" is legitimate in a URL, relative URL's beginning with "../" can sometimes cause problems). The value to use for ``dirprefix'' should be the string used, in cross-references among your files with non-relative URL's, to refer to the root of the tree of files being checked (i.e. the value of the optional ``directory'' parameter, or, if this is not specified, the current directory when run?chek.sh is being run). Whichever type of non-relative URL your documents use for this purpose (whether a host-local reference like "/~myself/subdir/", a full reference with access method like "http://myhost.edu/~myself/subdir/", or any intermediate form), you should use the appropriate prefix as your ``dirprefix'' string; if your files use an inconsistent mixture of these different reference types, then no single ``dirprefix'' can work, and cross-reference checking will partially fail. Finally, if each file has its own name specified in a <BASE HREF="..."> reference, you can let ``dirprefix'' be the null string, and use the option usebase=1.
The second parameter ``outfileprefix'' is the name of the files (with the extensions ".ERR", ".NAME", ".HREF", and ".SRC" -- and also ".MAP" if the map=1 parameter is included on the command line) in which the output of the HTML-checking and cross-referencing process will be put.
After these parameters, optional parameters that follow on the remainder of the command line can be any of the "option=value" pairs discussed in the ``Command-line Options'' section above or the ``Language Customization Options'' section below (except for refsfile=, dirprefix=, xref=1, and subtract=, which are specified within the run?chek.sh scripts, and listfile=).
So the following are some typical command lines (remember that putting the name of a shell script first in a command line, as in the first example, implies that you have set execute permission by running chmod):
runpchek.sh http://uts.cc.utexas.edu/~churchh/ check configfile=example.cfg
sh runachek.sh '' out $HOME/public_html map=1 &
The second example shows how cross-reference checking may be run as a background process.
If no error has occurred, then when the shell script has finished, non-cross-referencing errorcheck data is in the file ``outfileprefix.ERR'', locations in the checked files which were not referenced from within these files are in a file named ``outfileprefix.NAME'', references from the checked files which are not to locations found within the files are in ``outfileprefix.HREF'', and references to in-line images are in ``outfileprefix.SRC''.
If ``outfileprefix.HREF'' and ``outfileprefix.NAME'' have file lengths greater than one, this does not necessarily signal an error: ``outfileprefix.HREF'' will contain not only `dangling' references to local HTML files, but also references to non-inline images, sounds, etc., and external references (to files not in the directory tree being checked, including files on other WWW sites) as well. Similarly, the file ``outfileprefix.NAME'' contains locations which are not referenced locally, but these locations might be referenced from outside the local directory tree.
It would be nice to check for the existence of local images listed in ``outfileprefix.SRC'' (and also the local non-inline images in ``outfileprefix.HREF''), but in general the references to these images are in URL format there (rather than in local filesystem format), so that there is no way to do this at the Unix shell level.
If you have several directory trees of HTML files which cross-refer, and each hierarchy needs a different ``dirprefix'', you can still do cross-reference checking, if you run cross-reference checking for each individual directory tree, specifying the same output file names for each run, and use "append=1" as one of the options. (You can even do cross-reference checking across multiple machines, if you have an account on each machine, and transfer the cumulative .NAME, .HREF, and .SRC files to each machine before running local cross-reference checking on that machine -- of course, the ``dirprefix'' string on each machine will have to include a hostname for this to work.) Under Unix, the rducfil?.sh shell scripts will then reduce the resulting .HREF and .NAME files, so that they only contain unresolved references, by removing all items which are contained in both the files (these shell scripts will also sort the .SRC file and collapse duplicate entries). The rducfil?.sh scripts take only one command-line parameter, the common prefix of the .NAME, .HREF, and .SRC files to be resolved.
Another way of doing cross-reference checking across multiple directory trees on a single machine is to generate an external file containing a list of the HTML files to be checked; this external list can then be specified by the listfile= or lf= command-line option, as explained in the ``MS-DOS'' section below (though the listfile=/lf= parameter is not MS-DOS specific).
By default, htmlchek checks HTML files more or less according to version 1.24 of the HTML 2.0 standard (with some departures, as discussed above). However, htmlchek is not limited to checking HTML files according to a single language definition.
The following options add extensions to the HTML language checked for:
Since the HTML language will continue to evolve, the HTML 3.0 definition is still preliminary, and the Netscape extensions document is unclear on some points (and uses the word "tag" rather confusingly) -- therefore, the language definitions coded in the htmlchek program are clearly not cast in stone. For this reason I have also provided htmlchek with the following command-line or configuration file options to customize many features of how htmlchek treats individual tags, and thus the language that is checked for:
If a new tag is declared with any of the preceding four options, it becomes a "known" tag to htmlchek. The options in the following sub-section should only be applied to tags which have been declared in this way (or are already known to htmlchek), or the results may not be what you expect.
Beware that some of the above definitions have the effect of undefining things that are incompatible with what you are defining (to avoid logical inconsistencies). For example, if you define "lowlevelpair=p", then the tag <P> will be undefined as a loosely-pairing tag (since this is incompatible with ``lowlevelpair'' status). This means it will be treated as an unknown tag, unless you add an explicit "strictpair=p" or "nonrecurpair=p" declaration.
Since it is cumbersome to specify long strings on the command line, there is an alternative configuration file mechanism. Specifying configfile=``filename'' on the command line will cause htmlchek to read in options from the file (cf= is recognized as an abbreviated synonym for configfile=). The same "option=value" units that are recognized on the command line should be specified one per line in the configuration file (note that all lines in the configuration file which do not contain the `=' character are treated as comment lines and silently ignored).
Two sample configuration files are included in the htmlchek distribution, example.cfg and html2dtd.cfg. If html2dtd.cfg is invoked (by using configfile=html2dtd.cfg on the command line), then htmlchek conforms more strictly to the official HTML 2.0 DTD (following the SGML treatment of the `<' and `>' characters, and allowing low-level mark-up tags to self-nest).
There are some differences between specifying options on the command line and in the configuration file. On the command line, if there are multiple instances of the same "xxx=" option, all but the last will be silently ignored, but in the configuration file such multiple definitions will have cumulative effect. Also the relative order of evaluation on the command line is undefined (if you have both "strictpair=p" and a "nonrecurpair=p" definitions on the command line, you don't know which will override the other), while the order of statements in a configuration file is significant, since later definitions will override previous ones. Also, there can be no spaces or tabs around the `=', `,' or `:' characters on the command line, but this requirement is relaxed in the configuration file.
You can include definitions both on the command line and in the configuration file, in which case command line definitions will override those in the configfile= (specify an "arena=off" on the command line to override an "arena=1" in the configuration file, and similarly with html3=, htmlplus=, and netscape=). The internal definitions invoked by "arena=1" etc. and "netscape=1" will override definitions specified in the configuration file, but not those on the command line.
Note that the options discussed in the ``Command-line Options'' section above (append=, dirprefix=, refsfile=, sugar=, and usebase=) cannot be specified in the configuration file (nor, obviously, can configfile= or cf= itself be specified there). This is because the configfile= is a language definition file, not a user preference file. (If I ever implement a user preference file in a future version of htmlchek, it will be separate from the configfile=.) Since nowswarn= is actually a language configuration option, it can be specified in the configuration file.
dehtml removes all HTML markup from a file so you can spell-check the darn thing. The commoner ampersand entities are translated to the appropriate single characters, so you can spell check if you're writing in a non-English language, and your spelling checker understands 8-bit Latin-1 alphabetic characters. Note that dehtml makes no pretensions to being an intelligent HTML-to-text translator; it completely ignores everything within <...>, and passes everything outside <...> through completely unaltered (except known ampersand entities).
Typical command lines:
awk -f dehtml.awk infile.html > outfile.txt
perl dehtml.awk infile.html > outfile.txt
The shell script file dehtml.sh runs dehtml.awk using the best available interpreter (under Unix):
sh dehtml.sh infile.html > outfile.txt
This program processes all files on the command line to STDOUT; to process a number of files individually, use the iteration mechanism of your shell; for example:
for a in *.html ; do awk -f dehtml.awk $a > otherdir/$a ; done
in Unix sh, or:
for %a in (*.htm) do call dehtml %a otherdir\%a
in MS-DOS, where dehtml.bat is the following one-line batch file:
gawk -f dehtml.awk %1 > %2
While dehtml isn't primarily an error-checking program, if it does happen to find errors connected with its functioning (or encounter HTML code beyond its capacity to handle), then the error messages are on lines beginning "&&^" which are intermixed with the non-error output.
The relatively tiny entify program translates Latin-1 high alphabetic characters in a file to HTML ampersand entities for safety when moving the file through non-8-bit-safe transport mechanisms (principally non-Mime RFC-822 e-mail and Usenet). This is for the greater convenience of those writing European languages with editors which use Latin-1 characters; entify can be run just before distributing an HTML file externally.
Typical command line:
awk -f entify.awk infile.8bit > outfile.html
perl entify.pl infile.8bit > outfile.html
(Note that entify doesn't help in checking whether an HTML file is OK, but is rather used as a precautionary measure to prevent the file from being mangled by archaic 7-bit software.)
This relatively trivial script protects the HTML/SGML metacharacters `&', `<' and `>' by replacing them with the appropriate ampersand entity references; it is useful for importing plain text into an HTML file. Typical command lines:
awk -f metachar.awk infile.text > outfile.htmltext
perl metachar.pl infile.text > outfile.htmltext
This program creates a simple menu for HTML files specified on the command line; the text in each input file's <TITLE>...</TITLE> element is placed in a link to that file in the output menu file. If the toc=1 command-line option is specified, makemenu also includes a simple table of contents for each input file in the menu, based on the file's <H1>-<H6> headings, interpreted as a system of sub-sections, with appropriate indenting.
If there are links inside headings, then makemenu will attempt to preserve the validity of <A HREF="..."> references, and transform an <A NAME="..."> into an <A HREF="..."> link to the heading from the menu file; however, makemenu is limited by the fact that it does not examine each <A> tag in a heading individually, but only does global search-and-replace operations on the whole <Hn>...</Hn> element (for this reason, the values of <A HREF=> and <A NAME=> are only operated on if they are quoted).
In general, makemenu is a small and somewhat simple program, and not an error-checker, so it is possible to confuse it by giving it erroneous or bizarrely-formatted HTML input. The following are typical command lines (makemenu.sh is a Unix shell script to run makemenu.awk under the best available awk interpreter, and with options checking):
awk -f makemenu.awk [options] infiles.html > menu.html
perl makemenu.pl [options] infiles.html > menu.html
sh makemenu.sh [options] infiles.html > menu.html
Further documentation is included as commments at the beginning of the makemenu.awk and makemenu.pl source files.
This program extracts links and anchors from HTML files, and isolates text contained in <A> and <TITLE> elements. It copies <A HREF="...">Text</A> references from an input file to the output, and takes the text in <TITLE>Text</TITLE> elements and <A NAME="...">Text</A> anchors in an input file, and outputs them as references to the input file in which they were found. So the output of xtraclnk.pl contains basically only <A HREF="...">Text</A> links (which can be optionally sandwiched by a minimal HTML header and footer, so that the output is itself a valid HTML file).
This was suggested by an idea of John Harper at Toronto U.; what he had in mind, I think, was to use this as part of a CGI script which would dynamically construct an HTML document with links to all files with a title or anchors that contain text matching a user-specified search pattern. However, xtraclnk.pl also has some value as an HTML style debugging tool: if you have used a lot of context-dependent titles like "Intro", and meaningless link text like "Click Here", this will be very apparent when you view the HTML document (derived with xtraclnk.pl using the title= option) which contains only the text inside titles and anchors in your other HTML documents. This program can also be used to enforce consistency in link text: if there is random variation between different <A HREF="...">LinkText</A> elements which all point towards the same resource, this will be apparent when the output of xtraclnk.pl is sorted.
Also, by looking over the sorted output of xtraclnk.pl, it becomes relatively easy to detect mistaken links, that point to someplace other than what was intended.
Further documentation is in comments at the beginning of the source file itself (command line options other than title= and loc= work the same way as the options of htmlchek).
The files included in this package whose names end with ".sh" are shell scripts for greater ease of use under Unix or Posix 1003.2. However, nothing in the checking or supplemental programs themselves depends on the Unix operating system (with one minor exception -- see ``Sub-directory hierarchy cross-reference checking'' below), so that htmlchek.awk and htmlchek.pl can be run on any system where an awk or perl interpreter is available.
These Unix shell scripts are typically invoked by means of a command line of the form ``sh scriptname [script-options]'', with possible additional shell parameters such as output redirection (with the `>' character) or background execution (with the `&' character -- e.g. "sh htmlchek.sh infiles.html > outfile"). If you have set execute permission on the script files (e.g. by means of "chmod +x *.sh"), and the directory where they reside is specified in the PATH environment variable, then you can omit "sh" from the beginning of the command line.
For the shell scripts htmlchek.sh, htmlchkp.sh, runachek.sh, and runpchek.sh, if htmlchek.awk or htmlchek.pl is not in the current directory when the shell script is run, the environment variable HTMLCHEK should be set to the name of the directory where the program is located. See your shell documentation for information about how to set environment variables ("setenv HTMLCHEK /somedir/" in csh and tcsh, "HTMLCHEK=/somedir/; export HTMLCHEK" in sh and its offspring). The HTMLCHEK environment variable is also used for the shell scripts that run supplemental programs, namely makemenu.sh and dehtml.sh. (The rducfil?.sh scripts only use "one-liner" awk and perl programs, specified within the script files themselves, and so do not need $HTMLCHEK in the environment.)
These shell scripts return 1 on exit if some detectable error occurred, and output an appropriate errormessage to STDERR; otherwise they return 0.
If you run cross-reference checking on a non-Unix operating system, there won't be any problem for files in a single directory, but cross reference checking across a directory hierarchy does take advantage of the fact that the URL format coincides with Unix-specific filename conventions. For example, if an <A NAME="XXX"> reference occurs in a file which is not in the directory at the top of the tree in which HTML files are being checked, then what is output to the .NAME file is the concatenation of the string specified in the dirprefix= option, with the name of the current file on the command line (after removing any initial "./" generated by the Unix find utility, and also any prefix specified with the subtract="..." option), and then with the `#' character, and finally with the NAME="..." string. On Unix, if the dirprefix= string is a valid URL beginning (for example "http://myhost.edu/~myself/"), then the result of this concatenation will also be a valid URL (e.g. "http://myhost.edu/~myself/subdir/somefile#XXX"). On VMS this would produce "http://myhost.edu/~myself/[.subdir]somefile#XXX" or worse. For HREF="..." references, things can get more complicated. Under MS-DOS, you can get around this problem by using forward slashes `/' in filenames on the command line, not backslashes (`\') -- see the next section.
Under MS-DOS, there is a problem with checking all the HTML files in a directory hierarchy at once, as is done in the run?chek.sh scripts with the Unix find utility, especially since the MS-DOS command line is limited to 126 characters. (Also, some MS-DOS ports of awk do not have command-line regexp expansion library built in, so that one cannot use wildcards to say "awk -f htmlchek.awk *.html".)
For these reasons, htmlchek can read an external file containing a list of names of HTML files to be checked; this external list is specified with the listfile= command-line option (which can also be abbreviated as lf=). (While particularly useful for MS-DOS, the listfile=/lf= can be used on other platforms.) The list file can be generated automatically by Rahul Desi's stuff utility (that uses forward slashes `/', appropriate for multi-directory cross-reference checking with htmlchek), which is available at ftp://oak.oakland.edu/SimTel/msdos/zoo/stuff2.zip, with a command line of the following form:
stuff [directory1] [directory2] ... -name *.htm > list
Note that in awk the value of listfile= must be available within the BEGIN{...} block; this means that with Posix-1003.2-compliant awks (such as gawk), this option should be first on the command-line, and must be preceded by -v (e.g. "-v listfile=xxx"). Specifying listfile= will not work in some older awk/nawk programs which do not correctly handle manipulation of ARGV/ARGC. (Perl does not suffer from these difficulties.)
The classification of each problem as being an "ERROR!" or a "Warning!" is sometimes a somewhat subjective decision on my part.
Htmlchek does not pretend to handle the full range of constructs
that the HTML 2.0 declaration and DTD would make available in
SGML, but which are not understood by most
HTML applications. Thus the weird and wonderful syntactic
features enabled by MINIMIZE SHORTTAG YES in the HTML declaration
are not supported (for example,
"<><HEAD/<TITLE///<BODY/text<IMG TOP
SRC=x.gif<![IGNORE[ </HTML>]]>/</>" is technically a
completely valid HTML 2.0 document,
with no tags omitted!).
And SGML meta-constructs such as <!ENTITY...> etc. are not checked for, and can result in error messages.
Ampersand codes like & are not checked against any fixed list of approved HTML/SGML entities, since such lists are in principle rather extensible. If you want to check if any ampersand entities outside of the currently most-commonly recognized HTML set are used, then you can run a file though the separate program dehtml and see if any ampersand codes are left (on Unix, do something like "sh dehtml.sh infile.html | egrep '&[0-9a-zA-Z.]*;'").
This program doesn't check that tag options (attributes) such as ALIGN actually have values taken from the approved set (since no part of the evolving HTML standards are more fluid); but as an aid to typo-detection, the option=value pairs in which the value is unquoted (such as ALIGN=BOTTOM) are output as part of each file's tag diagnostics; this allows you to pick out incorrect pairs like ALIGN=BOTOMM. (In order to get a single set of tag diagnostics for multiple files, you can do something like "cat *.html | awk -f htmlchek.awk" on Unix -- however, this allows the errors in one file to affect the interpretation of following files.)
Only the commonly-used double-quote character (") is recognized in quoting option values, though the single-quote character (') should theoretically also be able to be used.
Htmlchek tries to enforce the <HTML><HEAD>...</HEAD> <BODY>...</BODY> </HTML> format; if you don't want to add these constructs to your files, then you'll have to learn to ignore the warnings that will inevitably be produced. (Marcus E. Hennecke has a perl script, old2newhtml, which automatically adds these tags to a HTML file, available at ftp://ftp.crc.ricoh.com/pub/www/old2newhtml.) However, I've tried to cut down the output, so that a warning is generally produced only for the first item of each type that is not contained in an <HTML>, <HEAD>, or <BODY> element, and not for every uncontained item in a file.
Error messages are output at the earliest point at which an error becomes detectable, which is not necessarily always the point where the bad code actually occurs (this is particularly true for "improper nesting errors" -- i.e. where there is a pending <x>, so that </x> is expected, but </y> is found instead). Complaints about encountering text where no text should be found (such as immediately within a <HEAD>...<HEAD> element) are deferred until the start of the first following tag.
As with almost any parser or lint-type program that doesn't just give up at the first error, the presence of one real error can generate a cascade of subsequent bogus errors. I've tried to eliminate some of the more redundant repeated errormessages that earlier versions of this program tended to generate in such cases. However, it is still true that sometimes htmlchek can't compensate for an error (particularly a self-nesting error), so that the invalid HTML code it has encountered affects its interpretation of valid HTML code later on in the file -- and some of the subsequent errormessages and warnings for that file may not be useful. The only remedy is to fix the first error (which is the real one), and run the check again.
Checking the same HTML file twice with htmlchek.awk can result in a different ordering of the final tag diagnostics etc., due to the indeterminacy of the awk "for (x in array) {...}" looping construct. In general, wherever there is a list in the output, the list will be sorted with htmlchek.pl and unsorted with htmlchek.awk. Also, awk error messages connected with the running of htmlchek.awk itself (such as an incorrect configuration file name, etc.) are output to STDOUT (and so are intermixed with messages about HTML errors in the files being checked), since there is no portable way of outputting to STDERR in awk.
The version history is contained in the file htmlchek.awk.
If you get an awk error under Unix, the most common problem that people seem to be having is inadvertently running incompatible ``old awk''; also, some vendor-supplied awks under Unix have problems (which can be avoided by using gawk -- see the file README.40). Try using htmlchek.sh (or dehtml.sh and makemenu.sh for dehtml.awk and makemenu.awk) and see if the problem goes away.
No Web document would be complete without including a meaningless bitmap graphic.
htmlchek version 4.1, February 20 1995