Last update: Thu Apr 4 15:44:41 2002
In April 1995, I searched the net for SGML parsers, with the
hope of being able to combine one with ostensible HTML
grammars, and thereby get a rigorous syntax checker for HTML
files. I tracked down three:
arcsgml
(from Charles Goldfarb, a leading architect of GML and
SGML),
asp-sgml
(from the Amsterdam Compiler Kit), and
sgmls
(from Jim Clark, who is also the author of GNU groff).
sgmls
is a descendant of both arcsgml
and asp-sgml
.
Note added [17-Jan-1996]: Regrettably, the site
ftp.ex.ac.uk
no longer welcomes outside
access. The SP
system described below now
provides a satisfactory alternative, and a
North American
distribution site with precompiled binaries for many
systems has recently been established.
Based on the dates of the software I found, I suspected that
sgmls
was newer, and correspondence with
Joachim Schrod in Darmstadt, who is an SGML expert,
confirmed that sgmls
is the parser of choice,
although Jim Clark is working on a new one, called SP,
that may eventually replace sgmls.
I was able to get smgls
installed on all of our
local architectures without much problem, but I was then
stymied for two weeks in trying to figure out how to run it.
The smgls
man pages are rather cryptic, and the
output is even more so, so it was not until 1-May-1995 that
I located the
on-line HTML validation service
via the substantial WWW archive at
UC Irvine,
and from it, the HTML Check Toolkit and the
html-check
utility. That night, I found how the html-check
script runs sgmls
, and that provided
the clue to getting it all working.
The html-check
distribution includes a binary
executable of sgmls
for a system of your
choice, so you may not need to do an sgmls
installation, unless you want to target multiple
architectures, like I do, or you feel more secure about
building programs directly from source code yourself.
html-check
As a result of this work, on our local machines, you can now type
html-check *.html
and get a rigorous validation of your HTML files. There is,
alas, no manual page written yet for html-check.
There are several reasons for much of the troubles I've had
trying to make sgmls
work with HTML, and all of
them could have been resolved much earlier had the
documentation been better, and had HTML developers taken
more care in providing so-called HTML grammar files. I hope
that these notes can spare others some of the hours of grief
and frustration that I've gone through.
sgmls
needs at least four files to run:
In some of the HTML `grammar files' that I found on the net,
the declaration file, the grammar file, and some garbage
HTML were embedded into a single file. sgmls
requires that these be provided as separate
files, and unless one is already quite familiar with SGML,
it is not at all obvious that the net files need to be
split.
The html-check
script conceals files (1), (2),
and (4), by running the command
/usr/local/bin/sgmls -s \ -m /usr/local/lib/html-check/lib/catalog \ /usr/local/lib/html-check/lib/html.decl \ *.html
Emacs, Reduce, SGML, and TeX are all confronted with a similar problem: they consist of a low-level engine written in some standard programming language, but acquire much of their functionality by run-time loading of a large collection of commands written in a secondary language.
To avoid an onerous startup time, Emacs, Reduce, and TeX all handle this problem by a one-time preloading step at installation time that consumes the secondary language files, and produces a fast-loading binary file. The program version that the user actually runs then already has the secondary language code loaded, or else can do so quickly behind the scenes at startup; all that the user needs to provide the program with is the name of his/her own file.
SGML parsers at present do not do this, so several files are needed at every run:
The second argument of the DOCTYPE declaration is
the document type. The third is either PUBLIC or
SYSTEM, to indicate the interpretation of the fourth
argument. If it is SYSTEM, then the fourth argument
is the actual filename of the grammar file. Since
filenames are system-dependent, this use is
generally deprecated, in favor of PUBLIC, which says
that the filename can be determined by searching for
the fourth argument in the SGML catalog file, which
was provided to sgmls
by the -m
/usr/local/lib/html-check/lib/catalog
command-line option.
The peculiar string "-//IETF//DTD HTML Level
2//EN" is called a public identifier
, and is not well described in either the
sgmls
man pages, or the SGML books of
Martin Bryan
and
Eric van Herwijnen
(I don't yet have
Charles Goldfarb's
book, or a copy of the ISO 8879:1986 SGML standard).
The public identifier typically contains 3 or 4
double-slash-separated parts. The first two parts
are typically an ISO standard number, or an owner
identifier. In the above example, IETF stands for
Internet Engineering Task Force. The
second last part, DTD HTML Level 2, here
means Document Type Definition HyperText Markup
Language version 2. The last part, EN,
is the ISO two-level country abbreviation for
the language in which the grammar file is written:
in this case, ENglish.
The file named at the end of the catalog file entry is taken to be relative to the location of the catalog file itself, so the grammar file, or Document Type Definition file in SGMLese, is found in our case in /usr/local/lib/html-check/lib/html.dtd .
sgmls
can map to a grammar .dtd file. Current
WWW browsers are not based on rigorous SGML grammars (or
at least, if they are, the grammar is hardcoded into the
browser, rather than defined in a file). They therefore
don't require a DOCTYPE indication; they just assume
HTML based on the file extension. However, if you want
to use other SGML tools on your .html
files, you need to make them conformant to the grammar,
and that means having a DOCTYPE declaration at the
start.
My own home page index.html file uses features of HTML 3.0 (formerly known as HTMLPlus), so it begins
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">
which the catalog file maps to the file html-3.dtd, found in the catalog directory /usr/local/lib/html-check/lib.
The catalog supplied with html-check
file also provides mappings to a stricter grammar
for each version of HTML (i.e. one that doesn't
recognize obsolete or deprecated features); however,
the one for version 3.0, html-3s.dtd,
is missing from the html-check
version
0.1 distribution. I've had communication with the
package authors about it, and it turns out that the
WWW developers have not yet implemented this file.
However, as soon as I hear that they have, I will
tighten up my files to conform to the strict
grammars, since they then have a better chance of
being handled correctly by a wide variety of SGML
and HTML software, including WWW browsers.
Because the HTML grammar requires a HEAD containing a TITLE, and a BODY, the minimal grammar-conformant .html file looks like this:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <HEAD> <TITLE>the title</TITLE> <BODY>
This uses tag minimization, which is detrimental to clarity,
and the use of simple tools such as
html-pretty,
so it is better written with closing tags, and with a LINK
element to identify the author:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" > <HEAD> <TITLE>the title</TITLE> <LINK REV="made" HREF="mailto:beebe@math.utah.edu"> </HEAD> <BODY> </BODY>
This can in turn be filtered by html-pretty
to
look like this:
<!-- -*-html-*- --> <!-- Prettyprinted by html-pretty lex version 0.07 [20-Apr-1995] --> <!-- on Tue May 2 10:25:50 1995 --> <!-- for Nelson H. F. Beebe (beebe@chamberlin.math.utah.edu) --> <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" > <HEAD> <TITLE> the title </TITLE> <LINK REV="made" HREF="mailto:beebe@math.utah.edu"> </HEAD> <BODY> </BODY>
WARNING: The leading space that normally makes these entries easier to read has been lost, because this file is written according to the HTML 2.0 specification, which has no representation for a visible space, and doesn't permit a verbatim <PRE> ... </PRE> environment to be contained within an anchor definition <A NAME="..."> ... </A>.
HTML 3.0 remedies this with the entity for a non-breakable space, and more flexible environment nesting.
These entries are taken from an extensive bibliography on SGML and HTML that I maintain for the TeX Users Group and the benefit of the SGML and WWW community.