rethinking the HTML DTD.

Dan Connolly (connolly@pixel.convex.com)
Tue, 14 Jul 92 13:39:53 CDT


I have been troubled by the fact that HTML documents look like SGML documents,
but technically, they are not. So I have tried to come up with a DTD that
captures the features of HTML.

I have come to the conclusion that HTML has very little structure, and that this
is by design.

I am beginning to wonder how much the needs of WWW have in common with the
features of SGML.

It seems to me that SGML is the technology of choice when you have a community
of information consumers and producers that share a common structure. e.g. the
construction industry might use SGML to exchange bill of materials, parts lists,
inventories, etc. The SGML parser would be used to verify part numbers, make
sure every widget has a corresponding gadget, etc.

The WWW project is a form of electronic publishing, however, and publishing is a
natural application of SGML. But the value of SGML is that you can verify the
structure of the text. A publisher can specify in his DTD the format
of references, bibliography entries, the placement of the abstract, etc.

The WWW project has no such editorial policies to enforce. The editorial
policies set forth in the HTML tag set are things like "you can have a title, if
you want, and we'll keep it visible for the user; you can have headings and
paragraphs and glossaries and lists and menus, and as long as you use them
in pretty much the traditional way, they'll be formatted reasonably. And
you can have anchors -- references from/to other documents."

The question that recently came into my mind is: why is the WWW project
defining such a tag set? The practical answer is that the NeXT implementation
has a nifty editor, and we'd like to be able to write nicely formatted documents
and display them nicely on nice terminals and simply on simple terminals.

Honestly, for that purpose, RTF is a more mature technology. The NeXT has
extensive support for RTF, and the Mac and the PC have some support.

I think all we're lacking is public implementations of RTF->ASCII,
RTF->Postscript, and RTF->X Windows renderers. MS Word and NeXT
edit would be fine editors. Really, for the kind of casual documents
the WWW project deals with, SGML is not a good match. Who really
uses all the "format independent" features of WWW? I haven't seen
anything that the RTF stylesheet features can't handle.

Unless we want some part of the WWW system to verify the structure
of documents, why are we using SGML (and using it poorly)?

Granted RTF doesn't have very good hypertext and multimedia features,
but that's what the WWW project is all about: experimenting with
hypertext and multimedia. We could standardize multimedia RTF conventions
as well as we have done for SGML.

Comments?

Dan