HyperText Markup Language (HTML) is the pervasive data format for the World Wide Web. While HTML provides an outstanding mechanism to deliver simple documents over the Web, its simplicity imposes limitations that significantly raise the cost of deploying complex websites.
Because of the lack of SGML support in mainstream Web browsers, most applications that deliver SGML information over the Web convert the SGML to HTML. This down-translation removes much of the intelligence of the original SGML information. That lost intelligence virtually eliminates information flexibility and poses a significant barrier to reuse, interchange, and automation.
HTML's limitations become evident when trying to deploy large and complex business-critical applications, which are usually intranet/extranet applications as opposed to standard websites. These limitations include:
Limited structure - Most of HTML's limitations can be traced to its fixed set of tags, which primarily serve to specify formatting of documents delivered on the Web. In other words, HTML tags support only a fixed and trivially simple structure. In this, HTML shares the limitations of other presentation-specific markup languages, such as RTF, which is designed for documents that are delivered in print. The reason SGML was invented was, in part, to separate information from formatting in order to provide a powerful and extensible way to mark up information. HTML's lack of structure creates significant barriers to using HTML for applications beyond simple browsing, such as reuse, interchange, and automation. Each of these is covered below.
Limited reuse - Many organizations publish the same information in multiple forms; it's very common to have both printed and Web forms of the same data. Information originally created in HTML can be reused for printing, and information originally created for printing can be reused for Web delivery. However, to achieve reuse requires conversion that's usually followed by manual intervention to fix up the appearance (i.e., the formatting) of the resulting document. And that means that each time the source information changes, the conversion and fix-up process must be repeated. This is an expensive, time-consuming, and labor-intensive process, and one of the reasons for the adoption of SGML by organizations with lots of data to distribute.
Limited interchange - Because the Internet is simple and ubiquitous, it provides an ideal medium for organizations that want to interchange data. However, HTML undermines interchange because its small, fixed set of tags primarily indicates only the appearance of an element of a document. HTML provides nothing to denote the data within a document, which cripples attempts to achieve reuse. For example, a computer manufacturer may wish to capture semiconductor data from its suppliers and feed that data into its computer-aided design (CAD) systems. Its CAD systems require data such as the function, tolerances, and timing of each pin of an integrated circuit. HTML provides no way to tag such data unambiguously. In fact, even if the original source data contains the necessary tagging to eliminate uncertainty, which is likely to be the case if the source data is in SGML, the resulting down-translation to HTML strips all the intelligence away.
Limited automation - Automation saves labor, reduces costs, speeds delivery, and improves quality. There are many opportunities for adding automation to the use of the Web, particularly for intranets and extranets. Examples include almost any forms-based application, such as insurance enrollments, medical claims processing, and online banking. However, HTML poses a significant barrier to achieving automation. All highly automated processes are built on a data format that's highly expressive and absolutely consistent. HTML lacks the necessary expressiveness, since it's limited to a fixed set of presentation-oriented tags, and lacks as well the absolute consistency, since there's no way to impose a rigorous data structure on top of those tags.
Searching produces too many "hits" - One of the most valuable capabilities of the Web is provided by search engines that allow a user to find everything on the Web related to an inquiry. As the volume of information available on the Web continues to skyrocket, however, the amount of data retrieved for a typical search has risen to unusable proportions. Searchers of information must choose between queries that are so narrow that relevant information may be omitted from the results, and queries so general they produce far too many hits to be useful. The reason that Web searches turn up too many hits is that we typically search all the content of every page. Although searches can be limited to titles, those searches are almost certain to exclude relevant hits. One of the best ways to improve Web searching would be to provide content-specific elements. For example, the word "bonds" could be tagged as a name, or a chemical term, or a financial term. Then searches for content related to "bonds" could be limited to a specific domain of inquiry.
Moving target: HTML 2.0 to 3.2 to 4.0 to ?? - Since HTML is an evolving standard, its capabilities are continually being extended through the introduction of new tags. For those who are maintaining large amounts of information in HTML, the release of new revisions of HTML usually requires reviewing and retagging the existing data. In fact, many webmasters are relieved that Microsoft and Netscape have increased the intervals between new versions of their browsers from six months to one year, because that means that they don't have to retag their websites as often. To avoid the retagging problem entirely, many organizations create their source information in SGML and down-translate to HTML. The level of effort for changing an SGML-to-HTML translator may be as little as a few hours, while the effort to retag hundreds or thousands of pages can stretch into many weeks.
Despite its limitations, HTML has a variety of attributes which make it appealing to today's Web users.
Very simple - HTML makes Web cruising so simple, most people can train themselves. All you have to learn is how to click on the blue, underlined text. For those who want to create simple Web pages, HTML is easy enough to learn in just a few hours.
Built-in style - The screen formatting that is built into HTML is likewise very simple. Even though HTML formatting has a lot of limitations, it's far better than the plain text display of the Internet before the advent of HTML and the World Wide Web. HTML's limited formatting options make Web publishing even easier, because you don't have to deal with balancing multiple columns, positioning graphics to achieve attractive page breaks, and so on.
Easy, standard linking - HTML's flexible and powerful hypertext linking is easy to set up, but it, too, has limitations that complicate large-scale implementations.
Forms support - You can easily set up simple forms applications with HTML. Today's Web editors make it possible for you to set up your first form in an hour or two. After you've done your first form, you can do additional forms in much less time.
Simple programming - Finally, HTML uses CGI scripting for really easy programming. Although you can't do everything with it, you can do a decent amount really easily.