The relationships among HTML, XML and XHTML are an area of considerable confusion on the web. We often see questions on the webkit-dev mailing list where people wonder why their seemingly XHTML documents result in HTML output. Or we’re asked why an XML construct like <b /> doesn’t actually close the bold tag.
This article will attempt to clear up some of that confusion.
You may be wondering what the subtitle has to do with the title. Well, the HTML/XHTML distinction may seem like an obscure topic, but it can have significant practical effects. In particular, it is likely to affect Dashboard Widget developers in a huge way in upcoming WebKit versions. I’ll explain further at the end.
What are HTML, XML and XHTML?
The original language of the World Wide Web is HTML (HyperText Markup Language), often referred to by its current version, HTML 4.01 or just HTML4 for short. HTML was originally an application of SGML (Standard Generalized Markup Language), a sort of meta-language for making markup languages. SGML is quite complicated, and in practice most browsers do not actually follow all of its oddities. HTML as actually used on the web is best described as a custom language influenced by SGML.
Another important thing to note about HTML is that all HTML user agents (this is a catchall term for programs that read HTML, including web browsers, search engine web crawlers, and so forth) have extremely lenient error handling. Many technically illegal constructs, like misnested tags or bad attribute names, are allowed or safely ignored. This error-handling is relatively consistent between browsers. But there are lots of differences in edge cases, because this error handling behavior is not documented or part of any standard. This is why it is a good idea to validate your documents.
XML and XHTML are quite different. XML (eXtensible Markup Language) grew out of a desire to be able to use more than just the fixed vocabulary of HTML on the web. It is a meta-markup language, like SGML, but one that simplifies many aspects to make it easier to make a generic parser. XHTML (eXtensible HyperText Markup Language) is a reformulation of HTML in XML syntax. While very similar in many respects, it has a few key differences.
First, XML always needs close tags, and has a special syntax for tags that don’t need a close tag. In HTML, some tags, such as img are always assumed to be empty and close themselves. Others, like p may close implicitly based on other content. And others, like div always need to have a close tag. In XML (including XHTML), any tag can be made self-closing by putting a slash before the code angle bracket, for example <img src="funfun.jpg"/>. In HTML that would just be <img src="funfun.jpg">
Second, XML has draconian error-handling rules. In contrast to the leniency of HTML parsers, XML parsers are required to fail catastrophically if they encounter even the simplest syntax error in an XML document. This gives you better odds of generating valid XML, but it also makes it very easy for a trivial error to completely break your document.
HTML-compatible XHTML
When XML and XHTML were first standardized, no browser supported them natively. To enable at least partial use of XHTML, the W3C came up with something called “HTML-compatible XHTML”. This is a set of guidelines for making valid XHTML documents that can still more or less be processed as HTML. The basic idea is to use self-closing syntax for tags where HTML doesn’t want a close tag, like img, br or link, with an extra space before the slash. So our ever-popular image example would look like this: <img src="funfun.jpg" />. The details are described in the Appendix C of the XHTML 1.0 standard.
It’s important to note that this is kind of a hack, and depends on the de facto error handling behavior of HTML parsers. They don’t really understand the XML self-closing syntax, but writing thing
Answered by
Abhi Singh
, an ibibo Master,
at
2:25 PM on October 08, 2008