Zumi's Scratchpad

XML for websites: the deep dive

updated on

You may have been introduced to XML as a sort of sister to HTML, being both applications of the "standard generalized markup language" (SGML) formalized back in the 80s. That is, they both share the same kind of markup components, which would be tags (which can have attributes) and their content (which itself can be even more tags). The difference is that you could use XML to mark up literally anything. Notes, books, pages, cars... XML, however, has a more strict parser than HTML. That got people thinking, what if we could combine the two? Especially since web development was getting really messed and we need to pull everything back.

Enter stage left, XHTML...

XHTML was basically just HTML 4 but parsed using the stricter XML parser. Despite W3C's high hopes for the thing, no one really wanted or liked it — especially since it was too verbose for their liking (have fun dealing with both lang and xml:lang lmao). Not to mention, use of XHTML was regarded as an easy way to break websites faster through malicious user input. People who had to use XHTML came up with an "ingenious" way to reduce the pain inflicted by the strict parsing (and make browsers like IE compatible with it at the time), that is, tricking the browser to load it as plain HTML: manually setting the MIME type to text/html. The WHATWG was formed to maintain plain HTML, and today XHTML is likely seen as a relic from times gone by.

As the web becomes bigger and more painful in the client-side, some people began to rethink what HTML has become. It's a "living standard" with no versioning anymore, so what works and what doesn't isn't as clear anymore. Some people also point to the WHATWG being mostly corporate and big browser vendors, so any change to HTML will be in their favor (hello DRM!) So at an attempt to return sanity to their world of web authoring, some people returned to XHTML, with the intended mimetype of application/xhtml+xml — their reasoning was that it's stable (I guess because no one cares about it anymore, but hey, at least support for XHTML is across the board!) and that the W3C still took the wheel on this one. Their decision inspired me to try something out for myself.

Maybe I've explained it before in webdev rant containment, but it was interesting to try and create a modern-looking website whose markup belongs in the year 2005. Modern in the sense that, yes, it's using current-year CSS3 and Javascript, while also minding progressive enhancement. Three things become niceties for me in attempting to use XHTML 1.1:

Despite all that, some things become deal-breakers:

There were some new tags introduced with XHTML 1.1, such as the <ruby> tag, to mark up pronunciation of a block of text underneath the main text. This tag survives to this day in HTML 5, and you could think of some useful things to do with it...

XHTML 2: ELECTRIC BOOGALOO

XHTML version 2 was going to introduce some breaking changes, such as the use of XForms instead of HTML forms. I wouldn't have liked it either. Even more tags are introduced, which I imagine was derided as being potentially redundant to existing tags:

<nl>

Essentially a <ul> specifically to mark up navigation menus. Difference is that you have a <label>, used kind of like <caption> in tables, to define the list's heading. Examples here.

This one eventually morphed into a generic <nav> container you can put an <ul> into.

<section>

Yeah, basically the same tag from HTML 5. What's interesting though is the following tag that's used alongside it...

<h>

This is supposed to replace <h1> through <h6> to mark up headings. The idea is that instead of having to track which heading level you're on, you rely on <section> to do that for you:

<section>
    <h>First</h>
    <section>
        <h> Second</h>
    </section>
</section>

Instead of just:

<h1>First</h1>
<h2>Second</h2>

Yeah. I can see why this wouldn't be considered a favorite. More examples here.

<separator/>

Apparently this is what <hr/> got renamed to, because people were treating the tag as a literal line across the screen, instead of a thematic change in a paragraph.

There were also other things like a global src attribute to make everything an <img> (wow), the earliest appearance of role, property... Things got even messier pretty quickly, and I imagine this was why XHTML 2 went the way of ECMAScript 4, and was promptly scrapped for a more incremental revision of the previous version.

Client-side templates, without JS

Now, you don't need to manually make every page in HTML, when you could just write a bunch of XML and then apply an XML stylesheet! XSLT for short. Exactly, they're to markup what CSS does for display. That is, it's essentially a templating system that browsers support out of the box - in fact - since the early 00's! Does this mean I don't have to use all those static site generators? Well let's just see...

Let's say I wanted to write this blog in XML. What do we need here? We just need the actual contents and maybe some metadata... I'd mark up something like this:

<?xml version="1.0" encoding="utf-8"?>
<blog>
    <info>
        <name>XML for websites: the deep dive</name>
        <date>2021-11-23</date>
        <summary>
            In which I take a look at the carcasses left behind
            by the sheer enthusiasm of early and mid 2000s WWW.
        </summary>
        <tags>
            <tag>xml</tag>
            <tag>webdev</tag>
        </tags>
    </info>
    <content>
        <p>
            You may have been introduced to XML as a sort of sister to HTML,
            being both applications of the "standard generalized markup language"
            (SGML) formalized back in the 80s. The difference is that you
            could use XML to mark up literally <em>anything</em>.
        </p>

        <sec>Enter stage left, XHTML...</sec>

        <p>
            Write more content here...
        </p>
    </content>
</blog>

Using an XSLT file I can define some rules to insert contents of a tag inside another tag. Here, the blog's title should be an <h2> in the generated HTML, since the <h1> is already taken by the site title.

In XSLT, you do that using:

<xsl:template match="name">
    <h2><xsl:apply-templates/></h2>
</xsl:template>

Yeah, of course it's verbose. It's XML, after all. What this does is that it defines a template to use whenever it encounters any <name> tag (that's our match attribute) in our little example, and puts whatever content inside of it inside an <h2> of our HTML. match uses XPath, which, like CSS selectors, is a way to refer to XML tree elements. Though I wonder if we can use those instead. Anyway, if you wanted to specify the <name> tag must be inside of <blog><info> to match, you'd specify:

<xsl:template match="blog/info/name">
    <h2><xsl:apply-templates/></h2>
</xsl:template>

Either way, from <info><name>XML for websites: the deep dive</name></info> you'd get <h2>XML for websites: the deep dive</h2>.

Now <xsl:apply-templates/> is applied recursively, so we can turn a <blog> directly into an HTML page, using:

<xsl:template match="blog">
    <html>
        <head>
            <title>
                <xsl:value-of select="blog/info/name"/>
                - Zumi's Scratchpad
            </title>
        </head>
        <body>
            <header>
                <h1>Zumi's Scratchpad</h1>
                <nav><ul id="main-menu">
                    <li><a href="#">Home</a></li>
                    <li><a href="#">Bloge</a></li>
                </ul></nav>
            </header>
            <article>
                <xsl:apply-templates/>
            </article>
            <footer>
                © 2021 Zumi
            </footer>
        </body>
    </html>
</xsl:template>

There's something new here. <xsl:value-of/> grabs some element using XPath, here we're using it to grab our blog's INFO <name> and put it inside the <title> tag OF the HTML.

This isn't a complete XSLT by any means, since we need to redefine the tags I put inside <content> and render it back as actual HTML (this is left as an exercise for the reader), but let's go ahead and suture it up. Wrap all three of the code blocks above inside of this block:

<xsl:stylesheet version="1.0">
    <xsl:output method="html" doctype-system="about:legacy-compat" encoding="utf-8" indent="yes"/>

    <!-- XSLT content here... -->

</xsl:stylesheet>

And there you go. Now regarding adding a HTML5 doctype so that browsers wouldn't complain... It's been suggested that we use:

<xsl:text disable-output-escaping='yes'>&lt;!DOCTYPE html&gt;</xsl:text>

But that's not reliable. Chrome, Chrome-likes and IE support this while Firefox doesn't. Yeah, it doesn't work. The doctype-system="about:legacy-compat" approach however, did, and it made all browsers happy.

EDIT 2023-08-21: Viatrix brought to my attention this section from the "HTML Living Standard" as well as this section from XSLT 1.0 standard. doctype-system="about:legacy-compat" previously read doctype-public="". Thanks for the input!

All that's left to do is to link it to your XML, right after the XML declaration:

<?xml-stylesheet href="YOUR_TEMPLATE.xsl" type="text/xsl"?>

And you're gold! Take a look at what the page looks like here. As for CSS, you can do it inside the XSL template.

Now, given the power that you have with this thing... Maybe it won't be surprising why people might not want to use this and would rather use static site generators:

But this is still a really neat technology that every modern browser has by default, and it'd be really cool to see more things built with it.

Another thing catches my eye, and it's that mysterious thing that should be on top of every HTML page...

DTDs. What's the point?

See, they're supposed to define clearly what should be in a file. It did have its uses in SGML in general, and HTML did have an actual DTD with associated doctype. In practice, browsers never check an HTML file against its DTD, and just uses it to determine whether or not to render the page in quirks mode. Then again, HTML was just a looser application of SGML than XML is, and it'd change a bit quickly. People who insist on validation anyway, they end up flooding W3C servers because the identifier looks like a URL. DTDs are kinda hard to verify without specialized tools you'd pay for, and they're cryptic too. It's no wonder why it fell by the wayside, and its validation functions replaced with the XML Schema Definition (XSD).

But how do you get these things?

Right, let's make a DTD for our blog XML example. Let's assume we want <tags> in the info tag to be optional, everything else should be exactly there. A basic DTD without the HTML tags would look something like this:

<!--
    Blogs must contain meta content
    and main content, in that order
-->
<!ELEMENT blogs(
    info,
    content
)>

<!--
    Meta element must contain a blog title,
    blog date, and a summary. Tags may
    be there but they're optional.

    In any order, while technically there
    should be one of each, DTD syntax only allows
    the "at least zero" (something*) or
    the "at least one" (something+) modifiers.

    So here we define that there should be at
    least one of them present.
-->
<!ELEMENT info(
        name | date | summary | tags?
)+>

<!--
    Tags must only contain tags, and at least
    one must be there.
-->
<!ELEMENT tags(
    tag
)+>

<!--
    These elements should only contain
    plain text.
-->
<!ELEMENT name (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT summary (#PCDATA)>

<!--
    Think of these as "constants", let's define
    a subset of HTML to add here.
-->
<!ENTITY % markup
    "em | i | strong | b "
> <!-- TODO: add more -->

<!ENTITY % block
    "p | sec "
>

<!--
    ...and then we define the markup themselves.
    You can nest these things, it turns out.
-->
<!ELEMENT p(
    %markup;
)*>

<!ELEMENT sec(
    %markup;
)*>

<!ELEMENT em(%markup;)*>
<!ELEMENT i(%markup;)*>
<!ELEMENT strong(%markup;)*>
<!ELEMENT b(%markup;)*>

<!--
    Maybe we shouldn't allow lone markup to
    be here, and must be on one of our block tags
    (p and sec)
-->
<!ELEMENT content(
    %block;
)*>

<!-- Maybe I forgot something, but you get the idea. -->

And then you include it with your XML by adding this to the top, after your XML stylesheet definition:

<!DOCTYPE blog SYSTEM "my_blog.dtd">

The blog here is the top level element, so don't go thinking this can be anything lol.

That one defines a "private" DTD. Public DTD like the ones that come with HTML 4.01, XHTML and RSS 0.91? Do you need it now?

And now, the most important question. How do you validate these things without buying an XML IDE or something? The most reliable method I've done so far is what's being laid out here, using Python and LXML:

from lxml.etree import XMLParser, parse

parse(
    "YourMarkup.xml",
    XMLParser(dtd_validation=True)
)

Enough for today

Gee, that's a lot of XML autism. Sorry if I made your head spin. I think that looking at what people back then thought was the "future of the web, now!" and then look back at it in current year in great irony, that would make me look at current web dev with probably slightly higher skepticism.

Anyway, the most useful resource I've found for getting your feet wet with these "ancient" things — if you REALLY want — is the Edutech Wiki, from TECFA at the University of Geneva: