Getting to know XML

ArticleCategory: [Artikel Kategorie]

Applications

AuthorImage:[Here we need a little image from you]

[Floris Lambrechts]

TranslationInfo:[Author and translation history]

original in en Floris Lambrechts

AboutTheAuthor:[Über den Autor]

I have been the LinuxFocus/Nederlands main editor for years. I'm studying 'industrial engineer in electronics' in Leuven, Belgium and waste my time toying with Linux, PHP, XML and LinuxFocus, while reading books like those by Stephen Hawking and (at the moment:) Jef Raskin, 'The Humane Interface'.

Abstract:[Here you write a little summary]

This is a pretty short introduction to XML. You will meet Eddy the meta cat, the XML syntax police, and some DTDs. Don't worry, we'll explain ;-)

ArticleIllustration:[This is the title picture for your article]

[Illustration: xml]

ArticleBody:[Der eigentliche Artikel]

Introduction

In the summer of 2001, some of _LF_'s editors came together in Bordeaux during the LSM. Many talks and discussions at the LSM's documentation special-interest-group turned out to have the same subject: XML. Long (and fun) hours went into explaining what XML exactly is, what it's good for and how one can use it. In case you're interested, that is also exactly what this article will try to discuss.

I'd like to thank Egon Willighagen and Jaime Villate for introducing me to XML. This article is somewhat based on information in Jaime's articles, which you can find in the links below.

What is XML

We documentation guys all knew what XML was about, more or less. After all, it's syntax is very similar to HTML and it's just another markup language like SGML and (again) HTML, right? Right. But there is more to it.
XML has some properties that make it a useful data format for almost anything. It almost seems like it can describe the most complex things, and yet it remains easy to read for humans, and easy to parse for programs. How can that be? Let's investigate this odd language.

Eddy, the meta cat

First of all, XML is a markup language. Documents written in a markup language contain basically two things: data, and metadata. If you know what data exactly is, please let me know, but until then let's talk about the metadata ;). Simply said: metadata is extra information that adds a context, or a meaning, to the data itself. A simple example: take the sentence 'My cat is called Eddy'. A human like you knows that 'cat' is the name of a species of animal, and 'Eddy' is its name. Computer programs, however, are not human and don't know all that. So we use metadata to add meaning to the data (with XML syntax, of course!):

 <sentence>
   My <animal>cat</animal> is called <name>Eddy</name>.
 </sentence>

Now even a dumb computer program can tell that 'cat' is a species, and that 'Eddie' is a name. If we want to produce a document where all names are printed blue, and all species in red, then XML makes it really simple for us. Just for the fun of it, this is what we would get:

 My cat is called Eddy.

Now, theoretically, we can put the layout information (the colors in this case) in a separate file, a so-called stylesheet. When we do that, we are actually separating the layout information from the content, something that is considered the Holy Grail of Web designTM by some. So far we have done nothing special, adding metadata is what markup languages are designed to do. So then, what makes XML so special?

The syntax police

First of all, XML has a very strict syntax. For example, in XML every <tag> must have a closing </tag>. [ Note: since it's a little stupid to write <tag></tag> when there's nothing in between, you can also write <tag /> and save a couple of minutes of your life, eventually].
Another rule is that you can not 'mix' tags. You have to close the tags in the reverse order that you opened them. Something like this in not valid:

<B> Bold text <I>Bold and italic text </B> italic text </I>

The syntax rules say that you should close the </I> tag before you close </B>
And, be aware, _all_ the elements in an XML document should be contained in tags (except the two outer tags, of course!). That is why, in the example above, we have written <sentence> tags around the sentence. Without them, some of the words in the sentence would not be included between tags, and that, like so many things, makes the XML syntax police really mad.
Mozilla screenshot
Mozilla's syntax police @work...

But a strong police force sure has it advantages: it brings order. Since XML follows such strict syntax rules, it is very easy for programs to read. Also, the data in your XML documents is very structured, which makes it easy to read and write for humans.
Please note that the 'theoretical' assets of XML are not always realizable in practice. For example, most current XML parsers are far from fast, and often really big. So it appears that XML is not that easy to read for software at all. Let's just say it's certainly not a good idea to do *everything* in XML, just because you can. For applications where you need to do a lot of lookups in a document, or where you have really huge documents, XML is often not the right choice. But that doesn't mean it's impossible to use XML for those purposes.
A nice example for the power of XML, but also for it's slowness, is the fact that you can write databases in it (try that with HTML! :p). That's exactly what Egon Willighagen has done for the Dutch LinuxFocus section, his article about that system is available in the links at the bottom of this page. In this case the flexibility and extensibility of a homebrew file format were chosen above pure speed (say, mySQL).
Concerning the strict syntax of XML: if you manage to become good friends with the syntax checkers, then there may be some ways you can let the police actually do some of your work. If you want to do so, you'll have to make clever use of a DTD...

The DTD

In our little 'Eddy the meta-cat' example above, we have invented our own XML tags. Of course, such a creative act is not tolerated by the police force! The 'men in blue' want to know what you are doing, how, when and (if possible) why. Well, no problem, you can explain everything with the DTD...

A DTD allows you to 'invent' new tags. In fact, it allows you to invent complete new languages, as long as they follow the XML syntax.
The DTD, or Document Type Definition, is a file that contains a description of an XML language. It is actually a listing of all the possible tags, their possible attributes, and their possible combinations. The DTD describes what is possible in your XML language, and what's not. So, when we talk about an 'XML language', we are actually talking about a specific DTD.

Put the police to work

Sometimes the DTD will force you to do something at a specific place. For example, the DTD can force you to include a tag that contains the title of the document. What's so nice about this is, that there is actually software (e.g. an emacs module) that can write the required tags automatically.
That way, some parts of your document's structure gets filled in automatically. Because the syntax is so strict and well-defined, the DTD can guide you through the process of writing a document. And when you make mistakes, such as forgetting to place an end tag, the police will inform you. So in the end, the cops are not that 'mad' at all; where the real-world cops say 'You have the right to remain silent', the XML police tells you very friendly about a 'Syntax error @ line xx : '... :)
And while the police do all that work for you, of course *you* can just go on and concentrate on the content.

In the mix

One last great feature of XML is it's ability to use several DTDs at once. This means you can use several different data types at the same time in one document.

This 'mixing' is done with xml namespaces. For example, you can include the Docbook DTD into your .xml document (for the 'dbk' prefix in this example).
All Docbook's tags are then ready to be used in your document in this form (let's say there is a Docbook tag <just_a_tag>):

 <dbk:just_a_tag> just some words </dbk:just_a_tag>

Using the namespace system, you can use any tag and any attribute of any xml DTD. It opens up a world of possibilities, as you can see in the next chapter...

Available DTDs

Here is a small collection of DTDs that are already (partly) in use.

Links

The W3C, or World Wide Web Consortium
They have info on XML, MathML, CML, RDF, SVG, SOAP, XHTML, namespaces...
www.w3.org

Some stuff by Jaime Villate (you may need an online translator to read the first two:)
Introduction to XML(in Spanish)
How to generate HTML with XML(in Spanish)
LSM-slides

HTML tidy, the program:
www.w3.org/People/Raggett/tidy

Docbook
www.docbook.org

Mozilla.org's SVG project
www.mozilla.org/projects/svg

Relevant LinuxFocus articles:
Using XML and XSLT to build LinuxFocus.org(/Nederlands)
Making PDF documents with DocBook