Daily Archives: 2015.04.10 10:48

XML: Xtensively Mucked-up Lists (or “How A Committee Screwed Up Sexps”)

Some folks are puzzled at why I avoid XML. They just can’t understand why I avoid it whenever I can and do crazy things like write ASN.1 specs, use native language terms when possible (like Python config files consisting of Python dicts, Erlang configs consisting of Erlang terms, etc.), consider YAML/JSON a decent last resort, and regard XML as a non-option.

I maintain that XML sucks. I believe that it is, to date, the most perfectly horrible corruption of one of the most universal and simple concepts in computer science: sexps.

ZOMG! Someone screwed up sexps!

Let that one sink in. What a thing to say! How in the world would one even propose to screw up such a simple idea? Let’s consider an example…

Can you identify the semantic difference among the following examples?
(Inspired by the sample XML in the Python xml.etree docs)

Verson 1

<country name="Liechtenstein">
  <rank>1</rank>
  <year>2008</year>
  <gdppc>141100</gdppc>
  <neighbor name="Austria" direction="E"/>
  <neighbor name="Switzerland" direction="W"/>
</country>

Version 2

<country>
  <name>Liechtenstein</name>
  <rank>1</rank>
  <year>2008</year>
  <gdppc>141100</gdppc>
  <neighbor>
    <name>Austria</name>
    <direction>E</direction>
  </neighbor>
  <neighbor>
    <name>Switzerland</name>
    <direction>W</direction>
  <neighbor>
</country>

Version 3

<country name="Liechtenstein" rank="1" year="2008" gdppc="141100">
  <neighbor name="Austria" direction="E"/>
  <neighbor name="Switzerland" direction="W"/>
</country>

Version 4

And here there is a deliberate semantic difference, meant to be illustrative of a certain property of trees… which is supposedly the whole point.

<entries>
  <country rank="1" year="2008" gdppc="141100">
    <name>Liechtenstein</name>
    <neighbors>
      <name direction="E">Austria</name>
      <name direction="W">Switzerland</name>
    </neighbors>
  </country>
</entries>

Which one should you choose for your application? Which one is obvious to a parser? From which could you more than likely write a general parsing routine that could pull out data that meant something? Which one could you turn into a program by defining the identifier tags as functions somewhere?

Consider the last two questions carefully. The so-called “big data” people are hilarious, especially when they are XML people. There is a difference between “not a large enough sample to predict anything specific” and “a statistically significant sample from which generalities can be derived”, certainly, but that has a lot more to do with representative sample data (or rather, how representative the sample is) than the sheer number of petabytes you have sitting on disk somewhere. “Big Data” should really be about “Big Meaning”, but we seem to be so obsessed over the medium that we miss the message. Come to think of it, this is a problem that spans the techniverse — it just happens to be particularly obvious and damaging in the realm of data science.

The reason I so hate XML is because the complexity and ambiguity introduced in an effort to make the X in XML mean something has crippled it in terms of clarity. What is a data format if it confuses the semantics of the data? XML is unnecessarily ambiguous to the people who have to parse (or design, document, discuss, edit, etc.) XML schemas, and makes any hope of readily converting some generic data represented as XML into a program that can extract its meaning without going to the extra work of researching a schema — which throws the entire concept of “universality” right out the window.

Its all a lie. A tremendous amount of effort has been wasted over the years producing tools that do nothing more than automate away the mundane annoyances dealing with the stupid way in which the structure is serialized. These efforts have been trumpeted as a major triumph, and yet they don’t tell us anything about the resulting structure, which itself is still more ambiguous than plain old sexps would have been. Its not just that its a stupid angle-bracket notation when serialized (that’s annoying, but forgiveable: most sexps are annoying paren, obnoxious semantic whitespace, or confusing ant-poop delimited — there just is no escape from the tyranny of ASCII). XML structure is broken and ambiguous, no matter what representation it takes as characters in a file.