Monthly Archives: December 2014

JSON and YAML: Not a pair that fits every foot (but XML sucks)

It is good to reflect on exactly how hard a problem it is to define a consistent cross-platform data representation. Most of the time (especially on the web) we just shovel data around, let things be inconsistent, avoid conflicts by pretending they don’t happen, and carry a general disregard to data consistently. This attitude is, sadly, what has come to characterize “NoSQL” in my mind, though in a strict sense that is not true at all (GIS and graph databases aren’t SQL systems, and some are very solid — PostGIS being the exception in that it is a surprisingly well made extension to a surprisingly solid SQL-based RDBMS).

Obviously this isn’t a good attitude to have when dealing with things more important than small games or social media distractions. That said, most of the code written today seems to fall into those two categories, and many a career is spent exclusively roaming the range between these two (and whether we should consider most of the crap that constitutes the web a “game” itself is worth thinking about, whether we think of SEO, mindshare in the blogosphere, StackExchange rep, Facebook likes/friends/whatever, pingbacks, comment counts, etc.). We focus so much on these trivial and often meaningless cases that an entire generation of would-be programmers has no idea what the shape of data is really about.

When you really need a consistent data representation that can survive the network (ouch! that’s no mean feat!), can consistently be coerced into a known, predictable, serialized representation, and can be handled by generated code in nearly any language you need ASN.1.

But ASN.1 is hard to learn (or even find resources on outside of telecom projects), and JSON and YAML are easy to reference and (initially) use. XML was made unnecessarily hard, I think as a cosmic joke on people who never heard the term “S-expression”, but very basic XML seems easy, even if its something you would never want to type by hand (though that always seems to wind up being necessary, despite our best efforts at tooling…).

Why not just use JSON, YAML or XML everywhere? That bit above, about a consistent representation — that’s why. Well, that’s part of why. Another part of why is that despite your best efforts to define things in XML or nest explicit declarations in YAML/JSON you will always wind up either missing something, or find yourself needing to change some type information you embedded as a nested element in your data definition and then need to write a sed or awk script just to modify later (and if you’re the type who thinks “ah, a simple search/replace in my IDE…” and “IDE” to you doesn’t basically equate to “Emacs” or your shell itself, you’re going to a gunfight with boxing gloves on — if you need a better IDE to manage your language then you really need a better language).

The problem with YAML/JSON/XML are twofold: they are not defined anywhere, so while you may have a standard of sorts somewhere, there is no way to enforce that standard. An alternative is to include type information everywhere within your tags as attributes (in XML) or nest tagged groups or create a massive reference chain of type -> reference pointer -> data entry in YAML (or nest everything to an insane degree in JSON), but then making changes to the type of a field in a record type you have 20 million instances of is… problematic.

And we haven’t even discussed representation. “But its all text, right?” Oh… you silly person…

Everything is numbers. Two numbers, actually, 0 and 1. We know this, and we sort of forget that there are several ways of interpreting those numbers as bigger numbers, and those bigger numbers as textual components, and those textual components as actual text and that actual text (finally) as glyph you see when you use “the typewriter part” or look at “the TV part” (or do anything with the little touchscreens we use everywhere these days nobody seems to have worked out a genuinely solid interface solution to just yet).

Every layer of that chain of interpretation I mentioned above can be done several ways. Every layer. Think about that for a second. Now, if you live purely in a single world (like modern Linuxes and probably newer versions of OSX) where there is only UTF-8, then about half the possible permutations are eliminated. If you only ever deal with unaccented characters that fall in the primary 127 defined by ASCII, then several more permutations are eliminated — and you should dance with joy.

Unless you deal with a bit of non-textual data in addition to the textual stuff. You know, like pictures and sounds and application-produced opaque binary data and whatnot. If that’s the case, you should tremble. Or… oh god, no… what if your data doesn’t stand alone? What if all those letters are supposed to actually mean something? “We have lots of data” isn’t nearly as important to customers as “we have lots of meanings” — but don’t ask a customer about that directly, they have no idea what you mean, because all the text stuff already means something to them.