Tag Archives: coding

Zomp/zx: Yet Another Repository System

I’ve been working on a from-source repo system for Erlang on and off for the last few months, contributing time to it pretty much whenever real-life is not interfering. I’m getting close to making a release. Now that my main data bits are worked out, the rest isn’t all that hard. I need to figure out what I want to say in an announcement.

The problem is that I’m really horrible at announcements and this system does things in a pretty different way to other repository systems out there, so I’m not sure what things are going to be important about it to users (worth putting into an announcement) and what things are going to be important to only me because I’m the one who wrote it (and am therefore obsessed with its externally inconsequential internals). What is internally interesting about a project is almost never what is externally interesting about it. Marketing; QED. So I need to sort that out, and writing sometimes helps me sort that kind of thing out.

I’m making this deliberately half-baked, disorganized, over-long post public because Joe Armstrong gave me some food for thought the other day. I had written him my thoughts on a subject posted to a mailing list but sent the message in private. I made my message to him off-list for two reasons: first, I wasn’t comfortable with my way of expressing the idea just yet; and second, I am busy with real-life stuff and side projects, including the repo system, and don’t want to get sucked into online chatter that might amount to nothing more than bikeshedding. (I’m a world-class bikeshedder!) Joe wrote me back asking why I made the reply private, I told him my reasons, and he made me change my mind. He hopes that more people will publish their ideas all the time, good or bad, fully baked or still soggy — because that’s the only way we can ever find any other interesting ideas these days is by searching for them, usually in text, on the net somewhere. It isn’t like we can’t go back and revise, but whether or not we do go back and clean up our literary messes, the availability of core ideas and exposure of thought processes are more important than polish. He’s been on a big drive to make sure that he posts most of his thoughts to public mailing lists or blogs so that his ideas get at least indexed and archived. On reflection I agree with him.

So here I am, trying to publicly organize my thoughts on my repository system.

I should start with the goals of the system.

This system is intended to smooth over a few points of pain experienced when trying to get a new Erlang project off the ground, and in particular avert the path of pain peculiar to Erlang newcomers when they encounter the “how to set up a project” problem. Erlang’s tooling is great but a bit crufty (deeply featured, but confusing to interface with) and not at all what the kool kids expect these days. And anyway I’m really just trying to scratch my own itch here.

At the moment we have two de facto standards for publishing Erlang systems: erlang.mk and Rebar. I like both of these, especially erlang.mk, but they do one thing that annoys me and never seems to quite fit my need: they build Erlang releases.

Erlang releases are great. They cut all the cruft of a release out and pack everything needed to actually run a system into a single blob of digits that you can move, in a single shot, to a new target system — including the Erlang runtime itself. Awesome! Self-contained deployment and it never misses. This has been an Erlang feature since before people even realized that they needed repeatable deployment infrastructure outside of the classic “let’s build a monolithic, static binary executable” approach. (Erlang is perpetually ahead of its time, even by today’s standards. I look at the poor kids stubbing their toes with Docker and language du jour and just shake my head — though part of that is because many shops are using Docker to solve concurrency issues that they haven’t even become cognizant of, thinking that they are experiencing “scaling” problems but missing the point entirely.)

Erlang releases are awesome when the deployment target is an embedded system, but not so awesome if the target is a full-blown operating system, VM, container, or virtual environment fully stocked with gobs of memory and storage and flush with system utilities and resources. Erlang releases sort of kitchen-sink the deployment itself. What if you want to run several different Erlang programs, all delivered as releases, all depending on the same library? You’ve got tons of copies of that library. Which is OK, but still sort of weird, because you also have tons of copies of the runtime (among other things). Each release is self-contained and lean, but in aggregate this is a bit odd.

Erlang releases make sense when you’re deploying to a phone switch or a sensor device in the middle of nowhere and the runtime is basically acting as its own operating system. Erlang releases are, in that context, analogous to putting a Gentoo stage 3 binary image on a system to leapfrog most of the toolchain process. Very cool when you’re in that situation, but a bit tinker-tacky when you’re just trying to run, say, a client program written in Erlang or test a web front-end for something that uses YAWS or Cowboy.

So that’s the siloed-kitchen-sink issue. The other issue is that newcomers are perpetually confused about releases. This makes teaching elementary Erlang hard. In my view we should really focus on escript for beginner code — just let the new guy run something out of a single file the way he is used to doing when learning a new language instead of showing him pages of really slick code, then some interpreter stuff, and then leaping straight from that to a complex and advanced packaging setup necessarily tailored for conducting embedded deployments to slim hardware devices. Seriously. WTF. Escripts give beginners all the power of Erlang necessary for exploring the more interesting bits of code and refactoring needed to learn sequential Erlang with the major advantage of being able to interface with the system the same way programmers from other environments are used to dealing with langauge runtimes like Bash, AWK, Python, Ruby, Perl, etc.

But what about that gap between scripts and full-blown production deployments for embedded hardware?

Erlang has… nothing.

That’s right! There is no agreed-upon way to deploy or even run Erlang code in the same manner a Python coder would expect to execute a python program. There is no virtualenv type system, there is no standard answer to the question “if I’m in the project directory and type ./do_thingy it will just work, right?” The answer is always “Well, it depends…” and what actually winds up happening is that people either roll a whole release just to crank a trivial amount of code up or (quite often) implement an ad hoc way to get the same effect in a lighter-weight way. (erlang.mk shines here, actually.)

Erlang does provide a number of ways to make a system run locally from source of .beam files — and has actually quite reasonable built-in resources for this — but nothing has been built around these tools that also deals with external dependencies, argument passing in a standard way, or any of the other little things you really need if you want to claim a complete solution. Hence all the ad hoc solutions that “work on my machine” but certainly aren’t something you expect your users to use (not with broad success, anyway).

This wouldn’t be such a big problem if it weren’t for the fact that not having any standard way to “just run a program” also means that there really isn’t any standard way to deal with client side code in Erlang. This is a big annoyance for me because much of what I do is client-side code. In Erlang.

In fact, it totally boggles my mind that client-side Erlang isn’t more common, especially considering that AMD is already fielding zillion-core processors for desktops, yet most languages are fundamentally single-threaded. That doesn’t mean you can’t do concurrency and parallelism in other languages, but most problems are not parallel in nature to begin with (parallel problems are easy to write solutions to in any language) while most real-world problems are concurrent. But concurrent systems are hard to write in almost every language. Concurrent problems are the bulk of the interesting problems we’re still not very good at solving with computers. AMD is moving to make the tools available to make much more interesting concurrent processing tools available on the client side (which means Intel will soon start pouring it gajillions worth of blood diamond money into a similar effort), but most languages and environments have no good way to make use of that on the client side. (Do you see why I hear Lady Fortune knocking?)

Browsers? Oh yeah. That’s a great plan. Have you noticed that most sites slowly move toward the “Single Page App” design over time (read as: the web sucks, so now we write full-but-crippled client-programs and deliver them over the web), invest heavily in do-sneaky-things-without-telling-you JavaScript and try to hog every core your system has if you allow it the slightest permission to do so? No. In the age of bitcoin miners embedded in nearly every ad this is not the direction I think we should be envisioning things going.

I want to take better advantage of the cores users have available, and that doesn’t necessarily mean make more efficient use of every cycle as much as it means to make scheduling across processes more efficient to reduce latency throughout the system overall. That’s something users care about quite a lot. This is the problem Erlang has already solved in a way no other runtime out there has. So I want to capitalize on it.

And yet, there is still not standardish way of dealing with code from source, running it locally, declaring or resolving dependencies, or even launching a client-side program at all.

So… how am I approaching it?

I have a project called “zomp” which is a repository system. It is a distributed repository system, so not everything has to be held in one place. Code in the zomp universe is held in little semantic silos called “realms”. Each realm can have whatever packages the owner (sysop) wants it to have. Each realm must have one server node somewhere that is its “prime” — the node in charge of that realm. That node is where system operator tasks for that realm take place, packagers and maintainers submit code for inclusion, where the package index is built, where the canonical copy of everything is stored. Other nodes configured to see that realm connect to the prime node and receive a copy of the current indexes and are tested for availability and published as available resources for querying indexes or downloading packages.

When too many subordinate nodes connect to a prime the prime will redirect a new node to a subordinate, when a subordinate gets “full” of subordinates itself, it picks a subordinate for new redirects itself, etc. so each realm winds up forming a resource tree of mirror nodes that connect back to the realm prime by a single path. A single node might be prime for several realms, or other nodes may act as prime for different realms — and any node can be configured to become a part of any number of realm trees.

That’s the high-level code division.

The zomp constellation is interfaced with via the “zx” program (short for “zomp explorer”, or “zomp exchanger”, or “Zomp eXtreem!”, or homage to the Sinclair ZX-81, or whatever else might lend itself to the letters “zx” that you might want to make up — I actually forget what it originally stood for, but it is remarkably convenient to type so it’s staying that way)

zx is configured to have visibility on zomp realms the same way a zomp node is (in fact, they use the same configuration files and it isn’t weird to temporarily host a zomp node on your desktop the same way you might host a torrent node for a while — the only extra effort is that you do have to open a port, zomp doesn’t (yet) do hole punching magic).

You can tell zx to run a program using the highly counter-intuitive command:

zx run Realm-ProgramName[-Version]

It breaks the program name down into:

  • Realm (optional, defaulting to the main realm of public FOSS packages called “otpr”)
  • Name (necessary — sort of the whole point)
  • Version (which is optional and can also be partial: “1.0.3” vs just “1.0” or “1”, defaulting to the latest in a series or latest overall)

With those components it then contacts any zomp node it knows provides the needed realm, resolves the latest version number of the requested program, downloads and unpacks it, checks and downloads any missing dependencies, builds the program, and launches it. (And if it doesn’t know any active mirrors it asks the prime node and is seeded with known mirror nodes in addition to getting its query answered.)

The packages are kept in a local cache stored at the user level, not the system level (sort of like how browsers keep their JS and page caches) — though if you want to daemonize zomp and run it as a permanent service (if you run a realm prime, for example) then you would want to create an unprivileged system user specifically for the purpose. If you specify a fully-qualified “realm-name-version” for execution and the packages already exist and are built, zx just launches the code directly (which is the majority case, so no delay there — fast startup).

All zomp nodes carry a complete index of their configured realms and can answer queries with very little overhead, but only the prime node has a copy of all the packages for that realm

 

Zomp realms are write-only. There is no facility for removing a package from a realm entirely, only for upgrading the versions of packages whenever necessary. (Removal is, of course, possible, but requires manual intervention by the sysop.)

When a zx client or zomp node asks an upstream node for a package and the upstream node does not have a copy it will query its upstream until the request reaches a node that does have a copy. Once found a “found” notice goes back down to the client telling it how many hops away the package is, and new “hops away” notices are sent as the package is passed downstream toward the original requestor (avoiding timeouts and allowing the user to get some feedback about what is going on). The package is cached at each node along the way, so subsequent requests for that same package will be handled immediately without any more relay downloading.

Because the tree of nodes is expected to be relatively ephemeral and in a constant state of flux, the tendency is for package stores on mirror nodes to be populated by only the latest, most popular packages. This prevents the annoying problem with old realms having gobs of packages that nobody uses but mirror hosts being burdened with maintaining them all anyway.

But why not just keep the latest of everything and ditch old packages?

Ever heard of “version shear”? Yeah. Me too. It sucks. That’s why.

There are no “up to” or “greater than” or “abstract version 3” type dependency declarations in zomp package metadata. As a package maintainer you must explicitly declare the complete version of each dependency in your system. In the case of diamond-shaped dependencies (where two packages in your system depend on slightly different versions of the same package) the burden is on the packagers to declare a version that works for a given release of that package. There are no dependency trees for this reason. If your package depends on X, and X depends on Y and Z then your package must be defined as depending on X, Y and Z — and fully specify the versions involved.

Semver is strictly enforced, by the way. That is, all release numbers are “Major.Minor.Patch”. And that’s it. No more, no less. This is one of the primary criteria for inclusion into a public realm and central to the way both zx and zomp interpret package semantics. If an upstream project has some other numbering scheme the packager will need to create a semver standard of his own. And actually, this turns out to not be very hard in practice. There is one weird side-effect of full, static dependency version declarations and semver: updating dependencies results in incrementing your package’s patch number, so even if you don’t change anything in a program for a long time, a program with many dependencies under heavy development may wind up on version 2.3.257 without much change other than the {deps, PackageIDs}. line in the package meta file.

zx helps make you aware of these situations, so solving them has not been particularly difficult in practice.

Why do things this way?

The “static dependencies forever and ever, amen” decision is a tradeoff between the important feature of fully repeatable builds Erlang releases are famous for (to the point of bug-compatibility between deployment sites — which is critical in production) and the flexibility users and developers have come to expect from source repository systems like pip, pypi, CPAN, etc. Because each realm is write-only there is no danger that a package will be superceded and disappear. The way trickle-down caching works for mirror zomp nodes does not unduly burden the subordinate realm mirrors, and the local caching behavior of zx itself at launch time tends to make all of this mostly delay-free for zx clients and still gives them the option to always run “latest available version” if they want.

And on the note of “latest version”…

Client-side programs are not expected to be run too terribly long at a time. People shut desktop programs down, restart computers, update their kernels, etc. So even if a client program runs a long time (on the order of web, email, IRC, certain games, crypto wallets/miners, torrent nodes, Freenode, Tor, etc) it will still have a chance to restart every few days or weeks to check for a new version (if invoked in a way that omits the version number so that it always queries the latest version).

But what about for long-running server-side type programs? When zx starts a script checks the initial environment and then starts the erlang runtime with zx as its target application, passing it the package ID of the desired program to run and its arguments as arguments. That last sentence was odd. An example is helpful:

zx run foo-bar arg1 arg2 arg3

zx invokes the launching script (a Bash script on Linux, BSD and OSX, a batch file on Windows — so actually the command is zx.bash or zx.cmd)  with the arguments run foo-bar arg1 arg2 arg3. zx receives the instruction “run” and then breaks “foo-bar” into {Realm, Name} = {"foo", "bar"}. Everything after that is passed in as strings which wind up being the input arguments to the program being run: “foo-bar”.

zx registers a process called zx_daemon which remains resident in the runtime and waits for a subscription request or zomp query. Any Erlang program written with the intention of being used with zx can send a message to zx_daemon and ask it to maintain a connection to the program’s parent realm and enroll for update notifications. If the target program itself is the subject of a realm index update then it will get a message letting it know what has changed. The program can respond any way the author wants to such a notification.

In this way it is possible to write a client-side or server-side application that can enroll to become aware of updates to itself without any extra infrastructure and a minimal amount of code. In some programs I’ve used this to cause a pop up notification to appear to desktop users so they know that a new version has become available and they should restart the program (the way Firefox does on Windows). It could also be used to initiate a restart on its own, or whatever else you might come up with.

There are several benefits to developers of using this system as well.

As a developer I can start a new project by doing zx init app [Realm-Name] or zx init lib [Realm-Name] in an existing project root directory and a zomp.meta file will be generated for it, or a new project template directory will be created (populated with a functioning sample skeleton project). I can do zx dailyze and zx will make sure a generally relevant PLT exists or is built (if not up to date) and used to check the typespecs of the project and its dependencies. zx create package [Path] will create a zomp package, sign it, and populate the metadata for it. zomp keygen will generate the kind of keys necessary to interact with a zomp server. zomp submit PackageFilePath will submit a package for review.

And so on.. It is a lot easier to do most things now, and that’s the main point.

(There are commands for reviewing, approving, or rejecting package submissions, adding packagers and maintainers to package projects, adding dependencies to projects, X.Y.Z version incrementing, etc. as well.)

This is about 90% of the way I want it to be, but that means about 90% of the effort remains (pessimistically assuming the 90/10 rule, because life sucks and nobody cares). Most of that is probably going to be finagling some network lunacy, but a lot of the effort is going to be in putting polish to it.

Zomp/zx is based on a similar project I wrote for use within Tsuriai a few years ago that has much sparser features but does basically the same thing: eases packaging and repeatable deployment from source to client systems. I would never release that version publicly because it has a lot of “works for me!” level functionality, but very little polish and requires manually diddling quite a few settings files in error-prone ways (which is fine because it was just us diddling them).

My intention here is to Cadillac this out a bit so that newcomers can slide into the new language and just focus on that language after learning a minimum of tooling commands or environmental details. I think zx init app foo-bar and zx runlocal are a low enough bar for entry.

(Personal) Guidelines for Software Projects

A few guidelines for non-trivial, large projects you actually care about and want to maintain for more than a month or so.

1. Typespecs

Learn to use them. If you are writing a large, complex project in a language that doesn’t support this or have tooling for it then use a different language. Yes, it actually saves so much heartache that it is important enough to switch.

Why? Because for-real type checking can tell you, without the futility or religious interference of unit testing, whether or not your program is valid. A valid program is not necessarily a correct program, but an invalid program is necessarily an incorrect one. (Also, it is worth keeping in mind that classes are not types. There is a subtle, and critical, difference.)

2. Property testing, not unit testing

Don’t simply write a few “unit tests” and assume things work. They don’t. As Rich Hickey (the creator of Clojure) so aptly put it: “What is the one thing that is true about all bugs found in the wild? Every one of them passed all the tests!” It can be useful to engage in regression testing, but regression testing is a subset of integration testing and even crosses over with user testing (the ultimate of all) and project documentation and history management.

When you write code, it has bugs.

  • Some are syntactic: You forgot some ant poop somewhere (things like: : ; . ,), failed to close a brace or paren, or misspelled something.
  • Some are structural: You passed in a foo type but the function is defined as accepting bar (statistically this is the greatest category of compilable, invisible errors — reference point 1 above).
  • Some are scheduling and timing: You have races and deadlocks all over the place and never knew it because they don’t usually get triggered and are super complex to work out in your head.
  • Some are semantic: The program does precisely what you told it to do, but you told it to do the wrong thing (the most frequent place where protocol failures creep in).

You write every one of these kinds of bugs into your programs every time you write a non-trivial program. I can’t just tell you to knock it off and tighten your shot group because I do the same stuff because it is impossible to avoid! If you write all these stupid bugs into your programs, what do you think lurks in your hand-written test code? MORE BUGS!

So what do?

In the same way that we can write a type specification for a function (declare its domain and codomain, basically) we can also write a specification for the function’s valid inputs, and outputs and the expected rules the output should follow (its range and image, basically). This defines the properties of the function.

Neat-O. But what would we do with such a specification? Property declarations are like me explaining to you what a function does, but not how it manages to do it. To test whether our implementation of the function does the expected thing and lacks corner cases, however, we can use a property-based testing system to generate tests for us on the fly and run them to check whether the expected properties of the function hold true. Not only that, smart property based testing systems not only find bugs (values that are defined as valid but produce invalid results that violate the property specification) but can quite often home in on specific broken cases and give you a good indication what sorts of values are problematic. That is to say, a property-based testing engine equipped with good property definitions can locate the corner cases for you.

Why wouldn’t we do this by hand? Because typically unit tests cover a handful of most-common cases with their expected values and that’s about it. Property based testing is much less merciful and also much less prone to error because a property based tester will generate an endless stream of tests according to the provided properties and run them for as much CPU time as you’re willing to give for testing. You are never going to write millions of different test cases for your code. A property based testing engine will do precisely that if you give it the CPU time to do so. Compared to how testing is done in most projects this is like having nuclear power in the age of wooden stoves.

This is magical.

3. DO USER TESTING

When you release something that has worked for you so far, that’s about as much confidence as you should put in an alpha release. “Works for me!” are the bold last words of many an abandonned project.

Don’t be That Guy. Don’t release That Project as a final. Be clear its a beta or even alpha, and development is an ongoing thing, forever. Manage expectations, your users (paying or community) will reward you for being honest.

When you release a project understand that this is your beta period, even if you’re on a relatively mature version. In a sense all significant features go through their own little beta phase. This is true in part because you’ve no clue if power users are going to find a way to break it (they will) or if it will be instantly appreciated and adopted by the userbase (random gamble there). Whatever you think is important or intuitive might have never even occurred to them.

Power users are going to push the button the wrong way and don’t know how to deal. That’s actually a good thing if you maintain a relationship with your users, because you’re basically getting directions straight from the affected party about how to make your program better. This is important whether you’re doing community open source for some sweet Ego Points, or trying to feed the kids at your soul-crushing job.

No amount of unit testing (which we’ve already sort of debunked — write typespecs, don’t blindly churn out unit tests) or property testing (which is vastly superior to unit testing, but misses a lot of side-effecty issues, which are often the central purpose of your program) can catch everything. No amount of integration testing will uncover everything that is wrong with your program. None of these tests will tell you whether your program sucks to use and is reviled by users. But user testing will.

4. Don’t be afraid to change stuff

You have a version control system for your code. You use git. Or something. It doesn’t matter, though, because you have something that does version control for you and creating a new branch is painless. (Unless you’re not using a version control system… then you really need to start. You don’t have to submit to the dark cabal of Ruby hipsters that controls github, but you should at least be using git locally.)

If you have an idea try it out. It is probably a great idea in spirit but won’t be so great in reality until you’ve shaken a bit of the stupid, self-indulgent fantasy out of it. You can’t do that without exploring the idea in actual implementation and that sort of exploration requires hacking up your pristine project a bit until you discover exactly why, in mechanical terms, the Universe hates your idea. Once you know exactly why the Universe hates you and your ideas you can adjust your plan to accommodate the whims of the math gods, tame the vagaries of digital magika, and tap out the proper incantations in much less time than you could had you just held endless meetings about it.

Break stuff. Remember the Cardinal Rule of Hacking:

“If you understand what you’re doing, you’re not learning anything.”
– Some guy (who was not actually Abraham Lincoln)

Sometimes the best sign of progress is a change in the error messages you are getting.

Simplicity follows complexity. Until you write a godawful fugly version of your solution you don’t really understand the problem. If you don’t fully grok the problem how can you ever hope to come up with a solution? Only after you have encountered all the little gotchas that made the code ugly in the first place are you ready to rewrite that steaming pile of (working) poo into an elegant solution that is almost guaranteed to have fewer bugs if for no other reason than increased transparency and better organization of the code.

(But note that you could stick with the ugly version for a bit in a pinch — so not all is lost. Getting something working at all is better than having a bunch of great ideas that don’t exist in reality.)

5. Don’t be afraid of new languages

At this point in my life I’ve written code in about 30 or 40 languages. I don’t know the exact number. I have written a lot of code and gained intimacy with about 10 of those. That’s a lot of languages by some standards and not many at all by others. It is enough, though, that I have come to realize that most languages are minor syntactic variations on a couple of basic paradigms, and really none of that crap matters too much.

It’s all shitty. All languages suck. Some suck a little less than others. Try to find one from the handful that sucks dramatically less than others in a specific domain, then get comfortable with it as a go-to tool for that domain. But remember that it is just a tool. Jackhammers are tools, but I don’t see anyone building houses with them.

When you hop on to a new project that someone is already working on you’re going to have to pretty much adhere to the rules of their house, and that means dealing with whatever annoying language they wrote their awesome project in.

Want to hack on Freenode‘s core implementation? Better not mind dealing with network code and file operations in Java (eek!). And what if you don’t even know Java or it has been years since you saw it last and everything is different now? This is the concern that should worry you the least of all.

If you squint a little projects basically are languages. They have their own semantics (the project libs, its functions, it type specification, its class definitions, its decision tables, its… whatever its got that is relevant). They have their own sort of syntax. In fact, every very large project I’ve ever worked on tended to actually follow Greenspun’s Tenth Rule and if it was a concurrent system (so common today) they even tend to follow Virding’s First Rule. (That becomes less of a joke and more of a law of nature the longer you do this and the more you know about both lisp and OTP.)

What does this mean? It means that learning the language a program is written in is the easy part. Learning the libs of that language tend to take about twice as long as learning the language itself. Learning the internals of a large project, however, tend to take about ten times longer than that. So where is the real cost in effort here? It isn’t in the adoption of a new language. It is in the adoption of a new project because every project is a tarbaby.

6. JUST OPEN YOUR EDITOR YOU PROCRASTINATING SACK OF POO!

Getting started is the hardest part of writing anything, whether prose, code, or poetry is sitting down and typing out something.

How to tackle the procrastination problem? Easier said than done: OPEN YOUR EDITOR

3 to 5 letters is all you need: `vim` or `emacs` and away you go!

Once you’re fully in the Matrix, write a function or spec or something. It doesn’t matter what you try to do: it will be wrong. And then you’ll have been wrong, but not exhausted yet. And suddenly you’ll realize that you are the one being wrong on the internet today and that situation just cannot stand. So you’ll start fixing it. And tinkering on it. And before you know it you’ll actually have some something productive, the curse of social media will be temporarily suspended, and you’ll finally stop feeling so crap about yourself (for a few minutes, anyway).

Web Designers: Stop making SPAs for inherently web 1.0 style sites

It is 2017. What’s with the drive to make everything an SPA whether it needs to be or not? This is getting a little ridiculous. I’m going to ramble on below a bit because I’ve got a hankering to do so — pay this no mind.

All around the web I see sites that are best represented as a collection of inter-linked documents, and all around the web I see many of those being changed into single-page application (SPAs). Even more stupid is when the SPA in question was built by some naive dope who included a little bit of almost every JS framework in existence — including a random selection from the thousands of obsolete and dead ones.

What is the goal? What’s the deal? Do web authors today not know how the web was actually intended to work originally? That document publication is actually its reason for existence in the first place and that “web applications” are a new thing that is a backhack to an incomplete standard that only sorta-kinda-works?

Granted, the reason it only sorta-kinda-works is due mostly to the problems inherent in the fact that only a single language is allowed in scripts… which is ridiculous. Was nobody paying attention to the Guile2 approach all those years? The only lesson learned from the Java applet and Flash experience seems to have been that “it sucks to force users to install runtimes as plugins”. Ugh.

Anyway, back to web applications…

I get it. For the moment we don’t have a solid distinction between “a document browser” and “an application browser” so we are stuck with this insufficient worst-of-both-worlds nether region of “applications that masquerade as documents”. And that drives anyone nuts who has given this much thought.

Not that a lot of people have considered the difference deeply. I imagine that is probably because very few new coders today have ever written more than a line or two of code intended to run natively on a user’s local system. Nearly everyone has written thousands of lines of code intended to run natively on server-side systems, but even that is getting wonky because many youngsters today don’t know how to deploy without using Docker yet lack the faintest inkling as to what problems Docker actually is intended to solve and wind up bypassing better solutions when they exist.

Tools shine when they are used in a focused way, performing they job for which they were intended. The web is the same way. Yes, it is a big jumble of crap. So let’s just leave that there. Networks are a big jumble of crap, too, and so are human societies — so we’ve adopted dirty ways of dealing with the dirt. The jumbly pile of shit that is the web is one of our ways of dealing with that. Everything times out. Everything is sent in text. Protocols are bloated and redundant. There isn’t even a proper definition of what “valid” HTML and XML and JSON and whatever else is in most cases. Its all racing toward a singularity where everything is uniformly stupid. But… whatever, it sort of kind of still works — and humans just barely work themselves, so that’s par for the course.

The original web was designed to function as an insecure document publication system where documents could be interlinked. We realized that we could include more interesting stuff by expanding the definition of “document” to include more than just text, and quite recently with HTML5 the way in which documents can be written is only a few orders of magnitude behind, say, LaTeX, in its ability to arrange things on the screen (that’s feature lag is not entirely the fault of the HTML5 authors).

This gives a lot of freedom to website authors — perhaps too much.

If a website is a set of news articles or academic papers (or even tweets) then you really don’t need a SPA, you need a more traditional sort of “web site”. It can be dressed up all pretty with shiny things sprinkled around, of course, but we don’t want a SPA that mysteriously changes state in ways that users cannot bookmark things, can’t easily send links to one another to specific resources (something Twitter got right despite some initial confusion over how to frame their content), etc.

If a website is actually just a delivery front end for a graphical RPG, well, obviously the game part of the site is probably best designed as a SPA, but the rest of the site — the forums, armory, character pages, beastiary, fan wiki, manual, guild rankings, lore pages, etc. — are absolutely best presented outside of that SPA as an actual website.

See the difference?

The game example is actually quite useful to contemplate for a variety of reasons. I’ll probably come back and cut this post down to just that part. Either that or eventually come back and rewrite the first bits to more accurately convey the humor with which I, as a graybeard resident in cyberspace for about 30 years now, view the state of the web today.

Whatever you do, dear reader, have fun coding, and remember: Don’t outsmart yourself!

Erlangers! USE LABELS! (aka “Stop Writing Punched-in-the-Face Code Blocks”)

Do you write lambdas directly inline in the argument list of various list functions or list comprehensions? Do you ever do it even though the fun itself, or the other arguments or return assignment/assertion for the call are too long and force you to scrunch that lambda’s definition up into an inline-multiline ball of wild shit? YOU DO? WTF?!?!? AHHHH!

First off, realize this makes you look like a douchebag for not being polite to other people or your future self whenever you do it. There is a big difference for the human reading between:

%%% From shitty_inline.erl

do_whatever(Keys, SomeParameter) ->
    lists:foreach(fun(K) -> case external_lookup(K) of
                  {ok, V} -> do_side_effecty_thing(V, SomeParameter);
                  {error, R} -> report_some_failure(R)
                end
          end, Keys
    ).

and

%%% From shitty_listcomp.erl

do_whatever(Keys, SomeParameter) ->
    [fun(K) -> case external_lookup(K) of
        {ok, V} -> do_side_effecty_thing(V, SomeParameter);
        {error, R} -> report_some_failure(R) end end(Key) || Key <- Keys],
    ok.

and

%%% From less_shitty_listcomp.erl

do_whatever(Keys, SomeParameter) ->
    ExecIfFound = fun(K) -> case external_lookup(K) of
            {ok, V} -> do_side_effecty_thing(V, SomeParameter);
            {error, R} -> report_some_failure(R)
        end
    end,
    [ExecIfFound(Key) || Key <- Keys],
    ok.

and

%%% From labeled_lambda.erl

do_whatever(Keys, SomeParameter) ->
    ExecIfFound =
        fun(Key) ->
            case external_lookup(Key) of
                {ok, Value}     -> do_side_effecty_thing(Value, SomeParameter);
                {error, Reason} -> report_some_failure(Reason)
            end
        end,
    lists:foreach(ExecIfFound, Keys).

and

%%% From isolated_functions.erl

-spec do_whatever(Keys, SomeParameter) -> ok
    when Keys          :: [some_kind_of_key()],
         SomeParameter :: term().

do_whatever(Keys, SomeParameter) ->
    ExecIfFound = fun(Key) -> maybe_do_stuff(Key, SomeParameter) end,
    lists:foreach(ExecIfFound, Keys).

maybe_do_stuff(Key, Param) ->
    case external_lookup(Key) of
        {ok, Value}     -> do_side_effecty_thing(Value, Param);
        {error, Reason} -> report_some_failure(Reason)
    end.

Which versions force your eyes to do less jumping around? How about which version lets you most naturally understand each component of the code independently? Which is more universal? What does code like this translate to after erlc has a go at it?

Are any of these difficult to read? No, of course not. Every version of this is pretty darn basic and common — you need a listy operation by require a closure over some in-scope state to make it work right, so you really do need a lambda instead of being able to look all suave with a fun some_function/1 type thing. So we agree, taken by itself, any version of this is easy to comprehend. But when you are reading through hundreds of these sort of things at once to understand wtf is going on in a project while also remembering a bunch of other shit code that is laying around and has side effects while trying to recall some detail of a standard while the phone is ringing… things change.

Do I really care which way you do it? In a toy case like this, no. In actual code I have to care about forever and ever — absolutely, yes I do. The fifth version is my definite preference, but the fourth will do just fine also.

(Or even the third, maybe. I tend to disagree with the semantic confusion of using a list comprehension to effect a loop over a list of values only for the side effects without returning a value – partly because this is semantically ambiguous, and also because whenever possible I like every expression of my code to either be an assignment or an assertion (so every line should normally have a = on it). In other words, use lists:foreach/2 in these cases, not a list comp. I especially disagree with using a listcomp when we the main utility of using a list comprehension is normally to achieve a closure over local state, but here we are just calling another closure — so semantic fail there, twice.)

But what about my lolspeed?!?

I don’t know, but let’s see. I’ve created five modules, based on the above examples:

  1. shitty_inline.erl
  2. shitty_listcomp.erl
  3. less_shitty_listcomp.erl
  4. labeled_lambda.erl
  5. isolated_functions.erl

These all call the same helpers that do basically nothing important other than having actual side effects when called (they call io:format/2). What we are interested in here is the generated assembler. What is the cost of introducing these labels that help the humans out VS leaving things all messy the way we imagine might be faster for the runtime?

It turns out that just like with using assignments to document your code, there is zero cost to label functions. For example, here is the assembler for shitty_inline.erl side-by-side with labeled_lambda.erl:

Oooh, look. The exact same stuff!

(This is a screenshot, a text file with the contents shown is here: label_example_comparison.txt)

See? All that annoying-to-read inline lambdaness buys you absolutely nothing. You’re not helping the compiler, you’re not helping the runtime, and you are hurting your future self and anyone you want to work with on the same code later. (Note: You can generate precompiler output with erlc -P and erlc -E, and assembler output with erlc -S. Here is the manpage. Play around with it a bit, BEAM and EVM are amazing platforms, wide open for exploration!)

So use labels.

As for execution speed… all of these perform basically the same, except for the last one, isolated_functions.erl. Here is the assembler for that one: isolated_functions.S. This outperforms the others, though to a relatively insignificant degree. Of course, it is only an “insignificant degree” until that part of the program is the most critical part of whatever your program does — then even a 10% difference may be a really huge win for you. In those cases it is worth it to refactor to test the speed of different representations against each version of the runtime you happen to be using — and all thoughts on mere style have to take a backseat. But this is never the case for the vast majority of our code.

(I’ve read reports in the past that indicate 99% of our performance bottlenecks tend to reside in less than 1% of our code by line count — but I can’t recall the names of any just now. If you happen to find a reference, let me know so I can update this little parenthetical blurb with some hard references.)

My point here is that breaking every lambda out into a separate named functions isn’t always worth it — sometimes an in-place lambda really is more idiomatic and easier to understand simply because you can see everything right there in the same function body. What you don’t want to see is multi-line lambdas squashed into argument lists that make things hard to read and give you the exact same result once compiled as labeling that lambda with a meaningful variable name on another line in the code and then referring to it where it is invoked later.

zUUID: An Example Erlang/OTP Project

I was talking with a friend of mine yesterday about how UUID v2 seems to have evaporated. We looked into things further and found its not actually included in RFC 4122! One thing led to another and I wound up writing an example project that is yet another UUID generator/utility in Erlang — but this time it actually has duplicate v1 and v2 detection/correction and implements as close to what I can find is defined as UUID version 2 values.

As there are already plenty of UUID projects around I focused on making this one as readable as I possibly could — to include exported documentation, in-source documentation, obvious variable names, full typespecs, my silly little “pure” notation, blatantly obvious bitstring syntax, and the obligatory github presence.

Hopefully some folks newish to Erlang will come along and explain to me what confuses them about that code, the process of writing it, the documentation conventions, etc. so that I can become a better literate programmer. Of course, since the last thing the world needs is another UUID implementation I suppose I would have had better luck with something at least peripherally related to the web. (>.<)

Binary Search: Random Windowing Over Large Sets

Yesterday I came across a blog post from 2010 that said less than 10% of programmers can write a binary search. At first I thought “ah, what nonsense” and then I realized I probably haven’t written one myself, at least not since BASIC and Pascal were what the cool kids were up to in the 80’s.

So, of course, I had a crack at it. There was an odd stipulation that made the challenge interesting — you couldn’t test the algorithm until you were confident it was correct. In other words, it had to work the first time.

I was wary of fencepost errors (perhaps being self-aware that spending more time in Python and Lisp(s) than C may have made me lazy with array indexes lately) so, on a whim, I decided to use a random window index to guarantee that I was in bounds each iteration. I also wrote it in a recursive style, because it just makes more sense to me that way.

Two things stuck out to me.

Though I was sure what I had written was an accurate representation of what I thought binary search was all about, I couldn’t actually recall ever seeing an implementation, having never taken a programming or algorithm class before (and still owning zero books on the subject, despite a personal resolution to remedy this last year…). So while I was confident that my algorithm would return the index of the target value, I wasn’t at all sure that I was implementing a “binary search” to the snob-standard.

The other thing that made me think twice was simply whether or not I would ever breach the recursion depth limit in Python on really huge sets. Obviously this is possible, but was it likely enough that it would occur in the span of a few thousand runs over large sets? Sometimes what seems statistically unlikely can pop up as a hamstring slicer in practical application. In particular, were the odds good that a random guess would lead the algorithm to follow a series of really bad guesses, and therefore occasionally blow up. On the other hand, were the odds better that random guesses would be occasionally so good that on average a random index is better than a halved one (of course, the target itself is always random, so how does this balance)?

I didn’t do any paperwork on this to figure out the probabilities, I just ran the code several thousand times and averaged the results — which were remarkably uniform.

binsearch

I split the process of assignment into two different procedures, one that narrows the window to be searched randomly, and another that does it by dividing by two. Then I made it iterate over ever larger random sets (converted to sorted lists) until I ran out of memory — turns out a list sort needs more than 6Gb at around 80,000,000 members or so.

I didn’t spend any time rewriting to clean things up to pursue larger lists (appending guaranteed larger members instead of sorting would probably permit astronomically huge lists to be searched within 6Gb of memory) but the results were pretty interesting when comparing the methods of binary search by window halving and binary search by random window narrowing. It turns out that halving is quite consistently better, but not by much, and the gap may possibly narrow at larger values (but I’m not going to write a super huge list generator to test this idea on just now).

It seems like something about these results are exploitable. But even if they were, the difference between iterating 24 instead of 34 times over a list of over 60,000,000 members to find a target item isn’t much difference in the grand scheme of things. That said, its mind boggling how not even close to Python’s recursion depth limit one will get, even when searching such a large list.

Here is the code (Python 2).

from __future__ import print_function
import random

def byhalf(r):
    return (r[0] + r[1]) / 2

def byrand(r):
    return random.randint(r[0], r[1])

def binsearch(t, l, r=None, z=0, op=byhalf):
    if r is None:
        r = (0, len(l) - 1)
    i = op(r)
    z += 1

    if t > l[i]:
        return binsearch(t, l, (i + 1, r[1]), z, op)
    elif t < l[i]:
        return binsearch(t, l, (r[0], i - 1), z, op)
    else:
        return z

def doit(z, x):
    l = list(set((int(z * random.random()) for i in xrange(x))))
    l.sort()

    res = {'half': [], 'rand': []}
    for i in range(1000):
        if x > 1:
            target = l[random.randrange(len(l) - 1)]
        elif x == 1:
            target = l[0]
        res['half'].append(binsearch(target, l, op=byhalf))
        res['rand'].append(binsearch(target, l, op=byrand))
    print('length: {0:>12} half:{1:>4} rand:{2:>4}'\
                    .format(len(l),
                            sum(res['half']) / len(res['half']),
                            sum(res['rand']) / len(res['rand'])))

for q in [2 ** x for x in range(27)]:
    doit(1000000000000, q)

Something just smells exploitable about these results, but I can’t put my finger on why just yet. And I don’t have time to think about it further. Anyway, it seems that the damage done by using a random index to make certain you stay within bounds doesn’t actually hurt performance as much as I thought it would. A perhaps useless discovery, but personally interesting nonetheless.

Object-Relation Mismatch: Comparing Strawberries and Sunglasses

I’ve been spending a lot of time lately writing a rather large suite of business applications. The original customer was a construction company which needed a replacement for their estimation system. Then the same customer needed a facility pass management system to make the insane amount of bit-shoveling/paperwork involved in getting security clearances for workers to perform work in secure sites. Then a human resources system. Then a subcontract management system. Then a project scheduling system. Then an invoicing system.

The point here is, they really liked my initial work, and suddenly I got further orders. Pretty soon after discussing the first few add-on requirements with the customer it became apparent that I was either going to be writing a bunch of independent systems that would eventually have to learn how to talk to each other, or a modular system that covered down on office work as much as possible and could pull data from associated modules as necessary (but by the strictest definition is not an “expert” or ERP system — note that “ERP” is now a buzzword void of meaning just like “cloud”). Obviously, a modular design is the preferred way to go here, and what that costs me in effort making sure that dependencies don’t become big globby cancer balls buys me enormous gains selling the same work, reconfigured, to other customers later and makes it really easy to quickly write add-ons to fill further needs from the same customer.

Typical story. But how am I doing it and what does this have to do with the Dreaded Object-Relation “Impedance” Mismatch? Tools, man, tools. Most of the things I wrote in the past were system level utilities, subsystems, security toys, games, one-off utilities for myself to make my previous office work disappear[1], patches to my own systems, and other odds and ends. I’d never sat down and written a huge system to automate away someone else’s problems, though it turns out this is a lot more fun than you might expect provided you actually take the time to grasp what the customers need beyond what they have the presence of mind to actually say. (And this last point is worthy of an entire series of books no one will ever pay me to write.)

And so tools. I looked around and found a sweet toolkit for ERP called Tryton. I tried it out. Its pretty cool, but the biggest stepping stones Tryton gives you out of the box are a bunch of pre-defined models. That’s fine, at first, but they are almost exclusively based on non-normalized (as opposed to denormalized) data models. This looked good going in, but turned out to suck horribly as time passed.

Almost all of the problems ultimately trace back to the loose way in which the term “model” is used in ORM. “Model” means both the object definitions and the tables that feed them[2]. Which is downright mad because object member variables are not table columns, not by a mile, and tables can’t do things. This leads to a lot of wacky stuff.

Sometimes you can’t tell if it makes sense to add a method to a model, or to write a function and call it with arguments because what you’re trying to do isn’t an inherent function of the modeled concept itself (and if you’ve been conned into using Java life sucks even more because this decision has already been made for you regardless your situation). And then later you forget some of the things you wrote (or rather, where they are located, which equates to forgetting how to call them) because it was never clear from the outset what should be a function, what should be a method, and what is data and what is an object. This makes it unclear what should be an inherited behavior and what should be part of a library (and I’ll avoid ranting about the pointlessness of libraries of Java/struct-based objects). And this all because we love OOP so much that we’re willing to overlook the obvious fact that business rules are actually all about data and processes, not about objects and methods at the expense of sane project semantics.

In other words, business rules are about data, most easily conceptualized as nouns, and not really about verbs, most easily conceptualized as functions (and this is the beginning of why using OOP for things other than interface and simulation design is stupid — because its impossible to properly subordinate verbs to nouns or vice versa).

Beginning with this conceptual problem you start running into all sorts of weirdness which principally revolves around the related problem that every ORM-based business handling system out there tries to force data into a highly un-normalized data model. I think this is in an effort to make business data modeling “easy”, but it results in conscious efforts by framework designers to prevent their users (the application developers) from ever touching or knowing about SQL. To do that, though, it is necessary to make every part of data constraint, validation, verification, consistency, integrity (even referential integrity!), etc. into methods and functions and processes which live in the application. Instead of building on the fascinating advancements that have been made in data rule systems this approach deliberately tosses them aside and reinvents the wheel, but much worse. This relegates the database itself to actually just being a million-dollar file system[3].

For example, starting out with the estimation stuff wasn’t too hard, and Tryton has a fairly easy-to-use set of invoicing, receiving, accounting and tax configuration modules you can stack on to get some sweet functionality for free. It also has a customer management model and a generalized personal information manager that is supposed to form the basis for human resources management stuff you can build yourself. So this is great, right?

Wrong. Not just wrong because of the non-normalized data, I’ll get to that in a moment, but primarily wrong because nearly everything in the system attempts to be object oriented and real data just doesn’t work that way at all. I didn’t realize this at first, being inexperienced with business applications development. At first I thought, “Ah, so if we make our person model based off the person-party-address chain we can save a lot of time writing by simply spending time understanding what’s already here”. That sort of worked out. Until the pass management request came in. (That basing the estimation module off of the existing sales/orders/invoices chain would be a ridiculous prospect was a far less obvious problem.)

Now I had a new problem. Party objects are one table in the database, people objects are a child class in the application that inherits Party but is represented in the database as a separate table that doesn’t inherit the party one (but has a pass-up key instead to make the framework portable to database backends that don’t support inheritance or other useful features — more on that mess later) and addresses are represented in the database as being a child table to the party table, but as independent objects within the OO system at the application server level.

Still doesn’t sound horrible, accept that it requires a lot of gymnastics to do handle security checks and passes this way. In particular getting security clearances for workers involves explaining two things in excruciating detail: family relationships and address histories.

The first problem has absolutely no parallel in Tryton, so writing my own solution was the only way to proceed. This actually turned out to be easier than tackling the second problem, specifically because it let me write a data model first that was unencumbered by any design assumptions inherent in the system (other than fighting with the basic OOP one-table-per-model silliness). What was really required was to understand what constitutes a family. You can’t adopt a sibling, but a parent can adopt you, and reproduction is what makes people to begin with which requires a M+F pair, and we need an extra slot each direction for adoption/step relationships. So every person who shares a parent with you is a sibling. Label them based on sex and distance and viola! we’ve got a self-mapping family model. Cake. Oh wait, that’s only cake in SQL. Its actually really, really ugly to do that from within OOP ORM code. But enough about families. That was the easy part.

Addresses were way more problematic. Most software systems written in Western languages were developed in (surprise!) the West. The addressing systems in the West vary greatly and dealing with this variance is a major PITA, so most software is written to just completely ignore the interesting problem worth solving and instead pretend that addresses are always just three text strings (usually called something like “address_1”, “address_2” and “postal_code”). In line with the trend of ignoring the nature of the data that’s being dealt with, most personnel/party data management models plop the three address elements directly into the “person” (or “party” or “partner”, etc.) table directly. This is what Tryton does.

But there’s a bunch of problems here.

For one we’ve completely removed any chance of a person or party having two addresses without either adding more columns (the totally stupid, but most common approach) or adding a separate table and letting our existing columns wither on the vine. “Why not remove them?” — because removing columns in a pre-fab OOP ORM can have weird ripple effects because other objects expect the availability of those member variables on the person or party objects and the interface bits usually rely on the availability of related objects methods, etc.

Another problem is that such designs train users wrong by teaching them that whenever a person changes addresses the world actually changed as well and the right thing to do is erase the old data and replace it with something new. Which is crazy — because the old address is still the correct label for a location that didn’t move in the real world and so erasing it doesn’t mirror reality at all.

And the last statement above reveals the root problem: this isn’t how addresses really work at all. Addresses are limited in scope by time. A person or party occupies a location for a time, but the location was already there — so we need a start/end time span, not just a record linking a party and an address. Moving further, addresses are merely labels for locations. So we need a model of locations — which should be hierarchal and boundless, because that’s how real locations are. Then we need an address builder on top of that that can assemble an address by walking up the chain. This solves a ton of problems — for one we don’t have to care if a building has a street number or even a street at all (in Japan, for example, we don’t have street names, we have nested blocks of zones that define our address elements). It also solves the translation problem — which is really important for me again here because in English addresses are written from smallest element to largest, and in Japanese they are written from largest to smallest. But these representations are not the locations nor are they actually the addresses themselves — they are merely different forms of notation for the same address.

So all this stuff above is totally ignored by the typical software model of addressing — which really puts a kink in any prospect of working within the existing framework to write a background information check and pass management system. These kinds of incomplete conceptual assumptions pervade every framework I’ve dealt with, not just Tryton and make life within OOP ORM frameworks very difficult when you need to do something that the original authors didn’t think about.

This article is about mismatches, so I’ll point out that the obvious one we’re already overlooking is that the data doesn’t match reality — or come even close. And we’re only talking about addresses. This goes beyond the Object-Relation Mismatch — its the Data-Reality Mismatch. It just so happens that the Object-Relation Mismatch greatly enables the naive coder in creating ever deeper Data-Reality mismatches.

Given the way addresses are handled in most software systems we have a new data input and verification problem. With no concept of locations there is no way to let someone who is doing input link parties to common addresses. This is stupid for a lot of reasons, for one thing consider how much easier it is for a user to trace down an existing location tree until they get to a level that doesn’t exist in the database yet and then input just the new parts rather than typing in whole addresses each time.

“But typing addresses is easy!” you say. Not true. We have to track four different scripts per address element (Latin, two forms of kana, and kanji) and they all will have to come out the same way every time for the police computers to accept them. One of the core problems here is validating that person A’s address #2 which extends from the same dates as person B’s (his brother) address #4 which spans the same dates is the same in all details so that the police background checker won’t spit out an error (because they already have this data so yours had better be right). Trusting that every user is always going to input the exact same long address string all four times and never make a mistake is ridiculous. Its even more stupid when you consider that they are referencing the same places in the real world against data you already have so why on earth wouldn’t your software system just let them link to existing data rather than force them to enter unique, error-prone new stuff?

So assuming that you do the right thing and create a real data model in your database where locations are part of a tree structure and address assembled strings linked against locations and have a time reference, etc. how does all this manifest in the object code? Not at all the way that they present in the database. Consider trying to define a person’s “current address”.

There are two naive ways to do this and two right ways to do this. The most common stupid approach is to just put a boolean on it “is_current” or something similar and call it good. The other stupid way to do it is to present any NULL end dates as “current” and call it good. But what about the fact that NULL is supposed to mean “unknown” — which would most likely be the case at least some of the time anyway and therefore an accurate representation of known fact. And even more interestingly, how do we declare that a person can only have one current address? Without a programmatic rule you can’t, because making the “is_current” boolean a UNIQUE means that a person can’t have more than one false value, either, which means they can only ever have a current and a not current address (just two) and this is silly. Removing the constraint means that either the client code (really stupid) or a database trigger (sort of stupid) should check for and reject any more than a single true value per person.

The better way to handle this is to have an independent “current address” table where the foreign key to person or party is UNIQUE and a separate “address” table where you dump anything that isn’t current. This has the advantage of automatic partitioning — since you will almost never refer to old addresses anyway, you can get snappy responses to current address queries because the current address table is only as large as your person table. The other right way to do this is to create a “current address” table that doesn’t contain any address data at all but rather just a unique reference to a party and a (not unique) reference to an address. This approach is the easiest to retro-fit onto an existing schema and is probably the right solution for a datastore that isn’t going to get more than a million addresses to store anyway.

But wait… you can’t really do that in an ORM. I mean, you can make an ORM play along with the idea, but you can’t actually create this idea in a simple way from within ORM code, and from OOP ORM code it is really a much huger PITA to coerce the database into giving you what you want than just writing your tables and rules in SQL yourself and some views to massage them into a complete answer for easy coexistence with an ORM. In particular, its easiest to have the objects actually have an “is_current” boolean and the database just lie to the ORM and tell it that this is the case on the database end as well. Without knowing anything about how databases work, though, you’d never know that this is the right way to do things, and you’d never know that the ORM is actually obstructing you from doing a good job at data modeling instead of enabling you to do a good job.

So here’s another mismatch: good data design predicts that objects are inherited one way in Python and the tables follow a significantly different schema in the database. Other than the problem above (which is really a problem of forcing addresses to be children of parties/people and not children of a separate concept of location as we have it in the real world) the object/relation weirdness creates a lot of situations where you’re trying to query something that is conceptually simple, but winds up requiring a lot of looping or conditional logic in the application to sort things out.

As for the looping — here be dragons. If you just trust the ORM completely each iteration may well involve one query, which is really silly once you think about it. And if you do think about it (I did) you’ll write a larger query domain initially and loop over that in the application and save yourself a bunch of round trips. But either way this is silly, because isn’t SQL itself designed to be a language that permits the asking of detailed data questions in the first place? Why am I doing this stuff in Python (or Ruby or Lisp or Haskell or whatever)?

But I digress. Let me briefly return to the fact that the tables are inherited one way and the objects another. The primary database used for Tryton is Postgres. This is a great choice. That shows that somebody thought about things before pulling the trigger. Tryton was rewritten from old TinyERP/OpenERP (the word “open” here is misleading, by the way — OpenERP’s terms don’t come close to adhering to the OSS guidelines whereas TinyERP actually did, or was very close) and the main project leader spent a lot of time cleaning out funky cruft — another great sign. But somewhere in there a heavy impulse to be “database agnostic” or “portable” or some other dreamy urge got in there and screwed things up.

See, Tryton supports MySQL and a few other database systems besides that don’t have a very complete feature set. What this means is that to make the ORM-generated SQL Postgres uses similar to the ORM-generated SQL that MySQL uses you have to settle for the lowest-common feature set between the two. So any given cool feature that you could really benefit from in one that doesn’t exist in the other must be ditched for all database backend code or else maintaining the ORM becomes a nightmare.

This means that each time you say you want your framework to be “portable” across databases you are ditching every advanced feature that one system has got that any of the others don’t, resulting in least-common-denominator system design. So every benefit to using Postgres is gone. Poof. Every detriment to using a fast, naive system like MySQL is inherited. Every benefit to a fast, naive system like MySQL is also gone, because nothing is actually written against the retrieval speed optimizations built into that system at the expense of losing all the Big Kid features in a system like Postgres. Given this environment, paying enormous fees for Oracle isn’t just stupid because Postgres can very nearly match it anyway — its doubly stupid because you’re not even going to use any cool features that any database provides anyway if you write “database agnostic” framework code.

I had many a shitty epiphany over time as I learned more about data storage concepts in general, relational database systems in particular, and Postgres, Oracle, DB2 and MySQL specifically. (And in that process I grew to love Postgres and generally like DB2.)

So there is a lesson here not related directly to the OOP/relational theme, but worth stating in a general way because its important to nearly all software projects that depend on some infrastructure piece external to the project itself:

Pick a winner. If someone else in your project wants to use systemX because they like it, they can spend time making the ORM code work, but that should be an extension to the subsystem list, not a guarantee of the project because you’ve got more important things to do. This could be MySQL vs Postgres or Windows vs Linux. It doesn’t matter — pick one and specialize. Even better, pick whichever one gives the biggest boost to a specific layer of your application stack and use that there.

So far the above thinking has had me settling more on Postgres over anything else and more on Qt at the application level than anything else.

Back to my story. The addressing thing introduced enough problems that I eventually had to ditch it entirely and write my own that was based on normalized location data that carried natural data (parent-child relationships within the hierarchy of physical locations) with an address table that carried human-invented administrative data about those locations (if they have a postal code, and other trivia) and a junction table that connects parties (people or organizations) to those locations via the addresses and carries timeline and other data.

When I did this and mentioned it to some other Tryton folks they flipped out. Not because I had done this in the core project — no, this was my own substitute module — but because:

  1. I had written SQL, and not just dabbled in some CREATE TABLE statements
  2. I had normalized the data model (well, a very small part of it)

I wrote the SQL to carry the definitions where the ORM just didn’t have a way to express what I wanted (or was really funky to grok once it was written). Apparently this was a big taboo in ORM Land, though I didn’t know that going in. SQL seems to have this forbidden quality that excites as much as it instills fear these days, but I have no idea why. Again, I’m a n00b, so maybe I just don’t get why ORM is so much better. Also, mind you, there was no hostility from anyone, just shock and some sense of the aghast query “what have you done?” (The Tryton community is actually a very warm place to play around and the project leader is enormously helpful, and despite me being an American (and a Texan, no less!) living in Japan and them all snooty Euro types, we got along swell. If any FOSS ERP system has some glimmer of hope as of July 2012 its Tryton.)

Writing SQL deeper than a raw() query here and there is one thing, but normalizing the data model is something altogether on a different plane of foul according to the rites of the Holy ORM. I was continually told that this would hurt me in the future if I continued on with Tryton. But on the other hand, they weren’t looking at the small mountain of application code I would need to maintain and forward port forever to get around the non-normalized data issue(s). And anyway, once you normalize data all the way, you don’t normalize it further. There actually is a conclusion to that exercise. I’ve found that my normalized data models tend to endure and changes wind up being modified by additions instead of the painful process of moving things around (and this still seems mysteriously, wonderfully magical and relieving to me — but probably because I’m not actually educated in relational algebra and so can’t see the underlying sense to why normalized data is so easy to extend (I mean, conceptually its obvious, but how, precisely?)).

Their arguments about “the future” disregarded the application layer entirely because they were only thinking about Tryton, but for me it wasn’t just one place where non-normalized data started hurting me (it also disregarded that this predicted that I’d wind up leaving Tryton). The original concept for the estimation program didn’t really jibe with the way that a(nother) very obvious customer need could be served by putting meaningful links between what was contained in CAD files, what existed in the product database, and how the units of measure get computed among them. This meant that my real need wasn’t a single application as much as it was a single data store that remained coherent regardless what application happened to be talking to it at the time (I’m not even going to get into security in this post, but that is another thing that is enormously simplified by submitting to The Postgres Way instead of resisting).

And this brings me to another problem — in fact, the real kicker. I started realizing as I wrote all these things that while the Tryton client program is pretty slick, its not the end of the road to handle all needs. For one things a lot of it involves writing screens in XML. Yuk. That’s about as annoying as it gets, and I’ll leave that there. But most importantly there was no way I was ever going to be able to port the Tryton client to, say, Android (and maintain it) or embed the CAD programs we’re using (one easy to port C++/Qt program, and one black-box Windows contraption we run in Wine that is currently a must-have) and make things run smoothly. I was also going to have my work cut out for me if I wanted to use the same data store to start doing things like drive dashboard snapshot reporting over http or especially provide some CRUD capabilities over the Web for guys out of the office (and this issue goes all the way to the security model here as well).

Anyway, long(er) story short, Tryton just didn’t meet my needs going forward. I could have forced it to fit at a greater cost in time than I am willing to pay, but it just wasn’t a total fit for my needs, and part of that was the way that data in objects don’t really jibe with how data in the real world works.

But the fact that I could code this stuff up in SQL in a sane way without any magic intrigued me. Greatly. The bit that I did with the addresses made so much sense compared to every other model I’ve seen for addresses that I couldn’t ignore it. In reality people move, but locations stay right where they are. Address definitions might change, but this is an administrative concern which leaves a historical record. My model perfectly captures this and permits now the asking of questions both about the location, about the parties which were involved with the location, and even about the administrative situation surrounding the location over time (and that questions of proximity are easily answered as well and nest cleanly with PostGIS extensions is magical and worth noting). All without a long string of crazy dot-joined noSQL stuff going on and all without a single null value stored anywhere. It was really easy to see that this made sense. Beyond that, I didn’t have a bunch of meta data that documented my code, which should be incidental, but instead just a hard definition of how my data should look. From there I could do whatever I wanted in whatever application I wanted. Having a truly sane data model started making so much sense to me that I tried writing a few different applications on top of the same data model as an experiment. And it worked amazingly well.

Writing a PyQt application, for example, I can just ask the database for some information against a view. I can get the query back as a dictionary or a list or whatever I want and display it or manipulate it any way I want. Doing it from a Django web face is pretty easy to. Actually, really easy. Django has an ORM that can make life easier if you ditch the “this class is a table” idea and make them all unmanaged models which actually just call views in the database. Its even easier overall if they are the exact same views that your other applications call (or not, but usually this winds up being a useful situation). If you remember to not do any processing in the application code, but instead have the database build your view for you and just let Django be the way it gets into a web page then you’re really cooking with gas and can safely take advantage of all the automatic stuff Django can do. (Or even better than web pages, use Django to render OpenDocument files for you, which turns out to be a super easy easy way to woo your customers because its so much more useful than generating web pages. I should probably write a post about how to do this later because its just that cool.) Its even more cool to do this from Snap than Django — but that’s a whole ‘nother story.

This was just retrieving data, though. I got curious about entering data. And its really easy as well. But it involves a few extra things. Like careful definitions of the data model (ensure actual normalization, which is sometimes surprisingly counter-intuitive in how intuitive it is), multi-column unique constraints, check constraints, really understanding what a foreign-key is for, etc. all while still leaving room for a (now otherwise meaningless) numeric ID column for frameworks that may require it — and this whole numeric-keys-for-everything bit will seem more weird the longer you spend dealing with solid data models.

Basically, use all the tools in the Postgres bag and your life will get easier. And that’s actually not hard at all. The Postgres feature list (even the DB2 feature list) is pretty small compared to the vastness of the entire Python API coupled with the combined might (and confusion, usually) of whatever framework(s) you’re writing around. Doing it right also requires that you learn how to handle the various exceptions that the database will throw back at you as a result of your constraints and rules and things you’ve put in the database. But this makes programming the application layer really easy. Like incredibly easy. And anyway, learning how to handle a single set of database exceptions is a lot easier than trying to remember every stupid little exception condition your framework can produce multiplied by the number of frameworks you have.

And this is what is solving my core problem. I’m discovering that not only is SQL pretty darn easy, but that it solves my core business logic problems without actually writing any business logic. I think this is what the relation guys at IBM knew they were on to decades ago when they thought this idea up in the first place.

Consider the “current address” issue above. I didn’t use booleans, logical processes or any other trick to figure out whether an address was current or not, nor did I have to write a special rule that states that a person can only have a single current address at once but any arbitrary number of non-current addresses, nor did I have to write a single spot of application code. The problem is solved by the structure of the data alone — which is always the most efficient solution since it involves zero processing.

THis blows all that “use this framework to build your apps in 5 easy steps with Rails!” bullshit away. But I am a little put out that the concepts themselves don’t have more support within the trendier parts of the software development world. It seems everyone is jumping on the out of control bandwagon that marketers overloaded with Java Beans and Hopes and Dreams all those years ago and sent tumbling down the hill. Its like the Obama campaign infected the software industry (because he totally earned that Nobel Prize and Hawking doesn’t deserve one). Its still rocketing down the hill, distracting faculty, investors, budding programmers and the marketing world almost completely. Its really amazing. I am a little upset that discovering a really sane way to manage data was so hard and took so long among the enormous volume of siren screams and other noise on the wire in the development community. Of course, now that I know what I’m looking for locating good discussions and resources isn’t that hard — though it is a little odd to note that the copyright dates on most of them predate my own existence.

So now, as I convert a mishmash of previously written independent application models into a central data concept I am finding something amazing: I haven’t found a single business rule yet that isn’t actually easier to express in terms of data structure than it is to put in application code. I’m also finding that importing the data from the (now legacy) application databases is also usually not that hard, either, but requires more mental effort than anything else on my plate now.

Most amazing of all is the ease of writing application code. Even if I’m writing one application in C++/Qt, another in PyQt, another in Django, another in CL and another in Haskell that run variously across the spectrum of servers, tablets, phones and desktops[4], they can all live under the same guarantees and are super easy to understand because of the extreme lightness of all their code. I’m not doing anything but showing stuff to the user from the database, and putting stuff back in the database, and adjusting based on whether or not the database accepted what was given.

This makes application development fun again. Previously I has been bogged down in trying to define business logic rules as processes, and that was boring, especially since the magic sauce really should have just been a data model forcing me to be correct in the first place instead of me chasing exceptional cases through a bunch of logical code paths in the application (which had to be duplicated across all applications!). Also, this effort tended to put horse-blinders on me as far as interface went. Once I wrote a web interface, the enormous freedom that native application development gives you is suddenly invisible and you’re thinking in terms of “what is the parallel widget in Qt or GTK to the HTML SELECT” or whatever. That’s just lame. But its what starts happening when you spend so much brainpower worrying about conditional business logic that you forget all the cool stuff you can do in a native application (like 3D flowcharts, or 3/4D projections of project management data throughout time that you can “paw through” with the mouse or even a game controller, or a million other kickass ideas we usually only ever get to see in vidya games).

Getting your data model right gives you not only the mental freedom to start exploring what native UI can do that goes so far beyond the pitiful bag of cheap tricks that “web app development” has made standard today (or the convoluted mess of JavaScript and AJAX trash that supports it), it also gives you the confidence to step out and do some cool stuff in your client applications because, hey, the data model part of the problem is already solved. All you have to do is serialize the data in your application — which means in the application if you want to have objects, go for it, but make sure they are based on a view of derived data, not a 1-for-1 mapping of objects to relations. That serialization is an easy problem to have gives you the focus to do cool stuff nobody else is doing — and it all comes down to doing data right and escaping from the ridiculous house of mirrors that ORMs lead you into.

There is a conceptual mismatch between the object world and the relational world that is so vast that it is not worth trying to bridge. I’m saying there isn’t an Object-Relation Mismatch. They just aren’t even the same thing, so how could we have ever thought that comparing them against the same criteria ever made sense to begin with?

 

[1. Both when I was a desk jockey for a while and when I was still in the Army — being an SF engineer involves a good bit of math that you know about going in (and no calculators, so programming is no help there anyway) but is also huge amounts of paperwork that they never tell you about until after you walk thousands of miles to get your floppy green hat.]

[2. This is every bit as damaging as the way that leftist political thinkers loosely throw around the word “society”. As in “you owe it to society” and “society owes it to them” or “society must regulate X, Y, and Z” when what they really mean in some cases is actually “your community” and other cases as “the government”, which convolutes the discussion enough that obviously unacceptable things can seem acceptable — which is similar to how obviously non-OO things have been massaged into a OO-ish shape in the minds of thousands, but still remain just as procedural or functional as ever they were in reality.]

[3. This mistake is somewhat comically enshrined in the new NoSQL stuff, which consists principally of reinventing pre-relational data systems IBM already worked on and largely discarded decades ago.]

[4. In fairness, almost everything is running on Linux, and this makes development much easier than if I were trying to tackle the full spectrum of device OSes out there. Who wants to write a 3D reporting face for Blackberry that needs to work and look the same way it does on Android, KDE on Linux, or iOS (or Windows Phone… haha!).]