Posts Tagged ‘Programming’

Binary Search: Random Windowing Over Large Sets

Wednesday, August 14th, 2013

Yesterday I came across a blog post from 2010 that said less than 10% of programmers can write a binary search. At first I thought “ah, what nonsense” and then I realized I probably haven’t written one myself, at least not since BASIC and Pascal were what the cool kids were up to in the 80’s.

So, of course, I had a crack at it. There was an odd stipulation that made the challenge interesting — you couldn’t test the algorithm until you were confident it was correct. In other words, it had to work the first time.

I was wary of fencepost errors (perhaps being self-aware that spending more time in Python and Lisp(s) than C may have made me lazy with array indexes lately) so, on a whim, I decided to use a random window index to guarantee that I was in bounds each iteration. I also wrote it in a recursive style, because it just makes more sense to me that way.

Two things stuck out to me.

Though I was sure what I had written was an accurate representation of what I thought binary search was all about, I couldn’t actually recall ever seeing an implementation, having never taken a programming or algorithm class before (and still owning zero books on the subject, despite a personal resolution to remedy this last year…). So while I was confident that my algorithm would return the index of the target value, I wasn’t at all sure that I was implementing a “binary search” to the snob-standard.

The other thing that made me think twice was simply whether or not I would ever breach the recursion depth limit in Python on really huge sets. Obviously this is possible, but was it likely enough that it would occur in the span of a few thousand runs over large sets? Sometimes what seems statistically unlikely can pop up as a hamstring slicer in practical application. In particular, were the odds good that a random guess would lead the algorithm to follow a series of really bad guesses, and therefore occasionally blow up. On the other hand, were the odds better that random guesses would be occasionally so good that on average a random index is better than a halved one (of course, the target itself is always random, so how does this balance)?

I didn’t do any paperwork on this to figure out the probabilities, I just ran the code several thousand times and averaged the results — which were remarkably uniform.

I split the process of assignment into two different procedures, one that narrows the window to be searched randomly, and another that does it by dividing by two. Then I made it iterate over ever larger random sets (converted to sorted lists) until I ran out of memory — turns out a list sort needs more than 6Gb at around 80,000,000 members or so.

I didn’t spend any time rewriting to clean things up to pursue larger lists (appending guaranteed larger members instead of sorting would probably permit astronomically huge lists to be searched within 6Gb of memory) but the results were pretty interesting when comparing the methods of binary search by window halving and binary search by random window narrowing. It turns out that halving is quite consistently better, but not by much, and the gap may possibly narrow at larger values (but I’m not going to write a super huge list generator to test this idea on just now).

It seems like something about these results are exploitable. But even if they were, the difference between iterating 24 instead of 34 times over a list of over 60,000,000 members to find a target item isn’t much difference in the grand scheme of things. That said, its mind boggling how not even close to Python’s recursion depth limit one will get, even when searching such a large list.

Here is the code (Python 2).

```from __future__ import print_function
import random

def byhalf(r):
return (r[0] + r[1]) / 2

def byrand(r):
return random.randint(r[0], r[1])

def binsearch(t, l, r=None, z=0, op=byhalf):
if r is None:
r = (0, len(l) - 1)
i = op(r)
z += 1

if t > l[i]:
return binsearch(t, l, (i + 1, r[1]), z, op)
elif t < l[i]:
return binsearch(t, l, (r[0], i - 1), z, op)
else:
return z

def doit(z, x):
l = list(set((int(z * random.random()) for i in xrange(x))))
l.sort()

res = {'half': [], 'rand': []}
for i in range(1000):
if x > 1:
target = l[random.randrange(len(l) - 1)]
elif x == 1:
target = l[0]
res['half'].append(binsearch(target, l, op=byhalf))
res['rand'].append(binsearch(target, l, op=byrand))
print('length: {0:>12} half:{1:>4} rand:{2:>4}'\
.format(len(l),
sum(res['half']) / len(res['half']),
sum(res['rand']) / len(res['rand'])))

for q in [2 ** x for x in range(27)]:
doit(1000000000000, q)```

Something just smells exploitable about these results, but I can’t put my finger on why just yet. And I don’t have time to think about it further. Anyway, it seems that the damage done by using a random index to make certain you stay within bounds doesn’t actually hurt performance as much as I thought it would. A perhaps useless discovery, but personally interesting nonetheless.

Don’t Get Class Happy

Tuesday, August 13th, 2013

If you find yourself writing a class and you can’t explain the difference between the class and an instance of that class, just stop. You should be writing a function.

This antipattern — needless classes everywhere, for everything — is driving me crazy. Lately I see it in Python and Ruby a good bit, where it really shouldn’t occur. I feel like its a mental contagion that has jumped species from Java to other languages.

In particular I see classes used as inter-module namespaces, which is odd since its not like there is a tendency to run out of meaningful names within a single source file (of reasonable length). Importing the module from outside can either `import * from foo`,or `import foo`, or `from foo import filename`, or even `import foo as bar` (ah ha!) and make flexible use of namespacing where it is needed most — in the immediate vicinity of use.

So don’t write useless, meaningless, fluffy, non-state&behavior classes. Any time you can avoid writing classes, just don’t do it. Write a function. Its good discipline to see how far you can get writing zero classes in an implementation, and then write it with some classes and see which is actually less complex/more readable. In any case writing two solutions will give you a chance to examine the problem from multiple angles, and that’s nearly always an effort that results in better code since it forces you to value grokness over hacking together a few lines that sort of cover the need for the moment (and forgetting to ever come back and do it right).

Or maybe I should be admonishing people to seek flatness in namespaces, since this implies not writing classes that are stateless containers of a bunch of function definitions which incidentally get referred to as “methods” even though they don’t belong to a meaningful object in the first place.

Or maybe I should be railing against the oxymoronic concept of “stateless objects/classes”.

Or maybe I should be screaming about keeping our programming vocabulary straight so that we can finally say what we mean and mean what we say. (Sound advice in all walks of life.)

This last idea is perhaps the most indirect yet powerful of all. But it implies “standards” and so far in my experience “standards” and “computing” don’t tend to turn out very well once we get beyond, say, TCP/IP or ANSI and Unicode.

Development Speed VS Quality

Thursday, June 13th, 2013

I’ve been working under some pretty insane time constraints lately. Two things jump out at me upon review of my work:

1. I can, when cornered, churn out thousands upon thousands of lines of functioning code in a flash.
2. The code works, but it is not particularly insightful or brilliant — it merely works.

The first bit is sort of cool: Typo Monster and Syntax Bear are no longer a part of my life (at least not in my Big Five). Nice.

On the other hand, the second bit isn’t so nice. It means that insightful development requires far more time than people tend to imagine. This has been a point of discussion for decades in engineering and especially software development, but its hard to fully appreciate until you get a chance to compare your own code, written once under duress, with code written for a similar problem domain at a pace that allows more time for grokness.

Now that I think of it, its not the pace exactly which is the problem, it is the way that certain forms of deadline and environmental pressures can re-tune the time budget in a way that extends typing/coding/input time at the expense of grok time.

This is particularly pronounced when it comes to data modeling — no matter if this means in the sense of managed state, a relational database schema, a no-schema blobbulation, a serialization scheme of some sort, or an object/class model.

This is a particularly touchy thing, since rearranging a data model of any sort entails major surgery when more than one thing relies on the way data is stored, but refactoring functions is relatively lightweight (so long as you understand what the intended mapping was to begin with).

I suspect that the vast majority of production code is written under deadline and that most of it relies on expedient and trashy data models. I further suspect that most corporate settings urge programmers to “be programming” and officially establish that “programming” means “typing” instead of “understanding”.

It is shocking to revisit something written in a hurry and simplify it in a flash of insight, and then contemplate what just happened. The surprise is all the more resonant once you fully realize how much less time is required to input insightful code than rushed, verbose-but-functional trash.

Looking to my left and right, it is not particularly evident that much production code benefits from a culture of sacred Grok Time.

Language Preachers: A Language’s Worst Enemy

Monday, November 12th, 2012

Nothing crushes a language’s chance at becoming popular like the activities of its political and religious agitators. It doesn’t matter if the situation of the language is that its on the rise, on the decline or on the rebound. Language preachers suck, always. The only chance a language has for a successful PR campaign is when that campaign is based completely on lies, is extremely well funded, and appears to provide a remedy-by-aversion to the sort of problems incompetent instructors and consultants can’t think through and yet still encourages the sale of completely abstract content about abstractions without actually having to write useful code (Java). It doesn’t matter if this campaign is a complete farce, it’ll work if it permits enough of the stupid to perceive themselves as now smart enough because it can generate an entire sub-industry based on trying to prove that something can be done in this newly minted language for the incompetent — a heavy side of brand new buzz words helps, too (if only the “cloud” had been a language…).

But that’s just Java. Other languages don’t have the benefit of an academic revolution to assist in shoving a particular interpretation of a single programming paradigm down our throats (complete with a language that enforces just that one paradigm). Most languages just have to make it on their own either on merit (Haskell, a few lisps, Python), be the language of the year’s hottest killer app (PHP and the original web, Ruby and Rails), be the language of a killer app (Javascript and Mozilla/Firefox), or be both strong on merits and be the mother of a whole family of killer apps (C being the obvious example of the closest thing to modern immortality).

Recently I’ve been involved in a few discussions about the relative merits of a few different languages, database systems, distro flavors, kernel details, etc. Of all these sopics the top two likely start a religious war are choice of language and choice of database. My made-up estimate is that language discussions are twice as likely to incite digital jihad than database discussions.

Maybe it is because we spend most of our time dealing directly in a language and actually thinking in that language, so it is internalized to a level nothing else is. I don’t “think” in LibreOffice (though I’ve seen a few people who can think in Calc functions and others in Excel functions — and I hope the two types never meet to talk about it), I don’t carry on an internal monologue of sorts in Firefox or Nautilus or Gnome — but I do talk to myself in a sort of visual way in Python, Guile, Haskell, C, Bash, Perl, plpgsql, opcodes and a few other languages when the need arises. Programmers are prone to do this — and it turns out that they are most likely to do it only in a single language, ever (pick one of Java, Perl, Javascript, PHP, or Ruby for the best religious wars).

To someone who doesn’t speak enough languages to have evolved a sort of quiet disdain for them all a weird form of moral relativism seems to pervade. All languages other than the one they already know well are at once “the best for some job, just depends on what you’re doing” and completely inferior at everything compared to \$the_only_lang_i_know_well. And of course the complete n00b who doesn’t even know one language well yet will swear that LangX — which he will never take the time to study but will take the time to flood IRC with airy questions about — is the cure for cancer, war and was handed down by Mayan priests without actually knowing anything about it. Obviously there is a conflict here.

Most languages suck, and some suck far less than others. A very few suck so much less that they are almost good. That list is very short. The list of languages I think to myself in regularly is a mix of languages that I don’t completely despise and ones that I do despise but must use because useful projects are already written in them. It used to grind on me that I had to use inferior tools so often, but I’ve gotten past by accepting that most programmers are either really bad at what they do or are compelled by economic circumstances to abandon both their code and the quest for decent tools prematurely. Meh.

The last few years I’ve noticed a lot of Perl people getting zealous about their language. I’m not writing to bash Perl or its community — different communities go through their own inner social cycles as well. But the last six years or so has seen Perl people transition from a group that still self-assuredly represented the original web/CGI hackers and were held in guruly esteem to a group that is panicked about the low percentage of new projects written in Perl compared to the late 90’s and early 00’s. They’ve gone from a group that is solving problems to a group that is actively trying to tell everyone that Perl is still the thing to use, that its still hip and cool and better than everything else out there.

This is hurting Perl more than it is helping. Perl is indeed a good language, but the days when it was the de facto standard for a large subset of new project types is over. Panicking about it won’t help — but writing useful programs sure might. Today there are many alternatives that are at least as good as Perl ever was and many that are just better. Algol was and still is a great language, but I’m not about to start a new project in it. C is just better, and so far hasn’t been surpassed in that particular space (though its time may come, too). I make the decision to not use Algol out of an informed regard for the difficulty of maintaining Algol code in 2012 versus maintaining other code in 2012. If someone were constantly bombarding me with pro-Algol propaganda and biased trash benchmark comparisons then I would avoid Algol on principle instead of even learning enough about it to know whether or not it could have possibly been a good choice to begin with.

If Perl is the awzumest evar, then there is no need to proselytize. If it is on the decline, then there is a huge need to defend the temple — and this is the perception I have of this sort of behavior. It makes Perl people sound like Java people when it was a crap language trying to gain some mindshare — though in defense of that tactic Java did take over the shallow end of the pool, and that turned out to be the largest part of the pool by a vast margin.

The the Lisp community screwed itself by engaging in Java-style cheerleading for a time. This drove a wedge between the people who really could have benefitted from learning it and the community of language snobs who were already privy to The Great Mystery. The Postgres community was that way as well until very recently. I’m a card carrying lisper myself, but I admit the community just got retarded when it came to interacting with others. The only thing that has redeemed Lisp in recent years is that some of the best commercial Lisp developers also happen to be excellent writers. If it weren’t for that there would have been no revival because the inner community mixed with the outer community like water and oil over this language preaching business.

Perl is a genuinely good language and it has a genuinely good runtime (and the two are totally different issues). If it is fading, let it fade with majesty as one of the traditional uber-hacker languages of yore, bringing out your Perl magic as a way to intrigue and enlighten n00bs instead of making the word “Perl” synonymous with a noisy cult that is more interested in evangelizing than writing solutions to problems.

Don’t make the same mistake the Lispers did when they felt their language was threatened — its the difference between newcomers deliberately avoiding Perl instead of not having tried it yet.

Fedora: A Study in Featuritis

Thursday, October 11th, 2012

Its a creeping featurism! No, its a feeping creaturism! No, its an infestation of Feature Faeries! No, its Fedora!

I’ve been passively watching this thread (link to thread list) on the Fedora development list and I just can’t take anymore. I can’t bring myself to add to the symphony, either, because it won’t do any good — people with big money have already funded people with big egos to push forward with the castration of Fedora, come what may. So I’m writing a blog post way out here in the wilds of the unread part of the internet instead, mostly to satisfy my own urge to scream. Even if alone in the woods. Into a pillow. Inside a soundproof vault.

I already wrote an article about the current efforts to neuter Unix, so I won’t completely rehash all of that here. But its worth noting that the post about de-Nixing *nix generated a lot more support than hatred. When I write about political topics I usually get more hate mail than support, so this was unique. “But Unix isn’t politics” you might naively say — but let’s face it, the effort to completely re-shape Unix is nothing but politics; there is very little genuinely new or novel tech going on there (assloads of agitation, no change in temperature). In fact, that has ever been the Unix Paradox — that most major developments are political, not technical in nature.

As an example, in a response to the thread linked above, Konstantin Ryabitsev said:

So, in other words, all our existing log analysis tools have to be modified if they are to be of any use in Fedora 18?

In a word, yes. But what is really happening is that we will have to replace all existing *nix admins or at a minimum replace all of their training and habits. Most of the major movement within Fedora from about a year ago is an attempt to un-nix everything about Linux as we know it, and especially as we knew it as a descendant in the Unix tradition. If things keep going the way they are OS X might wind up being more “traditional” than Fedora in short order (never thought I’d write that sentence — so that’s why they say “never say ‘never'”).

Log files won’t even be really plain text anymore? And not “just” HTML, either, but almost definitely some new illegible form of XML by the time this is over — after all, the tendency toward laughably obfuscated XML is almost impossible to resist once angle brackets have made their way into any format for any reason. Apparently having log files sorted in Postgres wasn’t good enough.

How well will this sit with embedded systems, existing utilities, or better, embedded admins? It won’t, and they aren’t all going to get remade. Can you imagine hearing phrases like this and not being disgusted/amused/amazed: “Wait, let me fire up a browser to check what happened in the router board that only has a serial terminal connection can’t find its network devices”; or even better, “Let me fire up a browser to check what happened in this engine’s piston timing module”?

Unless Fedora derived systems completely take over all server and mobile spaces (and hence gain the “foist on the public by fiat” advantage Windows has enjoyed in spite of itself) this evolutionary branch is going to become marginalized and dumped by the community because the whole advantage of being a *nix admin was that you didn’t have to retrain everything every release like with Windows — now that’s out the window (oops, bad pun).

There was a time when you could pretty well know what knowledge was going to be eternal (and probably be universal across systems, or nearly so) and what knowledge was going to change a bit per release. That was always one of the biggest cultural differences between Unix and everything else. But those days are gone, at least within Fedoraland.

The original goals for systemd (at least the ones that allegedly sold FESCO on it) were to permit parallel service boot (biggest point of noise by the lead developer initially, with a special subset of this noise focused around the idea of Fedora “going mobile” (advanced sleep-states VS insta-boot, etc.)) and sane descendant process tracking (second most noise and a solid idea), with a little “easy to multi-seat” on the side to pacify everyone else (though I’ve seen about zero evidence of this actually getting anywhere yet). Now systemd goals and features have grown to cover everything to include logging. The response from the systemd team would likely be”but how can it not include logging?!?” Of course, that sort of reasoning is how you get monolithic chunk projects that spread like cancer. Its ironic to me that when systemd was introduced HAL was held up as such a perfect example of what not to do when writing a sub-system specifically because it became such an octopus — but at least HAL stayed within its govern-device-thingies bounds. I have no idea where the zone of responsibility for systemd starts and the kernel or userland begins anymore. That’s quite an achievement.

And there has been no end to resistance to systemd, and not just because of init script changeover and breakages. There have been endless disputes about the philosophy underlying its basic design. But don’t let that stop anybody and make them think. Not so dissimilar to the Gnome3/Unity flop.

I no longer see a future where this distro and its commercially important derivative is the juggernaut in Linux IT — particularly since it really won’t be Linux as we understand it, it will be some other operating system running atop the same kernel.

Come to think of it, changing the kernel would go over better than making all these service and subsystem changes — because administrators and users would at least still know what was going on for the most part and with a change in kernel the type of things that likely would be different (services) would be expected and even well-received if they represented clear improvements over whatever had preceded them.

Consider how similar administering Debian/Hurd is to administering Debian/Linux, or Arch/Hurd is to administering Arch/Linux. And how similar AIX and HP/UX are to administering, say, RHEL 6. We’re making such invasive changes through systemd that a change of kernel from a monolothic to a microkernel is actually more sensible — after all, most of the “wrangle services atop a kernel a new way” ideas are already managed a more robust way as part of the kernel design, not as an intermediate wonder-how-it’ll-work-this-week subsystem.

Maybe that is simpler. But it doesn’t matter, because this is about deliberately divisive techno politicking on one side (in the vain hope that “if our wacko system dominates the market, we’ll own the training market by default even if Scientific Linux and CentOS still dominate in raw numbers!”), and ego masturbation on the other (“I’ll be such a rebel if I shake up the Unix community by repeatedly deriding so-called ‘Unix traditions‘ as outdated superstitions and generally giving the Unix community the bird!”) on the other.

Here’s a guide to predicting the most likely outcomes:

• To read the future history* of how these efforts work out as a business tactic, check the history of Unix from the mid-1980’s to early 2000’s and see how well “diversification” in the interest of carving out corporate empires works. I find it strikingly suitable that political abuse of language has found its way into this effort — conscious efforts at diversification (defined as branching away from every other existing system, even your own previous releases) is always performed under the label of “standardization” or “conformance to existing influences and trends”. Har har. Joke’s on you, though, Fedora. (*Yeah, its already written, so you can just read this one. Easy.)
• To predict the future history of a snubbed Unix community, consider that the Unix community is so used to getting flipped the bird by commercial interests that lose their way that it banded together to write Linux and the entire GNU constellation from scratch. Consider also that the original UNIX was started by developers who were snubbed and felt ill at ease with another, related system whose principal flaw was (ironically) none other than the same featuritis the Linux community is now enduring.

I don’t see any future where Fedora succeeds in any of its logarithmically expanding goals as driven by Red Hat. And with that, I don’t see a bright future for Red Hat beyond v7 if they don’t get this and other priorities sorted**. As a developer who wishes for the love of everything holy that I could just focus on developing consumer business applications, I’m honestly sad to say that I’m having to look for a new “main platform” to develop for, because this goose looks about cooked.

** (sound still doesn’t work reliably — Ekiga is broken out of the box, Skype is owned by Microsoft now — Fedora/Red Hat don’t have a prayer at getting on mobile (miracles aside) — nobody is working on anything solid to stand a business on once the “cloud” dream bubble pops — virtualization is already way overinvested in and done better elsewhere already anyway — easy-to-fix media issues aren’t being looked at — a new init system makes everything above worse, not better, and is distracting and requires admins to completely retrain besides…)

Monday, August 27th, 2012

Doing computations in the database is almost always the right answer. Writing your own database procedures instead of relying on an ORM framework is even better. These go hand in hand since ORMs don’t allow you enough control over your data schema to define calculations in the first place (most can’t even properly handle multi-column primary keys and instead invent meaningless integer ID for everything; this should tell you something). Many, maybe most (at least that I’ve met), web developers don’t even know what sort of calculations are available to be performed in the database because they’ve been taught, in an absolute vacuum of personal experience, that “SQL hard. Relational thinking hard. OOP good. Trust framework”. There is so much missing here.

Frameworks try to be “database agnostic”. This is fundamentally flawed thinking. This implies that the data layer is merely there for “persistence” and that the “persistence layer” can be whatever — all RDBMS systems “aren’t OO and therefore suck” and so any old database will do. If this is true then it follows that frameworks and applications should be designed to work against any database system and not delve too deeply into any specific feature sets — after all, the application functionality is the focus, not the data, right? This is exactly backwards.

Even forgetting that this condemns you to least-common-denominator data design, this is still exactly backwards. Let me put the right way on a line all by itself, because it just that important:

Data designs should strive to be application agnostic.

Data drives everything. Your functions are what you can change around easily, but your data schema is critical and represents everything about your system logic. If you show me a well-labeled data schema I can probably guess what you are trying to do, but if you show me just your functions and objects I’ll require either a code tour or a lot of familiarization time before getting anything serious done (that project documentation will be lacking is a truism not worth addressing here).

Consider that changing your app code is cake whereas changing your data schema is major project surgery. OOP has us so in a stupor that we think if we just get our objects right everything will be fine. It won’t. Ever. As long as you think that you’ll also believe other crap like that each object should map directly to a table. There are certain basic truths about certain types of data. It is striking that I can give a data requirement to two DBAs schooled on two different RDBMSes and ask for a normalized data model (let’s just say NF3 for argument) and get back two very similar looking schemas, but I can give a feature requirement to two Java programmers and get back radically different system designs.

This should tell us something. In fact, it screams the truth that data is a foundation from which you must work up toward the application code, not the other way around. The database layer is the most important place to make sound choices. The choice of database system itself should be based on project requirements, because that choice matters. Most critically, I’ll say it again here because it is so important and implies so much on contemplation: the database designs should strive to be application agnostic.

On Looping and Recursion

Sunday, August 12th, 2012

This post was prompted by a discussion about various programming languages on the Scientific Linux Forum. The discussion is in the member’s sub-forum, so I can’t link it here very effectively.

wearetheborg wrote:

I don’t understand the statement “…develop a sense time being slices along the t-axis (similar to thinking transactionally) instead of dealing in terms of state.” Can you elaborate on this? I have been envisioning state as the state of the various variables and objects.

That is the way most people think of state. This is normal in procedural and structured programming languages (everything from assembler to Fortran to C to Java). It doesn’t have to always work that way, though.

Consider a web request. HTTP is stateless by design. We’ve backhacked the bejeezus out of it to try making it have state (cookies, AJAX, etc) but in the end we’re always fighting against that protocol instead of just accepting that documents aren’t application interfaces. But remember the original Perl CGI stuff? That was all based on the way databases treat time — as transaction points along the t-axis. This is very different than inventing the notion of ongoing state. So a user fills out a form, and when he submits it there is a database transaction that completely succeeds or completely fails. Each request for a page view generates a new page via a function whose input is the request URL (and whatever detailed data lies within the GET string), which then calls another function whose input is a complete response from the database containing a snapshot of data from a point in time, and whose output is the web page requested.

There is no necessity for state here. The input to the function is the request URL string. The function breaks that down and returns a complete output (which could be an error message). But there is nothing ambiguous about this and executing the same functions with the same inputs would always return the exact same output, every time. There are no objects carrying state between requests. There are no variables that live beyond the request -> response lifetime.

“But timestamps and things might change and if someone updates a page then a request before and after would be different!” you might say. And this is true (and indeed the whole point of such CGI scripts) but all that is external to the functions themselves. The content of the pages, including the timestamps, page templates and record data, are input to the functions, not persistent state variables that are evolving with time. The functions don’t care what the inputs are, they only need to do their job — the data is a completely separate concern (remembering this will minimize fights with your DBA, by the way… if you know an asshole of a DBA, consider that he’s probably not trying to be an asshole, but rather trying to help you simplify your life despite yourself).

All this functions-that-don’t-carry-state business means is that we don’t need variables, we need symbolic assignment. Further, its OK if that symbolic assignment is immutable, or even if that assignment never happens and is merely functions passing their results (or themselves) along to one another.

This is the foundation for transactional thinking and it treats time as slices or points along the t-axis instead of a stateful concept that is ongoing. Its also the foundation of quite a bit of functional programming. There are no variables internal to the functions themselves that have any influence on the output of the program. Every case of a given input results in an exactly defined output, every time. This simplifies programming in general and debugging in particular.

Its not the best for everything, of course. In simulations and games you need a concept of state to cover the spans between transactions or time-slicing periods. For example, in a game you are fighting a mob. The fight isn’t instant, it is a process of its own and the immediate gameplay is based on the state carried in the various objects that make up your character, the mob, equipped items, world objects, etc. But it doesn’t really matter in a grander sense whether or not each strike or point of time in the fight is actually transacted to the data persistence layer — if the server crashed just then you’d probably prefer to not log in back to the middle of a boss fight while your screen is still loading, or half your party isn’t logging in at the same time with them, etc.

So the fight is its own discreet subsystem which is intimately concerned with state and OOP design principles and is therefore completely expendable — it matters not if you won or lost that particular fight if the server crashes in the middle of it. The overall world, however, is designed based on transactional concepts of time slices — it does matter what the last non-combat status and position snapshot of your character was, and you probably care quite a bit about certain potentially monumental events like when your character binds himself to some epic loot, level up or apply a new skill point (that better go in the database, right?).

The vast majority of user applications aren’t games or simulations, though. Almost everything on the web is text string manipulation and basic arithmetic. Almost everything in business applications development amounts to the same. So we don’t need stateful functions and objects, in fact we get confused by them (the exception here being the interface itself — objects make awesome widgets and windows). If I make an object to represent, say, a customer invoice, and I’m doing calculations on that invoice or within it using methods, to really bug test that object I have to test every method under every possible condition, which means every permutation of state possible. If that object is a subclass of anther object, or interacts with another object in the system, like say a customer record object or a discount coupon object, I have to test both against each other in every combination of possible states. That’s impossible in the lifetime of this universe, so nobody does it. The closest we come is testing a tiny subset of the most likely cases out of a tiny subset of what’s possible, and then we are constantly surprised at lingering bugs users report in modules later because they passed all the tests (or even dumber, we aren’t surprised at all, which says something about how blindly we stick to methodology in the face of contrary evidence).

But we don’t need a stateful concept for, say, a customer invoice. We need a snapshot of what it looked like before, and what we want it to look like next. This “next” result (which is just the output of a function), once confirmed (transacted in the database) becomes the next snapshot and you discard the intermediate concept of “state” entirely. Line item calculations and changes should be functions that operate per line on input data from that line. Invoice sums, averages, discounts, etc. should be functions concerned only with relevant input data as well, namely the aggregate result of each line item.

None of this involves a stateful requirement and shouldn’t involve stateful objects because that state is an unnecessary complication to the system. It also causes problematic architectural questions (is an invoice an object? is a line item an object? are they both objects? what do they inherit from? do they have a common ancestor? how do we make relational data sorta fit with our object model?) What are all these class declarations and the attendant statefulness actually doing for you? Nothing but permitting you to write yourself into a hole with mistaken side-effects (oops, that method wasn’t checking to clear the “discount” boolean in this object before it does `self.foo()`).

Even more interesting, sticking with the web example above and moving it forward to 2012 where everyone is mad for Django and Rails, these ORM-based frameworks almost never maintain persistent state objects between calls. When they do it often causes a peculiar type of unique-per-thread bug. So what is all this business about class/model declarations in the first place, since objects are first and foremost data structures with behaviors? These class declarations are trying very hard to be `CREATE TABLE` and `ALTER TABLE` SQL commands but without the benefit of actually being SQL commands — in other words, they are a weak form of data dictionary that sacrifice both the clean presentation of S-expressions or YAML trees and the full power of your favorite database’s native language.

Dealing with the database on its own terms and letting your functions be stateless makes it enormously easier to test your system because each function has a knowable output for each knowable input. It does mean that you must have a small constellation of functions to do the job, but you were going to write at least as many methods against your objects anyway (and often duplicate, or very nearly duplicate methods across non-sibling objects that do very nearly the same thing — or even worse, inherit several layers deep in a ripped fishing net pattern, which makes future changes to parent classes a horror). If you are dealing with data from a database trying to force your relational data into objects is poor thinking in most cases anyway, not least because none of that object architecture gets you one iota closer to accomplishing your task, despite all the beautiful design work that has to go into making relational data work with OOP. Unless you’re writing a simulation (as in, a MUD, an MMORPG or a flight simulator) there is no benefit here.

The easiest way to envision slices of time for people who got into programming later than 1994 or so is usually to recall the way that database-to-webpage functions worked in the CGI days as referenced above. They don’t maintain state, and if they do it is just as static variables which exist long enough to dump content into a webpage substitution template before being deallocated (and if written differently this wouldn’t always be necessary, either — the following are not the same: `return (x + 1);`, `return x++;`, `x = x + 1; return x;`). The only thing that matters to such functions is the input, not any pre-existing state. You can, of course, include cookies and SSL tokens and Kerberos tickets and whatnot in the mix — they merely constitute further input, not stateful persistence as far as the function itself is concerned.

There are some consequences to this for functional programs. For one thing, most loops wind up working best as recursively defined functions. In a imperative OOP language like Java this is horrible because each iteration requires instantiating a new object with its own state and therefore becomes a wild resource hog and executes really slow. In a functional language, however, tail-recursive functions often perform better than the alternative loop would. The reason for this is that (most) functional languages are optimized for tail-recursion — this means that a recursive function that calls itself as the last step in its evaluation doesn’t need to return state to its previous iterations; it is dependent only on its input. The machine can load the function one time and jump to its start on each iteration with a new value (whatever the arguments are) in the register without even needing to hit the cache, mess with iterator variables, check, evaluate or reform state, etc. (Note that this is possible in languages like C as well, but usually requires use of the “evil” goto statement. In higher level languages, however, this sort of thing is usually completely off limits.)

Let’s look at a stateful countdown program of the type you’d probably write on the first day of class (small caveat here: I’ve never had the privilege to attend a class, but I’m assuming this is the sort of thing that must go on the first day):

```#include <stdio.h>

int main()
{
int x = 10;

while (x > 0)
{
printf("%d\n", x);
x--;
}

printf("Blastoff!\n");
return 0;
}```

Of course, this could be done in a for() loop or whatever, but the principle is the same. Looping is king in C, and for a good reason. Here is an equivalent program in Guile (a Scheme interpreter that’s just one “yum install guile” away if you want to play with it):

```(define (countdown x)
(begin
(display x)
(newline)
(if (> x 1)
(countdown (1- x))
(display "Blastoff!\n"))))

(countdown 10)```

In this program there is no loop, there is a call to itself with an argument equal to the initial argument decremented by 1, but we are not actually dealing with variables at all here.

In fact, the expression `(1- x)` does not change the value of x at all, because it is merely an expression which returns “the result of ‘decrement x by 1′” (an equivalent would be `(- x 1)`, which may or may not be equal in efficiency depending on the lisp environment) and not an assignment statement. (There are assignment statements in most functional languages, and they are primarily (ab)used by folks coming to functional languages from non-functional ones. Not saying you never want variables, but more often than not its more sensible to deal with expressions that say what you mean exactly once than stateful variables).

Being a toy example I’m using a “begin” statement to guarantee execution order in the interest of formatting text output. This sort of thing isn’t present in most functions, since whatever is required for the return value will be executed. Here we are generating an output side effect. We could eliminate the “begin” in the interest of being “more functional” while still emitting something to stdout as a side effect, but might be harder to read for someone coming from an imperative background:

```(define (countdown x)
(display (string-append (number->string x) "\n"))
(if (> x 1)
(countdown (1- x))
(display "Blastoff!\n")))```

If displaying an integer argument with a newline appended (or more generally, any argument value with a newline appended) was a common requirement (like for log files) the line beginning with `display` would become its own function, like `display\n` or something and be called more naturally as `(display\n x)`.

This is a very simple example of how stateless recursion works in place of loops. You can implement loops in functional languages, which is usually a bad idea, or you can implement tail-recursive functions in most imperative languages, which is also usually a bad idea (especially when it involved objects, unless you start inlining assembler in your C with the appropriate jumps/gotos… which isn’t worth the trouble compared to a loop). Having demonstrated the equivalence of loops and recursion for most purposes, it bears mentioning that within the lisp community (and probably others) using tail-recursive functions in place of `while()`/`for()` loops is so common that very often when someone says “loop” what they mean is a recursive loop, not an iterative loop.

I say this to illustrate that when comparing languages its not about whether you can do something in language X or not, but whether a certain style of thinking matches the way the compiler or execution environment work. Step outside of the pocket that the language or runtime has built for you and you wind up very quickly in a nasty, panicky, twisty place where unmaintainable hacked up speed optimizations start feeling necessary. The futility of such speed optimizations is usually evidence that you’ve existed the pocket of your language, and is why old timers often are heard saying things like “if you need all those speed hacks, you need to reconsider your design overall” — its not because there is a speedier way to make that one method or function work, per se, but more typically what they are getting at is that you are thinking against the paradigms or assumptions that your language or environment was designed around.

In cases not involving simulation I find that carrying lots of state in variables, structs and the like complicates even simple things unnecessarily and makes testing confusing by comparison to stateless functions. OOP is even worse in this regard because you’ve got functionality intertwined with your state vehicles, and inheritance makes that situation even more convoluted. When dealing with lots of state stack traces can’t give you a clear picture of what happens every time with any given input, only what happened this time, because understanding what happened when you’re dealing with state is not as simple as looking at the arguments, you have to consider the state of the environment and objects within it at the moment of execution. This gets to the heart of why I recommend thinking of time as slices or points or snapshots instead of as continuous state, and is a very important idea to understand if you want to grok functional programming.

Object-Relation Mismatch: Comparing Strawberries and Sunglasses

Tuesday, July 31st, 2012

I’ve been spending a lot of time lately writing a rather large suite of business applications. The original customer was a construction company which needed a replacement for their estimation system. Then the same customer needed a facility pass management system to make the insane amount of bit-shoveling/paperwork involved in getting security clearances for workers to perform work in secure sites. Then a human resources system. Then a subcontract management system. Then a project scheduling system. Then an invoicing system.

The point here is, they really liked my initial work, and suddenly I got further orders. Pretty soon after discussing the first few add-on requirements with the customer it became apparent that I was either going to be writing a bunch of independent systems that would eventually have to learn how to talk to each other, or a modular system that covered down on office work as much as possible and could pull data from associated modules as necessary (but by the strictest definition is not an “expert” or ERP system — note that “ERP” is now a buzzword void of meaning just like “cloud”). Obviously, a modular design is the preferred way to go here, and what that costs me in effort making sure that dependencies don’t become big globby cancer balls buys me enormous gains selling the same work, reconfigured, to other customers later and makes it really easy to quickly write add-ons to fill further needs from the same customer.

Typical story. But how am I doing it and what does this have to do with the Dreaded Object-Relation “Impedance” Mismatch? Tools, man, tools. Most of the things I wrote in the past were system level utilities, subsystems, security toys, games, one-off utilities for myself to make my previous office work disappear[1], patches to my own systems, and other odds and ends. I’d never sat down and written a huge system to automate away someone else’s problems, though it turns out this is a lot more fun than you might expect provided you actually take the time to grasp what the customers need beyond what they have the presence of mind to actually say. (And this last point is worthy of an entire series of books no one will ever pay me to write.)

And so tools. I looked around and found a sweet toolkit for ERP called Tryton. I tried it out. Its pretty cool, but the biggest stepping stones Tryton gives you out of the box are a bunch of pre-defined models. That’s fine, at first, but they are almost exclusively based on non-normalized (as opposed to denormalized) data models. This looked good going in, but turned out to suck horribly as time passed.

Almost all of the problems ultimately trace back to the loose way in which the term “model” is used in ORM. “Model” means both the object definitions and the tables that feed them[2]. Which is downright mad because object member variables are not table columns, not by a mile, and tables can’t do things. This leads to a lot of wacky stuff.

Sometimes you can’t tell if it makes sense to add a method to a model, or to write a function and call it with arguments because what you’re trying to do isn’t an inherent function of the modeled concept itself (and if you’ve been conned into using Java life sucks even more because this decision has already been made for you regardless your situation). And then later you forget some of the things you wrote (or rather, where they are located, which equates to forgetting how to call them) because it was never clear from the outset what should be a function, what should be a method, and what is data and what is an object. This makes it unclear what should be an inherited behavior and what should be part of a library (and I’ll avoid ranting about the pointlessness of libraries of Java/struct-based objects). And this all because we love OOP so much that we’re willing to overlook the obvious fact that business rules are actually all about data and processes, not about objects and methods at the expense of sane project semantics.

In other words, business rules are about data, most easily conceptualized as nouns, and not really about verbs, most easily conceptualized as functions (and this is the beginning of why using OOP for things other than interface and simulation design is stupid — because its impossible to properly subordinate verbs to nouns or vice versa).

Beginning with this conceptual problem you start running into all sorts of weirdness which principally revolves around the related problem that every ORM-based business handling system out there tries to force data into a highly un-normalized data model. I think this is in an effort to make business data modeling “easy”, but it results in conscious efforts by framework designers to prevent their users (the application developers) from ever touching or knowing about SQL. To do that, though, it is necessary to make every part of data constraint, validation, verification, consistency, integrity (even referential integrity!), etc. into methods and functions and processes which live in the application. Instead of building on the fascinating advancements that have been made in data rule systems this approach deliberately tosses them aside and reinvents the wheel, but much worse. This relegates the database itself to actually just being a million-dollar file system[3].

For example, starting out with the estimation stuff wasn’t too hard, and Tryton has a fairly easy-to-use set of invoicing, receiving, accounting and tax configuration modules you can stack on to get some sweet functionality for free. It also has a customer management model and a generalized personal information manager that is supposed to form the basis for human resources management stuff you can build yourself. So this is great, right?

Wrong. Not just wrong because of the non-normalized data, I’ll get to that in a moment, but primarily wrong because nearly everything in the system attempts to be object oriented and real data just doesn’t work that way at all. I didn’t realize this at first, being inexperienced with business applications development. At first I thought, “Ah, so if we make our person model based off the person-party-address chain we can save a lot of time writing by simply spending time understanding what’s already here”. That sort of worked out. Until the pass management request came in. (That basing the estimation module off of the existing sales/orders/invoices chain would be a ridiculous prospect was a far less obvious problem.)

Now I had a new problem. Party objects are one table in the database, people objects are a child class in the application that inherits Party but is represented in the database as a separate table that doesn’t inherit the party one (but has a pass-up key instead to make the framework portable to database backends that don’t support inheritance or other useful features — more on that mess later) and addresses are represented in the database as being a child table to the party table, but as independent objects within the OO system at the application server level.

Still doesn’t sound horrible, accept that it requires a lot of gymnastics to do handle security checks and passes this way. In particular getting security clearances for workers involves explaining two things in excruciating detail: family relationships and address histories.

The first problem has absolutely no parallel in Tryton, so writing my own solution was the only way to proceed. This actually turned out to be easier than tackling the second problem, specifically because it let me write a data model first that was unencumbered by any design assumptions inherent in the system (other than fighting with the basic OOP one-table-per-model silliness). What was really required was to understand what constitutes a family. You can’t adopt a sibling, but a parent can adopt you, and reproduction is what makes people to begin with which requires a M+F pair, and we need an extra slot each direction for adoption/step relationships. So every person who shares a parent with you is a sibling. Label them based on sex and distance and viola! we’ve got a self-mapping family model. Cake. Oh wait, that’s only cake in SQL. Its actually really, really ugly to do that from within OOP ORM code. But enough about families. That was the easy part.

Addresses were way more problematic. Most software systems written in Western languages were developed in (surprise!) the West. The addressing systems in the West vary greatly and dealing with this variance is a major PITA, so most software is written to just completely ignore the interesting problem worth solving and instead pretend that addresses are always just three text strings (usually called something like “address_1”, “address_2” and “postal_code”). In line with the trend of ignoring the nature of the data that’s being dealt with, most personnel/party data management models plop the three address elements directly into the “person” (or “party” or “partner”, etc.) table directly. This is what Tryton does.

But there’s a bunch of problems here.

For one we’ve completely removed any chance of a person or party having two addresses without either adding more columns (the totally stupid, but most common approach) or adding a separate table and letting our existing columns wither on the vine. “Why not remove them?” — because removing columns in a pre-fab OOP ORM can have weird ripple effects because other objects expect the availability of those member variables on the person or party objects and the interface bits usually rely on the availability of related objects methods, etc.

Another problem is that such designs train users wrong by teaching them that whenever a person changes addresses the world actually changed as well and the right thing to do is erase the old data and replace it with something new. Which is crazy — because the old address is still the correct label for a location that didn’t move in the real world and so erasing it doesn’t mirror reality at all.

So all this stuff above is totally ignored by the typical software model of addressing — which really puts a kink in any prospect of working within the existing framework to write a background information check and pass management system. These kinds of incomplete conceptual assumptions pervade every framework I’ve dealt with, not just Tryton and make life within OOP ORM frameworks very difficult when you need to do something that the original authors didn’t think about.

This article is about mismatches, so I’ll point out that the obvious one we’re already overlooking is that the data doesn’t match reality — or come even close. And we’re only talking about addresses. This goes beyond the Object-Relation Mismatch — its the Data-Reality Mismatch. It just so happens that the Object-Relation Mismatch greatly enables the naive coder in creating ever deeper Data-Reality mismatches.

Given the way addresses are handled in most software systems we have a new data input and verification problem. With no concept of locations there is no way to let someone who is doing input link parties to common addresses. This is stupid for a lot of reasons, for one thing consider how much easier it is for a user to trace down an existing location tree until they get to a level that doesn’t exist in the database yet and then input just the new parts rather than typing in whole addresses each time.

“But typing addresses is easy!” you say. Not true. We have to track four different scripts per address element (Latin, two forms of kana, and kanji) and they all will have to come out the same way every time for the police computers to accept them. One of the core problems here is validating that person A’s address #2 which extends from the same dates as person B’s (his brother) address #4 which spans the same dates is the same in all details so that the police background checker won’t spit out an error (because they already have this data so yours had better be right). Trusting that every user is always going to input the exact same long address string all four times and never make a mistake is ridiculous. Its even more stupid when you consider that they are referencing the same places in the real world against data you already have so why on earth wouldn’t your software system just let them link to existing data rather than force them to enter unique, error-prone new stuff?

So assuming that you do the right thing and create a real data model in your database where locations are part of a tree structure and address assembled strings linked against locations and have a time reference, etc. how does all this manifest in the object code? Not at all the way that they present in the database. Consider trying to define a person’s “current address”.

There are two naive ways to do this and two right ways to do this. The most common stupid approach is to just put a boolean on it “is_current” or something similar and call it good. The other stupid way to do it is to present any NULL end dates as “current” and call it good. But what about the fact that NULL is supposed to mean “unknown” — which would most likely be the case at least some of the time anyway and therefore an accurate representation of known fact. And even more interestingly, how do we declare that a person can only have one current address? Without a programmatic rule you can’t, because making the “is_current” boolean a UNIQUE means that a person can’t have more than one false value, either, which means they can only ever have a current and a not current address (just two) and this is silly. Removing the constraint means that either the client code (really stupid) or a database trigger (sort of stupid) should check for and reject any more than a single true value per person.

But wait… you can’t really do that in an ORM. I mean, you can make an ORM play along with the idea, but you can’t actually create this idea in a simple way from within ORM code, and from OOP ORM code it is really a much huger PITA to coerce the database into giving you what you want than just writing your tables and rules in SQL yourself and some views to massage them into a complete answer for easy coexistence with an ORM. In particular, its easiest to have the objects actually have an “is_current” boolean and the database just lie to the ORM and tell it that this is the case on the database end as well. Without knowing anything about how databases work, though, you’d never know that this is the right way to do things, and you’d never know that the ORM is actually obstructing you from doing a good job at data modeling instead of enabling you to do a good job.

So here’s another mismatch: good data design predicts that objects are inherited one way in Python and the tables follow a significantly different schema in the database. Other than the problem above (which is really a problem of forcing addresses to be children of parties/people and not children of a separate concept of location as we have it in the real world) the object/relation weirdness creates a lot of situations where you’re trying to query something that is conceptually simple, but winds up requiring a lot of looping or conditional logic in the application to sort things out.

As for the looping — here be dragons. If you just trust the ORM completely each iteration may well involve one query, which is really silly once you think about it. And if you do think about it (I did) you’ll write a larger query domain initially and loop over that in the application and save yourself a bunch of round trips. But either way this is silly, because isn’t SQL itself designed to be a language that permits the asking of detailed data questions in the first place? Why am I doing this stuff in Python (or Ruby or Lisp or Haskell or whatever)?

But I digress. Let me briefly return to the fact that the tables are inherited one way and the objects another. The primary database used for Tryton is Postgres. This is a great choice. That shows that somebody thought about things before pulling the trigger. Tryton was rewritten from old TinyERP/OpenERP (the word “open” here is misleading, by the way — OpenERP’s terms don’t come close to adhering to the OSS guidelines whereas TinyERP actually did, or was very close) and the main project leader spent a lot of time cleaning out funky cruft — another great sign. But somewhere in there a heavy impulse to be “database agnostic” or “portable” or some other dreamy urge got in there and screwed things up.

See, Tryton supports MySQL and a few other database systems besides that don’t have a very complete feature set. What this means is that to make the ORM-generated SQL Postgres uses similar to the ORM-generated SQL that MySQL uses you have to settle for the lowest-common feature set between the two. So any given cool feature that you could really benefit from in one that doesn’t exist in the other must be ditched for all database backend code or else maintaining the ORM becomes a nightmare.

This means that each time you say you want your framework to be “portable” across databases you are ditching every advanced feature that one system has got that any of the others don’t, resulting in least-common-denominator system design. So every benefit to using Postgres is gone. Poof. Every detriment to using a fast, naive system like MySQL is inherited. Every benefit to a fast, naive system like MySQL is also gone, because nothing is actually written against the retrieval speed optimizations built into that system at the expense of losing all the Big Kid features in a system like Postgres. Given this environment, paying enormous fees for Oracle isn’t just stupid because Postgres can very nearly match it anyway — its doubly stupid because you’re not even going to use any cool features that any database provides anyway if you write “database agnostic” framework code.

I had many a shitty epiphany over time as I learned more about data storage concepts in general, relational database systems in particular, and Postgres, Oracle, DB2 and MySQL specifically. (And in that process I grew to love Postgres and generally like DB2.)

So there is a lesson here not related directly to the OOP/relational theme, but worth stating in a general way because its important to nearly all software projects that depend on some infrastructure piece external to the project itself:

Pick a winner. If someone else in your project wants to use systemX because they like it, they can spend time making the ORM code work, but that should be an extension to the subsystem list, not a guarantee of the project because you’ve got more important things to do. This could be MySQL vs Postgres or Windows vs Linux. It doesn’t matter — pick one and specialize. Even better, pick whichever one gives the biggest boost to a specific layer of your application stack and use that there.

So far the above thinking has had me settling more on Postgres over anything else and more on Qt at the application level than anything else.

Back to my story. The addressing thing introduced enough problems that I eventually had to ditch it entirely and write my own that was based on normalized location data that carried natural data (parent-child relationships within the hierarchy of physical locations) with an address table that carried human-invented administrative data about those locations (if they have a postal code, and other trivia) and a junction table that connects parties (people or organizations) to those locations via the addresses and carries timeline and other data.

When I did this and mentioned it to some other Tryton folks they flipped out. Not because I had done this in the core project — no, this was my own substitute module — but because:

1. I had written SQL, and not just dabbled in some CREATE TABLE statements
2. I had normalized the data model (well, a very small part of it)

I wrote the SQL to carry the definitions where the ORM just didn’t have a way to express what I wanted (or was really funky to grok once it was written). Apparently this was a big taboo in ORM Land, though I didn’t know that going in. SQL seems to have this forbidden quality that excites as much as it instills fear these days, but I have no idea why. Again, I’m a n00b, so maybe I just don’t get why ORM is so much better. Also, mind you, there was no hostility from anyone, just shock and some sense of the aghast query “what have you done?” (The Tryton community is actually a very warm place to play around and the project leader is enormously helpful, and despite me being an American (and a Texan, no less!) living in Japan and them all snooty Euro types, we got along swell. If any FOSS ERP system has some glimmer of hope as of July 2012 its Tryton.)

Writing SQL deeper than a raw() query here and there is one thing, but normalizing the data model is something altogether on a different plane of foul according to the rites of the Holy ORM. I was continually told that this would hurt me in the future if I continued on with Tryton. But on the other hand, they weren’t looking at the small mountain of application code I would need to maintain and forward port forever to get around the non-normalized data issue(s). And anyway, once you normalize data all the way, you don’t normalize it further. There actually is a conclusion to that exercise. I’ve found that my normalized data models tend to endure and changes wind up being modified by additions instead of the painful process of moving things around (and this still seems mysteriously, wonderfully magical and relieving to me — but probably because I’m not actually educated in relational algebra and so can’t see the underlying sense to why normalized data is so easy to extend (I mean, conceptually its obvious, but how, precisely?)).

Their arguments about “the future” disregarded the application layer entirely because they were only thinking about Tryton, but for me it wasn’t just one place where non-normalized data started hurting me (it also disregarded that this predicted that I’d wind up leaving Tryton). The original concept for the estimation program didn’t really jibe with the way that a(nother) very obvious customer need could be served by putting meaningful links between what was contained in CAD files, what existed in the product database, and how the units of measure get computed among them. This meant that my real need wasn’t a single application as much as it was a single data store that remained coherent regardless what application happened to be talking to it at the time (I’m not even going to get into security in this post, but that is another thing that is enormously simplified by submitting to The Postgres Way instead of resisting).

And this brings me to another problem — in fact, the real kicker. I started realizing as I wrote all these things that while the Tryton client program is pretty slick, its not the end of the road to handle all needs. For one things a lot of it involves writing screens in XML. Yuk. That’s about as annoying as it gets, and I’ll leave that there. But most importantly there was no way I was ever going to be able to port the Tryton client to, say, Android (and maintain it) or embed the CAD programs we’re using (one easy to port C++/Qt program, and one black-box Windows contraption we run in Wine that is currently a must-have) and make things run smoothly. I was also going to have my work cut out for me if I wanted to use the same data store to start doing things like drive dashboard snapshot reporting over http or especially provide some CRUD capabilities over the Web for guys out of the office (and this issue goes all the way to the security model here as well).

Anyway, long(er) story short, Tryton just didn’t meet my needs going forward. I could have forced it to fit at a greater cost in time than I am willing to pay, but it just wasn’t a total fit for my needs, and part of that was the way that data in objects don’t really jibe with how data in the real world works.

Writing a PyQt application, for example, I can just ask the database for some information against a view. I can get the query back as a dictionary or a list or whatever I want and display it or manipulate it any way I want. Doing it from a Django web face is pretty easy to. Actually, really easy. Django has an ORM that can make life easier if you ditch the “this class is a table” idea and make them all unmanaged models which actually just call views in the database. Its even easier overall if they are the exact same views that your other applications call (or not, but usually this winds up being a useful situation). If you remember to not do any processing in the application code, but instead have the database build your view for you and just let Django be the way it gets into a web page then you’re really cooking with gas and can safely take advantage of all the automatic stuff Django can do. (Or even better than web pages, use Django to render OpenDocument files for you, which turns out to be a super easy easy way to woo your customers because its so much more useful than generating web pages. I should probably write a post about how to do this later because its just that cool.) Its even more cool to do this from Snap than Django — but that’s a whole ‘nother story.

This was just retrieving data, though. I got curious about entering data. And its really easy as well. But it involves a few extra things. Like careful definitions of the data model (ensure actual normalization, which is sometimes surprisingly counter-intuitive in how intuitive it is), multi-column unique constraints, check constraints, really understanding what a foreign-key is for, etc. all while still leaving room for a (now otherwise meaningless) numeric ID column for frameworks that may require it — and this whole numeric-keys-for-everything bit will seem more weird the longer you spend dealing with solid data models.

Basically, use all the tools in the Postgres bag and your life will get easier. And that’s actually not hard at all. The Postgres feature list (even the DB2 feature list) is pretty small compared to the vastness of the entire Python API coupled with the combined might (and confusion, usually) of whatever framework(s) you’re writing around. Doing it right also requires that you learn how to handle the various exceptions that the database will throw back at you as a result of your constraints and rules and things you’ve put in the database. But this makes programming the application layer really easy. Like incredibly easy. And anyway, learning how to handle a single set of database exceptions is a lot easier than trying to remember every stupid little exception condition your framework can produce multiplied by the number of frameworks you have.

And this is what is solving my core problem. I’m discovering that not only is SQL pretty darn easy, but that it solves my core business logic problems without actually writing any business logic. I think this is what the relation guys at IBM knew they were on to decades ago when they thought this idea up in the first place.

Consider the “current address” issue above. I didn’t use booleans, logical processes or any other trick to figure out whether an address was current or not, nor did I have to write a special rule that states that a person can only have a single current address at once but any arbitrary number of non-current addresses, nor did I have to write a single spot of application code. The problem is solved by the structure of the data alone — which is always the most efficient solution since it involves zero processing.

THis blows all that “use this framework to build your apps in 5 easy steps with Rails!” bullshit away. But I am a little put out that the concepts themselves don’t have more support within the trendier parts of the software development world. It seems everyone is jumping on the out of control bandwagon that marketers overloaded with Java Beans and Hopes and Dreams all those years ago and sent tumbling down the hill. Its like the Obama campaign infected the software industry (because he totally earned that Nobel Prize and Hawking doesn’t deserve one). Its still rocketing down the hill, distracting faculty, investors, budding programmers and the marketing world almost completely. Its really amazing. I am a little upset that discovering a really sane way to manage data was so hard and took so long among the enormous volume of siren screams and other noise on the wire in the development community. Of course, now that I know what I’m looking for locating good discussions and resources isn’t that hard — though it is a little odd to note that the copyright dates on most of them predate my own existence.

So now, as I convert a mishmash of previously written independent application models into a central data concept I am finding something amazing: I haven’t found a single business rule yet that isn’t actually easier to express in terms of data structure than it is to put in application code. I’m also finding that importing the data from the (now legacy) application databases is also usually not that hard, either, but requires more mental effort than anything else on my plate now.

Most amazing of all is the ease of writing application code. Even if I’m writing one application in C++/Qt, another in PyQt, another in Django, another in CL and another in Haskell that run variously across the spectrum of servers, tablets, phones and desktops[4], they can all live under the same guarantees and are super easy to understand because of the extreme lightness of all their code. I’m not doing anything but showing stuff to the user from the database, and putting stuff back in the database, and adjusting based on whether or not the database accepted what was given.

This makes application development fun again. Previously I has been bogged down in trying to define business logic rules as processes, and that was boring, especially since the magic sauce really should have just been a data model forcing me to be correct in the first place instead of me chasing exceptional cases through a bunch of logical code paths in the application (which had to be duplicated across all applications!). Also, this effort tended to put horse-blinders on me as far as interface went. Once I wrote a web interface, the enormous freedom that native application development gives you is suddenly invisible and you’re thinking in terms of “what is the parallel widget in Qt or GTK to the HTML SELECT” or whatever. That’s just lame. But its what starts happening when you spend so much brainpower worrying about conditional business logic that you forget all the cool stuff you can do in a native application (like 3D flowcharts, or 3/4D projections of project management data throughout time that you can “paw through” with the mouse or even a game controller, or a million other kickass ideas we usually only ever get to see in vidya games).

Getting your data model right gives you not only the mental freedom to start exploring what native UI can do that goes so far beyond the pitiful bag of cheap tricks that “web app development” has made standard today (or the convoluted mess of JavaScript and AJAX trash that supports it), it also gives you the confidence to step out and do some cool stuff in your client applications because, hey, the data model part of the problem is already solved. All you have to do is serialize the data in your application — which means in the application if you want to have objects, go for it, but make sure they are based on a view of derived data, not a 1-for-1 mapping of objects to relations. That serialization is an easy problem to have gives you the focus to do cool stuff nobody else is doing — and it all comes down to doing data right and escaping from the ridiculous house of mirrors that ORMs lead you into.

There is a conceptual mismatch between the object world and the relational world that is so vast that it is not worth trying to bridge. I’m saying there isn’t an Object-Relation Mismatch. They just aren’t even the same thing, so how could we have ever thought that comparing them against the same criteria ever made sense to begin with?

[1. Both when I was a desk jockey for a while and when I was still in the Army — being an SF engineer involves a good bit of math that you know about going in (and no calculators, so programming is no help there anyway) but is also huge amounts of paperwork that they never tell you about until after you walk thousands of miles to get your floppy green hat.]

[2. This is every bit as damaging as the way that leftist political thinkers loosely throw around the word “society”. As in “you owe it to society” and “society owes it to them” or “society must regulate X, Y, and Z” when what they really mean in some cases is actually “your community” and other cases as “the government”, which convolutes the discussion enough that obviously unacceptable things can seem acceptable — which is similar to how obviously non-OO things have been massaged into a OO-ish shape in the minds of thousands, but still remain just as procedural or functional as ever they were in reality.]

[3. This mistake is somewhat comically enshrined in the new NoSQL stuff, which consists principally of reinventing pre-relational data systems IBM already worked on and largely discarded decades ago.]

[4. In fairness, almost everything is running on Linux, and this makes development much easier than if I were trying to tackle the full spectrum of device OSes out there. Who wants to write a 3D reporting face for Blackberry that needs to work and look the same way it does on Android, KDE on Linux, or iOS (or Windows Phone… haha!).]

Sane Version Numbering

Sunday, July 1st, 2012

Version numbers should have definite meanings — in particular they should have meanings that provide some concrete semantics other than simple collative comparison. In open source this is usually the case, though there are certianly projects packed with wacky numbering, but that’s usually discouraged by the community strongly enough to not happen unless external and well-monied parties foist it on the project in the interest of “marketing”. There is an understood — and should be a natural and explicit — difference between major, minor and sub-minor version numbers, and what increases to each represents.

A sub-minor version number increase (ver. 3.4.x to 3.4.x++) represents improvements to the current state of the software. This means bug fixes, more complete documentation, translation improvements, code refactoring that does not impact the API in any way (bug fix, optimization, etc.) or other changes that do not interfere with interface expectations (human, machine or construct) or anything currently accepted as input or expected as output by the program.

A minor version number increase (ver. 3.x to 3.x++) represents new functionality and improvements that do change the API definitions, but are backwards compatible. A minor version increase might also include interface changes, so long as they don’t invalidate user expectations (if human interfaces) or invalidate what was considered valid input or expected output. So adding a menu item is OK, but moving menus around or changing the semantics of existing menu items is not.

A major version number increase (ver. X to X++) represents new functionality, fixes, etc. but does break backwards compatibility. Anyone moving to a new major version should expect to read the docs and check for changes deliberately or re-read the manual to find out what is different. If the software is a library then other programs which rely on that library cannot be expected to automatically work with the new major version because some or even all semantics or syntax might now be different. Any program’s output might be subject to change, so a program that always output “[username] [size] [time]” might now output “[time] [username] [size]” by default now, so other programs that rely on that output (like parsers or data importers) cannot rely on the output always being the same without checking whether it is first (it is always polite to include a “legacy output” switch if this is the case, and accordingly sound coding practice to explicitely declare output mode switches in programs whenever they are available to reduce breakage like this later on).

[NOTE: The following examples are in Python and refer to a library example, but that doesn’t matter in the slightest. This concept applies to all projects of all sorts in all languages, even (or especially) projects written in assembler.]

Consider for example, that we had a library API that had originally defined a function `find_prime()` this way:

`def find_prime(max):`

This was fine, but has some specific problems. For one thing it couldn’t do negative primes and would crash or return garbage values (a crash is probably better than getting garbage, actually). So in version X.Y.Z a lot of code broke or was tricky to write well because programmers would have to remember to pass it only values that were tested for being positive before calling the function to be safe, and that is both hard to remember to do every time on an infrequent function call and annoying to do anyway because if it happens every call then the positives-only check should really be a part of the function. So in version X.Y.Z++ the definition changes to provide an “IsNegative” exception to be nice to programmers and make the function more general. So now calling `find_prime(100)` works like it always did, but `find_prime(-100)` produces a much more human-friendly exception case instead of randomly terrorizing your program by letting it continue to function with garbage values scattered about within it.

The original function suffered from another problem, though. In version X.Y the function could accept a “max” argument, but couldn’t define a “min” argument, so programmers would have to generate primes up to “max” and then test for and cut off all the ones below the “min” value they really wanted. This is a general problem, and so it should be part of the function. So in version X.Y++ the definition was expanded to:

`def find_prime(max, min=False):`

This change permits the old call of `find_prime(100)` just fine, but also alows it to be called as `find_prime(100, 30)` or even better, the more natural `find_prime(min=30, max=100)`. So this change actually adds a feature, not just a bug fix, and so makes sense to include in a minor version increase.

Eventually the time came around for a major version increase, so in version X+ nobody has to care about backwards compatibility anymore, so the project is clear to break everything and not look back. The old `find_prime(100, 30)` type call was really driving everyone nuts because positionally `min` should come before `max`, and having positional arguments seem out of order is annoying, and some people hate using keyword arguments, and religious wars break out over such things anyway. So its clear this needs to be changed. But this couldn’t have been changed in previous minor versions of the library because a lot of the old code written against that library relied on constructs like `find_prime(100, 30)` being correct and breaking all that code wouldn’t have made sense over such a petty issue. But in a major version increase we don’t care. So this time around the definition was fixed to:

`def find_prime(min, max, *args, **kwargs):`

This is a considerably expanded definition, and this function actually includes a lot more functionality with finding primes than the old one did, which is great. It is easy to understand for people who used the original version, and easy to port forward to (or even write a parser to automatically check for and forward port calls in most circumstances).

This toy example above may be simple, but it represents how version numbers and program changes should be in tune. There are some caveats, however. Usually in version 0.x of a program, or especially version 0.0.x of a program the “0” actually means “all bets are off”, meaning that the architecture and project semantics are still under development and should not be relied upon as stable to write code or scripts against. Essentially, a project’s release of version 1.0 should be an implicit promise of interface stability, whether that means the API, user or other construct interfaces — and whether that interface represents input or output. This is particularly important in database driven applications, as major-version increases are the only possible chance a project has to change the data schema around. This last point is something worth pondering.

Now folks might wonder “what if I want to make a trivial change in a forward minor version X.Y+ but aspects of that change make sense to packport into the original X.Y that is used by many”. Look at how the Python project itself handled this sort of situation for the answer. Non-destructive improvements in Python 3 have been backported to Python 2.6 and 2.7 as sub-minor version increases. Some changes only made it into Python 2.7, however, and those changes and a few others actually define the differences between 2.6 and 2.7, and this makes perfect sense since they represents backwards-compatible API changes.

There are no hard and fast rules, however. If your project has widely used and users and dependent projects nearly always keep pace with the latest release *and* the minor-version releases are far enough between that porting isn’t an issue *and* a change that breaks backwards compatibility doesn’t affect a major feature (and this is where the situation can fall under much debate) then a transitional set of minor releases can, over time, bridge the gap and eventually introduce changes that break backwards compatibility. But this should be the exception, not routine practice. The Django project has this to say about API stability and version numbers.

The whole point is that your version numbers and what level they reside within the version number string should have a distinct meaning to your users and downstream projects without you having to spell everything out every time. Sure you write documentation, but a developer should generally be safe to look at code written against version X.Y.Z and know that building against X.Y.X++ will still work out OK. Maybe not in every case, but this should be generally true. Users should also expect real changes between major versions, not just a few pixels moved around on the exact same program.

Anyway, even in open source you sometimes see project breaking this rule, and usually its a sign that things will go downhill later on. For example, Firefox took over 7 years to go from version 0.x to 3.2.x. Then they decided that didn’t look cool enough because Chrome goes through a “major” version every 3 months, so Firefox should as well. So in one year they went from version 3.2.x to version 8 (or 10, depending on how you’re accounting for their initial “rapid release” year). And now in that project major versions don’t mean shit and the code base is going to crap because of it. Chrome’s code base is full of security problems and major code cancer (its like Google hired a bunch of noobs, seriously… that project won’t be sustainable at this rate) as well, and I supposed we don’t even need to get started talking about Internet Explorer. In short, the browser market is full of nutbags and I suppose in the browser world its OK to just be fucking retarded.

Contrast this with, say, the way just about any closed source company either just makes up version numbers, or goes as far as inventing incompatibility with its former self just to drive/force sales (and I’ll go ahead and point a finger directly at MS Office on this), and you’ll start wondering just how confusing it must be to work within a closed source project during its sustainment cycle. As a side note on closed source, though, the game industry usually has surprisingly well thought out version numbers — but I suppose that game sequels vs current game releases is what insulates them from that, and anyway the differences between sequels is a lot more obvious to a gamer than the difference between two versions of the same business software product in many cases.

A Note on “Web Applications”

Friday, June 1st, 2012

This is a subject worthy of a series of its own, but a short scolding will have to do for now…

The “Web” is not an applications environment. At all. Web security is a joke. The whole idea is a joke. That’s like trying to have “newspaper security” because the whole point of HTTP is untrusted, completely public, unthrottled publication of textual data and that’s it. Non-textual data is linked in, not even native within a document (ever heard of, say, an <img> tag or that whole series that starts <a something=””>?) and various programs that can use HTTP to read HTML and related markup can (or can’t) fetch extra stuff, but the text is the core of it. Always.

People get wrapped around the axle trying to develop applications that run within a browser and always hit funny walls when trying to deliver interactivity. We’ve got a whole constellation of backhacks to HTTP now that are used to pretend that a website can be an application — in fact at this point I think probably more time is spent working within this kludged edge case than any other specific area of computing, and that’s really saying something. It says a lot about the need for an applications protocol that really can do the things we wish the Web could do and it speaks volumes about how little computing is actually understood by the majority of people who call themselves developers today.

If “security” means being selective about who can see what, then you need to not be using the web at all. Websites are all-or-nothing, whether we delude ourselves that we can put layers of “security” backhacks over it to act against its nature or not.

If “security” means being selective about who can make changes to the data presented on a website, you need to make modifications to data by means other than web forms. And yes, I mean you should be doing the “C” “U” and “D” parts of CRUD from within a real program working over a protocol that is made for this purpose (and therefore inherently securable, not hacked up to be sort-of securable like HTTPS) and ditch your web admin interfaces*. That means the Web can be used as a presentation face for your data, as it was intended (well, dynamic pages is a bit of a hack of its own, but it doesn’t fundamentally alter the concept underlying the design of the protocols involved), and your data input and interactivity faces are just applications other than the browser. Think World of Warcraft, the game, the WoW Armory, and the WoW Forums. Very different faces to match very different use cases and security contexts, but all are connected by a unified data concept centered on character.

Piggybacking ssh is a snap and the forgotten ASN.1 protocol is actually easier to write against than JSON or XML-RPC, but you might have to write your own library handler, though this is true even for JSON as well, depending on your environment.

[*And a note here: the blog part of this site is currently running on WordPress, and that’s a conscious decision against having real security based on what I believe the use case to be and the likelihood of someone taking a hard look at screwing my site over. This approach requires moderation and a vigilant eye. My business servers don’t work this way at all and can serve pages built from DB data and serve static pages built from other data, but that’s it — there is no web interface that permits data alteration at all on those sites I consider to be actually important. Anyway, even our verbiage is screwed up these days. Think about how I just used the word “site”. (!= site server service application)]

I can hear you whining now, “But that requires writing real applications that can touch the database or file system or whatever and that’s <i>hard</i> and requires actual study! And maybe I’ll have to write an actual data model instead of rolling with whatever crap \$ORM_FRAMEWORK spatters out and maybe that means I have to actually know something about databases and maybe that means I’ll discover that MySQL is not good at handling normalized data and that might lead me to use Postgres or DB2 on projects and management and marketing droids don’t understand things that aren’t extensively gushed over by the marketing flax at EC trade shows and… NOOOoooo~~~!” But I would counter that all the layers of extra bullshit that are involved in running a public-facing “Web 2.0” site today are way more complicated, convoluted and extremely removed from a sane computing stack and therefore, at this point, much more involved, costly and “hard” to develop and maintain than real applications are.

In fact, since all of your “web applications” are built on top of a few basic parts, but in between most web developers use gigantic framework sets that are a lot larger and harder to learn and way shorter-lived than the underlying stack, your job security and applications security will both improve if you actually learn the underlying SQL, protocol workings, and interfaces involved in the utility constellation that comprises your “application” instead of just leaving everything up to the mystery of whatever today’s version of Rails, Dreamweaver or \$PRODUCT is doing to/with your ideas.

I’m not bashing all frameworks, some of them are helpful and anything that speeds proper development is probably a good thing (and “proper” is a deliberately huge and gray area here — which is why development is still more art than science and probably always will be). I am, however, bashing the ideas that:

1. Tthe web is an applications development environment
2. Powertools can be anything other than necessarily complex

At this point the mastery burden for web development tools in most cases outstrips the mastery burden for developing normal applications. This means they defeat their own purpose, but people just don’t know that because so few have actually developed a real application or two and discovered how that works. Instead most have spent lots of time reading up on this or that web framework and showing off websites to friends and marks telling them they’re working on a “new app”. Web apps feel easier for reasons of familiarity, not because they are actually less complex.

Hmm… this is a good sentence. Meditate on the last sentence in the above paragraph.

The dangerously seductive thing about web development is that the web is very good at giving quick, shallow results measurable in pixels. The “shallow and quick” part being the dangerous bit and the “pixels” being the seductive bit. There is a reason that folks like Fred Brooks has insinuated that the drive to “just show pixels” is the bane of good programming and also called pixels and pretty screens “chicken lipstick”. Probably the biggest challenge to professional software development is the fact that if you’re really developing original works screens are usually the last thing you have to show, after a huge amount of the work is already done. Sure, you can show how the program logic works and you can trace through routines in an interpreter if you’re using a script language somewhere in the mix like Python or Scheme, but those sorts of concrete progress from a programmer’s perspective don’t connect to most clients’ concerns or imaginations. And so we have this drive to show pixels for everything.

Fortunately I’m the boss where I work so I can get around this to some degree, but its extremely easy to see how this drive to show pixels starts dicking things up right from the beginning of a project. Its very tempting to tell a client “yeah, give me a week and I’ll show you what things should look like”. Because that’s easy. Its a completely baseless lie, but its easy. As a marketer, project manager, or anyone else who contacts clients directly, it is important to learn how to manage expectations while still keeping the client excited. A lot of that has to do with being blunt and honest about everything, and also explaining at least a little of how things actually work. That way instead of saying “I’ll show you how it will look” you can say “I’ll show you a cut-down of how it will work” and then go write a script that does basically what your client needs. You can train them to look for the right things from an interpreter and showing them a working model instead of just screens (which are completely functionless for all they know) as a developer you get a chance to do two awesome things at once: prototype and discover where your surprise problems are going to probably come from early on, and if you work hard enough at it write your “one to throw away” right out the gate at customer sale prototyping time.

But web screens are still seductive, the only thing most people read books about these days (if they read at all) and the mid-term future of general computing looks every bit as stupid, shallow and rat-race driven as the Web itself.

I can’t wait to destroy all this.

Haha, just kidding… but seriously…