The Intellectual Wilderness There is nothing more useless than doing efficiently that which should not be done at all.

2012.11.19 14:37

ODFapper 0.04 for Django on Unix

Filed under: Computing — Tags: , , , , , — zxq9 @ 14:37

I’ve pulled the most common bits out of the Django views where I had rendered ODFs before and have the itty-bitty beginnings of an ODF handling/rendering library started now. “Library” is a bit of a stretch, since its only a few functions and a Bash script, but it abstracts the most mysterious/tedious parts of ODF handling already.

This is just a simple tarball right now. It unpacks like a Django app would, but only contains a copy of odfapper, and a templatetags/ But since we are doing template tag registration you need to include it in your INSTALLED_APPS in and add a new variable ODFAPPER_PATH which needs to be a string with the absolute path to /your/project/location/odfapper/odfapper. Importing into a (or wherever) that you want to render ODFs in is done with from odfapper.funcs import render_odf, and I go into a bit more detail in the README included in the tarball. At the moment this post and the stuff in README (which is just an expansion of internal notes) is all the documentation.

I’ve got a ton more work I’d like to do on this. If there is any interest I could put a GitHub repo up — but other (paying) work calls to which this isn’t central, so… let me know if this is a direction anyone wants to head in and I’ll keep it going. There are a bajillion tiny things that are not so hard to do that would make ODF handling enormously more intuitive.

Link to 0.04 archive: odfapper_django-0.04.bz2
Link to just the shell update: odfapper-0.04.bz2

It bears mentioning that this is tested against Django 1.4, but not 1.5 yet. Unless the template loader classes have changed a lot then this should still work fine anyway, though.

2012.08.14 05:52

Using Unmanaged Django Models of Postgres Views in a Pre-existing Schema

Filed under: Computing — Tags: , , , , , , — zxq9 @ 05:52

This discussion covers the way Django 1.4 and Postgres 9.1 work as of mid 2012. If you have trouble with this or very similar code let me know and we can post your examples.

Some of us like Python. Some of us are subjected to working with the Web, at least from time to time. Despite the fact that we would all love to write native applications against a sane API that we’re allowed to read in a single language in an environment we control with a single data handling layer, that’s just not always reality. We’d all love to commit to a solid data model that someone smart was permitted the time to work on before the FAD design crew got spun up. But sometimes we hit weird cases where there’s just not time or money or scheduling is stupid or there are unanticipated complications or the platform changes based on a buzzword or the situation doesn’t fit because you’re working behind a three-star programmer who quit last week. Ha ha, just kidding — I meant that’s all the time every time.

Since that’s the norm, I’m discussing using Django to make reporting faces for existing database applications. In particular database applications that rely on a non-trivial normalized or properly denormalized data model. A lot of folks dive straight to the “Hire a guy to build a Django/PHP/TG/Rails/whatever application with a separate database that syncs or queries-to-sync what it needs from the existing database” conclusion. This is wasteful since most of the time all we want is to display some tabular data along with some custom math on that data. Usually the data set is not enormous and going full-bore data warehouse prepared to carry billions of records in a large set of denormalized, parallel schema is way, way overkill.

So how do we do this with a web framework that uses an ORM (noooo~!) and expects a 1-to-1 mapping between classes and tables? We use views against unmanaged Django models.

When I say “views” I mean in the database sense, not the ubiquitous Django stuff. As in the good ole’ Postgres CREATE VIEW foo AS SELECT [stuff]. This could be materialized views, calculated views, a denormalized table triggered to always be current (see “materialized view”), etc. Database views, whatever the flavor, will serve our purposes.

A lot of folks get tripped up on this particular subject because of the way frameworks like Django and Rails nearly always rely on meaningless integers as primary keys by default, can’t handle multi-column natural keys and try to add an integer “ID” to any model even if its not the primary key and therefore completely superfluous. Any non-trivial, normalized data model won’t have integer primary keys and any DBA who administers such a data store won’t add them to satisfy your stupid web framework. So forget about adding integer primary keys, we’re not going to use them and adding them would be an exercise worthy of something the government would pay for. But that means that until true support for natural, multi-column keys in Django gets finalized we need a new approach (and anyone stuck working with Django at least up to 1.5 will need a different approach anyway). No worries, read on.

We’re going to create a view against an unmanaged Django model class, and feed Django a dummy “id” column to keep it quiet. Of course, we have to remember that the unmanaged model isn’t safe to use as a target for models.ForeignKey(), but as we’ll see this is a pretty simple limitation to work around since any existing data model will already have pre-defined primary keys. This is workable even if we need to treat the view as a real model in the future.

Note: this is a toy example for necessary reasons. I’m only writing this to illustrate a point, not to write a full-bore business application on my blog. That said, feel free to use these ideas or any code presented here in any way you like.

So let’s get a few tables down. Let’s say there is a cost/price margin history schema in a database that various applications use to keep track of how much stuff costs today and how much it used to cost at a given point in time. Let’s say also that some of the items in the list are for sale and the history of the sale price needs to be tracked in a similar manner to the costs of things. Since a lot of applications access the data store and many of them are designed either strictly for sales/ordering-only or cost/inventory-only the cost and price histories are completely independent. Different applications can already tell the managers and analysts what the margin is right now, but all this historical data is difficult to compare over time because its tedious for humans to run the comparisons by hand, and a real pain to keep straight even in a spreadsheet application.

Sounds like a perfect job for a computer! Further, since this is tabular data display and essentialy a document-export function it is a perfect fit for something that is accessible from a browser.

So far we’ve got two tables on the costing/inventory side: inventory.item and inventory.cost_history. They look like this (note, this isn’t quite the way psql will display the table data, but just go with it):

                Table "inventory.item"
     Column     |       Type         |    Modifiers
 nsn            | varchar(50)        | primary key
 is_active      | boolean            | not null

                         Table "inventory.cost_history"
     Column     |       Type         |              Modifiers
 item           | varchar(50)        | foreign key (item.nsn), unique together with start_date
 start_date     | timestamp          | primary key with "item"
 end_date       | timestamp          |
 value          | money              | not null

Over on the product/pricing side we have two more tables to compliment those above: sales.product and sales.price_history. They look like this:

                Table "sales.product"
     Column     |       Type         |    Modifiers
 item           | varchar(50)        | foreign key (item.nsn), unique
 sku            | varchar(30)        | primary key
 name           | varchar(100)       | not null
 is_active      | boolean            | not null

                         Table "sales.price_history"
     Column     |       Type         |              Modifiers
 product        | varchar(50)        | foreign key (sales.sku), unique together with start_date
 start_date     | timestamp          | primary key with "product"
 end_date       | timestamp          |
 value          | money              | not null

Enormously simple — to the point that if I hadn’t explained the scenario you’d wonder why we even have a “sales.product” table. But let’s just accept that not every item in the company’s inventory is for sale, and those that are need to be treated differently so tying everything to inventory.item records isn’t a good idea. (It bears mentioning here that sales.price_history.product could just as easily point to inventory.item.nsn as sales.product.sku: natural connectivity with the “real” NSN vs more frequent reference by/to SKU from application queries.)

So this is the existing schema we’re hoping to write a small dashboard-ish web thingy in Django against without modification. How do we do it without driving the DBA nuts, making one query per iteration, burying a lot of huge raw() queries in our application code, or forcing things into some nooberiffic “One Class, to rule them all” paradigm? First we need something Django can hold on to deal with on its own terms — something that behaves enough like a real table to pass a sort of duck check.

We need a concept of “things for sale”. We don’t care about the items not for sale in this case, because the whole point is demonstrating a margin history — so we can, for the purposes of our dashboard — lump sales.product and inventory.item together:

class ProductRecord(models.Model):
    nsn       = models.CharField('NSN', max_length=50)
    sku       = models.CharField('SKU', max_length=30)
    name      = models.CharField(_('Name'), max_length=100)
    is_active = models.BooleanField(_('Active'))

    class Meta:
        db_table = 'sales\".\"product_record'
        managed = False

Pretty predictable so far. But it could use a better name than sales.product_record. I suck at naming things without thinking a bit first. With that in mind, note that I named the class “ProductRecord” to stay in tune with the db_table value. At the moment we’re just writing a dashboard, but if that requirement grows later on you’d hate to remember that in every other app “Product” refers to sales.product but in this app its something different because you already used the name “Product” for something else. We’ve lumped together a lot of pretty obvious data in one place from sales.product and inventory.item. Now let’s lump the money stuff together:

class ProductValueHistory(models.Model):
    sku        = models.CharField('SKU', max_length=30)
    cost       = models.DecimalField(_('Cost'), max_digits=15, decimal_places=5)
    price      = models.DecimalField(_('Price'), max_digits=15, decimal_places=5)
    start_date = models.DateTimeField(_('Start Date'))
    end_date   = models.DateTimeField(_('End Date'))

    class Meta:
        db_table = 'sales\".\"product_value_history'
        managed = False

Also fairly predictable. (The DecimalField thing and fixed-point math for money handling is a subject for another time.) You probably noticed the lack of a Django models.ForeignKey on both of these models. We don’t have them because we don’t have anything to tie them to that Django can understand because of that silly litter integer-as-a-primary-key assumption that nearly every ORM seems to universally make. We could add it, but that would require altering the original tables to accommodate this dashboard, and that goes against the principles of being a good guest in someone else’s digital house. Besides, a lot of applications access this data store — doing anything that changes things up could have ripple down effects all over the place; much better to forget about all that mess. Never let your tools drive your design — otherwise you’re being a tool for the sake of a cosmic “in Soviet Russia…” joke.

We could declare something else to be the primary key, but that would only affect how Django would generate SQL table creation code when running syncdb, and since we’re never going to run that on an unmanaged model and Django tries to add an integer ID to everything whether or not you’ve already got a natural primary key defined, that’s pointless

Now lets write our views for the database. This works in Postgres 9.1. I don’t know about other databases — and if you’re doing this in MySQL you probably don’t want to do things this way (hint: you should migrate):

CREATE VIEW sales.product_record AS
      row_number() OVER (ORDER BY AS id,
      i.nsn AS nsn, s.product AS sku, AS name, s.is_active AS is_active
      inventory.item AS i, sales.product AS s
      i.nsn = s.item;

Now we’ve got a “fake” table in the database that Django thinks is a real one. It even has an “id” column generated for us by the row_number() window function. This is silly, and only present to fool Django into accepting our model, but compared to how a lot of other ORMs work, this is a pretty small ORM tax to pay. The price goes up a little if we want to be allowed to do insertions and modifications from the Django class to this view instead of the real tables (we’d have to write rules), but even that isn’t so hard.

So now let’s get what we want out of the price/cost history combination. I’m not 100% happy with the way this query turns out, to be honest (partly because I deliberately made this situation a little complex by not making sales.price_history reference the inventory.item.nsn at all, so we require an extra join), but it does work fine — and there is a way around even the inefficiency in my (probably bad) SQL view code here:

CREATE VIEW sales.product_value_history AS
      row_number() OVER (ORDER BY p.start_date DESC, c.start_date DESC) AS id,
      p.sku AS sku,
        WHEN p.start_date < c.start_date
          THEN p.start_date
          ELSE c.start_date
        AS start_date,
        WHEN p.end_date < c.end_date
          THEN p.end_date
          ELSE c.end_date
        AS end_date,
      c.value AS cost,
      p.value AS price
      sales.price_history AS p,
      ( SELECT
            product.sku AS sku,
            cost.value AS value,
            cost.start_date AS start_date,
            cost.end_date AS end_date
          FROM sales.product AS product, inventory.cost_history AS cost
          WHERE product.item = cost.item) AS c
        p.product = c.sku
        (   (p.start_date, p.end_date)
            (c.start_date, c.end_date)
          (p.end_date IS NULL AND c.end_date IS NULL));

The query represented by this view goes a touch beyond what the Django ORM provides access to, but isn’t that hard to understand. The subquery where we join sales.product and inventory.cost_history is the link that provides us the necessary connection between a product’s SKU and its parent item’s NSN and returns a table called c. Joining that to the sales.price_history table for matching SKUs gives us all the costs and prices associated with a given product (and no results for items in inventory that are not for sale), and the extra WHERE clause using OVERLAPS lines up our price histories so we don’t have NULL-value gaps across spans of time when either the cost or price changed but the other didn’t.

We did the same “fake id” trick in this query using the row_number() window function so we can use Django’s ORM to pull results from this table like any other model. Because the window function already sorted the results in descending order, we don’t need to sort the results to know they are in chronological order.

Now where to put this bit of SQL? Of course, check the Django docs, but as of Django 1.4 the .sql files should go in a directory located at project/app/sql/ . Once there it should execute when you run syncdb — and if it doesn’t or you need to re-do something manually you can invoke it from within psql quite easily by doing \i /path/to/project/app/sql/filename.sql. (If you do it manually from within psql, remember to ALTER VIEW view_name OWNER TO django_role_name or Django probably won’t have permission to query it.)

So how do we use it to get a history for a specific product? Since we don’t have any primary key/foreign key relations set up in Django, we can’t just do:

product = ProductRecord.objects.get(id=number)
value_history = product.price_history.all()

This is a limitation many ORMs have, but we can work around it easily enough since the database itself has other primary keys that are guaranteed to be unique and therefore safe to use with object.get() or get_object_or_404():

product = ProductRecord.objects.get(nsn=value)
value_history = ProductValueHistory.objects.filter(sku=product.sku)

The results are already sorted so we can also do the following without any order_by() clause in Django:

current_value = ProductValueHistory.objets.filter(sku=product.sku)[0]

This gets us right up to the point of needing to write the math in to actually get margin totals and calculate percentages and make pie charts and all the other stuff business types like to do with this kind of data. At this point you can do all that in your Django view in Python (very easy and comfortable) or write a few more columns into the views that do this processing for you before it ever leaves the database, or even write entirely new views and new models that do whatever it is that you want.

Faced with the options above, when there is no obvious right answer I prefer to put it into the database as calculated columns or new views and get a complete answer per query instead of processing raw data in the application code. I (usually) take this approach because processing in the application makes that logic unavailable to any other applications that access the same data store which might want the same information in the future, thus leaving the writer of those other applications (probably me) with nothing left but to reinvent the wheel to get the exact same result (and looking ahead at maintenance centralizing logic is always better, whether in a library, database or whatever). Another good reason I’ve found to do things this way is to avoid accidentally writing an iterative processing routine at the application level that calls the database on each iteration of an arbitrarily large loop (I know you think “Yikes!” but people do this all over the place without even realizing it!).

But this view would be rather inefficient on very large rows of tables because its not eliminating rows based on an indexed item before it does the join processing and also because the OVERLAPS bit is pretty compute intensive on large sets. Indexes on the view can mitigate that to some degree, but there will come a point when materialized views/denormalized tables trump having faster hardware.

I’m constantly amazed at how fast hardware is these days and how much Postgres performance increases with each release, but that’s no excuse for a rather blatant query inefficiency that can be fixed easily. Its also absolutely no help for people stuck using Postgres < 7.1 or the legions of people stuck with MySQL or the poor fools stuck using a crap, deliberately crippled “home” or “small business” version of a proprietary database.

There are two possible ways out of this. You can write a raw() SQL query into Django (sometimes the easy answer), or you can make sales.product_value_history into a real table in the database that updates itself whenever the sales.price_history or inventory.cost_history tables are modified.

The second idea is the most interesting and involves a deliberate denormalization, which in this case I think is probably appropriate. This is called making a “materialized view”. Its available as a native feature in DB2 and Oracle, but not in Postgres just yet (I think 9.2 or 9.3, probably 2013 or 2014). However, in Postgres we can write triggers which keep our price/cost history table updated automatically whenever either of the sponsoring tables is modified. This ability is why while DBAs love conveniences like built-in materialized views features like this tend to take a lower priority than the serious stuff like JOIN efficiency, window functions and query optimizations. A word of caution: it is pretty common for folks whose first experience with databases like Postgres was through an ORM framework to try keeping tables synced by writing routines in at the application level — but this is a bad idea and defeats the purpose of using the database as a layer of abstraction. Leaky abstractions suck the further you travel with them and always remind me of the Turkish proverb “No matter how far you’ve gone down the road, turn back.”

I can feel your boredom burning into my fingers through a quantum time warp in the net, so I’ll end this here. Yes, I’m leaving the materialized views implementation as a loose end (there are great resources on the net for this), but the main point was how to put Django in meaningful touch with a database schema that is both non-trivial and doesn’t use arbitrary integer values as primary keys.

2012.07.31 02:00

Object-Relation Mismatch: Comparing Strawberries and Sunglasses

I’ve been spending a lot of time lately writing a rather large suite of business applications. The original customer was a construction company which needed a replacement for their estimation system. Then the same customer needed a facility pass management system to make the insane amount of bit-shoveling/paperwork involved in getting security clearances for workers to perform work in secure sites. Then a human resources system. Then a subcontract management system. Then a project scheduling system. Then an invoicing system.

The point here is, they really liked my initial work, and suddenly I got further orders. Pretty soon after discussing the first few add-on requirements with the customer it became apparent that I was either going to be writing a bunch of independent systems that would eventually have to learn how to talk to each other, or a modular system that covered down on office work as much as possible and could pull data from associated modules as necessary (but by the strictest definition is not an “expert” or ERP system — note that “ERP” is now a buzzword void of meaning just like “cloud”). Obviously, a modular design is the preferred way to go here, and what that costs me in effort making sure that dependencies don’t become big globby cancer balls buys me enormous gains selling the same work, reconfigured, to other customers later and makes it really easy to quickly write add-ons to fill further needs from the same customer.

Typical story. But how am I doing it and what does this have to do with the Dreaded Object-Relation “Impedance” Mismatch? Tools, man, tools. Most of the things I wrote in the past were system level utilities, subsystems, security toys, games, one-off utilities for myself to make my previous office work disappear[1], patches to my own systems, and other odds and ends. I’d never sat down and written a huge system to automate away someone else’s problems, though it turns out this is a lot more fun than you might expect provided you actually take the time to grasp what the customers need beyond what they have the presence of mind to actually say. (And this last point is worthy of an entire series of books no one will ever pay me to write.)

And so tools. I looked around and found a sweet toolkit for ERP called Tryton. I tried it out. Its pretty cool, but the biggest stepping stones Tryton gives you out of the box are a bunch of pre-defined models. That’s fine, at first, but they are almost exclusively based on non-normalized (as opposed to denormalized) data models. This looked good going in, but turned out to suck horribly as time passed.

Almost all of the problems ultimately trace back to the loose way in which the term “model” is used in ORM. “Model” means both the object definitions and the tables that feed them[2]. Which is downright mad because object member variables are not table columns, not by a mile, and tables can’t do things. This leads to a lot of wacky stuff.

Sometimes you can’t tell if it makes sense to add a method to a model, or to write a function and call it with arguments because what you’re trying to do isn’t an inherent function of the modeled concept itself (and if you’ve been conned into using Java life sucks even more because this decision has already been made for you regardless your situation). And then later you forget some of the things you wrote (or rather, where they are located, which equates to forgetting how to call them) because it was never clear from the outset what should be a function, what should be a method, and what is data and what is an object. This makes it unclear what should be an inherited behavior and what should be part of a library (and I’ll avoid ranting about the pointlessness of libraries of Java/struct-based objects). And this all because we love OOP so much that we’re willing to overlook the obvious fact that business rules are actually all about data and processes, not about objects and methods at the expense of sane project semantics.

In other words, business rules are about data, most easily conceptualized as nouns, and not really about verbs, most easily conceptualized as functions (and this is the beginning of why using OOP for things other than interface and simulation design is stupid — because its impossible to properly subordinate verbs to nouns or vice versa).

Beginning with this conceptual problem you start running into all sorts of weirdness which principally revolves around the related problem that every ORM-based business handling system out there tries to force data into a highly un-normalized data model. I think this is in an effort to make business data modeling “easy”, but it results in conscious efforts by framework designers to prevent their users (the application developers) from ever touching or knowing about SQL. To do that, though, it is necessary to make every part of data constraint, validation, verification, consistency, integrity (even referential integrity!), etc. into methods and functions and processes which live in the application. Instead of building on the fascinating advancements that have been made in data rule systems this approach deliberately tosses them aside and reinvents the wheel, but much worse. This relegates the database itself to actually just being a million-dollar file system[3].

For example, starting out with the estimation stuff wasn’t too hard, and Tryton has a fairly easy-to-use set of invoicing, receiving, accounting and tax configuration modules you can stack on to get some sweet functionality for free. It also has a customer management model and a generalized personal information manager that is supposed to form the basis for human resources management stuff you can build yourself. So this is great, right?

Wrong. Not just wrong because of the non-normalized data, I’ll get to that in a moment, but primarily wrong because nearly everything in the system attempts to be object oriented and real data just doesn’t work that way at all. I didn’t realize this at first, being inexperienced with business applications development. At first I thought, “Ah, so if we make our person model based off the person-party-address chain we can save a lot of time writing by simply spending time understanding what’s already here”. That sort of worked out. Until the pass management request came in. (That basing the estimation module off of the existing sales/orders/invoices chain would be a ridiculous prospect was a far less obvious problem.)

Now I had a new problem. Party objects are one table in the database, people objects are a child class in the application that inherits Party but is represented in the database as a separate table that doesn’t inherit the party one (but has a pass-up key instead to make the framework portable to database backends that don’t support inheritance or other useful features — more on that mess later) and addresses are represented in the database as being a child table to the party table, but as independent objects within the OO system at the application server level.

Still doesn’t sound horrible, accept that it requires a lot of gymnastics to do handle security checks and passes this way. In particular getting security clearances for workers involves explaining two things in excruciating detail: family relationships and address histories.

The first problem has absolutely no parallel in Tryton, so writing my own solution was the only way to proceed. This actually turned out to be easier than tackling the second problem, specifically because it let me write a data model first that was unencumbered by any design assumptions inherent in the system (other than fighting with the basic OOP one-table-per-model silliness). What was really required was to understand what constitutes a family. You can’t adopt a sibling, but a parent can adopt you, and reproduction is what makes people to begin with which requires a M+F pair, and we need an extra slot each direction for adoption/step relationships. So every person who shares a parent with you is a sibling. Label them based on sex and distance and viola! we’ve got a self-mapping family model. Cake. Oh wait, that’s only cake in SQL. Its actually really, really ugly to do that from within OOP ORM code. But enough about families. That was the easy part.

Addresses were way more problematic. Most software systems written in Western languages were developed in (surprise!) the West. The addressing systems in the West vary greatly and dealing with this variance is a major PITA, so most software is written to just completely ignore the interesting problem worth solving and instead pretend that addresses are always just three text strings (usually called something like “address_1”, “address_2” and “postal_code”). In line with the trend of ignoring the nature of the data that’s being dealt with, most personnel/party data management models plop the three address elements directly into the “person” (or “party” or “partner”, etc.) table directly. This is what Tryton does.

But there’s a bunch of problems here.

For one we’ve completely removed any chance of a person or party having two addresses without either adding more columns (the totally stupid, but most common approach) or adding a separate table and letting our existing columns wither on the vine. “Why not remove them?” — because removing columns in a pre-fab OOP ORM can have weird ripple effects because other objects expect the availability of those member variables on the person or party objects and the interface bits usually rely on the availability of related objects methods, etc.

Another problem is that such designs train users wrong by teaching them that whenever a person changes addresses the world actually changed as well and the right thing to do is erase the old data and replace it with something new. Which is crazy — because the old address is still the correct label for a location that didn’t move in the real world and so erasing it doesn’t mirror reality at all.

And the last statement above reveals the root problem: this isn’t how addresses really work at all. Addresses are limited in scope by time. A person or party occupies a location for a time, but the location was already there — so we need a start/end time span, not just a record linking a party and an address. Moving further, addresses are merely labels for locations. So we need a model of locations — which should be hierarchal and boundless, because that’s how real locations are. Then we need an address builder on top of that that can assemble an address by walking up the chain. This solves a ton of problems — for one we don’t have to care if a building has a street number or even a street at all (in Japan, for example, we don’t have street names, we have nested blocks of zones that define our address elements). It also solves the translation problem — which is really important for me again here because in English addresses are written from smallest element to largest, and in Japanese they are written from largest to smallest. But these representations are not the locations nor are they actually the addresses themselves — they are merely different forms of notation for the same address.

So all this stuff above is totally ignored by the typical software model of addressing — which really puts a kink in any prospect of working within the existing framework to write a background information check and pass management system. These kinds of incomplete conceptual assumptions pervade every framework I’ve dealt with, not just Tryton and make life within OOP ORM frameworks very difficult when you need to do something that the original authors didn’t think about.

This article is about mismatches, so I’ll point out that the obvious one we’re already overlooking is that the data doesn’t match reality — or come even close. And we’re only talking about addresses. This goes beyond the Object-Relation Mismatch — its the Data-Reality Mismatch. It just so happens that the Object-Relation Mismatch greatly enables the naive coder in creating ever deeper Data-Reality mismatches.

Given the way addresses are handled in most software systems we have a new data input and verification problem. With no concept of locations there is no way to let someone who is doing input link parties to common addresses. This is stupid for a lot of reasons, for one thing consider how much easier it is for a user to trace down an existing location tree until they get to a level that doesn’t exist in the database yet and then input just the new parts rather than typing in whole addresses each time.

“But typing addresses is easy!” you say. Not true. We have to track four different scripts per address element (Latin, two forms of kana, and kanji) and they all will have to come out the same way every time for the police computers to accept them. One of the core problems here is validating that person A’s address #2 which extends from the same dates as person B’s (his brother) address #4 which spans the same dates is the same in all details so that the police background checker won’t spit out an error (because they already have this data so yours had better be right). Trusting that every user is always going to input the exact same long address string all four times and never make a mistake is ridiculous. Its even more stupid when you consider that they are referencing the same places in the real world against data you already have so why on earth wouldn’t your software system just let them link to existing data rather than force them to enter unique, error-prone new stuff?

So assuming that you do the right thing and create a real data model in your database where locations are part of a tree structure and address assembled strings linked against locations and have a time reference, etc. how does all this manifest in the object code? Not at all the way that they present in the database. Consider trying to define a person’s “current address”.

There are two naive ways to do this and two right ways to do this. The most common stupid approach is to just put a boolean on it “is_current” or something similar and call it good. The other stupid way to do it is to present any NULL end dates as “current” and call it good. But what about the fact that NULL is supposed to mean “unknown” — which would most likely be the case at least some of the time anyway and therefore an accurate representation of known fact. And even more interestingly, how do we declare that a person can only have one current address? Without a programmatic rule you can’t, because making the “is_current” boolean a UNIQUE means that a person can’t have more than one false value, either, which means they can only ever have a current and a not current address (just two) and this is silly. Removing the constraint means that either the client code (really stupid) or a database trigger (sort of stupid) should check for and reject any more than a single true value per person.

The better way to handle this is to have an independent “current address” table where the foreign key to person or party is UNIQUE and a separate “address” table where you dump anything that isn’t current. This has the advantage of automatic partitioning — since you will almost never refer to old addresses anyway, you can get snappy responses to current address queries because the current address table is only as large as your person table. The other right way to do this is to create a “current address” table that doesn’t contain any address data at all but rather just a unique reference to a party and a (not unique) reference to an address. This approach is the easiest to retro-fit onto an existing schema and is probably the right solution for a datastore that isn’t going to get more than a million addresses to store anyway.

But wait… you can’t really do that in an ORM. I mean, you can make an ORM play along with the idea, but you can’t actually create this idea in a simple way from within ORM code, and from OOP ORM code it is really a much huger PITA to coerce the database into giving you what you want than just writing your tables and rules in SQL yourself and some views to massage them into a complete answer for easy coexistence with an ORM. In particular, its easiest to have the objects actually have an “is_current” boolean and the database just lie to the ORM and tell it that this is the case on the database end as well. Without knowing anything about how databases work, though, you’d never know that this is the right way to do things, and you’d never know that the ORM is actually obstructing you from doing a good job at data modeling instead of enabling you to do a good job.

So here’s another mismatch: good data design predicts that objects are inherited one way in Python and the tables follow a significantly different schema in the database. Other than the problem above (which is really a problem of forcing addresses to be children of parties/people and not children of a separate concept of location as we have it in the real world) the object/relation weirdness creates a lot of situations where you’re trying to query something that is conceptually simple, but winds up requiring a lot of looping or conditional logic in the application to sort things out.

As for the looping — here be dragons. If you just trust the ORM completely each iteration may well involve one query, which is really silly once you think about it. And if you do think about it (I did) you’ll write a larger query domain initially and loop over that in the application and save yourself a bunch of round trips. But either way this is silly, because isn’t SQL itself designed to be a language that permits the asking of detailed data questions in the first place? Why am I doing this stuff in Python (or Ruby or Lisp or Haskell or whatever)?

But I digress. Let me briefly return to the fact that the tables are inherited one way and the objects another. The primary database used for Tryton is Postgres. This is a great choice. That shows that somebody thought about things before pulling the trigger. Tryton was rewritten from old TinyERP/OpenERP (the word “open” here is misleading, by the way — OpenERP’s terms don’t come close to adhering to the OSS guidelines whereas TinyERP actually did, or was very close) and the main project leader spent a lot of time cleaning out funky cruft — another great sign. But somewhere in there a heavy impulse to be “database agnostic” or “portable” or some other dreamy urge got in there and screwed things up.

See, Tryton supports MySQL and a few other database systems besides that don’t have a very complete feature set. What this means is that to make the ORM-generated SQL Postgres uses similar to the ORM-generated SQL that MySQL uses you have to settle for the lowest-common feature set between the two. So any given cool feature that you could really benefit from in one that doesn’t exist in the other must be ditched for all database backend code or else maintaining the ORM becomes a nightmare.

This means that each time you say you want your framework to be “portable” across databases you are ditching every advanced feature that one system has got that any of the others don’t, resulting in least-common-denominator system design. So every benefit to using Postgres is gone. Poof. Every detriment to using a fast, naive system like MySQL is inherited. Every benefit to a fast, naive system like MySQL is also gone, because nothing is actually written against the retrieval speed optimizations built into that system at the expense of losing all the Big Kid features in a system like Postgres. Given this environment, paying enormous fees for Oracle isn’t just stupid because Postgres can very nearly match it anyway — its doubly stupid because you’re not even going to use any cool features that any database provides anyway if you write “database agnostic” framework code.

I had many a shitty epiphany over time as I learned more about data storage concepts in general, relational database systems in particular, and Postgres, Oracle, DB2 and MySQL specifically. (And in that process I grew to love Postgres and generally like DB2.)

So there is a lesson here not related directly to the OOP/relational theme, but worth stating in a general way because its important to nearly all software projects that depend on some infrastructure piece external to the project itself:

Pick a winner. If someone else in your project wants to use systemX because they like it, they can spend time making the ORM code work, but that should be an extension to the subsystem list, not a guarantee of the project because you’ve got more important things to do. This could be MySQL vs Postgres or Windows vs Linux. It doesn’t matter — pick one and specialize. Even better, pick whichever one gives the biggest boost to a specific layer of your application stack and use that there.

So far the above thinking has had me settling more on Postgres over anything else and more on Qt at the application level than anything else.

Back to my story. The addressing thing introduced enough problems that I eventually had to ditch it entirely and write my own that was based on normalized location data that carried natural data (parent-child relationships within the hierarchy of physical locations) with an address table that carried human-invented administrative data about those locations (if they have a postal code, and other trivia) and a junction table that connects parties (people or organizations) to those locations via the addresses and carries timeline and other data.

When I did this and mentioned it to some other Tryton folks they flipped out. Not because I had done this in the core project — no, this was my own substitute module — but because:

  1. I had written SQL, and not just dabbled in some CREATE TABLE statements
  2. I had normalized the data model (well, a very small part of it)

I wrote the SQL to carry the definitions where the ORM just didn’t have a way to express what I wanted (or was really funky to grok once it was written). Apparently this was a big taboo in ORM Land, though I didn’t know that going in. SQL seems to have this forbidden quality that excites as much as it instills fear these days, but I have no idea why. Again, I’m a n00b, so maybe I just don’t get why ORM is so much better. Also, mind you, there was no hostility from anyone, just shock and some sense of the aghast query “what have you done?” (The Tryton community is actually a very warm place to play around and the project leader is enormously helpful, and despite me being an American (and a Texan, no less!) living in Japan and them all snooty Euro types, we got along swell. If any FOSS ERP system has some glimmer of hope as of July 2012 its Tryton.)

Writing SQL deeper than a raw() query here and there is one thing, but normalizing the data model is something altogether on a different plane of foul according to the rites of the Holy ORM. I was continually told that this would hurt me in the future if I continued on with Tryton. But on the other hand, they weren’t looking at the small mountain of application code I would need to maintain and forward port forever to get around the non-normalized data issue(s). And anyway, once you normalize data all the way, you don’t normalize it further. There actually is a conclusion to that exercise. I’ve found that my normalized data models tend to endure and changes wind up being modified by additions instead of the painful process of moving things around (and this still seems mysteriously, wonderfully magical and relieving to me — but probably because I’m not actually educated in relational algebra and so can’t see the underlying sense to why normalized data is so easy to extend (I mean, conceptually its obvious, but how, precisely?)).

Their arguments about “the future” disregarded the application layer entirely because they were only thinking about Tryton, but for me it wasn’t just one place where non-normalized data started hurting me (it also disregarded that this predicted that I’d wind up leaving Tryton). The original concept for the estimation program didn’t really jibe with the way that a(nother) very obvious customer need could be served by putting meaningful links between what was contained in CAD files, what existed in the product database, and how the units of measure get computed among them. This meant that my real need wasn’t a single application as much as it was a single data store that remained coherent regardless what application happened to be talking to it at the time (I’m not even going to get into security in this post, but that is another thing that is enormously simplified by submitting to The Postgres Way instead of resisting).

And this brings me to another problem — in fact, the real kicker. I started realizing as I wrote all these things that while the Tryton client program is pretty slick, its not the end of the road to handle all needs. For one things a lot of it involves writing screens in XML. Yuk. That’s about as annoying as it gets, and I’ll leave that there. But most importantly there was no way I was ever going to be able to port the Tryton client to, say, Android (and maintain it) or embed the CAD programs we’re using (one easy to port C++/Qt program, and one black-box Windows contraption we run in Wine that is currently a must-have) and make things run smoothly. I was also going to have my work cut out for me if I wanted to use the same data store to start doing things like drive dashboard snapshot reporting over http or especially provide some CRUD capabilities over the Web for guys out of the office (and this issue goes all the way to the security model here as well).

Anyway, long(er) story short, Tryton just didn’t meet my needs going forward. I could have forced it to fit at a greater cost in time than I am willing to pay, but it just wasn’t a total fit for my needs, and part of that was the way that data in objects don’t really jibe with how data in the real world works.

But the fact that I could code this stuff up in SQL in a sane way without any magic intrigued me. Greatly. The bit that I did with the addresses made so much sense compared to every other model I’ve seen for addresses that I couldn’t ignore it. In reality people move, but locations stay right where they are. Address definitions might change, but this is an administrative concern which leaves a historical record. My model perfectly captures this and permits now the asking of questions both about the location, about the parties which were involved with the location, and even about the administrative situation surrounding the location over time (and that questions of proximity are easily answered as well and nest cleanly with PostGIS extensions is magical and worth noting). All without a long string of crazy dot-joined noSQL stuff going on and all without a single null value stored anywhere. It was really easy to see that this made sense. Beyond that, I didn’t have a bunch of meta data that documented my code, which should be incidental, but instead just a hard definition of how my data should look. From there I could do whatever I wanted in whatever application I wanted. Having a truly sane data model started making so much sense to me that I tried writing a few different applications on top of the same data model as an experiment. And it worked amazingly well.

Writing a PyQt application, for example, I can just ask the database for some information against a view. I can get the query back as a dictionary or a list or whatever I want and display it or manipulate it any way I want. Doing it from a Django web face is pretty easy to. Actually, really easy. Django has an ORM that can make life easier if you ditch the “this class is a table” idea and make them all unmanaged models which actually just call views in the database. Its even easier overall if they are the exact same views that your other applications call (or not, but usually this winds up being a useful situation). If you remember to not do any processing in the application code, but instead have the database build your view for you and just let Django be the way it gets into a web page then you’re really cooking with gas and can safely take advantage of all the automatic stuff Django can do. (Or even better than web pages, use Django to render OpenDocument files for you, which turns out to be a super easy easy way to woo your customers because its so much more useful than generating web pages. I should probably write a post about how to do this later because its just that cool.) Its even more cool to do this from Snap than Django — but that’s a whole ‘nother story.

This was just retrieving data, though. I got curious about entering data. And its really easy as well. But it involves a few extra things. Like careful definitions of the data model (ensure actual normalization, which is sometimes surprisingly counter-intuitive in how intuitive it is), multi-column unique constraints, check constraints, really understanding what a foreign-key is for, etc. all while still leaving room for a (now otherwise meaningless) numeric ID column for frameworks that may require it — and this whole numeric-keys-for-everything bit will seem more weird the longer you spend dealing with solid data models.

Basically, use all the tools in the Postgres bag and your life will get easier. And that’s actually not hard at all. The Postgres feature list (even the DB2 feature list) is pretty small compared to the vastness of the entire Python API coupled with the combined might (and confusion, usually) of whatever framework(s) you’re writing around. Doing it right also requires that you learn how to handle the various exceptions that the database will throw back at you as a result of your constraints and rules and things you’ve put in the database. But this makes programming the application layer really easy. Like incredibly easy. And anyway, learning how to handle a single set of database exceptions is a lot easier than trying to remember every stupid little exception condition your framework can produce multiplied by the number of frameworks you have.

And this is what is solving my core problem. I’m discovering that not only is SQL pretty darn easy, but that it solves my core business logic problems without actually writing any business logic. I think this is what the relation guys at IBM knew they were on to decades ago when they thought this idea up in the first place.

Consider the “current address” issue above. I didn’t use booleans, logical processes or any other trick to figure out whether an address was current or not, nor did I have to write a special rule that states that a person can only have a single current address at once but any arbitrary number of non-current addresses, nor did I have to write a single spot of application code. The problem is solved by the structure of the data alone — which is always the most efficient solution since it involves zero processing.

THis blows all that “use this framework to build your apps in 5 easy steps with Rails!” bullshit away. But I am a little put out that the concepts themselves don’t have more support within the trendier parts of the software development world. It seems everyone is jumping on the out of control bandwagon that marketers overloaded with Java Beans and Hopes and Dreams all those years ago and sent tumbling down the hill. Its like the Obama campaign infected the software industry (because he totally earned that Nobel Prize and Hawking doesn’t deserve one). Its still rocketing down the hill, distracting faculty, investors, budding programmers and the marketing world almost completely. Its really amazing. I am a little upset that discovering a really sane way to manage data was so hard and took so long among the enormous volume of siren screams and other noise on the wire in the development community. Of course, now that I know what I’m looking for locating good discussions and resources isn’t that hard — though it is a little odd to note that the copyright dates on most of them predate my own existence.

So now, as I convert a mishmash of previously written independent application models into a central data concept I am finding something amazing: I haven’t found a single business rule yet that isn’t actually easier to express in terms of data structure than it is to put in application code. I’m also finding that importing the data from the (now legacy) application databases is also usually not that hard, either, but requires more mental effort than anything else on my plate now.

Most amazing of all is the ease of writing application code. Even if I’m writing one application in C++/Qt, another in PyQt, another in Django, another in CL and another in Haskell that run variously across the spectrum of servers, tablets, phones and desktops[4], they can all live under the same guarantees and are super easy to understand because of the extreme lightness of all their code. I’m not doing anything but showing stuff to the user from the database, and putting stuff back in the database, and adjusting based on whether or not the database accepted what was given.

This makes application development fun again. Previously I has been bogged down in trying to define business logic rules as processes, and that was boring, especially since the magic sauce really should have just been a data model forcing me to be correct in the first place instead of me chasing exceptional cases through a bunch of logical code paths in the application (which had to be duplicated across all applications!). Also, this effort tended to put horse-blinders on me as far as interface went. Once I wrote a web interface, the enormous freedom that native application development gives you is suddenly invisible and you’re thinking in terms of “what is the parallel widget in Qt or GTK to the HTML SELECT” or whatever. That’s just lame. But its what starts happening when you spend so much brainpower worrying about conditional business logic that you forget all the cool stuff you can do in a native application (like 3D flowcharts, or 3/4D projections of project management data throughout time that you can “paw through” with the mouse or even a game controller, or a million other kickass ideas we usually only ever get to see in vidya games).

Getting your data model right gives you not only the mental freedom to start exploring what native UI can do that goes so far beyond the pitiful bag of cheap tricks that “web app development” has made standard today (or the convoluted mess of JavaScript and AJAX trash that supports it), it also gives you the confidence to step out and do some cool stuff in your client applications because, hey, the data model part of the problem is already solved. All you have to do is serialize the data in your application — which means in the application if you want to have objects, go for it, but make sure they are based on a view of derived data, not a 1-for-1 mapping of objects to relations. That serialization is an easy problem to have gives you the focus to do cool stuff nobody else is doing — and it all comes down to doing data right and escaping from the ridiculous house of mirrors that ORMs lead you into.

There is a conceptual mismatch between the object world and the relational world that is so vast that it is not worth trying to bridge. I’m saying there isn’t an Object-Relation Mismatch. They just aren’t even the same thing, so how could we have ever thought that comparing them against the same criteria ever made sense to begin with?


[1. Both when I was a desk jockey for a while and when I was still in the Army — being an SF engineer involves a good bit of math that you know about going in (and no calculators, so programming is no help there anyway) but is also huge amounts of paperwork that they never tell you about until after you walk thousands of miles to get your floppy green hat.]

[2. This is every bit as damaging as the way that leftist political thinkers loosely throw around the word “society”. As in “you owe it to society” and “society owes it to them” or “society must regulate X, Y, and Z” when what they really mean in some cases is actually “your community” and other cases as “the government”, which convolutes the discussion enough that obviously unacceptable things can seem acceptable — which is similar to how obviously non-OO things have been massaged into a OO-ish shape in the minds of thousands, but still remain just as procedural or functional as ever they were in reality.]

[3. This mistake is somewhat comically enshrined in the new NoSQL stuff, which consists principally of reinventing pre-relational data systems IBM already worked on and largely discarded decades ago.]

[4. In fairness, almost everything is running on Linux, and this makes development much easier than if I were trying to tackle the full spectrum of device OSes out there. Who wants to write a 3D reporting face for Blackberry that needs to work and look the same way it does on Android, KDE on Linux, or iOS (or Windows Phone… haha!).]

Powered by WordPress