This discussion covers the way Django 1.4 and Postgres 9.1 work as of mid 2012. If you have trouble with this or very similar code let me know and we can post your examples.
Some of us like Python. The lot of us that started in an Algol like C (or Algol) and moved on to a LISP like Scheme (or Common Lisp) find Python an easy middle, especially when working with others who don’t share the same background. Its easy enough to catch on to that even first-time use merely requires reading a sample and a link to the API reference — its just that easy to get started. OOP is optional and you get first-class functions, but you don’t get a lot of the functional goodies and you’re insulated from the platform a bit. Meh. I’d call it a generally reasonable trade for a happy medium language that feels a lot like executable psuedocode. If nothing else, it is phenomenally clean compared to C-syntax-where-it-doesn’t-belong languages — a lot the way YAML does when compared to XML.
Some of us are subjected to working with the Web, at least from time to time. Despite the fact that we would all love to write native applications against a sane API that we’re allowed to read in a single language in an environment we control with a single data handling layer, that’s just not always reality. We’d all love to commit to a top-down conceptual design with humans first in mind (especially humans who can express their problem in detail and don’t “hey, could you just…” halfway through) and bottom-up design from a solid data model (that someone smart was permitted the time to work on before the budget started paying for interface people to get busy). But sometimes we hit weird cases where there’s just not time or money or scheduling is stupid or there are unanticipated complications or the platform changes based on a buzzword or the situation doesn’t fit because you’re working behind a three-star programmer who quit last week on something you didn’t build. Or maybe that’s almost always.
Since that’s the norm, I’m discussing using Django to make reporting faces for existing database applications. In particular database applications that rely on a non-trivial normalized or properly denormalized data model. A lot of folks dive straight to the “Hire a guy to build a Django/PHP/TG/Rails/whatever application with a separate database that syncs or queries-to-sync what it needs from the existing database” conclusion. This is wasteful since most of the time all we want is to display some tabular data along with some custom math on that data. Usually the data set is not enormous and going full-bore data warehouse prepared to carry billions of records in a large set of denormalized, parallel schema is way, way overkill.
So how do we do this with a web framework that uses an ORM (noooo~!) and expects a 1-to-1 mapping between classes and tables? We use views against unmanaged Django models.
When I say “views” I mean in the database sense, not the ubiquitous Django view.py stuff. As in the good ole’ Postgres CREATE VIEW foo AS SELECT [stuff]. This could be materialized views, calculated views, a denormalized table triggered to always be current (see “materialized view”), etc. Database views, whatever the flavor, will serve our purposes.
A lot of folks get tripped up on this particular subject because of the way frameworks like Django and Rails nearly always rely on meaningless integers as primary keys by default, can’t handle multi-column natural keys and try to add an integer “ID” to any model even if its not the primary key and therefore completely superfluous. Any non-trivial, normalized data model won’t have integer primary keys and any DBA who administers such a data store won’t add them to satisfy your stupid web framework. So forget about adding integer primary keys, we’re not going to use them and adding them would be an exercise worthy of something the government would pay for. But that means that until true support for natural, multi-column keys in Django gets finalized we need a new approach (and anyone stuck working with Django at least up to 1.5 will need a different approach anyway). No worries, read on.
We’re going to create a view against an unmanaged Django model class, and feed Django a dummy “id” column to keep it quiet. Of course, we have to remember that the unmanaged model isn’t safe to use as a target for “models.ForeignKey()”, but as we’ll see this is a pretty simple limitation to work around since any existing data model will already have pre-defined primary keys. This is workable even if we need to treat the view as a real model in the future.
Note: this is a toy example for necessary reasons. I’m only writing this to illustrate a point, not to write a full-bore business application on my blog. That said, feel free to use these ideas or any code presented here in any way you like.
So let’s get a few tables down. Let’s say there is a cost/price margin history schema in a database that various applications use to keep track of how much stuff costs today and how much it used to cost at a given point in time. Let’s say also that some of the items in the list are for sale and the history of the sale price needs to be tracked in a similar manner to the costs of things. Since a lot of applications access the data store and many of them are designed either strictly for sales/ordering-only or cost/inventory-only the cost and price histories are completely independent. Different applications can already tell the managers and analysts what the margin is right now, but all this historical data is difficult to compare over time because its tedious for humans to run the comparisons by hand, and a real pain to keep straight even in a spreadsheet application.
Sounds like a perfect job for a computer! Further, since this is tabular data display and essentialy a document-export function it is a perfect fit for something that is accessible from a browser.
So far we’ve got two tables on the costing/inventory side: inventory.item and inventory.cost_history. They look like this (note, this isn’t quite the way psql will display the table data, but just go with it):
Table "inventory.item" Column | Type | Modifiers ----------------+--------------------+----------------- nsn | varchar(50) | primary key is_active | boolean | not null Table "inventory.cost_history" Column | Type | Modifiers ----------------+--------------------+-------------------------------------------------------- item | varchar(50) | foreign key (item.nsn), unique together with start_date start_date | timestamp | primary key with "item" end_date | timestamp | value | money | not null
Over on the product/pricing side we have two more tables to compliment those above: sales.product and sales.price_history. They look like this:
Table "sales.product" Column | Type | Modifiers ----------------+--------------------+------------------------------- item | varchar(50) | foreign key (item.nsn), unique sku | varchar(30) | primary key name | varchar(100) | not null is_active | boolean | not null Table "sales.price_history" Column | Type | Modifiers ----------------+--------------------+---------------------------------------------------------- product | varchar(50) | foreign key (sales.sku), unique together with start_date start_date | timestamp | primary key with "product" end_date | timestamp | value | money | not null
Enormously simple — to the point that if I hadn’t explained the scenario you’d wonder why we even have a “sales.product” table. But let’s just accept that not every item in the company’s inventory is for sale, and those that are need to be treated differently so tying everything to inventory.item records isn’t a good idea. (It bears mentioning here that sales.price_history.product could just as easily point to inventory.item.nsn as sales.product.sku: natural connectivity with the “real” NSN vs more frequent reference by/to SKU from application queries.)
So this is the existing schema we’re hoping to write a small dashboard-ish web thingy in Django against without modification. How do we do it without driving the DBA nuts, making one query per iteration, burying a lot of huge raw() queries in our application code, or forcing things into some nooberiffic “One Class, to rule them all” paradigm? First we need something Django can hold on to deal with on its own terms — something that behaves enough like a real table to pass a sort of duck check.
We need a concept of “things for sale”. We don’t care about the items not for sale in this case, because the whole point is demonstrating a margin history — so we can, for the purposes of our dashboard — lump sales.product and inventory.item together:
class ProductRecord(models.Model): nsn = models.CharField('NSN', max_length=50) sku = models.CharField('SKU', max_length=30) name = models.CharField(_('Name'), max_length=100) is_active = models.BooleanField(_('Active')) class Meta: db_table = 'sales\".\"product_record' managed = False
Pretty predictable so far. But it could use a better name than sales.product_record. I suck at naming things without thinking a bit first. With that in mind, note that I named the class “ProductRecord” to stay in tune with the db_table value. At the moment we’re just writing a dashboard, but if that requirement grows later on you’d hate to remember that in every other app “Product” refers to sales.product but in this app its something different because you already used the name “Product” for something else. We’ve lumped together a lot of pretty obvious data in one place from sales.product and inventory.item. Now let’s lump the money stuff together:
class ProductValueHistory(models.Model): sku = models.CharField('SKU', max_length=30) cost = models.DecimalField(_('Cost'), max_digits=15, decimal_places=5) price = models.DecimalField(_('Price'), max_digits=15, decimal_places=5) start_date = models.DateTimeField(_('Start Date')) end_date = models.DateTimeField(_('End Date')) class Meta: db_table = 'sales\".\"product_value_history' managed = False
Also fairly predictable. (The DecimalField thing and fixed-point math for money handling is a subject for another time.) You probably noticed the lack of a Django models.ForeignKey on both of these models. We don’t have them because we don’t have anything to tie them to that Django can understand because of that silly litter integer-as-a-primary-key assumption that nearly every ORM seems to universally make. We could add it, but that would require altering the original tables to accommodate this dashboard, and that goes against the principles of being a good guest in someone else’s digital house. Besides, a lot of applications access this data store — doing anything that changes things up could have ripple down effects all over the place; much better to forget about all that mess. Never let your tools drive your design — otherwise you’re being a tool for the sake of a cosmic “in Soviet Russia…” joke.
We could declare something else to be the primary key, but that would only affect how Django would generate SQL table creation code when running manage.py syncdb, and since we’re never going to run that on an unmanaged model and Django tries to add an integer ID to everything whether or not you’ve already got a natural primary key defined, that’s pointless
Now lets write our views for the database. This works in Postgres 9.1. I don’t know about other databases — and if you’re doing this in MySQL you probably don’t want to do things this way (hint: you should migrate):
CREATE VIEW sales.product_record AS SELECT row_number() OVER (ORDER BY s.name) AS id, i.nsn AS nsn, s.product AS sku, s.name AS name, s.is_active AS is_active FROM inventory.item AS i, sales.product AS s WHERE i.nsn = s.item;
Now we’ve got a “fake” table in the database that Django thinks is a real one. It even has an “id” column generated for us by the row_number() window function. This is silly, and only present to fool Django into accepting our model, but compared to how a lot of other ORMs work, this is a pretty small ORM tax to pay. The price goes up a little if we want to be allowed to do insertions and modifications from the Django class to this view instead of the real tables (we’d have to write rules), but even that isn’t so hard.
So now let’s get what we want out of the price/cost history combination. I’m not 100% happy with the way this query turns out, to be honest (partly because I deliberately made this situation a little complex by not making sales.price_history reference the inventory.item.nsn at all, so we require an extra join), but it does work fine — and there is a way around even the inefficiency in my (probably bad) SQL view code here:
CREATE VIEW sales.product_value_history AS SELECT row_number() OVER (ORDER BY p.start_date DESC, c.start_date DESC) AS id, p.sku AS sku, CASE WHEN p.start_date < c.start_date THEN p.start_date ELSE c.start_date END AS start_date, CASE WHEN p.end_date < c.end_date THEN p.end_date ELSE c.end_date END AS end_date, c.value AS cost, p.value AS price FROM sales.price_history AS p, ( SELECT product.sku AS sku, cost.value AS value, cost.start_date AS start_date, cost.end_date AS end_date FROM sales.product AS product, inventory.cost_history AS cost WHERE product.item = cost.item) AS c WHERE p.product = c.sku AND ( (p.start_date, p.end_date) OVERLAPS (c.start_date, c.end_date) OR (p.end_date IS NULL AND c.end_date IS NULL));
The query represented by this view goes a touch beyond what the Django ORM provides access to, but isn’t that hard to understand. The subquery where we join sales.product and inventory.cost_history is the link that provides us the necessary connection between a product’s SKU and its parent item’s NSN and returns a table called c. Joining that to the sales.price_history table for matching SKUs gives us all the costs and prices associated with a given product (and no results for items in inventory that are not for sale), and the extra WHERE clause using OVERLAPS lines up our price histories so we don’t have NULL-value gaps across spans of time when either the cost or price changed but the other didn’t.
We did the same “fake id” trick in this query using the row_number() window function so we can use Django’s ORM to pull results from this table like any other model. Because the window function already sorted the results in descending order, we don’t need to sort the results to know they are in chronological order.
Now where to put this bit of SQL? Of course, check the Django docs, but as of Django 1.4 the .sql files should go in a directory located at project/app/sql/ . Once there it should execute when you run manage.py syncdb — and if it doesn’t or you need to re-do something manually you can invoke it from within psql quite easily by doing \i /path/to/project/app/sql/filename.sql. (If you do it manually from within psql, remember to ALTER VIEW view_name OWNER TO django_role_name or Django probably won’t have permission to query it.)
So how do we use it to get a history for a specific product? Since we don’t have any primary key/foreign key relations set up in Django, we can’t just do:
product = ProductRecord.objects.get(id=number) value_history = product.price_history.all()
This is a limitation many ORMs have, but we can work around it easily enough since the database itself has other primary keys that are guaranteed to be unique and therefore safe to use with object.get() or get_object_or_404():
product = ProductRecord.objects.get(nsn=value) value_history = ProductValueHistory.objects.filter(sku=product.sku)
The results are already sorted so we can also do the following without any order_by() clause in Django:
current_value = ProductValueHistory.objets.filter(sku=product.sku)
This gets us right up to the point of needing to write the math in to actually get margin totals and calculate percentages and make pie charts and all the other stuff business types like to do with this kind of data. At this point you can do all that in your Django view in Python (very easy and comfortable) or write a few more columns into the views that do this processing for you before it ever leaves the database, or even write entirely new views and new models that do whatever it is that you want.
Faced with the options above, when there is no obvious right answer I prefer to put it into the database as calculated columns or new views and get a complete answer per query instead of processing raw data in the application code. I (usually) take this approach because processing in the application makes that logic unavailable to any other applications that access the same data store which might want the same information in the future, thus leaving the writer of those other applications (probably me) with nothing left but to reinvent the wheel to get the exact same result (and looking ahead at maintenance centralizing logic is always better, whether in a library, database or whatever). Another good reason I’ve found to do things this way is to avoid accidentally writing an iterative processing routine at the application level that calls the database on each iteration of an arbitrarily large loop (I know you think “Yikes!” but people do this all over the place without even realizing it!).
But this view would be rather inefficient on very large rows of tables because its not eliminating rows based on an indexed item before it does the join processing and also because the OVERLAPS bit is pretty compute intensive on large sets. Indexes on the view can mitigate that to some degree, but there will come a point when materialized views/denormalized tables trump having faster hardware.
I’m constantly amazed at how fast hardware is these days and how much Postgres performance increases with each release, but that’s no excuse for a rather blatant query inefficiency that can be fixed easily. Its also absolutely no help for people stuck using Postgres < 7.1 or the legions of people stuck with MySQL or the poor fools stuck using a crap, deliberately crippled “home” or “small business” version of a proprietary database.
There are two possible ways out of this. You can write a raw() SQL query into Django (sometimes the easy answer), or you can make sales.product_value_history into a real table in the database that updates itself whenever the sales.price_history or inventory.cost_history tables are modified.
The second idea is the most interesting and involves a deliberate denormalization, which in this case I think is probably appropriate. This is called making a “materialized view”. Its available as a native feature in DB2 and Oracle, but not in Postgres just yet (I think 9.2 or 9.3, probably 2013 or 2014). However, in Postgres we can write triggers which keep our price/cost history table updated automatically whenever either of the sponsoring tables is modified. This ability is why while DBAs love conveniences like built-in materialized views features like this tend to take a lower priority than the serious stuff like JOIN efficiency, window functions and query optimizations. A word of caution: it is pretty common for folks whose first experience with databases like Postgres was through an ORM framework to try keeping tables synced by writing routines in at the application level — but this is a bad idea and defeats the purpose of using the database as a layer of abstraction. Leaky abstractions suck the further you travel with them and always remind me of the Turkish proverb “No matter how far you’ve gone down the road, turn back.”
I can feel your boredom burning into my fingers through a quantum time warp in the net, so I’ll end this here. Yes, I’m leaving the materialized views implementation as a loose end (there are great resources on the net for this), but the main point was how to put Django in meaningful touch with a database schema that is both non-trivial and doesn’t use arbitrary integer values as primary keys.