Tag Archives: Postgres

Recursive Lineage Queries in SQL and RyuQ

Without explaining every detail, here is a “proper” (as in, non-NULL dependent) method for handling tree data within a recursive query. In the example below table foo has an id, a parent and a dtg (date-time-group). Obviously real world examples will have more fields, but this illustrates the technique. This sort of thing is quite handy in a huge variety of situations as quite a lot of data is hierarchical in nature.

Adding a layer type attribute on the view is also a good way to accomplish things like, say, providing visual layout variables for threaded comments on a website, even if your (stupid) web framework doesn’t understand such things — you can bypass the framework garbage by defining a lineage-type view directly in the database and a “model” on top of it.

CREATE FUNCTION default_parent() RETURNS TRIGGER AS
  $$
  BEGIN
    IF NEW.parent IS NULL THEN
      NEW.parent := NEW.id;
    END IF;
    RETURN NEW;
  END;
  $$ LANGUAGE plpgsql;

CREATE TABLE foo
 (id         SERIAL PRIMARY KEY,
  parent     integer REFERENCES foo ON UPDATE CASCADE ON DELETE SET NULL NOT NULL,
  created_on timestamptz DEFAULT current_timestamp NOT NULL);

CREATE TRIGGER foo_default_parent
  BEFORE INSERT ON foo
  FOR EACH ROW EXECUTE PROCEDURE default_parent();

CREATE TRIGGER foo_reset_parent
  BEFORE UPDATE ON foo
  FOR EACH ROW EXECUTE PROCEDURE default_parent();

CREATE VIEW lineage AS
  WITH RECURSIVE
  parents AS
    (SELECT * FROM foo WHERE parent = id),
  children AS
    (SELECT * FROM foo WHERE parent != id),
  tree AS
   (SELECT id,
           id::text AS branch,
           0 AS layer,
           parent
      FROM parents
   UNION ALL
    SELECT c.id,
           t.branch || '.' || c.id::text AS branch,
           t.layer + 1 AS layer,
           c.parent
    FROM children AS c, tree AS t
    WHERE c.parent = t.id)
  SELECT * FROM tree ORDER BY branch;

CREATE OR REPLACE FUNCTION show_descendants(parent integer) RETURNS TABLE
   (id      integer,
    lineage text,
    layer   integer,
    parent  integer) AS
  $$
    WITH RECURSIVE
    parents AS
      (SELECT * FROM foo WHERE id = $1),
    children AS
      (SELECT * FROM foo WHERE parent != id),
    tree AS
     (SELECT id,
             id::text AS branch,
             0 AS layer,
             parent
        FROM parents
     UNION ALL
      SELECT c.id,
             t.branch || '.' || c.id::text AS branch,
             t.layer + 1 AS layer,
             c.parent
      FROM children AS c, tree AS t
      WHERE c.parent = t.id)
    SELECT * FROM tree ORDER BY branch;
  $$ LANGUAGE SQL STABLE;

CREATE OR REPLACE FUNCTION
  get_lineage(parent_id integer DEFAULT 0, hide_parent boolean DEFAULT false)
  RETURNS TABLE
   (id      integer,
    lineage text,
    layer   integer,
    parent  integer) AS
  $$
    BEGIN
      IF parent_id > 0 AND hide_parent IS true THEN
        RETURN QUERY EXECUTE $query$ SELECT * FROM show_descendants($1) OFFSET 1 $query$
          USING parent_id;
      ELSIF parent_id > 0 AND hide_parent IS false THEN
        RETURN QUERY EXECUTE $query$ SELECT * FROM show_descendants($1) $query$
          USING parent_id;
      ELSE
        RETURN QUERY EXECUTE $query$ SELECT * FROM lineage $query$;
      END IF;
    END;
  $$ LANGUAGE plpgsql STABLE;

Let’s see some invocations:

insert into foo values (default);
insert into foo values (default);
insert into foo values (default);
insert into foo values (default, 1);
insert into foo (parent) values (1);
insert into foo (parent) values (1);
insert into foo (parent) values (1);
insert into foo (parent) values (2);
insert into foo (parent) values (3);
insert into foo (parent) values (3);
insert into foo (parent) values (4);
insert into foo (parent) values (5);
insert into foo (parent) values (7);
insert into foo (parent) values (7);
insert into foo (parent) values (10);
insert into foo (parent) values (15);

select * from lineage;
 id |   branch   | layer | parent 
----+------------+-------+--------
  1 | 1          |     1 |      1
  4 | 1.4        |     2 |      1
 11 | 1.4.11     |     3 |      4
  5 | 1.5        |     2 |      1
 12 | 1.5.12     |     3 |      5
  6 | 1.6        |     2 |      1
  7 | 1.7        |     2 |      1
 13 | 1.7.13     |     3 |      7
 14 | 1.7.14     |     3 |      7
  2 | 2          |     1 |      2
  8 | 2.8        |     2 |      2
  3 | 3          |     1 |      3
 10 | 3.10       |     2 |      3
 15 | 3.10.15    |     3 |     10
 16 | 3.10.15.16 |     4 |     15
  9 | 3.9        |     2 |      3
(16 rows)

select * from get_lineage(3);
 id |  lineage   | layer | parent 
----+------------+-------+--------
  3 | 3          |     0 |      3
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(5 rows)

select * from get_lineage(3, true);
 id |  lineage   | layer | parent 
----+------------+-------+--------
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(4 rows)

Notice that there are no NULLs. Most formulations of tree type data uses NULL to represent not having a parent — but this is wrong. NULL means something is unknown. If the top-level nature of an entity is known, then NULL is the wrong value and complicates any logic involving that field unnecessarily. If, instead, we define a top level entity as any entity which is its own parent, then we have no ambiguity in the structure, and we are providing a correct model. For this reason the parent column above is NOT NULL. If we wanted to provide the possibility of maintaining records where the parent was actually unknown, we could remove the NOT NULL constraint and unset the trigger.

That may sound nitpicky, but there are several places where this simplifies things or makes query execution downright unreliable because of a dependence on conditional logic shortcuts. Consider the common SQL function solution comparable to show_descendants() above:

CREATE OR REPLACE FUNCTION show_descendants(parent integer) RETURNS TABLE
   (id      integer,
    lineage text,
    layer   integer,
    parent  integer) AS
  $$
    WITH RECURSIVE tree AS
     (SELECT id,
             id::text AS branch,
             0 AS layer,
             parent
        FROM foo
        WHERE (parent IS NULL AND $1 = 0) OR (parent = $1)
     UNION ALL
      SELECT c.id,
             t.branch || '.' || c.id::text AS branch,
             t.layer + 1 AS layer,
             c.parent
      FROM children AS c, tree AS t
      WHERE c.parent = t.id)
    SELECT * FROM tree ORDER BY branch;
  $$ LANGUAGE SQL STABLE;

This works just fine in Postgres 9.2 — but that is a quirk of implementation. SQL is supposed to be declarative in nature, and as such there is no guarantee which side of the OR condition in WHERE will be executed first. It is cheaper to check the single condition on the right side than the AND condition on the left, so it is likely a future execution optimization will detect that and run the query this way. The problem is that the right side expression can be true in invalid cases. There is no way to write this query safely as long as NULL is a possible value for parent.

Another problem is that using a NULL parent value in any calculation will nuke the entire operation because of NULL propagation. Say you want to do “parent::text || foo” somewhere later. Surprise! Any NULL is going to kill the entire string. There are convoluted ways to avoid that, but we’re fighting against the nature of an unnatural data definition instead of providing a natural definition at the outset and letting the positive consequences follow.

An argument can be made that handling for-real parent values instead of just using NULL definitions complicates the code above unnecessarily. Its true, in SQL you have to define a trigger to handle the top-level case and be a little more specific in the view and function definitions that follow. But that’s the end of it. We’ve trapped the NULL and completely contained the complexity involved in modeling an accurate parent relationship, not mention guarded against the undefined nature of logical OR execution (this is a huge issue on its own). If we were to permit a NULL here and didn’t really intend for that to mean unknown, then we would have traded the convenience of writing a quick and dirty definition for letting the complex and unpredictable side effects escape our model and propagate throughout the rest of the schema and every application that touches this table. With that in mind it is obviously a better solution to pay the complexity tax up front and write the insert trigger and function definitions and nip the issue in the bud.

In RyuQ this can be defined as:

create [relation]
  foo
    attributes
      id         Serial
      parent     Integer
      created_on TimestampTZ
    conditions
      pk id
      fks
        parent -> foo.id
      defaults
        parent = id
        created_on = current_timestamp

And define some elabels to operate on the data:

tree = union
         project [id, lineage = id::text, layer = 1, parent]
           select [parent == id] foo
         project [id = foo.id,
                  lineage = tree.lineage + '.' + foo.id,
                  layer = tree.layer + 1,
                  parent = foo.parent]
           join
             tree
             select [parent != id] foo

branch ID =
       union
         project [id, lineage = id::text, layer = 1, parent]
           select [id == ID] foo
         project [id = foo.id,
                  lineage = branch.lineage + '.' + foo.id,
                  layer = branch.layer + 1,
                  parent = foo.parent]
           join
             branch
             select [parent != id] foo

headless ID = select [parent != id] branch ID

And that’s it. Notice that there are no NULLs here either; we are not using any MaybeType attributes. If we wanted to make the definition of a top-level entity “Something without a parent” then using the type NoneInteger would be correct. If we wanted to include the possibility that a parent relationship is unknown (but that this does not define a top-level entity) we could use MaybeInteger. If we want to include both then we should define a domain that includes integers, None, and _|_ as members, possibly calling it MaybeNoneInteger.

Invokation:

insert [foo]
  {id}
  (default), (default), (default)
  {parent}
  (1), (1), (1), (2), (3), (3), (4), (5), (7), (7), (10), (15)

tree
 id |   branch   | layer | parent 
----+------------+-------+--------
  1 | 1          |     1 |      1
  4 | 1.4        |     2 |      1
 11 | 1.4.11     |     3 |      4
  5 | 1.5        |     2 |      1
 12 | 1.5.12     |     3 |      5
  6 | 1.6        |     2 |      1
  7 | 1.7        |     2 |      1
 13 | 1.7.13     |     3 |      7
 14 | 1.7.14     |     3 |      7
  2 | 2          |     1 |      2
  8 | 2.8        |     2 |      2
  3 | 3          |     1 |      3
 10 | 3.10       |     2 |      3
 15 | 3.10.15    |     3 |     10
 16 | 3.10.15.16 |     4 |     15
  9 | 3.9        |     2 |      3
(16 rows)

branch 3
 id |  lineage   | layer | parent 
----+------------+-------+--------
  3 | 3          |     0 |      3
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(5 rows)

headless 3
 id |  lineage   | layer | parent 
----+------------+-------+--------
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(4 rows)

Pretty simple. I find the RyuQ easier and more clear, but obviously I’m biased.

Thinking About a Data Langauge

I’ve been thinking a lot lately about how a query and data definition language would look if I were able to write one myself. Well, it turns out I can write one myself, it just takes a lot of time. I don’t have a lot of time, but I’ve written down some of my ideas to clarify them and placed them here. This is mostly just a text-friendly way of writing relational algebra, but it has a few extras that would make it much nicer to use than SQL.

The mid-term goal is to implement at least some of the query language either as part of a runtime that sits outside of Postgres and feels like psql with something better than SQL, or hack an alternate parser into Postgres that would provide parsed Query trees to the optimizer. I haven’t decided which will be more time consuming in the long run just yet.

I’ll keep that area updated as I have time to work on the language spec more. There are quite a few ideas I’ve got left to commit to writing but just can’t yet due to time.

Lean on Your Database

Doing computations in the database is almost always the right answer. Writing your own database procedures instead of relying on an ORM framework is even better. These go hand in hand since ORMs don’t allow you enough control over your data schema to define calculations in the first place (most can’t even properly handle multi-column primary keys and instead invent meaningless integer ID for everything; this should tell you something). Many, maybe most (at least that I’ve met), web developers don’t even know what sort of calculations are available to be performed in the database because they’ve been taught, in an absolute vacuum of personal experience, that “SQL hard. Relational thinking hard. OOP good. Trust framework”. There is so much missing here.

Frameworks try to be “database agnostic”. This is fundamentally flawed thinking. This implies that the data layer is merely there for “persistence” and that the “persistence layer” can be whatever — all RDBMS systems “aren’t OO and therefore suck” and so any old database will do. If this is true then it follows that frameworks and applications should be designed to work against any database system and not delve too deeply into any specific feature sets — after all, the application functionality is the focus, not the data, right? This is exactly backwards.

Even forgetting that this condemns you to least-common-denominator data design, this is still exactly backwards. Let me put the right way on a line all by itself, because it just that important:

Data designs should strive to be application agnostic.

Data drives everything. Your functions are what you can change around easily, but your data schema is critical and represents everything about your system logic. If you show me a well-labeled data schema I can probably guess what you are trying to do, but if you show me just your functions and objects I’ll require either a code tour or a lot of familiarization time before getting anything serious done (that project documentation will be lacking is a truism not worth addressing here).

Consider that changing your app code is cake whereas changing your data schema is major project surgery. OOP has us so in a stupor that we think if we just get our objects right everything will be fine. It won’t. Ever. As long as you think that you’ll also believe other crap like that each object should map directly to a table. There are certain basic truths about certain types of data. It is striking that I can give a data requirement to two DBAs schooled on two different RDBMSes and ask for a normalized data model (let’s just say NF3 for argument) and get back two very similar looking schemas, but I can give a feature requirement to two Java programmers and get back radically different system designs.

This should tell us something. In fact, it screams the truth that data is a foundation from which you must work up toward the application code, not the other way around. The database layer is the most important place to make sound choices. The choice of database system itself should be based on project requirements, because that choice matters. Most critically, I’ll say it again here because it is so important and implies so much on contemplation: the database designs should strive to be application agnostic.

Using Unmanaged Django Models of Postgres Views in a Pre-existing Schema

This discussion covers the way Django 1.4 and Postgres 9.1 work as of mid 2012. If you have trouble with this or very similar code let me know and we can post your examples.

Some of us like Python. Some of us are subjected to working with the Web, at least from time to time. Despite the fact that we would all love to write native applications against a sane API that we’re allowed to read in a single language in an environment we control with a single data handling layer, that’s just not always reality. We’d all love to commit to a solid data model that someone smart was permitted the time to work on before the FAD design crew got spun up. But sometimes we hit weird cases where there’s just not time or money or scheduling is stupid or there are unanticipated complications or the platform changes based on a buzzword or the situation doesn’t fit because you’re working behind a three-star programmer who quit last week. Ha ha, just kidding — I meant that’s all the time every time.

Since that’s the norm, I’m discussing using Django to make reporting faces for existing database applications. In particular database applications that rely on a non-trivial normalized or properly denormalized data model. A lot of folks dive straight to the “Hire a guy to build a Django/PHP/TG/Rails/whatever application with a separate database that syncs or queries-to-sync what it needs from the existing database” conclusion. This is wasteful since most of the time all we want is to display some tabular data along with some custom math on that data. Usually the data set is not enormous and going full-bore data warehouse prepared to carry billions of records in a large set of denormalized, parallel schema is way, way overkill.

So how do we do this with a web framework that uses an ORM (noooo~!) and expects a 1-to-1 mapping between classes and tables? We use views against unmanaged Django models.

When I say “views” I mean in the database sense, not the ubiquitous Django view.py stuff. As in the good ole’ Postgres CREATE VIEW foo AS SELECT [stuff]. This could be materialized views, calculated views, a denormalized table triggered to always be current (see “materialized view”), etc. Database views, whatever the flavor, will serve our purposes.

A lot of folks get tripped up on this particular subject because of the way frameworks like Django and Rails nearly always rely on meaningless integers as primary keys by default, can’t handle multi-column natural keys and try to add an integer “ID” to any model even if its not the primary key and therefore completely superfluous. Any non-trivial, normalized data model won’t have integer primary keys and any DBA who administers such a data store won’t add them to satisfy your stupid web framework. So forget about adding integer primary keys, we’re not going to use them and adding them would be an exercise worthy of something the government would pay for. But that means that until true support for natural, multi-column keys in Django gets finalized we need a new approach (and anyone stuck working with Django at least up to 1.5 will need a different approach anyway). No worries, read on.

We’re going to create a view against an unmanaged Django model class, and feed Django a dummy “id” column to keep it quiet. Of course, we have to remember that the unmanaged model isn’t safe to use as a target for models.ForeignKey(), but as we’ll see this is a pretty simple limitation to work around since any existing data model will already have pre-defined primary keys. This is workable even if we need to treat the view as a real model in the future.

Note: this is a toy example for necessary reasons. I’m only writing this to illustrate a point, not to write a full-bore business application on my blog. That said, feel free to use these ideas or any code presented here in any way you like.

So let’s get a few tables down. Let’s say there is a cost/price margin history schema in a database that various applications use to keep track of how much stuff costs today and how much it used to cost at a given point in time. Let’s say also that some of the items in the list are for sale and the history of the sale price needs to be tracked in a similar manner to the costs of things. Since a lot of applications access the data store and many of them are designed either strictly for sales/ordering-only or cost/inventory-only the cost and price histories are completely independent. Different applications can already tell the managers and analysts what the margin is right now, but all this historical data is difficult to compare over time because its tedious for humans to run the comparisons by hand, and a real pain to keep straight even in a spreadsheet application.

Sounds like a perfect job for a computer! Further, since this is tabular data display and essentialy a document-export function it is a perfect fit for something that is accessible from a browser.

So far we’ve got two tables on the costing/inventory side: inventory.item and inventory.cost_history. They look like this (note, this isn’t quite the way psql will display the table data, but just go with it):

                Table "inventory.item"
     Column     |       Type         |    Modifiers
----------------+--------------------+-----------------
 nsn            | varchar(50)        | primary key
 is_active      | boolean            | not null

                         Table "inventory.cost_history"
     Column     |       Type         |              Modifiers
----------------+--------------------+--------------------------------------------------------
 item           | varchar(50)        | foreign key (item.nsn), unique together with start_date
 start_date     | timestamp          | primary key with "item"
 end_date       | timestamp          |
 value          | money              | not null

Over on the product/pricing side we have two more tables to compliment those above: sales.product and sales.price_history. They look like this:

                Table "sales.product"
     Column     |       Type         |    Modifiers
----------------+--------------------+-------------------------------
 item           | varchar(50)        | foreign key (item.nsn), unique
 sku            | varchar(30)        | primary key
 name           | varchar(100)       | not null
 is_active      | boolean            | not null

                         Table "sales.price_history"
     Column     |       Type         |              Modifiers
----------------+--------------------+----------------------------------------------------------
 product        | varchar(50)        | foreign key (sales.sku), unique together with start_date
 start_date     | timestamp          | primary key with "product"
 end_date       | timestamp          |
 value          | money              | not null

Enormously simple — to the point that if I hadn’t explained the scenario you’d wonder why we even have a “sales.product” table. But let’s just accept that not every item in the company’s inventory is for sale, and those that are need to be treated differently so tying everything to inventory.item records isn’t a good idea. (It bears mentioning here that sales.price_history.product could just as easily point to inventory.item.nsn as sales.product.sku: natural connectivity with the “real” NSN vs more frequent reference by/to SKU from application queries.)

So this is the existing schema we’re hoping to write a small dashboard-ish web thingy in Django against without modification. How do we do it without driving the DBA nuts, making one query per iteration, burying a lot of huge raw() queries in our application code, or forcing things into some nooberiffic “One Class, to rule them all” paradigm? First we need something Django can hold on to deal with on its own terms — something that behaves enough like a real table to pass a sort of duck check.

We need a concept of “things for sale”. We don’t care about the items not for sale in this case, because the whole point is demonstrating a margin history — so we can, for the purposes of our dashboard — lump sales.product and inventory.item together:

class ProductRecord(models.Model):
    nsn       = models.CharField('NSN', max_length=50)
    sku       = models.CharField('SKU', max_length=30)
    name      = models.CharField(_('Name'), max_length=100)
    is_active = models.BooleanField(_('Active'))

    class Meta:
        db_table = 'sales\".\"product_record'
        managed = False

Pretty predictable so far. But it could use a better name than sales.product_record. I suck at naming things without thinking a bit first. With that in mind, note that I named the class “ProductRecord” to stay in tune with the db_table value. At the moment we’re just writing a dashboard, but if that requirement grows later on you’d hate to remember that in every other app “Product” refers to sales.product but in this app its something different because you already used the name “Product” for something else. We’ve lumped together a lot of pretty obvious data in one place from sales.product and inventory.item. Now let’s lump the money stuff together:

class ProductValueHistory(models.Model):
    sku        = models.CharField('SKU', max_length=30)
    cost       = models.DecimalField(_('Cost'), max_digits=15, decimal_places=5)
    price      = models.DecimalField(_('Price'), max_digits=15, decimal_places=5)
    start_date = models.DateTimeField(_('Start Date'))
    end_date   = models.DateTimeField(_('End Date'))

    class Meta:
        db_table = 'sales\".\"product_value_history'
        managed = False

Also fairly predictable. (The DecimalField thing and fixed-point math for money handling is a subject for another time.) You probably noticed the lack of a Django models.ForeignKey on both of these models. We don’t have them because we don’t have anything to tie them to that Django can understand because of that silly litter integer-as-a-primary-key assumption that nearly every ORM seems to universally make. We could add it, but that would require altering the original tables to accommodate this dashboard, and that goes against the principles of being a good guest in someone else’s digital house. Besides, a lot of applications access this data store — doing anything that changes things up could have ripple down effects all over the place; much better to forget about all that mess. Never let your tools drive your design — otherwise you’re being a tool for the sake of a cosmic “in Soviet Russia…” joke.

We could declare something else to be the primary key, but that would only affect how Django would generate SQL table creation code when running manage.py syncdb, and since we’re never going to run that on an unmanaged model and Django tries to add an integer ID to everything whether or not you’ve already got a natural primary key defined, that’s pointless

Now lets write our views for the database. This works in Postgres 9.1. I don’t know about other databases — and if you’re doing this in MySQL you probably don’t want to do things this way (hint: you should migrate):

CREATE VIEW sales.product_record AS
  SELECT
      row_number() OVER (ORDER BY s.name) AS id,
      i.nsn AS nsn, s.product AS sku, s.name AS name, s.is_active AS is_active
    FROM
      inventory.item AS i, sales.product AS s
    WHERE
      i.nsn = s.item;

Now we’ve got a “fake” table in the database that Django thinks is a real one. It even has an “id” column generated for us by the row_number() window function. This is silly, and only present to fool Django into accepting our model, but compared to how a lot of other ORMs work, this is a pretty small ORM tax to pay. The price goes up a little if we want to be allowed to do insertions and modifications from the Django class to this view instead of the real tables (we’d have to write rules), but even that isn’t so hard.

So now let’s get what we want out of the price/cost history combination. I’m not 100% happy with the way this query turns out, to be honest (partly because I deliberately made this situation a little complex by not making sales.price_history reference the inventory.item.nsn at all, so we require an extra join), but it does work fine — and there is a way around even the inefficiency in my (probably bad) SQL view code here:

CREATE VIEW sales.product_value_history AS
  SELECT
      row_number() OVER (ORDER BY p.start_date DESC, c.start_date DESC) AS id,
      p.sku AS sku,
      CASE
        WHEN p.start_date < c.start_date
          THEN p.start_date
          ELSE c.start_date
      END
        AS start_date,
      CASE
        WHEN p.end_date < c.end_date
          THEN p.end_date
          ELSE c.end_date
      END
        AS end_date,
      c.value AS cost,
      p.value AS price
    FROM
      sales.price_history AS p,
      ( SELECT
            product.sku AS sku,
            cost.value AS value,
            cost.start_date AS start_date,
            cost.end_date AS end_date
          FROM sales.product AS product, inventory.cost_history AS cost
          WHERE product.item = cost.item) AS c
    WHERE
        p.product = c.sku
      AND
        (   (p.start_date, p.end_date)
          OVERLAPS
            (c.start_date, c.end_date)
        OR
          (p.end_date IS NULL AND c.end_date IS NULL));

The query represented by this view goes a touch beyond what the Django ORM provides access to, but isn’t that hard to understand. The subquery where we join sales.product and inventory.cost_history is the link that provides us the necessary connection between a product’s SKU and its parent item’s NSN and returns a table called c. Joining that to the sales.price_history table for matching SKUs gives us all the costs and prices associated with a given product (and no results for items in inventory that are not for sale), and the extra WHERE clause using OVERLAPS lines up our price histories so we don’t have NULL-value gaps across spans of time when either the cost or price changed but the other didn’t.

We did the same “fake id” trick in this query using the row_number() window function so we can use Django’s ORM to pull results from this table like any other model. Because the window function already sorted the results in descending order, we don’t need to sort the results to know they are in chronological order.

Now where to put this bit of SQL? Of course, check the Django docs, but as of Django 1.4 the .sql files should go in a directory located at project/app/sql/ . Once there it should execute when you run manage.py syncdb — and if it doesn’t or you need to re-do something manually you can invoke it from within psql quite easily by doing \i /path/to/project/app/sql/filename.sql. (If you do it manually from within psql, remember to ALTER VIEW view_name OWNER TO django_role_name or Django probably won’t have permission to query it.)

So how do we use it to get a history for a specific product? Since we don’t have any primary key/foreign key relations set up in Django, we can’t just do:

product = ProductRecord.objects.get(id=number)
value_history = product.price_history.all()

This is a limitation many ORMs have, but we can work around it easily enough since the database itself has other primary keys that are guaranteed to be unique and therefore safe to use with object.get() or get_object_or_404():

product = ProductRecord.objects.get(nsn=value)
value_history = ProductValueHistory.objects.filter(sku=product.sku)

The results are already sorted so we can also do the following without any order_by() clause in Django:

current_value = ProductValueHistory.objets.filter(sku=product.sku)[0]

This gets us right up to the point of needing to write the math in to actually get margin totals and calculate percentages and make pie charts and all the other stuff business types like to do with this kind of data. At this point you can do all that in your Django view in Python (very easy and comfortable) or write a few more columns into the views that do this processing for you before it ever leaves the database, or even write entirely new views and new models that do whatever it is that you want.

Faced with the options above, when there is no obvious right answer I prefer to put it into the database as calculated columns or new views and get a complete answer per query instead of processing raw data in the application code. I (usually) take this approach because processing in the application makes that logic unavailable to any other applications that access the same data store which might want the same information in the future, thus leaving the writer of those other applications (probably me) with nothing left but to reinvent the wheel to get the exact same result (and looking ahead at maintenance centralizing logic is always better, whether in a library, database or whatever). Another good reason I’ve found to do things this way is to avoid accidentally writing an iterative processing routine at the application level that calls the database on each iteration of an arbitrarily large loop (I know you think “Yikes!” but people do this all over the place without even realizing it!).

But this view would be rather inefficient on very large rows of tables because its not eliminating rows based on an indexed item before it does the join processing and also because the OVERLAPS bit is pretty compute intensive on large sets. Indexes on the view can mitigate that to some degree, but there will come a point when materialized views/denormalized tables trump having faster hardware.

I’m constantly amazed at how fast hardware is these days and how much Postgres performance increases with each release, but that’s no excuse for a rather blatant query inefficiency that can be fixed easily. Its also absolutely no help for people stuck using Postgres < 7.1 or the legions of people stuck with MySQL or the poor fools stuck using a crap, deliberately crippled “home” or “small business” version of a proprietary database.

There are two possible ways out of this. You can write a raw() SQL query into Django (sometimes the easy answer), or you can make sales.product_value_history into a real table in the database that updates itself whenever the sales.price_history or inventory.cost_history tables are modified.

The second idea is the most interesting and involves a deliberate denormalization, which in this case I think is probably appropriate. This is called making a “materialized view”. Its available as a native feature in DB2 and Oracle, but not in Postgres just yet (I think 9.2 or 9.3, probably 2013 or 2014). However, in Postgres we can write triggers which keep our price/cost history table updated automatically whenever either of the sponsoring tables is modified. This ability is why while DBAs love conveniences like built-in materialized views features like this tend to take a lower priority than the serious stuff like JOIN efficiency, window functions and query optimizations. A word of caution: it is pretty common for folks whose first experience with databases like Postgres was through an ORM framework to try keeping tables synced by writing routines in at the application level — but this is a bad idea and defeats the purpose of using the database as a layer of abstraction. Leaky abstractions suck the further you travel with them and always remind me of the Turkish proverb “No matter how far you’ve gone down the road, turn back.”

I can feel your boredom burning into my fingers through a quantum time warp in the net, so I’ll end this here. Yes, I’m leaving the materialized views implementation as a loose end (there are great resources on the net for this), but the main point was how to put Django in meaningful touch with a database schema that is both non-trivial and doesn’t use arbitrary integer values as primary keys.

tl;dr Instructions for Drupal7 on rpm-based Systems

I threw a few sites up for folks in need the other day, one of them being myself (lots of folks’ businesses were wrecked during the tsunami two months ago). From doing that I realized the Drupal 7 documentation on rpm-based systems is a bit lacking (especially in Japanese). There are a few reasons for this. For one thing, Drupal 6 is still what’s in the Fedora repositories (at least as of F15 — and its orphaned?!?). So the Fedora wiki has basic install instructions for 6, but not 7. The Fedora-ized version also places things in a symlinked area under /usr, which may be more secure, but it confuses new Drupal users about “how to get rid of the example.com/drupal/” problem under Apache. And… lastly, SELinux requires a few adjustments to let httpd and drupal to work well together. Most users’ reaction to the first sign of SELinux issues is to hurriedly turn it off… emphasis on the hurr in “hurriedly” there, because that is stupid.

Even a lot of professional web developers do this, which should drive home the point I’ve made elsewhere of web people != systems people (though systems people might be capable web developers). As of this writing Ubuntu doesn’t even field SELinux by default and their millions of users don’t get chewed up as easily as Windows folks do, but the majority use case for Ubuntu is not (and should not, imo) be server deployment. But why risk it when you have such a powerful security tool right in front of you? For the enterprise I just don’t find it prudent to abandon such a great and easy tool. That’s like choosing to not learn iptables — which some folks have also opted out of as well.

So, without further ado, here is the quick and dirty to get Drupal 7 working on a Fedora/CentOS/RHEL type OS with SELinux intact:

[Please note these instructions assume three things: 1) a completely fresh minimal install, 2) you have control of the server, and 3) you are able to execute commands as root or through sudo. Also note that I have removed foreign language setup from this, as I doubt anyone who reads my blog really needs Japanese but me.]

[root@localhost example.com]# yum install postgresql postgresql-server php-pgsql php-xml \
    php-pear php-devel gcc zlib-devel libcurl-devel make wget httpd php-mbstring
[root@localhost ~]# cd /var/www/html
[root@localhost html]# wget http://ftp.drupal.org/files/projects/drupal-7.0.tar.gz
[root@localhost html]# tar -zxf ./drupal-7.0.tar.gz
[root@localhost html]# mv drupal-7.0 drupal7
[root@localhost html]# cd drupal7
[root@localhost drupal7]# pecl install pecl_http
[root@localhost drupal7]# pecl install uploadprogress
[root@localhost drupal7]# echo extension=http.so > /etc/php.d/php_http.ini
[root@localhost drupal7]# echo extension=uploadprogress.so  >> /etc/php.d/php_http.ini
[root@localhost drupal7]# service postgresql initdb
[root@localhost drupal7]# service postgresql start
[root@localhost drupal7]# chkconfig postgresql on
[root@localhost drupal7]# chkconfig httpd on
[root@localhost drupal7]# setsebool -P httpd_can_network_connect_db=1
[root@localhost drupal7]# setsebool -P httpd_can_sendmail=1
[root@localhost drupal7]# setsebool -P httpd_unified=1
[root@localhost drupal7]# cp sites/default/default.settings.php sites/default/settings.php
[root@localhost drupal7]# chmod 666 sites/default/settings.php
[root@localhost drupal7]# su postgres
bash-4.1$ createuser --pwprompt --encrypted --no-adduser --no-createdb drupal
Enter password for new role:
Enter it again:
Shall the new role be allowed to create more new roles? (y/n) n
bash-4.1$ createdb --encoding=UNICODE --owner=drupal drupaldb
bash-4.1$ exit
[root@localhost drupal7]# cp /var/lib/pgsql/data/pg_hba.conf /var/lib/pgsql/data/pg_hba.conf.original
[root@localhost drupal7]# vi /var/lib/pgsql/data/pg_hba.conf

Add the following line at about line 71 or so, just after the local all all ident line (check first, don’t blindly dump this in with sed because this could all be wrong if you’re running a different version of Postgres or reading this far in the future):

host    drupaldb    drupal    127.0.0.1/32    md5

Remove the Apache and OS identification tags on server-generated error messages (such as the default Apache ### error messages):

[root@localhost drupal7]# vi /etc/httpd/conf/httpd.conf

Replace “ServerTokens OS” with “ServerTokens Prod
Replace “ServerSignature On” with “ServerSignature Off

And since I’m paranoid and use my servers only as servers (and prefer to send logs to a separate logging server), I also change “LogLevel Warn” to “LogLevel Info” and let my parsing scripts do the work of finding the important stuff. That makes forensics a lot easier later on down the road (though more compute intensive).

Add the following lines to your httpd.conf file if you are running multiple websites on a single server (on a single IP address that is hosting multiple domain names):

NameVirtualHost *:80

<VirtualHost *:80>
    ServerAdmin admin@example.com
    DocumentRoot /var/www/html/drupal7
    ServerName example.com
    ErrorLog logs/example.com-error_log
</VirtualHost>

<VirtualHost *:80>
    ServerName anothersite.example.com
    DocumentRoot /var/www/html/anothersite.example.com
</VirtualHost>

If you are not running virtual servers, or if you are running them in /home/user/public_html or whatever, adjust the way the file is written.

Now input iptable rules necessary to open port 80 for web traffic from outside, and allow httpd to access Postgres:

[root@localhost drupal7]# iptables -vI INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
[root@localhost drupal7]# iptables -vI INPUT -m state --state NEW -m udp -p udp --dport 5353 -j ACCEPT

Now point a browser at the server and do your Drupal installation setup. Once you’re done there reset the permissions on sites/default/settings.php and remove the last iptables rule:

[root@localhost drupal7]# chmod 644 sites/default/settings.php
[root@localhost drupal7]# iptables -vD INPUT -m state --state NEW -m udp -p udp --dport 5353 -j ACCEPT

If everything went correctly you should be able to use Drupal 7 with SELinux in Enforcing mode, with your iptables intact aside from however you remote login for administration (SSH, if you use it, if you don’t, then close off port 22 and do chkconfig sshd off as well).

The next time you reboot you will notice you can still log in to your shell, but you can’t access the website with a browser. That is because the iptables rule fell off (they don’t persist unless you tell them do). Once everything works the way it should, commit the iptable rule that is letting port 80 stay open:

iptables-save > /etc/sysconfig/iptables

And while we’re messing with security… let’s go ahead and turn off a php feature that the cracker wannabes have recently learned about: allow_url_fopen.

[root@localhost drupal7]# vi /etc/php.ini

Replace “allow_url_fopen On” with “allow_url_fopen Off

It is very unlikely that you will need all_url_fopen to be active, as very few modules use it (there is a Drupal-sepecific alternative to this, so it seems). Of course, if your site breaks it would be good to check if you actually did need this, but otherwise I’d leave it turned off until things go wrong.

Now we need to restart Apache:

[root@localhost drupal7]# apachectl restart

Given that the biggest fans of tutorials such as these are the sort of folks who would never spend the time to research the meaning of all this themselves and given that you wouldn’t be reading this if you already knew how to do the above in the first place, I must remind you to head to your favorite search engine and do searches for things like “hardening sshd”, “hardening httpd”, “turning off root login”, “using public key encryption with sshd” and anything else that might strike your fancy (protip: read as much as you can about SELinux and iptables).