Recursive Lineage Queries in SQL and RyuQ

Without explaining every detail, here is a “proper” (as in, non-NULL dependent) method for handling tree data within a recursive query. In the example below table foo has an id, a parent and a dtg (date-time-group). Obviously real world examples will have more fields, but this illustrates the technique. This sort of thing is quite handy in a huge variety of situations as quite a lot of data is hierarchical in nature.

Adding a layer type attribute on the view is also a good way to accomplish things like, say, providing visual layout variables for threaded comments on a website, even if your (stupid) web framework doesn’t understand such things — you can bypass the framework garbage by defining a lineage-type view directly in the database and a “model” on top of it.

CREATE FUNCTION default_parent() RETURNS TRIGGER AS
  $$
  BEGIN
    IF NEW.parent IS NULL THEN
      NEW.parent := NEW.id;
    END IF;
    RETURN NEW;
  END;
  $$ LANGUAGE plpgsql;

CREATE TABLE foo
 (id         SERIAL PRIMARY KEY,
  parent     integer REFERENCES foo ON UPDATE CASCADE ON DELETE SET NULL NOT NULL,
  created_on timestamptz DEFAULT current_timestamp NOT NULL);

CREATE TRIGGER foo_default_parent
  BEFORE INSERT ON foo
  FOR EACH ROW EXECUTE PROCEDURE default_parent();

CREATE TRIGGER foo_reset_parent
  BEFORE UPDATE ON foo
  FOR EACH ROW EXECUTE PROCEDURE default_parent();

CREATE VIEW lineage AS
  WITH RECURSIVE
  parents AS
    (SELECT * FROM foo WHERE parent = id),
  children AS
    (SELECT * FROM foo WHERE parent != id),
  tree AS
   (SELECT id,
           id::text AS branch,
           0 AS layer,
           parent
      FROM parents
   UNION ALL
    SELECT c.id,
           t.branch || '.' || c.id::text AS branch,
           t.layer + 1 AS layer,
           c.parent
    FROM children AS c, tree AS t
    WHERE c.parent = t.id)
  SELECT * FROM tree ORDER BY branch;

CREATE OR REPLACE FUNCTION show_descendants(parent integer) RETURNS TABLE
   (id      integer,
    lineage text,
    layer   integer,
    parent  integer) AS
  $$
    WITH RECURSIVE
    parents AS
      (SELECT * FROM foo WHERE id = $1),
    children AS
      (SELECT * FROM foo WHERE parent != id),
    tree AS
     (SELECT id,
             id::text AS branch,
             0 AS layer,
             parent
        FROM parents
     UNION ALL
      SELECT c.id,
             t.branch || '.' || c.id::text AS branch,
             t.layer + 1 AS layer,
             c.parent
      FROM children AS c, tree AS t
      WHERE c.parent = t.id)
    SELECT * FROM tree ORDER BY branch;
  $$ LANGUAGE SQL STABLE;

CREATE OR REPLACE FUNCTION
  get_lineage(parent_id integer DEFAULT 0, hide_parent boolean DEFAULT false)
  RETURNS TABLE
   (id      integer,
    lineage text,
    layer   integer,
    parent  integer) AS
  $$
    BEGIN
      IF parent_id > 0 AND hide_parent IS true THEN
        RETURN QUERY EXECUTE $query$ SELECT * FROM show_descendants($1) OFFSET 1 $query$
          USING parent_id;
      ELSIF parent_id > 0 AND hide_parent IS false THEN
        RETURN QUERY EXECUTE $query$ SELECT * FROM show_descendants($1) $query$
          USING parent_id;
      ELSE
        RETURN QUERY EXECUTE $query$ SELECT * FROM lineage $query$;
      END IF;
    END;
  $$ LANGUAGE plpgsql STABLE;

Let’s see some invocations:

insert into foo values (default);
insert into foo values (default);
insert into foo values (default);
insert into foo values (default, 1);
insert into foo (parent) values (1);
insert into foo (parent) values (1);
insert into foo (parent) values (1);
insert into foo (parent) values (2);
insert into foo (parent) values (3);
insert into foo (parent) values (3);
insert into foo (parent) values (4);
insert into foo (parent) values (5);
insert into foo (parent) values (7);
insert into foo (parent) values (7);
insert into foo (parent) values (10);
insert into foo (parent) values (15);

select * from lineage;
 id |   branch   | layer | parent 
----+------------+-------+--------
  1 | 1          |     1 |      1
  4 | 1.4        |     2 |      1
 11 | 1.4.11     |     3 |      4
  5 | 1.5        |     2 |      1
 12 | 1.5.12     |     3 |      5
  6 | 1.6        |     2 |      1
  7 | 1.7        |     2 |      1
 13 | 1.7.13     |     3 |      7
 14 | 1.7.14     |     3 |      7
  2 | 2          |     1 |      2
  8 | 2.8        |     2 |      2
  3 | 3          |     1 |      3
 10 | 3.10       |     2 |      3
 15 | 3.10.15    |     3 |     10
 16 | 3.10.15.16 |     4 |     15
  9 | 3.9        |     2 |      3
(16 rows)

select * from get_lineage(3);
 id |  lineage   | layer | parent 
----+------------+-------+--------
  3 | 3          |     0 |      3
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(5 rows)

select * from get_lineage(3, true);
 id |  lineage   | layer | parent 
----+------------+-------+--------
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(4 rows)

Notice that there are no NULLs. Most formulations of tree type data uses NULL to represent not having a parent — but this is wrong. NULL means something is unknown. If the top-level nature of an entity is known, then NULL is the wrong value and complicates any logic involving that field unnecessarily. If, instead, we define a top level entity as any entity which is its own parent, then we have no ambiguity in the structure, and we are providing a correct model. For this reason the parent column above is NOT NULL. If we wanted to provide the possibility of maintaining records where the parent was actually unknown, we could remove the NOT NULL constraint and unset the trigger.

That may sound nitpicky, but there are several places where this simplifies things or makes query execution downright unreliable because of a dependence on conditional logic shortcuts. Consider the common SQL function solution comparable to show_descendants() above:

CREATE OR REPLACE FUNCTION show_descendants(parent integer) RETURNS TABLE
   (id      integer,
    lineage text,
    layer   integer,
    parent  integer) AS
  $$
    WITH RECURSIVE tree AS
     (SELECT id,
             id::text AS branch,
             0 AS layer,
             parent
        FROM foo
        WHERE (parent IS NULL AND $1 = 0) OR (parent = $1)
     UNION ALL
      SELECT c.id,
             t.branch || '.' || c.id::text AS branch,
             t.layer + 1 AS layer,
             c.parent
      FROM children AS c, tree AS t
      WHERE c.parent = t.id)
    SELECT * FROM tree ORDER BY branch;
  $$ LANGUAGE SQL STABLE;

This works just fine in Postgres 9.2 — but that is a quirk of implementation. SQL is supposed to be declarative in nature, and as such there is no guarantee which side of the OR condition in WHERE will be executed first. It is cheaper to check the single condition on the right side than the AND condition on the left, so it is likely a future execution optimization will detect that and run the query this way. The problem is that the right side expression can be true in invalid cases. There is no way to write this query safely as long as NULL is a possible value for parent.

Another problem is that using a NULL parent value in any calculation will nuke the entire operation because of NULL propagation. Say you want to do “parent::text || foo” somewhere later. Surprise! Any NULL is going to kill the entire string. There are convoluted ways to avoid that, but we’re fighting against the nature of an unnatural data definition instead of providing a natural definition at the outset and letting the positive consequences follow.

An argument can be made that handling for-real parent values instead of just using NULL definitions complicates the code above unnecessarily. Its true, in SQL you have to define a trigger to handle the top-level case and be a little more specific in the view and function definitions that follow. But that’s the end of it. We’ve trapped the NULL and completely contained the complexity involved in modeling an accurate parent relationship, not mention guarded against the undefined nature of logical OR execution (this is a huge issue on its own). If we were to permit a NULL here and didn’t really intend for that to mean unknown, then we would have traded the convenience of writing a quick and dirty definition for letting the complex and unpredictable side effects escape our model and propagate throughout the rest of the schema and every application that touches this table. With that in mind it is obviously a better solution to pay the complexity tax up front and write the insert trigger and function definitions and nip the issue in the bud.

In RyuQ this can be defined as:

create [relation]
  foo
    attributes
      id         Serial
      parent     Integer
      created_on TimestampTZ
    conditions
      pk id
      fks
        parent -> foo.id
      defaults
        parent = id
        created_on = current_timestamp

And define some elabels to operate on the data:

tree = union
         project [id, lineage = id::text, layer = 1, parent]
           select [parent == id] foo
         project [id = foo.id,
                  lineage = tree.lineage + '.' + foo.id,
                  layer = tree.layer + 1,
                  parent = foo.parent]
           join
             tree
             select [parent != id] foo

branch ID =
       union
         project [id, lineage = id::text, layer = 1, parent]
           select [id == ID] foo
         project [id = foo.id,
                  lineage = branch.lineage + '.' + foo.id,
                  layer = branch.layer + 1,
                  parent = foo.parent]
           join
             branch
             select [parent != id] foo

headless ID = select [parent != id] branch ID

And that’s it. Notice that there are no NULLs here either; we are not using any MaybeType attributes. If we wanted to make the definition of a top-level entity “Something without a parent” then using the type NoneInteger would be correct. If we wanted to include the possibility that a parent relationship is unknown (but that this does not define a top-level entity) we could use MaybeInteger. If we want to include both then we should define a domain that includes integers, None, and _|_ as members, possibly calling it MaybeNoneInteger.

Invokation:

insert [foo]
  {id}
  (default), (default), (default)
  {parent}
  (1), (1), (1), (2), (3), (3), (4), (5), (7), (7), (10), (15)

tree
 id |   branch   | layer | parent 
----+------------+-------+--------
  1 | 1          |     1 |      1
  4 | 1.4        |     2 |      1
 11 | 1.4.11     |     3 |      4
  5 | 1.5        |     2 |      1
 12 | 1.5.12     |     3 |      5
  6 | 1.6        |     2 |      1
  7 | 1.7        |     2 |      1
 13 | 1.7.13     |     3 |      7
 14 | 1.7.14     |     3 |      7
  2 | 2          |     1 |      2
  8 | 2.8        |     2 |      2
  3 | 3          |     1 |      3
 10 | 3.10       |     2 |      3
 15 | 3.10.15    |     3 |     10
 16 | 3.10.15.16 |     4 |     15
  9 | 3.9        |     2 |      3
(16 rows)

branch 3
 id |  lineage   | layer | parent 
----+------------+-------+--------
  3 | 3          |     0 |      3
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(5 rows)

headless 3
 id |  lineage   | layer | parent 
----+------------+-------+--------
 10 | 3.10       |     1 |      3
 15 | 3.10.15    |     2 |     10
 16 | 3.10.15.16 |     3 |     15
  9 | 3.9        |     1 |      3
(4 rows)

Pretty simple. I find the RyuQ easier and more clear, but obviously I’m biased.

2 thoughts on “Recursive Lineage Queries in SQL and RyuQ

  1. joe.mazlin

    I didn’t realize that null propagates through expressions like that. I didn’t realize what is required to capture it at the definition level in SQL, either. How in the world has this not been corrected for so long? Most business software and websites are really based on a data standard with this kind of stupid hole in it?

    And why in the world isn’t RyuQ (which looks like its your own language) or something like it not taking over the data world?

    Reply
  2. mrn

    @joe
    why? because of inertia. This isn’t my style of sql, but the post is accurate and the null propagation thing is a very real problem and so is the need to selectively implement an sql standard where sometimes &-type comparisons work a certain way despite sql being supposedly a purely declarative language. As for the ryuq “taking over the world” it looks incomplete, and more importantly nobody knows about it or this website.

    @zxq9
    Hire a publicist or something. this really is the intellectual wilderness out here. not even twitter or fb? really?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>