Without explaining every detail, here is a “proper” (as in, non-NULL dependent) method for handling tree data within a recursive query. In the example below table foo has an id, a parent and a dtg (date-time-group). Obviously real world examples will have more fields, but this illustrates the technique. This sort of thing is quite handy in a huge variety of situations as quite a lot of data is hierarchical in nature.
Adding a layer type attribute on the view is also a good way to accomplish things like, say, providing visual layout variables for threaded comments on a website, even if your (stupid) web framework doesn’t understand such things — you can bypass the framework garbage by defining a lineage-type view directly in the database and a “model” on top of it.
CREATE FUNCTION default_parent() RETURNS TRIGGER AS $ BEGIN IF NEW.parent IS NULL THEN NEW.parent := NEW.id; END IF; RETURN NEW; END; $ LANGUAGE plpgsql; CREATE TABLE foo (id SERIAL PRIMARY KEY, parent integer REFERENCES foo ON UPDATE CASCADE ON DELETE SET NULL NOT NULL, created_on timestamptz DEFAULT current_timestamp NOT NULL); CREATE TRIGGER foo_default_parent BEFORE INSERT ON foo FOR EACH ROW EXECUTE PROCEDURE default_parent(); CREATE TRIGGER foo_reset_parent BEFORE UPDATE ON foo FOR EACH ROW EXECUTE PROCEDURE default_parent(); CREATE VIEW lineage AS WITH RECURSIVE parents AS (SELECT * FROM foo WHERE parent = id), children AS (SELECT * FROM foo WHERE parent != id), tree AS (SELECT id, id::text AS branch, 0 AS layer, parent FROM parents UNION ALL SELECT c.id, t.branch || '.' || c.id::text AS branch, t.layer + 1 AS layer, c.parent FROM children AS c, tree AS t WHERE c.parent = t.id) SELECT * FROM tree ORDER BY branch; CREATE OR REPLACE FUNCTION show_descendants(parent integer) RETURNS TABLE (id integer, lineage text, layer integer, parent integer) AS $ WITH RECURSIVE parents AS (SELECT * FROM foo WHERE id = $1), children AS (SELECT * FROM foo WHERE parent != id), tree AS (SELECT id, id::text AS branch, 0 AS layer, parent FROM parents UNION ALL SELECT c.id, t.branch || '.' || c.id::text AS branch, t.layer + 1 AS layer, c.parent FROM children AS c, tree AS t WHERE c.parent = t.id) SELECT * FROM tree ORDER BY branch; $ LANGUAGE SQL STABLE; CREATE OR REPLACE FUNCTION get_lineage(parent_id integer DEFAULT 0, hide_parent boolean DEFAULT false) RETURNS TABLE (id integer, lineage text, layer integer, parent integer) AS $ BEGIN IF parent_id > 0 AND hide_parent IS true THEN RETURN QUERY EXECUTE $query$ SELECT * FROM show_descendants($1) OFFSET 1 $query$ USING parent_id; ELSIF parent_id > 0 AND hide_parent IS false THEN RETURN QUERY EXECUTE $query$ SELECT * FROM show_descendants($1) $query$ USING parent_id; ELSE RETURN QUERY EXECUTE $query$ SELECT * FROM lineage $query$; END IF; END; $ LANGUAGE plpgsql STABLE;
Let’s see some invocations:
insert into foo values (default); insert into foo values (default); insert into foo values (default); insert into foo values (default, 1); insert into foo (parent) values (1); insert into foo (parent) values (1); insert into foo (parent) values (1); insert into foo (parent) values (2); insert into foo (parent) values (3); insert into foo (parent) values (3); insert into foo (parent) values (4); insert into foo (parent) values (5); insert into foo (parent) values (7); insert into foo (parent) values (7); insert into foo (parent) values (10); insert into foo (parent) values (15); select * from lineage; id | branch | layer | parent ----+------------+-------+-------- 1 | 1 | 1 | 1 4 | 1.4 | 2 | 1 11 | 1.4.11 | 3 | 4 5 | 1.5 | 2 | 1 12 | 1.5.12 | 3 | 5 6 | 1.6 | 2 | 1 7 | 1.7 | 2 | 1 13 | 1.7.13 | 3 | 7 14 | 1.7.14 | 3 | 7 2 | 2 | 1 | 2 8 | 2.8 | 2 | 2 3 | 3 | 1 | 3 10 | 3.10 | 2 | 3 15 | 3.10.15 | 3 | 10 16 | 3.10.15.16 | 4 | 15 9 | 3.9 | 2 | 3 (16 rows) select * from get_lineage(3); id | lineage | layer | parent ----+------------+-------+-------- 3 | 3 | 0 | 3 10 | 3.10 | 1 | 3 15 | 3.10.15 | 2 | 10 16 | 3.10.15.16 | 3 | 15 9 | 3.9 | 1 | 3 (5 rows) select * from get_lineage(3, true); id | lineage | layer | parent ----+------------+-------+-------- 10 | 3.10 | 1 | 3 15 | 3.10.15 | 2 | 10 16 | 3.10.15.16 | 3 | 15 9 | 3.9 | 1 | 3 (4 rows)
Notice that there are no NULLs. Most formulations of tree type data uses NULL to represent not having a parent — but this is wrong. NULL means something is unknown. If the top-level nature of an entity is known, then NULL is the wrong value and complicates any logic involving that field unnecessarily. If, instead, we define a top level entity as any entity which is its own parent, then we have no ambiguity in the structure, and we are providing a correct model. For this reason the parent column above is NOT NULL. If we wanted to provide the possibility of maintaining records where the parent was actually unknown, we could remove the NOT NULL constraint and unset the trigger.
That may sound nitpicky, but there are several places where this simplifies things or makes query execution downright unreliable because of a dependence on conditional logic shortcuts. Consider the common SQL function solution comparable to show_descendants() above:
CREATE OR REPLACE FUNCTION show_descendants(parent integer) RETURNS TABLE (id integer, lineage text, layer integer, parent integer) AS $ WITH RECURSIVE tree AS (SELECT id, id::text AS branch, 0 AS layer, parent FROM foo WHERE (parent IS NULL AND $1 = 0) OR (parent = $1) UNION ALL SELECT c.id, t.branch || '.' || c.id::text AS branch, t.layer + 1 AS layer, c.parent FROM children AS c, tree AS t WHERE c.parent = t.id) SELECT * FROM tree ORDER BY branch; $ LANGUAGE SQL STABLE;
This works just fine in Postgres 9.2 — but that is a quirk of implementation. SQL is supposed to be declarative in nature, and as such there is no guarantee which side of the OR condition in WHERE will be executed first. It is cheaper to check the single condition on the right side than the AND condition on the left, so it is likely a future execution optimization will detect that and run the query this way. The problem is that the right side expression can be true in invalid cases. There is no way to write this query safely as long as NULL is a possible value for parent.
Another problem is that using a NULL parent value in any calculation will nuke the entire operation because of NULL propagation. Say you want to do “parent::text || foo” somewhere later. Surprise! Any NULL is going to kill the entire string. There are convoluted ways to avoid that, but we’re fighting against the nature of an unnatural data definition instead of providing a natural definition at the outset and letting the positive consequences follow.
An argument can be made that handling for-real parent values instead of just using NULL definitions complicates the code above unnecessarily. Its true, in SQL you have to define a trigger to handle the top-level case and be a little more specific in the view and function definitions that follow. But that’s the end of it. We’ve trapped the NULL and completely contained the complexity involved in modeling an accurate parent relationship, not mention guarded against the undefined nature of logical OR execution (this is a huge issue on its own). If we were to permit a NULL here and didn’t really intend for that to mean unknown, then we would have traded the convenience of writing a quick and dirty definition for letting the complex and unpredictable side effects escape our model and propagate throughout the rest of the schema and every application that touches this table. With that in mind it is obviously a better solution to pay the complexity tax up front and write the insert trigger and function definitions and nip the issue in the bud.
In RyuQ this can be defined as:
create [relation] foo attributes id Serial parent Integer created_on TimestampTZ conditions pk id fks parent -> foo.id defaults parent = id created_on = current_timestamp
And define some elabels to operate on the data:
tree = union project [id, lineage = id::text, layer = 1, parent] select [parent == id] foo project [id = foo.id, lineage = tree.lineage + '.' + foo.id, layer = tree.layer + 1, parent = foo.parent] join tree select [parent != id] foo branch ID = union project [id, lineage = id::text, layer = 1, parent] select [id == ID] foo project [id = foo.id, lineage = branch.lineage + '.' + foo.id, layer = branch.layer + 1, parent = foo.parent] join branch select [parent != id] foo headless ID = select [parent != id] branch ID
And that’s it. Notice that there are no NULLs here either; we are not using any MaybeType attributes. If we wanted to make the definition of a top-level entity “Something without a parent” then using the type NoneInteger would be correct. If we wanted to include the possibility that a parent relationship is unknown (but that this does not define a top-level entity) we could use MaybeInteger. If we want to include both then we should define a domain that includes integers, None, and _|_ as members, possibly calling it MaybeNoneInteger.
Invokation:
insert [foo] {id} (default), (default), (default) {parent} (1), (1), (1), (2), (3), (3), (4), (5), (7), (7), (10), (15) tree id | branch | layer | parent ----+------------+-------+-------- 1 | 1 | 1 | 1 4 | 1.4 | 2 | 1 11 | 1.4.11 | 3 | 4 5 | 1.5 | 2 | 1 12 | 1.5.12 | 3 | 5 6 | 1.6 | 2 | 1 7 | 1.7 | 2 | 1 13 | 1.7.13 | 3 | 7 14 | 1.7.14 | 3 | 7 2 | 2 | 1 | 2 8 | 2.8 | 2 | 2 3 | 3 | 1 | 3 10 | 3.10 | 2 | 3 15 | 3.10.15 | 3 | 10 16 | 3.10.15.16 | 4 | 15 9 | 3.9 | 2 | 3 (16 rows) branch 3 id | lineage | layer | parent ----+------------+-------+-------- 3 | 3 | 0 | 3 10 | 3.10 | 1 | 3 15 | 3.10.15 | 2 | 10 16 | 3.10.15.16 | 3 | 15 9 | 3.9 | 1 | 3 (5 rows) headless 3 id | lineage | layer | parent ----+------------+-------+-------- 10 | 3.10 | 1 | 3 15 | 3.10.15 | 2 | 10 16 | 3.10.15.16 | 3 | 15 9 | 3.9 | 1 | 3 (4 rows)
Pretty simple. I find the RyuQ easier and more clear, but obviously I’m biased.
I didn’t realize that null propagates through expressions like that. I didn’t realize what is required to capture it at the definition level in SQL, either. How in the world has this not been corrected for so long? Most business software and websites are really based on a data standard with this kind of stupid hole in it?
And why in the world isn’t RyuQ (which looks like its your own language) or something like it not taking over the data world?
@joe
why? because of inertia. This isn’t my style of sql, but the post is accurate and the null propagation thing is a very real problem and so is the need to selectively implement an sql standard where sometimes &-type comparisons work a certain way despite sql being supposedly a purely declarative language. As for the ryuq “taking over the world” it looks incomplete, and more importantly nobody knows about it or this website.
@zxq9
Hire a publicist or something. this really is the intellectual wilderness out here. not even twitter or fb? really?