2013-05-16
What follows is a (still) disjointed, unnamed and nebulous query language concept. A bit more exists than what is represented here but has not been documented due to time constraints.

# Data Control Language

Not dreamed up yet. Whatever it does, though, should fit in with how Postgres does things, and probably just provide some shortcuts to commonly annoying operations in the standard DCL. The goal of any DCL definition must be to provide a programmatic (and obvious) way of automating the tedious nature of GRANT and REVOKE which seems to so often inspire DBAs to completely ignore authorization at the DB level itself and start writing (insane and usually flimsy) "access control" code in the application, with the application server role (or sometimes even the application!) having superuser access to the database.

# Data Definition Language

Basically any phrase that starts with [noun, name, colon]

Other than stripped down syntax, the real meat of changes here is to introduce a stronger, more explicit type/domain definition system. In particular, introduce Maybe domain types to make it explicit what a missing value's type is (NULL isn't sufficient), and in so doing get away from all this "shall we embrace, or whine, about NULLs?" nonsense.

Note that this does nothing to avoid ternary logic, it just makes it obvious where ternary logic is involved and provides alternatives to the nuclear NULL option by providing access to, say, a numerically typed _|_ value instead of just NULL. The other changes to NULL handling come into play with the query language, where we don't lose track of which unknown value came from where, so that the exact same unknown is equal to itself, but no others.

Define Integer and MaybeInteger as distinct domain types. That way these are different:


table foo:
num    Integer

table bar:
num    MaybeInteger



and the first is equivalent to:


CREATE TABLE foo
(num    integer NOT NULL);



All rows are unique, so all tables are created with the equivalent of a UNIQUE statement that captures every row, regardless the shape of the table and independent of primary keys (unless the whole row is the primary key, which happens sometimes). Ideally NULL != NULL, so the following two statements should be equivalent:


table foo:
a Integer PK
b MaybeInteger
c Text

CREATE TABLE foo
(a    integer PRIMARY KEY,
b    integer NULL,
c    text NOT NULL,
UNIQUE (a, b, c));



I'm leaving out UPDATE, INSERT, DELETE, UPSERT and ALTER at the moment because I don't have time to think about them just now. The overriding concern with these functions is that their syntax be similar. One benefit to that is that if UPDATE and INSERT have similar syntax, then UPSERT won't be arcane, and if its ordinary enough to be understood, it will be used. ALTER is a mixed bag of magic, because ALTER can change so much stuff that it should probably be broken down.

# Query Language

Nothing exciting. This is mostly just a text-friendly notation for relational algebra that is optionally parenthesized to be friendly for use in a REPL/commandline or semantically indented to be pretty in a source file. What is exciting about it is that un-exciting relational algebra is much easier to think through than nested statements in an arcane natural-language-ish syntax. A lack of sexy here renders complex constructs much easier to understand.

## Evaluation

The default operation when no other is given. Generally, this means to evaluate to a set and return it:


(table1)
table1
(label1)
label1



## Select

Takes an optional predicate. With no predicate it is equivalent in most cases to evaluation. Note that predicates can include other expressions (operations over a column, for example) as long as the end result is a boolean value. Non-boolean results are errors. NULL is treated as _|_ and evaluates to False (or specifically, it is Not True).


(select [a >= 10] (table1))
select(table1)
select [parent == 'Danny', age(dob) >= 18] table1



## Project

Column-picker. Generally, project [columns..] target, but there are some wilder invokations where an indicated column is a nested project() statement of its own.

Projection can occur by column inclusion or by column exclusion. To exclude columns the predicate list is "negated".


(project [a, b, c] (table1))
project -[a] (table1)
project [a, b] table1
project -[a, b]
table1
project [first_name, last_name]
select [parent == 'Danny', age(dob) >= 18] table1



## Product

R1 * R2 style cartesian product. Can accept an argument list of arbitrary length, though this of course makes the product grow at a (probably) undesirable rate.


(product (table1, table2))
product(table1, table2, label3)
product table1 table2 label3
product
table1
table2
label3



## Union

Basic set union. Obviously this has to follow set typing rules, etc...


(union (table1, table2, label3))
union(table1, table2, label3)
union table1 table2 label3
union
table2
table2
label3



## Difference

Set subtraction R1 - R2. (set(a) - set(b)) turns out to be quite useful in practice. This doesn't work well when the operands are bags instead of sets, though. It is an error to try diffing bags instead of sets (so you need a unique guarantee of some sort over each operand).


(difference (table1, table2))
difference(table1, table2)
difference table1 table2
difference
table1
table2



## Rename

A function name (rename()) and an operator "\". Explicitly changes column names in place. The "\" operator can be used within a project() predicate as well. Because anonymous columns and duplicate names are not allowed in results this must be used to assign proper names to derived values in a result list.

The presence of the "\" operator makes rename() redundant because the same thing can always be achieved with (project [a\b] (table1)). Note that neither rename() nor "\" removes access to the original, fully qualified base values -- it just adds an addtional label/pointer to the column (see join() for more on this).


(rename [a \ c, b \ d] (table1))
rename [a \ c, b \ d] (table1)
rename [a \ c, b \ d] table1
rename [a\c, b\d]
project [a, b] table1
project [a\c, b\d] table1
project [first_name, last_name, age(dob) \ years_old]
select [parent == 'Danny', age(dob) >= 18] table1



## Intersection

Basic intersection. R1 - (R1 - R2). Once again, typing rules are in play.


(intersect (table1, label2))
intersect(table1, label2)
intersect table1 label2
intersect
table1
label2



## Division

R1 / R2.


(division (table1, table2))
division(table1, table2)
division table1 table2
division
table1
table2



## Equi/Theta Join

R1 * R2 [R1.a = R2.a]

Essentially shorthand for doing a select over a renamed natural join over ... blah. Which is why this notation is useful. Note that the original base table is still a proper qualifier within a nested join. This permits us to (in theory) keep track of specific _|_ values so that a specific row's unknown can be equal to itself, but not other unknowns (as opposed to tanking the entire operation the way NULL propagation does). Columns subject to rename() can be referenced by their (simple) new name or by their fully qualified name in later joins. In other words, each join() does not produce a new table independent of the original values.


(join [table1.a==table2.b] (table1, table2))
join [a==b] (table1, table2)
join [table1.a >= avg(label2.b)] table1 label2
join [t1.a == t2.b]
t1
join [t2.c == t3.d] t2 t3
join [schema1.t1.a == foo]
t1
join [t2.c == t3.d]
rename [b\foo] t2
t3



An actual semijoin() and antijoin() are not included in this spec, but may be later. Anyway, they could be defined as follows:


semijoin R1 R2 = project [R1.*] (naturaljoin(R1, R2))
antijoin R1 R2 = project [R1.*] (difference(R1, naturaljoin(R1, R2)))



## Natural Join

R1 naturaljoin R2

Obviously this only works when columns/types match.

Natural join FIRST checks the data dictionary (db catalog, whatever) for FK relationships between arguments and join along them while ignoring column names. If no FK relationships are found then it will default to the SQL-ish NATURAL JOIN behavior and try to dumb-match along names/types (which seems far less useful, since most non-trivial schemas where natural join would be really useful don't have the same column names on both sides of PK-FK relationships).

Type mistmatch is an error -- no attempt at casting will be attempted.

(Should this degenerate to (product (table1, table2) in the case nothing is in common? Is that unexpected behavior despite being part of the algebraic definition? Is this set in stone in the algebra?)

Natural join can take more than two arguments, and tries to find a complete directed graph among its arguments that permits a FK-based join. If that fails, then it tries to go down the list in order, SQL style.


(naturaljoin (table1, table2))
naturaljoin(table1, label1)
naturaljoin table1 label2 label3
naturaljoin
table1
table2
label3



## Sort

Sorts based on a column name list or the result of a mapped function in the predicate. Default sort is ASCENDING according to the columns collation, DESCENDING is indicated by "negating" the column. Sort priority is from left to right, so the equivalent of SQL's SELECT * FROM foo1 ORDER BY a ASC, b, DESC, c; is (sort [a, -b, c] (foo1)).


(sort [a] (table1))
sort [-b, a] (table1)
sort [-age(dob)] table1
sort [age(dob), name]
label1



## Nest

A subsuming join.

Places a list of relations fitting some criteria from R2 into a new column on R1. One important thing to note is that the default name of the new column is the name of the superior table double-dot-concatenated to the name of the nested table. So accessing the nested value of b below would be "a..b" or "public.a..public.b" and accessing a single column on the nested b is "a..b.foo".


(nest [a.name == b.parent] (a, b))
nest [a.name == b.parent] (a, b)
nest [a.name == b.parent] a b



### Further clarification:

A join:

(join [a.name == b.parent] (a, b))
a.name    |   b.name   |   b.parent   |   b.dob
---------------------------------------------------
craig     |   anna     |    craig     |  1999-03-10
craig     |   selina   |    craig     |  2001-03-13
fred      |   john     |    fred      |  1985-12-07


A nest:

(nest [a.name == b.parent] (a, b))
a.name    |                           a..b
------------------------------------------------------------------------
craig     | [(anna, craig, 1999-03-10), (selina, craig, 2001-03-13)]
fred      | [(john, fred, 1985-12-07)]


A renamed nest:

(rename [a..b \ children] (nest [a.name == b.parent] (a, b)))
a.name    |                         children
------------------------------------------------------------------------
craig     | [(anna, craig, 1999-03-10), (selina, craig, 2001-03-13)]
fred      | [(john, fred, 1985-12-07)]


A projected nest:

(project [a.name, (project [name, dob] (a..b)) \ children] (nest [a.name == b.parent] (a, b)))
a.name    |                  children
----------------------------------------------------------
craig     | [(anna, 1999-03-10), (selina, 2001-03-13)]
fred      | [(john, 1985-12-07)]


Obviously it is cleaner to perform projection prior to the nesting action.

## Slice

Chop the result set into pieces.

Provides a way to implement the SQL OFFSET and LIMIT functionality, but in a more syntactically consistent way that is more familiar to developers:


(slice [:5] (table1))
slice [101:200] (label1)
slice [100:] (project [a, b] table1)
slice [101:] (slice [:200] (project [a,b] table1))



# Transactions

## Open/Begin a transaction

begin

## Open a labelled transaction


(begin (foo))
begin (foo)
begin foo



## Rollback/cancel out

rollback

## Rollback and cancel out of a labeled transaction


(rollback (foo))
rollback (foo)
rollback foo



## Create a savepoint


(savepoint (bar))
savepoint (bar)
savepoint bar



## Rollback just to the savepoint


(rollback (bar))
rollback (bar)
rollback bar



## Commit

Commit with no argument refers to the current transaction, commit with an argument can refer to a parent and will commit all intermediate transactions in the nested transaction chain.


commit
(commit)
(commit (foo))
commit (foo)
commit foo



## Nested Transactions

Transactions can be nested by opening a new one within another. Cancelling the outer transaction rollback and kills every child transaction in process.


begin foo
begin bar
rollback foo



is the same as


begin
begin
rollback
rollback



Savepoints and transaction labels share the same namespace. A rollback can only access its own transaction or a parent.

In this environment typical garbage collection doesn't work because it is not possible to tell when something goes "out of scope" and there is no terminal reference to indicate that a labeled object is now unused. So instead there are a few rules to memory use:

• All transient labels die at the end of the connection. To survive longer they would have to become stored procedures.
• A transaction defines a scope. Global labels are accessible from within a transaction, and labels defined within a parent transaction are accessible from within a child transaction, but not the other way around.
• When a transaction ends all associated labels are destroyed.
• An explicit del() function can clear a label
• Redifinition of a label frees whatever resources were tied to it if there are no other references to the underlying object.
• Caches are subject to more stringent rules of survival because they may be really large. A MAX_MEM and MAX_SIZE of is set in the environment. MAX_MEM indicates how much core can be allocated to cached items, and MAX_SIZE indicates how much scribble space is availble on disk to essentially save caches in temporary tables. Once MAX_SIZE is reached, caches are destroyed in FIFO fashion. If an attempt is made to cache a single result that is, by itself, larger than MAX_SIZE then the cache attempt fails with a warning, and the assignment is reverted to being a function assignment rather than a cache execution.
• Assignments that are invalidated by subsequent DDL statements are dropped, as are any dependent assignments.

cache1 := select [a > 100] really_huge_table
thingy1 = select [a > 100] really_huge_table
del cache1 thingy1



The square brackets that follow some keywords are lists of terms. Nothing more special. Most lists are optional and most operators have a defined behavior in absence of them. But including the list is always more explicit than not -- and guarantees conveyed meaning to a reader. It also ensures expected evaluation results in the event the execution environment has been changed or the definition of a function has been altered. Some functions have rules about their lists, though. For example, a select() or join() can only accept boolean values in its list, whereas rename() or project() can only accept labels.

Because all lists are always bracketed they can be formatted any way the user desires. Indentation and parenthetical checks begin/resume after the closing square bracket. This is a deliberate decision because, particularly in the case of huge, verbose project() calls (the equivalent of large SELECT a, b, c... FROM ([endless labrynth of sub-selects])) the column lists and predicate conditions and things can get unreadable really quickly and screw up the flow of SQL. I want to avoid that entirely by making the bracket list free-form and encourage users to write things like:


label1 = sort [last_name, first_name] (select [age(dob) >= 18] people)

project [first_name \ fn,      last_name \ ln,
age(dob) \ age,       birth_city,
hair_color,           eye_color,
height \ height_cm,   cm2in(height) \ height_in]
label1




project [first_name\fn,last_name\ln,age(dob)\age,birth_city,hair_color,eye_color,height\height_cm, cm2in(height)\height_in]
sort [last_name,first_name] (select [age(dob) >= 18] people)



# Comparison VS Assignment

Equality comparison is ==, inequality is != or /=, assignment is =. These are different from SQL because SQL doesn't really have proper assignment to begin with. Additionally, most application languages treat == as comparison and = as assignment, and switching the meaning in the middle of writing the same source file is silly. Better, methinks, to conform to the way most languages do things than try to impose my own system.

It strikes me as interesting that despite SQL's entirely free-form nature there is practially no such thing as "pretty" or "well-formed" non-trivial SQL code out there. The closest thing is hiding the ugliest bits in stored procedures and hoping you never have to look at them again.

Quoting follows the Postgres rules. Object selection is not quite the same, though. It is legal to reference a single column in dot-separated fashion:

schema1.table1.column1

Absolute references begin with a dot (and currently begin at the schema level, but this might change to the database in the future), references that begin with [:alnum:] are searched for either from the standard search path (and if not found error out) or searched for from the present working node (within most brackets this is a column list) and then if not found from the standard search path.


.schema1.table1.column1
table1.column1
project [column1] .schema1.table1



# Syntax and style

Parenthesization is optional, as is significant whitespace. That is to say, once you commit a phrase to parentisization whitespace will be disregarded, but without parenthesization each phrase will be evaluated assuming whitespace is significant. This way a REPL can be less annoying to implement, code generators can be built whichever way the host language lends itself most readily to, and people can indent however they want with parenthesized phrases, or go the "standard" semantic indentation way if they want for source:


(sort [a, -b] (project [a, b, c] union (t4, (join [t1.a=t3.e] (t3, (join [t1.a=t2.d] (t1, t2))), t5))))

sort [a, -b]
project [a, b, c]
union
t4
join [t1.a==t3.e]
t3
join [t1.a==t2.d] t1 t2
t5



Labels can be assigned to expressions and reused for the life of the connection (until the label is del()'d or assigned to something else). In a sense this makes everything that is assigned a sort of closure.


v1 = union (t4, join [t1.a==t3.e] (t3, join [t1.a==t2.d] (t1, t2)), t5)
v2 = project [a, b, c] v1
sort [a, b] v2



A lack of assignment indicates evaluation (and return output), an assignment with = indicates assignment only. v1 and v2 still exist and can be referenced again. Referencing them without assignment still indicates evaluation. The following will evaluate (project [a, b, c] v1) again:

v2

Each evaluation is fresh -- results are not cached. If something changed in the base fact tables that make up the v2 result the output will be different than it was the first time around. The := assignment operator can be used to cause caching of a result:

v3 := project [a, b, c] v1

The := operator indicates immediate evaluation and a cached result. The refresh command is shorthand for updating the cache variable without re-writing everything. So refresh v3 is equivalent to v3 := project [a, b, c] v1 again.