The Intellectual Wilderness There is nothing more useless than doing efficiently that which should not be done at all.

2012.08.27 11:55

Lean on Your Database

Filed under: Computing — Tags: , , , — zxq9 @ 11:55

Doing computations in the database is almost always the right answer. Writing your own database procedures instead of relying on an ORM framework is even better. These go hand in hand since ORMs don’t allow you enough control over your data schema to define calculations in the first place (most can’t even properly handle multi-column primary keys and instead invent meaningless integer ID for everything; this should tell you something). Many, maybe most (at least that I’ve met), web developers don’t even know what sort of calculations are available to be performed in the database because they’ve been taught, in an absolute vacuum of personal experience, that “SQL hard. Relational thinking hard. OOP good. Trust framework”. There is so much missing here.

Frameworks try to be “database agnostic”. This is fundamentally flawed thinking. This implies that the data layer is merely there for “persistence” and that the “persistence layer” can be whatever — all RDBMS systems “aren’t OO and therefore suck” and so any old database will do. If this is true then it follows that frameworks and applications should be designed to work against any database system and not delve too deeply into any specific feature sets — after all, the application functionality is the focus, not the data, right? This is exactly backwards.

Even forgetting that this condemns you to least-common-denominator data design, this is still exactly backwards. Let me put the right way on a line all by itself, because it just that important:

Data designs should strive to be application agnostic.

Data drives everything. Your functions are what you can change around easily, but your data schema is critical and represents everything about your system logic. If you show me a well-labeled data schema I can probably guess what you are trying to do, but if you show me just your functions and objects I’ll require either a code tour or a lot of familiarization time before getting anything serious done (that project documentation will be lacking is a truism not worth addressing here).

Consider that changing your app code is cake whereas changing your data schema is major project surgery. OOP has us so in a stupor that we think if we just get our objects right everything will be fine. It won’t. Ever. As long as you think that you’ll also believe other crap like that each object should map directly to a table. There are certain basic truths about certain types of data. It is striking that I can give a data requirement to two DBAs schooled on two different RDBMSes and ask for a normalized data model (let’s just say NF3 for argument) and get back two very similar looking schemas, but I can give a feature requirement to two Java programmers and get back radically different system designs.

This should tell us something. In fact, it screams the truth that data is a foundation from which you must work up toward the application code, not the other way around. The database layer is the most important place to make sound choices. The choice of database system itself should be based on project requirements, because that choice matters. Most critically, I’ll say it again here because it is so important and implies so much on contemplation: the database designs should strive to be application agnostic.

2012.08.19 21:38

Fred Brooks: Design Requirements

This overlooked resource is a talk that Fred Brooks gave to a very small audience June 4, 2007 — the first day of the WSOM Design Requirements Workshop. He does not discuss requirements or elicitation of requirements (which is what everybody else talked about) but rather discusses design as a process and how that is inseparable from the process of divining requirements. His primary points are:

  • You can’t know what you want to build until you already have something concrete. Everything is a throw-away until it is “good enough”.
  • Since you can’t know what you want to build, you don’t know the requirements, even after you start work. This indicates that the process of design and the process of divining requirements cannot be treated separately the way that academic literature treats the subjects.
  • The design process sometimes occurs in a more complex environment than in past years because of the late 20th century concept of “collaborative design”. Further, this is not the best design model for every project.

Unlike many other Fred Brooks talks there is a bit of audience interruption — sometimes with valid points, sometimes not — and you can observe both Fred’s unscripted response as well as others in the audience responding to audience comments and questions. It is interesting from a human perspective for this reason as well.

2012.08.14 05:52

Using Unmanaged Django Models of Postgres Views in a Pre-existing Schema

Filed under: Computing — Tags: , , , , , , — zxq9 @ 05:52

This discussion covers the way Django 1.4 and Postgres 9.1 work as of mid 2012. If you have trouble with this or very similar code let me know and we can post your examples.

Some of us like Python. Some of us are subjected to working with the Web, at least from time to time. Despite the fact that we would all love to write native applications against a sane API that we’re allowed to read in a single language in an environment we control with a single data handling layer, that’s just not always reality. We’d all love to commit to a solid data model that someone smart was permitted the time to work on before the FAD design crew got spun up. But sometimes we hit weird cases where there’s just not time or money or scheduling is stupid or there are unanticipated complications or the platform changes based on a buzzword or the situation doesn’t fit because you’re working behind a three-star programmer who quit last week. Ha ha, just kidding — I meant that’s all the time every time.

Since that’s the norm, I’m discussing using Django to make reporting faces for existing database applications. In particular database applications that rely on a non-trivial normalized or properly denormalized data model. A lot of folks dive straight to the “Hire a guy to build a Django/PHP/TG/Rails/whatever application with a separate database that syncs or queries-to-sync what it needs from the existing database” conclusion. This is wasteful since most of the time all we want is to display some tabular data along with some custom math on that data. Usually the data set is not enormous and going full-bore data warehouse prepared to carry billions of records in a large set of denormalized, parallel schema is way, way overkill.

So how do we do this with a web framework that uses an ORM (noooo~!) and expects a 1-to-1 mapping between classes and tables? We use views against unmanaged Django models.

When I say “views” I mean in the database sense, not the ubiquitous Django stuff. As in the good ole’ Postgres CREATE VIEW foo AS SELECT [stuff]. This could be materialized views, calculated views, a denormalized table triggered to always be current (see “materialized view”), etc. Database views, whatever the flavor, will serve our purposes.

A lot of folks get tripped up on this particular subject because of the way frameworks like Django and Rails nearly always rely on meaningless integers as primary keys by default, can’t handle multi-column natural keys and try to add an integer “ID” to any model even if its not the primary key and therefore completely superfluous. Any non-trivial, normalized data model won’t have integer primary keys and any DBA who administers such a data store won’t add them to satisfy your stupid web framework. So forget about adding integer primary keys, we’re not going to use them and adding them would be an exercise worthy of something the government would pay for. But that means that until true support for natural, multi-column keys in Django gets finalized we need a new approach (and anyone stuck working with Django at least up to 1.5 will need a different approach anyway). No worries, read on.

We’re going to create a view against an unmanaged Django model class, and feed Django a dummy “id” column to keep it quiet. Of course, we have to remember that the unmanaged model isn’t safe to use as a target for models.ForeignKey(), but as we’ll see this is a pretty simple limitation to work around since any existing data model will already have pre-defined primary keys. This is workable even if we need to treat the view as a real model in the future.

Note: this is a toy example for necessary reasons. I’m only writing this to illustrate a point, not to write a full-bore business application on my blog. That said, feel free to use these ideas or any code presented here in any way you like.

So let’s get a few tables down. Let’s say there is a cost/price margin history schema in a database that various applications use to keep track of how much stuff costs today and how much it used to cost at a given point in time. Let’s say also that some of the items in the list are for sale and the history of the sale price needs to be tracked in a similar manner to the costs of things. Since a lot of applications access the data store and many of them are designed either strictly for sales/ordering-only or cost/inventory-only the cost and price histories are completely independent. Different applications can already tell the managers and analysts what the margin is right now, but all this historical data is difficult to compare over time because its tedious for humans to run the comparisons by hand, and a real pain to keep straight even in a spreadsheet application.

Sounds like a perfect job for a computer! Further, since this is tabular data display and essentialy a document-export function it is a perfect fit for something that is accessible from a browser.

So far we’ve got two tables on the costing/inventory side: inventory.item and inventory.cost_history. They look like this (note, this isn’t quite the way psql will display the table data, but just go with it):

                Table "inventory.item"
     Column     |       Type         |    Modifiers
 nsn            | varchar(50)        | primary key
 is_active      | boolean            | not null

                         Table "inventory.cost_history"
     Column     |       Type         |              Modifiers
 item           | varchar(50)        | foreign key (item.nsn), unique together with start_date
 start_date     | timestamp          | primary key with "item"
 end_date       | timestamp          |
 value          | money              | not null

Over on the product/pricing side we have two more tables to compliment those above: sales.product and sales.price_history. They look like this:

                Table "sales.product"
     Column     |       Type         |    Modifiers
 item           | varchar(50)        | foreign key (item.nsn), unique
 sku            | varchar(30)        | primary key
 name           | varchar(100)       | not null
 is_active      | boolean            | not null

                         Table "sales.price_history"
     Column     |       Type         |              Modifiers
 product        | varchar(50)        | foreign key (sales.sku), unique together with start_date
 start_date     | timestamp          | primary key with "product"
 end_date       | timestamp          |
 value          | money              | not null

Enormously simple — to the point that if I hadn’t explained the scenario you’d wonder why we even have a “sales.product” table. But let’s just accept that not every item in the company’s inventory is for sale, and those that are need to be treated differently so tying everything to inventory.item records isn’t a good idea. (It bears mentioning here that sales.price_history.product could just as easily point to inventory.item.nsn as sales.product.sku: natural connectivity with the “real” NSN vs more frequent reference by/to SKU from application queries.)

So this is the existing schema we’re hoping to write a small dashboard-ish web thingy in Django against without modification. How do we do it without driving the DBA nuts, making one query per iteration, burying a lot of huge raw() queries in our application code, or forcing things into some nooberiffic “One Class, to rule them all” paradigm? First we need something Django can hold on to deal with on its own terms — something that behaves enough like a real table to pass a sort of duck check.

We need a concept of “things for sale”. We don’t care about the items not for sale in this case, because the whole point is demonstrating a margin history — so we can, for the purposes of our dashboard — lump sales.product and inventory.item together:

class ProductRecord(models.Model):
    nsn       = models.CharField('NSN', max_length=50)
    sku       = models.CharField('SKU', max_length=30)
    name      = models.CharField(_('Name'), max_length=100)
    is_active = models.BooleanField(_('Active'))

    class Meta:
        db_table = 'sales\".\"product_record'
        managed = False

Pretty predictable so far. But it could use a better name than sales.product_record. I suck at naming things without thinking a bit first. With that in mind, note that I named the class “ProductRecord” to stay in tune with the db_table value. At the moment we’re just writing a dashboard, but if that requirement grows later on you’d hate to remember that in every other app “Product” refers to sales.product but in this app its something different because you already used the name “Product” for something else. We’ve lumped together a lot of pretty obvious data in one place from sales.product and inventory.item. Now let’s lump the money stuff together:

class ProductValueHistory(models.Model):
    sku        = models.CharField('SKU', max_length=30)
    cost       = models.DecimalField(_('Cost'), max_digits=15, decimal_places=5)
    price      = models.DecimalField(_('Price'), max_digits=15, decimal_places=5)
    start_date = models.DateTimeField(_('Start Date'))
    end_date   = models.DateTimeField(_('End Date'))

    class Meta:
        db_table = 'sales\".\"product_value_history'
        managed = False

Also fairly predictable. (The DecimalField thing and fixed-point math for money handling is a subject for another time.) You probably noticed the lack of a Django models.ForeignKey on both of these models. We don’t have them because we don’t have anything to tie them to that Django can understand because of that silly litter integer-as-a-primary-key assumption that nearly every ORM seems to universally make. We could add it, but that would require altering the original tables to accommodate this dashboard, and that goes against the principles of being a good guest in someone else’s digital house. Besides, a lot of applications access this data store — doing anything that changes things up could have ripple down effects all over the place; much better to forget about all that mess. Never let your tools drive your design — otherwise you’re being a tool for the sake of a cosmic “in Soviet Russia…” joke.

We could declare something else to be the primary key, but that would only affect how Django would generate SQL table creation code when running syncdb, and since we’re never going to run that on an unmanaged model and Django tries to add an integer ID to everything whether or not you’ve already got a natural primary key defined, that’s pointless

Now lets write our views for the database. This works in Postgres 9.1. I don’t know about other databases — and if you’re doing this in MySQL you probably don’t want to do things this way (hint: you should migrate):

CREATE VIEW sales.product_record AS
      row_number() OVER (ORDER BY AS id,
      i.nsn AS nsn, s.product AS sku, AS name, s.is_active AS is_active
      inventory.item AS i, sales.product AS s
      i.nsn = s.item;

Now we’ve got a “fake” table in the database that Django thinks is a real one. It even has an “id” column generated for us by the row_number() window function. This is silly, and only present to fool Django into accepting our model, but compared to how a lot of other ORMs work, this is a pretty small ORM tax to pay. The price goes up a little if we want to be allowed to do insertions and modifications from the Django class to this view instead of the real tables (we’d have to write rules), but even that isn’t so hard.

So now let’s get what we want out of the price/cost history combination. I’m not 100% happy with the way this query turns out, to be honest (partly because I deliberately made this situation a little complex by not making sales.price_history reference the inventory.item.nsn at all, so we require an extra join), but it does work fine — and there is a way around even the inefficiency in my (probably bad) SQL view code here:

CREATE VIEW sales.product_value_history AS
      row_number() OVER (ORDER BY p.start_date DESC, c.start_date DESC) AS id,
      p.sku AS sku,
        WHEN p.start_date < c.start_date
          THEN p.start_date
          ELSE c.start_date
        AS start_date,
        WHEN p.end_date < c.end_date
          THEN p.end_date
          ELSE c.end_date
        AS end_date,
      c.value AS cost,
      p.value AS price
      sales.price_history AS p,
      ( SELECT
            product.sku AS sku,
            cost.value AS value,
            cost.start_date AS start_date,
            cost.end_date AS end_date
          FROM sales.product AS product, inventory.cost_history AS cost
          WHERE product.item = cost.item) AS c
        p.product = c.sku
        (   (p.start_date, p.end_date)
            (c.start_date, c.end_date)
          (p.end_date IS NULL AND c.end_date IS NULL));

The query represented by this view goes a touch beyond what the Django ORM provides access to, but isn’t that hard to understand. The subquery where we join sales.product and inventory.cost_history is the link that provides us the necessary connection between a product’s SKU and its parent item’s NSN and returns a table called c. Joining that to the sales.price_history table for matching SKUs gives us all the costs and prices associated with a given product (and no results for items in inventory that are not for sale), and the extra WHERE clause using OVERLAPS lines up our price histories so we don’t have NULL-value gaps across spans of time when either the cost or price changed but the other didn’t.

We did the same “fake id” trick in this query using the row_number() window function so we can use Django’s ORM to pull results from this table like any other model. Because the window function already sorted the results in descending order, we don’t need to sort the results to know they are in chronological order.

Now where to put this bit of SQL? Of course, check the Django docs, but as of Django 1.4 the .sql files should go in a directory located at project/app/sql/ . Once there it should execute when you run syncdb — and if it doesn’t or you need to re-do something manually you can invoke it from within psql quite easily by doing \i /path/to/project/app/sql/filename.sql. (If you do it manually from within psql, remember to ALTER VIEW view_name OWNER TO django_role_name or Django probably won’t have permission to query it.)

So how do we use it to get a history for a specific product? Since we don’t have any primary key/foreign key relations set up in Django, we can’t just do:

product = ProductRecord.objects.get(id=number)
value_history = product.price_history.all()

This is a limitation many ORMs have, but we can work around it easily enough since the database itself has other primary keys that are guaranteed to be unique and therefore safe to use with object.get() or get_object_or_404():

product = ProductRecord.objects.get(nsn=value)
value_history = ProductValueHistory.objects.filter(sku=product.sku)

The results are already sorted so we can also do the following without any order_by() clause in Django:

current_value = ProductValueHistory.objets.filter(sku=product.sku)[0]

This gets us right up to the point of needing to write the math in to actually get margin totals and calculate percentages and make pie charts and all the other stuff business types like to do with this kind of data. At this point you can do all that in your Django view in Python (very easy and comfortable) or write a few more columns into the views that do this processing for you before it ever leaves the database, or even write entirely new views and new models that do whatever it is that you want.

Faced with the options above, when there is no obvious right answer I prefer to put it into the database as calculated columns or new views and get a complete answer per query instead of processing raw data in the application code. I (usually) take this approach because processing in the application makes that logic unavailable to any other applications that access the same data store which might want the same information in the future, thus leaving the writer of those other applications (probably me) with nothing left but to reinvent the wheel to get the exact same result (and looking ahead at maintenance centralizing logic is always better, whether in a library, database or whatever). Another good reason I’ve found to do things this way is to avoid accidentally writing an iterative processing routine at the application level that calls the database on each iteration of an arbitrarily large loop (I know you think “Yikes!” but people do this all over the place without even realizing it!).

But this view would be rather inefficient on very large rows of tables because its not eliminating rows based on an indexed item before it does the join processing and also because the OVERLAPS bit is pretty compute intensive on large sets. Indexes on the view can mitigate that to some degree, but there will come a point when materialized views/denormalized tables trump having faster hardware.

I’m constantly amazed at how fast hardware is these days and how much Postgres performance increases with each release, but that’s no excuse for a rather blatant query inefficiency that can be fixed easily. Its also absolutely no help for people stuck using Postgres < 7.1 or the legions of people stuck with MySQL or the poor fools stuck using a crap, deliberately crippled “home” or “small business” version of a proprietary database.

There are two possible ways out of this. You can write a raw() SQL query into Django (sometimes the easy answer), or you can make sales.product_value_history into a real table in the database that updates itself whenever the sales.price_history or inventory.cost_history tables are modified.

The second idea is the most interesting and involves a deliberate denormalization, which in this case I think is probably appropriate. This is called making a “materialized view”. Its available as a native feature in DB2 and Oracle, but not in Postgres just yet (I think 9.2 or 9.3, probably 2013 or 2014). However, in Postgres we can write triggers which keep our price/cost history table updated automatically whenever either of the sponsoring tables is modified. This ability is why while DBAs love conveniences like built-in materialized views features like this tend to take a lower priority than the serious stuff like JOIN efficiency, window functions and query optimizations. A word of caution: it is pretty common for folks whose first experience with databases like Postgres was through an ORM framework to try keeping tables synced by writing routines in at the application level — but this is a bad idea and defeats the purpose of using the database as a layer of abstraction. Leaky abstractions suck the further you travel with them and always remind me of the Turkish proverb “No matter how far you’ve gone down the road, turn back.”

I can feel your boredom burning into my fingers through a quantum time warp in the net, so I’ll end this here. Yes, I’m leaving the materialized views implementation as a loose end (there are great resources on the net for this), but the main point was how to put Django in meaningful touch with a database schema that is both non-trivial and doesn’t use arbitrary integer values as primary keys.

2012.08.12 17:27

On Looping and Recursion

Filed under: Computing,Science & Tech — Tags: , , , , , , — zxq9 @ 17:27

This post was prompted by a discussion about various programming languages on the Scientific Linux Forum. The discussion is in the member’s sub-forum, so I can’t link it here very effectively.

wearetheborg wrote:

I don’t understand the statement “…develop a sense time being slices along the t-axis (similar to thinking transactionally) instead of dealing in terms of state.” Can you elaborate on this? I have been envisioning state as the state of the various variables and objects.

That is the way most people think of state. This is normal in procedural and structured programming languages (everything from assembler to Fortran to C to Java). It doesn’t have to always work that way, though.

Consider a web request. HTTP is stateless by design. We’ve backhacked the bejeezus out of it to try making it have state (cookies, AJAX, etc) but in the end we’re always fighting against that protocol instead of just accepting that documents aren’t application interfaces. But remember the original Perl CGI stuff? That was all based on the way databases treat time — as transaction points along the t-axis. This is very different than inventing the notion of ongoing state. So a user fills out a form, and when he submits it there is a database transaction that completely succeeds or completely fails. Each request for a page view generates a new page via a function whose input is the request URL (and whatever detailed data lies within the GET string), which then calls another function whose input is a complete response from the database containing a snapshot of data from a point in time, and whose output is the web page requested.

There is no necessity for state here. The input to the function is the request URL string. The function breaks that down and returns a complete output (which could be an error message). But there is nothing ambiguous about this and executing the same functions with the same inputs would always return the exact same output, every time. There are no objects carrying state between requests. There are no variables that live beyond the request -> response lifetime.

“But timestamps and things might change and if someone updates a page then a request before and after would be different!” you might say. And this is true (and indeed the whole point of such CGI scripts) but all that is external to the functions themselves. The content of the pages, including the timestamps, page templates and record data, are input to the functions, not persistent state variables that are evolving with time. The functions don’t care what the inputs are, they only need to do their job — the data is a completely separate concern (remembering this will minimize fights with your DBA, by the way… if you know an asshole of a DBA, consider that he’s probably not trying to be an asshole, but rather trying to help you simplify your life despite yourself).

All this functions-that-don’t-carry-state business means is that we don’t need variables, we need symbolic assignment. Further, its OK if that symbolic assignment is immutable, or even if that assignment never happens and is merely functions passing their results (or themselves) along to one another.

This is the foundation for transactional thinking and it treats time as slices or points along the t-axis instead of a stateful concept that is ongoing. Its also the foundation of quite a bit of functional programming. There are no variables internal to the functions themselves that have any influence on the output of the program. Every case of a given input results in an exactly defined output, every time. This simplifies programming in general and debugging in particular.

Its not the best for everything, of course. In simulations and games you need a concept of state to cover the spans between transactions or time-slicing periods. For example, in a game you are fighting a mob. The fight isn’t instant, it is a process of its own and the immediate gameplay is based on the state carried in the various objects that make up your character, the mob, equipped items, world objects, etc. But it doesn’t really matter in a grander sense whether or not each strike or point of time in the fight is actually transacted to the data persistence layer — if the server crashed just then you’d probably prefer to not log in back to the middle of a boss fight while your screen is still loading, or half your party isn’t logging in at the same time with them, etc.

So the fight is its own discreet subsystem which is intimately concerned with state and OOP design principles and is therefore completely expendable — it matters not if you won or lost that particular fight if the server crashes in the middle of it. The overall world, however, is designed based on transactional concepts of time slices — it does matter what the last non-combat status and position snapshot of your character was, and you probably care quite a bit about certain potentially monumental events like when your character binds himself to some epic loot, level up or apply a new skill point (that better go in the database, right?).

The vast majority of user applications aren’t games or simulations, though. Almost everything on the web is text string manipulation and basic arithmetic. Almost everything in business applications development amounts to the same. So we don’t need stateful functions and objects, in fact we get confused by them (the exception here being the interface itself — objects make awesome widgets and windows). If I make an object to represent, say, a customer invoice, and I’m doing calculations on that invoice or within it using methods, to really bug test that object I have to test every method under every possible condition, which means every permutation of state possible. If that object is a subclass of anther object, or interacts with another object in the system, like say a customer record object or a discount coupon object, I have to test both against each other in every combination of possible states. That’s impossible in the lifetime of this universe, so nobody does it. The closest we come is testing a tiny subset of the most likely cases out of a tiny subset of what’s possible, and then we are constantly surprised at lingering bugs users report in modules later because they passed all the tests (or even dumber, we aren’t surprised at all, which says something about how blindly we stick to methodology in the face of contrary evidence).

But we don’t need a stateful concept for, say, a customer invoice. We need a snapshot of what it looked like before, and what we want it to look like next. This “next” result (which is just the output of a function), once confirmed (transacted in the database) becomes the next snapshot and you discard the intermediate concept of “state” entirely. Line item calculations and changes should be functions that operate per line on input data from that line. Invoice sums, averages, discounts, etc. should be functions concerned only with relevant input data as well, namely the aggregate result of each line item.

None of this involves a stateful requirement and shouldn’t involve stateful objects because that state is an unnecessary complication to the system. It also causes problematic architectural questions (is an invoice an object? is a line item an object? are they both objects? what do they inherit from? do they have a common ancestor? how do we make relational data sorta fit with our object model?) What are all these class declarations and the attendant statefulness actually doing for you? Nothing but permitting you to write yourself into a hole with mistaken side-effects (oops, that method wasn’t checking to clear the “discount” boolean in this object before it does

Even more interesting, sticking with the web example above and moving it forward to 2012 where everyone is mad for Django and Rails, these ORM-based frameworks almost never maintain persistent state objects between calls. When they do it often causes a peculiar type of unique-per-thread bug. So what is all this business about class/model declarations in the first place, since objects are first and foremost data structures with behaviors? These class declarations are trying very hard to be CREATE TABLE and ALTER TABLE SQL commands but without the benefit of actually being SQL commands — in other words, they are a weak form of data dictionary that sacrifice both the clean presentation of S-expressions or YAML trees and the full power of your favorite database’s native language.

Dealing with the database on its own terms and letting your functions be stateless makes it enormously easier to test your system because each function has a knowable output for each knowable input. It does mean that you must have a small constellation of functions to do the job, but you were going to write at least as many methods against your objects anyway (and often duplicate, or very nearly duplicate methods across non-sibling objects that do very nearly the same thing — or even worse, inherit several layers deep in a ripped fishing net pattern, which makes future changes to parent classes a horror). If you are dealing with data from a database trying to force your relational data into objects is poor thinking in most cases anyway, not least because none of that object architecture gets you one iota closer to accomplishing your task, despite all the beautiful design work that has to go into making relational data work with OOP. Unless you’re writing a simulation (as in, a MUD, an MMORPG or a flight simulator) there is no benefit here.

The easiest way to envision slices of time for people who got into programming later than 1994 or so is usually to recall the way that database-to-webpage functions worked in the CGI days as referenced above. They don’t maintain state, and if they do it is just as static variables which exist long enough to dump content into a webpage substitution template before being deallocated (and if written differently this wouldn’t always be necessary, either — the following are not the same: return (x + 1);, return x++;, x = x + 1; return x;). The only thing that matters to such functions is the input, not any pre-existing state. You can, of course, include cookies and SSL tokens and Kerberos tickets and whatnot in the mix — they merely constitute further input, not stateful persistence as far as the function itself is concerned.

There are some consequences to this for functional programs. For one thing, most loops wind up working best as recursively defined functions. In a imperative OOP language like Java this is horrible because each iteration requires instantiating a new object with its own state and therefore becomes a wild resource hog and executes really slow. In a functional language, however, tail-recursive functions often perform better than the alternative loop would. The reason for this is that (most) functional languages are optimized for tail-recursion — this means that a recursive function that calls itself as the last step in its evaluation doesn’t need to return state to its previous iterations; it is dependent only on its input. The machine can load the function one time and jump to its start on each iteration with a new value (whatever the arguments are) in the register without even needing to hit the cache, mess with iterator variables, check, evaluate or reform state, etc. (Note that this is possible in languages like C as well, but usually requires use of the “evil” goto statement. In higher level languages, however, this sort of thing is usually completely off limits.)

Let’s look at a stateful countdown program of the type you’d probably write on the first day of class (small caveat here: I’ve never had the privilege to attend a class, but I’m assuming this is the sort of thing that must go on the first day):

#include <stdio.h>

int main()
    int x = 10;

    while (x > 0)
        printf("%d\n", x);

    return 0;

Of course, this could be done in a for() loop or whatever, but the principle is the same. Looping is king in C, and for a good reason. Here is an equivalent program in Guile (a Scheme interpreter that’s just one “yum install guile” away if you want to play with it):

(define (countdown x)
    (display x)
    (if (> x 1)
      (countdown (1- x))
      (display "Blastoff!\n"))))

(countdown 10)

In this program there is no loop, there is a call to itself with an argument equal to the initial argument decremented by 1, but we are not actually dealing with variables at all here.

In fact, the expression (1- x) does not change the value of x at all, because it is merely an expression which returns “the result of ‘decrement x by 1′” (an equivalent would be (- x 1), which may or may not be equal in efficiency depending on the lisp environment) and not an assignment statement. (There are assignment statements in most functional languages, and they are primarily (ab)used by folks coming to functional languages from non-functional ones. Not saying you never want variables, but more often than not its more sensible to deal with expressions that say what you mean exactly once than stateful variables).

Being a toy example I’m using a “begin” statement to guarantee execution order in the interest of formatting text output. This sort of thing isn’t present in most functions, since whatever is required for the return value will be executed. Here we are generating an output side effect. We could eliminate the “begin” in the interest of being “more functional” while still emitting something to stdout as a side effect, but might be harder to read for someone coming from an imperative background:

(define (countdown x)
  (display (string-append (number->string x) "\n"))
  (if (> x 1)
    (countdown (1- x))
    (display "Blastoff!\n")))

If displaying an integer argument with a newline appended (or more generally, any argument value with a newline appended) was a common requirement (like for log files) the line beginning with display would become its own function, like display\n or something and be called more naturally as (display\n x).

This is a very simple example of how stateless recursion works in place of loops. You can implement loops in functional languages, which is usually a bad idea, or you can implement tail-recursive functions in most imperative languages, which is also usually a bad idea (especially when it involved objects, unless you start inlining assembler in your C with the appropriate jumps/gotos… which isn’t worth the trouble compared to a loop). Having demonstrated the equivalence of loops and recursion for most purposes, it bears mentioning that within the lisp community (and probably others) using tail-recursive functions in place of while()/for() loops is so common that very often when someone says “loop” what they mean is a recursive loop, not an iterative loop.

I say this to illustrate that when comparing languages its not about whether you can do something in language X or not, but whether a certain style of thinking matches the way the compiler or execution environment work. Step outside of the pocket that the language or runtime has built for you and you wind up very quickly in a nasty, panicky, twisty place where unmaintainable hacked up speed optimizations start feeling necessary. The futility of such speed optimizations is usually evidence that you’ve existed the pocket of your language, and is why old timers often are heard saying things like “if you need all those speed hacks, you need to reconsider your design overall” — its not because there is a speedier way to make that one method or function work, per se, but more typically what they are getting at is that you are thinking against the paradigms or assumptions that your language or environment was designed around.

In cases not involving simulation I find that carrying lots of state in variables, structs and the like complicates even simple things unnecessarily and makes testing confusing by comparison to stateless functions. OOP is even worse in this regard because you’ve got functionality intertwined with your state vehicles, and inheritance makes that situation even more convoluted. When dealing with lots of state stack traces can’t give you a clear picture of what happens every time with any given input, only what happened this time, because understanding what happened when you’re dealing with state is not as simple as looking at the arguments, you have to consider the state of the environment and objects within it at the moment of execution. This gets to the heart of why I recommend thinking of time as slices or points or snapshots instead of as continuous state, and is a very important idea to understand if you want to grok functional programming.

Powered by WordPress