Tag Archives: Web

Using Unmanaged Django Models of Postgres Views in a Pre-existing Schema

This discussion covers the way Django 1.4 and Postgres 9.1 work as of mid 2012. If you have trouble with this or very similar code let me know and we can post your examples.

Some of us like Python. Some of us are subjected to working with the Web, at least from time to time. Despite the fact that we would all love to write native applications against a sane API that we’re allowed to read in a single language in an environment we control with a single data handling layer, that’s just not always reality. We’d all love to commit to a solid data model that someone smart was permitted the time to work on before the FAD design crew got spun up. But sometimes we hit weird cases where there’s just not time or money or scheduling is stupid or there are unanticipated complications or the platform changes based on a buzzword or the situation doesn’t fit because you’re working behind a three-star programmer who quit last week. Ha ha, just kidding — I meant that’s all the time every time.

Since that’s the norm, I’m discussing using Django to make reporting faces for existing database applications. In particular database applications that rely on a non-trivial normalized or properly denormalized data model. A lot of folks dive straight to the “Hire a guy to build a Django/PHP/TG/Rails/whatever application with a separate database that syncs or queries-to-sync what it needs from the existing database” conclusion. This is wasteful since most of the time all we want is to display some tabular data along with some custom math on that data. Usually the data set is not enormous and going full-bore data warehouse prepared to carry billions of records in a large set of denormalized, parallel schema is way, way overkill.

So how do we do this with a web framework that uses an ORM (noooo~!) and expects a 1-to-1 mapping between classes and tables? We use views against unmanaged Django models.

When I say “views” I mean in the database sense, not the ubiquitous Django view.py stuff. As in the good ole’ Postgres CREATE VIEW foo AS SELECT [stuff]. This could be materialized views, calculated views, a denormalized table triggered to always be current (see “materialized view”), etc. Database views, whatever the flavor, will serve our purposes.

A lot of folks get tripped up on this particular subject because of the way frameworks like Django and Rails nearly always rely on meaningless integers as primary keys by default, can’t handle multi-column natural keys and try to add an integer “ID” to any model even if its not the primary key and therefore completely superfluous. Any non-trivial, normalized data model won’t have integer primary keys and any DBA who administers such a data store won’t add them to satisfy your stupid web framework. So forget about adding integer primary keys, we’re not going to use them and adding them would be an exercise worthy of something the government would pay for. But that means that until true support for natural, multi-column keys in Django gets finalized we need a new approach (and anyone stuck working with Django at least up to 1.5 will need a different approach anyway). No worries, read on.

We’re going to create a view against an unmanaged Django model class, and feed Django a dummy “id” column to keep it quiet. Of course, we have to remember that the unmanaged model isn’t safe to use as a target for models.ForeignKey(), but as we’ll see this is a pretty simple limitation to work around since any existing data model will already have pre-defined primary keys. This is workable even if we need to treat the view as a real model in the future.

Note: this is a toy example for necessary reasons. I’m only writing this to illustrate a point, not to write a full-bore business application on my blog. That said, feel free to use these ideas or any code presented here in any way you like.

So let’s get a few tables down. Let’s say there is a cost/price margin history schema in a database that various applications use to keep track of how much stuff costs today and how much it used to cost at a given point in time. Let’s say also that some of the items in the list are for sale and the history of the sale price needs to be tracked in a similar manner to the costs of things. Since a lot of applications access the data store and many of them are designed either strictly for sales/ordering-only or cost/inventory-only the cost and price histories are completely independent. Different applications can already tell the managers and analysts what the margin is right now, but all this historical data is difficult to compare over time because its tedious for humans to run the comparisons by hand, and a real pain to keep straight even in a spreadsheet application.

Sounds like a perfect job for a computer! Further, since this is tabular data display and essentialy a document-export function it is a perfect fit for something that is accessible from a browser.

So far we’ve got two tables on the costing/inventory side: inventory.item and inventory.cost_history. They look like this (note, this isn’t quite the way psql will display the table data, but just go with it):

                Table "inventory.item"
     Column     |       Type         |    Modifiers
----------------+--------------------+-----------------
 nsn            | varchar(50)        | primary key
 is_active      | boolean            | not null

                         Table "inventory.cost_history"
     Column     |       Type         |              Modifiers
----------------+--------------------+--------------------------------------------------------
 item           | varchar(50)        | foreign key (item.nsn), unique together with start_date
 start_date     | timestamp          | primary key with "item"
 end_date       | timestamp          |
 value          | money              | not null

Over on the product/pricing side we have two more tables to compliment those above: sales.product and sales.price_history. They look like this:

                Table "sales.product"
     Column     |       Type         |    Modifiers
----------------+--------------------+-------------------------------
 item           | varchar(50)        | foreign key (item.nsn), unique
 sku            | varchar(30)        | primary key
 name           | varchar(100)       | not null
 is_active      | boolean            | not null

                         Table "sales.price_history"
     Column     |       Type         |              Modifiers
----------------+--------------------+----------------------------------------------------------
 product        | varchar(50)        | foreign key (sales.sku), unique together with start_date
 start_date     | timestamp          | primary key with "product"
 end_date       | timestamp          |
 value          | money              | not null

Enormously simple — to the point that if I hadn’t explained the scenario you’d wonder why we even have a “sales.product” table. But let’s just accept that not every item in the company’s inventory is for sale, and those that are need to be treated differently so tying everything to inventory.item records isn’t a good idea. (It bears mentioning here that sales.price_history.product could just as easily point to inventory.item.nsn as sales.product.sku: natural connectivity with the “real” NSN vs more frequent reference by/to SKU from application queries.)

So this is the existing schema we’re hoping to write a small dashboard-ish web thingy in Django against without modification. How do we do it without driving the DBA nuts, making one query per iteration, burying a lot of huge raw() queries in our application code, or forcing things into some nooberiffic “One Class, to rule them all” paradigm? First we need something Django can hold on to deal with on its own terms — something that behaves enough like a real table to pass a sort of duck check.

We need a concept of “things for sale”. We don’t care about the items not for sale in this case, because the whole point is demonstrating a margin history — so we can, for the purposes of our dashboard — lump sales.product and inventory.item together:

class ProductRecord(models.Model):
    nsn       = models.CharField('NSN', max_length=50)
    sku       = models.CharField('SKU', max_length=30)
    name      = models.CharField(_('Name'), max_length=100)
    is_active = models.BooleanField(_('Active'))

    class Meta:
        db_table = 'sales\".\"product_record'
        managed = False

Pretty predictable so far. But it could use a better name than sales.product_record. I suck at naming things without thinking a bit first. With that in mind, note that I named the class “ProductRecord” to stay in tune with the db_table value. At the moment we’re just writing a dashboard, but if that requirement grows later on you’d hate to remember that in every other app “Product” refers to sales.product but in this app its something different because you already used the name “Product” for something else. We’ve lumped together a lot of pretty obvious data in one place from sales.product and inventory.item. Now let’s lump the money stuff together:

class ProductValueHistory(models.Model):
    sku        = models.CharField('SKU', max_length=30)
    cost       = models.DecimalField(_('Cost'), max_digits=15, decimal_places=5)
    price      = models.DecimalField(_('Price'), max_digits=15, decimal_places=5)
    start_date = models.DateTimeField(_('Start Date'))
    end_date   = models.DateTimeField(_('End Date'))

    class Meta:
        db_table = 'sales\".\"product_value_history'
        managed = False

Also fairly predictable. (The DecimalField thing and fixed-point math for money handling is a subject for another time.) You probably noticed the lack of a Django models.ForeignKey on both of these models. We don’t have them because we don’t have anything to tie them to that Django can understand because of that silly litter integer-as-a-primary-key assumption that nearly every ORM seems to universally make. We could add it, but that would require altering the original tables to accommodate this dashboard, and that goes against the principles of being a good guest in someone else’s digital house. Besides, a lot of applications access this data store — doing anything that changes things up could have ripple down effects all over the place; much better to forget about all that mess. Never let your tools drive your design — otherwise you’re being a tool for the sake of a cosmic “in Soviet Russia…” joke.

We could declare something else to be the primary key, but that would only affect how Django would generate SQL table creation code when running manage.py syncdb, and since we’re never going to run that on an unmanaged model and Django tries to add an integer ID to everything whether or not you’ve already got a natural primary key defined, that’s pointless

Now lets write our views for the database. This works in Postgres 9.1. I don’t know about other databases — and if you’re doing this in MySQL you probably don’t want to do things this way (hint: you should migrate):

CREATE VIEW sales.product_record AS
  SELECT
      row_number() OVER (ORDER BY s.name) AS id,
      i.nsn AS nsn, s.product AS sku, s.name AS name, s.is_active AS is_active
    FROM
      inventory.item AS i, sales.product AS s
    WHERE
      i.nsn = s.item;

Now we’ve got a “fake” table in the database that Django thinks is a real one. It even has an “id” column generated for us by the row_number() window function. This is silly, and only present to fool Django into accepting our model, but compared to how a lot of other ORMs work, this is a pretty small ORM tax to pay. The price goes up a little if we want to be allowed to do insertions and modifications from the Django class to this view instead of the real tables (we’d have to write rules), but even that isn’t so hard.

So now let’s get what we want out of the price/cost history combination. I’m not 100% happy with the way this query turns out, to be honest (partly because I deliberately made this situation a little complex by not making sales.price_history reference the inventory.item.nsn at all, so we require an extra join), but it does work fine — and there is a way around even the inefficiency in my (probably bad) SQL view code here:

CREATE VIEW sales.product_value_history AS
  SELECT
      row_number() OVER (ORDER BY p.start_date DESC, c.start_date DESC) AS id,
      p.sku AS sku,
      CASE
        WHEN p.start_date < c.start_date
          THEN p.start_date
          ELSE c.start_date
      END
        AS start_date,
      CASE
        WHEN p.end_date < c.end_date
          THEN p.end_date
          ELSE c.end_date
      END
        AS end_date,
      c.value AS cost,
      p.value AS price
    FROM
      sales.price_history AS p,
      ( SELECT
            product.sku AS sku,
            cost.value AS value,
            cost.start_date AS start_date,
            cost.end_date AS end_date
          FROM sales.product AS product, inventory.cost_history AS cost
          WHERE product.item = cost.item) AS c
    WHERE
        p.product = c.sku
      AND
        (   (p.start_date, p.end_date)
          OVERLAPS
            (c.start_date, c.end_date)
        OR
          (p.end_date IS NULL AND c.end_date IS NULL));

The query represented by this view goes a touch beyond what the Django ORM provides access to, but isn’t that hard to understand. The subquery where we join sales.product and inventory.cost_history is the link that provides us the necessary connection between a product’s SKU and its parent item’s NSN and returns a table called c. Joining that to the sales.price_history table for matching SKUs gives us all the costs and prices associated with a given product (and no results for items in inventory that are not for sale), and the extra WHERE clause using OVERLAPS lines up our price histories so we don’t have NULL-value gaps across spans of time when either the cost or price changed but the other didn’t.

We did the same “fake id” trick in this query using the row_number() window function so we can use Django’s ORM to pull results from this table like any other model. Because the window function already sorted the results in descending order, we don’t need to sort the results to know they are in chronological order.

Now where to put this bit of SQL? Of course, check the Django docs, but as of Django 1.4 the .sql files should go in a directory located at project/app/sql/ . Once there it should execute when you run manage.py syncdb — and if it doesn’t or you need to re-do something manually you can invoke it from within psql quite easily by doing \i /path/to/project/app/sql/filename.sql. (If you do it manually from within psql, remember to ALTER VIEW view_name OWNER TO django_role_name or Django probably won’t have permission to query it.)

So how do we use it to get a history for a specific product? Since we don’t have any primary key/foreign key relations set up in Django, we can’t just do:

product = ProductRecord.objects.get(id=number)
value_history = product.price_history.all()

This is a limitation many ORMs have, but we can work around it easily enough since the database itself has other primary keys that are guaranteed to be unique and therefore safe to use with object.get() or get_object_or_404():

product = ProductRecord.objects.get(nsn=value)
value_history = ProductValueHistory.objects.filter(sku=product.sku)

The results are already sorted so we can also do the following without any order_by() clause in Django:

current_value = ProductValueHistory.objets.filter(sku=product.sku)[0]

This gets us right up to the point of needing to write the math in to actually get margin totals and calculate percentages and make pie charts and all the other stuff business types like to do with this kind of data. At this point you can do all that in your Django view in Python (very easy and comfortable) or write a few more columns into the views that do this processing for you before it ever leaves the database, or even write entirely new views and new models that do whatever it is that you want.

Faced with the options above, when there is no obvious right answer I prefer to put it into the database as calculated columns or new views and get a complete answer per query instead of processing raw data in the application code. I (usually) take this approach because processing in the application makes that logic unavailable to any other applications that access the same data store which might want the same information in the future, thus leaving the writer of those other applications (probably me) with nothing left but to reinvent the wheel to get the exact same result (and looking ahead at maintenance centralizing logic is always better, whether in a library, database or whatever). Another good reason I’ve found to do things this way is to avoid accidentally writing an iterative processing routine at the application level that calls the database on each iteration of an arbitrarily large loop (I know you think “Yikes!” but people do this all over the place without even realizing it!).

But this view would be rather inefficient on very large rows of tables because its not eliminating rows based on an indexed item before it does the join processing and also because the OVERLAPS bit is pretty compute intensive on large sets. Indexes on the view can mitigate that to some degree, but there will come a point when materialized views/denormalized tables trump having faster hardware.

I’m constantly amazed at how fast hardware is these days and how much Postgres performance increases with each release, but that’s no excuse for a rather blatant query inefficiency that can be fixed easily. Its also absolutely no help for people stuck using Postgres < 7.1 or the legions of people stuck with MySQL or the poor fools stuck using a crap, deliberately crippled “home” or “small business” version of a proprietary database.

There are two possible ways out of this. You can write a raw() SQL query into Django (sometimes the easy answer), or you can make sales.product_value_history into a real table in the database that updates itself whenever the sales.price_history or inventory.cost_history tables are modified.

The second idea is the most interesting and involves a deliberate denormalization, which in this case I think is probably appropriate. This is called making a “materialized view”. Its available as a native feature in DB2 and Oracle, but not in Postgres just yet (I think 9.2 or 9.3, probably 2013 or 2014). However, in Postgres we can write triggers which keep our price/cost history table updated automatically whenever either of the sponsoring tables is modified. This ability is why while DBAs love conveniences like built-in materialized views features like this tend to take a lower priority than the serious stuff like JOIN efficiency, window functions and query optimizations. A word of caution: it is pretty common for folks whose first experience with databases like Postgres was through an ORM framework to try keeping tables synced by writing routines in at the application level — but this is a bad idea and defeats the purpose of using the database as a layer of abstraction. Leaky abstractions suck the further you travel with them and always remind me of the Turkish proverb “No matter how far you’ve gone down the road, turn back.”

I can feel your boredom burning into my fingers through a quantum time warp in the net, so I’ll end this here. Yes, I’m leaving the materialized views implementation as a loose end (there are great resources on the net for this), but the main point was how to put Django in meaningful touch with a database schema that is both non-trivial and doesn’t use arbitrary integer values as primary keys.

A Note on “Web Applications”

This is a subject worthy of a series of its own, but a short scolding will have to do for now…

The “Web” is not an applications environment. At all. Web security is a joke. The whole idea is a joke. That’s like trying to have “newspaper security” because the whole point of HTTP is untrusted, completely public, unthrottled publication of textual data and that’s it. Non-textual data is linked in, not even native within a document (ever heard of, say, an <img> tag or that whole series that starts <a something=””>?) and various programs that can use HTTP to read HTML and related markup can (or can’t) fetch extra stuff, but the text is the core of it. Always.

People get wrapped around the axle trying to develop applications that run within a browser and always hit funny walls when trying to deliver interactivity. We’ve got a whole constellation of backhacks to HTTP now that are used to pretend that a website can be an application — in fact at this point I think probably more time is spent working within this kludged edge case than any other specific area of computing, and that’s really saying something. It says a lot about the need for an applications protocol that really can do the things we wish the Web could do and it speaks volumes about how little computing is actually understood by the majority of people who call themselves developers today.

If “security” means being selective about who can see what, then you need to not be using the web at all. Websites are all-or-nothing, whether we delude ourselves that we can put layers of “security” backhacks over it to act against its nature or not.

If “security” means being selective about who can make changes to the data presented on a website, you need to make modifications to data by means other than web forms. And yes, I mean you should be doing the “C” “U” and “D” parts of CRUD from within a real program working over a protocol that is made for this purpose (and therefore inherently securable, not hacked up to be sort-of securable like HTTPS) and ditch your web admin interfaces*. That means the Web can be used as a presentation face for your data, as it was intended (well, dynamic pages is a bit of a hack of its own, but it doesn’t fundamentally alter the concept underlying the design of the protocols involved), and your data input and interactivity faces are just applications other than the browser. Think World of Warcraft, the game, the WoW Armory, and the WoW Forums. Very different faces to match very different use cases and security contexts, but all are connected by a unified data concept centered on character.

Piggybacking ssh is a snap and the forgotten ASN.1 protocol is actually easier to write against than JSON or XML-RPC, but you might have to write your own library handler, though this is true even for JSON as well, depending on your environment.

[*And a note here: the blog part of this site is currently running on WordPress, and that’s a conscious decision against having real security based on what I believe the use case to be and the likelihood of someone taking a hard look at screwing my site over. This approach requires moderation and a vigilant eye. My business servers don’t work this way at all and can serve pages built from DB data and serve static pages built from other data, but that’s it — there is no web interface that permits data alteration at all on those sites I consider to be actually important. Anyway, even our verbiage is screwed up these days. Think about how I just used the word “site”. (!= site server service application)]

I can hear you whining now, “But that requires writing real applications that can touch the database or file system or whatever and that’s <i>hard</i> and requires actual study! And maybe I’ll have to write an actual data model instead of rolling with whatever crap $ORM_FRAMEWORK spatters out and maybe that means I have to actually know something about databases and maybe that means I’ll discover that MySQL is not good at handling normalized data and that might lead me to use Postgres or DB2 on projects and management and marketing droids don’t understand things that aren’t extensively gushed over by the marketing flax at EC trade shows and… NOOOoooo~~~!” But I would counter that all the layers of extra bullshit that are involved in running a public-facing “Web 2.0” site today are way more complicated, convoluted and extremely removed from a sane computing stack and therefore, at this point, much more involved, costly and “hard” to develop and maintain than real applications are.

In fact, since all of your “web applications” are built on top of a few basic parts, but in between most web developers use gigantic framework sets that are a lot larger and harder to learn and way shorter-lived than the underlying stack, your job security and applications security will both improve if you actually learn the underlying SQL, protocol workings, and interfaces involved in the utility constellation that comprises your “application” instead of just leaving everything up to the mystery of whatever today’s version of Rails, Dreamweaver or $PRODUCT is doing to/with your ideas.

I’m not bashing all frameworks, some of them are helpful and anything that speeds proper development is probably a good thing (and “proper” is a deliberately huge and gray area here — which is why development is still more art than science and probably always will be). I am, however, bashing the ideas that:

  1. Tthe web is an applications development environment
  2. Powertools can be anything other than necessarily complex

At this point the mastery burden for web development tools in most cases outstrips the mastery burden for developing normal applications. This means they defeat their own purpose, but people just don’t know that because so few have actually developed a real application or two and discovered how that works. Instead most have spent lots of time reading up on this or that web framework and showing off websites to friends and marks telling them they’re working on a “new app”. Web apps feel easier for reasons of familiarity, not because they are actually less complex.

Hmm… this is a good sentence. Meditate on the last sentence in the above paragraph.

The dangerously seductive thing about web development is that the web is very good at giving quick, shallow results measurable in pixels. The “shallow and quick” part being the dangerous bit and the “pixels” being the seductive bit. There is a reason that folks like Fred Brooks has insinuated that the drive to “just show pixels” is the bane of good programming and also called pixels and pretty screens “chicken lipstick”. Probably the biggest challenge to professional software development is the fact that if you’re really developing original works screens are usually the last thing you have to show, after a huge amount of the work is already done. Sure, you can show how the program logic works and you can trace through routines in an interpreter if you’re using a script language somewhere in the mix like Python or Scheme, but those sorts of concrete progress from a programmer’s perspective don’t connect to most clients’ concerns or imaginations. And so we have this drive to show pixels for everything.

Fortunately I’m the boss where I work so I can get around this to some degree, but its extremely easy to see how this drive to show pixels starts dicking things up right from the beginning of a project. Its very tempting to tell a client “yeah, give me a week and I’ll show you what things should look like”. Because that’s easy. Its a completely baseless lie, but its easy. As a marketer, project manager, or anyone else who contacts clients directly, it is important to learn how to manage expectations while still keeping the client excited. A lot of that has to do with being blunt and honest about everything, and also explaining at least a little of how things actually work. That way instead of saying “I’ll show you how it will look” you can say “I’ll show you a cut-down of how it will work” and then go write a script that does basically what your client needs. You can train them to look for the right things from an interpreter and showing them a working model instead of just screens (which are completely functionless for all they know) as a developer you get a chance to do two awesome things at once: prototype and discover where your surprise problems are going to probably come from early on, and if you work hard enough at it write your “one to throw away” right out the gate at customer sale prototyping time.

But web screens are still seductive, the only thing most people read books about these days (if they read at all) and the mid-term future of general computing looks every bit as stupid, shallow and rat-race driven as the Web itself.

I can’t wait to destroy all this.

Haha, just kidding… but seriously…

A Note to a Web Retailer about JavaScript-or-Else Pages

I was cruising computer hardware retailers on the web today because I’ve got to make a few orders. Not so many that I need to pester the manufacturers directly, but a large order compared to what most folks buy online with a credit card. Specifically, I need to replace one motherboard and build six new computers (complete with screens, etc.). Checking through a few shops I decided to order from one of the top 3 online stores for computer gear — and was rejected by their website for even trying to browse the product listings because I don’t have JavaScript enabled by default.

This got me to thinking about how stupid that was. JavsScript isn’t necessary for any of that site’s core functionality, particularly when just browsing. I hadn’t put anything in the cart, hadn’t done anything that requires “interactivity” (a laughable concept in a website if you know anything about the history of port 80, but whatever) or anything like that — I was just checking specs and prices, which worked just fine, but every single page is set to a five-second redirect delay if and only if the customer doesn’t have JavaScript enabled.

So I wrote the webmaster a note — and moved on to make my purchase somewhere else. Here’s the note:

Your website works just fine without JavaScript enabled but after a few seconds redirects *everyone* with JS disabled to a “turn JS on or you can’t browse our site” page. There is irony here because most people who browse the web with it disabled by default are IT professionals like myself who
1- Order huge amounts of hardware online and
2- Already know that they have JS disabled and how to enable it when they want to.

So that’s the situation and my complaint. Here’s a suggestion to solve it:
Test for JS and redirect to the JS-or-Else page only when the user hits ordering or shopping cart dialogues or something else to which JS is actually central. (Almost nothing, in fact, really requires JS, its just the only way a lot of web developers know how to think at this point.)

Compromising by making your site broadly accessible and then only forcing JS when the user actually needs a client-side script to run is a much better design policy than doing what you’re doing currently — which is drive customers elsewhere when they are just beginning to browse.

Its too easy to go somewhere that doesn’t do that, and so that is what what I’m doing now after writing you this note. Consider that I’d seriously consider turning JS on if I’d gone so far as to pick out items, fill the cart up, etc. and *then* get told that I can’t process the purchase without JS turned on.

tl;dr Instructions for Drupal7 on rpm-based Systems

I threw a few sites up for folks in need the other day, one of them being myself (lots of folks’ businesses were wrecked during the tsunami two months ago). From doing that I realized the Drupal 7 documentation on rpm-based systems is a bit lacking (especially in Japanese). There are a few reasons for this. For one thing, Drupal 6 is still what’s in the Fedora repositories (at least as of F15 — and its orphaned?!?). So the Fedora wiki has basic install instructions for 6, but not 7. The Fedora-ized version also places things in a symlinked area under /usr, which may be more secure, but it confuses new Drupal users about “how to get rid of the example.com/drupal/” problem under Apache. And… lastly, SELinux requires a few adjustments to let httpd and drupal to work well together. Most users’ reaction to the first sign of SELinux issues is to hurriedly turn it off… emphasis on the hurr in “hurriedly” there, because that is stupid.

Even a lot of professional web developers do this, which should drive home the point I’ve made elsewhere of web people != systems people (though systems people might be capable web developers). As of this writing Ubuntu doesn’t even field SELinux by default and their millions of users don’t get chewed up as easily as Windows folks do, but the majority use case for Ubuntu is not (and should not, imo) be server deployment. But why risk it when you have such a powerful security tool right in front of you? For the enterprise I just don’t find it prudent to abandon such a great and easy tool. That’s like choosing to not learn iptables — which some folks have also opted out of as well.

So, without further ado, here is the quick and dirty to get Drupal 7 working on a Fedora/CentOS/RHEL type OS with SELinux intact:

[Please note these instructions assume three things: 1) a completely fresh minimal install, 2) you have control of the server, and 3) you are able to execute commands as root or through sudo. Also note that I have removed foreign language setup from this, as I doubt anyone who reads my blog really needs Japanese but me.]

[root@localhost example.com]# yum install postgresql postgresql-server php-pgsql php-xml \
    php-pear php-devel gcc zlib-devel libcurl-devel make wget httpd php-mbstring
[root@localhost ~]# cd /var/www/html
[root@localhost html]# wget http://ftp.drupal.org/files/projects/drupal-7.0.tar.gz
[root@localhost html]# tar -zxf ./drupal-7.0.tar.gz
[root@localhost html]# mv drupal-7.0 drupal7
[root@localhost html]# cd drupal7
[root@localhost drupal7]# pecl install pecl_http
[root@localhost drupal7]# pecl install uploadprogress
[root@localhost drupal7]# echo extension=http.so > /etc/php.d/php_http.ini
[root@localhost drupal7]# echo extension=uploadprogress.so  >> /etc/php.d/php_http.ini
[root@localhost drupal7]# service postgresql initdb
[root@localhost drupal7]# service postgresql start
[root@localhost drupal7]# chkconfig postgresql on
[root@localhost drupal7]# chkconfig httpd on
[root@localhost drupal7]# setsebool -P httpd_can_network_connect_db=1
[root@localhost drupal7]# setsebool -P httpd_can_sendmail=1
[root@localhost drupal7]# setsebool -P httpd_unified=1
[root@localhost drupal7]# cp sites/default/default.settings.php sites/default/settings.php
[root@localhost drupal7]# chmod 666 sites/default/settings.php
[root@localhost drupal7]# su postgres
bash-4.1$ createuser --pwprompt --encrypted --no-adduser --no-createdb drupal
Enter password for new role:
Enter it again:
Shall the new role be allowed to create more new roles? (y/n) n
bash-4.1$ createdb --encoding=UNICODE --owner=drupal drupaldb
bash-4.1$ exit
[root@localhost drupal7]# cp /var/lib/pgsql/data/pg_hba.conf /var/lib/pgsql/data/pg_hba.conf.original
[root@localhost drupal7]# vi /var/lib/pgsql/data/pg_hba.conf

Add the following line at about line 71 or so, just after the local all all ident line (check first, don’t blindly dump this in with sed because this could all be wrong if you’re running a different version of Postgres or reading this far in the future):

host    drupaldb    drupal    127.0.0.1/32    md5

Remove the Apache and OS identification tags on server-generated error messages (such as the default Apache ### error messages):

[root@localhost drupal7]# vi /etc/httpd/conf/httpd.conf

Replace “ServerTokens OS” with “ServerTokens Prod
Replace “ServerSignature On” with “ServerSignature Off

And since I’m paranoid and use my servers only as servers (and prefer to send logs to a separate logging server), I also change “LogLevel Warn” to “LogLevel Info” and let my parsing scripts do the work of finding the important stuff. That makes forensics a lot easier later on down the road (though more compute intensive).

Add the following lines to your httpd.conf file if you are running multiple websites on a single server (on a single IP address that is hosting multiple domain names):

NameVirtualHost *:80

<VirtualHost *:80>
    ServerAdmin admin@example.com
    DocumentRoot /var/www/html/drupal7
    ServerName example.com
    ErrorLog logs/example.com-error_log
</VirtualHost>

<VirtualHost *:80>
    ServerName anothersite.example.com
    DocumentRoot /var/www/html/anothersite.example.com
</VirtualHost>

If you are not running virtual servers, or if you are running them in /home/user/public_html or whatever, adjust the way the file is written.

Now input iptable rules necessary to open port 80 for web traffic from outside, and allow httpd to access Postgres:

[root@localhost drupal7]# iptables -vI INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
[root@localhost drupal7]# iptables -vI INPUT -m state --state NEW -m udp -p udp --dport 5353 -j ACCEPT

Now point a browser at the server and do your Drupal installation setup. Once you’re done there reset the permissions on sites/default/settings.php and remove the last iptables rule:

[root@localhost drupal7]# chmod 644 sites/default/settings.php
[root@localhost drupal7]# iptables -vD INPUT -m state --state NEW -m udp -p udp --dport 5353 -j ACCEPT

If everything went correctly you should be able to use Drupal 7 with SELinux in Enforcing mode, with your iptables intact aside from however you remote login for administration (SSH, if you use it, if you don’t, then close off port 22 and do chkconfig sshd off as well).

The next time you reboot you will notice you can still log in to your shell, but you can’t access the website with a browser. That is because the iptables rule fell off (they don’t persist unless you tell them do). Once everything works the way it should, commit the iptable rule that is letting port 80 stay open:

iptables-save > /etc/sysconfig/iptables

And while we’re messing with security… let’s go ahead and turn off a php feature that the cracker wannabes have recently learned about: allow_url_fopen.

[root@localhost drupal7]# vi /etc/php.ini

Replace “allow_url_fopen On” with “allow_url_fopen Off

It is very unlikely that you will need all_url_fopen to be active, as very few modules use it (there is a Drupal-sepecific alternative to this, so it seems). Of course, if your site breaks it would be good to check if you actually did need this, but otherwise I’d leave it turned off until things go wrong.

Now we need to restart Apache:

[root@localhost drupal7]# apachectl restart

Given that the biggest fans of tutorials such as these are the sort of folks who would never spend the time to research the meaning of all this themselves and given that you wouldn’t be reading this if you already knew how to do the above in the first place, I must remind you to head to your favorite search engine and do searches for things like “hardening sshd”, “hardening httpd”, “turning off root login”, “using public key encryption with sshd” and anything else that might strike your fancy (protip: read as much as you can about SELinux and iptables).