I’ve been working on my little instructional project for the last few days and today finally got around to putting a very minimal, but working, chat system into the ErlMUD “scaffolding” code. (The commit with original comment is here. Note the date. By the time this post makes its way into Google things will probably be a lot different.)
I commented the commit on GitHub, but felt it was significant enough to reproduce here (lightly edited and linked). The state of the “raw Erlang” ErlMUD codebase as of this commit is significant because it clearly demonstrates the need for many Erlang community conventions, and even more significantly why OTP was written in the first place. Not only does it demonstrate the need for them, the non-trivial nature of the problem being handled has incidentally given rise to some very clear patterns which are already recognizable as proto-OTP usage patterns (without the important detail of having written any behaviors just yet). Here is the commit comment:
Originally chanman had been written to monitor, but not link or trap exits of channel processes [example]. At first glance this appears acceptable, after all the chanman doesn’t have any need to restart channels since they are supposed to die when they hit zero participants, and upon death the participant count winds up being zero.
But this assumes that the chanman itself will never die. This is always a faulty assumption. As a user it might be mildly inconvenient to suddenly be kicked from all channels, but it isn’t unusual for chat services to hiccup and it is easy to re-join whatever died. Resource exhaustion and an inconsistent channel registry is worse. If orphaned channels are left lying about the output of \list can never match reality, and identically named ones can be created in ways that don’t make sense. Even a trivial chat service with a tiny codebase like this can wind up with system partitions and inconsistent states (oh no!).
All channels crashing with the chanman might suck a little, but letting the server get to a corrupted state is unrecoverable without a restart. That requires taking the game and everything else down with it just because the chat service had a hiccup. This is totally unacceptable. Here we have one of the most important examples of why supervision trees matter: they create a direct chain of command, and enforce a no-orphan policy by annihilation. Notice that I have been writing “managers” not “supervisors” so far. This is to force me to (re)discover the utility of separating the concepts of process supervision and resource management (they are not the same thing, as we will see later).
Now that most of the “scaffolding” bits have been written in raw Erlang it is a good time to sit back and check out just how much repetitive code has been popping up all over the place. The repetitions aren’t resulting from some mandatory framework or environment boilerplate — I’m deliberately making an effort to write really “low level” Erlang, so low that there are no system or framework imposed patterns — they are resulting from the basic, natural fact that service workers form constellations of similarly defined processes and supervision trees provide one of the only known ways to guarantee fallback to a known state throughout the entire system without resorting to global restarts.
Another very important thing to notice is how inconsistent my off-the-cuff implementation of several of these patterns has been. Sometimes a loop has a single State variable that wraps the state of a service, sometimes bits are split out, sometimes it was one way to begin with and switched a few commits ago (especially once the argument list grew long enough to annoy me when typing). Some code_change/N functions have flipped back and forth along with this, and that required hand tweaking code that really could have been easier had every loop accepted a single wrapped State (or at least some standard structure that didn’t change every time I added something to the main loop without messing with code_change). Some places I start with a monitor and wind up with a link or vice versa, etc.
While the proper selection of OTP elements is more an art than a science in many cases, having commonly used components of a known utility already grouped together avoids the need for all this dancing about in code to figure out just what I want to do. I suppose the most damning point about all this is that none of the code I’ve been flip-flopping on has been essential to the actual problem I’m trying to solve. I didn’t set out to write a bunch of monitor or link or registry management code. The only message handler I care about is the one that sends a chat message to the right people. Very little of my code has been about solving that particular problem, and instead I consumed a few hours thinking through how I want the system to support itself, and spent very little time actually dealing with the problem I wanted to treat. Of course, writing this sort of thing without the help of any external libraries in any other language or environment I can think of would have been much more difficult, but the commit history today is a very strong case for making an effort to extract the common patterns used and isolate them from the actual problem solving bits.
The final thing to note is something I commented on a few commits ago, which is just how confusing tracing message passage can be when not using module interface functions. The send and receive locations are distant in the code, so checking for where things are sent from and where they are going to is a bit of a trick in the more complex cases (and fortunately none of this has been particularly complex, or I probably would have needed to write interface functions just to get anything done). One of the best things about using interface functions is the ability to glance at them for type information while working on other modules, use tools like Dialyzer (which we won’t get into we get into “pure Erlang” in v0.2), and easily grep or let Emacs or an IDE find calling sites for you. This is nearly impossible with pure ad hoc messaging. Ad hoc messaging is fine when writing a stub or two to test a concept, but anything beyond that starts getting very hard to keep track of, because the locations significant to the message protocol are both scattered about the code (seemingly at random) and can’t be defined by any typing tools.
I think this code proves three things:
- Raw Erlang is amazingly quick for hacking things together that are more difficult to get right in other languages, even when writing the “robust” bits and scaffolding without the assistance of external libraries or applications. I wrote a robust chat system this afternoon that can be “hot” updated, from scratch, all by hand, with no framework code — that’s sort of amazing. But doing it sucked more than it needed to since I deliberately avoided adhering to most coding standards, but it was still possible and relatively quick. I wouldn’t want to have to maintain this two months from now, though — and that’s the real sticking point if you want to write production code.
- Code convention recommendations from folks like Joe Armstrong (who actually does a good bit of by-hand, pure Erlang server writing — but is usually rather specific about how he does it), and standard set utilities like OTP exists for an obvious reason. Just look at the mess I’ve created!
- Deployment clearly requires a better solution than this. We won’t touch on this issue for a while yet, but seriously, how in the hell would you automate deployment of a scattering of files like this?