Tag Archives: erlang

Erlangers! USE LABELS! (aka “Stop Writing Punched-in-the-Face Code Blocks”)

Do you write lambdas directly inline in the argument list of various list functions or list comprehensions? Do you ever do it even though the fun itself, or the other arguments or return assignment/assertion for the call are too long and force you to scrunch that lambda’s definition up into an inline-multiline ball of wild shit? YOU DO? WTF?!?!? AHHHH!

First off, realize this makes you look like a douchebag for not being polite to other people or your future self whenever you do it. There is a big difference for the human reading between:

%%% From shitty_inline.erl

do_whatever(Keys, SomeParameter) ->
    lists:foreach(fun(K) -> case external_lookup(K) of
                  {ok, V} -> do_side_effecty_thing(V, SomeParameter);
                  {error, R} -> report_some_failure(R)
          end, Keys


%%% From shitty_listcomp.erl

do_whatever(Keys, SomeParameter) ->
    [fun(K) -> case external_lookup(K) of
        {ok, V} -> do_side_effecty_thing(V, SomeParameter);
        {error, R} -> report_some_failure(R) end end(Key) || Key <- Keys],


%%% From less_shitty_listcomp.erl

do_whatever(Keys, SomeParameter) ->
    ExecIfFound = fun(K) -> case external_lookup(K) of
            {ok, V} -> do_side_effecty_thing(V, SomeParameter);
            {error, R} -> report_some_failure(R)
    [ExecIfFound(Key) || Key <- Keys],


%%% From labeled_lambda.erl

do_whatever(Keys, SomeParameter) ->
    ExecIfFound =
        fun(Key) ->
            case external_lookup(Key) of
                {ok, Value}     -> do_side_effecty_thing(Value, SomeParameter);
                {error, Reason} -> report_some_failure(Reason)
    lists:foreach(ExecIfFound, Keys).


%%% From isolated_functions.erl

-spec do_whatever(Keys, SomeParameter) -> ok
    when Keys          :: [some_kind_of_key()],
         SomeParameter :: term().

do_whatever(Keys, SomeParameter) ->
    ExecIfFound = fun(Key) -> maybe_do_stuff(Key, SomeParameter) end,
    lists:foreach(ExecIfFound, Keys).

maybe_do_stuff(Key, Param) ->
    case external_lookup(Key) of
        {ok, Value}     -> do_side_effecty_thing(Value, Param);
        {error, Reason} -> report_some_failure(Reason)

Which versions force your eyes to do less jumping around? How about which version lets you most naturally understand each component of the code independently? Which is more universal? What does code like this translate to after erlc has a go at it?

Are any of these difficult to read? No, of course not. Every version of this is pretty darn basic and common — you need a listy operation by require a closure over some in-scope state to make it work right, so you really do need a lambda instead of being able to look all suave with a fun some_function/1 type thing. So we agree, taken by itself, any version of this is easy to comprehend. But when you are reading through hundreds of these sort of things at once to understand wtf is going on in a project while also remembering a bunch of other shit code that is laying around and has side effects while trying to recall some detail of a standard while the phone is ringing… things change.

Do I really care which way you do it? In a toy case like this, no. In actual code I have to care about forever and ever — absolutely, yes I do. The fifth version is my definite preference, but the fourth will do just fine also.

(Or even the third, maybe. I tend to disagree with the semantic confusion of using a list comprehension to effect a loop over a list of values only for the side effects without returning a value – partly because this is semantically ambiguous, and also because whenever possible I like every expression of my code to either be an assignment or an assertion (so every line should normally have a = on it). In other words, use lists:foreach/2 in these cases, not a list comp. I especially disagree with using a listcomp when we the main utility of using a list comprehension is normally to achieve a closure over local state, but here we are just calling another closure — so semantic fail there, twice.)

But what about my lolspeed?!?

I don’t know, but let’s see. I’ve created five modules, based on the above examples:

  1. shitty_inline.erl
  2. shitty_listcomp.erl
  3. less_shitty_listcomp.erl
  4. labeled_lambda.erl
  5. isolated_functions.erl

These all call the same helpers that do basically nothing important other than having actual side effects when called (they call io:format/2). What we are interested in here is the generated assembler. What is the cost of introducing these labels that help the humans out VS leaving things all messy the way we imagine might be faster for the runtime?

It turns out that just like with using assignments to document your code, there is zero cost to label functions. For example, here is the assembler for shitty_inline.erl side-by-side with labeled_lambda.erl:

Oooh, look. The exact same stuff!

(This is a screenshot, a text file with the contents shown is here: label_example_comparison.txt)

See? All that annoying-to-read inline lambdaness buys you absolutely nothing. You’re not helping the compiler, you’re not helping the runtime, and you are hurting your future self and anyone you want to work with on the same code later. (Note: You can generate precompiler output with erlc -P and erlc -E, and assembler output with erlc -S. Here is the manpage. Play around with it a bit, BEAM and EVM are amazing platforms, wide open for exploration!)

So use labels.

As for execution speed… all of these perform basically the same, except for the last one, isolated_functions.erl. Here is the assembler for that one: isolated_functions.S. This outperforms the others, though to a relatively insignificant degree. Of course, it is only an “insignificant degree” until that part of the program is the most critical part of whatever your program does — then even a 10% difference may be a really huge win for you. In those cases it is worth it to refactor to test the speed of different representations against each version of the runtime you happen to be using — and all thoughts on mere style have to take a backseat. But this is never the case for the vast majority of our code.

(I’ve read reports in the past that indicate 99% of our performance bottlenecks tend to reside in less than 1% of our code by line count — but I can’t recall the names of any just now. If you happen to find a reference, let me know so I can update this little parenthetical blurb with some hard references.)

My point here is that breaking every lambda out into a separate named functions isn’t always worth it — sometimes an in-place lambda really is more idiomatic and easier to understand simply because you can see everything right there in the same function body. What you don’t want to see is multi-line lambdas squashed into argument lists that make things hard to read and give you the exact same result once compiled as labeling that lambda with a meaningful variable name on another line in the code and then referring to it where it is invoked later.

The most basic Erlang service ⇒ worker pattern

There has been some talk about identifying “Erlang design patterns” or “functional design patterns”. The reason this sort of talk rarely gets very far (just refer to any of the thousands of aborted ML and forums threads on the subject) is because generally speaking “design patterns” is a phrase that means “things you have to do all the time that your language provides both no primitives to represent, and no easy way to write a library function behind which to hide an abstract implementation”. OOP itself, being an entire paradigm built around a special syntax for writing dispatching closures, tends to lack a lot of primitives we want to represent today and has a litany of design patterns.

NOTE: This is a discussion of a very basic Erlang implementation pattern, and being very basic it also points out a few places new Erlangers get hung up on, like what context a specific call is made in — because that’s just not obvious if you’re not already familiar with concurrency at the level Erlang does it. If you’re already a wizard, this article probably isn’t for you.

But what about Erlang? Why have so few design patterns (almost none?) emerged here?

The main reason is what would have been design patterns in Erlang have mostly become either functional abstractions or OTP (“OTP” in this use generally referring to the framework that is shipped with Erlang). This is about as far as the need for patterns has needed to go in the most general case. (Please note that it very often is possible to write a framework that implements a pattern, though it is very difficult to make such frameworks completely generic.)

But there is one thing the ole’ Outlaw Techno Psychobitch doesn’t do for us that quite a few of us do have a common need for but we have to discover for ourselves: how to create a very basic arrangement of service processes, supervisors, and workers that spawn workers according to some ongoing global state or node configuration. (Figuring this out is almost like a rite of passage for Erlangers.)

The case I will describe below involves two things:

  • There is some service you want to create that is represented by a named process that manages it and acts as its sole interface.
  • There is some configurable state that is relevant to the service as a whole, should be remembered, and you should not be forced to pass in as arguments every time you call for this work to be done.

For example, let’s say we have an artificial world written in Erlang. Let’s say its a game world. Let’s say mob management is abstracted behind a single mob manager service interface. You want to spawn a bunch of monster mobs according to rules such as blahlblahblah… (Who cares? The game system should know the details, right?) So that’s our task: spawning mobs. We need to spawn a bunch of monster mob controller processes, and they (of course) need to be supervised, but we shouldn’t have to know all the details to be able to tell the system to create a mob.

The bestiary is really basic config data that shouldn’t have to be passed in every time you call for a new monster to be spawned. Maybe you want to back up further and not even want to have to specify the type of monster — perhaps the game system itself should know generally what the correct spawn/live percentages are for different types of mobs. Maybe it also knows the best way to deal with positioning to create a playable density, deal with position conflicts, zone conflicts, leveling or phasing influences, and other things. Like I said already: “Who cares?”

Wait, what am I really talking about here? I’m talking about sane defaults, really. Sane defaults that should rule the default case, and in Erlang that generally means some sane options that are comfortably curried away in the lowest-arity calls to whatever the service functions are.  But from whence come these sane defaults? The service state, of course.

So now that we have our scenario in mind, how does this sort of thing tend to work out? As three logical components:

  • The service interface and state keeper, let’s call it a “manager” (typically shortened to “man”)
  • The spawning supervisor (typically shortened to “sup”)
  • The spawned thingies (not shortened at all because it is what it is)

How does that typically look in Erlang? Like three modules in this imaginary-but-typical case:

  • game_mob_man.erl
  • game_mob_sup.erl
  • game_mob.erl

The game_mob_man module represents the Erlang version of a singleton, or at least something very similar in nature: a registered process. So we have a definite point of contact for all requests to create mobs: calling game_mob_man:spawn_mob/0,1,... which is defined as

spawn_mob() ->

spawn_mob(Options) ->
    gen_server:cast(?MODULE, {beget_mob, Options}).


Internally there is the detail of the typical

handle_cast({beget_mob, Options}, State) ->
    ok = beget_mob(Options, State),
    {noreply, State};

and of course, since you should never be putting a bunch of logic or side-effecty stuff in directly in your handle_* function clauses beget_mob/2 is where the work actually occurs. Of course, since we are talking about common patterns, I should point out that there are not always good linguistic parallels like “spawn” ⇒ “beget” so a very common thing to see is some_verb/N becomes a message {verb_name, Data} becomes a call to an implementation do_some_verb(Data, State):

spawn_mob(Options) ->
    gen_server:cast(?MODULE, {spawn_mob, Options}).


handle_cast({spawn_mob, Options}, State) ->
    ok = do_spawn_mob(Options, State),
    {noreply, State};

% ...

do_spawn_mob(Options, State = #s{stuff = Stuff}) ->
    % Actually do work in the `do_*` functions down here

The important thing to note above is that this is the kind of registered module that is registered under its own name, which is why the call to gen_server:cast/2 is using ?MODULE as the address (and not self(), because remember, interface functions are executed in the context of the caller, not the process defined by the module).

Also, are the some_verb/N{some_verb, Data}do_some_verb/N names sort of redundant? Yes, indeed they are. But they are totally unambiguous, inherently easy to grep -n and most importantly, give us breaks in the chain of function calls necessary to implement abstractions like managed messaging and supervision that underlies OTP magic like the gen_server itself. So don’t begrudge the names, its just a convention. Learn the convention so that you write less annoyingly mysterious code; your future self will thank you.

So what does that have to do with spawning workers and all that? Inside do_spawn_mob/N we are going to call another registered process, game_mob_sup. Why not just call game_mob_sup directly? For two reasons:

  1. Defining spawn_mob/N within the supervisor still requires acquisition of world configuration and current game state, and supervisors do not hold that kind of state, so you don’t want data retrieval tasks or evaluation logic to be defined there. Any calls to a supervisor’s public functions are being called in the context of the caller, not the supervisor itself anyway. Don’t forget this. Calling the manger first gives the manager a chance to wrap its call to the supervisor in state and pass the message along — quite natural.
  2. game_mob_sup is just a supervisor, it is not the mob service itself. It can’t be. OTP already dictates what it is, and its role is limited to being a supervisor (and in this particular case of dynamic workers, a simple_one_for_one supervisor at that).

So how does game_mob_sup look inside? Something very close to this:


%%% Interface
spawn_mob(Conf) ->
    supervisor:start_child(?MODULE, [Conf]).

%%% Startup
start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    RestartStrategy = {simple_one_for_one, 5, 60},
    Mob = {game_mob,
           {game_mob, start_link, []},
    Children = [Mob],
    {ok, {RestartStrategy, Children}}.

(Is it really necessary to define these things as variables in init/1? No. Is it really necessary to break the tuple assigned to Mob vertically into lines and align everything all pretty like that? No. Of course not. But it is pretty darn common and therefore very easy to catch all the pieces with your eyes when you first glance at the module. Its about readability, not being uber l33t and reducing a line count nobody is even aware of that isn’t even relevant to the compiled code.)

See what’s going on in there? Almost nothing. That’s what. The interesting part to note is that very little config data is going into the supervisor at all, with the exception of how supervision is set to work. These are mobs: if they crash they shouldn’t come back to life, better to leave them dead and signal whatever keeps account of them so it can decide what to do (the game_mob_man, for example, which would probably be monitoring these). Setting them as permanent workers can easily (and hilariously) result in a phenomenon called “highly available mini bosses” — where a crash in the “at death cleanup” routine or the mistake of having the mob’s process retire with an exit status other than 'normal' causes it to just keep coming back to life right there, in its initial configuration (i.e. full health, full weapons, full mana, etc.).

But what stands above this? Who supervises the supervisor?

Generally speaking, a component like mob monsters would be a part of a larger concept of world objects, so whatever the world object “service” concept is would sit above mobs, and mobs would be one component of world entities in general.

To sum up, here is a craptastic diagram:

Yes, my games involve wildlife and blonde nurses.

Yes, my games involve wildlife and blonde nurses.

The diagram above shows solid lines for spawn_link, and dashed lines to indicate the direction of requests for things like spawn_link. The diagram does not show anything else. So monitors, messages, etc. are all just not there. Imagine them. Or don’t. That’s not the point of this post.

“But wait, I see what you did there… you made a bigger diagram and cut a bunch of stuff out!”

Yep. I did that. I made an even huger, much crappier, more inaccurate diagram because I wasn’t sure at first where I wanted to fit this into my imaginary game system.

And then I got carried away and diagrammed a lot more of the supervision tree.

And then I though “Meh, screw it, I’ll just push this up to a rough imagining of what it might look like pushed all the way back to the SuperSup”.

Here is the result of that digression:

It wouldn't look exactly like this, so use your imagination.

It wouldn’t look exactly like this, so use your imagination.


Yep. All that. Right there. That’s why its called a “supervision tree” instead of a “supervision list”. Any place in there you don’t have a dependency between parts, a thing can crash all by itself and not bring down the system. Consider this: the entire game can fail and chat will still work, users will still be logged in, etc. Not nearly as big a deal to restart just that one part. But what about ItemReg? Well, if that fails, we should probably squash the entire item system (I’ve got guns, but no bullets! or whatever) because game items are critical data. Are they really critical data? No. But they become critical because gamers are much more willing to accept a server interruption than they are losing items and having bad item data stored.

And with that, I’m out! Hopefully I was able to express a tiny little bit about one way supervision can be coupled with workers in the context of an ongoing, configured service that lives within a larger Erlang system and requires on-the-fly spawning of supervised workers.

(Before any of you smarties that have been around a while and point out how I glossed over a few things, or how spawning a million items as processes might not be the best idea… I know. That’s not the point of this post, and the “right approach” is entirely context dependent anyway. But constructive criticism is, as always, most welcome.)

zUUID: An Example Erlang/OTP Project

I was talking with a friend of mine yesterday about how UUID v2 seems to have evaporated. We looked into things further and found its not actually included in RFC 4122! One thing led to another and I wound up writing an example project that is yet another UUID generator/utility in Erlang — but this time it actually has duplicate v1 and v2 detection/correction and implements as close to what I can find is defined as UUID version 2 values.

As there are already plenty of UUID projects around I focused on making this one as readable as I possibly could — to include exported documentation, in-source documentation, obvious variable names, full typespecs, my silly little “pure” notation, blatantly obvious bitstring syntax, and the obligatory github presence.

Hopefully some folks newish to Erlang will come along and explain to me what confuses them about that code, the process of writing it, the documentation conventions, etc. so that I can become a better literate programmer. Of course, since the last thing the world needs is another UUID implementation I suppose I would have had better luck with something at least peripherally related to the web. (>.<)

Pure Declarations in Erlang

Over the last year or so I’ve gone back and forth in my mind and in discussions with other Erlangers about type systems in Erlang, or rather, I’ve been going back and forth about its lack of one and the way Dialyzer acts as our bandaid in this area. Types are useful enough that we need Dialyzer, but the pursuit of functional puritanism gets insane enough that its simply not worth it in a language intended for real-world production use, especially in the messy, massively concurrent, let-it-crash, side-effecty, message-centric world of Erlang.

But… types and pure functions are still really useful and setting a goal of making as much of a program as possible into provable, bounded, typed, pure functions tends to result in easy to understand, test and maintain code. So there is obviously some stress here.

What I would like to do is add a semantic that the compiler (or Dialyzer, but would prefer this be a compiler check, tbh) be aware of what functions are pure and which are not. The way I would do this is by using a different “arrow”, in particular the Prolog-style declaration indicator: :-

[Edit after further discussion…] What I would like to do is add a directive that Dialyzer can interpret according to a simply purity rule. Adding this to Dialyzer makes more sense than putting it in the compiler — Dialyzer is already concerned with checking; the compiler is already concerned with compiling.

The directive would be -pure(Name/Arity) (a compliment to -spec). The rule would be very simple: only guard-permissible BIFs and other pure functions are legal from within the body of a pure function. This is basically just an extension of the current guard rule (actually, I wonder why this version isn’t already the guard rule… other than the fact that unless something like this is implemented the compiler itself wouldn’t have any way of checking for purity, so currently it must blindly accept a handful of BIFs known to be pure and nothing else).

For example, here is a pure function in Erlang, but neither the compiler nor Dialyzer can currently know this:

-spec increment(integer()) -> integer().
increment(A) ->
    A + 1.

Here is the same function declared to be pure:

-spec increment(integer()) -> integer().
increment(A) ->
    A + 1.

Pretty simple change.

“ZOMG! The whold standard library!” And yes, this is true — the whole thing is out. Except that the most important bits of it (the data structures like lists, dict, maps, etc.) could be easily converted to pure functions with little more than changing -> to :- adding a single line to the definition.

Any pure function could be strongly typed and Dialyzer could adhere to strong types instead of looser “success types” in these cases. Some code that is currently written to take an input from a side-effecty function, pass it through a chain of non-returning and possible side-effecty functions as a way to process or act on the value, and ultimately then call some side-effecty final output function would instead change to a form where the side-effects are limited to a single function that does both the input and output, and all the processing in-between would be done in pure functions.

This makes code inherently more testable. In the first case any test of the code is essentially an integration test — as to really know how things will work requires knowing at least one step into side effects (and very often we litter our code with side-effects without a second thought, something prayer-style monadisms assist greatly with). In the second case, though, the majority of the program is pure and independently testable, with no passthrough chain of values that have to be checked. I would argue that in many cases such passthrough is either totally unnecessary, or when it really is beneficial passing through in functions is not as useful as passing through in processes — that is to say, that when transformational passthrough is desired it is easier to reason about an Erlang program as a series of signal transformations over a message stream than a chain of arbitrarily side-effecty function calls that collectively make a recursive tail-call (and that’s a whole different ball of wax, totally orthogonal to the issue of functional purity).

Consider what we can know about a basic receive loop:

loop(State) ->
    {process, Data} ->
        {ok, NewState} = do_process(Data, State),
    {send_state, From} ->
        From ! State,
    halt ->
    Message ->
        ok = log(unexpected, Unexpected),

-spec do_process(term(), #state{}) -> {ok, #state{}} | {error, term()}.
do_process(Data, State) :-
    % Do purely functional stuff

-spec log(category(), term()) -> ok.
log(Cat, Data) ->
    % Do side-effecty stuff

We can see exactly what cases result in another iteration and which don’t. Compare that with this:

loop(State) ->
    {process, Data}     -> do_process(Data, State);
    {send_state, Asker} -> tell(Asker, State);
    quit                -> exit(normal);
    Message             -> handle_unexpected(Message, State)

do_process(Data, State) ->
    % Do stuff.
    % Mutually recursive tail call; no return type.

tell(Asker, State) ->
    % Do stuff; another tail call...

handle_unexpected(Message, State) ->
    ok = log(unexpected, Message),
    % Do whatever else; end with tail call to loop/1...

I like the way the code lines up visually in the last version of loop/1, sure, but I can’t know nearly as much about it as a process. Both styles are common, but the former lends itself to readability and testing while the latter is a real mixed bag. Pure functions would keep us conscious of what we are doing and commit our minds in ways to the definite-return form of code where our pure functions and our side-effecty ones are clearly separated. Of course, anyone could also continue to write Erlang any old way they used to — this would just be one more tool to assist with breaking complexity down and adding some compile-time checking in large systems.

I would love to see this sort of thing happen within Erlang eventually, but I am pretty certain that its the sort of change that won’t happen if I don’t roll up my sleeves and do it myself. We’ve got bigger fish to fry, in my opinion, (and I’ve certainly got higher priorities personally right now!) but perhaps someday…

Iterators? We Don’t NEED No Stinking Iterators!

Every so often a request for “implementation of iterators for maps” over hashes/maps/dicts or some other K-V data structure appears on mailing list for a functional langauge. I’ve spent years making heavy use of iterators in imperative languages, and the way they fit into Python is really great. For Python. I totally understand where some of these folks are coming from, they just don’t realize that functional languages are not where they came from.

So… “Is this post the result of some actual event”? Yeah, you got me. It is. On the erlang-questions mailing list someone asked “Are maps ever going to get an iterator?” Again.

Erlang is definitely not Kansas, but people thinking either that it is or (more dangerously) that it should be and then trying to influence the maintainers to make it that way (and then the powers-that-be getting in a panic over “market share” and doing horrible things to the language…) worries me a bit.

There is no faster way to paint a functional language into a corner than to try making it occasionally imperative. Conversely, consider the syntactic corner C++ and Java have painted themselves into by trying to include functional features as after-thoughts where they really didn’t belong.

(I know, I know, death-by-kitchen-sink is a proud C++ tradition. It is becoming one for Java. Even though I hate Java there is no sense in making it worse by cluttering its syntax and littering it with gotchas and newbie-unfriendly readability landmines in the interest of providing features few Java coders understand the point of, especially when the whole concept of state management in a bondage-and-discipline OOP language like Java is to keep everything in structs with legs (not anonymous closures over state that is temporarily in scope…). The lack of such problems were previously one of the main points that favored Java over C++… well, that and actual encapsulation. Hopefully Rust and D can resist this temptation.)

This frustrates me. It is almost as if instead of picking a tool that matches a given job, people learn one tool and then try over time to make a super-powered Swiss Army knife of it. This never turns out well. The result is more Frankenstein’s Monster than Swiss Army knife and in the best case it winds up being hard to learn, confusing to document and crap at everything.

What’s worse, people assume that the first tool they learned well is the standard by which everything else should be judged (if so, then why are they learning anything else?). It follows, then, that if a newly studied LangX does not have a feature of previously used LangY then it must be introduced because it is “missing”. (I do admit, though, to wishing other languages had pattern matching in function heads… but I don’t bring this up on mailing lists as if its a “missing feature”; I do, however, cackle insanely when overloading is compared with matching.)

Let’s say we did include iterators for maps into Erlang — whatever an “iterator” is supposed to mean in a list-are-conses type functional language. What would that enable?

-spec foreach(fun(), map()) -> ok.

That sort of looks pointless. Its exactly the same as lists:foreach(Fun, maps:to_list(Map)) or maybe lists:foreach(Fun, maps:values(Map)). Without knowing whether we’re trying to build a new map based on the old one or get some side effect out of Fun then its hard to know what the point is.


-spec map(fun(), OldMap :: map()) -> {ok, NewMap :: map()}.

But… wait, isn’t that just maps:map/2 all over again?

I think I know where this is going, though. These people really wish maps were ordered dictionaries, because they want keys to be ordered. So they want something like this:

-spec side_effects_in_order_dammit(fun(), map()) -> ok.
side_effects_in_order_dammit(F, M) ->
    Ordered = [{K, maps:get(K, M)} || K <- lists:sort(maps:keys(M))],
    ok = lists:foreach(F, Ordered).

But wait, what order should the keys be in, anyway?

This is a slow, steady march to insanity. “Give me iterators” begets “Let’s have ordered maps” begets “Let’s have ordered iterators for maps” and so on, and eventually you wind up with most of the Book of Genesis in the Devil’s Bible of Previously Decent Functional Languages. All the while, totally forgetting that these things already exist in another form. There are more data structures than just maps for a reason.

This just gets ridiculous, and it isn’t even what hashes are about to begin with.

Horrible, Drunk Code (but it works and demonstrates a point)

Over on StackOverflow otopolsky was asking about how to make an Erlang program that could read selected lines in a huge file, offset from the bottom, without exploding in memory (too hard).

I mentioned the standard bit about front-loading and caching the work of discovering the linebreak locations, the fact that “a huge text file” nearly always means “a set of really huge log files” and that in this case tokenizing the semantics of the log file within a database is a Good Thing, etc. (my actual answer is here).

He clearly knew most of this, and was hoping that there was some shortcut already created. Well, I don’t know that there is, but it bothered me that his initial stab at following my advice about amortization of linebreak discovery resulted in an attempt to read a 400MB text file in to run a global match over it, and that this just ate up all his memory and made his machine puke. Granted, my initial snippet was a naive implementation that didn’t take size into account at all, but…

400MB? Eating ALL your memory? NO WAY. Something must be done… A call to action!

The main problem is I’m already a bit jeezled up because my wife broke out some 泡盛 earlier (good business day today)… so any demo code I will produce will be, ahem, a little less than massive-public-display worthy (not that the 6 or 7 guys on the planet who actually browse Erlang questions on SO would care, or don’t already know who I am). So… I’m posting here:


linebreaks(File) ->
    {ok, FD} = file:open(File, [raw, read_ahead, binary]),
    Step = 1000,
    Count = 1,
    Loc = 1,
    Indexes = [],
    Read = file:read(FD, Step),
    {ok, Result} = index(FD, Read, Count, Loc, Step, Indexes),
    ok = file:close(FD),
    [{1, Loc} | Result].

index(FD, {ok, Data}, Count, Loc, Step, Indexes) ->
    NewLines = binary:matches(Data, <<$\n>>),
    {NewCount, Found} = indexify(NewLines, Loc, Count, []),
    Read = file:read(FD, Step),
    index(FD, Read, NewCount, Loc + Step, Step, [Found | Indexes]);
index(_, eof, _, _, _, Indexes) ->
    {ok, lists:reverse(lists:flatten(Indexes))};
index(_, {error, Reason}, _, _, _, Indexes) ->
    {error, Reason, Indexes}.

indexify([], _, Count, Indexed) ->
    {Count, Indexed};
indexify([{Pos, Len} | Rest], Offset, Count, Indexed) -> 
    NewCount = Count + 1,
    indexify(Rest, Offset, NewCount, [{Count, Pos + Len + Offset} | Indexed]).

As ugly as that is, it runs in constant space and the index list produced on a 7,247,560 line 614,754,920 byte file appears to take a bit of space (a few dozen MB for the 7,247,560 element list…), and temporarily requires a bit more space during part of the return operation (very sudden, brief spike in memory usage right at the end as it returns). But it works, and returns what we were looking for in a way that won’t kill your computer. And… it only takes a 14 seconds or so on the totally crappy laptop I’m using right now (an old dual-core Celeron).

Much better than what otopolsky ran into when his computer ran for 10 minutes before it started swapping after eating all his memory!


lceverett@changa:~/foo/hugelogs$ ls -l
合計 660408
-rw-rw-r-- 1 ceverett ceverett       928  9月  3 01:31 goofy.erl
-rw-rw-r-- 1 ceverett ceverett  61475492  9月  2 23:17 huge.log
-rw-rw-r-- 1 ceverett ceverett 614754920  9月  2 23:19 huger.log
ceverett@changa:~/foo/hugelogs$ erl
Erlang/OTP 18 [erts-7.0] [source] [64-bit] [smp:2:2] [async-threads:10] [kernel-poll:false]

Running $HOME/.erlang.
Eshell V7.0  (abort with ^G)
1> c(goofy).
2> {HugeTime, HugeIndex} = timer:tc(goofy, linebreaks, ["huge.log"]).
3> {HugerTime, HugerIndex} = timer:tc(goofy, linebreaks, ["huger.log"]).
4> HugerTime / 1000000.
5> HugeTime / 1000000.
6> lists:last(HugeIndex).
7> lists:last(HugerIndex).

Rather untidy code up there, but I figure it is readable enough that otopolsky can get some ideas from this and move on.

Erlang: Writing Terms to a File for file:consult/1

I notice that there are a few little helper functions I seem to always wind up writing given different contexts. In Erlang one of these is an inverse function for file:consult/1, which I have to write any time I use a text file to store config data*.

Very simply:

write_terms(Filename, List) ->
    Format = fun(Term) -> io_lib:format("~tp.~n", [Term]) end,
    Text = lists:map(Format, List),
    file:write_file(Filename, Text).

[Note that this *should* return the atom 'ok' — and if you want to check and assertion or crash on failure, you want to do ok = write_terms(Filename, List) in your code.]

This separates each term in a list by a period in the text file, which causes file:consult/1 to return the same list back (in order — though this detail usually does not matter because most conf files are used as proplists and are keysearched anyway).

An annoyance with most APIs is a lack of inverse functions where they could easily be written. Even if the original authors of the library don’t conceive of a use for an inverse of some particular function, whenever there is an opportunity for this leaving it out just makes an API feel incomplete (and don’t get me started on “web APIs”… ugh). This is just one case of that. Why does Erlang have a file:consult/1 but not a file:write_terms/2 (or “file:deconsult/2” or whatever)? I don’t know. But this bugs me in most libs in most languages — this is the way I usually deal with this particular situation in Erlang.

[* term_to_binary/1 ←→ binary_to_term/1 is not an acceptable solution for config data!]

Erlang: Maps, Comprehensions and Side-effecty Iteration

In Erlang it is fairly common to want to perform a side-effecty operation over a list of values, not because you want to collect an aggregate (fold), actually map the input list to the output list (map), or build a new list in some way (list comprehension) but because you just want the side-effects of the operation.

The typical idiom (especially for broadcasting messages to a list of processes) is to use a list comprehension for this, or sometimes lists:map/2:

%% Some procedural side-effect, like spawning list of processes or grabbing
%% a list of resources:
[side_effecty_procedure(X) || X <- ListOfThings],

%% Or, for broadcasting:
[Pid ! SomeMessage || Pid <- ListOfPids],

%% And some old farts still do this:
lists:map(fun side_effecty_procedure/1, ListOfThings),

%% Or even this (gasp!) which is actually made for this sort of thing:
lists:foreach(fun side_effecty_procedure/1, ListOfThings),
%% but lacks half of the semantics I describe below, so this part
%% of the function namespace is already taken... (q.q)

That works just fine, and is so common that list comprehensions have been optimized to handle this specific situation in a way that avoids creating a return list value if it is clearly not going to be assigned to anything. I remember thinking this was sort of ugly, or at least sort of hackish before I got accustomed to the idiom, though. “Hackish” in the sense that this is actually a syntax intended for the construction of lists and only incidentally a useful way to write a fast side-effect operation over a list of values, and “ugly” in the sense that it is one of the few places in Erlang you can’t force an assertion to check the outcome of a side-effecty operation.

For example, there is no equivalent to the assertive ok = some_procedure() idiom, or even the slightly more complex variation used in some other situations:

case foo(Bar) of
    ok    -> Continue();
    Other -> Other

a compromise could be to write an smap/2 function defined something like

smap(F, List) ->
    smap(F, 1, List).

smap(F, N, []) -> ok;
smap(F, N, [H|T]) ->
    case F(H) of
        ok              -> smap(F, N + 1, T);
        {error, Reason} -> {error, {N, Reason}}

But this now has the problem of requiring that whatever is passed as F/1 have a return type of ok | {error, Reason} which is unrealistic without forcing a lot of folks to wrap existing side-effecty functions in something that coerces the return type to match this. Though that might not be bad practice ultimately, its still perhaps more trouble than its worth.

It isn’t like most Erlang code is written in a way where side-effecty iteration over a list is likely to fail, and if it actually does fail the error data from a crash will contain whatever was passed to the side-effecty function that failed. But this still just doesn’t quite sit right with me — that leaves the prospect of list iteration in the interest of achieving a series of side effects as at least a little bit risky to do in the error kernel (or “crash kernel”) of an application.

On the other hand, the specific case of broadcasting to a list of processes would certainly be nice to handle exactly the same way sending to a single process works:

%% To one message
Pid ! SomeMessage,
%% To a list of messages
Pids ! SomeMessage,

Which seems particularly obvious syntax, considering that the return of the ! form of send/2 is the message itself, meaning that the following two would be equivalent if the input to the ! form of send/2 also accepted lists:

%% The way it works now
Pid1 ! Pid2 ! SomeMessage,
%% Another way I wish it worked
[Pid1, Pid2] ! SomeMessage,

In any case, this is clearly not the sort of language wart that gets much attention, and its never been a handicap for me. It just seems a bit hackish and ugly to essentially overload the semantics of list comprehensions or (ab)use existing list operations like maps and folds to achieve the side effect of iterating over a list instead of having a function like smap/2 which is explicitly designed to achieve side-effecty iteration.