Archive for February 1st, 2018

Confounding Beginner Question: What is an Erlang Atom and Why is it Useful?

Thursday, February 1st, 2018

Like other Erlangers, I tend to take the atom data type for granted. Coming from another language, however, you might be puzzled at why we have all these little strings that aren’t really strings.

The common definition you’ll hear most frequently is something like:

An atom is a label. Its only meaning is itself.

Well, that’s true, but that also sounds a bit useless to someone coming from Python or R or JavaScript or whatever. So let’s break that down: what is a “label” useful for in programs?

  • Variable names are labels.
  • Function names are labels.
  • Module names are labels.
  • The strings you use as keys in a key/value data structure are labels.
  • The enums and label macros you might use in C for semantically significant internal values are almost exactly like atoms

OK, so we use labels all the time, why don’t any of those other languages have atoms, though? Let’s examine those last two reasons for a moment for a hint why.

In Python strings are objects and while building them is expensive, hashing them can be done ahead of time as a cached operation. This means comparing two strings of arbitrary length for equality is extremely cheap, because it is reduced to a large integer comparison for equality. This is not true in, say, C or Erlang or Lisp unless you build your own data structure to carry around the pre-hashed data. In Python it is simple enough to say:

if 'foo' in some_dict:
  # stuff
else:
  # other stuff

In C, however, string comparison is a bit of a hassle and dealing with string data in a cross-platform environment at all can be super annoying depending the age of the systems you might be interacting with or running/building your code on. In Erlang the syntax of string comparison is super simple, but the overhead is not pre-paid like in Python. So what is the alternative?

We use integer values to represent keys that are semantically meaningful to the program at the time it is written. But integers are hard to remember, so instead of having magic numbers floating all around the place we typically have semantically significant integer values aliased from a text label as a macro. This is helpful so that I don’t have to remember the meaning of code like

if (condition == 42) launch_missiles();
if (condition == 86) eat_kittens();

Instead I can write code like:

#define UNDER_ATTACK    42
#define VILE_UNDERBEAST 86

if (condition == UNDER_ATTACK)    launch_missiles();
if (condition == VILE_UNDERBEAST) eat_kittens();

It is extremely common in programs to have variables or arguments like condition in the above example. It doesn’t matter whether your language has matching (like Erlang, Rust, logic languages, etc.) or uses explicit conditionals like the fake C example above — there will always be a huge number of micro datatypes that carry great semantic significance within your program and only within your program and it is as useful to be able to label these enumerated values in a way that the human coders can understand and remember as it is useful for the computer to be able to compare them as simple integers instead of going to the trouble of string comparison every time your code needs to make a decision (because string comparison entails an arbitrarily long sequence of integer comparisons every single time you compare two strings).

In C we use those macros like above (well, not always; C actually does have super convenient enums that work a lot like atoms, but didn’t when I started using it as a kid in the stone age). In Erlang we just use an atom right there in place. You don’t need a declaration or definition anywhere, the runtime just keeps track of these things for you.

Underneath the hood Erlang maintains a running table of atom label values and translates them to integer values on the way into the system and on the way out of the system. The integer each atom actual resolves to is totally unimportant to you, so Erlang abstracts that detail away, but leaves the machine comparing integer values instead of doing full-string comparisons all over the place.

“But Erlang maps don’t do string comparisons on keys!” you might say.

And indeed, you would be right. Because map keys might be any arbitrary value each key is hashed on the way in, and every time keys are compared the comparing term is hashed the same way, so the end comparison is super fast, but we have to hash the input value first for it to mean anything. With atoms, though, we have a shortcut, because we already know they are both unambiguous integer values throughout the system, and this is a slight win over having to hash first before comparing keys.

In other situations where the comparison values cannot be hashed ahead of time, like function-head matching, however, atoms are a huge win over string comparisons:

-module(atoms).
-export([foo/1, bar/1]).

foo("Some string value that I don't really recall") ->
    {ok, 1};
foo("Some string value that I don't really care about") ->
    {ok, 2};
foo("Where is my cheeseburger?") ->
    {ok, 3};
foo(_) ->
    {error, wonky_input}.

bar(dont_recall) ->
    {ok, 1};
bar(dont_care) ->
    {ok, 2};
bar(cheeseburger) ->
    {ok, 3};
bar(_) ->
    {error, wonky_input}.

I’ve slowed the clockspeed of the system so that we can notice any difference here in microseconds.

1> timer:tc(fun() -> atoms:foo("Some string value that I don't really care about.") end).
{16,{error,wonky_input}}
2> timer:tc(fun() -> atoms:foo("Where is my cheeseburger?") end).
{13,{ok,3}}
3> timer:tc(fun() -> atoms:foo("arglebargle") end).
{12,{error,wonky_input}}
4> timer:tc(fun() -> atoms:bar(dont_care) end).
{9,{ok,2}}
5> timer:tc(fun() -> atoms:bar(cheeseburger) end).                                      
{10,{ok,3}}
6> timer:tc(fun() -> atoms:bar(arglebargle) end).                                        
{10,{error,wonky_input}}

See what happened? The long string that varies only at the tail end from two options in the function head takes 16 microsecond to compare and return a value. The string that differs at the head is evaluated as a bad match for the first two options the moment the very first character is compared. The total mismatch is our fastest return because that string never need be traversed even a single time to know that it doesn’t match any of the available definitions of foo/1. With atoms, however, we see a pretty constant speed of comparison. That speed would not change at all even if the atoms were a hundred characters long in text, because underneath they are all just integer values.

Now take a look back and consider the return values defined for foo/1 and bar/1. They don’t return naked values, they return pairs where the first member is an atom. This is a pretty common technique in Erlang when writing either a library intended for 3rd party use or when defining functions that have side-effecty operations that might fail (here we have pure functions, but I’m just using this as an example). Remember, the equal sign in Erlang is both an assignment operator and an assertion operator, when calling a function that nests its return values you have the freedom to decide whether to crash the current process on an unexpected value or to handle the “error” (in which case for your program it becomes an expected condition and not an exception).

blah(Condition) ->
    {ok, Value} = foo(Condition),
    do_stuff(Value).

The code above will crash if the tuple {error, wonky_input} is returned, because the expected atom 'ok' does not match the actually returned atom ‘error’.

bleh(Condition) ->
   case foo(Condition) of
       {ok, Value}          -> do_stuff(Value);
       {error, wonky_input} -> get_new_condition()
   end.

The code above now does not crash on that error return value and instead moves on to get another condition to try out, because the error tuple matches one of the case conditions that is defined as a return value. All this can happen really fast because atoms comparisons are really integer comparisons, and that means we save a ton of processor time (and space) by avoiding string/list or binary comparisons all over the place.

In addition to atoms being a much nicer and dramatically more flexible version of global enumerated types that let us write code in a more natural style that uses normal-language labels for program semantics, it turns out that function and module names are also atoms. This is a really nice feature in itself, because it allows us to write highly dynamic code with a lot less confusion about what types both sides of a call needs to be as well as making the code easier to read. I can even implement my own version of apply/3:

my_apply(Module, Function, Args) ->
    Module:Function(Args).

Of course, there is a whole pile of reasons why you will never want to actually write a function like this in a real program, but that’s the sort of power we have without doing any type casting magic, introspection, or on-the-fly modification of our program, references or memory space.

Once you get used to using atoms and matching you’ll really start to miss them in other languages and wonder how you ever got along without them. Now run off and start writing some code to practice thinking with atoms. They will become natural to you before the day is out.