Archive for the ‘Computing’ Category

How the Internet of Things Will Change the World: Not by Much

Monday, November 30th, 2015

Are you ready for the enormous, revolutionary, ground-shattering changes coming with the IoT?

If you said “yes” and by “yes” you meant you were prepared for breathtaking changes, you are a naive child wading in a murky pool of lampreys, your will putty in the hands of the same charlatans who brought you terms like “cloud computing” which still has yet to be defined in any concrete technical sense.

If you said “yes” and by “yes” you meant that you felt that the more things change the more they stay the same — then you are indeed prepared.

Cold War II, civil war in China, the breakup of the EU, abolishment of American drug laws, the DEA and an end to the Mexican civil war all at once — those are the kinds of things that will have a measurable impact on life. The so-called “internet of things” concept as heard in internet marketing is… well, not at all what the guy who coined the term “Internet of Things” meant.

We already have an internet of things. Has it cured cancer yet? Nope. But if we put RFID in every part of our bodies we will certainly be even more exposed to the will of outside actors. Not that the public has demonstrated that it cares about complete loss of its privacy, especially when “Google style conveniences in exchange for your life’s data” can be backed up by the rhetoric of fear necessitated by government “anti-terrorism” funding. (Yes, I mock this, and yes, I was a Green Beret in the US Army for 6 years — the direction that rhetoric is headed is toward government empowerment, and the government is exactly the least well equipped element of society to deal with terrorism.)

Want to see an internet of things? Tesla cars receive system updates across the network now, and can turn in performance data to help the maker improve on their designs and software. Open water jetski robots can follow automated routes and report hydrographic and bathyrithmic data back to a data processing facility to chart change over time. I was working on a (now defunct, but promising) design project to develop spotting scopes that were intelligent enough to peer data amongst one another within an operational space and change “spotter calls” into more generally interesting “shot requests” and aggregate shot providers in the area to engage targets based on type, effect and following damage reports. Whenever any peers had a network connection they could aggregate data externally.

Dude, we’re already there.

What we lack is standards. Oh, wait, nevermind… we actually have tens of thousands of those. What we lack is standards that people actually can use, that aren’t harder to learn and comply with than the handling of the basic user problems at hand are. These problems will mostly never be solved, not really. Truly understandable data must have a semantic foundation, and semantics are essentially arbitrary in most ways that matter in data processing. That means data must either be tagged in a non-trivial way or must be placed into a schema where relationships are what have meanings.

Note that “tagged in a non-trivial way” above means taking tagging systems to such extremes that they become their own ontologies. Think about that. It should make your face turn pale. That’s at least as difficult as developing an arbitrary formal language. (In case you didn’t notice, an “arbitrary formal” language is a oxymoron — though consortia and governments alike love nothing more than funding committee efforts to formalize the syntax of futile efforts in this area). Writing even trivial software using such a tagging system would require that programmers at every step of the system learn this arbitrary formal language of tagging before they do much of anything, and that’s a lot harder overall than just continuing on with our pile-of-ad-hoc-systems approach. Schema-based systems, while having some different tradeoffs (computationally natural descriptions of data as a “shape”, for example, is a really big win in practical application), ultimately suffer from the same complexity explosion at some level. In particular, applying a particular schema designed in the context of one problem domain will very often not fit in the context of another problem domain — and fully normalizing all data ever, ever would eventually require an infinite (and ever growing) number of relational definitions. Yech.

So… Internet of things? Yeah. We’re already living it. Don’t get too excited and don’t give into the hype. Just because you technically can read data from remote sensors or activate your house’s appliances with your phone (hint: you already can) doesn’t mean this is something you will want to do, or that the venture capitalists of the world will peel their lips off the adsearch cock for long enough to realize that there are more interesting things they could be funding than bounce-under ads and invisible iframe reclick-to-click javascript tech.

Rest easy. Humanity’s material circumstances will continue to get incrementally better (save the occasional dip due to predictably stupid things we do to ourselves) until The Singularity when we are all either suddenly eliminated due to obsolescence, or drive ourselves into a new mode of existence-as-slavery to whatever Google turns into (when all data is network accessible, privacy does not exist, all data is the private IP of a single aggregate, the rights of conscious uploaded entities don’t exist, the definition of “life” is still “way way way after birth”, and continued conscious existence equates to paying a service charge — that’s not really life). None of this is particularly dependent upon the hype surrounding the “Internet of Things”.

Almost Always Stupid: Parallel Quicksort

Monday, November 30th, 2015

So you’ve been thinking about applying your awesome parallel programming skills to some problem that is just complex enough to impress the noobs, but just trivial enough to code in a few lines? Cool. Knock yourself out.

But stop trying to parallelize quicksort.

Posting articles, tutorials, SO questions, and blog posts with the term “parallel quicksort” in the title only serves to encourage excitable, well-meaning, naive people to copypasta whatever code they find there and screw up what would have otherwise been straightforward, easy-to-read, relatively performant beginner-style code.

A naive quicksort implementation is pretty darn fast in the common cases found in real programs. There are edge cases where an intelligently chosen pivot can be a significant optimization — but the act of choosing the pivot intelligently is not free, which means before committing to this you have to know a lot about the context in which that particular sort will be used so you can be confident that it won’t be an overall pessimization.

A naive parallel quicksort implementation can only ever be slower than a naive serialized quicksort. Why? Think through the serial algorithm: each branch depends on the results of the subsequent branches. When we parallelize that the same dependencies exist but you’ve added the overhead of spawning and messaging to the rollup of values at the end. Even in Erlang where spawning and messaging (on a single node) are amazingly lighter weight and faster than in other environments, spawning and messaging are still more overhead in terms of time and memory than simple comparison. So a naive parallelized quicksort, while perhaps impressive because you get to show that you’re spawning and messaging everywhere, is never going to be as fast as a normal naive quicksort.

What about an intelligent parallel quicksort? Ah. Now here we have something potentially interesting. This can indeed be a significant performance win, even bigger than the case of an intelligently implemented quicksort in a guaranteed case of lopsided values where you know where the optimal pivot is located (consider for a moment how utterly tailored to the situation the choice must be, and how utterly familiar with things about that specific use case that are not quicksort one must be to know that any of this is true).

Let’s assume everything is going your way and you can already have a lot of information about the list you need to sort — enough that the overhead of finding smart pivots is outweighed by the overall running time of the sort — it can indeed be possible to write a much faster parallel sort. But not by forking more workers than you have schedulers.

The naive parallel quicksort forks/spawns every branch. That’s insane. A smart parallel quicksort will only fork until it has occupied the available scheduling resources of the system (cores, threads, schedulers, whatever the environment has). Why? Because forking any further will only incur scheduling overhead that amounts to busywork the platform has to do just to keep up with multiprocess accounting. That means an intelligent parallel quicksort will have to know not only how to pick an optimum pivot, it will also have to know how to stop forking once it has maxed out the scheduling resources of the system, and it will have to pull one more not-so-simple trick: it will have to know enough about the data being sorted to continue picking smart pivots until it is finished forking (to be faster than a simply paritioned two-threaded implementation).

That is a significant amount of arcane information needed about the input, and a not entirely obvious pile of code needed to both continue to pick pivots, communicate among threads and stop forking once the optimum number of threads for the environment has been reached. I know how to find out how many schedulers are available in an Erlang node, and I know how to find out how many cores are available on a Linux or BSD system in C. I don’t know how to find this out in Go or Java. I’m not really sure how to do it in Python, either, other than just making a subprocess call — but oops, see, once I do that the code isn’t really Python, it is “Python that works on platform X” (and not to mention, most sorts will complete in less time than a call via subprocess will… defeating the point). That’s a programmer pessimization. Spending a huge amount of time to write code that by definition probably can’t be made generic and reused later is not a very intelligent use of time (neither is writing this blog post, for that matter).

What else happens, though, when we parallelize quicksort? When we parallelize anything? Nothing is free, so there must be a cost. What are some cases where I want to sort things, and of those what cases are good candidates for parallel quicksort? This list isn’t exhaustive (sorting is pretty common) but thinking through the tradeoffs in just a few common situations may be illustrative. Remember: optimizations are usually tradeoffs. Once a routine has been optimized for the general case the remaining options are about shifting burdens away from dear resources and onto cheap ones (or, in the happiest case, not doing anything).

  1. Sorting a list before passing it to a function that requires a sorted list.
  2. Before an efficient update/merge of two lists (a special case of the above).
  3. Ordering client-side data for presentation.
  4. Organizing log data before flush (can be thought of  as a variation of the above).
  5. Building indexes.
  6. Organizing variable length search data (usually text search; a variant of the above).

In any of the cases above is it likely that I might have a set of data that is so huge that I might need to optimize the sort operation? Maybe in cases #1, #2, #4, #5, and very often in #6. So that’s entirely possible.

In any cases above is it likely that I will know enough to be able to build a customized parallel quicksort that is smart enough to pick smart pivots? Hrm… that is a lot trickier. Assuming you already might know something extra about the nature of the input it is maybe possible that this could be the case in #4, or #5 — but that is not the general case. You’d have to know an awful lot to get this right. The most likely candidate here is actually just #6.

Of the remaining cases (and for the vast majority of actual code, that is really just going to be #6), are there any situations where I want to slow down the whole system just to make a single sort go faster? Is the system doing anything else that might be important? Will this degrade overall performance when run in a cron job? Is perhaps the whole point of having lots of cores that maybe we don’t want the entire system to be affected by a single long-running operation? Consider a database that needs to build a text search index and pass it somewhere else based on whatever has changed in the last 12 hours (or an adsearch index builder that needs to sort billions of items of campaign change data). Is it acceptable for the whole system to experience a performance degradation? I don’t know — only you can answer that question. It might be a big win, it might not. I’ve only encountered a single case* where this was true in actual production code.

Once. Ever.

Now go do a search for “parallel quicksort” and skim over the results with new eyes. You almost never need this. Most articles on parallel quicksort are exactly as useful as the articles on parallel Fibonacci functions (or any Fibonacci functions, really). Sure, some of the code examples are fun to read and muse about, but they don’t get you any closer to solving whatever your actual user problems are.

[* Actually, it was adsearch campaign data, which is why I mentioned it.]

Methodologies in open source development

Sunday, November 15th, 2015

Prompted by an article on about scrum in foss projects. This is an incomplete, impulsive rough draft and requires some major revisions.

Nothing makes me facepalm faster than when I get hired to work on a project and the first thing anyone tells me is “This project is going to be great because we are using the best methodologies, you know, like TDD and Scrum and everything. Totally perfect.”

WTF!? How about telling me about the problem the project solves or why it will be an improvement over whatever came before or talking about the architecture or something? There are only three conclusions I can draw from seeing that the only thing someone can think to say when I first enter a project is a word about methodology:

  1. They have no clue what the project actually does.
  2. There are no project goals.
  3. They are hiring me specifically because they have not yet encountered any other competent developers and they know it, yet (amazingly) they feel confident in dictating whatever they decided are “best practices” to me.


Often this results in me backing out of a job, or leaving if I had tentatively agreed to it already — unless they are just going to pay enough to make watching a project tie itself in knots worth my time.

(Incidentally, I was talking with a guy from Klarna the other day and had the exact opposite experience. They know what problem their platform tackles and why it is a really good idea. So its not hopeless everywhere, just most places. What is really amazing is the guy I was speaking to wasn’t even a developer but understood what their main product is all about. That’s almost a reason to have actual hope for the future of that particular company. I don’t think I’ve written the word “hope” without prefacing it with a word like “unreasonable”, “false”, “sad” or “vain” in an article about software development.)

Today there is a problem with methodologies infecting and then blinding software projects (foss and otherwise). Its like competing flavors of ISIS taking over the ethos of various management and teams while sapping the reason from their adherents. Its not that Scrum or Agile or TDD are necessarily bad — there are a lot of good ideas that can be pulled from them — it is problematic that things like Scrum have become religions. Let’s put that in the big letters:

The problem isn’t this or that particular practice, it is the religion part.
— Me, just now

When it is absolutely unacceptable to question the utility of a methodology or specific coding practice you have a major problem. Selecting a particular set of practices without any regard to the context within which the methods or techniques are to be applied you are simply making uncritical, faith-based decisions. That is like hoping that a man who lives in the clouds will make everything better for you because you move a wooden block back and forth three times a day or kiss the ground a lot.

No, nevermind, actually it is worse than that, because you’ll only lose a few moments of your day moving a little block of wood around and kissing the ground is pretty quick. Neither has to bother anyone else who is trying to be productive right then. It is when you schedule daily or weekly meetings about block-moving techniques, force everyone to take a “vacation/working retreat” (oxymoron much?) to the tune of hundreds of thousands of dollars in company money to attend seminars given by charlatans about ground-kissing, and schedule weekend work time and “fun events” like 24-hour “hackathons” and “weekend sprints” to make up for the time lost in coordinating all the above activities. That’s way worse.

(Note that in the ranty paragraph above I’m not calling all scrum/agile/whatever coaches charlatans. The real coaches I’ve met (who have themselves written successful software) would agree entirely with what I’m writing here and tend to say straight out that chosen practices must match the context of the project. I am calling the seminar circuit and “methodology certification” guys charlatans. Those shitbags have learned that they can make crazy money by telling sweet, long, loud lies to management in a culture desperate for something to point value-blind investors at as a reason to throw good money after bad. Of course, this also implies that quite a bit of tech management is incompetent and most tech investors are just shooting in the dark.)

Without the ability to question a practice it becomes impossible to evaluate its impact on your goals. For example, sometimes TDD makes a lot of sense, especially when the product is a library. Srsly, if you write libraries and you don’t do TDD then you had better have some other word for something that amounts to the same thing. But sometimes tests — especially when they are, by necessity, integration tests of a non-trivial, comprehensive nature — can be a massive distraction, totally unhelpful, and a noticeable net loss to a project (especially project components that are literally untestable before the heat death of the universe).

The choice of a methodology or technique has to be a goal-based decision. Very often this means a business decision but sometimes its a project-goals-and-ethos type decision. Business decisions are the easiest because they are the most straightforward and the goals are relatively simple. It is a bit different in a foss project where adherence to a particular practice might be an underlying goal or a core part of the social value system that surrounds it. But even in a foss project it isn’t so hard to pin down a goal and determine how a particular practice when imposed as a rule will either help, hinder, or confuse the goal.

There is a general incongruency  between the context scrum was designed around and the context within which most foss projects exist. Scrum (and agile in general) is about customer-facing code; specifically in the case where the customer is a paying entity that is inexpert in the software being developed, but the developers are inexpert in the problem domain being solved. That is the primary Agile use case. It works well there — but this does not describe the situation of most foss projects.

Most foss projects are intended to be used by other software developers or techie users (system operators, interface specialists, DBAs, etc.), and are especially focused around infrastructure. In this case Agile simply doesn’t fit. Some documentation policies don’t even fit: there is a lot of blending among the idea that “code should be documented”, “code should be commented”, “comments should adhere to a markup that generates documentation”, “the API documentation is the manual” and “there is a product manual”. These are not at all the same things in certain markets, but in the open source world there are shades of meaning there spanning “the code is the documentation” to “literate code” to “we have a separate documentation project”.

I use foss software almost exclusively for work, but my customers don’t. They have no idea the difference, really, and don’t care over the span of a purchasing decision whether they are getting foss software or proprietary software — they only care that it works to solve their immediate business problem (that closed source sometimes introduces a different, longer-term business problem is not something they are generally prepared to understand the arguments for). In this case I have non-developer users paying for my (incidentally foss) software — and scrum/agile/whatever works very well there as a set of guidelines and practices to draw from when outlining my group’s workflow.

But the infrastructure stuff we do that sits behind all that — scrum/agile/whatever is totally insane. There is no “sit in with the customer to develop use cases” and holding a scrum meeting daily when it comes to network protocols that underlie the whole system, or the crypto suites that render the whole thing safe, or the database schemas that decompose the raw data into meanings relevant on the back-end. That sort of stuff is generally core-competency for foss developers, but it doesn’t fit the scrum or agile methodologies at all.

Only when you are working on a wicker-castle of a web project where concerns are so totally interwoven and mishmashed together that front-end customer-facing concerns become backend concerns does agile make any sense — but in this case it only makes sense because you are shipping a giant ball of mud and your hair is on fire every single day because you’ll never catch up with breakages (welcome to the actual modern paradigm that rules 90% of software projects I’ve seen…).

I’ve poked at a few different sides of development. What I’m really getting at is that there are different worlds of software development, and each world is suited better or worse by different methodologies that have been developed over time. (As an extreme case, consider the lunacy of scrum in most embedded projects.) No methodology I’ve ever seen fits any given project very well, but all of them absolutely work against the interests of a project that exists outside the context from which they originated. That could be because of time constraints (consider the infamous “docs don’t pay” attitude — yet it shouldn’t be dismissed outright because it is actually a valid business concern sometimes), the context of use doesn’t fit, or whatever.

In the same way that we (should) try to pick the best tools (languages, data stores, data structures, network paradigms, etc.) for a given job based on the nature of the task, we should approach selection of development methodology and team workflow based on the nature and context of the project.

Iterators? We Don’t NEED No Stinking Iterators!

Tuesday, September 29th, 2015

Every so often a request for “implementation of iterators for maps” over hashes/maps/dicts or some other K-V data structure appears on mailing list for a functional langauge. I’ve spent years making heavy use of iterators in imperative languages, and the way they fit into Python is really great. For Python. I totally understand where some of these folks are coming from, they just don’t realize that functional languages are not where they came from.

So… “Is this post the result of some actual event”? Yeah, you got me. It is. On the erlang-questions mailing list someone asked “Are maps ever going to get an iterator?” Again.

Erlang is definitely not Kansas, but people thinking either that it is or (more dangerously) that it should be and then trying to influence the maintainers to make it that way (and then the powers-that-be getting in a panic over “market share” and doing horrible things to the language…) worries me a bit.

There is no faster way to paint a functional language into a corner than to try making it occasionally imperative. Conversely, consider the syntactic corner C++ and Java have painted themselves into by trying to include functional features as after-thoughts where they really didn’t belong.

(I know, I know, death-by-kitchen-sink is a proud C++ tradition. It is becoming one for Java. Even though I hate Java there is no sense in making it worse by cluttering its syntax and littering it with gotchas and newbie-unfriendly readability landmines in the interest of providing features few Java coders understand the point of, especially when the whole concept of state management in a bondage-and-discipline OOP language like Java is to keep everything in structs with legs (not anonymous closures over state that is temporarily in scope…). The lack of such problems were previously one of the main points that favored Java over C++… well, that and actual encapsulation. Hopefully Rust and D can resist this temptation.)

This frustrates me. It is almost as if instead of picking a tool that matches a given job, people learn one tool and then try over time to make a super-powered Swiss Army knife of it. This never turns out well. The result is more Frankenstein’s Monster than Swiss Army knife and in the best case it winds up being hard to learn, confusing to document and crap at everything.

What’s worse, people assume that the first tool they learned well is the standard by which everything else should be judged (if so, then why are they learning anything else?). It follows, then, that if a newly studied LangX does not have a feature of previously used LangY then it must be introduced because it is “missing”. (I do admit, though, to wishing other languages had pattern matching in function heads… but I don’t bring this up on mailing lists as if its a “missing feature”; I do, however, cackle insanely when overloading is compared with matching.)

Let’s say we did include iterators for maps into Erlang — whatever an “iterator” is supposed to mean in a list-are-conses type functional language. What would that enable?

-spec foreach(fun(), map()) -> ok.

That sort of looks pointless. Its exactly the same as lists:foreach(Fun, maps:to_list(Map)) or maybe lists:foreach(Fun, maps:values(Map)). Without knowing whether we’re trying to build a new map based on the old one or get some side effect out of Fun then its hard to know what the point is.


-spec map(fun(), OldMap :: map()) -> {ok, NewMap :: map()}.

But… wait, isn’t that just maps:map/2 all over again?

I think I know where this is going, though. These people really wish maps were ordered dictionaries, because they want keys to be ordered. So they want something like this:

-spec side_effects_in_order_dammit(fun(), map()) -> ok.
side_effects_in_order_dammit(F, M) ->
    Ordered = [{K, maps:get(K, M)} || K <- lists:sort(maps:keys(M))],
    ok = lists:foreach(F, Ordered).

But wait, what order should the keys be in, anyway?

This is a slow, steady march to insanity. “Give me iterators” begets “Let’s have ordered maps” begets “Let’s have ordered iterators for maps” and so on, and eventually you wind up with most of the Book of Genesis in the Devil’s Bible of Previously Decent Functional Languages. All the while, totally forgetting that these things already exist in another form. There are more data structures than just maps for a reason.

This just gets ridiculous, and it isn’t even what hashes are about to begin with.

Horrible, Drunk Code (but it works and demonstrates a point)

Thursday, September 3rd, 2015

Over on StackOverflow otopolsky was asking about how to make an Erlang program that could read selected lines in a huge file, offset from the bottom, without exploding in memory (too hard).

I mentioned the standard bit about front-loading and caching the work of discovering the linebreak locations, the fact that “a huge text file” nearly always means “a set of really huge log files” and that in this case tokenizing the semantics of the log file within a database is a Good Thing, etc. (my actual answer is here).

He clearly knew most of this, and was hoping that there was some shortcut already created. Well, I don’t know that there is, but it bothered me that his initial stab at following my advice about amortization of linebreak discovery resulted in an attempt to read a 400MB text file in to run a global match over it, and that this just ate up all his memory and made his machine puke. Granted, my initial snippet was a naive implementation that didn’t take size into account at all, but…

400MB? Eating ALL your memory? NO WAY. Something must be done… A call to action!

The main problem is I’m already a bit jeezled up because my wife broke out some 泡盛 earlier (good business day today)… so any demo code I will produce will be, ahem, a little less than massive-public-display worthy (not that the 6 or 7 guys on the planet who actually browse Erlang questions on SO would care, or don’t already know who I am). So… I’m posting here:


linebreaks(File) ->
    {ok, FD} = file:open(File, [raw, read_ahead, binary]),
    Step = 1000,
    Count = 1,
    Loc = 1,
    Indexes = [],
    Read = file:read(FD, Step),
    {ok, Result} = index(FD, Read, Count, Loc, Step, Indexes),
    ok = file:close(FD),
    [{1, Loc} | Result].

index(FD, {ok, Data}, Count, Loc, Step, Indexes) ->
    NewLines = binary:matches(Data, <<$\n>>),
    {NewCount, Found} = indexify(NewLines, Loc, Count, []),
    Read = file:read(FD, Step),
    index(FD, Read, NewCount, Loc + Step, Step, [Found | Indexes]);
index(_, eof, _, _, _, Indexes) ->
    {ok, lists:reverse(lists:flatten(Indexes))};
index(_, {error, Reason}, _, _, _, Indexes) ->
    {error, Reason, Indexes}.

indexify([], _, Count, Indexed) ->
    {Count, Indexed};
indexify([{Pos, Len} | Rest], Offset, Count, Indexed) -> 
    NewCount = Count + 1,
    indexify(Rest, Offset, NewCount, [{Count, Pos + Len + Offset} | Indexed]).

As ugly as that is, it runs in constant space and the index list produced on a 7,247,560 line 614,754,920 byte file appears to take a bit of space (a few dozen MB for the 7,247,560 element list…), and temporarily requires a bit more space during part of the return operation (very sudden, brief spike in memory usage right at the end as it returns). But it works, and returns what we were looking for in a way that won’t kill your computer. And… it only takes a 14 seconds or so on the totally crappy laptop I’m using right now (an old dual-core Celeron).

Much better than what otopolsky ran into when his computer ran for 10 minutes before it started swapping after eating all his memory!


lceverett@changa:~/foo/hugelogs$ ls -l
合計 660408
-rw-rw-r-- 1 ceverett ceverett       928  9月  3 01:31 goofy.erl
-rw-rw-r-- 1 ceverett ceverett  61475492  9月  2 23:17 huge.log
-rw-rw-r-- 1 ceverett ceverett 614754920  9月  2 23:19 huger.log
ceverett@changa:~/foo/hugelogs$ erl
Erlang/OTP 18 [erts-7.0] [source] [64-bit] [smp:2:2] [async-threads:10] [kernel-poll:false]

Running $HOME/.erlang.
Eshell V7.0  (abort with ^G)
1> c(goofy).
2> {HugeTime, HugeIndex} = timer:tc(goofy, linebreaks, ["huge.log"]).
3> {HugerTime, HugerIndex} = timer:tc(goofy, linebreaks, ["huger.log"]).
4> HugerTime / 1000000.
5> HugeTime / 1000000.
6> lists:last(HugeIndex).
7> lists:last(HugerIndex).

Rather untidy code up there, but I figure it is readable enough that otopolsky can get some ideas from this and move on.

XML: Xtensively Mucked-up Lists (or “How A Committee Screwed Up Sexps”)

Friday, April 10th, 2015

XML sucks. I believe that it is, to date, the most perfectly horrible corruption of one of the most universal and simple concepts in computer science: sexps.

ZOMG! Someone screwed up sexps!

Let that one sink in. What a thing to say! How in the world would one even propose to screw up such a simple idea? Let’s consider an example…

Can you identify the semantic difference among the following examples?
(Inspired by the sample XML in the Python xml.etree docs)

Verson 1

<country name="Liechtenstein">
  <neighbor name="Austria" direction="E"/>
  <neighbor name="Switzerland" direction="W"/>

Version 2


Version 3

<country name="Liechtenstein" rank="1" year="2008" gdppc="141100">
  <neighbor name="Austria" direction="E"/>
  <neighbor name="Switzerland" direction="W"/>

Version 4

And here there is a deliberate semantic difference, meant to be illustrative of a certain property of trees… which is supposedly the whole point.

  <country rank="1" year="2008" gdppc="141100">
      <name direction="E">Austria</name>
      <name direction="W">Switzerland</name>

Which one should you choose for your application? Which one is obvious to a parser? From which could you more than likely write a general parsing routine that could pull out data that meant something? Which one could you turn into a program by defining the identifier tags as functions somewhere?

Consider the last two questions carefully. The so-called “big data” people are hilarious, especially when they are XML people. There is a difference between “not a large enough sample to predict anything specific” and “a statistically significant sample from which generalities can be derived”, certainly, but that has a lot more to do with representative sample data (or rather, how representative the sample is) than the sheer number of petabytes you have sitting on disk somewhere. “Big Data” should really be about “Big Meaning”, but we seem to be so obsessed over the medium that we miss the message. Come to think of it, this is a problem that spans the techniverse — it just happens to be particularly obvious and damaging in the realm of data science.

The reason I so hate XML is because the complexity and ambiguity introduced in an effort to make the X in XML mean something has crippled it in terms of clarity. What is a data format if it confuses the semantics of the data? XML is unnecessarily ambiguous to the people who have to parse (or design, document, discuss, edit, etc.) XML schemas, and makes any hope of readily converting some generic data represented as XML into a program that can extract its meaning without going to the extra work of researching a schema — which throws the entire concept of “universality” right out the window.

Its all a lie. A tremendous amount of effort has been wasted over the years producing tools that do nothing more than automate away the mundane annoyances dealing with the stupid way in which the structure is serialized. These efforts have been trumpeted as a major triumph, and yet they don’t tell us anything about the resulting structure, which itself is still more ambiguous than plain old sexps would have been. Its not just that its a stupid angle-bracket notation when serialized (that’s annoying, but forgiveable: most sexps are annoying paren, obnoxious semantic whitespace, or confusing ant-poop delimited — there just is no escape from the tyranny of ASCII). XML structure is broken and ambiguous, no matter what representation it takes as characters in a file.

Erlang: Writing Terms to a File for file:consult/1

Sunday, March 8th, 2015

I notice that there are a few little helper functions I seem to always wind up writing given different contexts. In Erlang one of these is an inverse function for file:consult/1, which I have to write any time I use a text file to store config data*.

Very simply:

write_terms(Filename, List) ->
    Format = fun(Term) -> io_lib:format("~tp.~n", [Term]) end,
    Text = lists:map(Format, List),
    file:write_file(Filename, Text).

[Note that this *should* return the atom 'ok' — and if you want to check and assertion or crash on failure, you want to do ok = write_terms(Filename, List) in your code.]

This separates each term in a list by a period in the text file, which causes file:consult/1 to return the same list back (in order — though this detail usually does not matter because most conf files are used as proplists and are keysearched anyway).

An annoyance with most APIs is a lack of inverse functions where they could easily be written. Even if the original authors of the library don’t conceive of a use for an inverse of some particular function, whenever there is an opportunity for this leaving it out just makes an API feel incomplete (and don’t get me started on “web APIs”… ugh). This is just one case of that. Why does Erlang have a file:consult/1 but not a file:write_terms/2 (or “file:deconsult/2” or whatever)? I don’t know. But this bugs me in most libs in most languages — this is the way I usually deal with this particular situation in Erlang.

[* term_to_binary/1 ←→ binary_to_term/1 is not an acceptable solution for config data!]

Erlang: Maps, Comprehensions and Side-effecty Iteration

Saturday, February 28th, 2015

In Erlang it is fairly common to want to perform a side-effecty operation over a list of values, not because you want to collect an aggregate (fold), actually map the input list to the output list (map), or build a new list in some way (list comprehension) but because you just want the side-effects of the operation.

The typical idiom (especially for broadcasting messages to a list of processes) is to use a list comprehension for this, or sometimes lists:map/2:

%% Some procedural side-effect, like spawning list of processes or grabbing
%% a list of resources:
[side_effecty_procedure(X) || X <- ListOfThings],

%% Or, for broadcasting:
[Pid ! SomeMessage || Pid <- ListOfPids],

%% And some old farts still do this:
lists:map(fun side_effecty_procedure/1, ListOfThings),

%% Or even this (gasp!) which is actually made for this sort of thing:
lists:foreach(fun side_effecty_procedure/1, ListOfThings),
%% but lacks half of the semantics I describe below, so this part
%% of the function namespace is already taken... (q.q)

That works just fine, and is so common that list comprehensions have been optimized to handle this specific situation in a way that avoids creating a return list value if it is clearly not going to be assigned to anything. I remember thinking this was sort of ugly, or at least sort of hackish before I got accustomed to the idiom, though. “Hackish” in the sense that this is actually a syntax intended for the construction of lists and only incidentally a useful way to write a fast side-effect operation over a list of values, and “ugly” in the sense that it is one of the few places in Erlang you can’t force an assertion to check the outcome of a side-effecty operation.

For example, there is no equivalent to the assertive ok = some_procedure() idiom, or even the slightly more complex variation used in some other situations:

case foo(Bar) of
    ok    -> Continue();
    Other -> Other

a compromise could be to write an smap/2 function defined something like

smap(F, List) ->
    smap(F, 1, List).

smap(F, N, []) -> ok;
smap(F, N, [H|T]) ->
    case F(H) of
        ok              -> smap(F, N + 1, T);
        {error, Reason} -> {error, {N, Reason}}

But this now has the problem of requiring that whatever is passed as F/1 have a return type of ok | {error, Reason} which is unrealistic without forcing a lot of folks to wrap existing side-effecty functions in something that coerces the return type to match this. Though that might not be bad practice ultimately, its still perhaps more trouble than its worth.

It isn’t like most Erlang code is written in a way where side-effecty iteration over a list is likely to fail, and if it actually does fail the error data from a crash will contain whatever was passed to the side-effecty function that failed. But this still just doesn’t quite sit right with me — that leaves the prospect of list iteration in the interest of achieving a series of side effects as at least a little bit risky to do in the error kernel (or “crash kernel”) of an application.

On the other hand, the specific case of broadcasting to a list of processes would certainly be nice to handle exactly the same way sending to a single process works:

%% To one message
Pid ! SomeMessage,
%% To a list of messages
Pids ! SomeMessage,

Which seems particularly obvious syntax, considering that the return of the ! form of send/2 is the message itself, meaning that the following two would be equivalent if the input to the ! form of send/2 also accepted lists:

%% The way it works now
Pid1 ! Pid2 ! SomeMessage,
%% Another way I wish it worked
[Pid1, Pid2] ! SomeMessage,

In any case, this is clearly not the sort of language wart that gets much attention, and its never been a handicap for me. It just seems a bit hackish and ugly to essentially overload the semantics of list comprehensions or (ab)use existing list operations like maps and folds to achieve the side effect of iterating over a list instead of having a function like smap/2 which is explicitly designed to achieve side-effecty iteration.

OpenSSL RSA DER public key PKCS#1 OID header madness

Thursday, February 26th, 2015

(Wow, what an utterly unappealing post title…)

I have run into an annoying issue with the DER output of RSA public keys in OpenSSL. The basic problem is that OpenSSL adds an OID header to its ASN.1 DER output, but other tools are not expecting this (be it iOS, a keystore in Java, the iPhone keystore, or Erlang’s public_key module).

I first noticed this when experiencing decode failures in Erlang trying to use the keys output by an openssl command sequence like:

openssl genpkey \
    -algorithm rsa \
    -out $keyfile \
    -outform DER \
    -pkeyopt rsa_keygen_bits:8192

openssl rsa \
    -inform DER \
    -in $keyfile \
    -outform DER \
    -pubout \
    -out $pubfile

Erlang’s public_key:der_decode(‘RSAPrivateKey’, KeyBin) would give me the correct #’RSAPrivateKey’ record but public_key:der_decode(‘RSAPublicKey’, PubBin) would choke and give me an asn1 parse error (ML thread I posted about this). Of course, OpenSSL is expecting the OID header, so it works fine there, just not anywhere else.

Most folks probably don’t notice this, though, because the primary use case for most folks is either to use OpenSSL directly to generate and then use keys, use a tool that calls OpenSSL through a shell to do the same, or use a low-level tool to generate the keys and then use the same library to use the keys. Heaven forbid you try to use an OpenSSL-generated public key in DER format somewhere other than OpenSSL, though! (Another reason this sort of thing usually doesn’t cause problems is that almost nobody fools around with crypto stuff in their tools to begin with, leaving this sort of thing as an issue far to the outside of mainstream hackerdom…)

Of course, the solution could be to chop off the header, which happens to be 24-bits long:

1> {ok, OpenSSLBin} = file:read_file("").
2> {ok, ErlangBin} = file:read_file("").
3> <<_:24/binary, ChoppedBin/binary>> = OpenSSLBin.
4> ChoppedBin = ErlangBin.

But… that’s pretty darn arbitrary. It turns out the header is always:


I could match on that, but it still feels weird because that particular binary string just doesn’t mean anything to me, so matching on it is still the same hack. I could, of course, go around that by writing just the public key as PEM, loading it, encoding it to DER, and saving that as the DER file from within Erlang (thus creating a same-tool-to-same-tool situation). But that’s goofy too: PEM is just a base64 encoded DER binary wrapped in a text header/footer! Its completely arbitrary that PEM should work but DER shouldn’t!


start([Prefix]) ->
    PemFile = string:concat(Prefix, ".pub.pem"),
    KeyFile = string:concat(Prefix, ".key.der"),
    PubFile = string:concat(Prefix, ".pub.der"),
    {ok, PemBin} = file:read_file(PemFile),
    [PemData] = public_key:pem_decode(PemBin),
    Pub = public_key:pem_entry_decode(PemData),
    PubDer = public_key:der_encode('RSAPublicKey', Pub),
    ok = file:write_file(PubFile, PubDer),
    io:format("Wrote private key to: ~ts.~n", [KeyFile]),
    io:format("Wrote public key to:  ~ts.~n", [PubFile]),
    case check(KeyFile, PubFile) of
        true  ->
            ok = file:delete(PemFile),
            io:format("~ts and ~ts agree~n", [KeyFile, PubFile]),
        false ->
            io:format("Something has gone wrong.~n"),

check(KeyFile, PubFile) ->
    {ok, KeyBin} = file:read_file(KeyFile),
    {ok, PubBin} = file:read_file(PubFile),
    Key = public_key:der_decode('RSAPrivateKey', KeyBin),
    Pub = public_key:der_decode('RSAPublicKey', PubBin),
    TestMessage = <<"Some test data to sign.">>,
    Signature = public_key:sign(TestMessage, sha512, Key),
    public_key:verify(TestMessage, sha512, Signature, Pub).

This is so silly its maddening — and apparently folks from the iOS, PHP and Java worlds have essentially built themselves hacks to handle DER keys that deal directly with this.

I finally found a way that is both semantically meaningful (well, somewhat) and is sure to either generate a proper PKCS#1 DER public RSA key or fail telling me that its looking at badly-formed ASN.1 (or binary trash): the magical command “openssl asn1parse -strparse [offset]”.

A key generation script therefore looks something like this:

#! /bin/bash

prefix=${1:?"Cannot proceed without a file prefix."}


# Generate 8192 RSA key
openssl genpkey \
    -algorithm rsa \
    -out $keyfile \
    -outform DER \
    -pkeyopt rsa_keygen_bits:8192

# OpenSSL's PKCS#1 ASN.1 output adds a 24-byte header that
# other tools (Erlang, iOS, Java, etc.) choke on, so we clip
# the OID header off with asn1parse.
openssl rsa \
    -inform DER \
    -in $keyfile \
    -outform DER \
    -pubout \
| openssl asn1parse \
    -strparse 24 \
    -inform DER \
    -out $pubfile

The output on stdout will look something like:

$ openssl asn1parse -strparse 24 -inform DER -in 
    0:d=0  hl=4 l=1034 cons: SEQUENCE          
    4:d=1  hl=4 l=1025 prim: INTEGER           :CAA78299F24DC4FCA78E9F110D459429A1322C8AF0BF558B7E49B21A30D5EFDF0FFF512C15E5292CD85209EBB21861235250ABA3FBD48F0830CC3F355DDCE5BA467EA03F58F91E12985E54336309D8F954525802949FA68A5A742B4933A5F26F148B912CF60D5A140C52C2E72D1431FA2D40A3E3BAFA8C166AF13EBAD774E7FEB17CAE94EB12BE97F3CABD668833691D2A2FB3001BF333725E65081F091E46597028C3D50ABAFAF96FA2DF16E6AEE2AE4B8EBF4360FBF84C1312001A10056AF135504E02247BD16C5A03C5258139A76D3186753A98B7F8E7022E76F830863536FA58F08219C798229A23D0DDB1A0B57F1A4CCA40FE2E5EF67731CA094B03CA1ADF3115D14D155859E4D8CD44F043A2481168AA7E95CCC599A739FF0BB037969C584555BB42021BDB65C0B7EF4C6640CCE89A860D93FD843FE77974607F194206462E15B141A54781A5867557D8A611D648F546B72501E5EAC13A4E01E8715DFE0B8211656CFA67F956BC93EA7FB70D4C4BCFEC764128EAE8314F15F2473FCF4C6EEA29299D0B13C2CCC11A73BF58209A056CB44985C81504013529C75B6563CED3CC27988C70733080B01F78726A3ABBCD8765538D515D2282EF79F4818A807A9173CC751E46BDB930E87F8DA64C6ABDD8127900D94A20EE87F435716F4871463854AD0B7443243BDA0682F999D066EA213952F296089DA4577B7B313D536185E7C37E68F87D43A53B37F1E0FB7969760F31C291FDB99F63F6010E966EC418A10EFE7D4EBD4B494693965412F0E78150150B61A4FF84AB26355FC607697B86BFB014E6450793D6BF5EA0565C63161D670CD64E99FFD9AE9CCF90C6F77949A137E562E08787D870604825EF955B3A26F078CBA5E889F9AFEEF48FE7F36A51D537E9ADF0746EA81438A91DF088C6C048622AC372AEEBD11A2427DFCBC2FE163FD62BE3DCBDC4961ECD685CBAD55898B22E07A8C7E3FB4AEA8AC212F9CA6D6FE522EC788FEDD08E403BD70D1ABEDE03B760057920B27B70CBDCB3C1977A7B21B5CD5AF3B7D57A0B2003864C4A56686592CBFC1BBAF746E35937A3587664A965B806C06C94EC78119A8F3BB18BCFCEB1E5193C0EFD025A86AD6C2AE81551475A14B81D5A153E993612E1D7AED44AB38E9A4915D4A9B3A3B4B2DF2C05C394FC65B46E3983EA0F951B04B7558099DB04E73F550D08A453020EFD7E1A4F2383C90E8F92A2A44CC03A5E0B4F024656741464D86E89EA66526A11193435EB51AED58438AA73A4A74ABF3CDE1DE645D55A09E33F3408AD340890A72D2C6E4958669C54B87B7C146611BFD6594519CDC9D9D76DF4D5F9D391F4A8D851FF4DAE5207585C0A544A817801868AA5106869B995C66FB45D3C27EE27C078084F58476A7286E9EE5C1EF76A6CF17F8CDD829EFBB0C5CED4AD3D1C2F101AB90DC68CEA93F7B891B217
 1033:d=1  hl=2 l=   3 prim: INTEGER           :010001

And if we screw it up and don’t actually align with the ASN.1 header properly things explode:

$ openssl asn1parse -strparse 32 -inform DER -in 
Error parsing structure
140488074528416:error:0D07207B:asn1 encoding routines:ASN1_get_object:header too long:asn1_lib.c:150:
140488074528416:error:0D068066:asn1 encoding routines:ASN1_CHECK_TLEN:bad object header:tasn_dec.c:1306:
140488074528416:error:0D06C03A:asn1 encoding routines:ASN1_D2I_EX_PRIMITIVE:nested asn1 error:tasn_dec.c:814:

Now my real question is… why isn’t any of this documented? This was a particularly annoying issue to work my way around, has obviously affected others, and yet is obscure enough that almost no mention of it can be found in documentation anywhere. GHAH!

JSON and YAML: Not a pair that fits every foot (but XML sucks)

Friday, December 12th, 2014

It is good to reflect on exactly how hard a problem it is to define a consistent cross-platform data representation. Most of the time (especially on the web) we just shovel data around, let things be inconsistent, avoid conflicts by pretending they don’t happen, and carry a general disregard to data consistently. This attitude is, sadly, what has come to characterize “NoSQL” in my mind, though in a strict sense that is not true at all (GIS and graph databases aren’t SQL systems, and some are very solid — PostGIS being the exception in that it is a surprisingly well made extension to a surprisingly solid SQL-based RDBMS).

Obviously this isn’t a good attitude to have when dealing with things more important than small games or social media distractions. That said, most of the code written today seems to fall into those two categories, and many a career is spent exclusively roaming the range between these two (and whether we should consider most of the crap that constitutes the web a “game” itself is worth thinking about, whether we think of SEO, mindshare in the blogosphere, StackExchange rep, Facebook likes/friends/whatever, pingbacks, comment counts, etc.). We focus so much on these trivial and often meaningless cases that an entire generation of would-be programmers has no idea what the shape of data is really about.

When you really need a consistent data representation that can survive the network (ouch! that’s no mean feat!), can consistently be coerced into a known, predictable, serialized representation, and can be handled by generated code in nearly any language you need ASN.1.

But ASN.1 is hard to learn (or even find resources on outside of telecom projects), and JSON and YAML are easy to reference and (initially) use. XML was made unnecessarily hard, I think as a cosmic joke on people who never heard the term “S-expression”, but very basic XML seems easy, even if its something you would never want to type by hand (though that always seems to wind up being necessary, despite our best efforts at tooling…).

Why not just use JSON, YAML or XML everywhere? That bit above, about a consistent representation — that’s why. Well, that’s part of why. Another part of why is that despite your best efforts to define things in XML or nest explicit declarations in YAML/JSON you will always wind up either missing something, or find yourself needing to change some type information you embedded as a nested element in your data definition and then need to write a sed or awk script just to modify later (and if you’re the type who thinks “ah, a simple search/replace in my IDE…” and “IDE” to you doesn’t basically equate to “Emacs” or your shell itself, you’re going to a gunfight with boxing gloves on — if you need a better IDE to manage your language then you really need a better language).

The problem with YAML/JSON/XML are twofold: they are not defined anywhere, so while you may have a standard of sorts somewhere, there is no way to enforce that standard. An alternative is to include type information everywhere within your tags as attributes (in XML) or nest tagged groups or create a massive reference chain of type -> reference pointer -> data entry in YAML (or nest everything to an insane degree in JSON), but then making changes to the type of a field in a record type you have 20 million instances of is… problematic.

And we haven’t even discussed representation. “But its all text, right?” Oh… you silly person…

Everything is numbers. Two numbers, actually, 0 and 1. We know this, and we sort of forget that there are several ways of interpreting those numbers as bigger numbers, and those bigger numbers as textual components, and those textual components as actual text and that actual text (finally) as glyph you see when you use “the typewriter part” or look at “the TV part” (or do anything with the little touchscreens we use everywhere these days nobody seems to have worked out a genuinely solid interface solution to just yet).

Every layer of that chain of interpretation I mentioned above can be done several ways. Every layer. Think about that for a second. Now, if you live purely in a single world (like modern Linuxes and probably newer versions of OSX) where there is only UTF-8, then about half the possible permutations are eliminated. If you only ever deal with unaccented characters that fall in the primary 127 defined by ASCII, then several more permutations are eliminated — and you should dance with joy.

Unless you deal with a bit of non-textual data in addition to the textual stuff. You know, like pictures and sounds and application-produced opaque binary data and whatnot. If that’s the case, you should tremble. Or… oh god, no… what if your data doesn’t stand alone? What if all those letters are supposed to actually mean something? “We have lots of data” isn’t nearly as important to customers as “we have lots of meanings” — but don’t ask a customer about that directly, they have no idea what you mean, because all the text stuff already means something to them.