Horrible, Drunk Code (but it works and demonstrates a point)

Over on StackOverflow otopolsky was asking about how to make an Erlang program that could read selected lines in a huge file, offset from the bottom, without exploding in memory (too hard).

I mentioned the standard bit about front-loading and caching the work of discovering the linebreak locations, the fact that “a huge text file” nearly always means “a set of really huge log files” and that in this case tokenizing the semantics of the log file within a database is a Good Thing, etc. (my actual answer is here).

He clearly knew most of this, and was hoping that there was some shortcut already created. Well, I don’t know that there is, but it bothered me that his initial stab at following my advice about amortization of linebreak discovery resulted in an attempt to read a 400MB text file in to run a global match over it, and that this just ate up all his memory and made his machine puke. Granted, my initial snippet was a naive implementation that didn’t take size into account at all, but…

400MB? Eating ALL your memory? NO WAY. Something must be done… A call to action!

The main problem is I’m already a bit jeezled up because my wife broke out some 泡盛 earlier (good business day today)… so any demo code I will produce will be, ahem, a little less than massive-public-display worthy (not that the 6 or 7 guys on the planet who actually browse Erlang questions on SO would care, or don’t already know who I am). So… I’m posting here:

-module(goofy).
-export([linebreaks/1]).

linebreaks(File) ->
    {ok, FD} = file:open(File, [raw, read_ahead, binary]),
    Step = 1000,
    Count = 1,
    Loc = 1,
    Indexes = [],
    Read = file:read(FD, Step),
    {ok, Result} = index(FD, Read, Count, Loc, Step, Indexes),
    ok = file:close(FD),
    [{1, Loc} | Result].

index(FD, {ok, Data}, Count, Loc, Step, Indexes) ->
    NewLines = binary:matches(Data, <<$\n>>),
    {NewCount, Found} = indexify(NewLines, Loc, Count, []),
    Read = file:read(FD, Step),
    index(FD, Read, NewCount, Loc + Step, Step, [Found | Indexes]);
index(_, eof, _, _, _, Indexes) ->
    {ok, lists:reverse(lists:flatten(Indexes))};
index(_, {error, Reason}, _, _, _, Indexes) ->
    {error, Reason, Indexes}.

indexify([], _, Count, Indexed) ->
    {Count, Indexed};
indexify([{Pos, Len} | Rest], Offset, Count, Indexed) -> 
    NewCount = Count + 1,
    indexify(Rest, Offset, NewCount, [{Count, Pos + Len + Offset} | Indexed]).

As ugly as that is, it runs in constant space and the index list produced on a 7,247,560 line 614,754,920 byte file appears to take a bit of space (a few dozen MB for the 7,247,560 element list…), and temporarily requires a bit more space during part of the return operation (very sudden, brief spike in memory usage right at the end as it returns). But it works, and returns what we were looking for in a way that won’t kill your computer. And… it only takes a 14 seconds or so on the totally crappy laptop I’m using right now (an old dual-core Celeron).

Much better than what otopolsky ran into when his computer ran for 10 minutes before it started swapping after eating all his memory!

Results:

lceverett@changa:~/foo/hugelogs$ ls -l
合計 660408
-rw-rw-r-- 1 ceverett ceverett       928  9月  3 01:31 goofy.erl
-rw-rw-r-- 1 ceverett ceverett  61475492  9月  2 23:17 huge.log
-rw-rw-r-- 1 ceverett ceverett 614754920  9月  2 23:19 huger.log
ceverett@changa:~/foo/hugelogs$ erl
Erlang/OTP 18 [erts-7.0] [source] [64-bit] [smp:2:2] [async-threads:10] [kernel-poll:false]

Running $HOME/.erlang.
Eshell V7.0  (abort with ^G)
1> c(goofy).
{ok,goofy}
2> {HugeTime, HugeIndex} = timer:tc(goofy, linebreaks, ["huge.log"]).
{1404370,
 [{1,1},
  {2,119},
  {3,220},
  {4,...},
  {...}|...]}
3> {HugerTime, HugerIndex} = timer:tc(goofy, linebreaks, ["huger.log"]).
{14245673,
 [{1,1},
  {2,119},
  {3,220},
  {4,...},
  {...}|...]}
4> HugerTime / 1000000.
14.245673
5> HugeTime / 1000000.
1.40437
6> lists:last(HugeIndex).
{724757,61475493}
7> lists:last(HugerIndex).
{7247561,614754921}

Rather untidy code up there, but I figure it is readable enough that otopolsky can get some ideas from this and move on.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.