Yesterday I came across a blog post from 2010 that said less than 10% of programmers can write a binary search. At first I thought “ah, what nonsense” and then I realized I probably haven’t written one myself, at least not since BASIC and Pascal were what the cool kids were up to in the 80’s.

So, of course, I had a crack at it. There was an odd stipulation that made the challenge interesting — you couldn’t test the algorithm until you were confident it was correct. In other words, it had to work the first time.

I was wary of fencepost errors (perhaps being self-aware that spending more time in Python and Lisp(s) than C may have made me lazy with array indexes lately) so, on a whim, I decided to use a random window index to guarantee that I was in bounds each iteration. I also wrote it in a recursive style, because it just makes more sense to me that way.

Two things stuck out to me.

Though I was sure what I had written was an accurate representation of what I *thought* binary search was all about, I couldn’t actually recall ever seeing an implementation, having never taken a programming or algorithm class before (and *still* owning zero books on the subject, despite a personal resolution to remedy this last year…). So while I was confident that my algorithm would return the index of the target value, I wasn’t at all sure that I was implementing a “binary search” to the snob-standard.

The other thing that made me think twice was simply whether or not I would ever breach the recursion depth limit in Python on really huge sets. Obviously this is possible, but was it likely enough that it would occur in the span of a few thousand runs over large sets? Sometimes what seems statistically unlikely can pop up as a hamstring slicer in practical application. In particular, were the odds good that a random guess would lead the algorithm to follow a series of really bad guesses, and therefore occasionally blow up. On the other hand, were the odds better that random guesses would be occasionally *so* good that on average a random index is better than a halved one (of course, the target itself is always random, so how does this balance)?

I didn’t do any paperwork on this to figure out the probabilities, I just ran the code several thousand times and averaged the results — which were remarkably uniform.

I split the process of assignment into two different procedures, one that narrows the window to be searched randomly, and another that does it by dividing by two. Then I made it iterate over ever larger random sets (converted to sorted lists) until I ran out of memory — turns out a list sort needs more than 6Gb at around 80,000,000 members or so.

I didn’t spend any time rewriting to clean things up to pursue larger lists (appending guaranteed larger members instead of sorting would probably permit astronomically huge lists to be searched within 6Gb of memory) but the results were pretty interesting when comparing the methods of binary search by window halving and binary search by random window narrowing. It turns out that halving is quite consistently better, but not by much, and the gap may possibly narrow at larger values (but I’m not going to write a super huge list generator to test this idea on just now).

It seems like something about these results are exploitable. But even if they were, the difference between iterating 24 instead of 34 times over a list of over 60,000,000 members to find a target item isn’t much difference in the grand scheme of things. That said, its mind boggling how not even close to Python’s recursion depth limit one will get, even when searching such a large list.

Here is the code (Python 2).

from __future__ import print_function import random def byhalf(r): return (r[0] + r[1]) / 2 def byrand(r): return random.randint(r[0], r[1]) def binsearch(t, l, r=None, z=0, op=byhalf): if r is None: r = (0, len(l) - 1) i = op(r) z += 1 if t > l[i]: return binsearch(t, l, (i + 1, r[1]), z, op) elif t < l[i]: return binsearch(t, l, (r[0], i - 1), z, op) else: return z def doit(z, x): l = list(set((int(z * random.random()) for i in xrange(x)))) l.sort() res = {'half': [], 'rand': []} for i in range(1000): if x > 1: target = l[random.randrange(len(l) - 1)] elif x == 1: target = l[0] res['half'].append(binsearch(target, l, op=byhalf)) res['rand'].append(binsearch(target, l, op=byrand)) print('length: {0:>12} half:{1:>4} rand:{2:>4}'\ .format(len(l), sum(res['half']) / len(res['half']), sum(res['rand']) / len(res['rand']))) for q in [2 ** x for x in range(27)]: doit(1000000000000, q)

Something just smells exploitable about these results, but I can’t put my finger on why just yet. And I don’t have time to think about it further. Anyway, it seems that the damage done by using a random index to make certain you stay within bounds doesn’t actually hurt performance as much as I thought it would. A perhaps useless discovery, but personally interesting nonetheless.

The moment I saw it, I went “aha” – it provides an answer to a question I had for a long time about usability vs. performance concerns in scrolling of lists of non-uniform item size, where finding out item size is expensive. Say, laying out a paragraph of text just to figure out how tall it is.

Your finding translates to:

It’s perfectly fine to generate text editor widget scrollbars as if long, multi-paragraph text all had uniform paragraph size. From a user perspective – a perspective of an instinctively done binary search – it merely will make the search for a particular piece of text slightly longer, but not overly so. Twice as long, if the paragraph length distribution were uniform.

It’d be interesting to run this on various distributions and see if any of them are particularly pathological.

I think I’ll code this insight into QListView to get it to work reasonably on large number of items. But, more importantly, this optimization really belongs in QPlainTextEdit, which is painfully slow on any changes (even append!).

@Kuba Ober

“It’d be interesting to run this on various distributions and see if any of them are particularly pathological.”

That had not occurred to me, since most of the data I deal with is, at least from a very high level, either sequential or unpredictably clustered (and I haven’t gone to the trouble to profile what the clutering types might be). This is quite an interesting idea.