Tag Archives: algorithm

Binary Search: Random Windowing Over Large Sets

Yesterday I came across a blog post from 2010 that said less than 10% of programmers can write a binary search. At first I thought “ah, what nonsense” and then I realized I probably haven’t written one myself, at least not since BASIC and Pascal were what the cool kids were up to in the 80’s.

So, of course, I had a crack at it. There was an odd stipulation that made the challenge interesting — you couldn’t test the algorithm until you were confident it was correct. In other words, it had to work the first time.

I was wary of fencepost errors (perhaps being self-aware that spending more time in Python and Lisp(s) than C may have made me lazy with array indexes lately) so, on a whim, I decided to use a random window index to guarantee that I was in bounds each iteration. I also wrote it in a recursive style, because it just makes more sense to me that way.

Two things stuck out to me.

Though I was sure what I had written was an accurate representation of what I thought binary search was all about, I couldn’t actually recall ever seeing an implementation, having never taken a programming or algorithm class before (and still owning zero books on the subject, despite a personal resolution to remedy this last year…). So while I was confident that my algorithm would return the index of the target value, I wasn’t at all sure that I was implementing a “binary search” to the snob-standard.

The other thing that made me think twice was simply whether or not I would ever breach the recursion depth limit in Python on really huge sets. Obviously this is possible, but was it likely enough that it would occur in the span of a few thousand runs over large sets? Sometimes what seems statistically unlikely can pop up as a hamstring slicer in practical application. In particular, were the odds good that a random guess would lead the algorithm to follow a series of really bad guesses, and therefore occasionally blow up. On the other hand, were the odds better that random guesses would be occasionally so good that on average a random index is better than a halved one (of course, the target itself is always random, so how does this balance)?

I didn’t do any paperwork on this to figure out the probabilities, I just ran the code several thousand times and averaged the results — which were remarkably uniform.

binsearch

I split the process of assignment into two different procedures, one that narrows the window to be searched randomly, and another that does it by dividing by two. Then I made it iterate over ever larger random sets (converted to sorted lists) until I ran out of memory — turns out a list sort needs more than 6Gb at around 80,000,000 members or so.

I didn’t spend any time rewriting to clean things up to pursue larger lists (appending guaranteed larger members instead of sorting would probably permit astronomically huge lists to be searched within 6Gb of memory) but the results were pretty interesting when comparing the methods of binary search by window halving and binary search by random window narrowing. It turns out that halving is quite consistently better, but not by much, and the gap may possibly narrow at larger values (but I’m not going to write a super huge list generator to test this idea on just now).

It seems like something about these results are exploitable. But even if they were, the difference between iterating 24 instead of 34 times over a list of over 60,000,000 members to find a target item isn’t much difference in the grand scheme of things. That said, its mind boggling how not even close to Python’s recursion depth limit one will get, even when searching such a large list.

Here is the code (Python 2).

from __future__ import print_function
import random

def byhalf(r):
    return (r[0] + r[1]) / 2

def byrand(r):
    return random.randint(r[0], r[1])

def binsearch(t, l, r=None, z=0, op=byhalf):
    if r is None:
        r = (0, len(l) - 1)
    i = op(r)
    z += 1

    if t > l[i]:
        return binsearch(t, l, (i + 1, r[1]), z, op)
    elif t < l[i]:
        return binsearch(t, l, (r[0], i - 1), z, op)
    else:
        return z

def doit(z, x):
    l = list(set((int(z * random.random()) for i in xrange(x))))
    l.sort()

    res = {'half': [], 'rand': []}
    for i in range(1000):
        if x > 1:
            target = l[random.randrange(len(l) - 1)]
        elif x == 1:
            target = l[0]
        res['half'].append(binsearch(target, l, op=byhalf))
        res['rand'].append(binsearch(target, l, op=byrand))
    print('length: {0:>12} half:{1:>4} rand:{2:>4}'\
                    .format(len(l),
                            sum(res['half']) / len(res['half']),
                            sum(res['rand']) / len(res['rand'])))

for q in [2 ** x for x in range(27)]:
    doit(1000000000000, q)

Something just smells exploitable about these results, but I can’t put my finger on why just yet. And I don’t have time to think about it further. Anyway, it seems that the damage done by using a random index to make certain you stay within bounds doesn’t actually hurt performance as much as I thought it would. A perhaps useless discovery, but personally interesting nonetheless.