The Intellectual Wilderness There is nothing more useless than doing efficiently that which should not be done at all.

2013.04.20 20:12

Version String Comparison in Python

Filed under: Computing — Tags: , , , , — zxq9 @ 20:12

Comparing and sorting version strings in Python scripts has come up a few times lately. Here are some simple approaches. These examples assume that the version strings will be numeric representations and not include elements like “rc-1” or “alpha” or whatever. If your problem includes these kinds of elements, don’t worry — they are solvable by applying a touch of regex-fu to the processes below.

First, sorting a bunch of version number strings. The problem is that many version strings reported by various packages are just that: strings. Usually the only way to get version information is to ask for it (“[prog] –version” in a shell script, or “SELECT version();” from a database, or whatever) and interpret whatever gets sent to stdout. Those strings don’t mean the same thing to comparison operators that they mean to us so long as they remain strings. For example, the string ‘3.5.10’ is greater alphabetically than ‘3.15.1’ but is a higher version. So we need to convert them to tuples of integers to make comparison of them natural (again, all these examples assume integer-only version strings, but minor changes to the process can allow you to compare anything — and assigning a custom collation order can allow you to sort against any arbitrary order of arbitrary symbols, but that’s beyond the scope of the basic nature of the problem I’m addressing here):

>>> vers
['3.5.10', '3.15.1', '2.5.7', '0.20.0', '2.12.5', '10.4.3']
>>> s_vers = [tuple([int(x) for x in n.split('.')]) for n in vers]
>>> s_vers
[(3, 5, 10), (3, 15, 1), (2, 5, 7), (0, 20, 0), (2, 12, 5), (10, 4, 3)]
>>> vers[0] > vers[1]
>>> s_vers[0] > s_vers[1]
>>> cmp(vers[0], vers[1])
>>> cmp(s_vers[0], s_vers[1])
>>> s_vers.sort()
>>> s_vers
[(0, 20, 0), (2, 5, 7), (2, 12, 5), (3, 5, 10), (3, 15, 1), (10, 4, 3)]

The list comprehension (actually, two nested list comprehensions) assignment to s_vers is the important part of this. Once that is done you can compare whatever you want. If the version number is buried as an element in a dict or larger list (likely) you can do this conversion in place by adding a new element to the contained structures and then sort the greater list based on that element:

>>> packages
[{'version': '3.5.10', 'name': 'foo'}, {'version': '3.15.1', 'name': 'foo'}, {'version': '2.5.7', 'name': 'foo'}, {'version': '0.20.0', 'name': 'foo'}, {'version': '2.12.5', 'name': 'foo'}, {'version': '10.4.3', 'name': 'foo'}]

OK, that’s pretty ugly (uglier depending on how your browser renders <pre> type text), so I’ll print them in order so we can watch the list change more easily.

>>> for p in packages:
...   print p
{'version': '3.5.10', 'name': 'foo'}
{'version': '3.15.1', 'name': 'foo'}
{'version': '2.5.7', 'name': 'foo'}
{'version': '0.20.0', 'name': 'foo'}
{'version': '2.12.5', 'name': 'foo'}
{'version': '10.4.3', 'name': 'foo'}
>>> for p in packages:
...   p.update({'version_tuple': tuple([int(x) for x in p['version'].split('.')])})
>>> for p in packages:
...   print p
{'version_tuple': (3, 5, 10), 'version': '3.5.10', 'name': 'foo'}
{'version_tuple': (3, 15, 1), 'version': '3.15.1', 'name': 'foo'}
{'version_tuple': (2, 5, 7), 'version': '2.5.7', 'name': 'foo'}
{'version_tuple': (0, 20, 0), 'version': '0.20.0', 'name': 'foo'}
{'version_tuple': (2, 12, 5), 'version': '2.12.5', 'name': 'foo'}
{'version_tuple': (10, 4, 3), 'version': '10.4.3', 'name': 'foo'}
>>> packages.sort(key = lambda x:x['version_tuple'])
>>> for p in packages:
...   print p
{'version_tuple': (0, 20, 0), 'version': '0.20.0', 'name': 'foo'}
{'version_tuple': (2, 5, 7), 'version': '2.5.7', 'name': 'foo'}
{'version_tuple': (2, 12, 5), 'version': '2.12.5', 'name': 'foo'}
{'version_tuple': (3, 5, 10), 'version': '3.5.10', 'name': 'foo'}
{'version_tuple': (3, 15, 1), 'version': '3.15.1', 'name': 'foo'}
{'version_tuple': (10, 4, 3), 'version': '10.4.3', 'name': 'foo'}

We started out with a list of dictionaries, each containing a package name and a version string. The first loop updates each dictionary to include a version tuple, and the next orders the dictionaries within the list by the tuple values. Viola! We have a list of dictionaries sorted by version number. Of course, if there are more than one package name involved you will want to sort on the package name first, then the version tuple as a secondary criteria (so you don’t compare versions of package ‘foo’ against versions of package ‘bar’, or sort glibc against firefox, for example).

If lambdas are unfamiliar to you, don’t be scared off by the package.sort() line up there — lambdas are perfectly safe, reliable and quite concise once you understand the way they are used.

From here writing a sort function for lists of version strings should be pretty obvious. And… that means that writing a comparison function for two individual elements that works the same way the built-in cmp() function works is trivial:

>>> def ver_tuple(z):
...   return tuple([int(x) for x in z.split('.') if x.isdigit()])
>>> def ver_cmp(a, b):
...   return cmp(ver_tuple(a), ver_tuple(b))
>>> vers
['3.5.10', '3.15.1', '2.5.7', '0.20.0', '2.12.5', '10.4.3']
>>> ver_cmp(vers[0], vers[1])
>>> ver_cmp(vers[0], vers[0])
>>> ver_cmp(vers[3], vers[4])

Nice and easy.

Now I can’t figure out why comparison functions I’ve seen floating around occupy so much space and are hard to follow — full of class declarations and exec loops within exec loops (!!!) and other nonsense. At the most you will need to add some regular expression matching to extract/split on the correct substrings from the version string. That means you would have to import the re module and the list comprehension will grow by a few (maybe 10) characters.


  1. Nice writeup! This makes more sense to me now. I thought iterating a sort of compare/escape down the line of elements was the way to do this, but its way more code. But why don’t you address sorted() and only use sort()? Does it not work the same way?

    Comment by avraml — 2013.04.21 00:39 @ 00:39

  2. I didn’t really think of sorted(), actually, because I rarely use it. Only lists have a sorting method of their own, sort(), while sorted() is a function that works on any iterable object.

    There is a catch to that: sorted() is a function that returns a sorted list, sort() is a method that operates on its instance in place. In the cases above I have no need of a copy of the original list, I just want the results and would prefer if they persist in the most idiot-proof manner available. To that end having a list sorted in place is more brainfart resistant than receiving a sorted copy because there is only one canonical reference to it and the referred object itself has been sorted. But that’s not always desirable.

    You can replace sort() with sorted() in any of the example cases. In fact, you can replace sort() with sorted() in even more cases than the ones above (a tuple of version tuples, for example, but since tuples are immutable to get a sorted version of them you have to create a copy anyway, so this is a self-evident issue — and the order of occurrence in dicts/hashes is undefined (on purpose) so don’t sort those in place, because doing so is against the purpose of dicts).

    The key thing to note there is really that both sort() and sorted() accept a “key” argument and in the case of sorting a larger data structure (sometimes much larger) by version number the ability to use a phrase like this is invaluable:

    sorted(list_of_things, key = lambda x:x['version_tuple'])

    Two interesting resources (but three links, for completeness):
    The Python Wiki HowTo on sorting
    Itertools Documentation (Python 2.7)
    Itertools Documentation (Python 3.2)
    Itertools is interesting for a number of reasons, many of which a central to the design philosophy of the language itself.

    Comment by zxq9 — 2013.04.21 01:00 @ 01:00

  3. I get that, I guess. It just seems weird that there is a sort() method and a sorted() function that are really the same thing. I wonder if sort() actually calls out to sorted()?

    I have a different question about applying this practically. Sometimes things like patches need to be applied in order or evaluated that way and are named things like x.x.x.patch or pkg-name-x.x.x.patch or whatever. How would you sort those from a directory listing like this without building all that class and inner loop stuff? I think people build all that infrastructure because ordering numbered filenames is the main use case and your solution doesn’t seem to really help me there.

    Comment by avraml — 2013.04.22 11:11 @ 11:11

  4. Two main approaches, depending on the use case: one is to create a short-lived tuple of the converted tuple and original string so you can sort on the tuple and then eliminate it, and the other would be to create a VersionNumber class that pre-computed the tuple and string values and returned whichever you needed at the moment. The short-lived tuple solution is just fine for most cases, but wouldn’t be very efficient for huge sets (like I wouldn’t write the version sorter for a huge repository manager this way); the class version would trade memory for the advantage of a single conversion regardless how many times you have to compare against the tuple. And lastly, how to get just the numbers? Regexs, as I mentioned above. Another, somewhat naive way is to just use fnmatch — its naive but works fine if there is a definite naming convention to the versioned file names.

    Here is a (really dirty) example of a temporary tuple sort:

    import fnmatch
    # Only get versions newer than prev_version
    prev_version = (1, 3, 7)
    files = fnmatch.filter(os.listdir(path), '[0-9]*.patch')
    filtered = [z for z in [(ver_tuple(x), x) for x in files] if z[1] >= prev_version]
    updates = [f[1] for f in filtered]  # Discard the tuple you sorted on

    Of course, the regex approach is vastly superior to the fnmatch method above. Something like r’.*([0-9]+\.*){2,4}.*’ or some such. (Don’t rely on that as-is — I haven’t tested it and am not really good at making regex bits up on the fly — you get the idea, though.)

    The class-based concept is very similar, it just provides a place for us to precompute and store the filename, version string and version tuple and methods to present or compare against them a bit more cleanly than writing them into a list of tuples. But a list of tuples and a list of “VersionNumber” objects would amount to the exact same thing, really, since either could be sorted. It would just be annoying to have to remember which index number was what later on (though you could use named tuples for this as well — you could also use a list of dictionaries, but sorting via lambdas all the time gets verbose).

    The situation dictates which way fits best. Sometimes you really need to keep the filename, version string and comparable version tuple around to avoid a ton of in-place conversions as you compare, execute, copy and write to logging processes. In any case, the principles are all the same — you don’t need tens of lines of code for this.

    Comment by zxq9 — 2013.04.23 01:52 @ 01:52

  5. I wrote up something for attacking version number problems with bash. I came to three general conclusions:

    * when you see a floating point number, do not assume it is floating point

    * when you see a number, try operating on it as a string instead

    * look for structure in data which may suggest optimizations

    Comment by Rich — 2013.05.17 09:51 @ 09:51

  6. @Rich
    Doing the whole thing in Bash seems fragile, but workable. More pointedly, though, I’m not sure that some of the folks I deal with would be as familiar as you are with the idosyncracies of Bash, and that makes readability a problem. On the other hand, my lambda-reliant solution above isn’t readable for some Python coders, either, so maybe this is just a tricky problem in general.

    It seems like it would require more massaging to make sure you pull out just the version string from, say, a package file name in Bash than in awk or Python. If I had to do this in Bash, I’d probably write a separate awk script and call it from my Bash script instead, or if I was going to go that route, add a function that verifies that we have extracted a valid version number string to the two sorting functions above, and just call the Python script from my Bash script, and have it feed ordered values back (like read back an ordered list of versions or file names or whatever the subject is into a Bash array).

    But that’s me — Bash is great glue, but it is similar to SQL and a few other highly idiosyncratic languages: the more I code in them the less I find myself wanting to code in them.

    Comment by zxq9 — 2013.05.17 11:34 @ 11:34

RSS feed for comments on this post.

Leave a comment

Powered by WordPress