So you’ve been thinking about applying your awesome parallel programming skills to some problem that is just complex enough to impress the noobs, but just trivial enough to code in a few lines? Cool. Knock yourself out.
But stop trying to parallelize quicksort.
Posting articles, tutorials, SO questions, and blog posts with the term “parallel quicksort” in the title only serves to encourage excitable, well-meaning, naive people to copypasta whatever code they find there and screw up what would have otherwise been straightforward, easy-to-read, relatively performant beginner-style code.
A naive quicksort implementation is pretty darn fast in the common cases found in real programs. There are edge cases where an intelligently chosen pivot can be a significant optimization — but the act of choosing the pivot intelligently is not free, which means before committing to this you have to know a lot about the context in which that particular sort will be used so you can be confident that it won’t be an overall pessimization.
A naive parallel quicksort implementation can only ever be slower than a naive serialized quicksort. Why? Think through the serial algorithm: each branch depends on the results of the subsequent branches. When we parallelize that the same dependencies exist but you’ve added the overhead of spawning and messaging to the rollup of values at the end. Even in Erlang where spawning and messaging (on a single node) are amazingly lighter weight and faster than in other environments, spawning and messaging are still more overhead in terms of time and memory than simple comparison. So a naive parallelized quicksort, while perhaps impressive because you get to show that you’re spawning and messaging everywhere, is never going to be as fast as a normal naive quicksort.
What about an intelligent parallel quicksort? Ah. Now here we have something potentially interesting. This can indeed be a significant performance win, even bigger than the case of an intelligently implemented quicksort in a guaranteed case of lopsided values where you know where the optimal pivot is located (consider for a moment how utterly tailored to the situation the choice must be, and how utterly familiar with things about that specific use case that are not quicksort one must be to know that any of this is true).
Let’s assume everything is going your way and you can already have a lot of information about the list you need to sort — enough that the overhead of finding smart pivots is outweighed by the overall running time of the sort — it can indeed be possible to write a much faster parallel sort. But not by forking more workers than you have schedulers.
The naive parallel quicksort forks/spawns every branch. That’s insane. A smart parallel quicksort will only fork until it has occupied the available scheduling resources of the system (cores, threads, schedulers, whatever the environment has). Why? Because forking any further will only incur scheduling overhead that amounts to busywork the platform has to do just to keep up with multiprocess accounting. That means an intelligent parallel quicksort will have to know not only how to pick an optimum pivot, it will also have to know how to stop forking once it has maxed out the scheduling resources of the system, and it will have to pull one more not-so-simple trick: it will have to know enough about the data being sorted to continue picking smart pivots until it is finished forking (to be faster than a simply paritioned two-threaded implementation).
That is a significant amount of arcane information needed about the input, and a not entirely obvious pile of code needed to both continue to pick pivots, communicate among threads and stop forking once the optimum number of threads for the environment has been reached. I know how to find out how many schedulers are available in an Erlang node, and I know how to find out how many cores are available on a Linux or BSD system in C. I don’t know how to find this out in Go or Java. I’m not really sure how to do it in Python, either, other than just making a subprocess call — but oops, see, once I do that the code isn’t really Python, it is “Python that works on platform X” (and not to mention, most sorts will complete in less time than a call via subprocess will… defeating the point). That’s a programmer pessimization. Spending a huge amount of time to write code that by definition probably can’t be made generic and reused later is not a very intelligent use of time (neither is writing this blog post, for that matter).
What else happens, though, when we parallelize quicksort? When we parallelize anything? Nothing is free, so there must be a cost. What are some cases where I want to sort things, and of those what cases are good candidates for parallel quicksort? This list isn’t exhaustive (sorting is pretty common) but thinking through the tradeoffs in just a few common situations may be illustrative. Remember: optimizations are usually tradeoffs. Once a routine has been optimized for the general case the remaining options are about shifting burdens away from dear resources and onto cheap ones (or, in the happiest case, not doing anything).
- Sorting a list before passing it to a function that requires a sorted list.
- Before an efficient update/merge of two lists (a special case of the above).
- Ordering client-side data for presentation.
- Organizing log data before flush (can be thought of as a variation of the above).
- Building indexes.
- Organizing variable length search data (usually text search; a variant of the above).
In any of the cases above is it likely that I might have a set of data that is so huge that I might need to optimize the sort operation? Maybe in cases #1, #2, #4, #5, and very often in #6. So that’s entirely possible.
In any cases above is it likely that I will know enough to be able to build a customized parallel quicksort that is smart enough to pick smart pivots? Hrm… that is a lot trickier. Assuming you already might know something extra about the nature of the input it is maybe possible that this could be the case in #4, or #5 — but that is not the general case. You’d have to know an awful lot to get this right. The most likely candidate here is actually just #6.
Of the remaining cases (and for the vast majority of actual code, that is really just going to be #6), are there any situations where I want to slow down the whole system just to make a single sort go faster? Is the system doing anything else that might be important? Will this degrade overall performance when run in a cron job? Is perhaps the whole point of having lots of cores that maybe we don’t want the entire system to be affected by a single long-running operation? Consider a database that needs to build a text search index and pass it somewhere else based on whatever has changed in the last 12 hours (or an adsearch index builder that needs to sort billions of items of campaign change data). Is it acceptable for the whole system to experience a performance degradation? I don’t know — only you can answer that question. It might be a big win, it might not. I’ve only encountered a single case* where this was true in actual production code.
Once. Ever.
Now go do a search for “parallel quicksort” and skim over the results with new eyes. You almost never need this. Most articles on parallel quicksort are exactly as useful as the articles on parallel Fibonacci functions (or any Fibonacci functions, really). Sure, some of the code examples are fun to read and muse about, but they don’t get you any closer to solving whatever your actual user problems are.
[* Actually, it was adsearch campaign data, which is why I mentioned it.]