This is one these posts that I've been willing to write for ages. The inspiration came from reading Kent Beck's Extreme Programming Explained and from a post by Stephan Schmidt.
Let's say you need to write a program that crunches a large set of inputs. There are three different calculations that need to be carried out for each input value. Each calculation is relatively complicated and relies on data structures for looking up previously computed values.
In a single-core settings things are simple. You can either choose to feed each input into all three algorithms or, otherwise, to run the first algorithm on all inputs and then turn to the other two algorithms. Either way, your throughput will be roughly the same (modulo caching issues).
The story is different in multi-core machines. By and large, there are two design options:
- Option 1 - Decomposition by Algorithms
Creare three threads, one for each algorithm. Each thread scans the input and feeds the values into its own algorithm.
There are almost no data races - the data structure supporting algorithm 1 are only accessed by thread 1, thus obviating the need to guard against other threads. Cache behavior also seem to be quite good: each thread touches only a part of the heap, thus promoting data locality and cache-friendliness.
- Option 2 - Decomposition by Inputs
Creare N threads (N == #cores). Each thread takes the next value from the input, and feeds it, in turn, into each of the three algorithms.
This design puts a greater burden on the programmer. Each thread touches all data structures, thus requiring careful locking. Locality is also not so good: after accessing data structures of algorithm 1, a thread will start executing algorithm 2 (for the same input) thus touching different data structures resulting in more cache misses.
- Scalability. Option 1 is great if you have three cores on your machine. If your code runs on a machine with (say) 8 core, then you need to redesign your code such that it has five more threads, possibly by decomposing existing algorithm into smaller, concurrently running, sub-algorithms. This incurs a substantial engineering effort.
Option 2, on the other hand, is highly scalable. If you're running on more cores, you just need to spawn more threads. You can even let your code figure the optimal number of threads dynamically, thereby allowing it to adjust to unforeseen circumstances (such as: if the O/S starts running a CPU-intensive process that leaves your program with fewer cores).
- Reduced starvation. In option 1, if execution times of the various algorithms are not equal, some threads will finish before the others (idle), thereby making overall throughput sub-optimal.
In option 2, a thread may go idle only at the very end of the execution: when the number of remaining inputs is less than #cores, which is a fracture of the total iterations.
- Less up-front estimations. In option 1, one needs to estimate the execution times of the various algorithms, across all possible target machines, in order to minimize the effect of starvation.
In option 2, such estimations are practically redundant due to the reduced starvation.
- Progress monitoring. Given that number of inputs is far larger than number of algorithms in a program, prediction of time-to-completion is more accurate in option 2 (c.f. Law of Large Numbers).
- Easy On/Off. In option 2, if you need to stop in the middle, you just need to stop dispatching work items to threads. Pretty soon all threads will stop. In option 1, the programmer needs to build shutdown logic into each thread/algorithm (e.g.: inspect a shared, atomic boolean and bail out if it goes to false)
In much the same way, it is also easy to restart later: you know that all inputs (up to a certain point) were fully processed, and all remaining inputs are fully un-processed. Thus, you have perfectly valid (though partial) output
An "Abort" in option 1 leaves each thread in a different location, thereby making it difficult to resume later. The outputs, across the algorithms, are non-aligned (in addition to being partial).
Of course, decomposition by inputs is not always better than decomposition by algorithms. Inter-thread locking may be, in certain cases, so costly that option 1 is still faster. Nonetheless, I argue that, in general, option 2 is likely to be more scalable, yield better throughput (due to less starvation) etc.
Please re-read the above text, after applying the following substitutions:
- Threads, Cores -> Programmers
- Inputs -> Features
- Algorithm -> Module, Subsytem (or any other large piece of a program)
- Execution Time -> Development Time
- Cache miss -> Developer switching between different parts of the code
(I believe that this is just another manifestation of the super-linearity axiom: in the world of software development, smaller tasks are more cost-effective than larger ones).