Hi,
Dennis wrote:I'm sending to one a random CPU of the 10: a list of commands ("code" from now).
This one CPU should split up the code to (equally) distribute the burden over all CPUs.
How does this happen?
Typically there's some parts of software that can't be done in parallel, some parts that can be done in parallel with carefully planned synchronisation and/or locking, and some parts that are "embarrassingly parallel". Even for the embarrassingly parallel stuff (where a hypothetical compiler might be able to equally distribute it over all CPUs) sometimes the communication costs (e.g. network latency) outweigh the benefits.
For an example; consider
Floyd–Steinberg dithering. Imagine your compiler has been given this pseudocode (like the code from the wikipedia page, but extended to handle RGB rather than monochrome):
Code: Select all
for (each y from top to bottom) {
for (each x from left to right) {
old_red = pixel[x][y].red;
old_green = pixel[x][y].green;
old_blue = pixel[x][y].blue;
new_red = find_closest_red(old_red);
new_green = find_closest_green(old_green);
new_blue = find_closest_blue(old_blue);
pixel[x][y].red = new_red;
pixel[x][y].green = new_green;
pixel[x][y].blue = new_blue;
quant_error_red = old_red- new_red;
quant_error_green = old_red- new_green;
quant_error_blue = old_red- new_blue;
pixel[x+1][y].red = pixel[x+1][y].red + 7/16 * quant_error_red;
pixel[x-1][y+1].red = pixel[x-1][y+1].red + 3/16 * quant_error_red;
pixel[x][y+1].red = pixel[x][y+1].red + 5/16 * quant_error_red;
pixel[x+1][y+1].red = pixel[x+1][y+1].red + 1/16 * quant_error_red;
pixel[x+1][y].green = pixel[x+1][y].green + 7/16 * quant_error_green;
pixel[x-1][y+1].green = pixel[x-1][y+1].green + 3/16 * quant_error_green;
pixel[x][y+1].green = pixel[x][y+1].green + 5/16 * quant_error_green;
pixel[x+1][y+1].green = pixel[x+1][y+1].green + 1/16 * quant_error_green;
pixel[x+1][y].blue = pixel[x+1][y].blue + 7/16 * quant_error_blue;
pixel[x-1][y+1].blue = pixel[x-1][y+1].blue + 3/16 * quant_error_blue;
pixel[x][y+1].blue = pixel[x][y+1].blue + 5/16 * quant_error_blue;
pixel[x+1][y+1].blue = pixel[x+1][y+1].blue + 1/16 * quant_error_blue;
}
}
If you try to spread it equally across many CPUs, then each CPU will have to wait for the previous CPUs to spread the error to nearby pixels and it'll be a massive disaster.
One way to do it is to have one CPU for each colour component (e.g. one CPU does red, one does green, one does blue). However, because these colour components are packed close together you're going to end up with massive "cache line trashing" problems and using 3 CPUs will be slower than using one. To avoid those cache line trashing problems, each CPU would have to have a separate copy of the source data and would have to put its results into a separate buffer, and then afterwards you'd need something to combine the data from each CPU (e.g. "red results", "green results" and "blue results") back into a single "combined RGB results".
A slightly more advanced way to do it would be to have local buffers to track the error values (so that each CPU doesn't need to modify the source data), and local buffers for the resulting data; where a CPU would process a horizontal line of pixels to create a horizontal line of results, and would then acquire a mutex for the line in the actual destination data and merge data from its local results buffer into the actual destination data. This would work for SMP, but things like mutexes don't work well for distributed systems (network latency).
For 4 separate (single-CPU) computers talking via. networking; you could have 3 computers calculating the red, green and blue and sending their data to the fourth computer; where the fourth computer merges the red, green, blue data from the other computers to create the final RGB result. In this case there are no locks/mutexes; and as long as asynchronous communication is used (e.g. and the first 3 computers aren't wasting most of their time waiting for an "ACK") it avoids most of the network latency problem. As a bonus, this approach would also work for SMP.
However, the chance of you ever seeing (or writing) a compiler capable of automatically converting the original code into "4 CPUs/processes with asynchronous communication" version is zero. Compilers just aren't that smart. Also note that even if a compiler was possible, the load still can't be spread equally - for 10 CPUs you might have 3 CPUs at 100% load, one CPU at 50% load, and 6 CPUs doing nothing.
Cheers,
Brendan