[Theory] Real-time Cell OS - How should you do it?

Dennis · Post by **Dennis** » Sat Feb 09, 2013 1:56 pm

Hello everyone,

Lately the working of Cell-processing began to fascinate me.
The main point about what I want to accomplish is quite similar to Sony's PS3 Cell microprocessor - also Server farms come close to my aspiration.
Another important aspect is that it should operate real-time.

Here is an example about what it should do:

Lets say I have about 10 CPU cores running, which all have their own resources. (RAM/storage/etc.)
However they can communicate with one another through e.g. a network connection.

Actual operation:
I'm sending to one a random CPU of the 10: a list of commands ("code" from now).
This one CPU should split up the code to (equally) distribute the burden over all CPUs.
After and during the execution of the code, I should be updated about the progress.
This all should happen within a certain time limit (soft real-time), so postponing code execution is not an option.

By now you should have a rather good view about what my OS should do.
However I already made a few startup ideas for this project - I would like to ask you:
What would your blueprints be at an project like this?

Thanks in advance,
Dennis

PS: You are not expected to write code, its nice if you tell how you should do it - maybe drawn from experience within this field.
(Connecting, sharing resources, splitting/executing code, real-time responding, etc.)

Brendan · Post by **Brendan** » Sat Feb 09, 2013 10:09 pm

Hi,

Dennis wrote:I'm sending to one a random CPU of the 10: a list of commands ("code" from now).
This one CPU should split up the code to (equally) distribute the burden over all CPUs.

How does this happen?

Typically there's some parts of software that can't be done in parallel, some parts that can be done in parallel with carefully planned synchronisation and/or locking, and some parts that are "embarrassingly parallel". Even for the embarrassingly parallel stuff (where a hypothetical compiler might be able to equally distribute it over all CPUs) sometimes the communication costs (e.g. network latency) outweigh the benefits.

For an example; consider Floyd–Steinberg dithering. Imagine your compiler has been given this pseudocode (like the code from the wikipedia page, but extended to handle RGB rather than monochrome):

Code: Select all

    for (each y from top to bottom) {
        for (each x from left to right) {
            old_red = pixel[x][y].red;
            old_green = pixel[x][y].green;
            old_blue = pixel[x][y].blue;
            new_red = find_closest_red(old_red);
            new_green = find_closest_green(old_green);
            new_blue = find_closest_blue(old_blue);
            pixel[x][y].red = new_red;
            pixel[x][y].green = new_green;
            pixel[x][y].blue = new_blue;
            quant_error_red = old_red- new_red;
            quant_error_green = old_red- new_green;
            quant_error_blue = old_red- new_blue;
            pixel[x+1][y].red = pixel[x+1][y].red + 7/16 * quant_error_red;
            pixel[x-1][y+1].red = pixel[x-1][y+1].red + 3/16 * quant_error_red;
            pixel[x][y+1].red = pixel[x][y+1].red + 5/16 * quant_error_red;
            pixel[x+1][y+1].red = pixel[x+1][y+1].red + 1/16 * quant_error_red;
            pixel[x+1][y].green = pixel[x+1][y].green + 7/16 * quant_error_green;
            pixel[x-1][y+1].green = pixel[x-1][y+1].green + 3/16 * quant_error_green;
            pixel[x][y+1].green = pixel[x][y+1].green + 5/16 * quant_error_green;
            pixel[x+1][y+1].green = pixel[x+1][y+1].green + 1/16 * quant_error_green;
            pixel[x+1][y].blue = pixel[x+1][y].blue + 7/16 * quant_error_blue;
            pixel[x-1][y+1].blue = pixel[x-1][y+1].blue + 3/16 * quant_error_blue;
            pixel[x][y+1].blue = pixel[x][y+1].blue + 5/16 * quant_error_blue;
            pixel[x+1][y+1].blue = pixel[x+1][y+1].blue + 1/16 * quant_error_blue;
        }
    }

If you try to spread it equally across many CPUs, then each CPU will have to wait for the previous CPUs to spread the error to nearby pixels and it'll be a massive disaster.

One way to do it is to have one CPU for each colour component (e.g. one CPU does red, one does green, one does blue). However, because these colour components are packed close together you're going to end up with massive "cache line trashing" problems and using 3 CPUs will be slower than using one. To avoid those cache line trashing problems, each CPU would have to have a separate copy of the source data and would have to put its results into a separate buffer, and then afterwards you'd need something to combine the data from each CPU (e.g. "red results", "green results" and "blue results") back into a single "combined RGB results".

A slightly more advanced way to do it would be to have local buffers to track the error values (so that each CPU doesn't need to modify the source data), and local buffers for the resulting data; where a CPU would process a horizontal line of pixels to create a horizontal line of results, and would then acquire a mutex for the line in the actual destination data and merge data from its local results buffer into the actual destination data. This would work for SMP, but things like mutexes don't work well for distributed systems (network latency).

For 4 separate (single-CPU) computers talking via. networking; you could have 3 computers calculating the red, green and blue and sending their data to the fourth computer; where the fourth computer merges the red, green, blue data from the other computers to create the final RGB result. In this case there are no locks/mutexes; and as long as asynchronous communication is used (e.g. and the first 3 computers aren't wasting most of their time waiting for an "ACK") it avoids most of the network latency problem. As a bonus, this approach would also work for SMP.

However, the chance of you ever seeing (or writing) a compiler capable of automatically converting the original code into "4 CPUs/processes with asynchronous communication" version is zero. Compilers just aren't that smart. Also note that even if a compiler was possible, the load still can't be spread equally - for 10 CPUs you might have 3 CPUs at 100% load, one CPU at 50% load, and 6 CPUs doing nothing.

Cheers,

Brendan

dozniak · Post by **dozniak** » Sun Feb 10, 2013 3:27 am

Cell does indeed operate on private cell-local memory, with the main processor coordinating DMA-ing the memory between the processing Cells and main memory. In current games there is very little of the operating system code helping with this - all the game main loop does is setup DMA transfers to or from certain Cells, and the DMA completion interrupts trigger processing. I recommend you read the blog[2] of PixelJunk developer, who describes many of the peculiarities of Cell programming.

[1] GDC presentation.
[2] Jaymin's blog with all the fun.

OSDev.org

[Theory] Real-time Cell OS - How should you do it?

[Theory] Real-time Cell OS - How should you do it?

Re: [Theory] Real-time Cell OS - How should you do it?

Re: [Theory] Real-time Cell OS - How should you do it?