Batch Processing?

pcmattman · Post by **pcmattman** » Fri Apr 20, 2007 8:55 pm

What if the two painters paint a fence each? They would finish in about the same time...

Of course, in the context of this discussion that would mean that you'd need a CPU per thread...

Colonel Kernel · Post by **Colonel Kernel** » Sat Apr 21, 2007 12:37 am

IMO, efficient multitasking of multiple CPU-intensive threads on a single CPU is a pointless exercise.

If a service uses multiple threads to complete a request, then the scheduler sees seperate threads and schedules each independantly. If a services uses one thread per CPU, then (hopefully) this results in all CPUs doing the request, but the scheduler still sees seperate batch processing threads.

Stop thinking about threads and start thinking about systems. Is the minor gains you're going to see from this really worth the effort or added complexity? Is the world clamouring for this feature? I just don't see it. Sometimes the right answer is: Break that fence into 2, 4, or 8 pieces and have a painter on each one. The world will be drowning in painters before you know it -- might as well figure out how to use them all (or more to the point, encourage application developers to do so).

If my answers always seem like this, it's because I know you're trying to make BCOS a commercial OS. If this were just a hobby, I'd keep my mouth shut, but when it comes to commercial software projects, I can't help but point out instances of requirements scope creep.

frank · Post by **frank** » Sat Apr 21, 2007 3:37 pm

I have just one simple question dealing with the painter analogy. What if the one painter has to prime the fences first and then has to wait for the primer to dry. Does the painter A - wait for the paint to dry and then continue or B - go onto the next fence while waiting for the paint to dry. If he choose A then there would be a lot of wasted time waiting for the paint to dry. Just a thought.

Brynet-Inc · Post by **Brynet-Inc** » Sat Apr 21, 2007 5:56 pm

This whole "painter" thing is getting annoying

Brendan · Post by **Brendan** » Sat Apr 21, 2007 6:14 pm

Hi,

Colonel Kernel wrote:IMO, efficient multitasking of multiple CPU-intensive threads on a single CPU is a pointless exercise.

The same theory applies to efficient multi-tasking (for suitable CPU loads) in general, including on multi-CPU systems...

Colonel Kernel wrote:Stop thinking about threads and start thinking about systems. Is the minor gains you're going to see from this really worth the effort or added complexity? Is the world clamouring for this feature? I just don't see it. Sometimes the right answer is: Break that fence into 2, 4, or 8 pieces and have a painter on each one. The world will be drowning in painters before you know it -- might as well figure out how to use them all (or more to the point, encourage application developers to do so).

Unfortunately some things just can't be broken into smaller pieces in practice (despite the large amout of research currently being undertaken to find ways to make it happen), and sometimes programmers don't when they could have. Even when things can be broken into smaller pieces those pieces still need to be scheduled somehow.

Colonel Kernel wrote:If my answers always seem like this, it's because I know you're trying to make BCOS a commercial OS. If this were just a hobby, I'd keep my mouth shut, but when it comes to commercial software projects, I can't help but point out instances of requirements scope creep.

I just don't beleive any OS can compete against existing OSs (Windows and Unix variants) by using similar techniques and providing similar features - an OS would need to be extremely superior to these OSs to get past the momentum of their existing market share (and I don't think that there's enough room for improvement left for this to be possible). Because of this I deliberately look for different techniques, and features that other OSs don't have.

I don't want to spend years developing a "new" OS and then have people say "Why not just use Linux?".

frank wrote:I have just one simple question dealing with the painter analogy. What if the one painter has to prime the fences first and then has to wait for the primer to dry. Does the painter A - wait for the paint to dry and then continue or B - go onto the next fence while waiting for the paint to dry. If he choose A then there would be a lot of wasted time waiting for the paint to dry. Just a thought.

The painter would do another fence while the paint dries, but as soon as the paint has dried he'd return. Basically the scheduler always gives CPU time to the oldest batch processing job that can run.

Cheers,

Brendan

Combuster · Post by **Combuster** » Sun Apr 22, 2007 1:49 pm

I have the tendency to support Colonel Kernel's attitude: the added complexity might not weigh up against the profit, if any. To add to that, things that sound nice in theory might not work out in practice. You'd need to do lots of benchmarks, comparisons and live users to get a decent overview of the impact of a different scheduler system.

I have some more brain teasers regarding practicality:

Consider an IO-bound batch program (real world example: computing the MD5 of a dvd image for some p2p application), and an CPU-bound program (the user is running bittorrent in the background while playing with blender)
now the download completes and the hashing starts, and shortly after the user wants to render his model. One would get the following:
the scheduler will hop from task to task when a disk track completes, causing lots of scheduling overhead and TLB flushes as the raytracer is scheduled out each time disk data is ready. A more stock scheduler would give the raytracer a better share of cpu time, then return to the MD5 which has by that time a sensible amount of disk data to work on.

The other brain teaser is about ported GNU tools: they are not scheduler-aware and will always run in parallel mode, giving cause to just that which you are trying to avoid. (which was also the reason why I said i wasn't satisfied by your explanation)

The last one (more philosophically), is wether or not there are always corner cases to find that ruin the advantages of an particular scheduling algorithm.

Dilemmas like these are a reason why I have pluggable schedulers on the wish list for my own kernel...

Brendan · Post by **Brendan** » Mon Apr 23, 2007 12:13 am

Hi,

Combuster wrote:I have some more brain teasers regarding practicality:

Consider an IO-bound batch program (real world example: computing the MD5 of a dvd image for some p2p application), and an CPU-bound program (the user is running bittorrent in the background while playing with blender)
now the download completes and the hashing starts, and shortly after the user wants to render his model. One would get the following:
the scheduler will hop from task to task when a disk track completes, causing lots of scheduling overhead and TLB flushes as the raytracer is scheduled out each time disk data is ready. A more stock scheduler would give the raytracer a better share of cpu time, then return to the MD5 which has by that time a sensible amount of disk data to work on.

I/O bound tasks aren't really suited to batch processing in general.

Worst case would be a heavy processing task that is started just before an I/O bound task is started. For an example, imagine if the heavy processing task takes 10 seconds of CPU time, and the I/O bound task takes 1 second of CPU time but spends 9 seconds waiting for IO. In this case the heavy processing task would take 10 seconds from start to finish, but the I/O bound task would take 20 seconds from start to finish because it wouldn't start until the heavy processing completed. The average time between start and finish for both tasks would be 15 seconds.

If the I/O bound task used a higher priority, then the heavy processing task would run while the I/O bound task is waiting. In this case the I/O bound task would take 10 seconds from start to finish and the heavy processing task would take 11 seconds from start to finish. The average time between start and finish would be 10.5 seconds.

Based on this, the task that computes the MD5 checksum should be using Policy 3a while the renderer uses Policy 3b (assuming that both tasks are of equal importance, which I'm not so sure is a correct assumption).

Combuster wrote:The other brain teaser is about ported GNU tools: they are not scheduler-aware and will always run in parallel mode, giving cause to just that which you are trying to avoid. (which was also the reason why I said i wasn't satisfied by your explanation)

GNU tools aren't scheduler aware, aren't optimised for a fully asychronous system, aren't designed for my user interface and/or file system and probably won't even compile. They don't belong on my OS as they'd suck without major changes.

Combuster wrote:The last one (more philosophically), is wether or not there are always corner cases to find that ruin the advantages of an particular scheduling algorithm.

For all scheduling algorithms there are corner cases. Round robin and variable time slice suck for low latency (but handles CPU hogs well), prioritized co-operative scheduling sucks for CPU hogs (but handles low latency well), batch processing sucks for IO bound tasks and tasks that don't complete (but is good for reduced average time until completion).

IMHO a good scheduler uses a mixture of scheduling algorithms so that a task can use the algorithm that suits it (and so that corner cases can be avoided). This would imply that a worse scheduler uses less algorithms and can't avoid as many corner cases.

Combuster wrote:Dilemmas like these are a reason why I have pluggable schedulers on the wish list for my own kernel...

I have "kernel modules", where you could have many schedulers and choose which to use when the OS is installed or during boot. In a real system this doesn't make as much difference as you'd think.

Basically any software needs to work with any scheduler, and you end up defining an interface between software and the scheduler that everything needs to comply with. If the "scheduler interface standard" states that there's scheduling policies and Policy 3b is for low priority batch processing then software can expect to use normal priority batch processing and schedulers need to handle low priority batch processing in a suitable manner.

Cheers,

Brendan

os64dev · Post by **os64dev** » Mon Apr 23, 2007 12:44 am

brendan wrote:Policy 0 - Co-operative scheduling for device drivers (e.g. IRQ handlers), where the highest priority task in this policy gets all CPU time (but where the scheduler monitors time used and terminates the thread if it uses more than N ms).

Co-operative scheduling with a scheduler monitoring and interrupting. That's plain preemption with a yield added to the drivers. Which imho would happen automatically if they block on semaphores.

brendan wrote:Policy 1a - High priority variable time slice for normal high priority threads, where all threads within this policy get N ms (where N depends on the threads priority) and threads take turns.
Policy 1b - High priority batch processing, where the oldest thread that can run is run and gets all CPU time (but where the scheduler monitors time used and terminates a thread if it uses a total of more than N ms).

At 1B is this the same as preempted FIFO, which again is a roundrobin scheduler?

brendan wrote:Policy 2a - Normal priority variable time slice for normal threads, where all threads within this policy get N ms (where N depends on the threads priority) and threads take turns.
Policy 2b - Normal priority batch processing, where the oldest thread that can run is run and gets all CPU time (but where the scheduler monitors time used and terminates a thread if it uses a total of more than N ms).

At 2B is this the same as preempted FIFO, which again is a roundrobin scheduler?

brendan wrote:Policy 3a - Low priority variable time slice, where all threads within this policy get N ms (where N depends on the threads priority) and threads take turns.
Policy 3b - Low priority batch processing, where the oldest thread that can run is run and gets all CPU time (and the scheduler doesn't monitor time used).

3b actually makes sense because batch processing is in general low priority. Think at batches running over night. I also think a book i have read mentions that batch processing is executed in idle time.

Brendan · Post by **Brendan** » Mon Apr 23, 2007 1:49 am

Hi,

os64dev wrote:
brendan wrote:Policy 0 - Co-operative scheduling for device drivers (e.g. IRQ handlers), where the highest priority task in this policy gets all CPU time (but where the scheduler monitors time used and terminates the thread if it uses more than N ms).
Co-operative scheduling with a scheduler monitoring and interrupting. That's plain preemption with a yield added to the drivers. Which imho would happen automatically if they block on semaphores.

The scheduler is monitoring the thread and terminates it if it hogs the CPU too much. It's not preempting.

Imagine you're talking to a friend on the phone, and your mom asks you to open a jar of jam - you put the phone down, open the jar, then pick the phone up again and keep talking. This is preemption (the phone conversation was preempted by your mom).

Now imagine you're talking to a friend on the phone and your mom comes along with an axe. She smashes the phone into millions of tiny pieces then precedes to chop you into mince meat and feeds your remains to the dog so that there's no evidence left for the police to find. This is termination (the phone conversation was terminated by your mom).

Consider what happens if a Policy 0 thread locks up. In this case it'd hog all CPU time and the entire OS would grind to a halt. The user wouldn't even be able to use "control+alt+delete" if the faulty thread is higher priority than the keyboard driver. If the scheduler thinks a policy 0 thread has locked up (i.e. if that thread uses too much CPU time) it'll terminate the thread (i.e. completely remove it from memory and make sure that it never uses any CPU time ever again) so that the entire OS doesn't grind to a halt.

os64dev wrote:
brendan wrote:Policy 3a - Low priority variable time slice, where all threads within this policy get N ms (where N depends on the threads priority) and threads take turns.
Policy 3b - Low priority batch processing, where the oldest thread that can run is run and gets all CPU time (and the scheduler doesn't monitor time used).
3b actually makes sense because batch processing is in general low priority. Think at batches running over night. I also think a book i have read mentions that batch processing is executed in idle time.

Policy 1b and Policy 2B make more sense once you understand that "terminate" isn't the same as "preempt".

Batch processing would normally be done as the lowest priority because usually there's no guarantee that the thread will complete in any amount of time. If you can guarantee that a thread will complete in 1 mS then there's no reason why you can't use higher priority batch processing for that thread. Of course if something goes wrong and the thread locks up it'd be good if the scheduler notices and terminates the faulty thread before it causes problems for the rest of OS.

Cheers,

Brendan

mystran · Post by **mystran** » Mon Apr 23, 2007 2:10 am

Brendan, what are you actually trying to do? I mean, what is the fundamental goal of this all?

As far as I can see, modern hardware is fast enough, that there's basicly three types of software where anything matters anymore:

1. Realtime - latency is everything, some CPU will necessarily be wasted..

2. High traffic servers - must scale as far as possible, I/O bandwidth is everything, and O(n) per client will eventually kill you

3. Interactive desktop - highly demanding in short bursts, rest of the time idle, somehow you must try to look responsive

I'm going to claim that if you manage to support any two of the above well, batch processing performance is going to be a complete non-issue.

os64dev · Post by **os64dev** » Mon Apr 23, 2007 5:18 am

Brendan wrote:Policy 1b and Policy 2B make more sense once you understand that "terminate" isn't the same as "preempt".

Ok, i thought that you would put the batch back into the batch queue after the time-slice is finished but it seems that the time you give to the batch is actually its life time. So after that time the batch dies whether it finished its work or not. How do you handle the early death situation then? Because the batch has to be submitted again i guess.

jnc100 · Post by **jnc100** » Mon Apr 23, 2007 6:31 am

mystran wrote:As far as I can see, modern hardware is fast enough

I think this sums it up. Task switching used to be a costly operation in the time taken to invoke the scheduler, switch to supervisor mode, flip the stacks, change cr0 and flip back. Now this is less relevant as these operations are handled so quickly.

As I see it, the problem is changing. As computers become more capable to execute code quickly, the code tends to operate on larger pieces of memory (e.g. video conversion - a typical batch job). The issue with this is that in task switching from/to this task, a larger amount of time is not spent changing cpu registers, but in invalidating and then having to rebuild the tlb cache.

Therefore, (just a thought off the top of my head) if your system allows you to relocate programs at will, would relocating a batch job to a virtual area not used by, e.g. the rapid-response user interface, prevent the need to invalidate everything, e.g. by setting the pages to global (even though they're not)?

edit: or would that just bypass page protection? I assume it would unless you can guarantee that the processes won't try to read each others address space.

Regards,
John.

Brendan · Post by **Brendan** » Mon Apr 23, 2007 11:25 am

Hi,

mystran wrote:Brendan, what are you actually trying to do? I mean, what is the fundamental goal of this all?

The main goal is scalability.

In general, as a system becomes more scalable it tends to use more tasks that do less work each, which means the scheduler also needs to deal with more smaller tasks that do less work each.

Traditional schedulers are designed more for a smaller number of threads that do more, and therefore additional schedululing algorithms like "batch processing" would be less effective for these systems.

For an (over-simplified) simple example, consider this:

Code: Select all

int find_sum(int max) {
    int result = 0;

    for(i = 0; i < max; i++) {
        result += i;
    }
    return result;
}

This is standard single-threaded code that doesn't scale at all. To make this more scalable, you could try something like:

Code: Select all

int find_sum(int max) {
    int workers;
    int range;
    int i;
    threadID *threadList;
    int result = 0;

    workers = get_number_of_CPUs() - 1;

    if( max / 4 < workers) {
        workers = max / 4;
    }

    if(workers == 0) {
        for(i = 0; i < max; i++) {
            result += i;
        }
        return result;
    }

    range = max/(workers + 1);
    threadList = malloc( sizeof(threadID) * workers);

    for(i = 0; i < workers; i++) {
        threadList[i] = spawnThread( &workerThread );
    }

    for(i = 0; i < workers; i++) {
        sendMessage(threadList[i], SUM, i * range, (i + 1) * range);
    }

    for(i = workers * range; i < max; i++) {
        result += i;
    }

    while(workers > 0) {
        getMessage();
        if(message.type == WORKER_RESULT) {
            result += message.data1;
        }
    }

    return result;
}


void workerThread(void) {
    threadID client;
    int start = 0;
    int end = 0;
    int i;
    int result = 0;

    while(end = 0) {
        getMessage();
        if(message.type == SUM) {
            client = message.senderID;
            start = message.data1;
            end = message.data2;
        }
    }

    for(i = start; i < end; i++) {
        result += i;
    }

    sendMessage(client, result);
    terminateThread();
}

So now you've got code that scales to the number of CPUs in the computer. This is good, but what about other computers?

For a distributed system, you could create a (multi-threaded) service that calculates the sum of a range of numbers, and then run this service on each computer. In this case you could have a main thread that sends a range of numbers to each service/computer and combines the replies into a final result. Of course then you could add redundancy - send each range of numbers to 2 different computers and check to make sure both computers return the same result.

Note: I know this example is a stupid example. Rather than using brute force a little thinking can go a long way (e.g. "result = max * (max - 1) / 2").

os64dev wrote:
Brendan wrote:Policy 1b and Policy 2B make more sense once you understand that "terminate" isn't the same as "preempt".
Ok, i thought that you would put the batch back into the batch queue after the time-slice is finished but it seems that the time you give to the batch is actually its life time. So after that time the batch dies whether it finished its work or not. How do you handle the early death situation then? Because the batch has to be submitted again i guess.

I use a system for "critical errors".

When code does something wrong and the kernel can't just return an error code, it causes a critical error. This includes exceptions but also includes other detectable problems, read or write errors on memory mapped files, read errors on swap space and exceeding time limits set by the scheduler.

When a critical error occurs the kernel adds some details in the kernel log and may do some optional things (create a core dump, send an email to the developers, etc). In any case the thread itself is always terminated. Each thread has a "has recovery" flag. If this flag is set for the terminated thread and the process has a valid "recovery thread", then the kernel sends details of the critical error to the recovery thread (and the recovery thread is responsible for determining what happened, and either recovering from the mess left behind or terminating the rest of the process). Otherwise (if the flag is clear for the terminated thread or the process doesn't have a valid recovery thread), the kernel terminates the entire process.

Other threads (belonging to other processes) that may have been waiting for results can elect to receive "obituaries" (where if any thread or process is terminated the kernel sends a message saying so), and use these obituaries to detect when something it's waiting for is terminated. Alternatively other threads can use time-outs (i.e. if it hasn't received the reply in N seconds it can retry). This sort of thing is mostly required anyway, as something like unplugging a network cable can mean a cause undeliverable messages.

Cheers,

Brendan

Brendan · Post by **Brendan** » Mon Apr 23, 2007 11:53 am

Hi,

jnc100 wrote:[Therefore, (just a thought off the top of my head) if your system allows you to relocate programs at will, would relocating a batch job to a virtual area not used by, e.g. the rapid-response user interface, prevent the need to invalidate everything, e.g. by setting the pages to global (even though they're not)?

edit: or would that just bypass page protection? I assume it would unless you can guarantee that the processes won't try to read each others address space.

It wouldn't bypass the page protection, but you'd need to invalidate all TLB entries that are effected when you relocate the batch job and when you set or clear the global flag.

To be honest, I'm skeptical about the costs of TLB flushing in user space. I'm not saying that TLB flushes caused by task switches don't effect performance (they do), but I am saying that it's not really a huge problem.

If you're switching from one process to another then TLB entries for the old process won't be used by the new process anyway, so you lose nothing by flushing the (user space) TLB entries.

It's only when you switch from process A to process B and then switch back to process A that the old TLB entries for process A would be useful. Of course there's also a chance that older TLB entries have been replaced by newer TLB entries while process B is running (given that the CPU uses a "least recently used" algorithm when it needs to evict TLB entries to make space).

Also, CPUs are designed to hide the cost of TLB misses (part of being "out of order"), and AFAIK a TLB lookup may be done using data left in the L1 or L2 caches (and might not result in a full read from the memory chips themselves).

Cheers,

Brendan

jnc100 · Post by **jnc100** » Mon Apr 23, 2007 12:31 pm

Brendan wrote:Also, CPUs are designed to hide the cost of TLB misses (part of being "out of order"), and AFAIK a TLB lookup may be done using data left in the L1 or L2 caches (and might not result in a full read from the memory chips themselves).

Ah, hadn't thought about that. I was just trying to find a way to cause the least inconvenience to a background batch job when a minor UI update takes place, e.g. updating the clock in the corner. It surely shouldn't be necessary to invalidate the entire TLB just to change such a small area of screen, but I suppose it is.

Regards,
John.