Batch Processing?

Colonel Kernel · Post by **Colonel Kernel** » Mon Apr 23, 2007 12:33 pm

Brendan wrote:To be honest, I'm skeptical about the costs of TLB flushing in user space. I'm not saying that TLB flushes caused by task switches don't effect performance (they do), but I am saying that it's not really a huge problem.

As you've pointed out before, it is indeed the re-filling of the TLB that's expensive, not the actual flushing. This is just a terminology issue though -- the cost is still there.

If you're interested, there are some benchmarks in this paper that show the relative performance costs of various hardware isolation mechanisms (TLB misses, privilege level switches, etc.), at least for Singularity:

ftp://ftp.research.microsoft.com/pub/tr/TR-2006-43.pdf

It's an interesting read.

About your distributed processing example -- I don't really see how that fits with your batch processing idea. Do you have a "distributed scheduler" of sorts...? As soon as you bring distributed processing into the equation, the cost of a few extra context switches gets lost in the noise IMO.

mystran · Post by **mystran** » Mon Apr 23, 2007 2:56 pm

jnc100 wrote: I think this sums it up. Task switching used to be a costly operation in the time taken to invoke the scheduler, switch to supervisor mode, flip the stacks, change cr0 and flip back. Now this is less relevant as these operations are handled so quickly.

Actually, you got me totally wrong. My point was that modern hardware is fast enough, that unless you need to scale a lot, respond with low latency, or somehow hide the typical interactive CPU bursts from the user, not many things are going to take very long anyway, and most thing are I/O bound anyway. The relative cost of switching threads is supposedly increasing all the time as processors become more speculative.

Kevin McGuire · Post by **Kevin McGuire** » Mon Apr 23, 2007 3:01 pm

It might be nice if the processor could read the TLB data from primary memory into a special cache during unused data bus cycles ahead of time. Have a preload instruction added.

Are we able to cache a TLB data page from memory, and the TLB will fetch from cache instead of over the data bus from primary memory?

jnc100 · Post by **jnc100** » Mon Apr 23, 2007 4:20 pm

mystran wrote:The relative cost of switching threads is supposedly increasing all the time as processors become more speculative.

In terms of clock cycles then maybe, but in absolute terms, compared with, e.g. a 1ms PIT interval then on modern processors more time is being spent on actually executing code than performing task switches.

Regards,
John.

Colonel Kernel · Post by **Colonel Kernel** » Mon Apr 23, 2007 4:31 pm

jnc100 wrote:In terms of clock cycles then maybe, but in absolute terms, compared with, e.g. a 1ms PIT interval then on modern processors more time is being spent on actually executing code than performing task switches.

Most task-switching overhead is better measured in terms of memory latency than processor clock cycles. Although processors have gotten much faster, memory has not kept up. Mystran's point is that the more speculative a processor is, the more state gets flushed on a pipeline stall, and the more data needs to be read back in from memory (worst case) or caches (best case). Don't forget that caches are not a panacea either -- L1 is fast but small, and L2 can be an order of magnitude slower than the processor clock (memory is several orders of magnitude slower

).

mystran · Post by **mystran** » Mon Apr 23, 2007 6:34 pm

It is also the case that not nearly all task-switches happen because of time-quantum exhaustion, so if we assumed a system where every task blocked before it's time-quantum got exhausted, then task-switching overhead indeed is going up in absolute terms as well..

It's a complicated topic, and any optimal policy will depend on your choice of optimality.

Brendan · Post by **Brendan** » Mon Apr 23, 2007 10:43 pm

Hi,

Colonel Kernel wrote:
Brendan wrote:To be honest, I'm skeptical about the costs of TLB flushing in user space. I'm not saying that TLB flushes caused by task switches don't effect performance (they do), but I am saying that it's not really a huge problem.
As you've pointed out before, it is indeed the re-filling of the TLB that's expensive, not the actual flushing. This is just a terminology issue though -- the cost is still there.

Consider an Opteron with 1088 TLB entries running an OS that uses 4 KB pages. The TLB is capable of holding the details for 4.25 MB of an address space.

Now imagine you've got something like this:

Code: Select all

void main(void) {
    doFunction1();
    doFunction2();
}

void doFunction1(void) {
    memset(buffer1, 0, 5MB);
}

void doFunction2(void) {
    memcpy(buffer2a, buffer2b, 2.5MB);
}

In this case doFunction1 and doFunction2 both trash the TLB - nothing that was in the TLB before either of these functions is run will be used, and everything that was in the TLB will be gone after either of these functions has run.

If these functions were implemented as seperate tasks, then it wouldn't make any difference if an OS used a single address space or not. A TLB flush caused by switching from one address space to another wouldn't matter because the "working set" of both tasks is enough to cause any old TLB entries to be evicted anyway.

The problem in this example isn't address space switching, it's switching from one "working set" to another.

Colonel Kernel wrote:About your distributed processing example -- I don't really see how that fits with your batch processing idea. Do you have a "distributed scheduler" of sorts...? As soon as you bring distributed processing into the equation, the cost of a few extra context switches gets lost in the noise IMO.

Basically what I'm trying to say is that as systems become more scalable traditional scheduling techniques become less suitable, and different scheduling techniques (like batch processing) become more suitable.

Single-threaded code uses one thread that runs for a relatively long time, so a scheduler designed for single-threaded software should probably be designed to work well with a small number of threads that each run for a relatively long time. In this case making it look like all threads are making progress is more important (which involves preempting threads to create this illusion).

As code becomes more scalable it tends to use more threads that run for shorter lengths of time, and it's more likely that something is waiting for several threads to complete. In this case making it look like all threads are making progress isn't important, and letting a thread run for long enough to complete work that something else is waiting for is more important. Therefore preempting threads is less suitable.

The "batch processing" idea doesn't give the illusion that all threads are making progress, doesn't use preemption at all, and instead optimises the average time until a job completes. It's intended for systems where there's lots of threads that run for shorter lengths of time (scalable systems) rather than being intended for systems with a few threads that run for longer lengths of time.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Mon Apr 23, 2007 11:26 pm

Brendan wrote:The problem in this example isn't address space switching, it's switching from one "working set" to another.

True. Did I mention that Singularity by default runs with paging disabled?

Even with paging enabled, it could probably use 4MB pages for the entire shared address space (not sure if it does or not). Of course, it's years away from being a product and shows no signs of supporting swapping to disk any time soon...

It's intended for systems where there's lots of threads that run for shorter lengths of time (scalable systems) rather than being intended for systems with a few threads that run for longer lengths of time.

When you put it that way, it makes more sense, but IMO it reminds me more of the characteristics of I/O-bound tasks rather than "batch" tasks (which tends to imply lots of CPU usage). Correct me if I'm wrong, but in BCOS probably the #1 reason why a thread would block would be to receive a message from some server process, right? Mystran's observation that software is getting more and more I/O bound is very pertinent here -- with user-space drivers, this means a lot of threads will block waiting for messages from the file system server, network server, etc. The point being, CPU-bound threads are not necessarily getting more "useful work" done from the user's point of view than chains of I/O bound threads are -- why favour one at the expense of the other?

Brendan · Post by **Brendan** » Tue Apr 24, 2007 10:46 pm

Hi,

Colonel Kernel wrote:When you put it that way, it makes more sense, but IMO it reminds me more of the characteristics of I/O-bound tasks rather than "batch" tasks (which tends to imply lots of CPU usage). Correct me if I'm wrong, but in BCOS probably the #1 reason why a thread would block would be to receive a message from some server process, right? Mystran's observation that software is getting more and more I/O bound is very pertinent here -- with user-space drivers, this means a lot of threads will block waiting for messages from the file system server, network server, etc. The point being, CPU-bound threads are not necessarily getting more "useful work" done from the user's point of view than chains of I/O bound threads are -- why favour one at the expense of the other?

I'm not sure the question makes sense. If a "CPU bound" thread is constantly blocking and waiting for message/s, then it's actually an I/O bound thread. It doesn't make much difference if it's waiting for it's I/O to come from hardware or if it's waiting for I/O to come from another process - it's still waiting for I/O.

It does make me realise one thing though - threads that block aren't necessarily bad for "batch processing", it's just that threads that block shouldn't really be mixed with threads that don't block. I've been thinking more about this, and come up with an alternative..

The scheduler could maintain 2 queues for each "batch processing" policy. The first queue would be intended for threads that block and the second would be intended for threads that don't.

When a new batch processing thread is started it'd be put on the blocking queue. If a thread on the blocking queue doesn't block after a small amount of time then it'd be shifted to the non-blocking queue, and if a thread on the non-blocking queue does block, it'd be shifted to the blocking queue.

The scheduler would only run threads on the non-blocking queue if there's nothing "ready to run" on the blocking queue.

There's one other thing I've been neglecting - what if a thread handles multiple requests? Originally I was thinking of services where a new thread is spawned for each request, which would increase overhead (threads constantly being spawned and terminated). If a service is implemented as a master thread with several worker threads (where the master thread sends incoming requests to the first available worker thread), then there is no overhead of spawning/terminating threads, but the scheduler's "batch processing" breaks. The scheduler needs to work on the age of the request, not the age of the thread, and has to provide a "I'm starting a new request" function so it can know the age of the request.

In this case worker threads could receive a request, tell the scheduler they're starting a new request, process that request and then block until they receive the next request. When the thread calls the scheduler's "new request" function the scheduler would shift the thread to the end of the blocking queue, set the thread's age to "now" and reset the thread's "maximum time to live" (so the scheduler doesn't decide it's a CPU hog and terminate it).

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Tue Apr 24, 2007 11:46 pm

I thought this might be of interest (from one of the early Singularity papers):

3.2.3 Scheduler

Singularity supports a compile-time replaceable scheduler. We have implemented four scheduler—the Rialto scheduler [32], a multi-resource laxity-based scheduler, a round-robin scheduler (implemented as a degenerate case of the Rialto scheduler), and a minimum latency round-robin scheduler.

The minimum latency round-robin scheduler is optimized for a large number of threads that communicate frequently. The scheduler maintains two lists of runable threads. The first, called the unblocked list, contains threads that have recently become runable. The second, called the preempted list, contains runable threads that have been pre-empted. When choosing the next thread to run, the scheduler removes threads from the unblocked list in FIFO order. When the unblocked list is empty, the scheduler removes the next thread from the preempted list. Whenever a scheduling timer interrupt occurs, all threads in the unblocked list are moved to the end of the preempted list, followed by the thread that was running when the timer fired. The first thread from the unblocked list is scheduled and the scheduling timer is reset.

The net effect of the two list scheduling policy is to favor threads that are awoken by a message, do a small amount of work, send one or more messages to other processes, and then block waiting for a message. This is a common behavior for threads running message handling loops. To avoid a costly reset of the scheduling timer, threads from the unblocked list inherit the scheduling quantum of the thread that unblocked them. Combined with the two-list policy, quantum inheritance is particular effective because Singularity can switch from one thread to another in as few as 394 cycles.