Hi,
Colonel Kernel wrote:Brendan wrote:To be honest, I'm skeptical about the costs of TLB flushing in user space. I'm not saying that TLB flushes caused by task switches don't effect performance (they do), but I am saying that it's not really a huge problem.
As you've pointed out before, it is indeed the re-filling of the TLB that's expensive, not the actual flushing. This is just a terminology issue though -- the cost is still there.
Consider an Opteron with 1088 TLB entries running an OS that uses 4 KB pages. The TLB is capable of holding the details for 4.25 MB of an address space.
Now imagine you've got something like this:
Code: Select all
void main(void) {
doFunction1();
doFunction2();
}
void doFunction1(void) {
memset(buffer1, 0, 5MB);
}
void doFunction2(void) {
memcpy(buffer2a, buffer2b, 2.5MB);
}
In this case doFunction1 and doFunction2 both trash the TLB - nothing that was in the TLB before either of these functions is run will be used, and everything that was in the TLB will be gone after either of these functions has run.
If these functions were implemented as seperate tasks, then it wouldn't make any difference if an OS used a single address space or not. A TLB flush caused by switching from one address space to another wouldn't matter because the "working set" of both tasks is enough to cause any old TLB entries to be evicted anyway.
The problem in this example isn't address space switching, it's switching from one "working set" to another.
Colonel Kernel wrote:About your distributed processing example -- I don't really see how that fits with your batch processing idea. Do you have a "distributed scheduler" of sorts...? As soon as you bring distributed processing into the equation, the cost of a few extra context switches gets lost in the noise IMO.
Basically what I'm trying to say is that as systems become more scalable traditional scheduling techniques become less suitable, and different scheduling techniques (like batch processing) become more suitable.
Single-threaded code uses one thread that runs for a relatively long time, so a scheduler designed for single-threaded software should probably be designed to work well with a small number of threads that each run for a relatively long time. In this case making it look like all threads are making progress is more important (which involves preempting threads to create this illusion).
As code becomes more scalable it tends to use more threads that run for shorter lengths of time, and it's more likely that something is waiting for several threads to complete. In this case making it look like all threads are making progress isn't important, and letting a thread run for long enough to complete work that something else is waiting for is more important. Therefore preempting threads is less suitable.
The "batch processing" idea doesn't give the illusion that all threads are making progress, doesn't use preemption at all, and instead optimises the average time until a job completes. It's intended for systems where there's lots of threads that run for shorter lengths of time (scalable systems) rather than being intended for systems with a few threads that run for longer lengths of time.
Cheers,
Brendan