Hi,
I have had nicely working 'test' task switching routines in the past (using the separate stack method) and am having a look at where to go from here. I would like to support MMX/SSE(2/3)/Extended 64 bit regs and wondered how best to do this.
As per usual, I would like to spend as little time as possible in the task switching routine, but it looks like there will inevitably be some branching involved here. So, what is the best way to do this?
1) Have a flag in my program control structure indicating whether these registers are used, and push/pop based on that flag?
2) Just Push/Pop these registers for each task anyway?
3) A mixture of the two (for example R8-R15 may be pushed regardless of whether they are used, but MMX/SSE may be an option)?
Thanks for your thoughts,
Adam
Task Switching with Extended Registers
Re: Task Switching with Extended Registers
Hi,
For R8 to R15, it's fairly safe to assume that all 64-bit processes will use these registers. If your OS supports 32-bit processes (e.g. for compatability with processes designed for the 32-bit version of your OS), then the cost of always saving these registers is likely to be less than the cost of potential branch mispredictions, especially as you'd be looking at up to 2 mispredicted branches. For e.g.:
It may be better for performance to always save/load R8 to R15 to avoid potential branch mispredictions, unless you need these branches for some other reason.
For FPU/MMX/SSE state, you could always load/save this state during the task switch. This method causes the least overhead if most/all tasks use FPU/MMX/SSE, and is also the easiest method to implement.
You could use the CPU's built in support for delayed FPU/MMX/SSE state saving/loading. In this case the task switch code never saves/loads FPU/MMX/SSE state and contains no test/branching for it, but if the task actually does use FPU/MMX/SSE you get a "Device not available" exception and need to do the FPU/MMX/SSE state saving/loading within the exception handler. The interesting thing here is that FPU//MMX/SSE state is never loaded or saved when it's unnecessary. This method causes the least overhead if most/all tasks don't use FPU/MMX/SSE (but is worse if most tasks do use FPU/MMX/SSE). Also (because the task switch code never needs to save/load the FPU/MMX/SSE state) task switches are always faster than any other method (at the expense of exception handling overhead outside the task switch), so if you want fast task switching but don't care about total overhead (e.g. you disable interrupts during the task switch and are worried about interrupt latency) it's the best option. However...
The method documented by Intel for using the CPU's built in support for delayed FPU/MMX/SSE state saving/loading does not work for multi-CPU (unless tasks use CPU affinity to ensure they're always run on the same CPU) - when the "Device not available" exception handler needs to load a task's FPU/MMX/SSE state this state may still be in another CPU (and not in RAM). For multi-CPU there's a way to make it work by using the hardware in a way it wasn't intended, so that FPU/MMX/SSE state is still never loaded or saved when it's unnecessary. The basic idea is for the "Device not available" exception handler to load the task's FPU/MMX/SSE state and set a "FPU/MMX/SSE was used" flag, and for the task switch code to test if the "FPU/MMX/SSE was used" flag is set and only save this state if it was used.
You could allow tasks to enable/disable FPU/MMX/SSE access and only load/save this state during the task switch if the task has FPU/MMX/SSE access enabled. This gives you 2 branches in the task switch code but avoids exception handling overheads and avoids loading/saving FPU/MMX/SSE state sometimes. The problem here is that FPU/MMX/SSE may be enabled but not used during the task's time slice and loaded/saved for no reason. In this case, if most/all tasks use FPU/MMX/SSE the overhead would be worse than always loading/saving but better than delayed loading/saving, and if most/all tasks don't use FPU/MMX/SSE the overhead would be better than always loading/saving but worse than delayed loading/saving.
Of course for FPU/MMX/SSE state there's probably other ways to handle it. In geneal there is no "best" method for all loads - mostly it depends on how many tasks use FPU/MMX/SSE, and how many of those tasks actually use it during every time slice. If you're feeling brave, you could support several different methods (e.g. the "always" method and the "delayed" method) and then dynamically determine which method to use based on load (e.g. switch from "delayed" to "always" if the number of tasks using FPU/MMX/SSE increases past some threshold, and switch back to "delayed" if the number of tasks using FPU/MMX/SSE decreases past some threshold).
Also, eventually there may be other things to save/load - e.g. the debug registers (DR0 to DR3 and DR7), performance monitoring MSRs, the address of the "Debug Store" buffer, etc. These sorts of things can probably be ignored for now (usually people don't add any support for things like debugging, profiling and performance monitoring/tuning until much later).
Cheers,
Brendan
Some thoughts...AJ wrote:I have had nicely working 'test' task switching routines in the past (using the separate stack method) and am having a look at where to go from here. I would like to support MMX/SSE(2/3)/Extended 64 bit regs and wondered how best to do this.
As per usual, I would like to spend as little time as possible in the task switching routine, but it looks like there will inevitably be some branching involved here. So, what is the best way to do this?
1) Have a flag in my program control structure indicating whether these registers are used, and push/pop based on that flag?
2) Just Push/Pop these registers for each task anyway?
3) A mixture of the two (for example R8-R15 may be pushed regardless of whether they are used, but MMX/SSE may be an option)?
For R8 to R15, it's fairly safe to assume that all 64-bit processes will use these registers. If your OS supports 32-bit processes (e.g. for compatability with processes designed for the 32-bit version of your OS), then the cost of always saving these registers is likely to be less than the cost of potential branch mispredictions, especially as you'd be looking at up to 2 mispredicted branches. For e.g.:
Code: Select all
if(old task was 64-bit) {
save R8 to R15
}
if(new task is 64-bit) {
load R8 to R15
}
For FPU/MMX/SSE state, you could always load/save this state during the task switch. This method causes the least overhead if most/all tasks use FPU/MMX/SSE, and is also the easiest method to implement.
You could use the CPU's built in support for delayed FPU/MMX/SSE state saving/loading. In this case the task switch code never saves/loads FPU/MMX/SSE state and contains no test/branching for it, but if the task actually does use FPU/MMX/SSE you get a "Device not available" exception and need to do the FPU/MMX/SSE state saving/loading within the exception handler. The interesting thing here is that FPU//MMX/SSE state is never loaded or saved when it's unnecessary. This method causes the least overhead if most/all tasks don't use FPU/MMX/SSE (but is worse if most tasks do use FPU/MMX/SSE). Also (because the task switch code never needs to save/load the FPU/MMX/SSE state) task switches are always faster than any other method (at the expense of exception handling overhead outside the task switch), so if you want fast task switching but don't care about total overhead (e.g. you disable interrupts during the task switch and are worried about interrupt latency) it's the best option. However...
The method documented by Intel for using the CPU's built in support for delayed FPU/MMX/SSE state saving/loading does not work for multi-CPU (unless tasks use CPU affinity to ensure they're always run on the same CPU) - when the "Device not available" exception handler needs to load a task's FPU/MMX/SSE state this state may still be in another CPU (and not in RAM). For multi-CPU there's a way to make it work by using the hardware in a way it wasn't intended, so that FPU/MMX/SSE state is still never loaded or saved when it's unnecessary. The basic idea is for the "Device not available" exception handler to load the task's FPU/MMX/SSE state and set a "FPU/MMX/SSE was used" flag, and for the task switch code to test if the "FPU/MMX/SSE was used" flag is set and only save this state if it was used.
You could allow tasks to enable/disable FPU/MMX/SSE access and only load/save this state during the task switch if the task has FPU/MMX/SSE access enabled. This gives you 2 branches in the task switch code but avoids exception handling overheads and avoids loading/saving FPU/MMX/SSE state sometimes. The problem here is that FPU/MMX/SSE may be enabled but not used during the task's time slice and loaded/saved for no reason. In this case, if most/all tasks use FPU/MMX/SSE the overhead would be worse than always loading/saving but better than delayed loading/saving, and if most/all tasks don't use FPU/MMX/SSE the overhead would be better than always loading/saving but worse than delayed loading/saving.
Of course for FPU/MMX/SSE state there's probably other ways to handle it. In geneal there is no "best" method for all loads - mostly it depends on how many tasks use FPU/MMX/SSE, and how many of those tasks actually use it during every time slice. If you're feeling brave, you could support several different methods (e.g. the "always" method and the "delayed" method) and then dynamically determine which method to use based on load (e.g. switch from "delayed" to "always" if the number of tasks using FPU/MMX/SSE increases past some threshold, and switch back to "delayed" if the number of tasks using FPU/MMX/SSE decreases past some threshold).
Also, eventually there may be other things to save/load - e.g. the debug registers (DR0 to DR3 and DR7), performance monitoring MSRs, the address of the "Debug Store" buffer, etc. These sorts of things can probably be ignored for now (usually people don't add any support for things like debugging, profiling and performance monitoring/tuning until much later).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.