FPU state on multicore processors
FPU state on multicore processors
I just discovered that my FPU state handler will fail on multicore processors, so I need to update it, although I'm not sure how.
Current (single-core) functionality:
1. Every task-switch will set TS flag in CR0-register (previously, this was done automatically when using hardware task-switching).
2. Exception 7 will check if the FPU state belongs to the current thread (there is a single variable called math_tss which holds the last owner). If not, a fsave is done to the math_tss, followed by a frstor from the new thread, and then math_tss is set to the new thread.
Of course, this logic will malfunction on a multicore processor since there is only a single current FPU thread per system.
However, just extending the logic to make the current FPU thread become a per-core variable will not work smothly. If a FPU-thread is moved from one core to another core, the other core cannot easily load the state from the other core (IPIs will be needed). If the state is saved every time a thread might be moved to another core, it would fix this, but this is not easily detected.
Currently, I think I might do it like this:
1. Create a new flag per core, HAS_FPU_STATE, and clear it during processor initialization.
2. Set TS flag during processor initialization.
3. Exception 7 will set HAS_FPU_STATE, load FPU state for a thread and do an CLTS.
4. The task-switcher will no longer set the TS flag, but will instead check HAS_FPU_STATE. If it is set, it will save the FPU state, clear HAS_FPU_STATE and set TS flag.
How would (have) others solved this problem?
Current (single-core) functionality:
1. Every task-switch will set TS flag in CR0-register (previously, this was done automatically when using hardware task-switching).
2. Exception 7 will check if the FPU state belongs to the current thread (there is a single variable called math_tss which holds the last owner). If not, a fsave is done to the math_tss, followed by a frstor from the new thread, and then math_tss is set to the new thread.
Of course, this logic will malfunction on a multicore processor since there is only a single current FPU thread per system.
However, just extending the logic to make the current FPU thread become a per-core variable will not work smothly. If a FPU-thread is moved from one core to another core, the other core cannot easily load the state from the other core (IPIs will be needed). If the state is saved every time a thread might be moved to another core, it would fix this, but this is not easily detected.
Currently, I think I might do it like this:
1. Create a new flag per core, HAS_FPU_STATE, and clear it during processor initialization.
2. Set TS flag during processor initialization.
3. Exception 7 will set HAS_FPU_STATE, load FPU state for a thread and do an CLTS.
4. The task-switcher will no longer set the TS flag, but will instead check HAS_FPU_STATE. If it is set, it will save the FPU state, clear HAS_FPU_STATE and set TS flag.
How would (have) others solved this problem?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: FPU state on multicore processors
There are two factors to consider:
- what is the cost of moving the FPU state from one core to another
- how often does the FPU state have to switch from cores
If the state swaps over often, it is probably faster to save on each task switch
If the state swaps over sparingly, it is probably faster to interrupt the remote processor.
Incidentally, the same holds for the processor caches, meaning that every core swap is a performance hit in its own right, and you will want to minimize them as much as possible. For instance, a dumb scheduler will cause a thread to hop cores when it's scheduled with probability (n_cores - 1) / n_cores, i.e. 50% on a dualcore and a whopping 83% on a 4x2 Core i7
After you solved that problem, you can also predict an cross-core FPU switch and have it store its state before the new process ever considers retrieving it (reducing the total waiting time).
- what is the cost of moving the FPU state from one core to another
- how often does the FPU state have to switch from cores
If the state swaps over often, it is probably faster to save on each task switch
If the state swaps over sparingly, it is probably faster to interrupt the remote processor.
Incidentally, the same holds for the processor caches, meaning that every core swap is a performance hit in its own right, and you will want to minimize them as much as possible. For instance, a dumb scheduler will cause a thread to hop cores when it's scheduled with probability (n_cores - 1) / n_cores, i.e. 50% on a dualcore and a whopping 83% on a 4x2 Core i7
After you solved that problem, you can also predict an cross-core FPU switch and have it store its state before the new process ever considers retrieving it (reducing the total waiting time).
Re: FPU state on multicore processors
As per a previous discussion about cores and threads, the probability that any task will switch core is low (less than 10-20%), but it will happen. Therefore, a FPU task MIGHT switch core, which means that this possibility must be accounted for if FPU state is not saved every time a thread has modified it.
There are other factors involved as well:
1. If FPU state might switch between cores, there is a need to make sure that the logic works under all possible conditions. If this is rare, testing the logic will become very hard. Untested FPU saving logic could become a nightmare.
2. It takes time to do checks for FPU state saving in the mainline scheduler, which is bad if these tests seldom are successful.
There are other factors involved as well:
1. If FPU state might switch between cores, there is a need to make sure that the logic works under all possible conditions. If this is rare, testing the logic will become very hard. Untested FPU saving logic could become a nightmare.
2. It takes time to do checks for FPU state saving in the mainline scheduler, which is bad if these tests seldom are successful.
Re: FPU state on multicore processors
It seems to work now.
Task-switcher code:
The FPU exception looks like this:
Task-switcher code:
Code: Select all
;
; AX is new thread to load
;
test fs:ps_flags,PS_FLAG_FPU
jz load_fpu_ok ; Thread did not use FPU in last time-slice
;
mov bx,fs:ps_math_thread
cmp ax,bx
je load_fpu_ok ; Same thread as previously
;
push ds
mov ds,bx
mov bx,OFFSET p_math_control
clts
db 9Bh, 66h, 0DDh, 37h ; 32-bit fsave [bx]
pop ds
;
lock and fs:ps_flags,NOT PS_FLAG_FPU
;
mov eax,cr0
or al,8
mov cr0,eax ; set TS flag so next FPU operation faults.
load_fpu_ok:
Code: Select all
GetThread
mov ds,ax
mov bx,OFFSET p_math_control
clts
db 9Bh, 66h, 0DDh, 27h ; 32-bit frstor [bx]
;
mov bx,core_data_sel
mov ds,bx
mov ds:ps_math_thread,ax
lock or ds:ps_flags,PS_FLAG_FPU
- gravaera
- Member
- Posts: 737
- Joined: Tue Jun 02, 2009 4:35 pm
- Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.
Re: FPU state on multicore processors
Yo:
Thanks for pointing this out, I hadn't thought of it myself. The solution that came to me is: assuming you have per-logical-cpu run-queues, a thread will rarely ever be rescheduled to a new logical CPU; however, when a thread is to be rescheduled the general thing is to take it off its CPU's run-queue and place it into the "global" scheduler which will select a new CPU for it.
So when you are taking a task off its current CPU's scheduler queues (migration), something similar to:
should suffice. Appreciate any comments I get on that, thanks
Thanks for pointing this out, I hadn't thought of it myself. The solution that came to me is: assuming you have per-logical-cpu run-queues, a thread will rarely ever be rescheduled to a new logical CPU; however, when a thread is to be rescheduled the general thing is to take it off its CPU's run-queue and place it into the "global" scheduler which will select a new CPU for it.
So when you are taking a task off its current CPU's scheduler queues (migration), something similar to:
Code: Select all
if (task is running on current cpu)
{
if (cpu has task's FPU context) {
save task FPU context;
}
}
else
{
Find out which CPU the task is running on;
Send a message to that CPU to check if it has that task's context and save it if it does;
// Synch and wait for IPI to finish.
};
migrate the task off its current CPU.
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: FPU state on multicore processors
This would be the most straightforward approach as far as I'm concerned:
EDIT: removed platform-specific code.
Code: Select all
void restore_fpu_state()
{
if (!task->fpu_live)
{
save_current_fpu_state();
load_fpu_state(task->fpu_state);
}
else
{
if (current_cpu() = task->fpu_cpu)
{
// do nothing
}
else
{
send_remote_fpu_ipi(); // you can also do this when a tread with a live FPU state is migrated to a different cpu.
while (task->fpu_live)
waste_time();
save_current_fpu_state();
load_fpu_state(task->fpu_state);
}
}
task->fpu_cpu = current_cpu();
task->fpu_live = true;
}
Last edited by Combuster on Thu Oct 13, 2011 10:15 am, edited 1 time in total.
Re: FPU state on multicore processors
You must do "clts" before accessing the FPU, otherwise the code will risk exceptions.Combuster wrote:This would be the most straightforward approach as far as I'm concerned
Besides, the FPU save / restore is not really a procedure, but must be made as an event handler for effective function.
Re: FPU state on multicore processors
Hi,
At which point does the overhead of doing the "delayed FPU state loading and saving" become so high that it's simply not worth doing it at all? I'd suggest that as soon as IPIs need to be broadcast to all other CPUs (and maybe even just sent to one other CPU), you've gone beyond that point (especially on systems with lots of CPUs). I'd also suggest that if you've minimised thread migration to reduce the chance of IPIs caused by "delayed FPU state loading and saving" (and you've crippled the scheduler's ability to schedule tasks on available CPUs effectively) then you've also gone too far.
With this in mind, the first thing I'd be doing (for multi-CPU) is saying "FPU state is saved during task switches whenever the previous tasks used it". That avoids all synchronisation, IPI overhead and tasks migration problems. It also means I'd only be considering "delayed FPU state loading" (and not saving).
When a "device not available" exception occurs you load the FPU state, and set a flag so the scheduler knows that the FPU state needs to be saved when there's a task switch.
The second step would be tracking how often each task uses the FPU. If you detect that a task uses the FPU most of the time, then you can avoid the overhead of a likely "device not available" exception by pre-loading the FPU state during the task switch.
The next step would be "delayed FPU state initialisation". When a task is created, set a flag saying "FPU state not initialised", and if/when the task uses the FPU, initialise the FPU state in the "device not available" exception handler.
Of course all of the above applies to MMX (and SSE and AVX) too.
However, for SSE it may be possible to also use the "OSFXSR" flag in CR4 to detect if a task actually uses SSE; and avoid loading and saving SSE state for tasks that only use FPU/MMX and don't use SSE. When a task is created you'd set a flag saying "FPU state not initialised" and another flag saying "SSE state not initialised". When you get the first "device not available" exception you initialise FPU state, set the "FPU state initialised" flag and return (like before); and if/when you get an "invalid opcode" exception you check if it was an SSE instruction, initialise SSE state and set the "SSE state initialised" flag (and also initialise FPU state if it hasn't already been initialised). If SSE state has been initialised, when you switch to the task you'd set TS and OFSXSR (and any "device not available" exception would cause both FPU and SSE to be loaded, rather than just FPU).
After all that comes AVX, where things get messy (but the basic idea behind "avoid loading and saving SSE state" should work for AVX too, just with XGETBV, XSETBV and XCR0 instead of OSFXR alone).
Cheers,
Brendan
At which point does the overhead of doing the "delayed FPU state loading and saving" become so high that it's simply not worth doing it at all? I'd suggest that as soon as IPIs need to be broadcast to all other CPUs (and maybe even just sent to one other CPU), you've gone beyond that point (especially on systems with lots of CPUs). I'd also suggest that if you've minimised thread migration to reduce the chance of IPIs caused by "delayed FPU state loading and saving" (and you've crippled the scheduler's ability to schedule tasks on available CPUs effectively) then you've also gone too far.
With this in mind, the first thing I'd be doing (for multi-CPU) is saying "FPU state is saved during task switches whenever the previous tasks used it". That avoids all synchronisation, IPI overhead and tasks migration problems. It also means I'd only be considering "delayed FPU state loading" (and not saving).
When a "device not available" exception occurs you load the FPU state, and set a flag so the scheduler knows that the FPU state needs to be saved when there's a task switch.
The second step would be tracking how often each task uses the FPU. If you detect that a task uses the FPU most of the time, then you can avoid the overhead of a likely "device not available" exception by pre-loading the FPU state during the task switch.
The next step would be "delayed FPU state initialisation". When a task is created, set a flag saying "FPU state not initialised", and if/when the task uses the FPU, initialise the FPU state in the "device not available" exception handler.
Of course all of the above applies to MMX (and SSE and AVX) too.
However, for SSE it may be possible to also use the "OSFXSR" flag in CR4 to detect if a task actually uses SSE; and avoid loading and saving SSE state for tasks that only use FPU/MMX and don't use SSE. When a task is created you'd set a flag saying "FPU state not initialised" and another flag saying "SSE state not initialised". When you get the first "device not available" exception you initialise FPU state, set the "FPU state initialised" flag and return (like before); and if/when you get an "invalid opcode" exception you check if it was an SSE instruction, initialise SSE state and set the "SSE state initialised" flag (and also initialise FPU state if it hasn't already been initialised). If SSE state has been initialised, when you switch to the task you'd set TS and OFSXSR (and any "device not available" exception would cause both FPU and SSE to be loaded, rather than just FPU).
After all that comes AVX, where things get messy (but the basic idea behind "avoid loading and saving SSE state" should work for AVX too, just with XGETBV, XSETBV and XCR0 instead of OSFXR alone).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: FPU state on multicore processors
Exactly. Avoiding IPIs, especially to many CPUs, is essential both for performance and for ease of debugging and getting it to work under all conditions.Brendan wrote:With this in mind, the first thing I'd be doing (for multi-CPU) is saying "FPU state is saved during task switches whenever the previous tasks used it". That avoids all synchronisation, IPI overhead and tasks migration problems. It also means I'd only be considering "delayed FPU state loading" (and not saving).
This is my current logic. Except that I also check if the next scheduled thread is identical to the one owning the FPU context.Brendan wrote:When a "device not available" exception occurs you load the FPU state, and set a flag so the scheduler knows that the FPU state needs to be saved when there's a task switch.
The second step would be tracking how often each task uses the FPU. If you detect that a task uses the FPU most of the time, then you can avoid the overhead of a likely "device not available" exception by pre-loading the FPU state during the task switch.
I'd initialize the state in the thread control block at thread creation time instead. There is no need to execute the "finit" operation in the new task context anyway. Basically, setting tag-register to 0xFFFF will do it.Brendan wrote:The next step would be "delayed FPU state initialisation". When a task is created, set a flag saying "FPU state not initialised", and if/when the task uses the FPU, initialise the FPU state in the "device not available" exception handler.
Messy. I don't bother with MMX, SSE and AVX yet. They are not used by any tasks yet.Brendan wrote:Of course all of the above applies to MMX (and SSE and AVX) too.
However, for SSE it may be possible to also use the "OSFXSR" flag in CR4 to detect if a task actually uses SSE; and avoid loading and saving SSE state for tasks that only use FPU/MMX and don't use SSE. When a task is created you'd set a flag saying "FPU state not initialised" and another flag saying "SSE state not initialised". When you get the first "device not available" exception you initialise FPU state, set the "FPU state initialised" flag and return (like before); and if/when you get an "invalid opcode" exception you check if it was an SSE instruction, initialise SSE state and set the "SSE state initialised" flag (and also initialise FPU state if it hasn't already been initialised). If SSE state has been initialised, when you switch to the task you'd set TS and OFSXSR (and any "device not available" exception would cause both FPU and SSE to be loaded, rather than just FPU).
After all that comes AVX, where things get messy (but the basic idea behind "avoid loading and saving SSE state" should work for AVX too, just with XGETBV, XSETBV and XCR0 instead of OSFXR alone).
-
- Member
- Posts: 170
- Joined: Wed Jul 18, 2007 5:51 am
Re: FPU state on multicore processors
Please excuse my ignorance, but why was it necessary to manually encode the instrucction? Is there a problem with the assembler?db 9Bh, 66h, 0DDh, 27h ; 32-bit frstor [bx]
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: FPU state on multicore processors
He wrote his own tools, with the obvious consequences
Re: FPU state on multicore processors
It is an artifact from the time when I used TASM to assemble. It couldn't correctly generate some 32-bit instructions, especially not in 16-bit segments. I think WASM can handle this, but I haven't tested as the code was written before I switched to WASM.tom9876543 wrote:Please excuse my ignorance, but why was it necessary to manually encode the instrucction? Is there a problem with the assembler?db 9Bh, 66h, 0DDh, 27h ; 32-bit frstor [bx]
OTOH, I wonder how the operand-size (in a 16-bit segment) would be given to frstor? How would this be coded in NASM for instance?
- Combuster
- Member
- Posts: 9301
- Joined: Wed Oct 18, 2006 3:45 am
- Libera.chat IRC: [com]buster
- Location: On the balcony, where I can actually keep 1½m distance
- Contact:
Re: FPU state on multicore processors
You can override the operand and address size with o16/o32 and a16/a32 when the instruction does not imply any such prefix. You get constructs like o32 frstor [bx] and a32 rep stosw.
Re: FPU state on multicore processors
Which more or less corresponds to "db 66h" and "db 67h" with TASM/WASM. The only problem is that the instruction coding starts with a "wait", and the "db 66h" should be after the wait. OTOH, I'm not sure if wait really is needed here.Combuster wrote:You can override the operand and address size with o16/o32 and a16/a32 when the instruction does not imply any such prefix. You get constructs like o32 frstor [bx] and a32 rep stosw.
Re: FPU state on multicore processors
Turns out the logic doesn't work after all, and the reason is that the task can be scheduled on a new core before the old core loads a new thread. That creates really nasty problems. It takes about a week before this bug triggers in the terminal setup. Changed the logic so the FPU state save is done when the ordinary CPU registers are saved on multicore (if the FPU exception has occurred), and re-inserted the "good" logic on single-core so FPU state is only saved when a new thread starts using the FPU.
Started a new week test to see if this fixes this really nasty bug.
Started a new week test to see if this fixes this really nasty bug.