The processor is built to do lazy task switching for coprocessor state. If you look at all FP and vector instructions, you'll notice they all cause an exception when CR0.TS is set. You can set CR0.TS by doing a hardware task switch or by manually setting it. You can clear it via the CLTS instruction.
The routine is roughly as follows:
1) Process A uses the FPU
2) process A gets scheduled
3) the scheduler sets CR0.TS
4) process B runs
5) process B gets scheduled
6) process C runs
7) process C uses the FPU
The coprocessor exception is thrown (#NP, IIRC)
9) The FPU state for process A is saved
10) The FPU state for process C is loaded
11) CLTS is executed to clear CR0.TS
12) process C runs, and can now use the FPU
Notice that there is virtually no FPU-related overhead for process B. And if process A were the only one using the FPU, there would be no floating point save/restore overhead at all.