Why don't we use the TSS for Task Switching?

Primis · Post by **Primis** » Sun Jul 06, 2014 11:13 am

I was reading up on the TSS, and I'm confused as to why I would want to do software task switching as opposed to using TSS entries in the GDT. Is it slower? Is it missing somthing? I've searched the wiki, searched the forums. I can't seem to find anything substantial on the topic.

Brendan · Post by **Brendan** » Sun Jul 06, 2014 11:33 am

Hi,

Primis wrote:I was reading up on the TSS, and I'm confused as to why I would want to do software task switching as opposed to using TSS entries in the GDT. Is it slower? Is it missing somthing? I've searched the wiki, searched the forums. I can't seem to find anything substantial on the topic.

Normally, something happens (IRQ or kernel API call) that causes the CPU to switch to kernel code; then that kernel code switches from one task to another. Task switches themselves typically only switch from one task that's running kernel code to another task that's running kernel code.

When switching from one task that's running kernel code to another task that's running kernel code; part of the CPU's state is already saved somewhere (e.g. on the stack) and part of it is constant (e.g. the kernel's segment registers). In this case the amount of state that actually needs to be saved and restored is "almost none" (e.g. ESP and a few general purpose registers).

The hardware task switching mechanism saves and loads everything, even though most of that saving and loading is unnecessary. This makes it slower than necessary by default. Worse, segment register loads involve expensive lookups and protection checks. Because the segment register loads are unnecessary and hardware task switching doesn't avoid them, this makes hardware task switching a lot slower for no reason.

Next; there are things that hardware task switching does not do. For example, most OSs keep track of how much time a task consumed. This includes saving and restoring FPU/MMX/SSE/AVX state. This means that hardware task switching alone is not enough.

Finally; hardware task switching only works for 32-bit 80x86. It's not supported on 64-bit 80x86 or other CPUs (ARM, PowerPC, ....).

Basically; it's slow, inadequate and not portable. There's no valid reason to use it (excluding special purposes, like possibly the double fault exception handler).

Cheers,

Brendan

Primis · Post by **Primis** » Sun Jul 06, 2014 12:22 pm

Brendan wrote: Basically; it's slow, inadequate and not portable. There's no valid reason to use it (excluding special purposes, like possibly the double fault exception handler

What purpose would it serve in the double fault exception handler? On a side note, can you assign a specific TSS to a specific interrupt such as the double fault?

Nable · Post by **Nable** » Sun Jul 06, 2014 12:25 pm

Primis wrote:On a side note, can you assign a specific TSS to a specific interrupt such as the double fault?

http://wiki.osdev.org/IDT#I386_Task_Gate

Brendan · Post by **Brendan** » Sun Jul 06, 2014 12:46 pm

Hi,

Primis wrote:
Brendan wrote: Basically; it's slow, inadequate and not portable. There's no valid reason to use it (excluding special purposes, like possibly the double fault exception handler
What purpose would it serve in the double fault exception handler?

Double fault occurs when the CPU failed to start one of the other exception handlers; which typically happens when the kernel is buggy - e.g. either the kernel's stack isn't valid, or the current virtual address space doesn't contain the exception handlers.

If you don't use a task gate for the double fault exception handler then the problem that caused the double fault still exists when the CPU tries to start the double fault exception handler, so the CPU can't start the double fault exception handler and ends up doing "triple fault" (reset the computer). Using a task gate for the double fault exception handler forces the CPU to switch to a "known good kernel state" (e.g. with a different kernel stack, different virtual address space, etc).

Of course it's hard to say if using hardware task switching for the double fault handler is justified or not. It's possibly easier to make sure that the other exception handlers aren't buggy (to ensure double fault doesn't happen in the first place).

Note: In long mode (where there is no hardware task switching), there's special support ("IST") for forcing a stack switch for cases where the kernel's stack may be invalid. It serves mostly the same purpose (but does less, with a lot less overhead).

Cheers,

Brendan

Czernobyl · Post by **Czernobyl** » Mon Jul 28, 2014 4:49 pm

Actually, X86-style hardware task switching (using task gates) being too slow to consider it, is a myth - there used to be some basis to the myth _way back_, when CPUs (80286,80396) did not contain cache memory and even motherboards did not have cache on them either, for economic reasons. Actually Intel never intended X86 protected mode to be usable without at least some amount of (external) cache, and it was expecting to make good money from the sale of static RAM... but IBM and the PC compatible market decided otherwise.

Since processors started to have large caches on dye, speed (lack of) has ceased to be a valid reason not to base an OS on native X86 task switches. Real reasons were designers' sloth, and/or desire not to rely on methods supported only on X86 in order to remain portable. Same kind of reasons that played against making use of X86 segmentation in 32 bit code.

As others said, finally AMD64 has practically removed HW tasking and segmentation. Some may find this unfortunate. Anyway, for a pet 32-bit OS, nobody prevents you (OP) to experiment a design including full featured task gates, call gates and up to 4 privilege levels (rings). X86 style floating point is not a problem, contrary to what someone wrote above, as there will be a processor exception raised when trying to execute the first FP instruction after a task switch...

HTH

Brendan · Post by **Brendan** » Mon Jul 28, 2014 4:52 pm

Hi,

Czernobyl wrote:Since processors started to have large caches on dye, speed (lack of) has ceased to be a valid reason not to base an OS on native X86 task switches. Real reasons were designers' sloth, and/or desire not to rely on methods supported only on X86 in order to remain portable. Same kind of reasons that played against making use of X86 segmentation in 32 bit code.

Have you got one of those mythical CPUs with "infinitely fast" caches then? I hear they're expensive...

Cheers,

Brendan

Czernobyl · Post by **Czernobyl** » Mon Jul 28, 2014 5:32 pm

Czernobyl wrote:Since processors started to have large caches on dye, speed (lack of) has ceased to be a valid reason not to base an OS on native X86 task switches.

Brendan wrote:Have you got one of those mythical CPUs with "infinitely fast" caches then? I hear they're expensive...

No need for infinitely fast. The point is any frequently used TSS (relatively small structures) will most likely "live" in on-dye cache when switching tasks, thus making memory access irrelevant - that was the dominant cost of task switching on a 286 or 386... BTDT

Anyhow, i'm not preaching a religion here, just saying, anyone interested in designing a _non-conventional_ kernel is welcome to learn how full segmentation and native task switching are really supposed to work, and experiment... Fun to be had, guaranteed.

Combuster · Post by **Combuster** » Tue Jul 29, 2014 12:36 am

Invalid argument, proof wanted. I don't see how only memory access times would have made hardware task switching slower on 386 when the equivalent software implementation on the same processor would demand at least the same amount of memory accesses (actually more for the additional code involved).

linguofreak · Post by **linguofreak** » Tue Jul 29, 2014 1:16 am

Combuster wrote:Invalid argument, proof wanted. I don't see how only memory access times would have made hardware task switching slower on 386 when the equivalent software implementation on the same processor would demand at least the same amount of memory accesses (actually more for the additional code involved).

Yeah. If neither hardware nor software task switching suffers a cache miss, the ratio of the time taken to do software task switching to the time taken to do hardware switching should be fairly close (if not identical) to the ratio with no cache at all.

bluemoon · Post by **bluemoon** » Tue Jul 29, 2014 3:20 am

linguofreak wrote:
Combuster wrote:Invalid argument, proof wanted. I don't see how only memory access times would have made hardware task switching slower on 386 when the equivalent software implementation on the same processor would demand at least the same amount of memory accesses (actually more for the additional code involved).
Yeah. If neither hardware nor software task switching suffers a cache miss, the ratio of the time taken to do software task switching to the time taken to do hardware switching should be fairly close (if not identical) to the ratio with no cache at all.

How about the hundreds of protection checks? At least for 386 the descriptor cache is flushed every time an hardware task switch, and they are way more slower than accessing cache or even from memory.

Combuster · Post by **Combuster** » Tue Jul 29, 2014 3:45 am

bluemoon wrote:How about the hundreds of protection checks?

Combuster wrote:the equivalent software implementation

Here's one (rhetoric) pop quiz:

1) Which checks would be avoided in either case
2) What has that argument to do with memory and the presence/absence of caches (besides the fact that the 386 does have various function-specific caches already: the descriptor cache, TLB, prefetch queue, ...)

Though it is certainly true that in some cases code can be written to avoid DS/ES loads if not required by the OS, giving software switching a significant work advantage in those cases, but that doesn't hold for all models. The thing is that hardware switching was slower and unuseful in general, and because everybody kept doing software switching afterwards, hardware switching has been gravely neglected and has only become progressively worse in comparison. None of this has anything to do with memory and improvements thereof, but rather CPU silicon.

Czernobyl · Post by **Czernobyl** » Tue Jul 29, 2014 3:55 am

It goes without the saying, on same hardware, HW task-switching would've been an order of magnitude slower if it were emulated in software instructions. However if you chose to do software task switching, you did not want to reproduce the behavior of HW TS accurately, so you could save some time by save/restoring exactly what was needed. What that bought you was some flexibility.

Whatever... my point is the argument of slowness (is this a word?) given against hardware task switches is not quite valid by itself. What's at stake is (in)flexibility : the choice of doing HW tasking constrains your design. In exchange you get nice co-routine like mechanics for free (almost).

That would be my answer to the OP : don't be deterred from studying TS and make your own opinion.

A last thought concerning efficiency again : there's been sort of a chicken and egg problem there. Had mainstream OSes adopted HW task switches more enthusiaticly, Intel engineers would undoubtably have added dedicated caching hardware for TSSes to their CPU designs, like there are dedicated caches for segmenting, paging (TLBs), etc. However I think even OS2 1.0 did not do much task switching, did it ?

Combuster · Post by **Combuster** » Tue Jul 29, 2014 5:00 am

Czernobyl wrote:It goes without the saying, on same hardware, HW task-switching would've been an order of magnitude slower if it were emulated in software instructions.

Ahem. Let's take the 486:

Software: int@71 + iret@31, pusha@9 + popa@9, 5 data segments = 5 * 9(load)+3(save), save cr3@4 load cr3@4
= 188 cycles
Hardware task switch: 309 cycles.
That gives the software emulation 121 clock cycles to do other bits of administration.

Myth: busted.

Czernobyl · Post by **Czernobyl** » Tue Jul 29, 2014 5:11 am

Combuster wrote: Let's take the 486:

Software: int@71 + iret@31, pusha@9 + popa@9, 5 data segments = 5 * 9(load)+3(save), save cr3@4 load cr3@4
= 188 cycles
Hardware task switch: 309 cycles.
That gives the software emulation 121 clock cycles to do other bits of administration.

Switching TSSs has much more work to do than your software (non-)equivalent.

And those cycle counts mean nothing in practice, they do NOT take account of possible waiting for memory accesses, that may dominate. Those are dependant on the relative speed and physical organisation of main memory, motherboard cache (if any), chipset, wait states and so on. And on the processor's side, pipeline, caching again, and so on.

On anything more modern than a 8086/8088 (if even that), the only practical way to do meaningful code timing is to do...precise measurements. The manual's instruction counts are bogus (rather, they are no use for actual timing as soon as memory and/or external device access is involved). This is why Intel finally stopped publishing those instruction counts as annexes to their processor manuals, starting with the Pentium (TM)

What was your point, anyway ?

OSDev.org

Why don't we use the TSS for Task Switching?

Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?

Re: Why don't we use the TSS for Task Switching?