OSDev.org

Posted: **Sun Apr 11, 2021 3:06 am**

Hello,

I wrote a preemptive multitask kernel that fully uses Intel (i386) hardware tasks.
Exceptions and hardware interrupts are handled through task gates. Consequently when an exception or a hardware interrupt is triggered, the CPU automatically switches to the corresponding task (handler).

The scheduler is a task as well. It is triggered by the hardware timer (PIT) at a frequency of 100Hz.
My kernel is very simple: the number of tasks is fixed and the scheduler simply switches to the next task to be executed in a round-robin fashion.

To execute the next task, here is what I do:
1) Update the EFLAGS register to set the NT (Nested Task) bit to 1: this tells the CPU we're in a nested task
2) Set the busy flag in the TSS descriptor of the next task we're about to switch to
3) Make the next task to be the current task's parent, by updating the current task's TSS previous task link with the TSS selector of the next task. Note that the current task is the scheduler itself.
4) Set the scheduler task to be the parent of the next task to schedule so that executing iret in the next task will get back to the scheduler
5) Execute the iret instruction to switch to the next task

This works well, each task gets scheduled properly in a round-robin fashion as expected.
However, this implementation suffers from two issues:
- Executing iret in a task to go back to the scheduler triggers a general protection fault (GPF)
- Once in the exception handler which is also a task, I'm not able to go back to the scheduler more than once without triggering a double fault exception.

In the exception handler, I would like to "kill" the offending task that triggered the exception (such as GPF). The idea is to mark this task as non-schedulable anymore and switch to the scheduler.
These are the steps I wrote to do this:
1) Mark the task as not schedulable anymore (it's a boolean I keep in the meta-data of each task); the scheduler will simply not schedule this task anymore.
2) Update the EFLAGS register to set the NT (Nested Task) bit to 1: this tells the CPU we're in a nested task
3) Set the busy flag in the TSS descriptor of the scheduler task we're about to switch to
4) Make the scheduler task to be the current task's parent (the exception handler), by updating the current task's TSS previous task link with the TSS selector of the scheduler task. Note that the current task is the exception handler.
5) Execute the iret instruction to switch to the scheduler task

This works once, the CPU switched to the scheduler which then schedules the next task.
It's all well, until a new exception is triggered (for instance by a task executing a privileged instruction). When a new task triggers an exception, and the code above is executed, the CPU generates a double fault!

Does anyone have an idea what I do wrong above?
- Why does executing iret in a task to go back to the scheduler triggers a general protection fault?
- Why is the way I "kill" a task doesn't seem to work?

From my understanding of how the CPU works, what I do should work, but clearly there is something I don't understand.

Thanks a lot for your help

Posted: **Sun Apr 11, 2021 6:16 am**

thxbb12 wrote:
This works well, each task gets scheduled properly in a round-robin fashion as expected.
However, this implementation suffers from two issues:
- Executing iret in a task to go back to the scheduler triggers a general protection fault (GPF)

If your scheduler is a separate task, why do you iret to get to the scheduler? Shouldn't you call into the scheduler via a TASK gate? That will set NT and the LINK flag, which your scheduler will overwrite to select the next task.

thxbb12 wrote:
- Once in the exception handler which is also a task, I'm not able to go back to the scheduler more than once without triggering a double fault exception.

Is the scheduler already locked? If your exception is invoked while the scheduler is in use, you can't then switch back to the scheduler except by returning to it using the nested task mechanism. So if your scheduler is already in the back link chain, and you try to switch to the scheduler again via a task gate, its TSS will already be marked as in use and trigger your exception, which could well result in a double fault if the resulting GPF also cannot be handled.

thxbb12 wrote:
Does anyone have an idea what I do wrong above?
- Why does executing iret in a task to go back to the scheduler triggers a general protection fault?
- Why is the way I "kill" a task doesn't seem to work?

From my understanding of how the CPU works, what I do should work, but clearly there is something I don't understand.

Thanks a lot for your help

There is a reason most people [citation required] avoid hardware task switching on the x86 - it's just a nasty mess.

Unless you're particularly interested in the x86 hardware task mechanism itself, I'd recommend dumping it and switching to a software mechanism, for the following benefits:

- It is simpler.
- It will be easier to debug as a result.
- It will probably perform better as a result (less state to save).
- It is more portable conceptually to other architectures, if and when you move beyond x86 (that includes amd64, which doesn't have TSS based task switching in long mode.)
- It is more scalable, you're not limited in the number of tasks by hardware imposed GDT limits.

Using software task switching, you just need a single TSS per CPU, which is required because the TSS contains the ESP for the kernel stack to use when entering the kernel via an interrupt from user mode.

When you switch tasks in a software manner, all you have to do is:

- Switch callee saved registers as defined by your C ABI. On x86, that'd be ebx, edi, esi, ebp and esp. All other registers are scratch registers in the default x86 ABI.
- Put the new kernel stack into the TSS.esp0, for use when switching from user mode to kernel mode in the incoming thread.
- Switch to your new thread's page table by updating cr3.

That's it!

If your incoming thread has user level state, it'll be reachable via the incoming kernel stack, so user state other than the page table doesn't have to be explicitly switched.

Contrast that to your current hardware task switching, where you read/write all of:

- General purpose registers - eax, ebx, ecx, edx, edi, esi, ebp, esp
- Segment registers - cs, ds, ss, es, fs, gs - with corresponding segment loads for the new values
- eip and eflags

In my kernel, my software task switch is simply a setjmp/longjmp, plus an update of TSS.esp0 and cr3, and even those latter two can be optionally deferred until returning to user mode (or user memory access is required from the kernel.)

My entire task switch is thus:

Code: Select all

void arch_thread_switch(thread_t * thread)
{
	thread_t * old = arch_get_thread();

	if (old == thread) {
		thread->state = THREAD_RUNNING;
		return;
	}

	if (old->state == THREAD_RUNNING) {
		old->state = THREAD_RUNNABLE;
	}

	if (old->process != thread->process) {
		if (thread->process) {
			vmap_set_asid(thread->process->as);
		} else {
			vmap_set_asid(0);
		}
	}

	if (0 == setjmp(old->context.state)) {
		if (thread->state == THREAD_RUNNABLE) {
			thread->state = THREAD_RUNNING;
		}
		tss[1] = (uint32_t)thread->context.stack + ARCH_PAGE_SIZE;
		current = thread;
		longjmp(thread->context.state, 1);
	}
}

Posted: **Sun Apr 11, 2021 9:27 am**

thewrongchristian wrote:
If your scheduler is a separate task, why do you iret to get to the scheduler? Shouldn't you call into the scheduler via a TASK gate? That will set NT and the LINK flag, which your scheduler will overwrite to select the next task.

In theory I could do both but there must be a bug in my code somewhere. It's true that an explicit switch using jmp is probably simpler.
Update: I changed my code to do just that and now ending a task through an exit syscall which mark the task as "completed" before switching to the scheduler works fine.

thewrongchristian wrote:
Is the scheduler already locked? If your exception is invoked while the scheduler is in use, you can't then switch back to the scheduler except by returning to it using the nested task mechanism. So if your scheduler is already in the back link chain, and you try to switch to the scheduler again via a task gate, its TSS will already be marked as in use and trigger your exception, which could well result in a double fault if the resulting GPF also cannot be handled.

I now changed my code to not use iret anymore. I now simply switch tasks by calling jmp on the tasks' tss.
If an exception is triggered, I terminate the offending task similarly to what I describe above (mark the task as "completed" and switch to the scheduler using jmp).

However, it doesn't always work properly.
For instance if I have a user task that triggers exception 0 (division by 0), everything works fine.

Now, if a user task executes an invalid operation (writing to a forbidden port or executing an "int" instruction), it triggers exception 13. In the corresponding exception handler, I "kill" the user task exactly as described in the div by 0 case above.
It's all good, however it only works once.
The second time the user task executes the illegal instruction which would trigger exception 13, it in facts triggers exception 6 (invalid opcode). Furthermore, the exception is not triggered by the user task itself but by the kernel (I have to retrieve which task it is but I don't know it yet).
I don't see why the behavior would be different whether it's exception 0 or 13 that is raised...

About why I'm using hardware tasks, well I originally did it to understand how the hardware task switching works. To implement a super basic mono-task kernel with static address spaces (pedagogical project) it's very simple to implement.
Now I'm working on extending it to support multi-tasking. I still think the code is fairly simple, but I'll certainly explore a software implementation in the future.
Thanks for your useful info and code sample

Posted: **Sun Apr 11, 2021 2:30 pm**

You shouldn't let general exception handlers be tasks. Only double fault should be a task, and when you get that, you cannot go back anyway.

Posted: **Sun Apr 11, 2021 2:38 pm**

thewrongchristian wrote: Unless you're particularly interested in the x86 hardware task mechanism itself, I'd recommend dumping it and switching to a software mechanism, for the following benefits:

- It is simpler.
- It will be easier to debug as a result.
- It will probably perform better as a result (less state to save).
- It is more portable conceptually to other architectures, if and when you move beyond x86 (that includes amd64, which doesn't have TSS based task switching in long mode.)
- It is more scalable, you're not limited in the number of tasks by hardware imposed GDT limits.

Using software task switching, you just need a single TSS per CPU, which is required because the TSS contains the ESP for the kernel stack to use when entering the kernel via an interrupt from user mode.

When you switch tasks in a software manner, all you have to do is:

- Switch callee saved registers as defined by your C ABI. On x86, that'd be ebx, edi, esi, ebp and esp. All other registers are scratch registers in the default x86 ABI.
- Put the new kernel stack into the TSS.esp0, for use when switching from user mode to kernel mode in the incoming thread.
- Switch to your new thread's page table by updating cr3.

That's it!

I think that is incorrect. If you switch tasks with a timer it's not enough to just save the C ABI. You must save everything that can be changed. Also, most C ABIs will use FS or GS as TLS pointer, so you must save (or at least restore) that too.

Also, I want to be able to show register state when I single step stuff (which causes task-switches), and then all the registers, including segment registers must be saved in the tasks control block. Having them saved on the stack is quite awkward. So, in that way using hardware task switching would be better if it wasn't for the problem that you cannot do it in two steps (save state & reload state).

Posted: **Sun Apr 11, 2021 3:13 pm**

rdos wrote:You shouldn't let general exception handlers be tasks. Only double fault should be a task, and when you get that, you cannot go back anyway.

Why shouldn't general handlers be tasks?

Here is what the Intel Manual says about it (section 5.12.2 Interrupt Tasks):

When an exception or interrupt handler is accessed through a task gate in the IDT, a task switch results. Handling an exception or interrupt with a separate task offers several advantages:
• The entire context of the interrupted program or task is saved automatically.
• A new TSS permits the handler to use a new privilege level 0 stack when handling the exception or interrupt. If an exception or interrupt occurs when the current privilege level 0 stack is corrupted, accessing the handler through a task gate can prevent a system crash by providing the handler with a new privilege level 0 stack.
• The handler can be further isolated from other tasks by giving it a separate address space. This is done by giving it a separate LDT.
The disadvantage of handling an interrupt with a separate task is that the amount of machine state that must be saved on a task switch makes it slower than using an interrupt gate, resulting in increased interrupt latency.

Posted: **Sun Apr 11, 2021 5:19 pm**

rdos wrote:
thewrongchristian wrote: When you switch tasks in a software manner, all you have to do is:

- Switch callee saved registers as defined by your C ABI. On x86, that'd be ebx, edi, esi, ebp and esp. All other registers are scratch registers in the default x86 ABI.
- Put the new kernel stack into the TSS.esp0, for use when switching from user mode to kernel mode in the incoming thread.
- Switch to your new thread's page table by updating cr3.

That's it!
I think that is incorrect. If you switch tasks with a timer it's not enough to just save the C ABI. You must save everything that can be changed. Also, most C ABIs will use FS or GS as TLS pointer, so you must save (or at least restore) that too.

That is something I've been promising myself to look at. At the moment, I currently implement my own TLS API, which is key based more like TlsSetValue/TlsGetValue or pthread_getspecific/pthread_setspecific. Statically allocated TLS using FS/GS is something I'm aware of, and would like to integrate at some point, but not something I need while I have my own TLS interface even though it would be more optimal. But even then, it's an extra segment register to update, so still much less work moving data to/from the TSS.

I can switch tasks in response to a timer, but task switching is something that is done just before returning from an interrupt to user space only, at which point I know the thread state is stable. Otherwise, all intra-kernel task switching is co-operative, so again under complete and deterministic control of the kernel and thread state is known and stable.

If you're talking about user state, then yes, I save all that (including segment registers) on the interrupt stack. I make the distinction between kernel task state (which my setjmp/longjmp solution is sufficient to use for task switching) and user state. User state includes the user address space, and it will also include the FP/MMX state as and when I get that far, but that state is not the current CPU state when the task switch actually occurs.

In fact, this dislocation between user state and kernel state comes in useful if I ever decide to not use a 1:1 user/kernel threading model. Having user state distinct allows the easy use N:M threading, as user state is no longer tied to a specific kernel thread.

rdos wrote: Also, I want to be able to show register state when I single step stuff (which causes task-switches), and then all the registers, including segment registers must be saved in the tasks control block. Having them saved on the stack is quite awkward. So, in that way using hardware task switching would be better if it wasn't for the problem that you cannot do it in two steps (save state & reload state).

Not that awkward. My task control block already contains a pointer to my stack, and I know where on my stack that data is (it's at the top of the stack just underneath the interrupt frame, so it's easy to find. I can even have a pointer to it in the TCB if I so desire, and when I need it (my user debug facilities are not yet that advanced) it'll be easy to add.

Posted: **Mon Apr 12, 2021 1:49 am**

thxbb12 wrote:
rdos wrote:You shouldn't let general exception handlers be tasks. Only double fault should be a task, and when you get that, you cannot go back anyway.
Why shouldn't general handlers be tasks?

Here is what the Intel Manual says about it (section 5.12.2 Interrupt Tasks):

When an exception or interrupt handler is accessed through a task gate in the IDT, a task switch results. Handling an exception or interrupt with a separate task offers several advantages:
• The entire context of the interrupted program or task is saved automatically.
• A new TSS permits the handler to use a new privilege level 0 stack when handling the exception or interrupt. If an exception or interrupt occurs when the current privilege level 0 stack is corrupted, accessing the handler through a task gate can prevent a system crash by providing the handler with a new privilege level 0 stack.
• The handler can be further isolated from other tasks by giving it a separate address space. This is done by giving it a separate LDT.
The disadvantage of handling an interrupt with a separate task is that the amount of machine state that must be saved on a task switch makes it slower than using an interrupt gate, resulting in increased interrupt latency.

Generally, exception handlers need to access the context of the current thread, and so are best implemented as trap gates rather than task gates. If the stack is corrupt, this will typically lead to a double fault, and so double fault should be handled with a task so it runs in a known valid context. Particularly the page fault handler must be a trap gate, while protection fault probably could be either. I use the protection fault handler to patch syscalls and so for me, it's better to have it as a trap gate since I often re-execute the faulting instruction. If protection fault is always a serious error, then it could just as well be handled with a task.

Posted: **Mon Apr 12, 2021 1:58 am**

thewrongchristian wrote: I can switch tasks in response to a timer, but task switching is something that is done just before returning from an interrupt to user space only, at which point I know the thread state is stable. Otherwise, all intra-kernel task switching is co-operative, so again under complete and deterministic control of the kernel and thread state is known and stable.

If you're talking about user state, then yes, I save all that (including segment registers) on the interrupt stack. I make the distinction between kernel task state (which my setjmp/longjmp solution is sufficient to use for task switching) and user state. User state includes the user address space, and it will also include the FP/MMX state as and when I get that far, but that state is not the current CPU state when the task switch actually occurs.

In fact, this dislocation between user state and kernel state comes in useful if I ever decide to not use a 1:1 user/kernel threading model. Having user state distinct allows the easy use N:M threading, as user state is no longer tied to a specific kernel thread.

With such a model it should work. In my model, threads can always be preempted, regardless if they are running in user mode or kernel mode. Some threads even run exclusively in kernel mode, but they are no different from threads started in user mode.

thewrongchristian wrote: Not that awkward. My task control block already contains a pointer to my stack, and I know where on my stack that data is (it's at the top of the stack just underneath the interrupt frame, so it's easy to find. I can even have a pointer to it in the TCB if I so desire, and when I need it (my user debug facilities are not yet that advanced) it'll be easy to add.

Certainly, but my user-level debugger can trace user-mode threads into kernel space at the source level. I also have a built-in kernel debugger where all threads, even those that run in the system process in the kernel, can be single-stepped. The kernel debugger runs in its own process so it has a known-valid environment.

Posted: **Tue Apr 13, 2021 12:19 am**

rdos wrote: Generally, exception handlers need to access the context of the current thread, and so are best implemented as trap gates rather than task gates. If the stack is corrupt, this will typically lead to a double fault, and so double fault should be handled with a task so it runs in a known valid context. Particularly the page fault handler must be a trap gate, while protection fault probably could be either. I use the protection fault handler to patch syscalls and so for me, it's better to have it as a trap gate since I often re-execute the faulting instruction. If protection fault is always a serious error, then it could just as well be handled with a task.

I still don't understand why I only encounter an issue when a GPF in raised as other exceptions are handled just fine.
Furthermore, if I make sure eip is reset to point to the beginning of the task code before any GPF is triggered, everything works as it should.
Anyway, there is something I don't seem to understand with hardware task switching.

I think I'll switch to software task switching instead.

Thanks!

Posted: **Tue Apr 13, 2021 12:59 am**

thxbb12 wrote:
rdos wrote: Generally, exception handlers need to access the context of the current thread, and so are best implemented as trap gates rather than task gates. If the stack is corrupt, this will typically lead to a double fault, and so double fault should be handled with a task so it runs in a known valid context. Particularly the page fault handler must be a trap gate, while protection fault probably could be either. I use the protection fault handler to patch syscalls and so for me, it's better to have it as a trap gate since I often re-execute the faulting instruction. If protection fault is always a serious error, then it could just as well be handled with a task.
I still don't understand why I only encounter an issue when a GPF in raised as other exceptions are handled just fine.
Furthermore, if I make sure eip is reset to point to the beginning of the task code before any GPF is triggered, everything works as it should.
Anyway, there is something I don't seem to understand with hardware task switching.

I think I'll switch to software task switching instead.

Thanks!

Good idea. You should use a task gate for double fault, but in that case, you never should try to re-execute the instruction as a double fault is unrecoverable. Instead, just dump the state of the faulting thread and hang. So, you don't need to understand the details of hardware task switching.

Still, I think your problem is that your GPF handler is not coded as a loop. When you switch task your current EIP is saved in the TSS, and the next time you get to the same task, EIP will point after the switch, so you need a jump back to the entry point. You should think of your GPF handler as a task that loops, waiting for a new exception at the point where you exited the last one.

Posted: **Wed Apr 14, 2021 7:16 am**

rdos wrote:
Good idea. You should use a task gate for double fault, but in that case, you never should try to re-execute the instruction as a double fault is unrecoverable. Instead, just dump the state of the faulting thread and hang. So, you don't need to understand the details of hardware task switching.

Still, I think your problem is that your GPF handler is not coded as a loop. When you switch task your current EIP is saved in the TSS, and the next time you get to the same task, EIP will point after the switch, so you need a jump back to the entry point. You should think of your GPF handler as a task that loops, waiting for a new exception at the point where you exited the last one.

My exceptions handlers are all infinite loops as they are tasks as you said.

There is no double fault that's triggered. I think I wasn't clear enough in my explanation.
Here is what happens:

I run a task that purposely triggers a GPF (in order to validate the proper termination of the task once it faults). My GPF handlers terminates the offending task by marking it as "exited" and it switches to the scheduler which doesn't schedule that specific task (aka it "kills" it).
This works, and the scheduler schedules the next task, as it should. Then, when I run again the same task as before which should trigger a GPF, it triggers a "invalid opcode" exception (6).
I don't understand why it's the case.

I have attached a screenshot of my OS to illustrate the issue:
1) I run "gpf_io.exe" -> this task purposely generates a GPF (by writing to a forbidden I/O port).
2) The kernel successfully kills the task as expected.
3) I run "hello.exe" which displays a message and terminates.
4) I run "gpf_io.exe" again -> this time around it triggers exception 6 (invalid opcode)

What I noticed is that if between steps 2) and 4) I reset tss->eip of the gpf_io.exe task, then everything is fine. It appears that after 1), eip for that given task is not reset although I make sure that when I load the task again in 4), eip is initialized correctly.

Note that I don't do any dynamic memory allocation. Everything is allocated statically and my tasks reside in statically allocated structures at compile time. All I do is using these "slots" to load my tasks' code + data into. When a task finishes, I mark the given slot as free again so the next task can be loaded into that slot.
It's all very primitive on purpose.

Posted: **Wed Apr 14, 2021 10:09 am**

thxbb12 wrote:it triggers a "invalid opcode" exception (6)

Just as a suggestion, you might want to have that exception handler print out the value of EIP that the CPU placed in the exception handler stack frame. It may not automatically point to the exact problem, but if the address is neither the range used by the user process's code, nor the range used by the kernel's code, it might suggest some kind of stack corruption.

Posted: **Sat Apr 17, 2021 4:28 am**

A quick update.

I modified my kernel to handle exceptions with interrupt gates and hardware interrupts with interrupt gates as well.
I just kept one task gate for the IRQ0 (the timer) in order to implement my scheduler as a task.
Now, everything works perfectly!

Thanks again everyone for your helpful remarks.

OSDev.org

Multi-tasking issue with hardware task switching (Intel)

Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)

Re: Multi-tasking issue with hardware task switching (Intel)