Page 1 of 1

Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Tue Apr 05, 2022 12:06 pm
by yasar11732
I am trying to read and understand Brendan's Multi-tasking Tutorial before trying to implement my own kernel threads.

I was able to follow the tutorial, until I got to this version of PIT handler;

Code: Select all

uint64_t time_slice_remaining = 0;
 
uint64_t time_since_boot = 0;
uint64_t time_between_ticks = 1000000;  // 1000 Hz = 1 ms between ticks = 1000000 nanoseconds between ticks
 
void PIT_IRQ_handler(void) {
    thread_control_block * next_task;
    thread_control_block * this_task;
 
    lock_stuff();
    time_since_boot += time_between_ticks;
 
    // Move everything from the sleeping task list into a temporary variable and make the sleeping task list empty
 
    next_task = sleeping_task_list;
    sleeping_task_list = NULL;
 
    // For each task, wake it up or put it back on the sleeping task list
 
    while(next_task != NULL) {
        this_task = next_task;
        next_task = this_task->next;
 
        if(this_task->sleep_expiry <= time_since_boot) {
            // Task needs to be woken up
            unblock_task(task);
        } else {
            // Task needs to be put back on the sleeping task list
            task->next = sleeping_task_list;
            sleeping_task_list = task;
        }
    }
 
    // Handle "end of time slice" preemption
 
    if(time_slice_remaining != 0) {
        // There is a time slice length
        if(time_slice_remaining <= time_between_ticks) {
            schedule();
        } else {
            time_slice_remaining -= time_between_ticks;
        }
    }
 
    // Done, unlock the scheduler (and do any postponed task switches!)
 
    unlock_stuff();
}
I will explain what I think is happening step by step so you can point out mistakes in my reasoning.

This function is called PIT_IRQ_handler, so I suppose it is called from an interrupt gate assembly wrapper. So, at the beginning of function interrupts are disabled by CPU. Also, PIC is also disabled waiting for "End Of Interrupt" command. Since EOI is not seen here, I will also assume it is sent by assembly wrapper after PIT_IRQ_handler is returned.

First thing this function does is to call lock_stuff() function. This function disabled interrupts and postpones task switching. It is implemented as below;

Code: Select all

void lock_stuff(void) {
#ifndef SMP
    CLI();
    IRQ_disable_counter++;
    postpone_task_switches_counter++;
#endif
}
Disabling interrupts does nothing in this case because it is already disabled, but it doesn't cause any harm.

Next, function unblocks any sleeping threads. This part has nothing to do with my question, so I skip to next part.

In the next part, if current thread used up all its timeslice, schedule function is called. This is the version of schedule function that is supposed to be called at that point in tutorial.

Code: Select all

void schedule(void) {
    thread_control_block * task;
 
    if(postpone_task_switches_counter != 0) {
        task_switches_postponed_flag = 1;
        return;
    }
 
    if( first_ready_to_run_task != NULL) {
        task = first_ready_to_run_task;
        first_ready_to_run_task = task->next;
        switch_to_task(task);
    } else if( current_task_TCB->state == RUNNING) {
        // Do nothing, currently running task can keep running
    } else {
 
        // Currently running task blocked and there's no other tasks
 
        task = current_task_TCB;                           // "task" is the task that we're borrowing while idle
        current_task_TCB = NULL;                           // Make sure other code knows we're not actually running the task we borrowed
        uint64_t idle_start_time = get_time_since_boot();  // For future power management support
 
        // Do nothing while waiting for a task to unblock and become "ready to run".  The only thing that is going to update 
        // first_ready_to_run_task is going to be from a timer IRQ (with a single processor anyway), but interrupts are disabled.
        // The timer must be allowed to fire, but do not allow any task changes to take place.  The task_switches_postponed_flag
        // must remain set to force the timer to return to this loop.
 
 
        do {
            STI();          // enable interrupts to allow the timer to fire
            HLT();          // halt and wait for the timer to fire
            CLI();          // disable interrupts again to see if there is something to do
        } while( (first_ready_to_run_task == NULL);
 
        // Stop borrowing the task
 
        current_task_TCB = task;                          
 
        // Switch to the task that unblocked (unless the task that unblocked happens to be the task we borrowed)
 
        task = first_ready_to_run_task;
        first_ready_to_run_task = task->next;
        if(task != current_task_TCB) {
            switch_to_task(task);
        }
    }
}
Since lock_stuff() was called earlier, this function will just set task_switches_postponed_flag and return immediately. However, unlock_stuff() from PIT_IRQ_handler() will be called next. This is where my confusion begins. This is how it is implemented.

Code: Select all

void unlock_stuff(void) {
#ifndef SMP
    postpone_task_switches_counter--;
    if(postpone_task_switches_counter == 0) {
        if(task_switches_postponed_flag != 0) {
            task_switches_postponed_flag = 0;
            schedule();
        }
    }
    IRQ_disable_counter--;
    if(IRQ_disable_counter == 0) {
        STI();
    }
#endif
}
Since task_switches_postponed_flag was set earlier, unlock_stuff() function will call schedule() and it will most probably call switch_task() function.

switch_task() function will return from another thread. However, we are yet to return from PIT_IRQ_handler. If next task we are switching to was called from PIT_IRQ_handler, things might work out fine. But if it is not, we won't be able to send EOI to PIC, so we can't preempt no more. It is possible for threads to block themselves to sleep for some time, so it is a real possibility.

Another problem I am having is this part in unblock_stuff() function;

Code: Select all

    IRQ_disable_counter--;
    if(IRQ_disable_counter == 0) {
        STI();
    }
This enables interrupts again before we return from IRQ handler. This seems problematic as it is supposed be enabled after iret called from interrupt handler. However, since PIC is waiting for EOI anyways, this might actually be harmless. What do you think about this?

Looking for your input,

Best regards

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Tue Apr 05, 2022 12:53 pm
by nullplan
Wow, I would not have expected Brendan's code to make the same mistake I see so many beginners make: Intermingling IRQ handling and task handling. These two things ought to be separate.

The points you raised are valid. The code does wrongly enable interrupts in the wrong place. It appears the counter was supposed to counteract this, but failed to adjust for the effect of interrupt gates. No matter, it would be better if he had primitives that read out the flags register before doing a cli, and restored that register instead of unconditionally enabling interrupts.

Code: Select all

static inline unsigned irqdisable(void) {
  unsigned flags;
  asm volatile ("pushfd; cli; popl %0" : "=r"(flags) :: "memory");
  return flags;
}

static inline void irqrestore(unsigned flags) {
  asm("pushl %0; popfd" :: "r"(flags) : "memory", "cc");
}
That way you could nest things more elegantly. Come to think of it, he actually needs a spinlock mechanism to actually protect the variables he wants to protect, because the whole thing will not work as soon as a second CPU gets into the mix.

It is true that Brendan's code does not emit EOI until later, however, with interrupts enabled, there is the potential for stack overflow even from this small window (PIC can re-interrupt before CPU reaches iret instruction). And in modern PCs, PIC is by far not the only interrupt source, which also opens the door to stack overflow.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Tue Apr 05, 2022 3:17 pm
by yasar11732
nullplan wrote: Intermingling IRQ handling and task handling. These two things ought to be separate.
That is what I thought too. But, I also don't see a way to do preemptive multitasking outside timer interrupts.
nullplan wrote: No matter, it would be better if he had primitives that read out the flags register before doing a cli, and restored that register instead of unconditionally enabling interrupts.
I re-read relevant section again, and he actually mentions that option, but opts for using a counter to make it look like a lock so we can change it later when we support multiple CPU's.
nullplan wrote: because the whole thing will not work as soon as a second CPU gets into the mix.
To be fair, it was supposed to be for single-CPU only.

That aside, I am still trying to figure out a reasonable explaination to why this code should work because everything else is very carefully thought out to have an obvious bug.

One possible explanation I could think of is that, EOI is sent before calling PIT_IRQ_handler and interrupts are supposed to be enabled without returning from interrupt handler because everything else depends on it.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Wed Apr 06, 2022 11:59 am
by nullplan
yasar11732 wrote:That is what I thought too. But, I also don't see a way to do preemptive multitasking outside timer interrupts.
Separate out the concepts. First of all, I'd have the timer mechanism and the callback be separate. I'd have a priority queue of timed events, ordered by deadline, with the timer dynamically set to the smallest time, and then when it has arrived, just run all the expired callbacks. A cyclic event is then just re-inserted. This uncouples the timer from the timed event, and eases adoption of newer or better timers (e.g. HPET, LAPIC timer). So that way you would handle the PIT interrupt by handling all expired timers, then setting the PIT to fire on the smallest deadline, then EOI (and you insert a new timer by resetting the PIT if need be). And what the timers do is written somewhere else.

Then, for the actual scheduling, I would just add a task flag. Have a flags field in the task struct. You need one anyway to distinguish kernel and user tasks, or to distinguish FPU tasks. When the timer triggers, just set a "timeout" flag. This is handled in the code that returns from interrupt back to userspace. It checks to see if the return is to userspace, and if so and the timeout flag is set, it clears the timeout flag and then calls schedule(). This gets you out of the perilous interrupt state: Since you can return to userspace only on the highest level of the stack, that code cannot have interrupted other kernel code, and therefore can take locks and allocate memory and schedule processes. It also ensures that any interrupt controller is out of interrupt state before scheduling.

And finally for timekeeping: Just use a hardware counter. No need to count timer interrupts in software.
yasar11732 wrote:One possible explanation I could think of is that, EOI is sent before calling PIT_IRQ_handler and interrupts are supposed to be enabled without returning from interrupt handler because everything else depends on it.
Another possibility is that the chances of something going wrong in reality are slim. There is a theoretical possibility of a stack overflow, but for real conditions to cause such a thing would require an interrupt storm, and unless such a thing happened, it would just not be noticed.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Wed Apr 06, 2022 12:45 pm
by Octocontrabass
nullplan wrote:There is a theoretical possibility of a stack overflow, but for real conditions to cause such a thing would require an interrupt storm, and unless such a thing happened, it would just not be noticed.
I wouldn't call it theoretical. Linux stopped allowing nested interrupt handlers because someone encountered real conditions that could cause a stack overflow.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Thu Apr 07, 2022 4:02 am
by rdos
Octocontrabass wrote:
nullplan wrote:There is a theoretical possibility of a stack overflow, but for real conditions to cause such a thing would require an interrupt storm, and unless such a thing happened, it would just not be noticed.
I wouldn't call it theoretical. Linux stopped allowing nested interrupt handlers because someone encountered real conditions that could cause a stack overflow.
I just find this to be evidence that Linux (and Windows) nested interrupt structure is too complicated and some people are unable to write decent IRQ handlers that might cause stack overflows.

Generally speaking, an IRQ handler should only remove the interrupt condition and signal wakeup conditions for server threads. I see no reason why this would need to be run with interrupts disabled.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Thu Apr 07, 2022 8:58 am
by Ethin
I am curious... Is there a better tutorial than Brendans available that anyone knows about discussing the implementation of multitasking? Or is Brendans still the "best"/"preferred" reference? I've always wondered how a system that uses (say) lottery scheduling would function under preemption...

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Thu Apr 07, 2022 9:53 am
by yasar11732
I have another question about the same tutorial, it is about this function:

Code: Select all

;C declaration:
;   void switch_to_task(thread_control_block *next_thread);
;
;WARNING: Caller is expected to disable IRQs before calling, and enable IRQs again after function returns
 
switch_to_task:
 
    ;Save previous task's state
 
    ;Notes:
    ;  For cdecl; EAX, ECX, and EDX are already saved by the caller and don't need to be saved again
    ;  EIP is already saved on the stack by the caller's "CALL" instruction
    ;  The task isn't able to change CR3 so it doesn't need to be saved
    ;  Segment registers are constants (while running kernel code) so they don't need to be saved
 
    push ebx
    push esi
    push edi
    push ebp
 
    mov edi,[current_task_TCB]    ;edi = address of the previous task's "thread control block"
    mov [edi+TCB.ESP],esp         ;Save ESP for previous task's kernel stack in the thread's TCB
 
    ;Load next task's state
 
    mov esi,[esp+(4+1)*4]         ;esi = address of the next task's "thread control block" (parameter passed on stack)
    mov [current_task_TCB],esi    ;Current task's TCB is the next task TCB
 
    mov esp,[esi+TCB.ESP]         ;Load ESP for next task's kernel stack from the thread's TCB
    mov eax,[esi+TCB.CR3]         ;eax = address of page directory for next task
    mov ebx,[esi+TCB.ESP0]        ;ebx = address for the top of the next task's kernel stack
    mov [TSS.ESP0],ebx            ;Adjust the ESP0 field in the TSS (used by CPU for for CPL=3 -> CPL=0 privilege level changes)
    mov ecx,cr3                   ;ecx = previous task's virtual address space
 
    cmp eax,ecx                   ;Does the virtual address space need to being changed?
    je .doneVAS                   ; no, virtual address space is the same, so don't reload it and cause TLB flushes
    mov cr3,eax                   ; yes, load the next task's virtual address space
.doneVAS:
 
    pop ebp
    pop edi
    pop esi
    pop ebx
 
    ret                           ;Load next task's EIP from its kernel stack
This is the code that switches between tasks. First it saves some registers to stack, it saves stack pointer and cr3 to task's structure. Next, it changes stack for next tasks stack, pop saved registers and exits.

I don't see flags register being saved somewhere. Wouldn't it cause problems if, for example, a task is stopped after 'cmp' and before 'jz'. Wouldn't other tasks be able to alter flags registers while there are running?

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Thu Apr 07, 2022 10:09 am
by nullplan
yasar11732 wrote:I don't see flags register being saved somewhere. Wouldn't it cause problems if, for example, a task is stopped after 'cmp' and before 'jz'. Wouldn't other tasks be able to alter flags registers while there are running?
flags is a volatile register. See, this code is called synchronously, so it doesn't have to save everything. And at the asynchronous boundary, namely the interrupt handler, you already are saving everything. In particular, flags is part of the interrupt frame, and restored by iret.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Thu Apr 07, 2022 11:48 am
by Octocontrabass
rdos wrote:Generally speaking, an IRQ handler should only remove the interrupt condition and signal wakeup conditions for server threads. I see no reason why this would need to be run with interrupts disabled.
There can be dozens or even hundreds of interrupt sources. If interrupt handlers can be interrupted, your stack must be big enough for all interrupt handlers to nest at the same time.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Thu Apr 07, 2022 12:25 pm
by rdos
Octocontrabass wrote:
rdos wrote:Generally speaking, an IRQ handler should only remove the interrupt condition and signal wakeup conditions for server threads. I see no reason why this would need to be run with interrupts disabled.
There can be dozens or even hundreds of interrupt sources. If interrupt handlers can be interrupted, your stack must be big enough for all interrupt handlers to nest at the same time.

Not really. The IRQ stub will enable interrupts, but it won't do EOI until after all handlers have run. This means only higher priority IRQs can run. Besides, highly loaded IRQs should be running on their own cores.

Re: Brendan's multi-tasking tutorial - Preemptive Scheduling

Posted: Thu Apr 07, 2022 3:06 pm
by rdos
yasar11732 wrote:
This is the code that switches between tasks. First it saves some registers to stack, it saves stack pointer and cr3 to task's structure. Next, it changes stack for next tasks stack, pop saved registers and exits.

I don't see flags register being saved somewhere. Wouldn't it cause problems if, for example, a task is stopped after 'cmp' and before 'jz'. Wouldn't other tasks be able to alter flags registers while there are running?
Generally, I think it is a bad idea and a mistake to leave the register set of a thread on an interupt stack. Registers should be saved & loaded from the thread control block. This also makes it easier to implement a kernel debugger, and it decouples scheduling from IRQs.

The code also lacks the actual scheduling. Typically, when an IRQ occurs, it might result in preemption of a thread, or switching to a high priority server thread. There is seldom a well-known thread to run next. That was the problem with hardware task-switching too, which assumed a switch from one thread to another with a call or jmp, but it didn't work well since there is a requirement for logic between having saved the state of the current thread and loading the state of the next (scheduling). Running the scheduler / task-switcher with interrupts disabled is a bad idea too. It should have some kind of lock instead, and run with interrupts enabled.

My scheduler has one dedicated stack per core that it will switch to after having saved the current thread. It can then run the scheduler function on a known-valid stack. The stack of the next thread to run is loaded as part of loading it's register context.