Scheduler thinks it has no more tasks, stuck in idle

ccleaverr · Post by **ccleaverr** » Fri Aug 12, 2022 8:17 pm

I have written a basic scheduler for my system which uses 4 SMP cores on x86_64.
One scheduler is initialised for each CPU, and new tasks are just scheduled on whichever processor has the fewest scheduled tasks.
The code definitely needs some work, but I'm trying to get it working first.

I have a series of tests built in to the kernel which run when initialisation is complete. Some of these schedule multiple processes in quick succession, wait for them to set some global variable, and then exit. Sometimes these processes have the same entry point, sometimes not. In any case, about 95% of the time my system will initialise and run all the tests and pass. I can even run the tests repeatedly in a loop without problems.

But if I do a 'make clean; make' in a loop, eventually one of my tests will hang. In my debugging adventures, I've found that despite having just called schedule(), the processor the task was scheduled on is stuck on the idle task. Interrupts continue to occur, but SwitchToNextTask() only ever chooses the idle task. Meanwhile the processor that called schedule() is stuck calling `YieldUntil(someData == TEST_SUCCESS_VAL)`, but that condition never occurs.

For some background, I keep the task list as one circular (doubly) linked list for each CPU. SwitchToNextTask() is called by both my timer interrupt, and by yield(), which is a software interrupt. When a task is scheduled, memory is allocated for the stack and the process control block. I place a pointer to the control block as well as to my DestroyTask function on the stack, so that when a process returns from its entry function, it will return to DestroyTask. DestroyTask can then just grab the control block off the stack. It checks if the process is joinable and if so, saves the return value and waits for it to be joined. Otherwise it changes the state to be destroyed. The memory allocator in the scheduler is a 'lockless' slab allocator used exclusively by the scheduler within scheduler spinlocks.

Here is the code for initialising the scheduler, called once for each CPU:

Code: Select all

void
knsc_InitialiseScheduler(void *(*continuation)(void *),
                         void *continuationArgs,
                         bool bootstrapProcessor)
{
   SpinlockAcquire(&sched);

   static bool dejaVu = false;
   if (!dejaVu) {
      knsc_InitialiseSchedulerAllocator();
      dejaVu = true;
   }

   u8 thisCpu = knsm_ThisCpuId();
   numScheduledTasksFor[thisCpu] = 1;

   first[thisCpu] = knsc_AllocateProcess();
   kmemset(first[thisCpu], 0x00, sizeof *first[thisCpu]);

   knsc_threadAttr_t attr = 1; /* NOT_JOINABLE */
   InitProcessStructCommonElements(first[thisCpu], (u64) IdleTask, 0, &attr);

   current[thisCpu] = knsc_AllocateProcess();
   kmemset(current[thisCpu], 0x00, sizeof *current[thisCpu]);

   current[thisCpu]->next = first[thisCpu];
   current[thisCpu]->prev = first[thisCpu];
   first[thisCpu]->next   = current[thisCpu];
   first[thisCpu]->prev   = current[thisCpu];

   SpinlockRelease(&sched);

   if (continuation != NULL) {
      knsc_Schedule(&attr, continuation, continuationArgs); /* Only executed on BSP, other processors wait in the idle task for something to do */
   }

   if (bootstrapProcessor) {
      knar_RegisterSchedYieldHandler(SwitchToNextTask);
      knar_RegisterForTimerInterruptNotifications(SwitchToNextTask);
      kprintf("Initialised scheduler\n");
   }

   ++numSchedulersRunning;
}

Here is the code for scheduling new tasks:

Code: Select all

knsc_thread_t
knsc_Schedule(knsc_threadAttr_t const *attr, void *(*entry)(void *), void *args)
{
   knsc_thread_t id = { 0 };

   SpinlockAcquire(&sched);
   u8 cpu = WhichCpuShouldTaskBeScheduledOn(); /* Whichever CPU has the shortest task list */

   struct process *new = knsc_AllocateProcess();
   kmemset(new, 0x00, sizeof *new);

   new->prev              = first[cpu]->prev;
   new->next              = first[cpu];
   first[cpu]->prev->next = new;
   first[cpu]->prev       = new;

   InitProcessStructCommonElements(new, (u64) entry, (u64) args, attr);

   ++numScheduledTasksFor[cpu];
   id = new->processId;

   SpinlockRelease(&sched);

   return id;
}

The code which initialises the control block:

Code: Select all

static void
InitProcessStructCommonElements(struct process *p,
                                u64 entry,
                                u64 argsP,
                                knsc_threadAttr_t const *attr)
{
   p->state      = RUNNABLE;
   p->processId  = currentProcessId++;

   if (attr != NULL) {
      p->attributes = *attr;
   } else {
      p->attributes = 0;
   }

   p->regs.rip = entry;
   p->regs.rsp = (u64) knsc_AllocateDefaultStack() + DEFAULT_STACK_SIZE;
   p->originalRsp = p->regs.rsp;

   p->regs.cs     = DEFAULT_CS; /* 0x08 */
   p->regs.rflags = DEFAULT_RFLAGS; /* 0x202 */
   p->regs.ss     = DEFAULT_SS; /* 0x00 */

   p->regs.rdi = argsP;

   p->regs.rsp -= sizeof(u64); /* Store the control block on the stack so DestroyTask can recover it */
   *((u64 *) p->regs.rsp) = (u64) p;

   p->regs.rsp -= sizeof(u64); /* Store the DestroyTask address so we'll return to it when the thread exits */
   extern void knsc_DestroyTask_Stub(void); 
   *(u64 *) (p->regs.rsp) = (u64) knsc_DestroyTask_Stub;
}

SwitchToNextTask, called by timer interrupts and yield()

Code: Select all

static void
SwitchToNextTask(struct registers *regs, u64 _unused(ticksSinceBoot))
{
   if (!SpinlockTry(&sched)) {
      return;
   }

   u8 thisCpu = knsm_ThisCpuId();
   kmemcpy(&current[thisCpu]->regs, regs, sizeof current[thisCpu]->regs);

   do {
      struct process *proc = current[thisCpu];

      current[thisCpu] = current[thisCpu]->next;

      /* Remove the task from the task list and free its memory */
      if (proc->state == AWAITING_DESTROY) {
         void *stackMem = (void *) (proc->originalRsp - DEFAULT_STACK_SIZE);

         proc->prev->next = proc->next;
         proc->next->prev = proc->prev;

         knsc_FreeDefaultStack(stackMem);
         knsc_FreeProcess(proc);
      }

   } while (current[thisCpu]->state != RUNNABLE);

   kmemcpy(regs, &current[thisCpu]->regs, sizeof *regs);
   SpinlockRelease(&sched);
}

DestroyTask, called by the assembly stub below, placed on the stack.

Code: Select all

void
knsc_DestroyTask(struct process *process, void *rax)
{
   SpinlockAcquire(&sched);
   u8 thisCpu = knsm_ThisCpuId();

   /* Don't free the memory yet even if the task is awaiting destroy, or we'll end up with a use-after-free.
       The memory will be freed after it's been removed from the task list */
   process->returnVal = rax;
   process->state = (process->attributes & NOT_JOINABLE) 
                  ? AWAITING_DESTROY : AWAITING_JOIN;

   --numScheduledTasksFor[thisCpu];

   SpinlockRelease(&sched);
   for (;;) {
      yield(); /* Once this task is destroyed, we won't come back here */
   }
}

Code: Select all

knsc_DestroyTask_Stub: /* Pass the control block and return value to the real DestroyTask */
   movq (%rsp), %rdi
   movq %rax, %rsi
   call knsc_DestroyTask

Code: Select all

void
knsc_JoinProcess(knsc_thread_t id, void **retval)
{
   SpinlockAcquire(&sched);
   struct process *proc = NULL;
   for (usize cpu=0; cpu<numSchedulersRunning; ++cpu) {
      struct process *p = first[cpu]->next;
      while (p != first[cpu]) {
         if (p->processId == id) {
            proc = p;
            goto out;
         }

         p = p->next;
      }
   }
out:
   SpinlockRelease(&sched);
   if (proc == NULL) { /* Probably don't need this anymore, tasks will be kept alive as long as they're joinable */
      if (retval != NULL) {
         *retval = NULL;
      }
      return;
   }

   YieldUntil(proc->state == AWAITING_JOIN); /* Wait for the task to finish */
   SpinlockAcquire(&sched);

   if (retval != NULL) {
      *retval = proc->returnVal;
   }

   proc->state = AWAITING_DESTROY;
   SpinlockRelease(&sched);
}

Where my idle task just executes a `hlt` instruction in an infinite loop.
This is currently just for kernel threads where all processors use the same page tables.

thewrongchristian · Post by **thewrongchristian** » Mon Aug 15, 2022 3:07 am

Do your spin locks block interrupts? If not, your CPU can take interrupts while holding your scheduler lock, and if interrupts handlers can unblock waiting tasks (putting them back on the scheduler queue,) then you could easily corrupt your scheduler linked list.

You could write paranoid debug code, which will walk your linked list forwards and backwards each time it is modified, to check you can walk in both directions back to your first item.

That way, if your linked list is corrupted in either direction, your test walk will hopefully hang, and you can then break in with a debugger and see what's going on.

I have code in my kernel, which checks only forwards and back links between the new item and its neighbours, but it usually catches linked list corruption. I like the walk idea above though, so I might add that as well/instead.

*edit*

Looking at your code again, you spin on what I assume is a global "sched" spinlock. So all your CPUs schedule and contend on a single spinlock, so I'd have to ask, why have per-CPU scheduler queues if you're serializing using a single lock?

ccleaverr · Post by **ccleaverr** » Tue Aug 16, 2022 8:05 pm

thewrongchristian wrote:Do your spin locks block interrupts? If not, your CPU can take interrupts while holding your scheduler lock, and if interrupts handlers can unblock waiting tasks (putting them back on the scheduler queue,) then you could easily corrupt your scheduler linked list.

They did not disable interrupts. They do now. It didn't help - as I expected - due to the (paranoid) global lock on the scheduler, which I should get around to replacing.

thewrongchristian wrote:That way, if your linked list is corrupted in either direction, your test walk will hopefully hang, and you can then break in with a debugger and see what's going on.

I tried this and it wasn't the problem. I have fixed the issue now. It turns out I had a memory leak elsewhere (creating a thread as joinable but never joining it) which masked the real issue and made it harder to debug. After fixing the memory leak, I found some of my tests were missing locks as I wrote them back in pre-SMP days. Indeed, running QEMU with only one processor caused the issue to disappear. I have added the necessary locks and the problem disappeared on SMP too.

OSDev.org

Scheduler thinks it has no more tasks, stuck in idle

Scheduler thinks it has no more tasks, stuck in idle

Re: Scheduler thinks it has no more tasks, stuck in idle

Re: Scheduler thinks it has no more tasks, stuck in idle