Buggy scheduling?

itportal · Post by **itportal** » Thu Sep 20, 2012 8:02 am

Hello everyone,

I am having a bit of a problem with my scheduler and I am not sure if my code is wrong or this is some kind of a qemu issue.

It is a smp kernel, where each core has its own scheduling and idle threads. The scheduling thread contains the scheduling policies (which are for now all the same) - it picks an unlocked thread form a global linked list of threads (also locks it) and schedules it using a system call. If no thread is free for scheduling, the idle thread for this core is scheduled. The whole thing is something like a 2-way scheduling.

Code: Select all

void scheduler(){
	while(true){
		volatile ll* p = thread_queue;
		while(p != NULL){
			asm("nop");
			volatile thread* t = (thread*)p->object; // it dies here trying to access the object pointer
			volatile lock* l = &(t->state);
			if(try_lock(l))
				asm("int 0x23" : : "a"(t));
			p = p->next;
		}
		// no suitable thread, schedule idle thread
		asm("int 0x23" : : "a"(cpu->idle));
	}

}

It works, I have tested it on bochs, qemu and real hardware. What is giving me headaches is (probably) the idle thread:

Code: Select all

void idle(){
	while(true){
		asm volatile("hlt");
	}
}

When I run the kernel with this idle thread, it works on both bochs and real hardware. But on qemu I get strange errors when I increase the number of cores (>= 4). A page fault occurs sometimes on one or several (but not all) cores in the scheduling thread when it tries to pick up an unlocked thread. I traced the instruction and it seems the problem is when trying to access the linked list and more precisely the pointer to the actual thread data structure (see comment in above code). Now, as I am using a global linked-list and the other scheduling threads are doing well, I assume the data structure is not broken. It is more likely that the stack of the scheduling thread is somehow broken, but I am not sure.

However, if I make the idle thread without a hlt instruction and let it be just an infinite loop, it works fine on qemu as well.

Another interesting thing - with the hlt instruction I do not get the error, if I make my lapic timer interrupt fire not so quickly.

The problem occurs when I have 4 or more cores. Unfortunately, I do not have real hardware with that many cores and cannot test it. But on bochs with 32 cores is working fine.

The problem might be somewhere else, not the idle thread, but I really have no clue. I tested and tried code for like 3 days now and would appreciate any ideas from your side.

PS. int 0x23 schedules the thread pointed in eax. All threads run in ring 0, user space still not used at this stage.

bluemoon · Post by **bluemoon** » Thu Sep 20, 2012 9:38 am

Hello, I have not look into your problem but I found some potential issue in your code (see inline comment)

Code: Select all

void scheduler(){
	while(true){
		volatile ll* p = thread_queue;
		while(p != NULL){
			asm("nop");

// Can stuff pointed by p be changed by other? you don't have any lock here.

		        volatile thread* t = (thread*)p->object; // it dies here trying to access the object pointer

// What if the thread terminated? do t still be accessible?

			volatile lock* l = &(t->state);
			if(try_lock(l))
				asm("int 0x23" : : "a"(t));
			p = p->next;
		}
		// no suitable thread, schedule idle thread
		asm("int 0x23" : : "a"(cpu->idle));
	}

}

itportal · Post by **itportal** » Thu Sep 20, 2012 11:20 am

Code: Select all

// Can stuff pointed by p be changed by other? you don't have any lock here.

The thread queue is built before scheduling is enabled and is afterwards not changed. Therefore, I left the locks out.

Code: Select all

// What if the thread terminated? do t still be accessible?

For testing purposes threads cannot terminate. So t is always accessible.

itportal · Post by **itportal** » Tue Sep 25, 2012 4:01 am

Is it possible that I receive an interrupt during I am processing another interrupt? I am entering the first interrupt through a interrupt gate and suppose I get another interrupt in the middle which is messing up my stacks.

bluemoon · Post by **bluemoon** » Tue Sep 25, 2012 4:07 am

itportal wrote:Is it possible that I receive an interrupt during I am processing another interrupt?

Yes.
The actual behaviour depends on the underlying mechanism of chips.
For old fashion PIC, same interrupt number won't trigger until an ack, but a higher priority interrupt (which may also be chained to another PIC) may still fire, in PC that's how master and slave PIC work.
CPU exceptions are also possible to fire within an interrupt handler or exception handler.

itportal · Post by **itportal** » Tue Sep 25, 2012 4:16 am

How about the APIC? I disable the PIC in my kernel.

bluemoon · Post by **bluemoon** » Tue Sep 25, 2012 4:45 am

IOAPIC should work almost the same as PIC, except there is usually no chain.
However I suggest you to check the manual.

Brendan · Post by **Brendan** » Tue Sep 25, 2012 10:20 am

Hi,

itportal wrote:Is it possible that I receive an interrupt during I am processing another interrupt? I am entering the first interrupt through a interrupt gate and suppose I get another interrupt in the middle which is messing up my stacks.

For an Interrupt Gate, when the CPU transfers control to the IRQ handler it stores EFLAGS on the stack then disables interrupts. Unless your IRQ handler enables IRQ again, this effectively means that no normal IRQs can interrupt an interrupt handler. Of course this won't prevent NMI or Machine Check, or other exceptions, or software interrupts; but it's extremely unlikely that you're seeing an NMI or machine check (and I'd assume you'd know if something was causing an exception or using a software interrupt).

For your code, I don't know what is causing the current problem because I don't know how "try_lock()" is implemented or what "int 0x23" does (or why your kernel is using software interrupts to begin with - does "kernel" run at CPL=3 or something?).

However, bluemoon is right - there are design flaws, which mean that as soon as anything modifies the linked list of threads it will become broken. The code needs to be redesigned/rewritten sooner or later to fix this design flaw; and redesigning/rewriting will get rid of the current bug (and create completely different bugs!). Basically, there's no point finding/fixing the current bug.

To fix the design flaw: a global queue needs a global lock.

Cheers,

Brendan

OSDev.org

Buggy scheduling?

Buggy scheduling?

Re: Buggy scheduling?

Re: Buggy scheduling?

Re: Buggy scheduling?

Re: Buggy scheduling?

Re: Buggy scheduling?

Re: Buggy scheduling?

Re: Buggy scheduling?