Buggy scheduling?
Posted: Thu Sep 20, 2012 8:02 am
Hello everyone,
I am having a bit of a problem with my scheduler and I am not sure if my code is wrong or this is some kind of a qemu issue.
It is a smp kernel, where each core has its own scheduling and idle threads. The scheduling thread contains the scheduling policies (which are for now all the same) - it picks an unlocked thread form a global linked list of threads (also locks it) and schedules it using a system call. If no thread is free for scheduling, the idle thread for this core is scheduled. The whole thing is something like a 2-way scheduling.
It works, I have tested it on bochs, qemu and real hardware. What is giving me headaches is (probably) the idle thread:
When I run the kernel with this idle thread, it works on both bochs and real hardware. But on qemu I get strange errors when I increase the number of cores (>= 4). A page fault occurs sometimes on one or several (but not all) cores in the scheduling thread when it tries to pick up an unlocked thread. I traced the instruction and it seems the problem is when trying to access the linked list and more precisely the pointer to the actual thread data structure (see comment in above code). Now, as I am using a global linked-list and the other scheduling threads are doing well, I assume the data structure is not broken. It is more likely that the stack of the scheduling thread is somehow broken, but I am not sure.
However, if I make the idle thread without a hlt instruction and let it be just an infinite loop, it works fine on qemu as well.
Another interesting thing - with the hlt instruction I do not get the error, if I make my lapic timer interrupt fire not so quickly.
The problem occurs when I have 4 or more cores. Unfortunately, I do not have real hardware with that many cores and cannot test it. But on bochs with 32 cores is working fine.
The problem might be somewhere else, not the idle thread, but I really have no clue. I tested and tried code for like 3 days now and would appreciate any ideas from your side.
PS. int 0x23 schedules the thread pointed in eax. All threads run in ring 0, user space still not used at this stage.
I am having a bit of a problem with my scheduler and I am not sure if my code is wrong or this is some kind of a qemu issue.
It is a smp kernel, where each core has its own scheduling and idle threads. The scheduling thread contains the scheduling policies (which are for now all the same) - it picks an unlocked thread form a global linked list of threads (also locks it) and schedules it using a system call. If no thread is free for scheduling, the idle thread for this core is scheduled. The whole thing is something like a 2-way scheduling.
Code: Select all
void scheduler(){
while(true){
volatile ll* p = thread_queue;
while(p != NULL){
asm("nop");
volatile thread* t = (thread*)p->object; // it dies here trying to access the object pointer
volatile lock* l = &(t->state);
if(try_lock(l))
asm("int 0x23" : : "a"(t));
p = p->next;
}
// no suitable thread, schedule idle thread
asm("int 0x23" : : "a"(cpu->idle));
}
}
Code: Select all
void idle(){
while(true){
asm volatile("hlt");
}
}
However, if I make the idle thread without a hlt instruction and let it be just an infinite loop, it works fine on qemu as well.
Another interesting thing - with the hlt instruction I do not get the error, if I make my lapic timer interrupt fire not so quickly.
The problem occurs when I have 4 or more cores. Unfortunately, I do not have real hardware with that many cores and cannot test it. But on bochs with 32 cores is working fine.
The problem might be somewhere else, not the idle thread, but I really have no clue. I tested and tried code for like 3 days now and would appreciate any ideas from your side.
PS. int 0x23 schedules the thread pointed in eax. All threads run in ring 0, user space still not used at this stage.