Interrupts in SMP system

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
mutex
Member
Member
Posts: 131
Joined: Sat Jul 07, 2007 7:49 pm

Interrupts in SMP system

Post by mutex »

Hi,

I have implemented multiprocessor init in my kernel. I have it working with my scheduler and timer (apic timer). Each cpu sets its own apic timer when exiting the scheduler. The amount of time is dependent on scheduling parameters for current thread/process.

Implemented mutexes and locks to isolate the critical sections. Everything works fine. Threads run in paralell and threads are scheduled in and out like before when i only ran on one cpu.

All cpu's can interrupt each other if needed with an IPI. Either IPI to scheduler so that all cpus will enter scheduler or IPI for each cpu to synchronize in kernel reading a message.

I use same IDT and GDT for all processors so that all can access the same routines.

I guess i could deliver the interrupts to different cpus based on randomness or by priority or something.. Have not thought this through yet..

But i have a question. I have problem figuring out how to handle keyboard irq etc.. Anyone have a graphical schematic on how pic/apic/ioapic are connected so that i can understand it. Should one irq go to same cpu always? or should it vary? The same issue goes probably for all IRQ's from external hw.

-
Thomas
User avatar
01000101
Member
Member
Posts: 1599
Joined: Fri Jun 22, 2007 12:47 pm
Contact:

Re: Interrupts in SMP system

Post by 01000101 »

I'm far too overtired to help out much right now, but I can link to a few threads that helped me when I was writing similar code.

IO-APIC Info
Enable local-apic interrupts?

Hope those can help a bit, they did for me. :)
I'll see if I can muster up some info tomorrow.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Interrupts in SMP system

Post by Brendan »

Hi,
thomasnilsen wrote:Anyone have a graphical schematic on how pic/apic/ioapic are connected so that i can understand it.
There's a few generic diagrams in Intel's "Multi-Processor Specification"; but you won't find anything detailed for modern computers because the motherboard manufacturer can connect any IRQ to any I/O APIC input, and there's stuff like "MSI" (Message Signalled Interrupts) that bypass the I/O APIC inputs and send the interrupt's details directly.
thomasnilsen wrote:Should one irq go to same cpu always? or should it vary? The same issue goes probably for all IRQ's from external hw.
For IRQs from hardware devices there's lots of things that effect "optimal" IRQ balancing. In general I'm mostly in favour of using "send to lowest priority" (and maintaining each CPU's "Task Priority Register") so that IRQs are handled by CPUs that are running the least important code (unless it's necessary) and so that CPUs that are sleeping (e.g. in a power saving state) aren't woken up (unless it's necessary).

On top of that I'd want to be able create groups of CPUs, and send IRQs to the lowest priority CPU within a specific group of CPUs. This has advantages for NUMA (e.g. run the IRQ handler on the lowest priority CPU that's "close" to the device), and can also have advantages for cache efficiency (e.g. run the IRQ handler on the lowest priority CPU in a group of CPUs that share an L3 cache).

For IPIs, the basic rule is to avoid sending them (if possible) and minimise the total number of CPUs that receive the IPI if the IPI must be sent.

As a very rough estimate, I'd say that receiving an IPI costs the receiving CPU around 400 cycles of overhead (flushing pipelines, fetching and decoding the interrupt handler's code, potential cache and TLB misses, etc), which can have a massive effect on scalability. For example, if there's 2 CPUs that send a "broadcast to all but self" IPI every 10000 cycles then the overhead is negligible (1 * 400 cycles of overhead per 10000 cycles = who cares), but if there's 64 CPUs that all send a "broadcast to all but self" IPI every 10000 cycles then it's a complete disaster (63 * 400 = 25200 cycles of overhead per 10000 cycles = impossible to get any work done :D ).

For an example, imagine if you've changed a page table entry in user space and need to do the "TLB shootdown" (to give other CPUs that may be running other threads that belong to the effected process a chance to do INVLPG). Here, you could check how many threads the process has (if the process only has one thread, then no other CPU can be running other threads that belong to the process, and no IPI is necessary). Then you could check the process' CPU affinity - if the process is only capable of running on a specific group of CPUs, then maybe you can avoid sending the IPI to CPUs that aren't part of that group.

For both of these purposes (IPIs and IRQs), it's a good idea to learn how Logical Destination Mode works, and be very clever when you're deciding which bit/s in the "Logical Destination Register" will be used for what. For example, in a NUMA system you might decide that bit 3 in the "Logical Destination Register" will be set for all CPUs that are "close" to a certain I/O hub, so that IRQs from devices connected to that I/O hub can use Logical Destination Mode to send an interrupt to the lowest priority CPU that has bit 3 in it's "Logical Destination Register" set.

Unfortunately, for x2APIC the "Logical Destination Register" is a pre-set read-only register, so (IMHO) one of the best ways to improve interrupt handling efficiency has died :(. Fortunately, in some/most cases it's still possible to use similar "logical destination mode" tricks (but, your hands are tied in some cases).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
mutex
Member
Member
Posts: 131
Joined: Sat Jul 07, 2007 7:49 pm

Re: Interrupts in SMP system

Post by mutex »

Thanks for reply guys,

I decided to give it a go delivering the INT's to the cpu with the thread having lowest pri. (Using task pri in apic). I was also thinking about inplementing some sort of performance counters to count instruction time in isr handlers and in general kernel work to try different approaches, but using the lowest pri cpu seems like a good idea.

The IPI is so far only used for forcing cpu's to refresh pagedir/table. I guess i could make cpu PF and do it there instead.. I have simple message passing with IPI so that i can tell the other cpu's to halt etc.. basically just for testing purposes atm.

For the numa part its of course important to take into device and memory distance for the various cpu groups, but im not there just yet ;)

I have a problem though. Seems that i get a race condition for the scheduler isr handler.. Basically it seems that i get a GPF on one of the cpu's, then the other ones follow with GPF aswell.. I have tried to isolate the problem but cannot seem to find it yet. I implemented some printf("CPU %i entered scheduler") stuff but then the problem does not arise.. So it looks like it might be due to mutex lock/unlock or the spinlock in the mutex.

My isr handler calls the PsScheduler() and inside there i have a mutex_lock(&shared_mutex_variable), do my thread switch / change thread pointer for current cpu then mutex_unlock(..) and then exiting the ISR..

I saw bochs complained about a "lock prefix invalid on instrunction".. Found out that i had a "lock xchgl" and that was not needed.. Anyway i removed it and that is not the problem because issue is still there..

I did a "if(cpu == 0) FindNextThread();" inside my scheduler so that only the BSP would cycle between all the threads. This works while cpu1 only enters and exits again..

I really have no clue what this is.. Singlestepping the whole thing is probably next step.. Arrrggghhh.

Im still not sure about my mutex implementation.. The lock seems to work.. but might i have some race condition here?? I tried both with and without the "rep; nop" cpu hint stuff... Should not affect this i think..

basically it looks like this;

Code: Select all


int mutex_scheduler;

void PsScheduler()
{
    mutex_lock(&mutex_scheduler);		// -- Critical section begin --
    int cpu = GetCpuId();				// Find out what CPU that are running this code. Using tss trick..
    PsGetNext(cpu);
    mutex_unlock(&mutex_scheduler);		// -- Critical section end --
}

void mutex_lock(int *lock)
{
    while (test_and_set(1, lock)) __asm__ __volatile__ ("rep; nop\n");
}

void mutex_unlock(int *lock)
{
    *lock = 0;
}

int test_and_set(int value, int *ptr)
{
    __asm__ __volatile__ ("xchgl %%eax,(%%edx) ": "=a"(value) : "a"(value), "d"(ptr));
}
User avatar
gravaera
Member
Member
Posts: 737
Joined: Tue Jun 02, 2009 4:35 pm
Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.

Re: Interrupts in SMP system

Post by gravaera »

thomasnilsen wrote:Thanks for reply guys,

I really have no clue what this is.. Singlestepping the whole thing is probably next step.. Arrrggghhh.
I laughed a little here :wink:
Debugging is only half the fun, I suppose.

-Good luck,
gravaera
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
Post Reply