OSDev.org

Posted: **Fri Jul 15, 2011 2:45 pm**

Hello, OSDevers!

I've run into a question: What should happen when (in my case) 3 CPUs/cores sends an IPI to the same LAPIC and int. vector simultaneously (or almost simultaneously)? To my understanding, target CPU should receive (and call corresponding interrupt handler) 3 times in a row. My kernel, however, disagrees and I receive only one interrupt. Tried it on QEMU and real quad-core computer, results are the same. When I add a slight (around 1ms, different for each core) delay, problem disappears.

Basically it's an AP waking process - BSP sets up trampoline environment, sends usual INIT-SIPI-SIPI (in my case INIT1-INIT2-INIT3-delay-SIPI1-SIPI2-SIPI3, no second SIPI) and then APs rush to the Long Mode. Once they are there and initialized things enough, they send an IPI back to BSP, telling "I'm alive". BSP is waiting for those IPIs and, when all 3 of them are received, proceeds to clean up trampoline and further to the scheduler.

AFAIK, other cores wakes and are set up correctly: they are able to print their LAPIC IDs, have unique stacks, can use their own LAPIC timers (to add delay I mentioned before). It looks like a race condition, so I tried to protect the whole IPI sending routine with spinlocks. No effect.

Of course I could come up with different way to detect when my APs are up: spin on (spinlock protected) variable, use a delay on BSP or keep those debugging delays on APs. But my current approach raises a few questions - is my assumption of 3 interrupts in-a-row wrong and those IPIs are somehow aggregated into one? Or there's a bug somewhere in my code?

Have read Intel's manuals about APIC couple of times, still no clue. Please advise.

Posted: **Fri Jul 15, 2011 5:42 pm**

Hi,

Velko wrote:I've run into a question: What should happen when (in my case) 3 CPUs/cores sends an IPI to the same LAPIC and int. vector simultaneously (or almost simultaneously)? To my understanding, target CPU should receive (and call corresponding interrupt handler) 3 times in a row.

You could get anywhere from one to 3 interrupts, depending on exact timing.

For one IPI, think of it as 4 steps:

Local APIC receives the IPI and sets the corresponding flag in its "Interrupt Received Register", then
Local APIC searches for the highest priority set flag in its "Interrupt Received Register", then
Local APIC handles the interrupt, by:
- Sending the interrupt to the CPU core
- Clearing the flag in its "Interrupt Received Register"
- Setting the flag in its "In Service Register"
CPU sends EOI to local APIC, which clears the flag in its "In Service Register" (and causes it to check for highest priority set flag in its "Interrupt Received Register" again)

Now consider what happens if more interrupts (for the same vector) are received before the flag in the "Interrupt Received Register" is cleared:

Local APIC receives the first IPI and sets the corresponding flag in its "Interrupt Received Register", then
Local APIC receives the second IPI and sets the corresponding flag in its "Interrupt Received Register" (but it's already set), then
Local APIC receives the third IPI and sets the corresponding flag in its "Interrupt Received Register" (but it's already set), then
Local APIC searches for the highest priority set flag in its "Interrupt Received Register", then
Local APIC handles the interrupt, by:
- Sending the interrupt to the CPU core
- Clearing the flag in its "Interrupt Received Register"
- Setting the flag in its "In Service Register"
CPU sends EOI to local APIC, which clears the flag in its "In Service Register" (and causes it to check for highest priority set flag in its "Interrupt Received Register" again)

In this case you only get one interrupt.

Now consider what happens if more interrupts (for the same vector) are received before the flag in the "In Service Register" is cleared:

Local APIC receives the first IPI and sets the corresponding flag in its "Interrupt Received Register", then
Local APIC searches for the highest priority set flag in its "Interrupt Received Register", then
Local APIC handles the interrupt, by:
- Sending the interrupt to the CPU core
- Clearing the flag in its "Interrupt Received Register"
- Setting the flag in its "In Service Register"
Local APIC receives the second IPI and sets the corresponding flag in its "Interrupt Received Register", then
Local APIC receives the third IPI and sets the corresponding flag in its "Interrupt Received Register" (but it's already set), then
CPU sends EOI to local APIC, which clears the flag in its "In Service Register"
Local APIC searches for the highest priority set flag in its "Interrupt Received Register", then
Local APIC handles the interrupt, by:
- Sending the interrupt to the CPU core
- Clearing the flag in its "Interrupt Received Register"
- Setting the flag in its "In Service Register"
CPU sends EOI to local APIC, which clears the flag in its "In Service Register" (and causes it to check for highest priority set flag in its "Interrupt Received Register" again)

In this case you get 2 interrupts.

Velko wrote:Basically it's an AP waking process - BSP sets up trampoline environment, sends usual INIT-SIPI-SIPI (in my case INIT1-INIT2-INIT3-delay-SIPI1-SIPI2-SIPI3, no second SIPI) and then APs rush to the Long Mode.

The extra INIT IPIs and the extra SIPI IPIs are a waste of time and probably do more harm than good. Also, for a lot of CPUs the second SIPI isn't needed and they begin executing instructions after the first SIPI. This can lead to problems. For example, if the AP increments a "number of CPUs started" counter, then it could increment this counter after receiving the first SIPI and then increment it again after receiving the second SIPI (then increment it again after the third SIPI) and you end up thinking there's more CPUs than there are.

Velko wrote:Once they are there and initialized things enough, they send an IPI back to BSP, telling "I'm alive". BSP is waiting for those IPIs and, when all 3 of them are received, proceeds to clean up trampoline and further to the scheduler.

You're saying that the timing is so exact that the BSP only receives one IPI. The only way that is possible is if you're broadcasting the "INIT-SIPI-SIPI" sequence to all CPUs at the same time. DO NOT broadcast the "INIT-SIPI-SIPI" sequence to all CPUs at the same time - it causes all CPUs, including CPUs that the user disabled (typical for CPUs with hyper-threading where the user disabled hyper-threading in the BIOS) and faulty CPUs that failed testing to be started, and is therefore wrong and dodgy (unless you're writing firmware and not an OS). You must only attempt to start CPUs that the firmware listed in the ACPI "APIC" table or the "MultiProcessor Specification" table (and not any others that might be present but aren't listed); and the only way to do that is to send the "INIT-SIPI-SIPI" sequence to each CPU separately.

The correct way to do it is something like:

Code: Select all

    for(each AP mentioned by BIOS) {
        AP_status = NOT_STARTED;
        send_INIT_to_AP();
        wait(10ms);
        send_SIPI_to_AP();
        timeout_remaining = 5ms;
        while( timeout_remaining > 0) {
            if(AP_status == STARTED) goto started;
        }
        send_SIPI_to_AP();
        timeout_remaining = 10ms;
        while( timeout_remaining > 0) {
            if(AP_status == STARTED) goto started;
        }
        printf("AP CPU failed to start\n");
        continue;

started:
        AP_status = ACKNOWLEDGED;
    }

The AP CPUs would do something like:

Code: Select all

AP_init () {
    AP_status = STARTED;
    while(AP_status != ACKNOWLEDGED) { /* Do nothing */ }

    /* Start CPU initialisation here */

In this pseudo-code, "AP_status" would be a volatile variable that must be used atomically, and the "timeout_remaining" thing would be something that is decreased over time (e.g. maybe the local APIC timer's "current count" register or something).

Finally; if each CPU takes 11 ms to start and you've got 127 CPUs to start, then it'd take 1397 ms to start all of them. That's a significant increase in boot times. This can be improved a lot by doing it in parallel. For example, the first CPU could start the second CPU; then the first and second CPUs could start the third and fourth CPUs; then all 4 started CPUs could start 4 more CPUs; then 8 CPUs start 8 more CPUs, etc. For parallel startup, if each CPU takes 11 ms to start and you've got 127 CPUs to start, then it'd take 77 ms to start all of them. Of course once you start looking at lots of CPUs you'd also need to consider supporting x2APIC.

Cheers,

Brendan

Posted: **Sat Jul 16, 2011 6:19 am**

Thanks for detailed explanation, Brendan!

# Local APIC receives the first IPI and sets the corresponding flag in its "Interrupt Received Register", then
# Local APIC receives the second IPI and sets the corresponding flag in its "Interrupt Received Register" (but it's already set), then
# Local APIC receives the third IPI and sets the corresponding flag in its "Interrupt Received Register" (but it's already set), then

Well, it settles that. I thought that Local APIC will not accept second and third IPIs until it's not done with first one. I understand now.

The extra INIT IPIs and the extra SIPI IPIs are a waste of time and probably do more harm than good.
...
DO NOT broadcast the "INIT-SIPI-SIPI" sequence to all CPUs at the same time

I guess, I did not made myself clear, what "INIT1-INIT2-INIT3-delay-SIPI1-SIPI2-SIPI3, no second SIPI" sequence means. Turns out it is not that "usual" after all

I am not broadcasting INIT-SIPI-SIPI. I am, however, firing them at each AP in rapid succession.

Pseudocode:

Code: Select all

foreach(AP in BIOS) {
        send_INIT($AP);
}
wait(10ms);
foreach(AP in BIOS) {
        send_SIPI($AP);
}
wait_For_Woke_IPIs_Or_TimeOut();
/* no second SIPI */
cleanupTrampoline();

APs then wakes up (almost) simultaneously, on their way runs into some spinlocks (which probably synchronizes them even more) and finally sends Woke_IPI back to BSP.

That was my idea on improving startup times - why wait, if you can fire some more IPIs at that time

. Seems to work fine, except for that Woke_IPI thing. But now, when I know what causes it, it's not that hard to work around. Also, I should probably implement an array of AP_status or something to see if second SIPI is needed.

But if You think, my AP waking sequence is not such a good idea, I'll revert back to starting them one-by-one.

Thanks again,
Velko

Posted: **Mon Jul 18, 2011 11:18 am**

Velko wrote: Pseudocode:

Code: Select all

foreach(AP in BIOS) {
        send_INIT($AP);
}
wait(10ms);
foreach(AP in BIOS) {
        send_SIPI($AP);
}
wait_For_Woke_IPIs_Or_TimeOut();
/* no second SIPI */
cleanupTrampoline();

That works quite well and cuts the bootup time by a good amount (Especially for a system with 16 cores). Pure64 would send out the INIT IPI to the first core, wait 10 ms, send the SIPI IPI, wait 2 ms, and then repeat for each of the other AP's. That time adds up for multiple cores and you could see the pause during bootup. I have adopted the method that you detailed above with no issues. Thanks for posting this.

-Ian

OSDev.org

Simultaneous IPIs to same target

Simultaneous IPIs to same target

Re: Simultaneous IPIs to same target

Re: Simultaneous IPIs to same target

Re: Simultaneous IPIs to same target