SMP initialization?

bzt · Post by **bzt** » Mon Feb 01, 2021 4:29 am

This is going to be one of the very rare occasions when I ask a question instead of answering them

Does anybody have a reliable SMP implementation? I'm asking this because there are no clear cut instructions in the MP spec, and even worse, the spec changed significantly over time. I've checked so many source codes, and there seem to be no consensus how to do it properly (one sets warm reset code in CMOS, others don't, one sets up MSR, others don't, some does masking the LAPIC register while others don't, some disable the NMI others don't, some has wait loops for delivery others don't, some implement delays, other's don't, one in particular even writes to mysterious IO ports too. It also does an LAPIC ID read after every single LAPIC register write. I haven't seen that in any other code.) I've studied the Linux kernel, but the SMP code is a real mess, with contradicting implementations, heavy with unfollowable callbacks. It's almost impossible to figure out that which method is used and which functions are called on a particular machine during boot up. And the comments are not helping either. One comment for example says that delays are not needed (above the code that does the delay, of course).

My current SMP code works most of the time. On all virtual machines, and on all my testbeds. BUT. From time to time, an issue pops up that on a certain hardware, one time out of 20 it doesn't boot up, rather freezes or reboots. I couldn't figure exactly out why. And specially, why does it only fail occasionally when all the parameters are supposedly the same?

So, does anyone have a good tutorial / source code / specification, which clearly describes what to do? I mean it has things like: if your machine is older than X, follow these steps, if it's model Y, do this, otherwise do Z. When do we need to set up CMOS registers? Probably not on UEFI, which doesn't have those. When is there a need to configure MSRs? What CPU family requires the delays, and which one don't? etc. Anything a bit more readable than the Linux source would do.

Thanks,
bzt

8infy · Post by **8infy** » Mon Feb 01, 2021 4:42 am

bzt wrote:This is going to be one of the very rare occasions when I ask a question instead of answering them

Does anybody have a reliable SMP implementation? I'm asking this because there are no clear cut instructions in the MP spec, and even worse, the spec changed significantly over time. I've checked so many source codes, and there seem to be no consensus how to do it properly (one sets warm reset code in CMOS, others don't, one sets up MSR, others don't, some does masking the LAPIC register while others don't, some disable the NMI others don't, some has wait loops for delivery others don't, some implement delays, other's don't, one in particular even writes to mysterious IO ports too. It also does an LAPIC ID read after every single LAPIC register write. I haven't seen that in any other code.) I've studied the Linux kernel, but the SMP code is a real mess, with contradicting implementations, heavy with unfollowable callbacks. It's almost impossible to figure out that which method is used and which functions are called on a particular machine during boot up. And the comments are not helping either. One comment for example says that delays are not needed (above the code that does the delay, of course).

My current SMP code works most of the time. On all virtual machines, and on all my testbeds. BUT. From time to time, an issue pops up that on a certain hardware, one time out of 20 it doesn't boot up, rather freezes or reboots. I couldn't figure exactly out why. And specially, why does it only fail occasionally when all the parameters are supposedly the same?

So, does anyone have a good tutorial / source code / specification, which clearly describes what to do? I mean it has things like: if your machine is older than X, follow these steps, if it's model Y, do this, otherwise do Z. When do we need to set up CMOS registers? Probably not on UEFI, which doesn't have those. When is there a need to configure MSRs? What CPU family requires the delays, and which one don't? etc. Anything a bit more readable than the Linux source would do.

Thanks,
bzt

Those aren't mysterious IO ports, they're clearly described in the MP specification.

bzt · Post by **bzt** » Mon Feb 01, 2021 5:07 am

8infy wrote:Those aren't mysterious IO ports, they're clearly described in the MP specification.

I know what IMCR is. What I meant, when do you have to set it? What if there's no PIC in the machine in the first place (only emulated by IOAPIC)? Do you still have to write the IMCR if you have disabled the IOAPIC? Does the UEFI firmware set it up? Should it? My machine doesn't have MP tables, only ACPI, does MADT.flags & 1 mean the same thing as PCMP.features & 0x80? These are exactly that kind of unanswered questions what makes the MP spec and the whole SMP so fuzzy...

So is there a complete guide / tutorial / source code / fan-made doc with all the details for contemporary machines?

Cheers,
bzt

8infy · Post by **8infy** » Mon Feb 01, 2021 5:37 am

Do you still have to write the IMCR if you have disabled the IOAPIC?

Yes, because IMCR is about LAPIC/PIC, not IOAPIC.

What I meant, when do you have to set it? What if there's no PIC in the machine in the first place (only emulated by IOAPIC)? Do you still have to write the IMCR if you have disabled the IOAPIC? Does the UEFI firmware set it up? Should it?

You have to set it only if you use MP tables and you find that bit 0x80 is set. I don't think I've ever seen a machine that has that bit set (most have placeholder MP tables with missing cores/default configuration bit set/missing interrupt assignments)

does MADT.flags & 1 mean the same thing as PCMP.features & 0x80?

No. To answer this question let's look at the ACPI spec 6.4, which says:

A one indicates that the system also has a PC-AT-compatible
dual-8259 setup. The 8259 vectors must be disabled (that is,
masked) when enabling the ACPI APIC operation.

As you can see, there isn't a single mention of the IMCR and I tend to believe specifications. (Perhaps if your machine has ACPI it for sure doesn't have IMCR, but that's just my guess)

thewrongchristian · Post by **thewrongchristian** » Mon Feb 01, 2021 6:07 am

bzt wrote: I've checked so many source codes, and there seem to be no consensus how to do it properly (one sets warm reset code in CMOS, others don't, one sets up MSR, others don't, some does masking the LAPIC register while others don't, some disable the NMI others don't, some has wait loops for delivery others don't, some implement delays, other's don't, one in particular even writes to mysterious IO ports too. It also does an LAPIC ID read after every single LAPIC register write. I haven't seen that in any other code.)

As I understand it, the xv6's SMP implementation was lifted from Plan 9:

https://github.com/mit-pdos/xv6-public/blob/master/README wrote: xv6 borrows code from the following sources:
JOS (asm.h, elf.h, mmu.h, bootasm.S, ide.c, console.c, and others)
Plan 9 (entryother.S, mp.h, mp.c, lapic.c)
FreeBSD (ioapic.c)
NetBSD (console.c)

Whether the Plan 9 implementation works any more reliably than your implementation, I couldn't say. But it might be worth looking at.

I suspect, as and when I tackle SMP, I won't be using Linux as a reference implementation

I find NetBSD's source code easy to read and navigate.

As to a canonical source of how to "do it properly", I suspect you'll be disappointed. The Linux source is probably as messy as it is as a result of people adding tweaks to address the very issues you've highlighted.

feryno · Post by **feryno** » Mon Feb 01, 2021 12:45 pm

I have something to share about delays used in SMP initialization.
The delays in cpu manuals are too long-lasting, modern CPUs manage to perform the init much faster than old CPUs. Shortening the delays on modern CPUs makes startup faster (boot but also resume from ACPI sleep states). There is also another speed-up approach, that the bootstrap CPU does not initialize all application cpus but only few of them and these activated AP CPUs activate other AP CPUs (principle similar to avalanche or nuclear chain reaction or branching tree)
I'm hypervisor developer for the past cca 10 years, this is somewhat similar to OS development. First versions of hypervisor were loaded from running OS, later I developed loading before OS (using UEFI / BIOS). So hypervisor is loaded first (this includes using UEFI MP protocols to start at AP CPUs or old good way by sending INIT-SIPI if UEFI fails) and this always ran flawlessly. Then OS is loaded which early initializes AP CPUs again, this ran again flawlessly e.g. at Fedora 22 (so this my experience is few years old). But later when testing Fedora 25 the OS ended up running single CPU, application CPUs failed to activate and I saw reported error messages during OS startup like "smpboot: do_boot_cpu failed(-1) to wakeup CPU#1" (which was something like 20 second delay for each AP CPU).
So I compared kernels used in the 2 versions and this important thing changed:

apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
mdelay(10);
apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);

apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT, phys_apicid);
udelay(init_udelay);
apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);

and also this important code:

/*
* The Multiprocessor Specification 1.4 (1997) example code suggests
* that there should be a 10ms delay between the BSP asserting INIT
* and de-asserting INIT, when starting a remote processor.
* But that slows boot and resume on modern processors, which include
* many cores and don't require that delay.
*
* Cmdline "init_cpu_udelay=" is available to over-ride this delay.
* Modern processor families are quirked to remove the delay entirely.
*/
#define UDELAY_10MS_DEFAULT 10000

static unsigned int init_udelay = UINT_MAX;

static int __init cpu_init_udelay(char *str)
{
get_option(&str, &init_udelay);

return 0;
}
early_param("cpu_init_udelay", cpu_init_udelay);

static void __init smp_quirk_init_udelay(void)
{
/* if cmdline changed it from default, leave it alone */
if (init_udelay != UINT_MAX)
return;

/* if modern processor, use no delay */
if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 6)) ||
((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0xF))) {
init_udelay = 0;
return;
}
/* else, use legacy delay */
init_udelay = UDELAY_10MS_DEFAULT;
}

so the fixup was trivial - setting cpu_init_udelay=10 as a boot parameter
Even cpu_init_udelay=1 always worked at all testing PCs.
Setting cpu_init_udelay=0 again caused the legendary error message "smpboot: do_boot_cpu failed(-1) to wakeup CPU#1"

While cpu_init_udelay=0 ran flawlessly at baremetal, the CPU did not manage to process the INIT on time when running under virtualization as there was vm exit which slowed down finishing the initialization at AP CPUs so BSP CPU did not wait long enough and fired INIT deassert while the AP was still not finishing INIT assert. The INIT deassert was necessary at very old CPUs but still persists in CPU manuals and so in OS kernels.

So yes, there are possible improvements for modern CPUs which are much faster but improvements can cause some unexpected behavior in specific situations.

And also one old curiosity, which is not too much useful today, because it is suitable only for OS loaded in real mode by BIOS via MBR and is unsuitable for UEFI:
I saw an example where the SMP initialization was not done using INIT-SIPI, but using NMI. The INIT-SIPI is done by firmware in boot phase when counting CPUs and initializing them, excluding defective CPUs/cores or activating only CPUs enabled by user in setup menu (the info which is stored in CMOS or in NVRAM for UEFI with CSM). After this is done the BSP CPU puts AP CPUs usually into HLT state with interrupts disabled from which they could be woken up not only by INIT-SIPI but also by NMI.
So the activation of halted AP CPUs from real mode (that's why it is not suitable for UEFI) was by hooking interrupt 2 (#NMI) in realmode IDT and then sending #NMI from BSP CPU to AP CPUs using APIC ICR.

bzt · Post by **bzt** » Mon Feb 01, 2021 8:55 pm

Thank you guys!

8infy wrote:Perhaps if your machine has ACPI it for sure doesn't have IMCR, but that's just my guess

Yeah, these things like "perhaps", "my guess" what I'm after

It would be good to finally have a reliable code.

Code: Select all

/* if modern processor, use no delay */
if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 6)) ||
((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0xF))) {
init_udelay = 0;
return;
}

This is exactly that kind of information I was looking for.

So no-one knows an all-in-one solution, the breadcrumbs should be collected from different sources, Plan9, NetBSD most notably. Not the answer I was hoping for, but thanks!

Cheers,
bzt

Octocontrabass · Post by **Octocontrabass** » Mon Feb 01, 2021 9:50 pm

bzt wrote:When do we need to set up CMOS registers?

Only on ancient (486/Pentium) hardware that doesn't support SIPI.

bzt wrote:When is there a need to configure MSRs?

When you've changed the BSP MSRs, or when firmware didn't initialize the AP MSRs correctly.

bzt wrote:What CPU family requires the delays, and which one don't?

Just use the same delays for every CPU family. If you want to start APs faster, start them in parallel.

feryno wrote:using UEFI MP protocols to start at AP CPUs

UEFI explicitly forbids this: you must return control of the APs to firmware before calling ExitBootServices.

feryno wrote:I saw an example where the SMP initialization was not done using INIT-SIPI, but using NMI.

Wow!

I wouldn't trust the BIOS enough to try that!

feryno · Post by **feryno** » Tue Feb 02, 2021 8:39 am

Octocontrabass wrote:
feryno wrote:using UEFI MP protocols to start at AP CPUs
UEFI explicitly forbids this: you must return control of the APs to firmware before calling ExitBootServices.

yes, exactly, they return control back to the firmware using the RET instruction
I must leave ExitBootServices call available to OS which starts later

Octocontrabass wrote:
feryno wrote:I saw an example where the SMP initialization was not done using INIT-SIPI, but using NMI.
Wow! I wouldn't trust the BIOS enough to try that!

that was only a curiosity, I came to it when reversing one realmode sample, it took me some time to understand that it is initializing APs by hooking realmode IDT vector 2 (NMI) and then sending NMI to all APs
I was quite surprised and somewhat happy as I saw something unusual... I wouldn't use that way, just saw that beauty and some level of creativity... every cpu manual describes how to do it but someone tried another way
I can't find the binary anymore but tried to remember the idea and put it in the attached sample (it is just bare idea, there are not necessary cpu checks, only minimal setup), don't do it that way, I posted it only to enjoy something unusual, you can very likely use this idea in your OS to reset some hanged CPU core (but in that case use watchdog timer for that purpose)

kzinti · Post by **kzinti** » Tue Feb 02, 2021 1:02 pm

feryno wrote:
Octocontrabass wrote:
feryno wrote:using UEFI MP protocols to start at AP CPUs
UEFI explicitly forbids this: you must return control of the APs to firmware before calling ExitBootServices.
yes, exactly, they return control back to the firmware using the RET instruction
I must leave ExitBootServices call available to OS which starts later

It previously sounded like you were suggesting that one can use UEFI to start the APs. But now you are agreeing that the APs need to be returned to the UEFI firmware, and thus they aren't running once your kernel start.

Is something missing from this picture? Can UEFI help in some way with starting up APs here?

bzt · Post by **bzt** » Tue Feb 02, 2021 1:31 pm

kzinti wrote:
feryno wrote:
Octocontrabass wrote:UEFI explicitly forbids this: you must return control of the APs to firmware before calling ExitBootServices.
yes, exactly, they return control back to the firmware using the RET instruction
I must leave ExitBootServices call available to OS which starts later
It previously sounded like you were suggesting that one can use UEFI to start the APs. But now you are agreeing that the APs need to be returned to the UEFI firmware, and thus they aren't running once your kernel start.

I actually have implemented that. By setting USE_MP_SERVICES define to 1 BOOTBOOT will compile with support for that. It works on all virtual machines, and on all TianoCore compliant firmware. But on real machines, there's a 33% chance that calling ExitBootServices with running APs will crash or freeze and not return to your app.

kzinti wrote:Is something missing from this picture? Can UEFI help in some way with starting up APs here?

Sadly no. By adding that criteria that all APs must be stopped when ExitBootServices gets called, the UEFI Forum rendered the whole EFI_MP_SERVICES completely and utterly useless, humiliating it to just another unnecessary bloat in the firmware. It would be to good to be true if the firmware could start up the cores reliably for your kernel...

Cheers,
bzt

xeyes · Post by **xeyes** » Tue Feb 02, 2021 2:29 pm

From time to time, an issue pops up that on a certain hardware, one time out of 20 it doesn't boot up

This sounds like a hardware compatibility/support issue rather than boot code unreliable?

Linux has similar issues. "Veteran" Linux users know that they must wait for a few month before putting it on new hardware especially laptops so that the devs can blaze the trail and fix severe bugs like "failure to boot" first. "Noob" Linux users instead ask questions like "I bought this newly released laptop and can't get Linux to even boot on it" all over the places.

Only Windows has this somewhat under control as MSFT has turned the table and solved it in the other direction. Instead of trying to outguess the hardware OEMs by adding all sorts of "reliability enhancements" to their boot code, they created the "certification sticker". The OEMs need to "work with MSFT" to bring up Windows and do compatibility tests before they get to put the shiny Windows logo stickers on their machines.

If you are really keen on solving this. You can also emulate this process by printing some "certified for bzt" stickers and only put them on hardware that you've tested and found working or made needed fixes. Then tell your users to submit hardware for "certification" before they attempt to run your stuff on it, otherwise "anything can happen, including but not limited to loss of data, personal injury and failure to boot"

bzt · Post by **bzt** » Tue Feb 02, 2021 4:37 pm

xeyes wrote:
From time to time, an issue pops up that on a certain hardware, one time out of 20 it doesn't boot up
This sounds like a hardware compatibility/support issue rather than boot code unreliable?

Maybe. However after many painful trial-and-error cycles I've concluded that the issue must be some system state. I mean something that I haven't thought of left uninitialized. When that thing has a lucky default value, everything works. When it was left with an unlucky value, the boot code freezes on next reboot. That's why I'm asking these questions because I'd like to figure out somehow what it could be that my init code is missing to configure.

xeyes wrote:Linux has similar issues. "Veteran" Linux users know that they must wait for a few month before putting it on new hardware especially laptops so that the devs can blaze the trail and fix severe bugs like "failure to boot" first.

I'm aware, but unfortunately no good for me if I can't figure out what hacks they might have added to the Linux kernel.

xeyes wrote: If you are really keen on solving this. You can also emulate this process by printing some "certified for bzt" stickers and only put them on hardware that you've tested and found working or made needed fixes. Then tell your users to submit hardware for "certification" before they attempt to run your stuff on it, otherwise "anything can happen, including but not limited to loss of data, personal injury and failure to boot"

Good joke

Cheers,
bzt

feryno · Post by **feryno** » Wed Feb 03, 2021 11:10 am

kzinti wrote:It previously sounded like you were suggesting that one can use UEFI to start the APs. But now you are agreeing that the APs need to be returned to the UEFI firmware, and thus they aren't running once your kernel start.

Is something missing from this picture? Can UEFI help in some way with starting up APs here?

kzinti - yes exactly, starting AP CPUs using UEFI service is not too much useful for OS development
I need to do it to initialize hypervisor on all CPUs / CPU cores and then give control to OS loader (Linux, ms win etc)

OSDev.org

SMP initialization?

SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?

Re: SMP initialization?