Hi,
kemosparc wrote:Thanks a lot for your helpful comments earlier.
I was able to startup all the cores from Long mode.
I will start on the synchronization and locking mechanisms now.
Be careful - sometimes an AP will start on the first SIPI. This means that (for example) if the AP CPUs increment some sort of "number of CPUs started" counter then they can start on the first SIPI, increment the counter, reset on the second SIPI, then increment the counter again.
It also means that if the BSP can detect that the AP has started after the first SIPI, then it can skip the second SIPI (and cancel the time delay between sending the first and second SIPIs).
Also, you probably want to allocate a stack for the AP CPU, store the "address of your stack" in the trampoline, then start the AP CPU (where the AP CPU gets the "address of your stack" from the trampoline and uses it to set ESP).
Also, it's entirely possible (and not difficult) to switch to protected mode or long mode without using a stack. This makes it easier to allocate a stack for the AP CPU, as the stack can be anywhere (e.g. above 1 MiB, and maybe even in the virtual address space).
This means that the full "AP CPU startup" sequence might go like this:
- BSP allocates stack for AP CPU, and stores the "top of stack" in the trampoline
- BSP sends INIT IPI to AP CPU
- BSP waits for 10 ms
- BSP sends first SIPI IPI to AP CPU
- BSP checks an "AP CPU started" flag in a loop, with a 200 us timeout
- If "AP CPU started" flag was not set:
- BSP sends second SIPI IPI to AP CPU
- BSP checks an "AP CPU started" flag in a loop, with a much longer timeout (e.g. 1 second)
- If "AP CPU started" flag was not set:
- BSP reports "AP CPU failed to start" error and gives up on this AP
- BSP sets a "AP can continue" flag (so that AP CPU knows BSP has noticed that it started)
- BSP waits for AP to set an "AP ready" flag (or increment a "number of CPUs" counter), so that it knows the AP CPU doesn't need the trampoline any more (and that it's safe start the next AP CPU).
The AP CPU goes a little like this:
- AP sets the "AP CPU started" flag to tell BSP its running
- AP waits for BSP to set the "AP can continue" flag
- AP switches to protected mode or long mode (possibly including enabling paging)
- AP loads its stack from the trampoline
- AP sets an "AP ready" flag (or increments a "number of CPUs" counter), so BSP knows its finished using the trampoline
When there are a large number of CPUs, it can take a while to start them all one at a time. For example, with 32 CPUs it could take about a third of a second (and with 256 CPUs it could take 2.5 seconds). There are 2 ways to reduce that.
For the first way, the 10 ms delay after sending the INIT IPI doesn't need to be that exact. This means that you could send the INIT IPI to (up to) 16 CPUs; then have one 10 ms delay; then do the remainder of the startup sequence one AP CPU at a time. This would mean that for 32 CPUs it'd take about 2*10ms + 32*400us = 32.8 ms (and for 256 CPUs it'd take about 262 ms).
For the second way; the BSP can start the first AP CPU; then the BSP and the AP CPU can start 2 more AP CPUs; then all 4 CPUs can start 4 more CPUs, then all 8 can start 8 more; and so on. This would mean that for 32 CPUs it'd take 4 steps or 4*10.4 = 41.6 ms (and for 256 CPUs it'd take 7 steps, or about 72.8 ms). The only problem here is that you need multiple trampolines (with one set of flags for synchronisation and "address for AP CPU stack" in each separate trampoline).
These methods can be combined. For example, BSP could send the INIT IPI to 15 CPUs, then start those 15 CPUs one at a time; then all 16 CPUs could all send the INIT IPI to 16 CPUs each (256 total), then each of the first 16 CPU would start its "16 more" CPUs one at a time. In this way, you'd have 16 CPUs running after about 16 ms, and have (up to) 16+256 CPUs running after another 16.4 ms (and have up to 4368 CPUs running after another 16.4 ms).
However; because of the 8-bit field in the SIPI it's impossible to use more than 256 trampolines at a time, and some of them aren't usable because they correspond to ROM, etc. This means that in practice you probably can't use more than about 128 trampolines at the same time. This means that the fastest possible sequence in practice ends up being something like this:
- BSP sends INIT IPI to 15 CPUs
- BSP does the remainder of the startup sequence for those 15 CPUs one at a time
-- About 16 ms passed so far; 16 CPUs running --
- BSP and 15 AP CPUs send INIT IPI to 16 CPUs each
- BSP and 15 AP CPUs do the remainder of startup sequence for those 256 CPUs (16 at a time)
-- About 32.4 ms passed so far; 272 CPUs running --
-- Limited by the max. number of trampolines from here on --
- 128 CPUs send INIT IPI to 16 CPUs each
- 128 CPUs do the remainder of startup sequence for those 2048 CPUs (128 at a time)
-- About 48.8 ms passed so far; 2320 CPUs running --
- 128 CPUs send INIT IPI to 16 CPUs each
- 128 CPUs do the remainder of startup sequence for those 2048 CPUs (128 at a time)
-- About 65.2 ms passed so far; 4368 CPUs running --
...
Finally; for systems with a lots of CPUs you can expect NUMA. With this in mind it may be a good idea to take NUMA into account. For example, BSP might start one logical CPU in each NUMA domain; then each of those CPUs starts the remaining CPUs in its NUMA domain.
Note: Yes, I did go a little over-board here (there aren't too many computers with more than 32 CPUs at the moment). However; next year Intel will be releasing the next version of Xeon Phi; which will be the first version that can be used as a main CPU. Each Xeon Phi will have up to 72 cores with 4 logical CPUs per core (or up to 288 logical CPUs per Xeon Phi chip). It's entirely conceivable that within 12 months we will start seeing (e.g.) computers with 4-socket motherboards that have 4 Xeon Phi chips and a total of 1152 logical CPUs.
Cheers,
Brendan