Hi,
johnsa wrote:I've narrowed down the cause somewhat.
For now I'm using INIT-SIPI-SIPI with delays until I put in the lock/waits. The second SIPI seems to cause the problem.
If I remove it qemu appears to work perfectly regardless of what I put in the trampoline code. With the second SIPI the results become somewhat random.
1) Sometimes changing just the attribute bytes causes it to hang.
2) Sometimes any change in the trampoline code at all.
For a lot of computers (including real computers and virtual machines) the second SIPI isn't actually necessary (and may just be there in case the first SIPI wasn't received correctly). If you don't have any synchronisation then the AP CPU can:
- Start on the first SIPI
- Begin executing the trampoline and possibly whatever code comes after that
- Receive the second SIPI and start executing code the trampoline a second time
If (for a simple example) your trampoline does "lock inc dword [totalCPUs]" and you start 3 CPUs then "totalCPUs" can be incremented 6 times.
Now; the INIT resets the CPU to a default state, but the SIPI mostly doesn't. This increases the potential for problems. For a simple example, imagine if your trampoline does something that relies on DS being set to zero then sets DS to a non-zero value, and then the second SIPI arrives and it starts the trampoline again. In this case the second time the trampoline is executed DS may still be non-zero, causing problems.
Basically; you need some sort of synchronisation.
In my experience the best synchronisation might be something like this (at the start of your trampoline):
Code: Select all
mov dword [cs:startupFlag],1 ;Tell the BSP we've started
.l1: cmp dword [cs:startupFlag],1 ;Has the BSP acknowledge that we've started?
je .l1 ; no, wait until the BSP has acknowledged
For the other side; the BSP would do something like:
- Set the "startupFlag" to zero
- Send the INIT IPI and do its 10 ms delay
- Send the first SIPI
- Loop/wait until either "startupFlag" becomes non-zero, or a (short, 200 us) time-out expires. If the time-out expires:
- Send the second SIPI
- Loop/wait until either "startupFlag" becomes non-zero, or a (long, 2 seconds) time-out expires. If the time-out expires; assume the AP CPU is faulty, display an error and don't touch that AP ever again.
- If "startupFlag" becomes non-zero (the AP CPU did start) at either of the 2 points above; set "startupFlag" back to zero to tell the AP CPU that it can continue.
Please note that this is different to the startup sequence that Intel describes. It does work well on every (real and virtual) computer I've tested it on (while Intel's sequence fails in some cases), and it's also typically a little faster than Intel's sequence (because Intel's "200 us" delays are conservative).
Cheers,
Brendan