Brendan wrote:Hi,
rdos wrote:Why is this a problem only on Core Duo and not on AMD or on dual core Intel Atom?
Edit: At the time the interrupt is dispatched, one of the cores is halted, while one is executing. No IRR or ISR is set in the local APIC of the executing core. The core that is halted has TPR = 0xFF, and it first executes cli and then hlt.
CLI then HLT puts the CPU into a special "shutdown the CPU and wait for INIT-SIPI-SIPI startup sequence" state. If an interrupt is received it's blocked (CLI) and when you send the INIT-SIPI-SIPI startup sequence to start the CPU again the INIT causes the local APIC to be reset (where any pending interrupts are lost).
Yes, I suspected that, and it initially was an issue on other CPUs (AMDs), but this was apparently solved by loading TPR with 0xFF on those CPUs. The processor really shouldn't claim a lowest priority delivery interrupt when TPR is set to 0xFF.
Brendan wrote:Clearing the CPU's Logical Destination Register so it won't receive any IRQs sent with logical delivery (while making sure there's a least one CPU that will still be able to receive them)
I don't do that. I'll check if this helps. Since BSP can never be put to sleep, and it always handles all logical destinations, this should be no problem.
Brendan wrote:Reconfiguring any IRQ that uses fixed delivery so they are or sent to other CPU/s
Since fixed delivery is only to BSP, and BSP can never sleep, this should not be an issue
Brendan wrote:Making sure no other CPU will send any type of IPI to the CPU you're disabling
I do this with a flag bit in the per-core scheduler selector.
Brendan wrote:Avoiding race conditions for all of the above (e.g. in case there's any pending interrupts that were sent before they were disabled, have a small delay with interrupts enabled to give the CPU a chance to handle any pending IRQs)
This should not be an issue as the code for disabling the CPU runs in the idle (null) thread, and the null thread will only run if there are no local threads available for scheduling, and no pending ISRs.
Brendan wrote:Disable paging to flush TLB contents (some CPUs have bugs where stale TLB contents may be used after reset/init)
That would be hard to do. It means the processor must be brought back towards real mode or something. The TLB will be flushed as part of the initialization code anyway, and the page directory is loaded with the system process, which is the same process it will run in when it has finished booting.
Brendan wrote:Disable caches (e.g. in CR0)
I'll look into that as well
Brendan wrote:Do a WBINVD to flush anything left in caches (disabled CPUs may not respond to snoop traffic, and not disabling/flushing cache contents can cause corruption)
I do that already.
Brendan wrote:Consider switching back to real mode - it shouldn't be necessary (but can't hurt and might potentially avoid problems caused by CPU errata)
OK, I might look into that.
Brendan wrote:Bringing a CPU back online would be the reverse of this, with a few extra steps thrown in (e.g. reconfigure MTRRs before enabling caches in case the OS changed them while the CPU was offline, LGDT, LIDT, etc).
Bringing it online is the same procedure as when the AP is started at initialization time. In fact, when the computer boots, it will boot the APs and then disable them. This is because each core needs some initial data. It is the power management driver that decides to boot additional APs when load is sufficiently high.