IPI send packet deadlock? Get BSOD clock watchdog timeout
Posted: Thu Dec 17, 2015 1:43 pm
Hi OSDev,
My Windows BSOD gave a dump file. From my analysis the root cause appears to be a result of an IPI send packet deadlock. Would like comments and/or confirmation of the root cause and possible solution.
Using windbg:
0: kd> !analyze -v
* Bugcheck Analysis *
************************************
CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 00000031, Clock interrupt time out interval in nominal clock ticks.
Arg2: 00000000, 0.
Arg3: 8db00120, The PRCB address of the hung processor.
Arg4: 00000002, 0.
1: kd> !ipi
IPI State for Processor 0
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame 82971cb4 SignalDone 00000000 RequestSummary 0
Packet State Stale WorkerRoutine nt!KiFlushTargetMultipleRangeTb
Parameter[0] 00000000 Parameter[1] 85d2800c Parameter[2] 85d28008
IPI State for Processor 1
As a sender, awaiting packet completion from processors 2, and 3.
TargetSet c PacketBarrier d IpiFrozen 2 [Frozen]
IpiFrame 807e3cb4 SignalDone 00000000 RequestSummary 0
Packet State Active WorkerRoutine nt!KiFlushTargetMultipleRangeTb
Parameter[0] 00000000 Parameter[1] 82973540 Parameter[2] 82973538
IPI State for Processor 2
As a receiver, the following unhandled requests are pending: [Packet] [DPC] [Freeze]
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame 8d71ccb4 SignalDone 807c7120 RequestSummary 6 [DPC] [Freeze]
Packet State Stale WorkerRoutine nt!KiFlushTargetMultipleRangeTb
Parameter[0] 00000000 Parameter[1] 82973540 Parameter[2] 82973538
IPI State for Processor 3
As a receiver, the following unhandled requests are pending: [Packet] [DPC] [Freeze]
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame 8d752cb4 SignalDone 807c7120 RequestSummary 6 [DPC] [Freeze]
Packet State Stale WorkerRoutine nt!KiFlushTargetProcessTb
Parameter[0] 00000000 Parameter[1] 00000000 Parameter[2] 00000000
NOTE: in dual core with hyper-threading ON,
2 cores: core 0 and core 1
4 logical procs: Processor 0, Processor 1, Processor 2, Processor 3
Processor 0, Processor 1 form Core 0.
Processor 2, Processor 3 form Core 1.
Basically, !ipi is saying that core 0 (either proc 0 or 1) is trying to send a
packet to all but self (send to core 1 (proc 2 & 3)) In this case it is proc 1
sending to proc 2 & 3 and waiting for completion.
It is not completing. Why? Because proc 2 is busy. And it has to be busy enough
that an IPI request from another proc can block/wait.
For proc 2 to be busy enough to block an IPI request, it has to be doing the
same thing (or something even more important).
From book "Windows Internals, part1, 6th edtion", page 89
<<Each interrupt level has a specific purpose. For example, the kernel issues an interprocessor
interrupt (IPI) to request that another processor perform an action, such as dispatching a particular
thread for execution or updating its translation look-aside buffer (TLB) cache.>>
Verified that proc 0 is send IPI packet for TLB. (from call traces)
Verified that proc 2 is send IPI packet for dispatching a particular thread for execution. (from raw stack)
In the raw stack for proc 2, find call trace that is doing IPI request.
8db52a94 82c21ecb 8296a8c8 00000000 00000206 hal!HalpSendFlatIpi+0x88 (FPO: [Non-Fpo])
8db52aa8 82c21f07 00000000 8db52b10 000000e1 hal!HalpRequestIpiSpecifyVector+0x6b (FPO: [Non-Fpo])
8db52abc 8287cd99 00000000 8db52b10 8db52b1c hal!HalRequestIpi+0x17 (FPO: [Non-Fpo])
8db52ad0 8287cb6e 86b08660 86143d48 8287cc7e nt!KiIpiSend+0x31 (FPO: [0,2,0])
8db52b4c 8287d69d 86143d48 00000000 86143d48 nt!KiDeferredReadyThread+0x7e2
8db52b68 8287d65b 8db52ba4 8db36120 8db37a88 nt!KiReadyThread+0x30
You may say that, with multiple procs, shouldn't Windows have synchronization for this thing to avoid contention.
This is true and yes there is. But this only works if everybody is using the SAME method to send IPI packet.
In this case here, it is not same and it opens the case for deadlock.
The problem is that proc 0 uses a different call to send IPI compared to proc 2.
Proc 0 uses nt!KiIpiSendPacket
Proc 2 uses nt!KiIpiSend
nt!KiIpiSendPacket has something that looks like a spin lock and set and checks
flags with a lock.
(828c5e25 f00fb196a4180000 lock cmpxchg dword ptr [esi+18A4h],edx)
nt!KiIpiSend has only something that set a flag with a lock but nothing looking
like a spin lock.
(8287cd79 f00990a0180000 lock or dword ptr [eax+18A0h],edx)
But the flag is not in the same place 18A4h vs. 18A0h ! Can this still work?
That looks like the problem.
Please comment whether this conclusion make sense or not.
Ultimately, I want to fix the problem but if this is root cause the fix is not
obvious.
Regards,
Andrew
My Windows BSOD gave a dump file. From my analysis the root cause appears to be a result of an IPI send packet deadlock. Would like comments and/or confirmation of the root cause and possible solution.
Using windbg:
0: kd> !analyze -v
* Bugcheck Analysis *
************************************
CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 00000031, Clock interrupt time out interval in nominal clock ticks.
Arg2: 00000000, 0.
Arg3: 8db00120, The PRCB address of the hung processor.
Arg4: 00000002, 0.
1: kd> !ipi
IPI State for Processor 0
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame 82971cb4 SignalDone 00000000 RequestSummary 0
Packet State Stale WorkerRoutine nt!KiFlushTargetMultipleRangeTb
Parameter[0] 00000000 Parameter[1] 85d2800c Parameter[2] 85d28008
IPI State for Processor 1
As a sender, awaiting packet completion from processors 2, and 3.
TargetSet c PacketBarrier d IpiFrozen 2 [Frozen]
IpiFrame 807e3cb4 SignalDone 00000000 RequestSummary 0
Packet State Active WorkerRoutine nt!KiFlushTargetMultipleRangeTb
Parameter[0] 00000000 Parameter[1] 82973540 Parameter[2] 82973538
IPI State for Processor 2
As a receiver, the following unhandled requests are pending: [Packet] [DPC] [Freeze]
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame 8d71ccb4 SignalDone 807c7120 RequestSummary 6 [DPC] [Freeze]
Packet State Stale WorkerRoutine nt!KiFlushTargetMultipleRangeTb
Parameter[0] 00000000 Parameter[1] 82973540 Parameter[2] 82973538
IPI State for Processor 3
As a receiver, the following unhandled requests are pending: [Packet] [DPC] [Freeze]
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame 8d752cb4 SignalDone 807c7120 RequestSummary 6 [DPC] [Freeze]
Packet State Stale WorkerRoutine nt!KiFlushTargetProcessTb
Parameter[0] 00000000 Parameter[1] 00000000 Parameter[2] 00000000
NOTE: in dual core with hyper-threading ON,
2 cores: core 0 and core 1
4 logical procs: Processor 0, Processor 1, Processor 2, Processor 3
Processor 0, Processor 1 form Core 0.
Processor 2, Processor 3 form Core 1.
Basically, !ipi is saying that core 0 (either proc 0 or 1) is trying to send a
packet to all but self (send to core 1 (proc 2 & 3)) In this case it is proc 1
sending to proc 2 & 3 and waiting for completion.
It is not completing. Why? Because proc 2 is busy. And it has to be busy enough
that an IPI request from another proc can block/wait.
For proc 2 to be busy enough to block an IPI request, it has to be doing the
same thing (or something even more important).
From book "Windows Internals, part1, 6th edtion", page 89
<<Each interrupt level has a specific purpose. For example, the kernel issues an interprocessor
interrupt (IPI) to request that another processor perform an action, such as dispatching a particular
thread for execution or updating its translation look-aside buffer (TLB) cache.>>
Verified that proc 0 is send IPI packet for TLB. (from call traces)
Verified that proc 2 is send IPI packet for dispatching a particular thread for execution. (from raw stack)
In the raw stack for proc 2, find call trace that is doing IPI request.
8db52a94 82c21ecb 8296a8c8 00000000 00000206 hal!HalpSendFlatIpi+0x88 (FPO: [Non-Fpo])
8db52aa8 82c21f07 00000000 8db52b10 000000e1 hal!HalpRequestIpiSpecifyVector+0x6b (FPO: [Non-Fpo])
8db52abc 8287cd99 00000000 8db52b10 8db52b1c hal!HalRequestIpi+0x17 (FPO: [Non-Fpo])
8db52ad0 8287cb6e 86b08660 86143d48 8287cc7e nt!KiIpiSend+0x31 (FPO: [0,2,0])
8db52b4c 8287d69d 86143d48 00000000 86143d48 nt!KiDeferredReadyThread+0x7e2
8db52b68 8287d65b 8db52ba4 8db36120 8db37a88 nt!KiReadyThread+0x30
You may say that, with multiple procs, shouldn't Windows have synchronization for this thing to avoid contention.
This is true and yes there is. But this only works if everybody is using the SAME method to send IPI packet.
In this case here, it is not same and it opens the case for deadlock.
The problem is that proc 0 uses a different call to send IPI compared to proc 2.
Proc 0 uses nt!KiIpiSendPacket
Proc 2 uses nt!KiIpiSend
nt!KiIpiSendPacket has something that looks like a spin lock and set and checks
flags with a lock.
(828c5e25 f00fb196a4180000 lock cmpxchg dword ptr [esi+18A4h],edx)
nt!KiIpiSend has only something that set a flag with a lock but nothing looking
like a spin lock.
(8287cd79 f00990a0180000 lock or dword ptr [eax+18A0h],edx)
But the flag is not in the same place 18A4h vs. 18A0h ! Can this still work?
That looks like the problem.
Please comment whether this conclusion make sense or not.
Ultimately, I want to fix the problem but if this is root cause the fix is not
obvious.
Regards,
Andrew