Bizarre QEMU bug

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
johnsa
Member
Member
Posts: 296
Joined: Mon Oct 15, 2007 3:04 pm

Bizarre QEMU bug

Post by johnsa »

If (for testing purposes), you send out startup ipis and your trampoline code does something like:

Code: Select all

	cli
	cld
	mov ax,0b800h
	mov es,ax
	xor di,di
	mov ax,0x1e43
	mov cx,500
	rep stosw
IF and ONLY if the attribute byte (0x1e) in this case is not the same as what is in display memory then QEMU freezes, hangs at random places (sometimes in code way before i even get the apic init).
I'm not sure if it's an issue specifically with the fact that the BSP and AP may be writing simultaneously so it's a locking related issue OR it's simply a case of the attribute byte must be the same.

This doesn't seem to be an issue in Bochs with smp or VirtualBox.
FallenAvatar
Member
Member
Posts: 283
Joined: Mon Jan 03, 2011 6:58 pm

Re: Bizarre QEMU bug

Post by FallenAvatar »

johnsa wrote:...
Can you provide a minimal (single file, preferably in assembly [gas or at&t syntaxt]) code-set to reproduce this 100% of the time?

If yes, please post it here for us to test out and/or post it to Qemu's bug tracker.

- Monk
johnsa
Member
Member
Posts: 296
Joined: Mon Oct 15, 2007 3:04 pm

Re: Bizarre QEMU bug

Post by johnsa »

I've narrowed down the cause somewhat.

For now I'm using INIT-SIPI-SIPI with delays until I put in the lock/waits. The second SIPI seems to cause the problem.
If I remove it qemu appears to work perfectly regardless of what I put in the trampoline code. With the second SIPI the results become somewhat random.
1) Sometimes changing just the attribute bytes causes it to hang.
2) Sometimes any change in the trampoline code at all.

The "hang" then appears to occur way before the AP startup even happens (I say this as it appears visually.. a number of text outputs which happen before the call are never displayed).
So the only thing I can assume here is that qemu jits/runs ahead somehow and the ap startup code is actually happening either before or in parallel with qemu updating the "virtual" vga output, which makes it appear is if the whole thing is running out of order.

If that is the case, the moral of the story is don't assume you'll see text output sequentially in the display before one of these types of calls (possibly any h/w interaction or maybe just MP startup).

These same issues are not occuring for me on real h/w or virtual box (so it seems to be qemu specific).
I am running qemu x64 as a Windows build, so the bug (if any) may not happen to the normal linux build of it.

I don't have a Bochs SMP setup at the moment for Windows, the vs2013 solution refuses to build properly and the cygwin configure script just doesn't seem to work. When i can get it to actually build with the right config I'll try it there too.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Bizarre QEMU bug

Post by Brendan »

Hi,
johnsa wrote:I've narrowed down the cause somewhat.

For now I'm using INIT-SIPI-SIPI with delays until I put in the lock/waits. The second SIPI seems to cause the problem.
If I remove it qemu appears to work perfectly regardless of what I put in the trampoline code. With the second SIPI the results become somewhat random.
1) Sometimes changing just the attribute bytes causes it to hang.
2) Sometimes any change in the trampoline code at all.
For a lot of computers (including real computers and virtual machines) the second SIPI isn't actually necessary (and may just be there in case the first SIPI wasn't received correctly). If you don't have any synchronisation then the AP CPU can:
  • Start on the first SIPI
  • Begin executing the trampoline and possibly whatever code comes after that
  • Receive the second SIPI and start executing code the trampoline a second time
If (for a simple example) your trampoline does "lock inc dword [totalCPUs]" and you start 3 CPUs then "totalCPUs" can be incremented 6 times.

Now; the INIT resets the CPU to a default state, but the SIPI mostly doesn't. This increases the potential for problems. For a simple example, imagine if your trampoline does something that relies on DS being set to zero then sets DS to a non-zero value, and then the second SIPI arrives and it starts the trampoline again. In this case the second time the trampoline is executed DS may still be non-zero, causing problems.

Basically; you need some sort of synchronisation.

In my experience the best synchronisation might be something like this (at the start of your trampoline):

Code: Select all

     mov dword [cs:startupFlag],1   ;Tell the BSP we've started
.l1: cmp dword [cs:startupFlag],1   ;Has the BSP acknowledge that we've started?
     je .l1                         ; no, wait until the BSP has acknowledged
For the other side; the BSP would do something like:
  • Set the "startupFlag" to zero
  • Send the INIT IPI and do its 10 ms delay
  • Send the first SIPI
  • Loop/wait until either "startupFlag" becomes non-zero, or a (short, 200 us) time-out expires. If the time-out expires:
    • Send the second SIPI
    • Loop/wait until either "startupFlag" becomes non-zero, or a (long, 2 seconds) time-out expires. If the time-out expires; assume the AP CPU is faulty, display an error and don't touch that AP ever again.
  • If "startupFlag" becomes non-zero (the AP CPU did start) at either of the 2 points above; set "startupFlag" back to zero to tell the AP CPU that it can continue.
Please note that this is different to the startup sequence that Intel describes. It does work well on every (real and virtual) computer I've tested it on (while Intel's sequence fails in some cases), and it's also typically a little faster than Intel's sequence (because Intel's "200 us" delays are conservative).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
johnsa
Member
Member
Posts: 296
Joined: Mon Oct 15, 2007 3:04 pm

Re: Bizarre QEMU bug

Post by johnsa »

Thanks for the useful insights!

I will definitely be putting the locking/sync in probably tonight if I get a chance. I am still curious as to what happens behind the scenes with qemu specifically, and how the MP startup seems to either affect code that happens before it, or if it runs the video update (to the real display) "out of sequence" with the code it's jitting/interpreting.
Post Reply