Get the performance monitoring interrupt on Qemu-Kvm

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
parfait
Posts: 15
Joined: Sat Dec 14, 2013 12:33 pm

Get the performance monitoring interrupt on Qemu-Kvm

Post by parfait »

I have a situation with catching the performance monitoring interrupt (PMI - especially instruction counter) on qemu kvm. The code below works fine on real machine (Intel Core TM i5-4300U) but on qemu-kvm (qemu-system-x86_64 -cpu host), I do not see even one PMI. Though the counter works normally. I can check it increments well.
However, I have tested with Linux kernel, and it catches the overflow interrupt very well on the same qemu-kvm. So there is obviously a step I am missing when it comes to configure the performance monitoring counter on Qemu-kvm.
Can someone point it out to me?
Here is the pseudo-code:

Code: Select all

 #define LAPIC_SVR           0xF0
#define LAPIC_LVT_PERFM      0x340,
#define CPU_LOCAL_APIC      0xFFFFFFFFBFFFE000
#define NMI_DELIVERY_MODE    0x4 << 8                            //NMI
#define MSR_PERF_GLOBAL_CTRL    0x38F
#define MSR_PERF_FIXED_CTRL     0x38D
#define MSR_PERF_FIXED_CTR0     0x309
#define MSR_PERF_GLOBAL_OVF_CTRL 0x390

/*Configure LAPIC*/
apic_base = Msr::read<Paddr>(Msr::IA32_APIC_BASE)
map(CPU_LOCAL_APIC, apic_base & 0xFFFFF000)                                                                // No caching, etc.
Msr::write (Msr::IA32_APIC_BASE, apic_base | 0x800);
write (LAPIC_SVR, read (LAPIC_SVR) | 0x100);
*reinterpret_cast<uint32 volatile *>(CPU_LOCAL_APIC + LAPIC_LVT_PERFM) = NMI_DELIVERY_MODE;

/*Configure MSR_PERF_FIXED_CTR0 to have overflow interrupt*/
Msr::write(Msr::MSR_PERF_GLOBAL_CTRL, Msr::read<uint64>(Msr::MSR_PERF_GLOBAL_CTRL) | (1ull<<32));          // enable IA32_PERF_FIXED_CTR0
Msr::write(Msr::MSR_PERF_FIXED_CTRL, 0xa);                                                                 // configure IA32_PERF_FIXED_CTR0 to count in user mode and interrupt on overflow
Msr::write(Msr::MSR_PERF_FIXED_CTR0, (1<<48) - 0x1000);                                                    // overflow after 0x1000 instruction
Msr::write(Msr::MSR_PERF_GLOBAL_OVF_CTRL, Msr::read<uint64>(Msr::MSR_PERF_GLOBAL_OVF_CTRL) & ~(1UL<<32));  // clear overflow condition
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Brendan »

Hi,

A couple of questions...

Which CPU is this for (and which CPU does Qemu emulate)? Note that performance monitoring counters are almost entirely "model specific" and code that works on one CPU will not work on a different CPU.

Why is it necessary to use something as painful and nasty as NMI (which has multiple unsolvable problems and should never be used for anything ever); especially for this case (where the performance monitoring counter only counts things that happen in user-space and can therefore only overflow while user-space code is running where interrupts should never be masked)?


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Schol-R-LEA »

Brendan wrote:Which CPU is this for (and which CPU does Qemu emulate)? Note that performance monitoring counters are almost entirely "model specific" and code that works on one CPU will not work on a different CPU.
I don't know if the OP edited it since you posted this or not, or if you simply missed it, but the shell invocation given

Code: Select all

qemu-system-x86_64 -cpu host
shows that it is emulating¹ an x86-64² CPU, and that it is to use the host CPU (which the OP stated was an Intel Core™ i5-4300U³) virtualized as-is rather than trying to imitate a different model.

Footnotes
1. Or in this case, virtualizing, since it is the same type as the host and KVM is enabled.
2. Most current installs of QEMU, on Linux at any rate, provide specialized versions for different CPU types to reduce the number of options necessary; the 'system' part indicates that it should emulate/virtualize as a full stand-alone system, rather than just an application compiled for a different ISA.
3. I think we can safely ignore the trademark notice here, or at least, render it correctly rather than making it look like it is part of the model name.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Brendan »

Hi,
Schol-R-LEA wrote:
Brendan wrote:Which CPU is this for (and which CPU does Qemu emulate)? Note that performance monitoring counters are almost entirely "model specific" and code that works on one CPU will not work on a different CPU.
I don't know if the OP edited it since you posted this or not, or if you simply missed it, but the shell invocation given

Code: Select all

qemu-system-x86_64 -cpu host
shows that it is emulating¹ an x86-64² CPU, and that it is to use the host CPU (which the OP stated was an Intel Core™ i5-4300U³) virtualized as-is rather than trying to imitate a different model.
I honestly don't know what Qemu emulates (I rarely use Qemu); but I have a hunch that it emulates "generic ancient CPU with chronologically incorrect additional instruction set extensions", partly so that it doesn't need a huge amount of code to (e.g.) handle all the MSRs the same as all the different potential host CPUs might (including when the host CPU is ARM or PowerPC or something else that isn't 80x86 at all).

For example, for all I know, Qemu's CPUID might always return the same "family:model:stepping" as a Pentium Pro (with chronologically incorrect additional instruction set extension for 64-bit/long mode, SSE, AVX, ...), including doing performance monitoring counters that are compatible with a Pentium Pro (with no attempt to be even slightly compatible with the performance monitoring in an Intel Core i5-4300U), where specifying "-cpu host" effects a few feature flags and the brand name string and nothing else.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Schol-R-LEA »

Ah, OK.

QEMU actually emulates... well, first off, it usually isn't emulating in the first place, at least when the guest is in the same family as the host. Contrary to the name, it is primarily a hypervisor, so it will use a virtualized sandbox on the host if the guest system matches (or is a subset of) the host. It can still be used in emulation mode if the host and guest match, in which can it can use 'hardware-accelerated emulation' to speed, things up. This is the same way VirtualBox works, BTW, which isn't surprising as VirtualBox's DRC module was originally based in part on QEMU's.

Under Linux, both virtualization and require that Kernel Virtual Machine support is enabled, but AFAICT the Windows and MacOS versions can use it out of the box, using a system developed by Intel called Hardware Accelerated Execution Manager (HAXM). I have no idea what QEMU does in its ports to FreeBSD and other x86 OSes, nor how it works on non-x86 systems.

However, it can emulate a number of ISAs as well, and has support for a several specific models for x86, ARM, MIPS, SPARC. Support for Itanium and PowerPC are a good deal sparser IIUC, but it does have modules for them, and in some cases can run on hosts of those types as well.

While the Wicked-Pedo page mentions Sandy Bridge emulation right now, this information is out of date - this changelog from a few versions back indicates that it can handle up to Skylake (or at least had been updated for some Skylake features at that point), and I am guessing that Kaby Lake and Coffee Lake updates are in progress. I will probably go correct that.

My understanind is that it is a modular system, so in principle it will work for any guest for which an ISA to internal representation module exists, and any host for which an IR to ISA module exists. QEMU is like GCC in this regard: at its heart, QEMU's emulation system is a protocol rather than a core function.

In emulation mode, it works by on-the-fly binary translation via dynamic recompilation rather than per-instruction interpretation (in other words, it is like the JVM JIT compiler rather than the JVM interpreter). This means that performance takes a bit of a hit up front as new parts of the foreign machine code get accessed, but overall performance is quite good - it will cache previously translated sections in order to avoid re-translation in larger programs or repeatedly invoked libraries, while still keeping the memory footprint from growing uncontrollably.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
parfait
Posts: 15
Joined: Sat Dec 14, 2013 12:33 pm

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by parfait »

Hello all and thank you for your respective replies.
To answer @Brendan
Brendan wrote: Which CPU is this for (and which CPU does Qemu emulate)?
As @Schol-R-LEA said, Qemu is used here with option -cpu, kvm-enabled, that is it emulates the host cpu and exposes all its features to the guest as is.
However, do you think that Qemu would have changed the performance monitoring MSR register addresses? This is mentioned nowhere.
Brendan wrote: Why is it necessary to use something as painful and nasty as NMI (which has multiple unsolvable problems and should never be used for anything ever); especially for this case (where the performance monitoring counter only counts things that happen in user-space and can therefore only overflow while user-space code is running where interrupts should never be masked)?
I know, just to be sure to catch the Perf Monitoring interrupt, however slight the signal may be :mrgreen: :mrgreen: :D .
I also tried (actually at the beginning) with a vectored interrupt (164), same result. No Interrupt is received.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Brendan »

Hi,
parfait wrote:
Brendan wrote:Which CPU is this for (and which CPU does Qemu emulate)?
As @Schol-R-LEA said, Qemu is used here with option -cpu, kvm-enabled, that is it emulates the host cpu and exposes all its features to the guest as is.
That page says that it "passes all available host processor features to the guest". This is completely different to "makes the guest CPU look and behave identical to the host CPU".

Specifically, "pass all available host processor features to the guest" could mean that it gets the feature flags from the host CPU (then ANDs it with a mask to remove features that Qemu or KVM can't/doesn't support for future proofing); then slaps those features into an ancient Pentium Pro with ancient Pentium Pro performance monitoring (if the "performance monitoring supported" feature flag was set).

So let's look at available facts:
  • Your code works on a real Intel Core TM i5-4300U
  • Your code does not work on a virtual "Intel Core TM i5-4300U" emulated by KVM
  • There must be a difference between a real Intel Core TM i5-4300U and the "Intel Core TM i5-4300U" emulated by Qemu/KVM
  • Qemu/KVM must not make the guest CPU look and behave identical to the host CPU
  • It should be easy for you to obtain the results of "CPUID, eax = 0x00000001" and "CPUID, eax = 0x0000000A" on the real CPU and on the "Intel Core TM i5-4300U" emulated by KVM and compare them; to see if there is a difference or not.
parfait wrote:However, do you think that Qemu would have changed the performance monitoring MSR register addresses? This is mentioned nowhere.
The MSR might not exist and if it does exist it might not support the same events; but it's extremely unlikely that it's at the MSR exists at a different address.

For example; maybe Qemu/KVM says that the performance monitoring feature is supported in "CPUID, eax = 0x00000001", but then also says that the instructions retired event is not supported in "CPUID, eax = 0x0000000A"; and maybe Linux auto-detects this and falls back to "non-architectural performance monitoring" and continues to work correctly because of this, and maybe your code doesn't work because you're assuming "guest is identical to host" and not doing any auto-detection.

Note that your pseudo-code looks correct to me; which means that if my hunch is wrong it's going to be hard to figure out what is happening. In other words, I'm mostly just using Occam's razor - testing the simplest (and easiest to check) theory first.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
linuxyne
Member
Member
Posts: 211
Joined: Sat Jul 02, 2016 7:02 am

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by linuxyne »

The code shown below was slapped on the LongMode routine, taking advantage of the default configuration to avoid configuring APIC, etc. Some instructions were removed to make place for identity mapping the APIC PA, as well as to keep the binary size under 512 bytes.

Code: Select all

        mov rcx, 0xfee00000; VA == PA == 0xfee00000
        mov dword [rcx + 0x340], 0x400

        mov eax, 0xff
        mov edx, 0x1
        mov ecx, 0x38f
        wrmsr

        mov edx, 0xffff
        mov eax, 0xffffffff
        mov ecx, 0x309
        wrmsr

        xor edx, edx
        mov eax, 0xb
        mov ecx, 0x38d
        wrmsr

loop:
        nop
        mov ecx, 0x38e
        rdmsr
        hlt
jmp loop

The qemu monitor shows the lapic settings. See LVTPC.

Code: Select all

QEMU 2.10.1 monitor - type 'help' for more information
(qemu) info lapic
info lapic
dumping local APIC state for CPU 0 

LVT0	 0x00008700 active-hi level                             ExtINT (vec 0)
LVT1	 0x00008400 active-hi level                             NMI   
LVTPC	 0x00000400 active-hi edge                              NMI

The detailed kvm traces show the NMI being accepted:

Code: Select all

qemu-system-x86-17546 [001] d.h. 383715.640069: kvm_apic_accept_irq: apicid 0 vec 0 (NMI|edge)
The above trace is printed by the function __apic_accept_irq by the kvm within the kernel. The function injects the NMI into the vcpu.

The VM goes into reboot loop. Adjusting the # of instructions set within msr 0x309 gives control over avoiding or manifesting that reboot loop.
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Schol-R-LEA »

I was all set to suggest a slew of tests to narrow down the cause of the problem (rather than assuming that QEMU was garbage as Brendan appears to be doing); but when digging deeper, I found that the OP already got an answer about this over a month ago on the QEMU-Discuss forum.

TL;DR: Performance counters aren't virtualizable, meaning that for a hypervisor to support them, it has to trap the operations on them and emulate the behavior. Since the counters are model and family specific, this would require exhaustive support for the exact models to being virtualized. QEMU has limited support for this, as do VMWare and VirtualBox (I can't find any indication that HyperV covers them at all, but that only means I didn't find references to it), but none support it fully for all processors, even on processors they otherwise support well.

For more on this, I direct the reader to this post on the VMware site, which discusses the problem in more general terms. While the thread is about the support for Skylake PMI in VMWare, it explains that it is because the hypervisor needs to support the specific model families in detail for it to work properly.

Brendan: I am dismayed that you seemed to be so dismissive of QEMU, not because I have any stake in it (I am planning to write my own hypervisor, after all, and have no intention of even looking at its codebase, so seeing a fault in QEMU is no skin off my nose) but because I thought better of you than such high-handed behavior. The fact that you seemed to keep taking the (not particularly applicable) 'Emu' part of the name literally points to you making unwarranted assumptions and refusing to correct them when they are pointed out. While you were arguably correct in seeing it as a problem with QEMU, the flaw is one shared by all x86 hypervisors to some degree, and rooted in how PMI itself works, rather than a problem specific to QEMU.

I will admit, this is my impression of what you said, and I may be mis-reading you here; if so, I apologize for this. I will also admit that I failed in my own due diligence in my earlier post by focusing on the topic of what QEMU is or is not, rather than trying to address the OP's question.

Oh, and the "TM" isn't part of the model name, as a stated before; it is the tradmark symbol. The way it appears in the initial post was an artifact of the way the text "Intel Core™ i5-4300U" got pasted into the forum in the first post (possibly due to a bug in the forum software, but I can't be sure). There is no "TM" series of Core processors AFAICT.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Brendan »

Hi,
Schol-R-LEA wrote:Brendan: I am dismayed that you seemed to be so dismissive of QEMU, not because I have any stake in it (I am planning to write my own hypervisor, after all, and have no intention of even looking at its codebase, so seeing a fault in QEMU is no skin off my nose) but because I thought better of you than such high-handed behavior. The fact that you seemed to keep taking the (not particularly applicable) 'Emu' part of the name literally points to you making unwarranted assumptions and refusing to correct them when they are pointed out. While you were arguably correct in seeing it as a problem with QEMU, the flaw is one shared by all x86 hypervisors to some degree, and rooted in how PMI itself works, rather than a problem specific to QEMU.
I wasn't being dismissive of Qemu; I know that it's almost impossible for any emulator or virtual machine to support performance monitoring counters 100% correctly. I was merely trying to gather more information about what the problem was, starting from (what I considered as) the most likely problem (because it's almost impossible for any emulator virtual machine to support performance monitoring counters 100% correctly).

For an example; if you're planning to write your own hypervisor; think about how you're going to implement an "instructions retired at any privilege level" performance monitoring counter that doesn't miscount (e.g. that doesn't add several hundred instructions to the counter when a single instruction like CPUID is executed).
linuxyne wrote:The VM goes into reboot loop. Adjusting the # of instructions set within msr 0x309 gives control over avoiding or manifesting that reboot loop.
In that case; I'd suspect a race condition in your code.

For example, maybe you configure the performance monitoring counter and then install an NMI exception handler immediately after that, and on real hardware the NMI occurs after the NMI exception handler has been set up so there's no problem; but on Qemu (where the counter probably counts "host instructions at CPL=3" where Qemu itself is a normal process running at CPL=3, and where the counter could overflow before Qemu is finished handling one of the final MSR accesses) the NMI might occur before you install an NMI exception handler.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
linuxyne
Member
Member
Posts: 211
Joined: Sat Jul 02, 2016 7:02 am

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by linuxyne »

Brendan wrote:
linuxyne wrote:The VM goes into reboot loop. Adjusting the # of instructions set within msr 0x309 gives control over avoiding or manifesting that reboot loop.
In that case; I'd suspect a race condition in your code.

For example, maybe you configure the performance monitoring counter and then install an NMI exception handler immediately after that, and on real hardware the NMI occurs after the NMI exception handler has been set up so there's no problem; but on Qemu (where the counter probably counts "host instructions at CPL=3" where Qemu itself is a normal process running at CPL=3, and where the counter could overflow before Qemu is finished handling one of the final MSR accesses) the NMI might occur before you install an NMI exception handler.
No handlers of any type were installed. The only piece of code which was not pasted earlier contained changes to the page tables. The msr 0x38d is setup to count instructions in /both/ the kernel and user mode. Since only kernel mode is active, the counter counts appropriate amounts of instructions (at cpl = 0) before it overflows and causes the NMI to be triggered. Howerver, if the msr is setup to count only user-mode instructions, NMI is not triggered.

My point in mentioning the reboot loop was that it probably is an expected behaviour of QEMU when dealing with nmi. For instance, if the LongMode binary is taken as is and run under QEMU, and if an NMI is injected manually (through the monitor command 'nmi'), the VM resets itself, runs the same binary and ends up once again at the screen displaying the 'hello world' message. The loop, as seen under the effect of the additional PMU settings, was taken as an evidence that the NMI is being recognized by the vcpu.
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Schol-R-LEA »

In that case, I reaffirm my earlier, pre-emptive apology. I will also have to recall that piece of advice later, thank you.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by Brendan »

Hi
linuxyne wrote:
Brendan wrote:
linuxyne wrote:The VM goes into reboot loop. Adjusting the # of instructions set within msr 0x309 gives control over avoiding or manifesting that reboot loop.
In that case; I'd suspect a race condition in your code.

For example, maybe you configure the performance monitoring counter and then install an NMI exception handler immediately after that, and on real hardware the NMI occurs after the NMI exception handler has been set up so there's no problem; but on Qemu (where the counter probably counts "host instructions at CPL=3" where Qemu itself is a normal process running at CPL=3, and where the counter could overflow before Qemu is finished handling one of the final MSR accesses) the NMI might occur before you install an NMI exception handler.
No handlers of any type were installed. The only piece of code which was not pasted earlier contained changes to the page tables. The msr 0x38d is setup to count instructions in /both/ the kernel and user mode. Since only kernel mode is active, the counter counts appropriate amounts of instructions (at cpl = 0) before it overflows and causes the NMI to be triggered. Howerver, if the msr is setup to count only user-mode instructions, NMI is not triggered.

My point in mentioning the reboot loop was that it probably is an expected behaviour of QEMU when dealing with nmi. For instance, if the LongMode binary is taken as is and run under QEMU, and if an NMI is injected manually (through the monitor command 'nmi'), the VM resets itself, runs the same binary and ends up once again at the screen displaying the 'hello world' message. The loop, as seen under the effect of the additional PMU settings, was taken as an evidence that the NMI is being recognized by the vcpu.
Oh, OK.

Would you mind clearly summarising the symptoms? For example:
  • On Qemu/KVM, it causes an expected reboot loop ("works") if and only if the counter is configured such that "number of instructions until overflow" is high enough (or low enough?)
  • On Qemu/KVM, it fails to cause an expected reboot loop if and only if the counter is configured such that "number of instructions until overflow" is too low (or too high?)
  • On real hardware, it always causes an expected reboot loop regardless of how "number of instructions until overflow" is configured
  • It makes no difference (or does make some difference?) on real hardware or Qemu/KVM if a normal (e.g. "fixed delivery") IRQ is used instead of NMI

Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
linuxyne
Member
Member
Posts: 211
Joined: Sat Jul 02, 2016 7:02 am

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by linuxyne »

I may be able to answer only some of Brendan's questions: The PoCs have not been tested when running raw (i.e. without qemu-kvm or an OS) on a physical machine. The behaviour of the msr 0x309 is: If we set a value -x into it, the PMI fires after the CPU retires (x+1) instructions of the appropriate type.



Below is a PoC which sets the msr 0x309 to -2147483648 for user-mode instructions.
The PMI fires after 2147483649 instructions are executed by the cpu in the user-mode. I did not actually calculate the number of instructions, but as we reduce the value of (x), the delay between the cpu resets reduces accordingly. If the number of instructions are kept small enough, one can verify the point when the PMI fires.

I believe that this PoC is sufficient for the OP to ascertain that qemu-kvm does indeed fire the PMIs after the desired number of user-mode instructions are retired. The OP may add NMI/IRQ handlers to further verify that it does or does not work. Whether or not IRQs other than NMI work needs to be verified (OP?).

The code is based on this article.
It was run on a qemu-kvm running on an i5-3330.

main.asm

Code: Select all

%define FREE_SPACE 0x9000
 
ORG 0x7C00
BITS 16
 
; Main entry point where BIOS leaves us.
 
Main:
    jmp 0x0000:.FlushCS               ; Some BIOS' may load us at 0x0000:0x7C00 while other may load us at 0x07C0:0x0000.
                                      ; Do a far jump to fix this issue, and reload CS to 0x0000.
 
.FlushCS:   
    xor ax, ax
 
    ; Set up segment registers.
    mov ss, ax
    ; Set up stack so that it starts below Main.
    mov sp, Main
 
    mov ds, ax
    mov es, ax
    mov fs, ax
    mov gs, ax
    cld
 
    ; Point edi to a free space bracket.
    mov edi, FREE_SPACE
    ; Switch to Long Mode.
    jmp SwitchToLongMode
 
 
BITS 64
.Long:
    hlt
    jmp .Long
 
 
BITS 16
%include "lm.asm"

; Pad out file.
times 510 - ($-$$) db 0
dw 0xAA55
lm.asm

Code: Select all

%define PAGE_PRESENT    (1 << 0)
%define PAGE_WRITE      (1 << 1)
%define PAGE_USER      (1 << 2)
%define PAGE_SIZE      (1 << 7)
 
%define CODE_SEG     0x0008
%define DATA_SEG     0x0010
 
ALIGN 4
IDT:
    .Length       dw 0
    .Base         dd 0
 
; Function to switch directly to long mode from real mode.
; Identity maps the first 2MiB.
; Uses Intel syntax.
 
; es:edi    Should point to a valid page-aligned 16KiB buffer, for the PML4, PDPT, PD and a PT.
; ss:esp    Should point to memory that can be used as a small (1 uint32_t) stack
 
SwitchToLongMode:
    ; Zero out the 16KiB buffer.
    ; Since we are doing a rep stosd, count should be bytes/4.   
    push di                           ; REP STOSD alters DI.
    mov ecx, 0x1000
    xor eax, eax
    cld
    rep stosd
    pop di                            ; Get DI back.
 
 
    ; Build the Page Map Level 4.
    ; es:di points to the Page Map Level 4 table.
    lea eax, [es:di + 0x1000]         ; Put the address of the Page Directory Pointer Table in to EAX.
    or eax, PAGE_PRESENT | PAGE_WRITE | PAGE_USER; Or EAX with the flags - present flag, writable flag.
    mov [es:di], eax                  ; Store the value of EAX as the first PML4E.
 
 
    ; Build the Page Directory Pointer Table.
    lea eax, [es:di + 0x2000]         ; Put the address of the Page Directory in to EAX.
    or eax, PAGE_PRESENT | PAGE_WRITE | PAGE_USER; Or EAX with the flags - present flag, writable flag.
    mov [es:di + 0x1000], eax         ; Store the value of EAX as the first PDPTE.
 
 
    ; Build the Page Directory.
    xor eax, eax
;    lea eax, [es:di + 0x3000]         ; Put the address of the Page Table in to EAX.
    or eax,  PAGE_SIZE | PAGE_PRESENT | PAGE_WRITE | PAGE_USER; Or EAX with the flags - present flag, writeable flag.
    mov [es:di + 0x2000], eax         ; Store to value of EAX as the first PDE.
 
; 0x0000fee00 000 

;  000000000 000000011 111110111 000000000 000000000000
    ; Build the Page Directory Pointer Table.
    lea eax, [es:di + 0x3000]         ; Put the address of the Page Directory in to EAX.
    or eax, PAGE_PRESENT | PAGE_WRITE ; Or EAX with the flags - present flag, writable flag.
    mov [es:di + 0x1000 + 0x18], eax         ; Store the value of EAX as the first PDPTE.

    mov eax, 0xfee00000
    or eax,  PAGE_SIZE | PAGE_PRESENT | PAGE_WRITE ; Or EAX with the flags - present flag, writeable flag.
    mov [es:di + 0x3000 + 0xfb8], eax         ; Store to value of EAX as the first PDE.

    ; Disable IRQs
    mov al, 0xFF                      ; Out 0xFF to 0xA1 and 0x21 to disable all IRQs.
    out 0xA1, al
    out 0x21, al
 
    lidt [IDT]                        ; Load a zero length IDT so that any NMI causes a triple fault.
 
    ; Enter long mode.
    mov eax, 10100000b                ; Set the PAE and PGE bit.
    mov cr4, eax
 
    mov edx, edi                      ; Point CR3 at the PML4.
    mov cr3, edx
 
    mov ecx, 0xC0000080               ; Read from the EFER MSR. 
    rdmsr    
 
    or eax, 0x00000100                ; Set the LME bit.
    wrmsr
 
    mov ebx, cr0                      ; Activate long mode -
    or ebx,0x80000001                 ; - by enabling paging and protection simultaneously.
    mov cr0, ebx                    
 
    lgdt [GDT.Pointer]                ; Load GDT.Pointer defined below.
 
    jmp CODE_SEG:LongMode             ; Load CS with 64 bit segment and flush the instruction cache
 
 
    ; Global Descriptor Table
ALIGN 4
GDT:
.Null:
    dq 0x0000000000000000             ; Null Descriptor - should be present.
 
.Kernel:
    dq 0x00209A0000000000             ; 64-bit code descriptor (exec/read).
    dq 0x0000920000000000             ; 64-bit data descriptor (read/write).
.User:
    dq 0x0020FA0000000000             ; 64-bit code descriptor (exec/read).
    dq 0x0000F20000000000             ; 64-bit data descriptor (read/write).
 
.Pointer:
    dw $ - GDT - 1                    ; 16-bit Size (Limit) of GDT.
    dd GDT                            ; 32-bit Base Address of GDT. (CPU will zero extend to 64-bit)
 
 
[BITS 64]      
LongModeUser:
	jmp LongModeUser

JumpUser:	
	mov rax, rsp
	push qword 0x23;ss
	push rax; rsp
	pushfq;
	push qword 0x1b; cs
	push qword LongModeUser; rip
	iretq;

LongMode:

	mov rcx, 0xfee00000
	mov dword [rcx + 0x340], 0x400

	mov eax, 0xff
	mov edx, 0x1
	mov ecx, 0x38f
	wrmsr
	mov edx, 0xffff
	mov eax, 0x80000000
	mov ecx, 0x309
	wrmsr

	xor edx, edx
	mov eax, 0xa; track only user-mode instructions
	mov ecx, 0x38d
	wrmsr
	jmp JumpUser

;not reached
loop: 
	hlt
    jmp loop                     ; You should replace this jump to wherever you want to jump to.
Compilation step:

Code: Select all

./nasm -fbin main.asm -o hd
Running step:

Code: Select all

       qemu-system-x86_64 -cpu host -monitor tcp::4444,server,nowait \
                -d int,guest_errors,unimp,pcall,cpu_reset -hda hd -enable-kvm
parfait
Posts: 15
Joined: Sat Dec 14, 2013 12:33 pm

Re: Get the performance monitoring interrupt on Qemu-Kvm

Post by parfait »

Thank you all for your helpful contributions.
@linuxyne, I tried your code, and i got a reboot even before full execution.
when I traced the execution, it appears that faulty instruction is the wrmsr on (line 131), file lm.asm

Code: Select all

123 LongMode:
124
125   mov rcx, 0xfee00000
126   mov dword [rcx + 0x340], 0x400
127
128   mov eax, 0xff
129   mov edx, 0x1
130   mov ecx, 0x38f
131   wrmsr 
So, I suspect that the MSR 0x38f does not exist or is not supported as Brendan said (weird, isn't it?).
This is the result of CPUID.x0A (in the kernel on qemu-system-x86_64 -cpu host --enable-kvm)
EAX 7300402
EBX 0
ECX 0
EDX 603
According to Intel doc, this means that the
EAX[bits 7:0] = 2 => Architectural Performance Monitoring Version is 2
But EBX = 0 => so none of architectural event is supported
EDX[bits 4:0] = 3 => it has 3 fixed-function performance counters
However EDX[bits 12:5] = 0x60 (96) seems very weird to me (not to high?).
Those values make me very confuse, any explanation is welcome
Post Reply