OSDev.org

Posted: **Mon Feb 07, 2022 11:01 pm**

We are currently working on the rv6 project which is porting MIT's educational operating system xv6 to Rust. Our code locates here.
We use qemu and qemu's virt platform to execute rv6, and it works well with using qemu.
Executing command on arm machine is this:

Code: Select all

RUST_MODE=release TARGET=arm KVM=yes GIC_VERSION=3
qemu-system-aarch64 -machine virt -kernel kernel/kernel -m 128M -smp 80 -nographic -drive file=fs.img,if=none,format=raw,id=x0,copy-on-read=off -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0 -cpu cortex-a53  -machine gic-version=3  -net none

To make some speed boost experiment with KVM, we made rv6 support the arm architecture on arm machine. The arm architecture's driver code locates in here.
The problem is, when we use qemu with kvm, the performance is significantly reduced.
Executing command on arm machine with KVM is this:

Code: Select all

qemu-system-aarch64 -machine virt -kernel kernel/kernel -m 128M -smp 80 -nographic -drive file=fs.img,if=none,format=raw,id=x0,copy-on-read=off -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0 -cpu host -enable-kvm  -machine gic-version=3  -net none

We repeated
1. Write 500 bytes syscall 10,000 times and the result was: kvm disable: 4,500,000 us, kvm enable: 29,000,000 us. (> 5 times)
2. Open/Close syscall 10,000 times result: kvm disable: 12,000,000 us, kvm enable: 29,000,000 us. (> 5 times)
3. Getppid syscall 10,000 times result: kvm disable: 735,000 us, kvm enable: 825,000 us. (almost same)
4. Simple calculation(a = a * 1664525 + 1013904223) 100 million times result: kvm disable: 2,800,000 us, kvm enable: 65,000,000 us. (> 20 times)
And the elapsed time was estimated by uptime_as_micro syscall in rv6.

These results were so hard to understand. So first we tried to find the bottleneck on rv6's booting process, because finding bottleneck during processing user program was so difficult.
We found that the first noticeable bottleneck on rv6 booting process was here:

Code: Select all

run.as_mut().init();
self.runs().push_front(run.as_ref());

As far as we know, this part is just kind of "list initialization and push element" part. So we thought that by some reason, the KVM is not actually working and it makes worse result. And also this part is even before turn on some interrupts, so we thought arm's GIC or interrupt related thing is not related with problem.

So, how can i get better performance when using kvm with qemu?

To solve this problem, we tried these already:
1. change qemu(4.2, 6.2), virt version, change some command for qemu-kvm like cpu, drive cache, copy-on-read something, kernel_irqchip.., cpu core.. etc
2. find some kvm hypercall to use - but not exists on arm64
3. Run lmbench by ubuntu on qemu with kvm to check KVM itself is okay. - We found KVM with ubuntu is super faster than only using qemu.
4. Check 16550a UART print code is really slow on enabling KVM which makes incorrect result on benchmark - Without bottleneck code, we found the progress time of rv6 booting were almost same with KVM enabled or not.
5. Check other people who suffer same situation like us - but this superuser page not works. Our clocksource is arch_sys_counter.

Our Environment
qemu-system-aarch64 version: 4.2.1 (Debian 1:4.2-3ubuntu6.19)
CPU model: Neoverse-N1
Architecture: arrch64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 80
Ubuntu 20.04.3 LTS(Focal Fossa)
/dev/kvm exists

Thank you for your help.

Posted: **Tue Feb 08, 2022 2:18 am**

Much of this (AArch64) is still quite new to me, and I haven't dug too deep into your code, but when I hear "slow on KVM, fast on TCG", my first thought is memory configuration.

KVM, as an interface to the platform's hardware virtualization, will generally oblige in honoring whatever caching (and cache invalidation) you ask of it. TCG (qemu's software emulation mode) will not. Counterintuitively, this can mean that if you've set up your memory attributes incorrectly, or are trying to purge TLBs or invalidate instruction caches too often, then TCG can be immensely faster than KVM.

I haven't gone through and hunted down all the macros and functions, but while you setup the MAIR register, it doesn't look like you're setting memory attributes; since you've got attribute 0 configured as device memory, your pages are all configured as device memory. If I'm not missing something and this is the case, it's likely to be responsible for your performance discrepancies. (It may also break some instructions, like exclusive loads, on some platforms, as they're only valid on normal memory)

Posted: **Tue Feb 08, 2022 4:51 am**

klange wrote:If I'm not missing something and this is the case, it's likely to be responsible for your performance discrepancies. (It may also break some instructions, like exclusive loads, on some platforms, as they're only valid on normal memory)

That is most likely the case indeed, device memory (nGnRnE specifically) disables caching, reordering, coalescing writes, and speculative accesses.

The exclusive store (which is the part that checks the lock) most likely breaks because the exclusive monitor is a part of the cache subsystem on most microarchitectures, like the Cortex-A72 used in the RPi4 SoC for example (but that is not specified by the ARMARM afaics).

Posted: **Wed Feb 09, 2022 12:14 am**

Thank you very much klange, qookie.

As you stated, the main cause of this problem was about MAIR_EL1.
I changed attribute 0's Device memory to normal memory like this:

Code: Select all

MAIR_EL1::Attr0_Device::nonGathering_nonReordering_EarlyWriteAck

to

Code: Select all

MAIR_EL1::Attr0_Normal_Outer::WriteBack_NonTransient_ReadWriteAlloc +
MAIR_EL1::Attr0_Normal_Inner::WriteBack_NonTransient_ReadWriteAlloc

and now kvm is very fast than qemu(at least >10 times) for many benchmarks.

Though i have one more question, klange.
I couldn't understand why you refer to this page.
Is there something should i do more?

Posted: **Wed Feb 09, 2022 12:22 am**

junsookun wrote:Though i have one more question, klange.
I couldn't understand why you refer to this page.
Is there something should i do more?

Some of the bits in the page entries are the index into the attribute table that will apply to that page. It might have been better for me to also link to here where you can see how not setting this index implies index 0. While you can set attribute 0 to be normal memory, you could also set all of your pages to use a different attribute index that is set as normal memory. In a more complete system, you want to pick and choose attributes based on what the memory is being used for.

Posted: **Wed Feb 09, 2022 12:36 am**

klange wrote: Some of the bits in the page entries are the index into the attribute table that will apply to that page. It might have been better for me to also link to here where you can see how not setting this index implies index 0. While you can set attribute 0 to be normal memory, you could also set all of your pages to use a different attribute index that is set as normal memory. In a more complete system, you want to pick and choose attributes based on what the memory is being used for.

Oh now i understand.
Thank you for your time and detailed explanation. It was very helpful.

OSDev.org

Using qemu+kvm is slow than using qemu in rv6(xv6 rust port)

Using qemu+kvm is slow than using qemu in rv6(xv6 rust port)

Re: Using qemu+kvm is slow than using qemu in rv6(xv6 rust p

Re: Using qemu+kvm is slow than using qemu in rv6(xv6 rust p

Re: Using qemu+kvm is slow than using qemu in rv6(xv6 rust p

Re: Using qemu+kvm is slow than using qemu in rv6(xv6 rust p

Re: Using qemu+kvm is slow than using qemu in rv6(xv6 rust p