simple code line is so slow on rv6(rust os) than ubuntu

junsookun · Post by **junsookun** » Wed Apr 06, 2022 7:15 am

We are currently working on the rv6 project which is porting MIT's educational operating system xv6 to Rust. Our code locates here.
We use qemu and qemu's virt platform to execute rv6, and it works well with using qemu.
Example rv6 executing command is this:

Code: Select all

RUST_MODE=release TARGET=arm KVM=yes GIC_VERSION=3; # compile
qemu-system-aarch64 -machine virt -kernel kernel/kernel -m 128M -smp 1 -nographic -drive file=fs.img,if=none,format=raw,id=x0 -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0 -cpu host -enable-kvm -machine gic-version=3

Now, we are comparing the speed(exactly, elapsed wall clock time) of system call of qemu+rv6+kvm and qemu+ubuntu 18.04+kvm with lmbench.
For ubuntu, qemu command is this:

Code: Select all

qemu-system-aarch64 -cpu host -enable-kvm -device rtl8139,netdev=net0 -device virtio-scsi-device -device scsi-cd,drive=cdrom -device virtio-blk-device,drive=hd0 -drive "file=${iso},id=cdrom,if=none,media=cdrom" -drive "if=none,file=${img_snapshot},id=hd0" -m 2G -machine "virt,gic-version=3,its=off" -netdev user,id=net0 -nographic -pflash "$flash0" -pflash "$flash1" -smp 1

Now, our goal is to make rv6 perform similar or faster than ubuntu for relatively simple system call like getppid().
Relatively simple system call means, for example, in the case of getppid(), the actual system call execution part is so simple. So it mainly measures the time for user space -> kernel space -> user space.
And we thought that on getppid() syscall, rv6 could show similar performance or more faster than ubuntu cause it's simple system.

The most important problem is that, although it will be described later, a simple code line for getppid takes so long time(About 0.08 microsec, which is about 70% time of ubuntu getppid() syscall) on kernel. So we wonder if there is a problem with the qemu or kvm side settings.

First, the measured performance result for lmbench's "lat_syscall null" which executes internally getppid() is:
- rv6, Rust opt-level: 1, smp 3(qemu), gcc optimization level: -O -> average 1.662 microsec
- ubuntu, smp 3, gcc optimization level: -O -> average 0.126 microsec
So we see that rv6 is so slower than ubuntu over 10x.

To find the bottleneck of rv6, we use linux perf and divided execution path into 4 stages.
Stage 1: Call getppid in the user space to until just before the trap handler is called
Stage 2: From after stage 1 to until just before the start of code specific to sys_getppid.
Stage 3: From after stage 2 to end of actual sys_getppid function
Stage 4: From after stage 3 to the point which getppid syscall returns on user space
The result with perf was:
- ubuntu: 0.042 microsec/ 0.0744 microsec / 0.00985 microsec / 0 -> total 0.126 microsec
- rv6: ? / ? / 0.3687 microsec / ? -> 1.662 microsec
- we made assumption for ubuntu stage 4 time is zero.
- The question mark is, on rv6 we couldn't use perf so only stage 3 time is measured for right now, but checked stage 3 part manually.

So from the result, it can be confirmed that the rv6's stage 3 already consumes more than 3 times of ubuntu's syscall total time, and at least 30 times more than ubuntu's stage 3.
This is so bad, so we tried several things to inspect the problem:
- Check whether rv6's timer interrupt affects execution time: The interval is 100ms which is so big, so it seems not related.
- To check user space's execution speed, we made simple quick sort program and check rv6's user space speed is significantly slower than ubuntu.
- When running 100,000 times, rv6(smp 1, opt-level 1)'s execution time: 3.2s vs ubuntu(smp 1)'s execution time: 2.7s.
- Although it is 20% slower, it is judged that there is almost no difference compared to the lmbench result. So we thought it is no big problem.

- Next we checked rv6's stage 3's code. https://github.com/kaist-cp/rv6/blob/ma ... cs.rs#L468
- The lock is held twice at line 469 and line 472, whereas ubuntu's same code part, lock is held only once. So first if we change the structure to hold lock only once, there will be improvement in speed. we noticed that.
- Also there's a big problem on 470 line. We measured time for 470 line with CNTPCT_EL0 register, and it was found that at least 0.08 microsec was consumed in the corresponding line.
- So ubuntu's stage 3 consumes about 0.01 microsec, but only line 470 of rv6, which does not have complicated logic(it also doesn't hold lock) consumes about 8 times that ubuntu's stage 3.
- So we concluded that there may be a problem with the kvm setting on the kernel side or other settings.

So do you have any idea for this problem? Thank you for your help.

Our Environment
qemu-system-aarch64 version: 4.2.1 (Debian 1:4.2-3ubuntu6.19)
CPU model: Neoverse-N1
Architecture: arrch64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 80
Ubuntu 20.04.3 LTS(Focal Fossa)
/dev/kvm exists

thewrongchristian · Post by **thewrongchristian** » Wed Apr 06, 2022 3:39 pm

junsookun wrote: So from the result, it can be confirmed that the rv6's stage 3 already consumes more than 3 times of ubuntu's syscall total time, and at least 30 times more than ubuntu's stage 3.
This is so bad, so we tried several things to inspect the problem:
- Check whether rv6's timer interrupt affects execution time: The interval is 100ms which is so big, so it seems not related.
- To check user space's execution speed, we made simple quick sort program and check rv6's user space speed is significantly slower than ubuntu.
- When running 100,000 times, rv6(smp 1, opt-level 1)'s execution time: 3.2s vs ubuntu(smp 1)'s execution time: 2.7s.
- Although it is 20% slower, it is judged that there is almost no difference compared to the lmbench result. So we thought it is no big problem.

- Next we checked rv6's stage 3's code. https://github.com/kaist-cp/rv6/blob/ma ... cs.rs#L468
- The lock is held twice at line 469 and line 472, whereas ubuntu's same code part, lock is held only once. So first if we change the structure to hold lock only once, there will be improvement in speed. we noticed that.
- Also there's a big problem on 470 line. We measured time for 470 line with CNTPCT_EL0 register, and it was found that at least 0.08 microsec was consumed in the corresponding line.
- So ubuntu's stage 3 consumes about 0.01 microsec, but only line 470 of rv6, which does not have complicated logic(it also doesn't hold lock) consumes about 8 times that ubuntu's stage 3.
- So we concluded that there may be a problem with the kvm setting on the kernel side or other settings.

So do you have any idea for this problem? Thank you for your help.

Don't be too hard on yourself. Linux has had 30 years of optimisation effort behind it.

Looking at the Linux implementation, stage 3 for Linux doesn't even take a full lock. It does an RCU read of the parent task structure, which is a very cheap operation. An atomic pointer read might be all that is required to get the parent task, from which the parent pid can be returned.

Very few applications spend much time doing such trivial system calls as getppid. Whilst getppid is good in determining syscall overhead for trivial system calls, most system calls spend most of their time doing stuff other than entering and leaving system calls.

You'd be better off profiling a representative workload, and optimising where your kernel is actually spending its time.

junsookun · Post by **junsookun** » Wed Apr 06, 2022 9:45 pm

Thank you for your reply, thewrongchristian.

I think i mislead you with some of the content of the post, so i edited the content.

The main point of this article is that a simple line of code in the rv6 kernel consumes a lot of time than ubuntu, so there may be a problem with the kvm setting on the kernel side or other settings, which is stated on

- Also there's a big problem on 470 line. We measured time for 470 line with CNTPCT_EL0 register, and it was found that at least 0.08 microsec was consumed in the corresponding line.
- So ubuntu's stage 3 consumes about 0.01 microsec, but only line 470 of rv6, which does not have complicated logic(it also doesn't hold lock) consumes about 8 times that ubuntu's stage 3.

Could you please let me know if you have any ideas?

Ethin · Post by **Ethin** » Wed Apr 06, 2022 11:09 pm

Honestly (and other OSDevers, correct me if I'm wrong) but you shouldn't worry about syscall speed, especially right now. Syscalls are so fast that you'd have to call billions of them within like a second or something to actually make them "slow". The thing you should concern yourself with, especially right now, is the speed of the code that the syscall runs. For example, you should optimize your file IO routines, as an example, because those need to be as fast as possible. But remember that the code that your syscall runs is going to take a lot longer than the syscall transition takes, so that's what you need to optimize. Of course, you should profile it under a "typical" load (e.g. a full UI running, with background processes and such), and you should never optimize blindly.

junsookun · Post by **junsookun** » Thu Apr 07, 2022 5:24 am

Thank you for your reply, Ethin.

In fact, we compared the performance of rv6 and ubuntu for many other syscalls, such as sys_read, sys_stat, sys_fstat, sys_open, sys_write, sys_fork .. etc. in addition to getppid, and almost all results showed that rv6 was at least 10 times slower than ubuntu.
Therefore, we are analyzing the problem, focusing on the simplest syscall, getppid, as a starting point for system optimization.
This is because we thought the overall slowness of rv6 maybe due to some setting of qemu, rv6 or some specific kernel settings.

linuxyne · Post by **linuxyne** » Thu Apr 07, 2022 7:03 am

You can write, in C instead of in rust, an extremely small "OS", with support
for just enough constructs that allow measuring the time taken for the
syscall (or parts of it) in which you are interested.

This should allow you to focus your tests on chosen fields.

nullplan · Post by **nullplan** » Thu Apr 07, 2022 8:47 am

Dumb question, but if you think the choice of language may be the problem here, and rv6 is a translation of something that already exists in another language (namely xv6), then why not compare these two? That way you would know if it is Rust or your way of translating from C to Rust that's doing you in, or if you have other problems.

linuxyne · Post by **linuxyne** » Thu Apr 07, 2022 10:08 am

nullplan wrote:Dumb question, but if you think the choice of language may be the problem here, and rv6 is a translation of something that already exists in another language (namely xv6), then why not compare these two? That way you would know if it is Rust or your way of translating from C to Rust that's doing you in, or if you have other problems.

That depends on whether xv6 supports aarch64 (assuming the language is still C in the official version). If it does, then yes, they can compare the two aarch64 systems. Else, it then depends on how easier it is to port xv6 to aarch64 on C, which would still cause duplicated efforts as they are already working on such a port albeit under rust.

Still, one can continue along the road already taken by writing a small enough OS under rust (or stripping the existing implementation to a bare minimum) and checking the syscall performance, thereby avoiding any change to the language chosen. The choice of language came into picture since they are comparing Linux (written in C) with their system xv6 (written in rust). Writing something small enough which jumps back and forth between user and kernel should not be very difficult, either in rust or in C. If the minimal rust example shows no extra delays then they can at least write off the qemu-kvm as working as expected.

---

I remember that kvm supports tracing, which might be useful here, especially if the guest (i.e. rv6/xv6) doesn't support the arm perf counters. This should allow you to compare the rv6's performance, though you will have to test and see if the traces are any useful to you in the first place.

Edit: Just a thought: For gicv3 do you use mem-mapped IO to access the GIC registers or the system registers (ICC_)? Although it remains to be seen if either method affects the behaviour being targeted here, such differences may show up as differences in timing/delay. In any case, the kvm traces, if they capture and show relevant information, should help you determine the time spent at various points, to the extent allowed by such traces.

Edit2: How is the memory being utilized mapped? cacheability, shareability settings?

OSDev.org

simple code line is so slow on rv6(rust os) than ubuntu

simple code line is so slow on rv6(rust os) than ubuntu

Re: getppid is so slow on rv6(xv6 rust port) than qemu+ubunt

Re: simple code line is so slow on rv6(rust os) than ubuntu

Re: simple code line is so slow on rv6(rust os) than ubuntu

Re: simple code line is so slow on rv6(rust os) than ubuntu

Re: simple code line is so slow on rv6(rust os) than ubuntu

Re: simple code line is so slow on rv6(rust os) than ubuntu

Re: simple code line is so slow on rv6(rust os) than ubuntu