Profiling tools for system level

nexos · Post by **nexos** » Thu Apr 01, 2021 12:31 pm

Hello,
I want to try to profile some parts of my OS (PMM, interrupts, etc). What tools could be used for this? I know GNU has gprof, but, that won't work at system level, AFAIK. Any ideas?
Thanks,
nexos

nullplan · Post by **nullplan** » Thu Apr 01, 2021 12:45 pm

Easiest: If you just want to know the run times of selected algorithms, measure TSC before and after, and log the difference. Else, there are performance counters you can use. You would have to read the documentation on these, I can't really help you. Apparently they allow you to log in a very detailed manner what exactly is going on in your code.

Beyond that, there isn't a whole lot. You can register a timer interrupt, and measure where the timer interrupt hit. But to be useful, you need that timer to happen pretty often, but that limits what you can do inside the interrupt. And if you would like to do things like record the stack trace, that could easily go over your allotted time. The more you do in the timer int, the bigger the observer effect.

But for starters, TSC differences can really help you find your hot spots in the code, and make an improvement actually quantifiable.

Korona · Post by **Korona** » Thu Apr 01, 2021 1:01 pm

nullplan made some good points, let me expand on performance monitoring counters (PMCs). For the actual implementation, I can point you to our code for Intel and AMD, but I won't explain them in detail unless there are questions (the details are in the manuals anyway).

nullplan wrote:Beyond that, there isn't a whole lot. You can register a timer interrupt, and measure where the timer interrupt hit. But to be useful, you need that timer to happen pretty often, but that limits what you can do inside the interrupt. And if you would like to do things like record the stack trace, that could easily go over your allotted time. The more you do in the timer int, the bigger the observer effect..

When using PMCs, this is the strategy that you want to use. In a kernel, it might be useful to make the PMC (or timer) fire an NMI, such that you can also debug code that has interrupts disabled. The only difficulty is that handling the resulting profiling data is much harder to process in NMI context. One strategy is to push the resulting data to a ring buffer that you drain from non-NMI context. This is the strategy that we have implemented in Managarm (and here I've written about this before). We then dump the data to virtio-console and use a program on the host to parse it and to look up the instructions in the source code (using addr2line).

thewrongchristian · Post by **thewrongchristian** » Thu Apr 01, 2021 1:20 pm

nexos wrote:Hello,
I want to try to profile some parts of my OS (PMM, interrupts, etc). What tools could be used for this? I know GNU has gprof, but, that won't work at system level, AFAIK. Any ideas?
Thanks,
nexos

My plan (I have lots of plans, have you noticed?) is to profile based on IRQ stack trace. Interrupts won't be uniform, but as they're also not related to the foreground task, they should result in reasonably representative profile data without having to force extra profiling timer interrupts.

My design will be:

- Maintain a map of function ptr -> profile data for each kernel symbol.
- When profiling is enabled, on an IRQ, generate a stack trace. For each function in the stack trace, find its profile information in the above map, and increment a usage count. Not sure how to count recursive functions at this point, as they'd be multiply counted.

The map will also double as a useful symbol table lookup, as the profile information can also contain the function name details, so I can use the data for more than just profiling (panic or exception back traces come to mind.)

Then I'll just need a mechanism to snapshot and read the profile information to whatever will consume it.

In terms of costs, I can't see it being massively expensive to maintain such information. The map will have an entry per function, but that's less than a 1000 symbols in my current kernel, so with an ideal BST, less than 10 comparisons per back trace pointer to lookup the profile information, while a back trace might be 10s of functions deep in the worst case, but is more than likely not in the kernel at all (probably in user code or idle).

If you want to use GCC's profiling support, read up on the -pg option, which instruments code with calls to a function called mcount, which the gprof library provides (and you'd have to provide yourself) that does records are necessary profile information. See this page for details.

nexos · Post by **nexos** » Thu Apr 01, 2021 2:52 pm

nullplan wrote:Easiest: If you just want to know the run times of selected algorithms, measure TSC before and after, and log the difference.

That's probably what I'm going to do. It makes the most sense to use the TSC.

vvaltchev · Post by **vvaltchev** » Thu Apr 01, 2021 4:55 pm

nexos wrote:
nullplan wrote:Easiest: If you just want to know the run times of selected algorithms, measure TSC before and after, and log the difference.
That's probably what I'm going to do. It makes the most sense to use the TSC.

That's what I do at the moment too. My project is not big enough to "deserve" proper profiling infrastructure (Linux got ftrace in version 4.0), so using TSC is good enough. When it's about algorithms that can be run outside of the kernel, like the memory allocator for example, I have some performance tests that are runnable as unit tests. So, I build that code for the host architecture (or for the target arch as well, if TARGET=x86) and run them on Linux. Not only TSC is useful (but beware of the noise: always check the measurements are stable), but I can use more advanced profiling tools like perf. When almost the same code can be run as a regular usermode program on Linux, we have a ton of tools for profiling our algorithms. Obviously, the most accurate measurements are made on Tilck itself, while running on real hardware. In that case, I use just RDTSC, but the benchmark code I'm using it the same as in the unit tests. The good news is that, after comparing my measurements on Tilck + real HW with the benchmarks in the unit tests, I get very similar results. Linux's overhead is low, if:

- the test does enough iterations
- I run the test multiple times anyway checking that results are stable
- the CPU governors are set to performance instead of powersave
- the machine does nothing else at the moment

OSDev.org

Profiling tools for system level

Profiling tools for system level

Re: Profiling tools for system level

Re: Profiling tools for system level

Re: Profiling tools for system level

Re: Profiling tools for system level

Re: Profiling tools for system level