GPF and IRQ performance

Brendan · Post by **Brendan** » Sat Oct 15, 2005 1:29 am

Hi,

I've been thinking about a new IRQ handling scheme. For clarity I'll describe the "standard" method, and then the new method...

Standard Method

This involves a seperate IDT descriptor and a seperate assembly handler for each IRQ. Assume the IRQ handlers all look like this (where NN is the IRQ number):

Code: Select all

IRQ_handler_NN:
    push eax
    mov eax, NN                     ;eax = IRQ number
    jmp common_IRQ_handler

    ...

common_IRQ_handler:

Now assume the CPU has been busy and contains nothing useful in it's caches when an IRQ occurs. In this case I'd expect a cache miss when the CPU looks up the IDT entry, another cache miss when it gets to the assembly handler and a third cache miss when it jumps to "common_IRQ_handler".

GPF Method

For the alternative, the IDT limit is set short and the IDT itself only contains 36 descriptors. The IRQs are mapped to interrupt vectors that are above the IDT limit.

When an IRQ occurs, the CPU checks the IDT limit and it causes a GPF. The IRQ number is automatically stored on the stack by the CPU as part of the error code. The GPF handler begins with code like this:

Code: Select all

GPF_handler:
    push eax
    test byte [esp + 4], 1      ;Was it an IRQ?
    je real_GPF_handler      ; no, it's a real GPF

    mov eax,[esp + 4]
    shr eax,3                      ;eax = IRQ number + 36

In this case (assuming the CPU contains nothing useful in it's caches) I'd expect a cache miss when the CPU looks up the IDT entry and another cache miss when it gets to the GPF handler (or 2 cache misses instead of 3).

Because the GPF handler would be re-used for all IRQs, it's also more likely that it would already in the cache.

On modern CPUs a cache miss is costly, but I also expect that (if cache miss penalties are ignored or everything is already in the CPUs cache) a GPF will take longer.

Questions

So, which method would be faster on average?

How could the difference be measured with RDTSC (both with and without cache misses)?

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Sat Oct 15, 2005 4:55 pm

This sounds like premature optimization to me. It sounds interesting, but is it really worth spending that much time on?

Brendan wrote: So, which method would be faster on average?

Who knows... the only way is to benchmark it.

How could the difference be measured with RDTSC (both with and without cache misses)?

<shrug/> I thought you were the expert on RDTSC.

Generally speaking, you have to be careful here when measuring the impact of the change because your OS doesn't spend all its time handling IRQs. You would need to figure out some "common" scenarios that combine the expected hardware configurations with the expected workloads and see if the change makes a difference. Only you know what is "expected" for your OS.

Cjmovie · Post by **Cjmovie** » Sat Oct 15, 2005 6:59 pm

Also, with cache sizes increasing yearly, I don't see somthing like IRQ's (where your going to, for insance, get a PIT interrupt at LEAST 18.2 times a second) getting out of the cache. I think it's smart enough to know it. Then again, I could be wrong.

Next generation processors are going to have 1mb cache, STANDARD.

I agree, it is an optimization I wouldn't want to spend too much time on.

Sorry if I sound on a rant, rather frustrated right now with other stuff.....

Brendan · Post by **Brendan** » Sat Oct 15, 2005 8:21 pm

Hi,

Colonel Kernel wrote:This sounds like premature optimization to me. It sounds interesting, but is it really worth spending that much time on?

The problem with premature optimization is that it usually complicates things too early.

For this, it simplifies the IDT a little and removes the need for a collection of assembly stubs, so IMHO it doesn't make things more complicated.

More important is that I find it interesting - it's something I haven't tried before (I probably wouldn't have considered it if it wasn't for a note at the bottom of the IDT page in the Wiki). At the moment I'm rewriting/updating code and none of it is really new or exciting, and I need something interesting!

Colonel Kernel wrote:Generally speaking, you have to be careful here when measuring the impact of the change because your OS doesn't spend all its time handling IRQs. You would need to figure out some "common" scenarios that combine the expected hardware configurations with the expected workloads and see if the change makes a difference. Only you know what is "expected" for your OS.

I couldn't really figure out how the OS will be used in advance - someone might use it to edit text files (an occasional IRQ) while someone else might use it for a server with a four 1000 MHz ethernet cards going flat out (lots of IRQs).

Instead, I'd want to measure the time between the CPU receiving the IRQ and the common IRQ handling code being started.

The most reliable way of measuring it that I can think of would be to get the local APIC to send an IPI to itself - that way I can know when the IRQ is about to happen and any hardware delays between triggering the IRQ and the IRQ arriving at the CPU would be as small as possible.

Something like:

Code: Select all

.preload:
    trigger IPI
    hlt
    mov ecx,0x10000
.next:
    cli
    trigger IPI
;   wbinvd
    rdtsc
    mov ebx,eax
    mov esi,edx
    sti
    hlt
    add [counter],eax
    add [counter + 4],edx
    loop .next
    mov eax,[counter + 2]    ;eax = average number of cycles
    ret

    times 256 nop  ;Make sure it's a different cache line

IRQ_handler:
    push eax
    jmp commonIRQhandler

    times 256 nop  ;Make sure it's a different cache line

GPFhandler:
    test dword [esp],1
    je .skip
commonIRQhandler:
    rdtsc
    sub eax,ebx
    sbb edx,esi
.skip:
    add esp,4
    iretd

With a bit of luck I'll give this a go later today.

Thanks,

Brendan

Brendan · Post by **Brendan** » Sat Oct 15, 2005 9:56 pm

Hi,

Cjmovie wrote: Also, with cache sizes increasing yearly, I don't see somthing like IRQ's (where your going to, for insance, get a PIT interrupt at LEAST 18.2 times a second) getting out of the cache. I think it's smart enough to know it. Then again, I could be wrong.

AFAIK for the latest Pentium 4's, the L1 instruction cache stores up to 12 thousand micro-ops (including those with a 1 MB L2 cache). Very roughly, a 3 GHz CPU could probably wipe this in 10 uS.

For Intel's 1 MB L2 cache, it's 8-way set associative, which basically means that you only need 8 "unlucky" cache loads and the cache line you wanted to keep is evicted. If you assume that your not lucky or unlucky, the CPU would need to load almost 1 MB of new data before your cache lines are lost. Modern CPUs can easily exceed 1 GB per second though, so it'd be possible to wipe the 1 MB cache in less than 1 mS.

In then end, as long as the GPF method doesn't make performance worse then I'll use it (it simplifies my IDT and saves around 2 KB of RAM

)...

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Sun Oct 16, 2005 12:02 am

Brendan wrote:The problem with premature optimization is that it usually complicates things too early.

Not just. It also takes developer time (and money, if you're like me and develop software for a living) away from the end goal of completing a project/shipping a product/etc.

Try the scheme out, let us know how well it works.

Cjmovie · Post by **Cjmovie** » Sun Oct 16, 2005 12:11 am

Yes, although I do not support the idea, it will be fun to see just what a difference it will make.

The only problem is, how to make sure it isn't in the cache? Or really, to be more realistic, how to simulate the standard functioning of a kernel? You're going to have to have it doing a lot of stuff while you run the test, hopefully a few large applications (which could just do nothing but go through a loop copying random memory locations from some chunk or something - something that acceses large areas of RAM).
Clearing the cache by hand would be useless, becase you'd either have to do it every time (unrealistic) or you'd only do it once (after one Int is called, it's back in the cache).

Of course, I'm no expert, so I'm probably way off on all this

.

Brendan · Post by **Brendan** » Sun Oct 16, 2005 3:43 am

Hi,

Ok, I followed the algorithm I posted earlier, but I actually did it in real mode. I used the "wbinvd" to flush the CPUs caches.

The actual code I used can be found at (it's not too pretty):
http://bcos.hopto.org/ugly.html.

For each test I've recorded the raw total count, the cycles per interrupt and then worked out how much longer the GPF took compared to a standard interrupt. After this I worked out how much longer the standard interrupt handling without caching takes compared to a GPF with everything cached. The results are:

1.6 GHz Pentium 4 Williamette with 100 MHz RAM (known to perform badly):

GPF, no cache flush: 0x0563FF8C, 1380 cycles/int, 152%
Sandard, no cache flush: 0x038BFFBC, 908 cycles/int, 100%

GPF with cache flush: 0x080B20C8, 2059 cycles/int, 133%
Standard with cache flush: 0x060B3C50, 1547 cycles/int, 100%

112% difference between standard without cache and GPF with cache

Dual Pentium II 400 MHz:

GPF, no cache flush: 0x00C100BA, 193 cycles/int, 137%
Sandard, no cache flush: 0x008D00D1, 141 cycles/int, 100%

GPF with cache flush: 0x01AB29B0, 428 cycles/int, 108%
Standard with cache flush: 0x018ACF68, 395 cycles/int, 100%

204% difference between standard without cache and GPF with cache

Pentium Pro 200:

GPF, no cache flush: 0x00B80042, 184 cycles/int, 140%
Sandard, no cache flush: 0x008300C4, 131 cycles/int, 100%

GPF with cache flush: 0x01AB3B1A, 427 cycles/int, 104%
Standard with cache flush: 0x019B49E5, 411 cycles/int, 100%

223% difference between standard without cache and GPF with cache

Ok, from this it's fair to say my Pentium 4 sucks!

Because it's the first CPU Intel did with the netburst architecture, and because later Pentium 4s are (AFAIK) much better, I'm not too sure if these Pentium 4 results represent Pentium 4s in general.

Apart from this the results are mostly as expected, although I didn't think a GPF would be that much slower.

What I really want to measure is the GPF with everything in the cache compared to the standard method when the IDT entry and stub code isn't in the cache (but the common code is in the cache). Under these circumstances I'd estimate that the standard method would take roughly 30% longer for the "average" CPU, but...

IMHO when IRQs don't happen often the GPF method would be better, but when IRQs are frequent the overhead of the GPF itself becomes a problem and the standard method is better. However, the performance of IRQ handling isn't very important if there isn't very many IRQs, and is much more important when there's much more IRQs.

From the results above, a GPF just has too much overhead for this idea to make sense. I honestly thought the GPF method would only be about 5% slower when everything is in the cache.

Cheers,

Brendan

Cjmovie · Post by **Cjmovie** » Sun Oct 16, 2005 4:02 am

If you'd provide a floppy image, I'd be happy to double those results on my AMD64 (1.8ghz Oc'ed to 2.16ghz, btw 1mb cache

), my 486DX (40mhz, lol), my AMD Sempron(800mhz), and my AMD Duron (1.4ghz). And, If I can convince my brother (yeah, right!), an Intel Celeron (2.8ghz).

Brendan · Post by **Brendan** » Sun Oct 16, 2005 5:17 am

Hi,

Cjmovie wrote: If you'd provide a floppy image, I'd be happy to double those results on my AMD64 (1.8ghz Oc'ed to 2.16ghz, btw 1mb cache ), my 486DX (40mhz, lol), my AMD Sempron(800mhz), and my AMD Duron (1.4ghz). And, If I can convince my brother (yeah, right!), an Intel Celeron (2.8ghz).

Ok

Some notes! First, there's four different boot images, one for each test. I called them "gc.img" for the GPF method, "gnc.img" for GPF with cache flushing, "ic.img" for standard method, and "inc.img" for the standard method with cache flushing. It's painfull but I didn't have time to write a nice GUI front end

.

For each boot disk, boot it and you should see a blue screen with the line "Starting: Physical Memory Manager Boot Code" followed by a 64 bit number in hexadecimal. This is the total cycle count. If you ignore the last 4 digits you'll get the "cycles per interrupt" value.

At the very start of the test it wipes out the PIC masks, so even though it's in real mode "control + alt + delete" won't do anything - use the on/off switch or your reset button

.

Also, it doesn't do any checking - it just assumes that the CPU can support RDTSC and that a local APIC is present at 0xFEE00000. If this isn't the case it'll probably lock up waiting for the interrupt that never comes. This rules out your 80486 (no RDTSC). I'm not sure about the others (the Sempron's local APIC may have been disabled by the BIOS for e.g.).

I put all of the boot images into a zip file, which can be downloaded from http://bcos.hopto.org/tests.zip.

You'll also be the first person to see the beginnings of my new Boot Manager!

Cheers,

Brendan

Cjmovie · Post by **Cjmovie** » Sun Oct 16, 2005 5:48 am

I've been trying to write the images to a floppy for the last half an hour, but it wouldn't work.

Then I noticed I had floppy drive A disabled in my hardware profile.

So, I'm off to test now. First, I've got an AMD Athlon XP 2000+ in addition to those. It, ahem, replaces the 486 (lol).

Cjmovie · Post by **Cjmovie** » Sun Oct 16, 2005 6:06 am

Here are the (somewhat interesting) results from the Athlon XP 2000+ machine:

GPF Without Cache: 0x101 cycels or 257 cycles. (Watch out Intel!) (000000000101E445)
GPF With Cache: IDK. It never said (Umm..odd)

Old Method Without Cache: IDK. It never said (umm...Odd)
Old Method With Cache: Guess? IDK. It never said!

I'm not really sure why it worked one time and not on the others.

I can come to one conclusion: The GPF method is much safer, if you feel like clearing the cache once a second! lol.

Brendan · Post by **Brendan** » Sun Oct 16, 2005 6:45 am

Hi,

Cjmovie wrote:I'm not really sure why it worked one time and not on the others.

Hmmm - that is unusual, even though it wasn't "bullet-proof" code...

It does show the difference an inbuilt memory controller can make though

Thanks,

Brendan

Cjmovie · Post by **Cjmovie** » Mon Oct 17, 2005 3:01 pm

Hmm....My floppy for the IC got messed up, going to have to re-do that one. But there are these results:

AMD Athlon 64 3000+ (Currently 1.81ghz)

INC - 804 Cycles (0x3241469)
IC - ?
GNC - 307 Cycles (0x13309C0)
GC - 696 Cycles (0x2B80DC4)

Rating:
With INC at 100%, GNC is 38%.

Those are some very intriguing results....Need to test IC.

Brendan · Post by **Brendan** » Mon Oct 17, 2005 6:35 pm

Hi,

Cjmovie wrote:Hmm....My floppy for the IC got messed up, going to have to re-do that one. But there are these results:

AMD Athlon 64 3000+ (Currently 1.81ghz)

INC - 804 Cycles (0x3241469)
IC - ?
GNC - 307 Cycles (0x13309C0)
GC - 696 Cycles (0x2B80DC4)

Rating:
With INC at 100%, GNC is 38%.

Those are some very intriguing results....Need to test IC.

That is very interesting...

Firstly, the GPF method with cache flushing (307 cycles) shouldn't be faster than the GPF method with full cache use (696 cycles). Based on this (and the results from the Athlon XP 2000+), I'm wondering if I labelled the boot images wrong and got "GC" and "GNC" around the wrong way. I'll check when I get home...

It's also odd that the standard method with cache flushing is slower than either of the GPF method tests as it contradicts previous (Intel) results - even though it's possible as it should cause an extra cache miss (compared to the GPF method with cache flushing).

Thanks,

Brendan

OSDev.org

GPF and IRQ performance

GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance

Re:GPF and IRQ performance