GPF and IRQ performance

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

GPF and IRQ performance

Post by Brendan »

Hi,

I've been thinking about a new IRQ handling scheme. For clarity I'll describe the "standard" method, and then the new method...


Standard Method

This involves a seperate IDT descriptor and a seperate assembly handler for each IRQ. Assume the IRQ handlers all look like this (where NN is the IRQ number):

Code: Select all

IRQ_handler_NN:
    push eax
    mov eax, NN                     ;eax = IRQ number
    jmp common_IRQ_handler

    ...

common_IRQ_handler:
Now assume the CPU has been busy and contains nothing useful in it's caches when an IRQ occurs. In this case I'd expect a cache miss when the CPU looks up the IDT entry, another cache miss when it gets to the assembly handler and a third cache miss when it jumps to "common_IRQ_handler".


GPF Method

For the alternative, the IDT limit is set short and the IDT itself only contains 36 descriptors. The IRQs are mapped to interrupt vectors that are above the IDT limit.

When an IRQ occurs, the CPU checks the IDT limit and it causes a GPF. The IRQ number is automatically stored on the stack by the CPU as part of the error code. The GPF handler begins with code like this:

Code: Select all

GPF_handler:
    push eax
    test byte [esp + 4], 1      ;Was it an IRQ?
    je real_GPF_handler      ; no, it's a real GPF

    mov eax,[esp + 4]
    shr eax,3                      ;eax = IRQ number + 36
In this case (assuming the CPU contains nothing useful in it's caches) I'd expect a cache miss when the CPU looks up the IDT entry and another cache miss when it gets to the GPF handler (or 2 cache misses instead of 3).

Because the GPF handler would be re-used for all IRQs, it's also more likely that it would already in the cache.

On modern CPUs a cache miss is costly, but I also expect that (if cache miss penalties are ignored or everything is already in the CPUs cache) a GPF will take longer.


Questions

So, which method would be faster on average?

How could the difference be measured with RDTSC (both with and without cache misses)?


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Colonel Kernel
Member
Member
Posts: 1437
Joined: Tue Oct 17, 2006 6:06 pm
Location: Vancouver, BC, Canada
Contact:

Re:GPF and IRQ performance

Post by Colonel Kernel »

This sounds like premature optimization to me. It sounds interesting, but is it really worth spending that much time on?
Brendan wrote: So, which method would be faster on average?
Who knows... the only way is to benchmark it.
How could the difference be measured with RDTSC (both with and without cache misses)?
<shrug/> I thought you were the expert on RDTSC. ;)

Generally speaking, you have to be careful here when measuring the impact of the change because your OS doesn't spend all its time handling IRQs. You would need to figure out some "common" scenarios that combine the expected hardware configurations with the expected workloads and see if the change makes a difference. Only you know what is "expected" for your OS.
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!
Cjmovie

Re:GPF and IRQ performance

Post by Cjmovie »

Also, with cache sizes increasing yearly, I don't see somthing like IRQ's (where your going to, for insance, get a PIT interrupt at LEAST 18.2 times a second) getting out of the cache. I think it's smart enough to know it. Then again, I could be wrong.

Next generation processors are going to have 1mb cache, STANDARD.

I agree, it is an optimization I wouldn't want to spend too much time on.

Sorry if I sound on a rant, rather frustrated right now with other stuff.....
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:GPF and IRQ performance

Post by Brendan »

Hi,
Colonel Kernel wrote:This sounds like premature optimization to me. It sounds interesting, but is it really worth spending that much time on?
The problem with premature optimization is that it usually complicates things too early.

For this, it simplifies the IDT a little and removes the need for a collection of assembly stubs, so IMHO it doesn't make things more complicated.

More important is that I find it interesting - it's something I haven't tried before (I probably wouldn't have considered it if it wasn't for a note at the bottom of the IDT page in the Wiki). At the moment I'm rewriting/updating code and none of it is really new or exciting, and I need something interesting! :)
Colonel Kernel wrote:Generally speaking, you have to be careful here when measuring the impact of the change because your OS doesn't spend all its time handling IRQs. You would need to figure out some "common" scenarios that combine the expected hardware configurations with the expected workloads and see if the change makes a difference. Only you know what is "expected" for your OS.
I couldn't really figure out how the OS will be used in advance - someone might use it to edit text files (an occasional IRQ) while someone else might use it for a server with a four 1000 MHz ethernet cards going flat out (lots of IRQs).

Instead, I'd want to measure the time between the CPU receiving the IRQ and the common IRQ handling code being started.

The most reliable way of measuring it that I can think of would be to get the local APIC to send an IPI to itself - that way I can know when the IRQ is about to happen and any hardware delays between triggering the IRQ and the IRQ arriving at the CPU would be as small as possible.

Something like:

Code: Select all

.preload:
    trigger IPI
    hlt
    mov ecx,0x10000
.next:
    cli
    trigger IPI
;   wbinvd
    rdtsc
    mov ebx,eax
    mov esi,edx
    sti
    hlt
    add [counter],eax
    add [counter + 4],edx
    loop .next
    mov eax,[counter + 2]    ;eax = average number of cycles
    ret

    times 256 nop  ;Make sure it's a different cache line

IRQ_handler:
    push eax
    jmp commonIRQhandler

    times 256 nop  ;Make sure it's a different cache line

GPFhandler:
    test dword [esp],1
    je .skip
commonIRQhandler:
    rdtsc
    sub eax,ebx
    sbb edx,esi
.skip:
    add esp,4
    iretd
With a bit of luck I'll give this a go later today.


Thanks,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:GPF and IRQ performance

Post by Brendan »

Hi,
Cjmovie wrote: Also, with cache sizes increasing yearly, I don't see somthing like IRQ's (where your going to, for insance, get a PIT interrupt at LEAST 18.2 times a second) getting out of the cache. I think it's smart enough to know it. Then again, I could be wrong.
AFAIK for the latest Pentium 4's, the L1 instruction cache stores up to 12 thousand micro-ops (including those with a 1 MB L2 cache). Very roughly, a 3 GHz CPU could probably wipe this in 10 uS.

For Intel's 1 MB L2 cache, it's 8-way set associative, which basically means that you only need 8 "unlucky" cache loads and the cache line you wanted to keep is evicted. If you assume that your not lucky or unlucky, the CPU would need to load almost 1 MB of new data before your cache lines are lost. Modern CPUs can easily exceed 1 GB per second though, so it'd be possible to wipe the 1 MB cache in less than 1 mS.

In then end, as long as the GPF method doesn't make performance worse then I'll use it (it simplifies my IDT and saves around 2 KB of RAM ;) )...


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Colonel Kernel
Member
Member
Posts: 1437
Joined: Tue Oct 17, 2006 6:06 pm
Location: Vancouver, BC, Canada
Contact:

Re:GPF and IRQ performance

Post by Colonel Kernel »

Brendan wrote:The problem with premature optimization is that it usually complicates things too early.
Not just. It also takes developer time (and money, if you're like me and develop software for a living) away from the end goal of completing a project/shipping a product/etc. :)

Try the scheme out, let us know how well it works.
Top three reasons why my OS project died:
  1. Too much overtime at work
  2. Got married
  3. My brain got stuck in an infinite loop while trying to design the memory manager
Don't let this happen to you!
Cjmovie

Re:GPF and IRQ performance

Post by Cjmovie »

Yes, although I do not support the idea, it will be fun to see just what a difference it will make.

The only problem is, how to make sure it isn't in the cache? Or really, to be more realistic, how to simulate the standard functioning of a kernel? You're going to have to have it doing a lot of stuff while you run the test, hopefully a few large applications (which could just do nothing but go through a loop copying random memory locations from some chunk or something - something that acceses large areas of RAM).
Clearing the cache by hand would be useless, becase you'd either have to do it every time (unrealistic) or you'd only do it once (after one Int is called, it's back in the cache).

Of course, I'm no expert, so I'm probably way off on all this :P.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:GPF and IRQ performance

Post by Brendan »

Hi,

Ok, I followed the algorithm I posted earlier, but I actually did it in real mode. I used the "wbinvd" to flush the CPUs caches.

The actual code I used can be found at (it's not too pretty):
http://bcos.hopto.org/ugly.html.

For each test I've recorded the raw total count, the cycles per interrupt and then worked out how much longer the GPF took compared to a standard interrupt. After this I worked out how much longer the standard interrupt handling without caching takes compared to a GPF with everything cached. The results are:


1.6 GHz Pentium 4 Williamette with 100 MHz RAM (known to perform badly):

GPF, no cache flush: 0x0563FF8C, 1380 cycles/int, 152%
Sandard, no cache flush: 0x038BFFBC, 908 cycles/int, 100%

GPF with cache flush: 0x080B20C8, 2059 cycles/int, 133%
Standard with cache flush: 0x060B3C50, 1547 cycles/int, 100%

112% difference between standard without cache and GPF with cache


Dual Pentium II 400 MHz:

GPF, no cache flush: 0x00C100BA, 193 cycles/int, 137%
Sandard, no cache flush: 0x008D00D1, 141 cycles/int, 100%

GPF with cache flush: 0x01AB29B0, 428 cycles/int, 108%
Standard with cache flush: 0x018ACF68, 395 cycles/int, 100%

204% difference between standard without cache and GPF with cache


Pentium Pro 200:

GPF, no cache flush: 0x00B80042, 184 cycles/int, 140%
Sandard, no cache flush: 0x008300C4, 131 cycles/int, 100%

GPF with cache flush: 0x01AB3B1A, 427 cycles/int, 104%
Standard with cache flush: 0x019B49E5, 411 cycles/int, 100%

223% difference between standard without cache and GPF with cache


Ok, from this it's fair to say my Pentium 4 sucks! :) Because it's the first CPU Intel did with the netburst architecture, and because later Pentium 4s are (AFAIK) much better, I'm not too sure if these Pentium 4 results represent Pentium 4s in general.

Apart from this the results are mostly as expected, although I didn't think a GPF would be that much slower.

What I really want to measure is the GPF with everything in the cache compared to the standard method when the IDT entry and stub code isn't in the cache (but the common code is in the cache). Under these circumstances I'd estimate that the standard method would take roughly 30% longer for the "average" CPU, but...

IMHO when IRQs don't happen often the GPF method would be better, but when IRQs are frequent the overhead of the GPF itself becomes a problem and the standard method is better. However, the performance of IRQ handling isn't very important if there isn't very many IRQs, and is much more important when there's much more IRQs.

From the results above, a GPF just has too much overhead for this idea to make sense. I honestly thought the GPF method would only be about 5% slower when everything is in the cache.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Cjmovie

Re:GPF and IRQ performance

Post by Cjmovie »

If you'd provide a floppy image, I'd be happy to double those results on my AMD64 (1.8ghz Oc'ed to 2.16ghz, btw 1mb cache :D ), my 486DX (40mhz, lol), my AMD Sempron(800mhz), and my AMD Duron (1.4ghz). And, If I can convince my brother (yeah, right!), an Intel Celeron (2.8ghz).
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:GPF and IRQ performance

Post by Brendan »

Hi,
Cjmovie wrote: If you'd provide a floppy image, I'd be happy to double those results on my AMD64 (1.8ghz Oc'ed to 2.16ghz, btw 1mb cache :D ), my 486DX (40mhz, lol), my AMD Sempron(800mhz), and my AMD Duron (1.4ghz). And, If I can convince my brother (yeah, right!), an Intel Celeron (2.8ghz).
Ok :)

Some notes! First, there's four different boot images, one for each test. I called them "gc.img" for the GPF method, "gnc.img" for GPF with cache flushing, "ic.img" for standard method, and "inc.img" for the standard method with cache flushing. It's painfull but I didn't have time to write a nice GUI front end :D.

For each boot disk, boot it and you should see a blue screen with the line "Starting: Physical Memory Manager Boot Code" followed by a 64 bit number in hexadecimal. This is the total cycle count. If you ignore the last 4 digits you'll get the "cycles per interrupt" value.

At the very start of the test it wipes out the PIC masks, so even though it's in real mode "control + alt + delete" won't do anything - use the on/off switch or your reset button :).

Also, it doesn't do any checking - it just assumes that the CPU can support RDTSC and that a local APIC is present at 0xFEE00000. If this isn't the case it'll probably lock up waiting for the interrupt that never comes. This rules out your 80486 (no RDTSC). I'm not sure about the others (the Sempron's local APIC may have been disabled by the BIOS for e.g.).

I put all of the boot images into a zip file, which can be downloaded from http://bcos.hopto.org/tests.zip.

You'll also be the first person to see the beginnings of my new Boot Manager! :)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Cjmovie

Re:GPF and IRQ performance

Post by Cjmovie »

I've been trying to write the images to a floppy for the last half an hour, but it wouldn't work.

Then I noticed I had floppy drive A disabled in my hardware profile.

:P

So, I'm off to test now. First, I've got an AMD Athlon XP 2000+ in addition to those. It, ahem, replaces the 486 (lol).
Cjmovie

Re:GPF and IRQ performance

Post by Cjmovie »

Here are the (somewhat interesting) results from the Athlon XP 2000+ machine:

GPF Without Cache: 0x101 cycels or 257 cycles. (Watch out Intel!) (000000000101E445)
GPF With Cache: IDK. It never said (Umm..odd)

Old Method Without Cache: IDK. It never said (umm...Odd)
Old Method With Cache: Guess? IDK. It never said!

I'm not really sure why it worked one time and not on the others.

I can come to one conclusion: The GPF method is much safer, if you feel like clearing the cache once a second! lol.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:GPF and IRQ performance

Post by Brendan »

Hi,
Cjmovie wrote:I'm not really sure why it worked one time and not on the others.
Hmmm - that is unusual, even though it wasn't "bullet-proof" code...

It does show the difference an inbuilt memory controller can make though :)


Thanks,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Cjmovie

Re:GPF and IRQ performance

Post by Cjmovie »

Hmm....My floppy for the IC got messed up, going to have to re-do that one. But there are these results:

AMD Athlon 64 3000+ (Currently 1.81ghz)

INC - 804 Cycles (0x3241469)
IC - ?
GNC - 307 Cycles (0x13309C0)
GC - 696 Cycles (0x2B80DC4)

Rating:
With INC at 100%, GNC is 38%.

Those are some very intriguing results....Need to test IC.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re:GPF and IRQ performance

Post by Brendan »

Hi,
Cjmovie wrote:Hmm....My floppy for the IC got messed up, going to have to re-do that one. But there are these results:

AMD Athlon 64 3000+ (Currently 1.81ghz)

INC - 804 Cycles (0x3241469)
IC - ?
GNC - 307 Cycles (0x13309C0)
GC - 696 Cycles (0x2B80DC4)

Rating:
With INC at 100%, GNC is 38%.

Those are some very intriguing results....Need to test IC.
That is very interesting...

Firstly, the GPF method with cache flushing (307 cycles) shouldn't be faster than the GPF method with full cache use (696 cycles). Based on this (and the results from the Athlon XP 2000+), I'm wondering if I labelled the boot images wrong and got "GC" and "GNC" around the wrong way. I'll check when I get home...

It's also odd that the standard method with cache flushing is slower than either of the GPF method tests as it contradicts previous (Intel) results - even though it's possible as it should cause an extra cache miss (compared to the GPF method with cache flushing).


Thanks,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply