Optimizing MSI and IRQ stubs

rdos · Post by **rdos** » Sat Jan 14, 2012 9:22 am

I'm currently redesigning my IRQ stubs to minimize latency while optimizing speed of IRQ handlers. The stubs have some basic requirements that cannot be changed, IOW, they must save segment registers, and the IRQ handlers themselves have far 48-bit addresses and are passed with DS as a parameter. The IRQ handlers should execute with interrupts enabled.

Here is the scheduler code that makes it possible to execute handlers with interrupts enabled:

Code: Select all


; This locks the scheduler so it cannot switch thread. This macro is exported by the scheduler.

EnterInt    MACRO
    mov ax,core_data_sel   ; load the unique per-core GDT selector
    mov ds,ax
    add ds:ps_nesting,1     ; disable scheduling
    push ds:ps_sel            ; save the core selector
                        ENDM

; This unlocks the scheduler and checks if some actions needs to be performed after the IRQ. Should be run with interrupts disabled

LeaveInt    MACRO
    local tucDone
    local tucSwap

    pop ds                                        ; restore core selector
    
    sub ds:ps_nesting,1                      ; enable scheduler if ps_nesting becomes -1
    jnc tucDone                                 ; if we are a nested interrupt, just exit
;
    mov ax,ds:ps_curr_thread              ; check if there is a current thread. If not, the scheduler was running
    or ax,ax
    jz tucDone
;    
    test ds:ps_flags,PS_FLAG_TIMER OR PS_FLAG_PREEMPT        ; check for pending actions
    jnz tucSwap
;    
    mov ax,ds:ps_wakeup_list              ; check for pending thread wakeups 
    or ax,ax
    jz tucDone

tucSwap:
    add ds:ps_nesting,1                      ; something needs to be handled, so lock again
    sti
    mov ax,ds
    mov fs,ax
    OsGate irq_schedule_nr                 ; schedule

tucDone:

The simplest handler is the MSI handler. MSIs can never be shared, and also cannot have IRQ detection, so they only contain the basic code:

Code: Select all


This is the handler header, which has local variables that stores the entry DS and 48-bit handler procedure

msi_handler_struc   STRUC

msi_linear          DD ?              ; linear address of code segment so it can easily be modified by the setup-code, but not at runtime
msi_handler_ads     DD ?,?        ; handler procedure
msi_handler_data    DW ?         ; handler data

msi_handler_struc   ENDS

MsiStart:

msi_handler     msi_handler_struc <>        ; put the header first in the code segment

MsiEntry:
    push ds
    push es
    push fs
    pushad
;
    EnterInt
    sti
;       
    mov ds,cs:msi_handler_data
    call fword ptr cs:msi_handler_ads
;
    cli    
    mov ax,apic_mem_sel
    mov ds,ax
    xor eax,eax
    mov ds:APIC_EOI,eax                         ; EOI APIC
    LeaveInt
;
    popad
    pop fs
    pop es
    pop ds
    iretd

MsiEnd:

The way the MSI handler is constructed is by allocating a piece of memory, and copy the above code to it. This makes it possible to construct all MSI handler with the same procedure, since the above code contain no MSI or interrupt references. The code is then mapped to a GDT code selector. This makes it possible to access the handler data at fixed addresses at the beginning without loading flat or other selectors. Then the code segment is mapped to the IDT. Then the handler address (procedure + data) is stored in the header.

Owen · Post by **Owen** » Sun Jan 15, 2012 4:53 am

For a start, take all of the scheduler actions out of the line of the interrupt fast path (i.e. returning without invoking the scheduler). x86 CPUs predict forward branches as unlikely and backwards branches as likely, so you are penalizing the presumably most common case.

Contemplate saving less GPRs; all 8 is probably not necessary. Also, multiple pushes cannot be executed in parallel (they end up serialized by the rSP writes), so "sub rSP, xxx; mov [rsp+xxx], reg" is faster, though I'm not sure how this compares for segment register pushes.

rdos · Post by **rdos** » Sun Jan 15, 2012 6:09 am

Owen wrote:For a start, take all of the scheduler actions out of the line of the interrupt fast path (i.e. returning without invoking the scheduler). x86 CPUs predict forward branches as unlikely and backwards branches as likely, so you are penalizing the presumably most common case.

Actually, as far as I can see, the most common case in LeaveInt is that no branches are taken. The most common scenario in a typical IRQ handler in RDOS is to wake-up a server thread, which puts something in the ps_wakeup_list. Also, it is uncommon that interrupts happen in scheduler, that interrupts are nested, and that preemption timer expired, which means those branches are all uncommon.

Owen wrote:Contemplate saving less GPRs; all 8 is probably not necessary. Also, multiple pushes cannot be executed in parallel (they end up serialized by the rSP writes), so "sub rSP, xxx; mov [rsp+xxx], reg" is faster, though I'm not sure how this compares for segment register pushes.

You probably have a point here. It might be better to only save the registers that the IRQ/MSI stubs use themselves, and then require IRQ handlers to save used registers instead.

rdos · Post by **rdos** » Tue Jan 17, 2012 4:35 pm

Now I have also changed the IO-APIC IRQ stubs in a similar manner. An extra benefit is that initially the IO-APIC ints are setup to detect interrupt activity. This is useful for autodecting IRQs, especially for serial ports, but possibly also for other devices. As soon as an IRQ fires, it would mask the interrupt and set a bit in a mask to indicate that the IRQ happened. When a real handler is installed, it just overwrites the detection handler. Sharing IRQs is done by adding a "share part" at the end of the IRQ handler, and just linking it with a near jmp. Multiple "share parts" can be installed, so chaining handlers can be done at any depth.

With these new IRQ handler stubs, the 2-core AMD just works perfectly well. I've stressed it for 6 hours with the random GUI test-app, and it still works.

This is how the IO-APIC stubs looks like:

Code: Select all

irq_handler_struc   STRUC

irq_linear          DD ?                     ; linear address of handler memory (used for resize when chaining)
irq_handler_ads     DD ?,?               ; handler address. Initially set to IrqDetect
irq_handler_data    DW ?
irq_chain           DW ?                   ; chain. Initially set to IrqExit
irq_detect_nr       DB ?                  ; IRQ number for detect. 

irq_handler_struc   ENDS

IrqStart:

irq_handler     irq_handler_struc <>        ; start with the header

IrqEntry:
    push ds
    push es
    push fs
    pushad
;
    EnterInt
    sti
;       
    mov ds,cs:irq_handler_data                               ; call the first handler (detect or IRQ handler)
    call fword ptr cs:irq_handler_ads
;
    mov bx,OFFSET IrqEnd - OFFSET IrqStart            ; setup bx for chaining
    jmp cs:irq_chain                                              ; chain

IrqExit:
    cli    
    mov ax,apic_mem_sel
    mov ds,ax
    xor eax,eax
    mov ds:APIC_EOI,eax
    LeaveInt
;
    popad
    pop fs
    pop es
    pop ds
    iretd

IrqDetect:
    mov ax,SEG data
    mov ds,ax
    movzx bx,cs:irq_detect_nr
    shl bx,3
    add bx,OFFSET global_int_arr
    mov al,ds:[bx].gi_ioapic_id
    mov es,ds:[bx].gi_ioapic_sel
;       
    mov bl,10h                                           ; disable IRQ in IO-APIC
    add bl,al
    add bl,al
;    
    LockIoApic                                          ; take IO-APIC spinlock
    mov es:ioapic_regsel,bl
    mov eax,10000h
    mov es:ioapic_window,eax
;
    inc bl
    mov es:ioapic_regsel,bl
    xor eax,eax
    mov es:ioapic_window,eax
    UnlockIoApic                                     ; release IO-APIC spinlock
;
    movzx dx,cs:irq_detect_nr
    cmp dx,24
    jae IrqDetectDone
;    
    mov bx,OFFSET detected_irqs
    bts ds:[bx],dx                                        ; set mask bit

IrqDetectDone:
    retf32

IrqEnd:

The chaining code is not very complex:

Code: Select all

irq_chain_struc  STRUC

irch_handler_ads     DD ?,?                ; chain handler
irch_handler_data    DW ?
irch_chain           DW ?                    ; next in chain

irq_chain_struc ENDS

IrqChainStart:

irch_handler      irq_chain_struc <>    ; start block with chain header (bx points here)

IrqChainEntry:
    push bx
    mov ds,cs:[bx].irch_handler_data
    call fword ptr cs:[bx].irch_handler_ads        ; call handler
    pop bx
;
    mov si,bx
    add bx,OFFSET IrqChainEnd - OFFSET IrqChainStart    ; setup for next chain
    jmp cs:[si].irch_chain                              ; chain

IrqChainEnd:

rdos · Post by **rdos** » Sat Jan 21, 2012 4:19 am

Obviously, there is a need for a 3rd stub as well, a level-triggered IO-APIC stub (basically PCI). This stub needs to run the handlers twice. Once before EOI and once after EOI. It might be a good idea to let the handler indicate if it handled something or not, and remember this for the second run (handlers that didn't handle anything doesn't need a second run). I also want to out-compete Linux by providing support for IRQ detection when handlers are installed. The IRQ detection stub would run when nobody handled the IRQ.

Additionally, I might remove support for chaining from non-PCI (edge triggered ISA) interrupts, as those are normally not sharable anyway. I might support the old RequestPrivateIrqHandler / ReleasePrivateIrqHandler interface instead.

rdos · Post by **rdos** » Sat Jan 21, 2012 3:05 pm

The special PCI stub is ready and tested, but it doesn't solve the issue with the RTL8168/8111 network controller.

I shouldn't need timeouts in the driver in order for network packets to be handled correctly, but I don't seem to be able to solve it in any other way.

Edit: This is probably relevant regaring this issue: http://ubuntuforums.org/showthread.php?t=1022411&page=9

Seems like Linux have about the same problems as I have. There is probably some kind of bug in the chip.

rdos · Post by **rdos** » Sat Jan 21, 2012 4:58 pm

It is a chip-bug. Even if they don't admit it, the fix for the bug can be found in Realteks own driver for Linux. In the beginning of their IRQ-handler they do something really strange. They clear the interrupt mask. At the end they reprogram the correct interrupt mask. When I added this to my IRQ, everything suddenly seems to work a lot better.

Actually, it works 100%, and it works with lowest priority delivery as well.

rdos · Post by **rdos** » Sun Jan 22, 2012 5:32 am

I'm now sure that the IRQs on both my 2-core (IO-APIC delivery) and 6-core (MSI delivery) works for the RTL8168 NIC. I stressed it by sending almost 1 million IPC over IP messages between these machines, and the NICs still worked after that.

rdos · Post by **rdos** » Tue Jan 24, 2012 12:49 am

I reverted from using the special PCI IRQ handlers. They don't seem to be needed, and only degrade performance. After finding the chip-bug in the RTL NIC I no longer need these handlers there either.

I've also found the major bug in the AHCI/SATA driver that forced me to use timers. It was not related to IRQs, but rather to not signalling properly There are still some minor issues in the AHCI driver, but I'm pretty sure they are unrelated to IRQs.

IOW, it seems like the IRQs are now working on multiple machines. With lowest priority delivery for IO-APIC PCI and MSI versions.

OSDev.org

Optimizing MSI and IRQ stubs

Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs

Re: Optimizing MSI and IRQ stubs