Page 1 of 1

Cyclical GP fault on iretq of timer interrupt in 64 bit OS

Posted: Sun Aug 12, 2018 3:39 am
by adrianmay
I'm trying to write a 64 bit OS. It throws a GP on iretq from the timer interrupt handler, then repeatedly throws more GPs from the iretq of the GP handler.

I know this because my generic handler prints the ISR number on the serial port, and it goes 32, 13, 13, 13, ...

The error code for the first GP is 10, which is my data segment.

I'm debugging it in qemu, so I can see quite a bit. Here's the situation at the iretq from the timer handler:

Code: Select all

    (gdb) disas isr_common,isr_head_2                                                
    Dump of assembler code from 0x8189 to 0x81c4:                                    
       0x0000000000008189 <isr_common+0>:   callq  0x8125 <sayN100>                  
       0x000000000000818e <isr_common+5>:   cmp    $0x20,%eax                        
       0x0000000000008191 <isr_common+8>:   jl     0x81a8 <isr_common.no_more_acks>  
       0x0000000000008193 <isr_common+10>:  cmp    $0x30,%eax                        
       0x0000000000008196 <isr_common+13>:  jge    0x81a8 <isr_common.no_more_acks>  
       0x0000000000008198 <isr_common+15>:  cmp    $0x28,%al                         
       0x000000000000819a <isr_common+17>:  jl     0x81a2 <isr_common.ack_master>    
       0x000000000000819c <isr_common+19>:  push   %rax                              
       0x000000000000819d <isr_common+20>:  mov    $0x20,%al                         
       0x000000000000819f <isr_common+22>:  out    %al,$0xa0                         
       0x00000000000081a1 <isr_common+24>:  pop    %rax                              
       0x00000000000081a2 <isr_common.ack_master+0>:        push   %rax              
       0x00000000000081a3 <isr_common.ack_master+1>:        mov    $0x20,%al         
       0x00000000000081a5 <isr_common.ack_master+3>:        out    %al,$0x20         
       0x00000000000081a7 <isr_common.ack_master+5>:        pop    %rax              
       0x00000000000081a8 <isr_common.no_more_acks+0>:      cmp    $0x24,%ax         
       0x00000000000081ac <isr_common.no_more_acks+4>:      pop    %rax              
       0x00000000000081ad <isr_common.no_more_acks+5>:      pop    %rax              
    => 0x00000000000081ae <isr_common.end+0>:       iretq                            
       0x00000000000081b0 <isr_head_0+0>:   pushq  $0x55    ;DUMMY ERROR CODE                         
       0x00000000000081b2 <isr_head_0+2>:   mov    $0x0,%eax                         
       0x00000000000081b7 <isr_head_0+7>:   push   %rax                              
       0x00000000000081b8 <isr_head_0+8>:   jmp    0x8189 <isr_common>               
       0x00000000000081ba <isr_head_1+0>:   pushq  $0x55    ;DUMMY ERROR CODE                                                  
       0x00000000000081bc <isr_head_1+2>:   mov    $0x1,%eax                         
       0x00000000000081c1 <isr_head_1+7>:   push   %rax                              
       0x00000000000081c2 <isr_head_1+8>:   jmp    0x8189 <isr_common>
That also shows a couple of "isr_head"s which are entered in the IDT, might push a dummy error code and jmp to isr_common.

The stack looks correct to me:

Code: Select all

    (gdb) bt                                   
    #0  0x00000000000081ae in isr_common.end () 
    #1  0x0000000000008123 in LongMode.Nirv ()  
    #2  0x0000000000000010 in ?? ()             
    #3  0x0000000000000216 in ?? ()             
    #4  0x0000000000015000 in Pd ()             
    #5  0x0000000000000010 in ?? ()             
    #6  0x000000b8e5894855 in ?? ()             
    #7  0x78bf00000332e800 in ?? ()             
    #8  0x000003e3e8000000 in ?? ()                        
where:

Code: Select all

    0x0000000000008122 <LongMode.Nirv+0>:        hlt                           
    0x0000000000008123 <LongMode.Nirv+1>:        jmp    0x8122 <LongMode.Nirv> 
To be careful:

Code: Select all

    (gdb) info registers                              
    rax            0x55     85                        
    rbx            0x80000011       2147483665        
    rcx            0xc0000080       3221225600        
    rdx            0x3f8    1016                      
    rsi            0xb      11                        
    rdi            0x3fc    1020                      
    rbp            0x0      0x0                       
    rsp            0x14fd8  0x14fd8 <Pd+36824>        
    r8             0x0      0                         
    r9             0x0      0                         
    r10            0x0      0                         
    r11            0x0      0                         
    r12            0x0      0                         
    r13            0x0      0                         
    r14            0x0      0                         
    r15            0x0      0                         
    rip            0x81ae   0x81ae <isr_common.end>   
    eflags         0x97     [ CF PF AF SF ]           
    cs             0x8      8                         
    ss             0x10     16                        
    ds             0x10     16                        
    es             0x10     16                        
    fs             0x10     16                        
    gs             0x10     16
    
    (gdb) x/32xg 0x14f00                                               
    0x14f00 <Pd+36608>:     0x0000000000841f0f      0x000000841f0f2e66 
    0x14f10 <Pd+36624>:     0x00841f0f2e660000      0x1f0f2e6600000000 
    0x14f20 <Pd+36640>:     0x2e66000000000084      0x0000000000841f0f 
    0x14f30 <Pd+36656>:     0x000000841f0f2e66      0x00841f0f2e660000 
    0x14f40 <Pd+36672>:     0x1f0f2e6600000000      0x2e66000000000084 
    0x14f50 <Pd+36688>:     0x0000000000841f0f      0x000000841f0f2e66 
    0x14f60 <Pd+36704>:     0x00841f0f2e660000      0x1f0f2e6600000000 
    0x14f70 <Pd+36720>:     0x2e66000000000084      0x0000000000841f0f 
    0x14f80 <Pd+36736>:     0x000000841f0f2e66      0x00841f0f2e660000 
    0x14f90 <Pd+36752>:     0x1f0f2e6600000000      0x2e66000000000084 
    0x14fa0 <Pd+36768>:     0x0000000000000020      0x0000000000008144 
    0x14fb0 <Pd+36784>:     0x0000000080000011      0x0000000000000020 
    0x14fc0 <Pd+36800>:     0x0000000000000020      0x0000000000000020 
    0x14fd0 <Pd+36816>:     0x0000000000000055      0x0000000000008123 
    0x14fe0 <Pd+36832>:     0x0000000000000010      0x0000000000000216 
    0x14ff0 <Pd+36848>:     0x0000000000015000      0x0000000000000010 
Now I'll let it run to the GP handler head:

Code: Select all

    (gdb) break isr_head_13                                  
    Breakpoint 3 at 0x8236                                   
    (gdb) c                                                  
    Continuing.                                              
                                                             
    Breakpoint 3, 0x0000000000008236 in isr_head_13 ()       
    (gdb) bt                                                 
    #0  0x0000000000008236 in isr_head_13 ()                 
    #1  0x0000000000000010 in ?? ()                          
    #2  0x00000000000081ae in isr_common.no_more_acks ()     
    #3  0x0000000000000008 in ?? ()                          
    #4  0x0000000000000097 in ?? ()                          
    #5  0x0000000000014fd8 in Pd ()                          
    #6  0x0000000000000010 in ?? ()                          
    #7  0x0000000000000055 in ?? ()                          
    #8  0x0000000000008123 in LongMode.Nirv ()               
    #9  0x0000000000000010 in ?? ()                          
    #10 0x0000000000000216 in ?? ()                          
    #11 0x0000000000015000 in Pd ()                          
    #12 0x0000000000000010 in ?? ()
We see that it pushed the error code 0x10 after the usual stack with selector, flags and return address with selector, but the interesting thing is that my dummy error code from the timer (0x55) is back from the dead.
We already know it was popped by the first iretq and I didn't push it this time:

Code: Select all

    (gdb) disas isr_head_13                                   
    Dump of assembler code for function isr_head_13:          
    => 0x0000000000008236 <+0>:     mov    $0xd,%eax          
       0x000000000000823b <+5>:     push   %rax               
       0x000000000000823c <+6>:     jmpq   0x8189 <isr_common>
I guess that's just 16-byte alignment, but I'm not really involved in that. The stack was 16-byte aligned before the timer went off but the CPU pushed an odd number of longlongs.

So why would it crash? The Intel docs say that GP with a selector means it tried to pop something out of range, but I see no such problem.

Any help much appreciated.

Re: Cyclical GP fault on iretq of timer interrupt in 64 bit

Posted: Sun Aug 12, 2018 5:51 am
by Brendan
Hi,
adrianmay wrote:So why would it crash? The Intel docs say that GP with a selector means it tried to pop something out of range, but I see no such problem.
It's very likely that the stack is messed up when you IRETQ (e.g. forgot to POP something), causing the CPU to complain because (e.g.) the value for CS or SS its trying to load from the stack isn't where the CPU thinks it should be.

If that's the problem; then it's extremely unlikely that the compiler would have generated wrong code, which means that it's likely that the problem is in your assembly stubs (and the common interrupt handler if that's also in assembly).

Would you mind posting the original assembly source code for the stubs (and the common interrupt handler if that's also in assembly); so we can see the whole thing (and not just fragments excluding "not taken" branches)?

Note that this looks wrong:

Code: Select all

    (gdb) disas isr_head_13                                   
    Dump of assembler code for function isr_head_13:          
    => 0x0000000000008236 <+0>:     mov    $0xd,%eax          
       0x000000000000823b <+5>:     push   %rax               
       0x000000000000823c <+6>:     jmpq   0x8189 <isr_common>
..because it's modifying RAX before pushing it (causing the original value in RAX from interrupted code to be trashed); but that can't cause a GPF by itself.


Cheers,

Brendan

Re: Cyclical GP fault on iretq of timer interrupt in 64 bit

Posted: Sun Aug 12, 2018 6:10 am
by Octocontrabass
adrianmay wrote:The stack looks correct to me:

Code: Select all

    #2  0x0000000000000010 in ?? ()             
That looks an awful lot like your data selector was in CS when the IRQ occurred, which might explain the initial GPF.

Try running your OS in Bochs. Bochs logs a lot of detail by default, including which protection check is causing each GPF. It's often enough to pinpoint the issue, but if not, you can post the log here along with a link to your code.

Re: Cyclical GP fault on iretq of timer interrupt in 64 bit

Posted: Mon Aug 13, 2018 5:35 am
by adrianmay
Indeed it was because I had the data segment on the stack where the code seg should have been. When I put 8: in front of an earlier jump the problem went away.
Thanks everybody!