[SOLVED] Strange Pagefault on hardware interrupt.
Posted: Fri Sep 13, 2013 6:05 am
Hey!
I have a small hobby micro-kernel (surprise!) found at [1]. I'm relatively far already (physical memory, virtual memory, threading, and (the newest one) user mode threads. Now that i have added user mode i'm facing a strange problem with the local APIC timer, and a page-fault which i completely fail to explain.
Some details on the problem, i really hope somebody has a good idea what i could check ... :
* I have two kernel threads that print out the current time in microseconds. If i let only those two threads run, the whole system is stable (at least for a few hours ). [To be clear]: the local APIC timer code works flawlessly in this scenario.
* Now i added a user mode thread, that writes "hello userspace" using a system call (through int 50). If it writes this only a few times (number varies), all is ok, and even exiting and cleaning up the thread works well. If it writes the same in a loop more than ~5-20 times, things get strange (explanation of output below):
from the above I can see:
* the kernel thread 3 is running ("hello thread")
* the user mode thread is running and printing "hello userspace" using int 50 (i added trace output to see that interrupt dispatcher gets triggered).
* this LAPIC timer is scheduled to somewhere close to 181989 (timestamp of log output = /very/ roughly the timestamp the LAPIC uses for calculation) + 5234 (the "scheduled for" output). means it should fire somewhere around 187223 (again, roughly. usually log timestamps are ~200-300 microseconds off).
* after a few short microseconds, i get this not-so-cool page-fault at 0x00006fffffff006a <0x00006fffffff0045:uapi_sysc_call + 0x25> while writing to 0xfffffffffffffff8. looking at the disassembly there:
* as you can see, the page-fault happens at a "mov %rax,%rax". i don't trust this. i suspect something with interrupt handling is wrong...?!
* as an additionaly issue, the local APIC never fires anymore (to be clear: until it stops, it already fired a few times fine, and if i only run kernel threads, it fires fine all the time). I use the local APIC as only scheduler clock source. I use it in one-shot mode with a small timer framework to calculate the next fire time (so i'm more or "tickless"). Currently the scheduler is mostly alone with programming the timer to after the threads current time slice expired. But because the timer interrupt seems to be lost (can the be due to another disabled interrupts while the timer fires? i thought the interrupt will be delivered when i re-enable interrupts?) the timer never gets re-programmed, and thus will never fire again
maybe also useful information:
* the same happens on both qemu and bochs. there is no helpful information from both qemu and bochs ( usually there is at least /something/).
* on qemu with gdb attached, if i let the kernel run on (executing now kernel thread 3 ("hello thread") all the time, as timer does not trigger scheduler anymore), the kernel hangs after another ~50 rounds, and gdb breaks completely when trying to investigate (invalid package reply, ...). This does not happen with bochs. There it just keeps on running that single thread.
* i thought about maybe having some parallelism somewhere in the interrupt handling, but i cannot imagine this, as the first stmt in the ISR is cli and the last sti; iretq;
If you need any more details, please ask! I'd really appreciate any kind of help or hint.
[1] https://github.com/mduft/tachyon3
Cheers, Thanks,
Markus
I have a small hobby micro-kernel (surprise!) found at [1]. I'm relatively far already (physical memory, virtual memory, threading, and (the newest one) user mode threads. Now that i have added user mode i'm facing a strange problem with the local APIC timer, and a page-fault which i completely fail to explain.
Some details on the problem, i really hope somebody has a good idea what i could check ... :
* I have two kernel threads that print out the current time in microseconds. If i let only those two threads run, the whole system is stable (at least for a few hours ). [To be clear]: the local APIC timer code works flawlessly in this scenario.
* Now i added a user mode thread, that writes "hello userspace" using a system call (through int 50). If it writes this only a few times (number varies), all is ok, and even exiting and cleaning up the thread works well. If it writes the same in a loop more than ~5-20 times, things get strange (explanation of output below):
Code: Select all
[176264] info: 3: hello thread 176307 (0xffffffff7ffe1f98)
[181638] trace: dispatch 35
[181752] trace: switching: 1:0 (6)
[181989] trace: scheduled for 5234
[182459] trace: dispatch 50
hello userspace
[183330] trace: dispatch 50
hello userspace
[184663] trace: dispatch 50
hello userspace
[185732] trace: dispatch 50
hello userspace
[186767] trace: dispatch 50
hello userspace
[188076] trace: dispatch 14
[188489] trace: page-fault at 0x00006fffffff006a <0x00006fffffff0045:uapi_sysc_call + 0x25> while writing to 0xfffffffffffffff8
[189366] trace: CPU registers after fault:
RAX: 0x0000000000000000, RBX: 0x0000000000000000
RCX: 0x0000000000000003, RDX: 0xffffffff8013fd8a
RBP: 0x00007fffffffff98, RSP: 0x00007fffffffff70
[190089] trace: fault occured in thread 0, process 1
[190388] trace: no translation for the given page was available!
[191146] error: don't know anything more to do. page-fault @ 0x00006fffffff006a (code 0x2) without resolution!
[193326] error: thread 0 in process 1 aborted!
[0xffffffff80127d91] thr_abort + 0x9b
[0xffffffff8011d9a7] pgflt_handler + 0x7d2
[0xffffffff8011ff99] intr_dispatch + 0x1a3
[0xffffffff8013ce1e] _x86_64_isr14 + 0x47
[------ intr ------]
[0x00006fffffff006a] uapi_sysc_call + 0x25
[0x00006fffffff0038] test_thr2 + 0x38
[0x00006fffffff0097] uapi_thr_trampoline + 0x20
[0x0000000000000000] <unknown>
* the kernel thread 3 is running ("hello thread")
* the user mode thread is running and printing "hello userspace" using int 50 (i added trace output to see that interrupt dispatcher gets triggered).
* this LAPIC timer is scheduled to somewhere close to 181989 (timestamp of log output = /very/ roughly the timestamp the LAPIC uses for calculation) + 5234 (the "scheduled for" output). means it should fire somewhere around 187223 (again, roughly. usually log timestamps are ~200-300 microseconds off).
* after a few short microseconds, i get this not-so-cool page-fault at 0x00006fffffff006a <0x00006fffffff0045:uapi_sysc_call + 0x25> while writing to 0xfffffffffffffff8. looking at the disassembly there:
Code: Select all
uintptr_t SECTION(SECT_USER_CODE) uapi_sysc_call(syscall_t call, volatile uintptr_t p0, volatile uintptr_t p1) {
6fffffff0045: 55 push %rbp
6fffffff0046: 48 89 e5 mov %rsp,%rbp
6fffffff0049: 48 83 ec 28 sub $0x28,%rsp
6fffffff004d: 89 7d ec mov %edi,-0x14(%rbp)
6fffffff0050: 48 89 75 e0 mov %rsi,-0x20(%rbp)
6fffffff0054: 48 89 55 d8 mov %rdx,-0x28(%rbp)
volatile uintptr_t res;
// TODO: sysenter, etc.?
asm volatile(
6fffffff0058: 48 8b 4d e0 mov -0x20(%rbp),%rcx
6fffffff005c: 48 8b 55 d8 mov -0x28(%rbp),%rdx
6fffffff0060: 8b 45 ec mov -0x14(%rbp),%eax
6fffffff0063: 89 c7 mov %eax,%edi
6fffffff0065: 48 89 ce mov %rcx,%rsi
6fffffff0068: cd 32 int $0x32
6fffffff006a: 48 89 c0 mov %rax,%rax
6fffffff006d: 48 89 45 f8 mov %rax,-0x8(%rbp)
"\tint %1\n"
"\tmov %%rax, %0\n"
: "=a"(res)
: "i"(SYSC_INTERRUPT), "D"(call), "S"(p0), "d"(p1));
return res;
6fffffff0071: 48 8b 45 f8 mov -0x8(%rbp),%rax
}
6fffffff0075: c9 leaveq
6fffffff0076: c3 retq
* as an additionaly issue, the local APIC never fires anymore (to be clear: until it stops, it already fired a few times fine, and if i only run kernel threads, it fires fine all the time). I use the local APIC as only scheduler clock source. I use it in one-shot mode with a small timer framework to calculate the next fire time (so i'm more or "tickless"). Currently the scheduler is mostly alone with programming the timer to after the threads current time slice expired. But because the timer interrupt seems to be lost (can the be due to another disabled interrupts while the timer fires? i thought the interrupt will be delivered when i re-enable interrupts?) the timer never gets re-programmed, and thus will never fire again
maybe also useful information:
* the same happens on both qemu and bochs. there is no helpful information from both qemu and bochs ( usually there is at least /something/).
* on qemu with gdb attached, if i let the kernel run on (executing now kernel thread 3 ("hello thread") all the time, as timer does not trigger scheduler anymore), the kernel hangs after another ~50 rounds, and gdb breaks completely when trying to investigate (invalid package reply, ...). This does not happen with bochs. There it just keeps on running that single thread.
* i thought about maybe having some parallelism somewhere in the interrupt handling, but i cannot imagine this, as the first stmt in the ISR is cli and the last sti; iretq;
If you need any more details, please ask! I'd really appreciate any kind of help or hint.
[1] https://github.com/mduft/tachyon3
Cheers, Thanks,
Markus