OSDev.org

Posted: **Sat Apr 15, 2006 4:46 am**

I'm actively messing about with my code now (hoping to make a release at the end of the weekend - probably silly to hope for that) and I was converting it all to 64-bit mode. Now, as I was rewriting the interrupt handling code to use the IST mechanism, I wondered whether the rsp[0..2] do anything in 64-bit mode when always using an IST offset in the IDT entries. If not, it'll save me some memory.

Does anybody know whether the rsp[0..2] is still used?

Is the TSS entry in the GDT being used after the LTR call or can I safely assume that it's never going to be viewed or changed (and thereby clean up the GDT) ?

As a future-expecting thing, I named the thread so I could ask more questions here, I expect more will come.

Posted: **Sat Apr 15, 2006 5:24 am**

Hi,

Candy wrote:I'm actively messing about with my code now (hoping to make a release at the end of the weekend - probably silly to hope for that) and I was converting it all to 64-bit mode. Now, as I was rewriting the interrupt handling code to use the IST mechanism, I wondered whether the rsp[0..2] do anything in 64-bit mode when always using an IST offset in the IDT entries. If not, it'll save me some memory.

Does anybody know whether the rsp[0..2] is still used?

RSP0 is used for interrupts that don't use the IST (sometimes). If all of your IDT entries use the IST, then you don't need RSP0.

This is difficult to do in practice though, as interrupts that use the IST is are not re-entrant. Basically, if an interrupt does use the IST then ESP is always changed when the interrupt is used, regardless of which CPL or which stack the CPU was suing before the interrupt occured. If 2 IRQ handlers use the same stack/IST entry and the first IRQ handler can be interrupted by the second, then the second IRQ handler will trash the first IRQ handler's stack.

If an interrupt doesn't use the IST and the CPU is at CPL=3, then there will be a stack switch to RSP0.

If an interrupt doesn't use the IST and the CPU is at CPL=0, then there will be no stack switch. Because of the way SYSCALL works (no stack change), this means an interrupt could end up using a CPL=3 stack.

IMHO if you're trying to write a fully interruptable kernel, this is all a big mess.

I'm not sure how I'm going to deal with it yet - either not allowing IRQ nesting at all, or doing dynamic stack switching in software.

I'm also not sure what I'm going to do in the page fault handler, which must use an IST (to avoid triple faults caused by trying to use "not-present" pages in the CPL=3 stack), but must also be interruptable/re-entrant (for paging data to/from disk).

Candy wrote:Is the TSS entry in the GDT being used after the LTR call or can I safely assume that it's never going to be viewed or changed (and thereby clean up the GDT) ?

I would assume that once you've done LTR, the GDT entry is no longer used (same as protected mode if hardware task switching isn't used).

Cheers,

Brendan

Posted: **Sat Apr 15, 2006 5:35 am**

Brendan wrote: RSP0 is used for interrupts that don't use the IST (sometimes). If all of your IDT entries use the IST, then you don't need RSP0.

This is difficult to do in practice though, as interrupts that use the IST is are not re-entrant. Basically, if an interrupt does use the IST then ESP is always changed when the interrupt is used, regardless of which CPL or which stack the CPU was suing before the interrupt occured. If 2 IRQ handlers use the same stack/IST entry and the first IRQ handler can be interrupted by the second, then the second IRQ handler will trash the first IRQ handler's stack.

If an interrupt doesn't use the IST and the CPU is at CPL=3, then there will be a stack switch to RSP0.

If an interrupt doesn't use the IST and the CPU is at CPL=0, then there will be no stack switch. Because of the way SYSCALL works (no stack change), this means an interrupt could end up using a CPL=3 stack.

IMHO if you're trying to write a fully interruptable kernel, this is all a big mess.

I'm not sure how I'm going to deal with it yet - either not allowing IRQ nesting at all, or doing dynamic stack switching in software.

Disallow IRQ nesting, figure out in which way exceptions may cause other exceptions and handle them with different IST entries. That's my approach at the moment. Disallowing IRQ nesting also means that IRQ handlers must be VERY short, at the order of hundreds of cycles as opposed to some multi-thousand cycle operations I've seen at times.

I'm also not sure what I'm going to do in the page fault handler, which must use an IST (to avoid triple faults caused by trying to use "not-present" pages in the CPL=3 stack), but must also be interruptable/re-entrant (for paging data to/from disk).

Why not swap the IST stack for the PF handler out with the thread, and restore it just the same? PF's aren't nestable by design so that's ok.

Posted: **Sat Apr 15, 2006 6:45 am**

Hi,

Candy wrote:
Brendan wrote:I'm not sure how I'm going to deal with it yet - either not allowing IRQ nesting at all, or doing dynamic stack switching in software.
Disallow IRQ nesting, figure out in which way exceptions may cause other exceptions and handle them with different IST entries. That's my approach at the moment. Disallowing IRQ nesting also means that IRQ handlers must be VERY short, at the order of hundreds of cycles as opposed to some multi-thousand cycle operations I've seen at times.

For me, the kernel handles all IRQs by sending a message to any threads that have registered for that IRQ. Not allowing IRQ nesting means not allowing the "send_message()" code to be interrupted, and as "send_message()" may need to allocate memory it also means the memory manager/s become non-interruptable. That's a lot of uninterruptable code.

It also doesn't help when there isn't enough free RAM to send a message (trying to send pages to swap space while running uninterruptable code), or what effect it'd have on "worst case" interrupt latency (repeatedly trying to acquire the message queue spinlocks when other CPUs have already acquired them).

I'm thinking one option for IRQ handlers might be to use the default stack (either the CPL=3 stack or RSP0) and make sure the page fault handler can deal with "page not present" in the CPL=3 stack from within the kernel's IRQ handler and messaging code. This implies a different RSP0 for each thread.

I've been adding up the overhead though - 2 stacks per thread gets expensive when there's lots of threads. That's why I've been considering dynamic stack switching in software.

The idea here is that you work out how much can be nested (N), and then have "N + 1" kernel stacks per CPU. In this case every IDT entry is an "interrupt gate" that uses the same IST entry. On interrupt handler entry interrupts are disabled by the CPU. Before the interrupt handler does anything it finds the next free kernel stack and does a stack switch, then enables interrupts. This gives a very short uninterruptable period at the start of an interrupt handler, but allows for everything else to be interruptable.

I'd be looking at 256 kernel stacks per CPU (worst case). This means that if the OS is running more than 256 threads per CPU then dynamic stack switching saves memory. It also helps with cache locality, as the same "kernel stack pages" would be constantly re-used (i.e. less cache misses).

For very fast interrupts (where the interrupt handler is relatively fast compared to the software stack switch), the interrupt handler could remain uninterruptable and use the IST stack. I'm thinking of IPI handling for TLB shootdown, timer IRQs, etc.

Candy wrote:
I'm also not sure what I'm going to do in the page fault handler, which must use an IST (to avoid triple faults caused by trying to use "not-present" pages in the CPL=3 stack), but must also be interruptable/re-entrant (for paging data to/from disk).
Why not swap the IST stack for the PF handler out with the thread, and restore it just the same? PF's aren't nestable by design so that's ok.

PF's can be nested - for example, the kernel before my last kernel had a page fault handler that would try to map a page into a page table and generate a second page fault if the page table isn't present (the second page fault would map a page table and return, allowing the first page fault to complete). I decided this was taking "allocation on demand" a little too far though... BTW double faults only occur when the CPU is trying to invoke the exception handler, and don't occur after the exception handler has been started.

I'm thinking of using software stack switching here too....

Cheers,

Brendan

Posted: **Sat Apr 15, 2006 7:18 am**

Hi,

Brendan wrote: For me, the kernel handles all IRQs by sending a message to any threads that have registered for that IRQ. Not allowing IRQ nesting means not allowing the "send_message()" code to be interrupted, and as "send_message()" may need to allocate memory it also means the memory manager/s become non-interruptable. That's a lot of uninterruptable code.

It also doesn't help when there isn't enough free RAM to send a message (trying to send pages to swap space while running uninterruptable code), or what effect it'd have on "worst case" interrupt latency (repeatedly trying to acquire the message queue spinlocks when other CPUs have already acquired them).

Hence my objective of limiting the reach of an IRQ handler. If it can only run during a predetermined cycle length, you can't call malloc, free or any other such function. That means you have to design around such short interrupts, making your driver much more stable. It also means that, since the interrupts are very short, interrupt code is a lot more testable and you can get away with a very small stack (1 page perhaps even). Make your drivers set up memory regions for target writing before registering to the interrupt. It also makes the behaviour more deterministic.

I'm thinking one option for IRQ handlers might be to use the default stack (either the CPL=3 stack or RSP0) and make sure the page fault handler can deal with "page not present" in the CPL=3 stack from within the kernel's IRQ handler and messaging code. This implies a different RSP0 for each thread.

I've been adding up the overhead though - 2 stacks per thread gets expensive when there's lots of threads. That's why I've been considering dynamic stack switching in software.

That's very unpredictable behaviour. Allowing nested interrupt calls means that your stack depth has to be open to at least the number of interrupts times their maximum depth, which are both large and not predictable. I've seen more than one commercially produced and used system crash because the stack overflowed at some off-chance.

Posted: **Sat Apr 15, 2006 7:18 am**

The idea here is that you work out how much can be nested (N), and then have "N + 1" kernel stacks per CPU. In this case every IDT entry is an "interrupt gate" that uses the same IST entry. On interrupt handler entry interrupts are disabled by the CPU. Before the interrupt handler does anything it finds the next free kernel stack and does a stack switch, then enables interrupts. This gives a very short uninterruptable period at the start of an interrupt handler, but allows for everything else to be interruptable.

I'd be looking at 256 kernel stacks per CPU (worst case). This means that if the OS is running more than 256 threads per CPU then dynamic stack switching saves memory. It also helps with cache locality, as the same "kernel stack pages" would be constantly re-used (i.e. less cache misses).

It also means useless overhead when you have less than 256 threads.

I'm looking at two thread-specific stacks atm, the normal one (which is huge and growing) and one for cases when this stack can't be depended upon. Then there are some stacks that are "global", the hwint stack, the abort-exception stack and stuff like that. They are small just the same and only one per CPU.

Then, I'll work out how big the pagefault stack needs to be and fix it at that size (and with some proper coding on my behalf it should stick below a page in size).

I'm thinking of IPI handling for TLB shootdown, timer IRQs, etc.

I'm going to use IPI's for TLB shootdown and thread switch IRQ's. All threading decisions are taken upon a thread block or end, where the thread is either removed in full or added to a different location. The values are updated and the thread update lock is released. When the first processor receives its APIC interrupt to do a task quantum decrease and possible switch, it spins on the thread schedule lock and then does its work. It then sends an IPI to the next processor in line to do its task switch. The last processor in the line does its work and then calls the thread schedule method, which decides what threads can be scheduled next. When that one is done, the lock is released. I'm still unsure on some specifics.

PF's can be nested - for example, the kernel before my last kernel had a page fault handler that would try to map a page into a page table and generate a second page fault if the page table isn't present (the second page fault would map a page table and return, allowing the first page fault to complete). I decided this was taking "allocation on demand" a little too far though... BTW double faults only occur when the CPU is trying to invoke the exception handler, and don't occur after the exception handler has been started.

I'm going to not make recursive page faults allowed. If you can't do much recursive or nested things, you don't get stack overflows quite as quickly.

Posted: **Sat Apr 15, 2006 9:19 am**

Hi,

Candy wrote:
The idea here is that you work out how much can be nested (N), and then have "N + 1" kernel stacks per CPU. In this case every IDT entry is an "interrupt gate" that uses the same IST entry. On interrupt handler entry interrupts are disabled by the CPU. Before the interrupt handler does anything it finds the next free kernel stack and does a stack switch, then enables interrupts. This gives a very short uninterruptable period at the start of an interrupt handler, but allows for everything else to be interruptable.

I'd be looking at 256 kernel stacks per CPU (worst case). This means that if the OS is running more than 256 threads per CPU then dynamic stack switching saves memory. It also helps with cache locality, as the same "kernel stack pages" would be constantly re-used (i.e. less cache misses).
It also means useless overhead when you have less than 256 threads.

It means wasted RAM when there's less than N threads (but I'm intending to run many threads anyway), and I'd still get better cache locality.

BTW I would calculate the number of stacks dynamically (e.g. "number_of_stacks = number_of_IRQs + number_of_exception_handlers + K + number_of_IPI_mechanisms + 1", where K is to account for exception handler nesting). For a "normal" system I'd assume the number of stacks will be closer to 32 (the "256 stack worst case" was derived from one stack per IDT entry, but it'd be very rare to need all 256 IDT entries).

Candy wrote:I'm looking at two thread-specific stacks atm, the normal one (which is huge and growing) and one for cases when this stack can't be depended upon. Then there are some stacks that are "global", the hwint stack, the abort-exception stack and stuff like that. They are small just the same and only one per CPU.

Then, I'll work out how big the pagefault stack needs to be and fix it at that size (and with some proper coding on my behalf it should stick below a page in size).

That should work, as long as IRQs aren't nested and the IRQ handlers don't mess up IRQ latency, and as long as nothing can keep a global stack during thread switches (including the page fault handler)...

Cheers,

Brendan

Posted: **Sat Apr 15, 2006 9:55 am**

Do 64-bit chips have a speed penalty when accessing a 64-bit int as opposed to a 32-bit int?

Posted: **Sat Apr 15, 2006 11:02 am**

Hi,

Candy wrote:Do 64-bit chips have a speed penalty when accessing a 64-bit int as opposed to a 32-bit int?

For accessing a 64 bit value from memory there's no speed penalty, as long as the access is aligned to an 8 byte boundary (and probably even when it's not aligned if it fits within one cache line).

Accessing an array of 64 bit values would double the chance of cache misses though, compared to an equivelent array of 32 bit values...

Cheers,

Brendan

Posted: **Sat Apr 15, 2006 11:07 am**

Brendan wrote:
Candy wrote:Do 64-bit chips have a speed penalty when accessing a 64-bit int as opposed to a 32-bit int?
For accessing a 64 bit value from memory there's no speed penalty, as long as the access is aligned to an 8 byte boundary (and probably even when it's not aligned if it fits within one cache line).

Accessing an array of 64 bit values would double the chance of cache misses though, compared to an equivelent array of 32 bit values...

OK, let me reformulate slightly.

Is there an additional speed penalty between 32-bit and 64-bit values when compared with the speed penalty between 16-bit and 32-bit values?

If I understand correctly, there's no direct speed limit. So, if I use uint64_t instead of int at all places in the code, it should be just about the same speed? assuming no arrays?

Posted: **Sat Apr 15, 2006 11:36 am**

Hi,

Candy wrote:OK, let me reformulate slightly.

Is there an additional speed penalty between 32-bit and 64-bit values when compared with the speed penalty between 16-bit and 32-bit values?

If I understand correctly, there's no direct speed limit. So, if I use uint64_t instead of int at all places in the code, it should be just about the same speed? assuming no arrays?

AFAIK there is no direct speed penalty for accessing 8-bit, 16-bit, 32-bit, 64-bit, 80-bit or 128-bit values from memory in any 80x86 CPU in any operating mode (if the CPU supports these sizes). There are indirect penalties (data cache efficiency/alignment, .test/.data/.bss section sizes, etc).

If you change all of your ints to uint64_t then there won't be any direct speed penalties in 64 bit code (but there may be indirect penalties that can usually be completely ignored).

For uint64_t in 32 bit code there would be penalties caused by general registers being 32 bit and operations being doubled (e.g. an "ADD" followed by an "ADC" to do 64 bit addition using 32 bit general registers).

Also, for some CPUs there are indirect penalties for not using the entire general register. For example, loading a 16 bit integer into the lower half of a 32 bit general register can be slower because the CPU may need to wait for the an earlier instruction to retire before it can get the higher 16 bits, which creates a register dependancy. This isn't the case for 32 bit integers on 64 bit CPUs as the higher bits are zeroed.

Cheers,

Brendan

Posted: **Wed Oct 04, 2006 10:04 am**

hi

my question is how to code an interrupt-routine under amd-longmode

up to now the routine for the timer is entered when interrupts are enabled.
(there is a specific output on screen coming from this code)

But after this the system stops with general exception msg.

it seems to me as if the return will not work
but can be that on entry something is going wrong

Can anyone help?

Did anybody code such a thing?

thank you

frbk

OSDev.org

64-bit mode questions

64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions

Re:64-bit mode questions