Hi,
Hangin10 wrote:Brendan wrote:One kernel stack that's used by everything would mean that only one thread can be in the kernel at a time, which has serious performance implications. For example, you might have 16 CPUs where one of the CPUs is running kernel code, and a second CPU might get an IRQ. In this case, does the second CPU need to wait until the first CPU leaves the kernel before it can use the kernel stack?
Maybe you need one kernel stack per CPU, or one kernel stack per process, or one kernel stack per thread? One kernel stack per thread is much easier to do, especially if your kernel can be pre-empted.
I always envisioned a kernel stack per thread. I didn't even realize there were other options; I can see how difficult those could be (at the very least it hurts my head to think about it).
In this case (one kernel stack per thread), then you probably need to reserve space for "max number of threads * kernel stack size" bytes in the address space. However, I probably should point out that in some situations you can reserve space for only one kernel stack, and map different physical pages into this same area during thread switches (e.g. lots of kernel stacks all mapped to the same linear address).
Hangin10 wrote:Brendan wrote:For applications, if the stack isn't large enough the application can change the location of it's stack. For this reason I'd probably give applications a tiny stack (e.g. 4 KiB, or just enough so it can allocate a larger stack if it needs to).
My intent was to only give it a page, and let the page fault handler handle growth. If the application suddenly decides to
put the stack elsewhere, I doubt that would affect my kernel's functioning at all (the pf handler would only grow in the defined area). Then, if an application puts the stack elsewhere (or even if it didn't; that's its own problem) always be allowed to request a page allocation in that space, like the rest of "application space". Perhaps I should include a syscall to specifically allow an application to request an amount additional "standard" stack space?
Your kernel's linear memory manager could allow processes to convert any arbitrary area in user space to "allocate on demand" (where it looks like the area contains zeros, and writes to pages cause new pages to be allocated). For example, you could have one page in user space for the default 4 KiB stack (e.g. from 0x00001000 to 0x00001FFF) that's used when a process is first started, then the process can tell the kernel to make the area from 0x12345000 to 0x23456000 "allocate on demand" and then it could do "mov esp,0x23456780" to use this area for a larger stack. Also don't forget that when a process spawns a new thread the new thread will need a stack of it's own (and when a process spawns a new thread it could tell the kernel to use <some_address> for the new thread's stack).
Hangin10 wrote:Brendan wrote:The kernel stack can grow - the main problem is when the CPU gets a page fault while trying to start an exception handler because the stack needs to be larger. If you rely on the page fault handler to increase the size of a kernel stack then you'd need to use a task gate for the page fault handler, or use a task gate for the double fault handler (where the double fault exception handler will increase the size of the kernel stack if the page fault handler couldn't be started by the CPU). Of course there's also the "no free RAM left" problem to deal with...
For kernel stack size, I've used 2 KiB kernel stacks before without any real problem, but that was a micro-kernel written in assembly. If you're using a high level language, or if your code uses recursion, or if it's a monolithic kernel, or if it supports nested IRQs, then you'll probably need larger kernel stacks. For example, as far as I know, Linux uses 8 KiB kernel stacks.
How can the kernel stack grow without the page fault handler?
One way would be to use a segment register, so that if the stack reaches it's current limit you get a general protection fault. Another method is to use explicit tests - for example, before calling any function you could check if there's enough space free on the current stack (and increase the size of the stack before calling the function if there isn't enough space). Of course explicit tests might only be necessary in some cases (e.g. before calling a specific kernel function that is known to use lots of stack space, or in recursive code).
Hangin10 wrote:To cut a ramble short I'm mostly comfortable with the 16kb kernel stacks (my initial inspiration to write this OS was Win98 running on a K6 with 32MB of RAM. the second you open IE, it's out of RAM without any other processes running except explorer IIRC), but I might reconsider 8 later.
Doing something like "#define KERNEL_STACK_PAGES 4" is a good idea...
Hangin10 wrote:Brendan wrote:If you do use one kernel stack per thread, then you want to use small kernel stacks, and you also want to make sure your kernel code is careful about stack usage (e.g. no recursive code); but if the kernel stacks are small you could allocate RAM for them so that you don't need to worry about them growing. Also, it's possible to have one kernel stack per thread plus an additional stack per IRQ, and when the IRQ handler starts you switch to the IRQ handler's own stack at the start of the IRQ handler and switch back to the previous stack at the end of the IRQ handler (this allows you to support nested IRQs without needing kernel stacks that are all large enough to handle many IRQ handlers at the same time).
So have about ~1MB of virtual address space set aside for 1 page for each IRQs stack (except for page fault, double fault, and IRQ0 timer), and the IRQ can switch to its own stack. It would have to copy the interrupt saved information from the old stack to the new stack, then it doesn't need to switch back. I'm confused, then, as that ~1MB ends up being per thread, unless it only exists until needed, but then how to know when to map it out?
There's only 256 entries in the IDT, and some of those are reserved for other purposes (e.g. exceptions, kernel API, IPIs). This means that for absolute worst case, you're limited to a maximum of about 224 IRQs. So, how much stack space does an IRQ handler need? IRQ handlers should be small and fast, so 512 bytes of stack space per IRQ is probably enough.
If you're using extra "per IRQ" stacks, then maybe each thread only needs 2 KiB of kernel stack and each IRQ handler only needs 512 bytes of stack. In this case if the kernel supports a maximum of 16384 threads, then you'd need up to 32768 KiB for kernel stacks plus up to 224 KiB for IRQ stacks, or up to about 32.22 MiB of total stack space. In the same situation but without using "per IRQ" stacks, you'd need up to 114 KiB per thread, or about 1.78 GiB of total stack space.
If you're using "per thread" kernel stack with additional "per IRQ" stacks, then there's no need to copy anything between stacks. For an example, your IRQ handlers might look something like this:
Code: Select all
IRQhandler: ;Interrupts disabled by CPU on entry
push ebp ;Save EBP
mov ebp,esp ;EBP = interrupted code's ESP
mov esp,IRQ0_stacktop ;ESP = IRQ handler's private stack
push ebp ;Interrupted code's ESP pushed onto IRQ handler's private stack
sti ;Enable IRQs (allow nesting)
/* normal IRQ handling stuff here */
/* sendEOI */
pop esp ;ESP = interrupted code's ESP
pop ebp ;Restore EBP
iretd
Of course this is only a simple example.
For a more advanced example, consider having a stack of IRQ stacks for each CPU:
Code: Select all
IRQhandler: ;Interrupts disabled by CPU on entry
push ebp ;Save EBP
push eax ;Save EAX
mov ebp,esp ;EBP = interrupted code's ESP
mov eax,[this_CPU_number] ;eax = the current CPU's number
mov esp,[nextStackTop+eax*4] ;esp = the address for this CPU's next stack
add dword [nextStackTopTable+eax*4],IRQ_STACK_SIZE ;Update the "next stack address" in case we're interrupted
sti ;Enable IRQs (allow nesting)
/* normal IRQ handling stuff here */
mov eax,[this_CPU_number] ;eax = the current CPU's number
cli ;Disable IRQs (prevent nesting)
/* sendEOI */
sub dword [nextStackTop+eax*4],IRQ_STACK_SIZE ;Restore the "next stack address" now that we're finished with the stack
pop esp ;ESP = interrupted code's ESP
pop eax ;Restore EAX
pop ebp ;Restore EBP
iretd
This might seem like extra overhead, and it will cost you more space (e.g. with 16 CPUs you'll need 16 times as much space for IRQ stacks). However, it does take advantage of cache locality (helps to avoid cache misses and TLB misses). Basically if you get several different IRQs one after the other (which is likely enough if you're using IRQ priorities), then they'll all use the same IRQ stack (which is going to already be in the cache). There's also advantages for NUMA systems (make sure the RAM used by the CPU is close/fast to that CPU).
Of course this could be improved more - e.g. reduce the number of IRQ stacks without effecting cache locality too much. For example, you might want one set of IRQ stacks per core (e.g. for hyper-threading where logical CPUs share L2 caches), or one set of IRQ stacks per chip (e.g. where multiple cores share the same L2 or L3 caches), or one set of IRQ stacks per NUMA domain (to minimize the number of IRQ stacks while still getting the "RAM used is close/fast" advantage).
Cheers,
Brendan