Page 1 of 1

Where to put the user stack

Posted: Wed Sep 09, 2009 8:54 pm
by thepowersgang
An issue that has always bugged me, is where to place the user's first stack.

Most of what I've seen suggests at the end of the user's address space (for my kernel 0xBFFF...) but where does one put shared libraries then, if the stack and the heap are converging on each other.
The way I did it with my old kernel was I gave shared libraries the range from 0xB000... to 0xBFFF... and started the user stack at 0xB000... and worked downwards, but I don't know if this was the best way of doing it.

Any ideas?

Re: Where to put the user stack

Posted: Wed Sep 09, 2009 9:55 pm
by Hangin10
It's probably not the best idea, and I wouldn't be surprised if I made a serious mistake while mapping this out, but my planned memory map for my OS goes something like this:

Code: Select all

0x0000:0000 - Unmapped page for NULL pointer page faults.
0x0000:1000 - Identity mapped for 8086 emulation.
0x0010:0000 - Low memory. Usable pages, but allocated last. Also used for ISA DMA (ie mainly floppies).
0x0100:0000 - End of low memory. Nothing mapped until:
0x03FF:FFFC - Application stack start
0x0400:0000 - Application code / data / bss
0x2000:0000 - Application heap
0x6000:0000 - Application dynamic linked libraries
0x7000:0000 - Pages for shared application message queues
0x8000:0000 - Pages for shared kernel message queues
0x9000:0000 - Pages for page-in points for misc kernel items

0xBFFF:0000 - Kernel stack end
0xBFFF:FFFC - Kernel stack start
0xC000:0000 - Kernel code
0xE000:0000 - Kernel heap
0xFFC0:0000 - Paging data. (ie Page directory mapped to self)
With this kind of arrangement, the stack doesn't grow into the heap, but it does get like 48MB of address space.
How much stack do you think should apps/kernel get?
(The kernel stack can't grow unless the page fault IDT entry is a task gate, I think, is this right?
Maybe I haven't given it enough space.)

Re: Where to put the user stack

Posted: Wed Sep 09, 2009 11:45 pm
by Brendan
Hi,
Hangin10 wrote:It's probably not the best idea, and I wouldn't be surprised if I made a serious mistake while mapping this out, but my planned memory map for my OS goes something like this:

Code: Select all

0x0000:0000 - Unmapped page for NULL pointer page faults.
0x0000:1000 - Identity mapped for 8086 emulation.
0x0010:0000 - Low memory. Usable pages, but allocated last. Also used for ISA DMA (ie mainly floppies).
0x0100:0000 - End of low memory. Nothing mapped until:
0x03FF:FFFC - Application stack start
You probably won't need to have identity mapped pages for 8086 emulation or low memory mapped in every address space - you could map them if the process needs them, and use this space for something else if the process doesn't need them.
Hangin10 wrote:

Code: Select all

0x0400:0000 - Application code / data / bss
0x2000:0000 - Application heap
The application's heap could start just above the end of the application's bss. Normally a kernel doesn't care about the application's heap - it just has a large area where the application can allocate/free pages, and the application implements it's own heap (e.g. code to manage the application's heap might be in shared library), where the application uses it's own heap management code, and the application's heap management code uses the kernel API to allocate/free pages. It's also entirely possible for an application to work with pages directly and not use a heap at all (I've written applications like this in the past), or for different applications to use different alternatives (e.g. "objstacks") or a mixture of several techniques (e.g. one area for heap, one area for objstacks, one area for explicitly managed pages and a fourth area reserved for memory mapped file/s) .
Hangin10 wrote:

Code: Select all

0xBFFF:0000 - Kernel stack end
0xBFFF:FFFC - Kernel stack start
One kernel stack that's used by everything would mean that only one thread can be in the kernel at a time, which has serious performance implications. For example, you might have 16 CPUs where one of the CPUs is running kernel code, and a second CPU might get an IRQ. In this case, does the second CPU need to wait until the first CPU leaves the kernel before it can use the kernel stack?

Maybe you need one kernel stack per CPU, or one kernel stack per process, or one kernel stack per thread? One kernel stack per thread is much easier to do, especially if your kernel can be pre-empted.
Hangin10 wrote:

Code: Select all

0xC000:0000 - Kernel code
0xE000:0000 - Kernel heap
Kernel heap could start immediately after the kernel code.
Hangin10 wrote:With this kind of arrangement, the stack doesn't grow into the heap, but it does get like 48MB of address space.
How much stack do you think should apps/kernel get?
For applications, if the stack isn't large enough the application can change the location of it's stack. For this reason I'd probably give applications a tiny stack (e.g. 4 KiB, or just enough so it can allocate a larger stack if it needs to).
Hangin10 wrote:(The kernel stack can't grow unless the page fault IDT entry is a task gate, I think, is this right?
Maybe I haven't given it enough space.)
The kernel stack can grow - the main problem is when the CPU gets a page fault while trying to start an exception handler because the stack needs to be larger. If you rely on the page fault handler to increase the size of a kernel stack then you'd need to use a task gate for the page fault handler, or use a task gate for the double fault handler (where the double fault exception handler will increase the size of the kernel stack if the page fault handler couldn't be started by the CPU). Of course there's also the "no free RAM left" problem to deal with...

For kernel stack size, I've used 2 KiB kernel stacks before without any real problem, but that was a micro-kernel written in assembly. If you're using a high level language, or if your code uses recursion, or if it's a monolithic kernel, or if it supports nested IRQs, then you'll probably need larger kernel stacks. For example, as far as I know, Linux uses 8 KiB kernel stacks.

However, you'd need to decide how many kernel stacks you need. The more kernel stacks you need the more important the size of each kernel stack is (for example, with one kernel stack per thread, 131072 threads with 8 KiB per kernel stack adds up to 1 GiB). With only one kernel stack, or with only one kernel stack per CPU, you can probably allow 16 MiB of space (for e.g.) per stack without worrying about it too much.

If you do use one kernel stack per thread, then you want to use small kernel stacks, and you also want to make sure your kernel code is careful about stack usage (e.g. no recursive code); but if the kernel stacks are small you could allocate RAM for them so that you don't need to worry about them growing. Also, it's possible to have one kernel stack per thread plus an additional stack per IRQ, and when the IRQ handler starts you switch to the IRQ handler's own stack at the start of the IRQ handler and switch back to the previous stack at the end of the IRQ handler (this allows you to support nested IRQs without needing kernel stacks that are all large enough to handle many IRQ handlers at the same time).


Cheers,

Brendan

Re: Where to put the user stack

Posted: Thu Sep 10, 2009 1:08 am
by Hangin10
Brendan wrote:You probably won't need to have identity mapped pages for 8086 emulation or low memory mapped in every address space - you could map them if the process needs them, and use this space for something else if the process doesn't need them.
Whoops. I kinda put physical layout in the virtual layout. What I meant was that pages 1MB <= x < 16MB physical get a separate physical allocation bitmap, and are searched only if there is no physical page >= 16MB free. Also, now that I think about it, I should probably handle the 8086 emulation by just malloc'ing pages dynamically when they get touched by the code running in the emulator (that way more than one process can run the emulator, as it already passes around a "context" struct [pointer], no idea what I was thinkin' there).
Brendan wrote:The application's heap could start just above the end of the application's bss. Normally a kernel doesn't care about the application's heap - it just has a large area where the application can allocate/free pages, and the application implements it's own heap (e.g. code to manage the application's heap might be in shared library), where the application uses it's own heap management code, and the application's heap management code uses the kernel API to allocate/free pages. It's also entirely possible for an application to work with pages directly and not use a heap at all...
That's an excellent point. That certainly simplifies the whole "application area" from the kernel's viewpoint.
Brendan wrote:One kernel stack that's used by everything would mean that only one thread can be in the kernel at a time, which has serious performance implications. For example, you might have 16 CPUs where one of the CPUs is running kernel code, and a second CPU might get an IRQ. In this case, does the second CPU need to wait until the first CPU leaves the kernel before it can use the kernel stack?

Maybe you need one kernel stack per CPU, or one kernel stack per process, or one kernel stack per thread? One kernel stack per thread is much easier to do, especially if your kernel can be pre-empted.
I always envisioned a kernel stack per thread. I didn't even realize there were other options; I can see how difficult those could be (at the very least it hurts my head to think about it).
Brendan wrote:For applications, if the stack isn't large enough the application can change the location of it's stack. For this reason I'd probably give applications a tiny stack (e.g. 4 KiB, or just enough so it can allocate a larger stack if it needs to).

My intent was to only give it a page, and let the page fault handler handle growth. If the application suddenly decides to
put the stack elsewhere, I doubt that would affect my kernel's functioning at all (the pf handler would only grow in the defined area). Then, if an application puts the stack elsewhere (or even if it didn't; that's its own problem) always be allowed to request a page allocation in that space, like the rest of "application space". Perhaps I should include a syscall to specifically allow an application to request an amount additional "standard" stack space?
Brendan wrote:The kernel stack can grow - the main problem is when the CPU gets a page fault while trying to start an exception handler because the stack needs to be larger. If you rely on the page fault handler to increase the size of a kernel stack then you'd need to use a task gate for the page fault handler, or use a task gate for the double fault handler (where the double fault exception handler will increase the size of the kernel stack if the page fault handler couldn't be started by the CPU). Of course there's also the "no free RAM left" problem to deal with...

For kernel stack size, I've used 2 KiB kernel stacks before without any real problem, but that was a micro-kernel written in assembly. If you're using a high level language, or if your code uses recursion, or if it's a monolithic kernel, or if it supports nested IRQs, then you'll probably need larger kernel stacks. For example, as far as I know, Linux uses 8 KiB kernel stacks.
How can the kernel stack grow without the page fault handler? I don't anticipate my kernel recursing much (if ever), the only thing I anticipate at this point is an extreme amount of linked list / queue iteration. It's going to be kind of a hybrid, I think, as things like ATA/ATAPI, floppy, and a few file systems will probably be included monolithically, but anything else would have to run as a process (whether the request is fulfilled directly or uses an IPC message to a FS process should be transparent to the application which opened and read a file). To cut a ramble short I'm mostly comfortable with the 16kb kernel stacks (my initial inspiration to write this OS was Win98 running on a K6 with 32MB of RAM. the second you open IE, it's out of RAM without any other processes running except explorer IIRC), but I might reconsider 8 later.
Brendan wrote:If you do use one kernel stack per thread, then you want to use small kernel stacks, and you also want to make sure your kernel code is careful about stack usage (e.g. no recursive code); but if the kernel stacks are small you could allocate RAM for them so that you don't need to worry about them growing. Also, it's possible to have one kernel stack per thread plus an additional stack per IRQ, and when the IRQ handler starts you switch to the IRQ handler's own stack at the start of the IRQ handler and switch back to the previous stack at the end of the IRQ handler (this allows you to support nested IRQs without needing kernel stacks that are all large enough to handle many IRQ handlers at the same time).
So have about ~1MB of virtual address space set aside for 1 page for each IRQs stack (except for page fault, double fault, and IRQ0 timer), and the IRQ can switch to its own stack. It would have to copy the interrupt saved information from the old stack to the new stack, then it doesn't need to switch back. I'm confused, then, as that ~1MB ends up being per thread, unless it only exists until needed, but then how to know when to map it out?

Thank you.

Re: Where to put the user stack

Posted: Thu Sep 10, 2009 3:23 am
by Brendan
Hi,
Hangin10 wrote:
Brendan wrote:One kernel stack that's used by everything would mean that only one thread can be in the kernel at a time, which has serious performance implications. For example, you might have 16 CPUs where one of the CPUs is running kernel code, and a second CPU might get an IRQ. In this case, does the second CPU need to wait until the first CPU leaves the kernel before it can use the kernel stack?

Maybe you need one kernel stack per CPU, or one kernel stack per process, or one kernel stack per thread? One kernel stack per thread is much easier to do, especially if your kernel can be pre-empted.
I always envisioned a kernel stack per thread. I didn't even realize there were other options; I can see how difficult those could be (at the very least it hurts my head to think about it).
In this case (one kernel stack per thread), then you probably need to reserve space for "max number of threads * kernel stack size" bytes in the address space. However, I probably should point out that in some situations you can reserve space for only one kernel stack, and map different physical pages into this same area during thread switches (e.g. lots of kernel stacks all mapped to the same linear address).
Hangin10 wrote:
Brendan wrote:For applications, if the stack isn't large enough the application can change the location of it's stack. For this reason I'd probably give applications a tiny stack (e.g. 4 KiB, or just enough so it can allocate a larger stack if it needs to).
My intent was to only give it a page, and let the page fault handler handle growth. If the application suddenly decides to
put the stack elsewhere, I doubt that would affect my kernel's functioning at all (the pf handler would only grow in the defined area). Then, if an application puts the stack elsewhere (or even if it didn't; that's its own problem) always be allowed to request a page allocation in that space, like the rest of "application space". Perhaps I should include a syscall to specifically allow an application to request an amount additional "standard" stack space?
Your kernel's linear memory manager could allow processes to convert any arbitrary area in user space to "allocate on demand" (where it looks like the area contains zeros, and writes to pages cause new pages to be allocated). For example, you could have one page in user space for the default 4 KiB stack (e.g. from 0x00001000 to 0x00001FFF) that's used when a process is first started, then the process can tell the kernel to make the area from 0x12345000 to 0x23456000 "allocate on demand" and then it could do "mov esp,0x23456780" to use this area for a larger stack. Also don't forget that when a process spawns a new thread the new thread will need a stack of it's own (and when a process spawns a new thread it could tell the kernel to use <some_address> for the new thread's stack).
Hangin10 wrote:
Brendan wrote:The kernel stack can grow - the main problem is when the CPU gets a page fault while trying to start an exception handler because the stack needs to be larger. If you rely on the page fault handler to increase the size of a kernel stack then you'd need to use a task gate for the page fault handler, or use a task gate for the double fault handler (where the double fault exception handler will increase the size of the kernel stack if the page fault handler couldn't be started by the CPU). Of course there's also the "no free RAM left" problem to deal with...

For kernel stack size, I've used 2 KiB kernel stacks before without any real problem, but that was a micro-kernel written in assembly. If you're using a high level language, or if your code uses recursion, or if it's a monolithic kernel, or if it supports nested IRQs, then you'll probably need larger kernel stacks. For example, as far as I know, Linux uses 8 KiB kernel stacks.
How can the kernel stack grow without the page fault handler?
One way would be to use a segment register, so that if the stack reaches it's current limit you get a general protection fault. Another method is to use explicit tests - for example, before calling any function you could check if there's enough space free on the current stack (and increase the size of the stack before calling the function if there isn't enough space). Of course explicit tests might only be necessary in some cases (e.g. before calling a specific kernel function that is known to use lots of stack space, or in recursive code).
Hangin10 wrote:To cut a ramble short I'm mostly comfortable with the 16kb kernel stacks (my initial inspiration to write this OS was Win98 running on a K6 with 32MB of RAM. the second you open IE, it's out of RAM without any other processes running except explorer IIRC), but I might reconsider 8 later.
Doing something like "#define KERNEL_STACK_PAGES 4" is a good idea...
Hangin10 wrote:
Brendan wrote:If you do use one kernel stack per thread, then you want to use small kernel stacks, and you also want to make sure your kernel code is careful about stack usage (e.g. no recursive code); but if the kernel stacks are small you could allocate RAM for them so that you don't need to worry about them growing. Also, it's possible to have one kernel stack per thread plus an additional stack per IRQ, and when the IRQ handler starts you switch to the IRQ handler's own stack at the start of the IRQ handler and switch back to the previous stack at the end of the IRQ handler (this allows you to support nested IRQs without needing kernel stacks that are all large enough to handle many IRQ handlers at the same time).
So have about ~1MB of virtual address space set aside for 1 page for each IRQs stack (except for page fault, double fault, and IRQ0 timer), and the IRQ can switch to its own stack. It would have to copy the interrupt saved information from the old stack to the new stack, then it doesn't need to switch back. I'm confused, then, as that ~1MB ends up being per thread, unless it only exists until needed, but then how to know when to map it out?

There's only 256 entries in the IDT, and some of those are reserved for other purposes (e.g. exceptions, kernel API, IPIs). This means that for absolute worst case, you're limited to a maximum of about 224 IRQs. So, how much stack space does an IRQ handler need? IRQ handlers should be small and fast, so 512 bytes of stack space per IRQ is probably enough.

If you're using extra "per IRQ" stacks, then maybe each thread only needs 2 KiB of kernel stack and each IRQ handler only needs 512 bytes of stack. In this case if the kernel supports a maximum of 16384 threads, then you'd need up to 32768 KiB for kernel stacks plus up to 224 KiB for IRQ stacks, or up to about 32.22 MiB of total stack space. In the same situation but without using "per IRQ" stacks, you'd need up to 114 KiB per thread, or about 1.78 GiB of total stack space.

If you're using "per thread" kernel stack with additional "per IRQ" stacks, then there's no need to copy anything between stacks. For an example, your IRQ handlers might look something like this:

Code: Select all

IRQhandler:                     ;Interrupts disabled by CPU on entry
    push ebp                          ;Save EBP
    mov ebp,esp                       ;EBP = interrupted code's ESP
    mov esp,IRQ0_stacktop             ;ESP = IRQ handler's private stack
    push ebp                          ;Interrupted code's ESP pushed onto IRQ handler's private stack
    sti                               ;Enable IRQs (allow nesting)

    /* normal IRQ handling stuff here */

    /* sendEOI */

    pop esp                           ;ESP = interrupted code's ESP
    pop ebp                           ;Restore EBP
    iretd
Of course this is only a simple example.

For a more advanced example, consider having a stack of IRQ stacks for each CPU:

Code: Select all

IRQhandler:                                        ;Interrupts disabled by CPU on entry
    push ebp                                             ;Save EBP
    push eax                                             ;Save EAX
    mov ebp,esp                                          ;EBP = interrupted code's ESP
    mov eax,[this_CPU_number]                            ;eax = the current CPU's number
    mov esp,[nextStackTop+eax*4]                         ;esp = the address for this CPU's next stack
    add dword [nextStackTopTable+eax*4],IRQ_STACK_SIZE   ;Update the "next stack address" in case we're interrupted
    sti                                                  ;Enable IRQs (allow nesting)

    /* normal IRQ handling stuff here */

    mov eax,[this_CPU_number]                            ;eax = the current CPU's number
    cli                                                  ;Disable IRQs (prevent nesting)

    /* sendEOI */

    sub dword [nextStackTop+eax*4],IRQ_STACK_SIZE        ;Restore the "next stack address" now that we're finished with the stack
    pop esp                                              ;ESP = interrupted code's ESP
    pop eax                                              ;Restore EAX
    pop ebp                                              ;Restore EBP
    iretd
This might seem like extra overhead, and it will cost you more space (e.g. with 16 CPUs you'll need 16 times as much space for IRQ stacks). However, it does take advantage of cache locality (helps to avoid cache misses and TLB misses). Basically if you get several different IRQs one after the other (which is likely enough if you're using IRQ priorities), then they'll all use the same IRQ stack (which is going to already be in the cache). There's also advantages for NUMA systems (make sure the RAM used by the CPU is close/fast to that CPU).

Of course this could be improved more - e.g. reduce the number of IRQ stacks without effecting cache locality too much. For example, you might want one set of IRQ stacks per core (e.g. for hyper-threading where logical CPUs share L2 caches), or one set of IRQ stacks per chip (e.g. where multiple cores share the same L2 or L3 caches), or one set of IRQ stacks per NUMA domain (to minimize the number of IRQ stacks while still getting the "RAM used is close/fast" advantage).


Cheers,

Brendan