kmalloc in a ukernel: overkill?

Colonel Kernel · Post by **Colonel Kernel** » Thu Dec 08, 2005 1:37 am

I'm designing some of the innards of memory management in my ukernel and I have a question about the possible uses for kmalloc/kfree (i.e. -- an in-kernel heap implementation).

In this ukernel, the types of structures that need to be allocated are determined at design-time, i.e. -- there are no such things as kernel-loadable modules or kernel-mode drivers. The ukernel is completely "sealed" in this sense.

Therefore, it ought to be possible (and some people have done this) to pre-allocate areas of the kernel's virtual address space for particular kinds of structures (e.g. -- an address range for thread control blocks, one for kernel stacks, etc.) and commit physical pages to these regions on-demand. The upside I see to this approach is that it's simpler, but the downside is that it imposes certain limits on the number of resources of each type that can be allocated (limits which may be difficult to choose properly at design time). It is also wasteful of kernel address space IMO.

On the other hand, an in-kernel heap is more complex and probably less time-efficient (if more space-efficient).

Any thoughts? Do those of you with ukernels bother implementing an in-kernel heap allocator?

Brendan · Post by **Brendan** » Thu Dec 08, 2005 4:08 am

Hi,

In my experience (and I've always used the "pre-determined areas" approach), the benefits more than make up for the disadvantages (at least for micro-kernels - I wouldn't attempt a monolithic kernel without kmalloc).

Consider managing "thread control blocks". When I'm spawning a thread I find a free "thread ID", and then find the area for the thread control block with "mov reg, threadID; shl reg, 12; add reg, TCBareaStart". There is no way "kmalloc()" can be as fast.

Then there's finding the data again next time you need it. For pre-determined areas you know where it is and you can just do the "shift add" again to find the data. If you use kmalloc instead then it could be anywhere - for those thread control blocks you'd probably need an additional array of pointers so that you could do "mov reg, threadID; lea reg, [reg * 4 + address]" to find the address of the structure.

For other types of structures the "pre-determined areas" also gives the kernel more flexibility. For example, for my thread control blocks pages use allocation on demand but are never freed, so that when the same thread ID is re-used the page is already present (i.e. the amount of memory used for thread control blocks grows to the maximum amount needed, and after that the memory management for them costs nothing). For "CPU control blocks", they are allocated during boot only (any access to this area after boot causes a page fault). The linear memory manager also had a "special" system for optimizing pages used for message queue data, where pages aren't freed unless the OS starts running out of memory (so if there's plenty of memory it becomes "zero cost" like the thread control blocks, but if memory is running low they are returned to the system).

For the "kmalloc" approach you can't choose to treat things differently, and can't "tune" the memory management to suit each specific purpose - everything is just a chunk on the heap.

For the disadvantages, you're right - it does cost more space and can limit resources. In practice this isn't as bad as it seems. For example, I choose to use 1 GB of kernel space because of the way PAE works (1 GB corresponds to one page directory). For my last OS this was enough space to get 65536 threads belonging to 32768 processes (using 8 KB per thread and 4 KB per process) with over half of the space left over for message queue data. It's extremely likely that the computer will run out of RAM well before these limits are reached.

For the other things mapped into kernel space the limits are hardware limits - 65536 I/O ports, up to 256 CPUs, BIOS 8 * 8 and 8 * 16 font data, except for NUMA domain data (256 entries) and maximum amount of video display memory the kernel could access (43000 Kb).

Some of my older kernels used similar limits with leaner data structures and no PAE with only 512 MB, leaving 2.5 GB for processes. Each design is different, including yours.

Having fixed limits can also be a bonus. For example, I can guarantee that the system will handle 65536 threads in all circumstances (as long as the computer has enough RAM), while if the kernel used kmalloc instead I couldn't guarantee anything (the kernel's address space may have been used up by other stuff).

Just for fun (research), work out what limits you'd want and how you would set up the pre-allocated areas for each. It'll take a little insight into how you want to implement things, but it'll be worth it (even if you decide to use kmalloc anyway).

Cheers,

Brendan

Pype.Clicker · Post by **Pype.Clicker** » Thu Dec 08, 2005 7:24 am

Well, if you're going for a 'table' mode, you're probably better to make sure that you enforce locality of allocated structure.

E.g. giving each thread a 4KB (a whole page) is imho a bad idea especially if - out of those 4KB - 3KB could be 'room for messages in waiting. And if you have one page allocated for threads, you're better to fill it first before you allocate another one (imho)

Brendan · Post by **Brendan** » Thu Dec 08, 2005 8:12 am

Hi,

Pype.Clicker wrote:Well, if you're going for a 'table' mode, you're probably better to make sure that you enforce locality of allocated structure

E.g. giving each thread a 4KB (a whole page) is imho a bad idea especially if - out of those 4KB - 3KB could be 'room for messages in waiting. And if you have one page allocated for threads, you're better to fill it first before you allocate another one (imho)

For my last version, each thread had a 4 KB page containing information about that thread, where about 2.5 KB was used for the thread's kernel stack. Each process also had a 4 KB page where about 3 KB was used for a list of (16 bit) thread IDs - no space wasted in either (and a limit of around 1536 threads per process).

I also remember doing a version with 2 KB per thread instead, with about 1 KB of kernel stack. I could probably still do this but I'd need to be a bit more careful than I've become (if the kernel uses too much stack it happily trashes the thread's information).

Each "message queue page" was filled before another page was needed. Everything else naturally filled N pages or was arranged to share a page - e.g. the I/O port owner table consumed 256 KB while the GDT and IDT shared a page. It's not too hard to arrange with a little planning...

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Thu Dec 08, 2005 11:14 am

Brendan wrote:Some of my older kernels used similar limits with leaner data structures and no PAE with only 512 MB, leaving 2.5 GB for processes. Each design is different, including yours.

You mean 3.5 GB for processes, right? That's what I'm aiming for.

Just for fun (research), work out what limits you'd want and how you would set up the pre-allocated areas for each. It'll take a little insight into how you want to implement things, but it'll be worth it (even if you decide to use kmalloc anyway).

Heh... that's the problem. It's hard for me to forsee the details of everything I will need at this early stage. I have a pretty good idea of how I want things to be at a high level, but I want to explore the details through code rather than through research so that I have something concrete to tell me whether I'm on the right track.

This is not normally how I develop software -- it's just an easier way for me to learn about all these details.

Food for thought, anyway.

Are there any ukernel devs here who did implement a kernel heap?

Pype.Clicker · Post by **Pype.Clicker** » Thu Dec 08, 2005 11:40 am

well, i *do* have a heap, but i'm not sure many people here would still call Clicker a 'microkernel' ... especially knowing that it has modules and in-kernel drivers management (which may all change after we have better in-kernel IPC, actually

FlashBurn · Post by **FlashBurn** » Thu Dec 08, 2005 1:36 pm

Maybe you have a look at a slaballocator. I?m using one and I think that it is a compromise between a kmalloc and the way brendan desribed it! One more benefit is that you have one allocator for all same sized strucs and so you haven?t to write different functions for every different struc.

carbonBased · Post by **carbonBased** » Thu Dec 08, 2005 6:35 pm

Brendan wrote: Consider managing "thread control blocks". When I'm spawning a thread I find a free "thread ID", and then find the area for the thread control block with "mov reg, threadID; shl reg, 12; add reg, TCBareaStart". There is no way "kmalloc()" can be as fast.

How do you determine the next available threadId?

--Jeff

Brendan · Post by **Brendan** » Thu Dec 08, 2005 8:41 pm

Hi,

carbonBased wrote:
Brendan wrote: Consider managing "thread control blocks". When I'm spawning a thread I find a free "thread ID", and then find the area for the thread control block with "mov reg, threadID; shl reg, 12; add reg, TCBareaStart". There is no way "kmalloc()" can be as fast.
How do you determine the next available threadId?

Because the "thread control blocks" are 4 KB and there could be up to 65536 of them, I don't use them for deciding which thread to run because it'd make a mess of CPU caches.

Instead there's a "thread state table", which contains a 4 byte entry for each thread (a 16 bit thread state and a 16 bit "next thread ID on a linked list"). When a thread becomes ready to run it's placed on one of the scheduler's "ready to run" queues and the thread state table is updated, so that the "next thread ID" refers to the next thread on the queue.

The scheduler does almost everything using the thread state table entries to minimize cache misses. The information in the "thread control block" is only ever accessed when the thread is running or during a thread switch, where it's more likely to be in the cache anyway.

There are exceptions to this, but not where performance matters. For example, you can find a thread's details (thread name, amount of time it has used, parent process ID, etc) from its thread control block at any time, but this sort of thing is rare.

Cheers,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Fri Dec 09, 2005 12:42 am

FlashBurn wrote: Maybe you have a look at a slaballocator. I?m using one and I think that it is a compromise between a kmalloc and the way brendan desribed it! One more benefit is that you have one allocator for all same sized strucs and so you haven?t to write different functions for every different struc.

I found a paper on Sun's slab allocator -- that looks really cool! Space efficient, time efficient, cache efficient, very debuggable, quick reclamation, etc. I'm not sure the object caching capabilities are really worthwhile for a ukernel, but the rest of it seems like a really good idea.

Two things about Brendan's approach still beat the slab allocator though. The slab allocator doesn't prevent requests for one type of resource from starving the address space and physical RAM required to allocate other types of resources (whereas Brendan's scheme at least prevents pilfering of address space, if not RAM). Also, Brendan's scheme allows for the quick lookup of kernel resources based on computation instead of memory lookup, which is more cache-efficient (this reminds me of how thread IDs are manipulated on L4). With a general-purpose allocator, you'd need some kind of "handle table" that adds a layer of indirection between the userland-visible resource ID and the actual resource.

One question I have that is somewhat unrelated -- in the case of the slab allocator, who is responsible for deciding where in the kernel's portion of the address space to put allocated slabs? Does the VMM decide, or does the slab allocator decide, or is there some intermediate "kernel address space allocator"?

JoeKayzA · Post by **JoeKayzA** » Fri Dec 09, 2005 12:52 am

Colonel Kernel wrote: One question I have that is somewhat unrelated -- in the case of the slab allocator, who is responsible for deciding where in the kernel's portion of the address space to put allocated slabs? Does the VMM decide, or does the slab allocator decide, or is there some intermediate "kernel address space allocator"?

I guess there is still a need for a seperate virtual memory allocator below the slab allocator - unlike a 'real' heap allocator, this one can work with page-granularity instead of byte-granularity, however. In a monolythic kernel you'd need this anyway (for allocating virtual pages which are not yet mapped somewhere), don't know about your uKernel though.....

cheers Joe

Colonel Kernel · Post by **Colonel Kernel** » Fri Dec 09, 2005 1:09 am

JoeKayzA wrote: In a monolythic kernel you'd need this anyway (for allocating virtual pages which are not yet mapped somewhere), don't know about your uKernel though.....

Argh... This probably means I have to design exactly how user-mode virtual memory management will work before my question can be answered. :-\ I was thinking specifically about where responsibility lies for managing the kernel's own address space (0xE0000000-0xFFFFFFFF in my case).

My intention was to implement something very similar to L4's hierarchical address spaces, which implies a "mapping database" that could be composed of nodes allocated from the slab allocator. The kernel should be unaware of how each userland process is using its address space, since this is policy determined by user-mode pagers. Unless I'm missing something (I won't know until I start implementing it, hence my chicken-and-egg problem here...).

Pype.Clicker · Post by **Pype.Clicker** » Fri Dec 09, 2005 4:33 am

Colonel Kernel wrote: I found a paper on Sun's slab allocator -- that looks really cool! Space efficient, time efficient, cache efficient, very debuggable, quick reclamation, etc. I'm not sure the object caching capabilities are really worthwhile for a ukernel, but the rest of it seems like a really good idea.

I suppose you're referring to "The Slab Allocator: An Object-Caching Kernel Memory Allocator", by Jeff Bonwick, Sun Microsystems.

I had an object cache in Clicker (actually, that preallocates blocks and split them in 'cells' ready for use by objects), but i shall admit i never thought of 'reusing state' between objects, though the paper remains vague on how different types of objects will be handled if they have same size.

distantvoices · Post by **distantvoices** » Fri Dec 09, 2005 9:21 am

I've often considered replacing my current kmalloc/kfree suite by something faster and more reliable.

f. ex. a slab allocator:A cool thing: you tell it to create a slab for each object you ever intend to create and voila ... it allocates a portion of memory, initializes buffers the size of your object and there you go ... it fetches memory for a slab getting out of memory and gives memory back if the vmm needs some. I like this approach. Just have wondered how to implement this kind of thing.

I use something similar to what Brendan is doing for the page management: I keep the stack in a realm of preallocated memory and the pages are linked together, already initialized, and if I free some, they are just dumped on that stack - it's just a push to dump and a pop to get one.

I'll come back as soon as I've read that article and revise this post.

Stay Safe.

@colonel: You'd need an allocator for kernel objects anyway, so just go on and design one. you can use it to allocate adress space management stuff too, that's the best thing.

Colonel Kernel · Post by **Colonel Kernel** » Fri Dec 09, 2005 9:47 am

beyond infinity wrote: @colonel: You'd need an allocator for kernel objects anyway, so just go on and design one. you can use it to allocate adress space management stuff too, that's the best thing.

It's a little more complicated than that.

I'm really gathering requirements for my kernel's VMM (I guess what Brendan calls the linear memory manager). I haven't even implemented my physical memory manager yet, although it's all designed.

I got stuck in the most unlikely of places. I'm designing the initialization sequence for the physical memory manager, which has turned out to be pretty painful. My kernel PMM is going to use a free page stack, which needs some memory allocated and mapped in. It determines what memory it can allocate for the free page stack by running through initial lists of free and used physical address space regions. These two lists come from the MultibootInfo structures. The free list comes from the "RAM regions" in the memory map or the "lower" and "upper" memory fields, and the used list from the "reserved" regions in the memory map, the module list, and a few extras. In order to avoid overwriting the MultibootInfo structure (which I plan to keep around), one of the extras in the "used" list is the structure itself. However, at this point during initialization, the Multiboot structure has already been mapped into the higher half and its pointers tweaked. So, I was looking to design the linear memory manager so I could obtain the physical addresses from the mapped MultibootInfo's virtual addresses (which I figured was better than just "un-tweaking" the pointers and spreading more magic numbers like 0xE0000000 around my code). Probably not the easiest way to do things.

<mini-rant>
If those guys who wrote the Multiboot spec made promises like "the structures will be entirely contiguous in memory" and "the structures will always fit in lower memory" that would have made this a lot easier. >:(
</mini-rant>

OSDev.org

kmalloc in a ukernel: overkill?

kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?

Re:kmalloc in a ukernel: overkill?