stack vs bitmap

blound · Post by **blound** » Wed Dec 12, 2007 3:12 am

I know this has been asked many times before and I read many forum posts/guides on it but I wanted to gather my thoughts and ask so..

From what I read it seems that for physical memory managers a bitmap would be a better a choice to stack because of the "looseness" of allocation with stacks ( meaning whichever pages happen to get freed also happen to allocated next) - which is bad for continuous allocations and also you need multiple stacks for different address ( 0-16mb, 16mb-?? ) for isa dma and some pci "stuff" ( i didnt find many posts bring up the pci bits ). Also it seems you could make bitmap allocations as fast as stack allocations by keeping a pointer to the last allocated bit and then allocate from there.

thinking that the goal of my os is just to have ring3 multitasking and the ability to write drivers without having to learn another oses api for it ... does it make sense for me to fight with stacks? I know they can be useful if you are going to mess with the full numa and zones - thing I studied in linux previously, but it seems like alot of overhead for a very basic OS.

also I read posts where people knocked bitmaps and the method of keeping a pointer to the last allocated page because it causes alot of fragmentation and was thinking of ideas to reduce the fragmentation:

1) instead of keeping a pointer to the last allocated page - keep one say 5-10 bits back so that even if they are all still allocated "walking" past them will not be expensive .. if 5-10 is successfull maybe larger amounts could help more

2) (very expensive - made up as i was typing this ) ... make an idle task, running in ring0 that when the system was very inactive, that could walk the bitmap array from both ends and when it notices a gap in the bitmap it could swap the node of the highest used bitmap index with the gapped one ... I realize this would be very expensive (tlb flush for the one swapped, maybe other stuff? ) . The task would also have to watch for dma ranges or other areas that need contigous memory, but assuming it was dealing only with virtual addresses ranges it could help to clean up the bitmap alot.

I realize this was more of a ramble then a question but what I was hoping to get out of it was:

1) does a bitmap make more sense and were my reasons on target
2) did either of my proposed cleanups for the bitmap make any sense and/or are there other used ways to deal with the fragmentation?

thanks

JamesM · Post by **JamesM** » Wed Dec 12, 2007 3:28 am

Hi,

1) does a bitmap make more sense and were my reasons on target

Bitmaps are easier to implements afaik, and as far as I can see stacks don't give a massive performance boost. I use bitmaps myself, but if you make your kernel correctly in self contained modules (avoiding spaghetti code) you should be able to plug a new physical memory allocator in at a later stage without cost. I would stick with the easiest solution until you are up and running.

2) did either of my proposed cleanups for the bitmap make any sense and/or are there other used ways to deal with the fragmentation?

As far as I can tell physical memory fragmentation really isn't a problem outside of DMA-available areas. Your second suggestion really did seem like overkill, and you would have to ensure massive amounts of locking when swapping those page allocations around: and what about on multiprocessor systems? What if another processor is currently running the process you want to "optimise"?

AJ · Post by AJ » Wed Dec 12, 2007 3:30 am

Hello,

I am currently writing a kernel and a second stage boot loader (it actually does some more stuff between the end of GRUB and the start of my Long Mode OS).

For the boot loader, a bitmap is pertinent. I often want to allocate contiguous memory and have things identity mapped - the bitmap gives me this flexibility and control. For the Kernel, I am definitely going down the stack route. Here, the boot loader has already marked pages used for e.g. Memory Mapped IO as "used". When I create the stack, I therefore know that all the RAM I am adding is definitely free to the system.

If you are intending to use paging, it really doesn't matter that physical memory is fragmented. It makes no difference to system speed. If anything, fragmented physical memory may speed up your physical page frame allocations (if you are using a bitmap-walking algorithm). With the stack method, fragmentation makes no difference whatsoever.

ATM, I have three stacks in total - one sub 1MB, one sub 16MB and one 16MB+. Memory allocation and freeing is a matter of:

Code: Select all

return(page_stack[stackptr--]);
and
page_stack[++stackptr] = ptr;

I may, in fact, get rid of the separate 1-16MB stack and change to a simple two-stack system.

To answer your questions properly:

1) If you want to perform many a) contiguous allocations or b) allocations from a known location, choose a bitmap. If you are using virtual memory and it really doesn't matter where the physical page frames come from, use a stack.
2) Again, if you are using paging, don't worry about physical fragmentation.

Cheers,
Adam

blound · Post by **blound** » Wed Dec 12, 2007 3:43 am

two quick replies!

AJ wrote:Here, the boot loader has already marked pages used for e.g. Memory Mapped IO as "used". When I create the stack, I therefore know that all the RAM I am adding is definitely free to the system.

I understand all but this when regarding using stacks. I know that you can get all the addresses that are given by grub and used by your kernel/original page entries/video memory etc and that you can also get the amount of available ram from grub.

so then you have the unusable addresses and the total size. so then do you walk from say "min available address" to "highest available" skipping unusable areas and add them ( page aligned ) to the array?

the bitmap seems easier because you initilizae the bitmap array to 0 then as you get the unusable addresses you mark them to 1, but with stack its like you have to collect them first then work backwards.

I couldnt find any code implementing the stack method so I am confused on how the stack is originally filled but i understand everything after that. so either a walk through of filling the original stack ( or code

would be very appreciated.

AJ · Post by AJ » Wed Dec 12, 2007 4:06 am

blound wrote: so then you have the unusable addresses and the total size. so then do you walk from say "min available address" to "highest available" skipping unusable areas and add them ( page aligned ) to the array?

Absolutely right. And yes - the bitmap is easier to initialise, but the stack is capable of providing much faster allocations once set up. As a rough idea:

1. Caracal Boot Loader (CBoot) initialises the memory bitmap from GRUB's memory map.
2. CBoot is a "mini-os" - it can run tasks, has a CLI and provides kernel debugging tools if necessary ([ad]latest version will be released inside the next month![/ad]). It also does VESA mode switches and sets up system timers. All this is done using its own heap manager on top of the RAM bitmap.
3. Just before it launches the kernel (and when all CBoot-specific pages have been released), CBoot converts the bitmap to a stack, using the following mechanism (this is only at proof of concept stage):

Code: Select all

void pmem_create_stack()
{	
	for(u32 page=0; page < pmap_size; page++)
	{
		for(u32 offset = 0; offset < 32; offset++)
			if(!(pmap[page]&(1<<offset))) stack_free((void*)((page * 0x20000) + offset * 0x1000));
	}	
}

inline void stack_free(void *pptr)
{
	u32 phys = (u32)pptr  - ((u32)pptr % 0x1000);
	
	if(phys<0x100000)
		vcb->pmap_low_base[vcb->pmap_low_ptr++] = phys;
	else if (phys <= vcb->pmem_size)
		vcb->pmap_high_base[vcb->pmap_high_ptr++] = phys;
}

As you were hinting at, this is sloooow!

4. CBoot launches the kernel, passing it an information structure - included in this are parameters containing physical stack pointers.

That's the complicated bit - but it only happens once at boot time. What is important to me is system speed after boot. The kernel now just has to push and pop to/from the physical memory stack to free and allocate pages, which is fast.

Compare, for example, a bitmap. When you want to find a free page, each 32 bit long integer can contain a representation of 0x20000 bytes of physical RAM. Supposing you have 4GB of physical RAM and only the last page is free. If you have a simple crawling allocation algorithm, you have to trawl through 32,768 long integers before you realise that there is one free page. You then have to check which of the 32 offsets are marked as 'free'. This cannot be as fast as the inline function or macro:

Code: Select all

return(pmap[--pmap_ptr]);

(OK - there are ways of speeding this up by keeping track of larger pages / last free page and so on).

I certainly don't want to tread on the toes of anyone who has chosen to use a bitmap, but that is the reason that my OS goes through that extra bit of boot-time hassle to create a stack.

Cheers,
Adam

Brendan · Post by **Brendan** » Wed Dec 12, 2007 4:31 am

Hi,

blound wrote:I know this has been asked many times before and I read many forum posts/guides on it but I wanted to gather my thoughts and ask so..

From what I read it seems that for physical memory managers a bitmap would be a better a choice to stack because of the "looseness" of allocation with stacks ( meaning whichever pages happen to get freed also happen to allocated next) - which is bad for continuous allocations and also you need multiple stacks for different address ( 0-16mb, 16mb-?? ) for isa dma and some pci "stuff" ( i didnt find many posts bring up the pci bits ). Also it seems you could make bitmap allocations as fast as stack allocations by keeping a pointer to the last allocated bit and then allocate from there.

You can't make bitmaps as fast as stack operations - you *always* need to do some searching in some cases.

For example, imagine your bitmap looks like this (with a pointer to a free page):

Code: Select all

111111101110111
-------^

You allocate a free page and get this:

Code: Select all

111111111110111
-------^

Now, do you search for the next free page immediately (so the pointer always points to the next free page), and get this:

Code: Select all

111111111110111
-----------^

Or, do you have a pointer to the "next *possibly* free page" to avoid searching, and get this:

Code: Select all

111111111110111
--------^

The second option is the best option, because a page may be freed before you need to search, and the pointer may be reset to a page that is free. For example, if a page is freed (regardless of what your pointer points to) you might end up with:

Code: Select all

110111111110111
--^

Despite this, if no pages are freed you need to search - either before doing some allocatoins ("pointer to the next possibly free page" method) or after doing some allocations ("pointer to the next free page" method). For free page stacks, you never need to search.

Then there's RAM overhead. For a free page stack you need a pointer to the stack top (and possibly, but not necessarily, a pointer to the stack bottom). Each free page can contain a pointer to the next free page on the stack (which can be ignored), so you only really several bytes of data to manage many GB of RAM. On the other hand, for bitmaps you need at least one bit per page, or for 4 GB of RAM you'd need at least 128 KB of data.

The next question is contiguous allocations - when do you need to allocate contiguous pages? Assuming you use paging, almost all allocations don't need to be contiguous. The only case where contiguous pages are needed are for buffers that are used for ISA DMA and bus mastering. However, most PCI cards that can use bus mastering also use "scatter-gather" and don't need contiguous pages.

The ISA DMA can't access physical RAM above 0x01000000, so (for ISA DMA) you only need a maximum of 16 MB of RAM that can allocated as contiguous ISA DMA buffers.

However, for ISA DMA each DMA channel is wired to a specific device. If you assume each device driver that uses an ISA DMA channel only uses one buffer per DMA channel (and IMHO there's no sane reason to use mutliple buffers for one DMA channel) then you can reduce this maximum much more. The ISA DMA controllers only have 7 usable channels (one is for "cascade"). Four of them are 8-bit channels than can only handle 64 KB transfers and the other 3 are 16-bit channels that can only handle 128 KB transfers. This means you'd really only need a maximum of 640 KB (or "4 * 64 + 3 * 128 KB") of RAM that can allocated as contiguous ISA DMA buffers.

For PCI devices with bus mastering, I really don't know how many devices (if any) don't use "scatter-gather" (disk controllers, modern video cards and USB controllers all use "scatter-gather"). Let's allow a few MB for them, just in case.

Now, taking all this information into account, for a computer with 4 GB of RAM would you:

a) use bitmaps for 100% of RAM, even though it's slower and has more overhead
b) use free page stacks for 100% of RAM, even though it can't be used for contiguous allocations
c) manage some RAM with bitmaps (just enough to satisfy device drivers, plus some spare just in case) and manage the rest of RAM with free page stacks, and get the best of both methods

blound wrote:thinking that the goal of my os is just to have ring3 multitasking and the ability to write drivers without having to learn another oses api for it ... does it make sense for me to fight with stacks?

If simplicity is your only goal, then just use bitmaps..

blound wrote:also I read posts where people knocked bitmaps and the method of keeping a pointer to the last allocated page because it causes alot of fragmentation and was thinking of ideas to reduce the fragmentation:

If you use paging, then fragmentation doesn't matter at all and (except for those DMA/bus mastering buffers) you could ignore physical memory fragmentation completely. Also, those DMA/bus mastering buffers are typically allocated during boot before physical memory has a chance to become fragmented, so you don't need to worry about that much either.

Physical memory fragmentation may mean slightly less cache efficiency (see "page/cache coloring" in the wikipedia for a quick description), but page/cache coloring can be implemented on top of bitmaps or by using a free page stack for each page/cache colour.

Cheers,

Brendan

FlashBurn · Post by **FlashBurn** » Wed Dec 12, 2007 11:59 am

I used the stack approach by myself and there is one point where a stack allocator canÂ´t be used. when you want to provide the possibility of allocating a 4mb page. How would you solve this with a stack.

(to be more precise, I used bitmaps and stacks)

blound · Post by **blound** » Wed Dec 12, 2007 1:18 pm

thanks for the awesome replies.. I will go with a bitmap because my design is meant to be simple ... now I have some justification on both sides since before everything I found was pro-bitmap

I think i cared too much about contiguos allocations before (not sure why), but I also see now after reading these and thinking on it that its not so important

mystran · Post by **mystran** » Wed Dec 12, 2007 6:11 pm

I personally use a stack. The initialization is a bit tricky, 'cos I keep the stack as a linear array of page-addresses, yet the pages that hold the array are allocated from the stack itself (and freed into the stack again when they contain no useful entries) such that the whole thing essentially uses 1 pointer worth (non-free) memory. Well, and the memory for the code ofcourse.

I wouldn't use the method where pages contain pointers to the next free page (=linked-list) because in 32-bits this might mean having to modify virtual memory mappings just to find the page in the list, and even in 64-bits where you could map all memory for the purpose, it doesn't play nicely with caches (or TLBs for that matter). When using a linear array-of-addresses you only need to do a map/unmap for every 1025 pages freed/allocated, and even when several pages are allocated at the same time, the array doesn't trash more than a couple of processor cache lines. Plus it's IMHO easier..

Brendan · Post by **Brendan** » Wed Dec 12, 2007 9:52 pm

Hi,

mystran wrote:I wouldn't use the method where pages contain pointers to the next free page (=linked-list) because in 32-bits this might mean having to modify virtual memory mappings just to find the page in the list, and even in 64-bits where you could map all memory for the purpose, it doesn't play nicely with caches (or TLBs for that matter). When using a linear array-of-addresses you only need to do a map/unmap for every 1025 pages freed/allocated, and even when several pages are allocated at the same time, the array doesn't trash more than a couple of processor cache lines. Plus it's IMHO easier..

I use the method where free a page contains a pointer to the next free page, and it is a little tricky. However, when a free page is being allocated it needs to be mapped into an address space after it's allocated, and when a page is being freed it's already in an address space before it's freed - there is no unnecessarily map/unmap.

It goes a little like this:

Code: Select all

allocateVirtualPage(virtual_address) {
    lock_free_page_stack();
    map_physical_into_virtual(virtual_address, address_of_first_page_on_stack);
    address_of_first_page_on_stack = *virtual_address;
    unlock_free_page_stack();
}

freeVirtualPage(virtual_address) {
    lock_free_page_stack();
    *virtual_address = address_of_first_page_on_stack;
    address_of_first_page_on_stack = convert_virtual_to_physical(virtual_address);
    unlock_free_page_stack();
    set_page_table_entry(virtual_address, 0);
}

What makes it messy is seperating the virtual memory manager from the physical memory manager, but this is possible - "allocPhysicalPage()" needs to return a function pointer that's called by "allocVirtualPage()" after the page is mapped; "freePhysicalPage()" needs to know where the page is in the virtual address space before it's freed; and you need to make sure the physical memory manager's functions are called in the correct order.

When a page is freed it's virtual address may already be in the TLB and it's data may already be in the cache; and when a page is allocated it's TLB entry and data may be need to be loaded after the page is allocated anyway (especially for "allocation on demand" where you know the page fault handler will return to an instruction that accesses the virtual page), so the effect on the TLB and caches isn't as bad as it seems.

Also, for page/cache coloring, NUMA, keeping track of dirty pages and clean pages seperately, and having seperate sets of stacks for below 4 GB and above 4 GB, I use up to 1024 free page stacks. Using 4 KB for each stack would cost me up to 4 MB of kernel space, which would reduce the chance that the information is in the TLB and caches (of the correct CPU) when it's needed.

Cheers,

Brendan

bewing · Post by **bewing** » Thu Dec 13, 2007 12:52 pm

Brendan wrote: The ISA DMA controllers only have 7 usable channels (one is for "cascade"). Four of them are 8-bit channels than can only handle 64 KB transfers and the other 3 are 16-bit channels that can only handle 128 KB transfers. This means you'd really only need a maximum of 640 KB

If you assume each device driver that uses an ISA DMA channel only uses one buffer per DMA channel (and IMHO there's no sane reason to use mutliple buffers for one DMA channel)

AFAIK, there are only 6 usable ISA DMA channels -- the 8-bit channel #0 is also unusable -- it is pre-allocated for RAM to RAM DMA, but disconnected on all PCs.
A too-simple implementation of ISA DMA Soundblaster cards will jitter unless you double-buffer. (You can "double-buffer" within just one buffer, though.) Which might imply a need for 2 buffers for every DMA channel. (And the floppy, ISA DMA#2, doesn't need the full amount.)

I say: theoretical PCI DMA devices without scatter-gather can fight with ISA DMA devices for fixed memory allocations inside this space. (There won't really be any of either, on any modern machine, of course.) But you don't need a bitmap to allocate a tiny handful of fixed buffers like this.

jnc100 · Post by **jnc100** » Thu Dec 13, 2007 1:16 pm

Brendan wrote:What makes it messy is seperating the virtual memory manager from the physical memory manager, but this is possible - "allocPhysicalPage()" needs to return a function pointer that's called by "allocVirtualPage()" after the page is mapped; "freePhysicalPage()" needs to know where the page is in the virtual address space before it's freed; and you need to make sure the physical memory manager's functions are called in the correct order.

I basically do it like you have, however another method which I've used for completely separating the physical and virtual managers is:

Code: Select all

physaddr_t PhysMemStack::BeginPop()
{
	if(this->top == 0)
		return NULL;

	Lock();
	return this->top;
}

physaddr_t PhysMemStack::EndPop(physaddr_t paddr)
{
	this->top = *((physaddr_t *)paddr);
	Unlock();
	return paddr;
}

uintptr_t VirtMemManager::AllocPage(uintptr_t virtual_location)
{
        physaddr_t phys_mem = pmem->BeginPop();
        MapPage(phys_mem, virtual_location);
        pmem->EndPop((physaddr_t)virtual_location);
        return virtual_location;
}

and similarly for the push algorithms.

Regards,
John.

FlashBurn · Post by **FlashBurn** » Fri Dec 14, 2007 3:08 pm

I would be interested if anyone here uses 4mb/2mb pages? And if he does, how he organized the memory.

Brendan · Post by **Brendan** » Fri Dec 14, 2007 10:30 pm

Hi,

bewing wrote:AFAIK, there are only 6 usable ISA DMA channels -- the 8-bit channel #0 is also unusable -- it is pre-allocated for RAM to RAM DMA, but disconnected on all PCs.
A too-simple implementation of ISA DMA Soundblaster cards will jitter unless you double-buffer. (You can "double-buffer" within just one buffer, though.) Which might imply a need for 2 buffers for every DMA channel. (And the floppy, ISA DMA#2, doesn't need the full amount.)

You're right - only 6 channels. The first 8-bit channel was originally used for RAM chip refresh (8086, XT), but became unused not long after. Some chipsets (mostly 80286 era IIRC) did allow RAM to RAM DMA transfers, but AFAIK it was never a standard - no way to detect if the chipset supported it or not, and much slower than a "rep movsd" anyway.

bewing wrote:I say: theoretical PCI DMA devices without scatter-gather can fight with ISA DMA devices for fixed memory allocations inside this space. (There won't really be any of either, on any modern machine, of course.) But you don't need a bitmap to allocate a tiny handful of fixed buffers like this.

That depends on your OS design. For e.g. for me, device drivers can be started and stopped at any time, so contiguous buffers need to be allocated and freed dynamically; and the memory managers need to keep track of RAM used for PCI bus mastering in case the driver is "force terminated" in the middle of a DMA transfer (where the kernel can't stop the transfer and needs to put the pages involved onto a "cool down" list so they can't be re-allocated until enough time has passed to ensure any bus mastering has completed).

jnc100 wrote:I basically do it like you have, however another method which I've used for completely separating the physical and virtual managers is:

That is much cleaner, but doesn't work for more complicated situations. For example, if there's 1024 free page stacks and a bitmap, and if the allocated page may or may not need to be "cleaned" (filled with zeros and optionally tested for RAM faults). In this case it's easier for "allocPhysPage()" to return a function pointer that points to one of many different "post-allocation" routines.

FlashBurn wrote:I would be interested if anyone here uses 4mb/2mb pages? And if he does, how he organized the memory.

IMHO large pages (including AMD's new "1 GB pages") aren't very useful unless you're mapping a large amount during boot (e.g. the physical memory mapping that Linux has in kernel space), and/or unless you're mapping a large memory mapped I/O area (e.g. video display memory).

For dynamically allocated large pages, for a simple implementation the large pages will end up being split into 4 KB pages after the OS runs for a while. (as software allocates and frees pages). This will give you a slight performance improvement (less TLB misses) for a short time after boot, until all large pages are split up and there's none left for new allocations.

A more complex implementation would prevent this by recombining freed 4 KB pages into a free large page where possible. This adds overhead to the code that frees pages and prevents faster physical memory management methods from being used (e.g. free page stacks). It would also rely on pure luck (you'd be hoping that the right 4 KB pages are freed so that you can make a free large page), but it only takes one of the 4 KB pages to be "in use" to prevent a free large page from being created. On average, the overhead of combining free 4 KB pages into free large pages will probably be more than the slight performance improvement you get from having less TLB misses.

An even more complex implementation could actively recombine 4 KB pages into free large pages to avoid relying on pure luck. For example, if you've almost got enough free 4 KB pages to make a free large page you could find the remaining ("in use") 4 KB pages and replace them, so that you can make a free large page. This will probably give you the worst performance - you'd have the most large pages being used and the least TLB misses, but you'd also have the highest memory management overhead.

There is one other way to use large pages. If the computer has a huge amount of RAM, then it might make sense to use large pages and nothing else. This means a lot less memory management overhead than "4 KB pages only" (mostly because you'd be allocating a lot less pages for the same amount of RAM). It also reduces the cost of TLB misses because the CPU needs to do less (e.g. find the page directory entry without needing to find the page table entry). However, AFAIK modern CPUs aren't designed for this - for example, the CPU might have 16 TLB entries for large pages and 16384 TLB entries for 4 KB pages, where a "large pages only" OS won't use any of the 16385 TLB entries for 4 KB pages and will end up with more TLB misses because of it. Of course it will also waste more RAM (e.g. if you need 8 KB of RAM, then you get a 2 MB page and waste 2040 KB). In general this might become a good idea in future, but IMHO it isn't a good idea now.

Cheers,

Brendan

FlashBurn · Post by **FlashBurn** » Sat Dec 15, 2007 1:45 am

IÂ´m working on such a algo that uses both (4kb and 4mb pgs). In my asm os I made this with bitmaps and now I have up to 2046 lists, going from 1 free page up to 2046 free pages and I allocate only from the lowest list (1 and up). So the chance that I get many 4mb pages are good. The problems are that allocating 4mb pgs isnÂ´t as fast as allocating a 4kb pg and deallocating a 4kb pg could take "a while".

OSDev.org

stack vs bitmap

stack vs bitmap

Re: stack vs bitmap

Re: stack vs bitmap