OSDev.org

Posted: **Tue Oct 16, 2012 11:49 am**

So I've implemented the basics, and it seems to work well. The nice thing is that this new scheme is pretty easy to debug in a standard kernel thread, because the managed physical memory is not accessed. The next step is to replace the old allocation functions with the new.

I've also figured out a few optimizations I'll probably implement.

First, I think the search algorithm for a good bitmap should break when it has found a bitmap with over 50% free pages. No sense in searching for bitmaps that are better than that.

Second, I think there is a very effective optimization that could be done. By providing a cache (which uses a stack) per processor-core that contain a fixed number of entries, the core can quickly reclaim recently freed pages. That would improve performance in scenarios with many allocate-free pairs. Each cache entry could contain:

Code: Select all

struct PhysCacheEntry
{
    short int BitmapNr;
    short int BitNr; 
    struct PhysCacheEntry *Next;
};

That's 8 byte per entry. A single page could contain 512 entries (2MB), which might be enough. BitmapNr and BitNr could be used to quickly be able to update the bit in the bitmap (this would still be done in order to detect double-frees), and to calculate the physical address. As this cache is per core, it won't need any spinlocks, and the links could be updated by using cli/sti.

While a possible optimization might be to only link the pages, and not free them in the bitmap, this also has some disadvantages. One is that double-frees cannot be detected, and another is that the entries would only be possible to allocate from the core they were freed on. So I think they always should be freed. A complication of freeing them is that another core could allocate them, but that is detected when trying to clear the bit in the bitmap with btr. If CY is not set, the entry will be discarded, and another tried instead.

An additional advantage is that cache-misses are reduced when the same physical page is reused again.

Posted: **Tue Oct 16, 2012 10:01 pm**

Hi,

rdos wrote:I'll use the 4K bitmap approach, with a 4 byte header for each bitmap, as this can be implemented without spinlocks, which is a great advantage.

If there are 20 different ways of implementing something (where some methods use spinlocks and some don't); you have to compare each method to determine the advantages/disadvantages of each. You can't assume that "no spinlocks" is an advantage - it's possible that one of the methods that does use spinlocks is better than all of the methods that don't use spinlocks.

rdos wrote:I think I will use 8MB linear address space for the page bitmaps, which would support 256GB of physical memory.

Is there any reason why you can't determine how many page bitmaps you need during boot? You've said that your OS supports old computers and you've said the OS is mostly for embedded systems, and an old embedded system that only has 16 MiB of RAM would only need (part of) 1 bitmap and doesn't need to have half the usable RAM wasted.

rdos wrote:Switching active block is done by finding the first block with the largest number of free pages. This is done like that because allocation speed is faster the more free pages a bitmap contains. This search algorithm is lockless.

If each CPUs was using a different active block; then sooner or later 2 or more CPUs would go looking for the first block with the largest number of free pages at the same/similar time and find the same active block. Once 2 or more CPUs are using the same active block then they end up "accidentally synchronised" - that block becomes full and the CPUs that were using it search for a new active block at the same/similar time and are likely to find the same block as each other again. Once you've got groups of CPUs that happen to be using the same active block, when 2 or more groups of CPUs search for a new active block at the same/similar time then the groups combine.

For example, for a system with 16 CPUs that's been running for a while, you might end up with a 50% chance of 4 CPUs using the same active block.

When CPUs have the same active block, one will modify the block (causing the cache line to be invalidated in all other CPU's caches) and then modify the bitmap header (causing another cache line to be invalidated in all other CPU's caches). Then the next CPU gets 2 cache misses and invalidates 2 cache lines, then the next, etc. This mostly means that when multiple CPUs happen to be using the same active block (which is relatively likely), you end up with a high chance of many cache misses.

These cache misses would obviously be bad for performance. For a NUMA system (where different CPUs are on different physical chips) it's worse - you end up with a lot of extra traffic across a hyper-transport or quick-path link as the cache line bounces between the physical chips. Under heavy load (lots of CPUs allocating pages at the same time), this extra traffic across the hyper-transport/quick-path link can consume a lot of the link's bandwidth and harm performance of other CPUs that are trying to use the link's bandwidth for other things. For example, for a 16-CPU NUMA system, if 8 CPUs are trying to allocate a lot of pages at the same time then the remaining 8 CPUs (which aren't allocating any pages) may be starved and (depending on what they're doing) end up with half the performance due to constantly stalling while waiting for data to come over a congested hyper-transport/quick-path link.

However, if the bitmap headers are only 4 bytes and are packed together (e.g. 16 of these 4-byte headers packed into each 64-byte cache line) then one CPU can modify one bitmap's header and invalidate the cache line that other CPUs are using for different bitmap headers - CPUs don't need to be using the same active page. Consider a computer with 2 GiB of RAM - for this you'd never use more than 16 bitmaps, so every CPU may be pounding the same "header cache line" regardless of which active block it uses. Also consider cache misses caused by searching for a new active block. Note: This false sharing can be avoided by placing the 4-byte bitmap headers on separate cache lines (e.g. make them "4 bytes with 60 bytes or 124 bytes of unused padding").

Next; for a pathological worst case, allocating a page may take an infinite amount of time. Imagine that there is no bitmap with more than one free page. A CPU searches its active bitmap and finds that it's full, then finds the block with the largest number of free pages (and finds a block with 1 free page because there is nothing with more); but immediately after this a different CPU allocates the last free page in that bitmap. Your CPU searches its new active bitmap and finds that it's full, then finds a new block with the largest number of free pages (and finds another block with 1 free page); but immediately after the last page of that bitmap is allocated. This could happen indefinitely - e.g. for a system with 16 CPUs, 2 of those CPUs might be allocating the last page in a block and freeing 1 page in previously full blocks; and you could have 14 "very unlucky" CPUs that are all trying (and constantly failing) to allocate a page; even when there's a many free pages that could be allocated.

rdos wrote:I've also figured out a few optimizations I'll probably implement.

First, I think the search algorithm for a good bitmap should break when it has found a bitmap with over 50% free pages. No sense in searching for bitmaps that are better than that.

This won't reduce the "higher than expected chance of 2 or more CPUs using the same active page" problem (different CPUs searching for a new active block at the same/similar time would still find the same block).

rdos wrote:Second, I think there is a very effective optimization that could be done. By providing a cache (which uses a stack) per processor-core that contain a fixed number of entries, the core can quickly reclaim recently freed pages.

If the per-CPU cache is too small, then this won't help much to reduce the chance of the "cache thrash when 2 or more CPUs are using the same active page" problem. If the per-CPU cache is too large it will create a new "out-of-memory even though lots of pages are free" problem. Deciding how large the per-CPU cache should be may be a bit like trying to find a value that is greater than 4 and also less than 2 - there might not be any good cache size.

The per-CPU cache also doesn't prevent the "allocating a page may take an infinite amount of time" pathological worst case (a pattern of allocations/de-allocations that causes indefinite starvation still exists).

Cheers,

Brendan

Posted: **Wed Oct 17, 2012 1:04 am**

Brendan wrote:If there are 20 different ways of implementing something (where some methods use spinlocks and some don't); you have to compare each method to determine the advantages/disadvantages of each. You can't assume that "no spinlocks" is an advantage - it's possible that one of the methods that does use spinlocks is better than all of the methods that don't use spinlocks.

Agreed. However, there are several several design-issues involved:
1. The design must be able to handle above 4GB memory without exhausting below 4G address-space, and must do so in protected mode
2. The design should at least provide a way to debug double-frees in some way even if it is not part of production release
3. It is preferrable if it can be implemented locklessly as this relaxes demands on pagefault handler in regard to deadlocks

Brendan wrote:
rdos wrote:I think I will use 8MB linear address space for the page bitmaps, which would support 256GB of physical memory.
Is there any reason why you can't determine how many page bitmaps you need during boot? You've said that your OS supports old computers and you've said the OS is mostly for embedded systems, and an old embedded system that only has 16 MiB of RAM would only need (part of) 1 bitmap and doesn't need to have half the usable RAM wasted.

It is determined during boot. This is kind of tricky because there needs to be a few free physical pages when paging is started, so I allocate 3 pages in this linear region (page directory, first page for bitmap headers and first bitmap page covering first 128MB), and then add all physical memory up to 0x40000 as free. That means that a system with 16MB RAM would only use 3 pages (12k) for physical memory allocation. The rest of the linear address space would not be mapped. I plan to fillup the initial bitmaps after starting paging, and letting pagefault handler automatically allocate physical memory for bitmaps as needed. That's why I need a small pool of free memory to be setup before paging is started. This requires a lockless physical memory manager, as the free operation itself might call allocate in the pagefault handler.

Brendan wrote:If each CPUs was using a different active block; then sooner or later 2 or more CPUs would go looking for the first block with the largest number of free pages at the same/similar time and find the same active block. Once 2 or more CPUs are using the same active block then they end up "accidentally synchronised" - that block becomes full and the CPUs that were using it search for a new active block at the same/similar time and are likely to find the same block as each other again. Once you've got groups of CPUs that happen to be using the same active block, when 2 or more groups of CPUs search for a new active block at the same/similar time then the groups combine.

For example, for a system with 16 CPUs that's been running for a while, you might end up with a 50% chance of 4 CPUs using the same active block.

When CPUs have the same active block, one will modify the block (causing the cache line to be invalidated in all other CPU's caches) and then modify the bitmap header (causing another cache line to be invalidated in all other CPU's caches). Then the next CPU gets 2 cache misses and invalidates 2 cache lines, then the next, etc. This mostly means that when multiple CPUs happen to be using the same active block (which is relatively likely), you end up with a high chance of many cache misses.

These cache misses would obviously be bad for performance. For a NUMA system (where different CPUs are on different physical chips) it's worse - you end up with a lot of extra traffic across a hyper-transport or quick-path link as the cache line bounces between the physical chips. Under heavy load (lots of CPUs allocating pages at the same time), this extra traffic across the hyper-transport/quick-path link can consume a lot of the link's bandwidth and harm performance of other CPUs that are trying to use the link's bandwidth for other things. For example, for a 16-CPU NUMA system, if 8 CPUs are trying to allocate a lot of pages at the same time then the remaining 8 CPUs (which aren't allocating any pages) may be starved and (depending on what they're doing) end up with half the performance due to constantly stalling while waiting for data to come over a congested hyper-transport/quick-path link.

I agree. This is a problem, but it can at least partially be solved by the per-core cache.

Brendan wrote:However, if the bitmap headers are only 4 bytes and are packed together (e.g. 16 of these 4-byte headers packed into each 64-byte cache line) then one CPU can modify one bitmap's header and invalidate the cache line that other CPUs are using for different bitmap headers - CPUs don't need to be using the same active page. Consider a computer with 2 GiB of RAM - for this you'd never use more than 16 bitmaps, so every CPU may be pounding the same "header cache line" regardless of which active block it uses. Also consider cache misses caused by searching for a new active block. Note: This false sharing can be avoided by placing the 4-byte bitmap headers on separate cache lines (e.g. make them "4 bytes with 60 bytes or 124 bytes of unused padding").

That would mean the headers would use 64 bytes instead (which would be 32 pages in linear address-space for 256GB instead of 2). There is also another drawback when it comes to caching. If a single CPU traverse the headers, it would experience less cache-misses if the headers are 4 bytes than 64 bytes (access would have greater locality), and this is the very operation performed when looking for new bitmaps. I think the latter issue is more relevant than the former, especially if caching per core is used to reduce load on bitmaps.

Brendan wrote:Next; for a pathological worst case, allocating a page may take an infinite amount of time. Imagine that there is no bitmap with more than one free page. A CPU searches its active bitmap and finds that it's full, then finds the block with the largest number of free pages (and finds a block with 1 free page because there is nothing with more); but immediately after this a different CPU allocates the last free page in that bitmap. Your CPU searches its new active bitmap and finds that it's full, then finds a new block with the largest number of free pages (and finds another block with 1 free page); but immediately after the last page of that bitmap is allocated. This could happen indefinitely - e.g. for a system with 16 CPUs, 2 of those CPUs might be allocating the last page in a block and freeing 1 page in previously full blocks; and you could have 14 "very unlucky" CPUs that are all trying (and constantly failing) to allocate a page; even when there's a many free pages that could be allocated.

That case seems so odd that I don't think it is something you need to plan for. I think chances are about 99.99% that this would lead to a reboot, given that production release of RDOS does a reboot when physical memory is exhausted.

Brendan wrote:
rdos wrote:Second, I think there is a very effective optimization that could be done. By providing a cache (which uses a stack) per processor-core that contain a fixed number of entries, the core can quickly reclaim recently freed pages.
If the per-CPU cache is too small, then this won't help much to reduce the chance of the "cache thrash when 2 or more CPUs are using the same active page" problem. If the per-CPU cache is too large it will create a new "out-of-memory even though lots of pages are free" problem. Deciding how large the per-CPU cache should be may be a bit like trying to find a value that is greater than 4 and also less than 2 - there might not be any good cache size.

The cache entries would only be advisory. The core also needs to allocate it in the bitmap before returning the physical address. Then the issue of the cache size is more an issue of how much memory would be useful to put into it.

Posted: **Fri Oct 19, 2012 6:38 am**

I had a temporary set-back in regards to physical memory handling. The kernel overflowed 64k, so I had to do something. I solved this by moving out the scheduler from kernel, so now the kernel is about 48k, and the scheduler can be ported to a 32-bit device-driver (and thus might also incorporate C-code). The scheduler now also is better isolated from the rest of the kernel as it runs in its own code-selector.

Soon I'm about to check the new bitmap-based page allocation code.

Edit: The new memory handler is now running. There doesn't seem to be any noticable speed-difference, although the boot takes longer because setting up the bitmaps seems slower, especially when the routine needs to enable / disable paging for each 4k page.

Posted: **Thu Dec 06, 2012 10:07 am**

My first thought was a bitmap. But that takes constant space and with multiple memory regions it becomes messy. Also not so nice for multiple cores.

My second thought was to make simple linked list, each empty page containing a pointer to the next. Freeing a page adds it to the front of the list, allocating a page removes the first page from the list. For multiple cores each core can have its own list and when one core runs dry a global lock and redistribution can be used. The problem there is that every page has to be written to on free() and read from on alloc(), which means the need to be mapped somewhere. I don't want to keep every page mapped in case the kernel has a bug and tries to access the wrong address and mapping and unmapping them as needed is complex and takes time.

So third thought: A page is rather large to store a single next pointer. Why not store more pointers per page?

Code: Select all

 // untested
enum {MAX_FREE_PER_PAGE = 510};
struct free_page {
  struct free_page *next;
  uint64_t  num_entries;
  void * page[MAX_FREE_PER_PAGE];
};
struct core {
  struct core *next;
  struct core *prev;
  uint64_t core_id;
  struct free_page *free_page;
  uint64_t num_free_pages;
};

void free_page(struct core *core, void *page) {
  struct free_page *p = core->free_page;
  if (p == NULL) {
    p = map_page(page, MAP_KERNEL); // map the page somewhere within the kernel region
    bzero(p, sizeof(struct free_page));
    core->free_page = p;
  } else if (p->num_entries == MAX_FREE_PER_PAGE) {
    p = map_page(page, MAP_KERNEL); // map the page somewhere within the kernel region
    bzero(p, sizeof(struct free_page));
    p->next = core->free_page;
    core->free_page = p;     
  } else {
    p->page[p->num_entries++] = page;
  }
  ++ core->num_free_pages;
}

void * alloc_page(struct core *core) {
  if (core->num_free_pages == 0) {
    global_rebalance();
    if (core->num_free_pages == 0) {
      out_of_memory();
    }
  }
  --core->num_free_pages;
  struct free_page *p = core->free_page;
  if (p->num_entries == 0) {
    core->free_page = p->next;
    unmap_page(p);
    return p;
  } else {
    return p->page[p->mum_entries--];
  }
}

As you can see for every 511 free pages I use one of them to manage the remaining 510 pages. That significantly reduces the amount of mapping and unmapping and exposed pages. As pages are allocated the managing pages are also used. I only "waste" space when there are free pages.

In the global balancing I would first search for the core with the most free pages. Since small errors in the count would be irelevant that wouldn't need a lock. Then lock down that core and steal roughly half its pages. I would steal half the struct free_page structures unless the process has only one. That would mean finding the middle, setting empty_core->free_page = p.next and p->next = NULL and readjusting the num_free for both cores.

Improvements:
If a core has a full struct free_page and a process frees a page and allocates a page in a loop the core would continiously map, bzero and unmap the page. So it might make sense to add a small array of free pages to struct core. When the top struct free_page is full put the page into the struct core. Only when that is full too create a new page and transfer all packages from struct core to that page. That way a the map/bzero/unmap cost gets spread out.

The tail of the free_page list doesn't need to be mapped, only the head is needed at any one time. If free_page.next is changed to be the page number of the next page instead of its virtual address then the tail could be unmaped and remapped as needed. Rebalancing would become harder though as you then have to remap each page to follow the next pointer.

Hope that helps someone.
Mrvn

Posted: **Thu Dec 06, 2012 10:26 am**

If you steal pages between cores you will need locks. My implementation is lock-free (and thus can handle pagefaults in ISRs and things like that).

I have not yet implemented the local core cache of recently used pages (which also can be implemented lock-free since only the core itself has access to it), simply because I want to be 100% sure that the bitmap implementation is stable.

Posted: **Thu Dec 06, 2012 1:45 pm**

Instead of stealing pages from other core, how about have a low priority thread(s) that manage VM and usable memory for all cores and balance them before any of them actually run out of memory and need to "lock and steal"?

"lock and steal" case may still exist, but not very often then.

Posted: **Thu Dec 06, 2012 5:05 pm**

There are two possible ways to keep pages local to a core.

The obvious one is to allocate a few (or many) pages and keep them in a (private?) list per core. That's the implementation that might need to steal pages between cores in low-memory situations.

The other way to implement it is to have a list of suggested pages that have been recently used (on the particular core), but then have been freed. In this case the cache only works as a suggestion were in the bitmap there might be a free page. The allocator still needs to try to allocate it the usual way. The benefit is that the page probably is still in the cache, and there is no need to search bitmaps for free pages (which could be slow in low-memory situations).

It is the latter algorithm that I will eventually implement, and not the former.

Posted: **Fri Dec 07, 2012 3:46 am**

rdos wrote:If you steal pages between cores you will need locks. My implementation is lock-free (and thus can handle pagefaults in ISRs and things like that).

I have not yet implemented the local core cache of recently used pages (which also can be implemented lock-free since only the core itself has access to it), simply because I want to be 100% sure that the bitmap implementation is stable.

Yes it will need locks, unless you use a lock free jobs stealing queue implementation but that takes constant space and gets complex. But note that in most cases a cpu will not run dry but if it does then only one other cpu needs to be locked. Or rather the free_page structure of that other cpu needs to be locked. Unless more than half your cores run dry at the same time you will always find a core which free_pages you can lock and steal from. Another thing is that taking and putting a page is a very short operation. The time the structure will be locked will be negible compared to the time an ISR takes just for the context switch and register saving. But worst case would be if only one cpu has pages, in which case it would take O(log cores) time and rebalancing on each step. Still probably not a problem.

I think the real problem, and which will kill your ISR idea slightly, is when you actually run out of pages and need to swap something out or flush buffers to disk. There is no way around suspending the thread and waiting for IO to complete then.

Mrvn

Posted: **Fri Dec 07, 2012 3:55 am**

bluemoon wrote:Instead of stealing pages from other core, how about have a low priority thread(s) that manage VM and usable memory for all cores and balance them before any of them actually run out of memory and need to "lock and steal"?

"lock and steal" case may still exist, but not very often then.

I was thinking more of a high priority thread, one per core. When a core runs low on pages (no need to wait till it is completely dry) it finds a core with many pages and sends that cores thread a "give me some pages" message. The next time the other core schedules the thread runs, takes some free pages and sends a reply back.

But that requires a good, potentially lock free, inter core message passing mechanism.

Mrvn

OSDev.org

Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation