OSDev.org

Posted: **Tue Sep 11, 2012 5:40 am**

Primarily because it allows double-free detection. Additionally, because I don't want to duplicate physical memory handling in 64-bit mode, and because I don't want to connect memory allocation with page table allocation.

If you really only need the bitmap for double-free detection then why not have both. Bitmaps are cheap but are very slow when memory is short and it's clumsy to try to make them scale on lot's of cores. Of course once all the double-free bugs are detected during testing you can remove the bitmap code from the production build.

Posted: **Tue Sep 11, 2012 5:46 am**

Also, when allocating / freeing memory on behalf of a 64-bit process, this will need to be done in an 64-bit context, which would duplicate the code. It is far easier to just call the 32-bit kernel and ask it to allocate / free a physical memory page.

It would be far easier to just write proper 64-bit memory manager and call it from 32-bit kernel.

After that, you would be getting closer and closer to getting rid of that segmentation and protected mode specific hell

Posted: **Tue Sep 11, 2012 5:50 am**

gerryg400 wrote:
Primarily because it allows double-free detection. Additionally, because I don't want to duplicate physical memory handling in 64-bit mode, and because I don't want to connect memory allocation with page table allocation.
If you really only need the bitmap for double-free detection then why not have both. Bitmaps are cheap but are very slow when memory is short and it's clumsy to try to make them scale on lot's of cores. Of course once all the double-free bugs are detected during testing you can remove the bitmap code from the production build.

That's a possibility. Although I also don't know how to allocate multiple, consequtive, pages with Brendan's method. It would need a massive amount of page flushes and TLB-misses in order to find adjacent pages that are free. It might be able to solve this by having a small bitmap only for this function, but without that it would be very complex and slow.

I'm also not sure that any modern computer with 4G+ of memory would ever get into a state of low memory. Therefore, such configurations should not suffer from this problem, especially not when the page bitmap with the most free pages are used for allocation.

The scaling issue can be (at least partially) solved by letting each core have it's own current page bitmap.

Posted: **Tue Sep 11, 2012 5:55 am**

Griwes wrote:It would be far easier to just write proper 64-bit memory manager and call it from 32-bit kernel.

Not so, as the main target is low-end PCs that have no support for x86-64 or PAE. It would be nice to be able to run x86-64 applications, but that will not be the primary target anytime soon.

Posted: **Tue Sep 11, 2012 6:38 am**

Hi,

rdos wrote:That's a possibility. Although I also don't know how to allocate multiple, consequtive, pages with Brendan's method. It would need a massive amount of page flushes and TLB-misses in order to find adjacent pages that are free. It might be able to solve this by having a small bitmap only for this function, but without that it would be very complex and slow.

How many cases can you think of where you need adjacent pages? The main one is ancient hardware that uses the ISA DMA chips. There might also be a few/rare PCI cards that don't support scatter/gather. I'd be surprised if you could find a reasonably modern system (e.g. something that supports long mode) where allocations of adjacent pages adds up to more than 1 MiB.

For a system with only 4 GiB of RAM, that works out to less than 0.025% of pages that needed to be allocated as adjacent pages, and a "slight larger" 99.975% of pages where nobody cares about adjacent pages. Why not make allocations for that 99.975% faster, and just have a small bitmap (e.g. for the RAM below 0x01000000 that all hardware can access) for the negligible 0.025% that needs it?

rdos wrote:I'm also not sure that any modern computer with 4G+ of memory would ever get into a state of low memory. Therefore, such configurations should not suffer from this problem, especially not when the page bitmap with the most free pages are used for allocation.

This depends on what the OS is being used for. Something like an FTP/HTTP server could easily consume 4 GiB just for file caches. Anything involving 3D games, CAD, movie editing, etc could consume plenty of RAM too. For most embedded systems you could probably get by with 2 MiB of RAM (or less if there's no video buffer), but then you wouldn't need 64-bit or SMP (or an OS) for most of those.

Cheers,

Brendan

Posted: **Tue Sep 11, 2012 6:58 am**

Brendan wrote:How many cases can you think of where you need adjacent pages? The main one is ancient hardware that uses the ISA DMA chips. There might also be a few/rare PCI cards that don't support scatter/gather. I'd be surprised if you could find a reasonably modern system (e.g. something that supports long mode) where allocations of adjacent pages adds up to more than 1 MiB.

Actually, many modern chips require this, but only for the initialization code.

Some examples are RTL8139, RTL8169, AHCI and a couple of (modern) audio devices.

Brendan wrote: For a system with only 4 GiB of RAM, that works out to less than 0.025% of pages that needed to be allocated as adjacent pages, and a "slight larger" 99.975% of pages where nobody cares about adjacent pages. Why not make allocations for that 99.975% faster, and just have a small bitmap (e.g. for the RAM below 0x01000000 that all hardware can access) for the negligible 0.025% that needs it?

That's reasonable.

Brendan wrote: This depends on what the OS is being used for. Something like an FTP/HTTP server could easily consume 4 GiB just for file caches. Anything involving 3D games, CAD, movie editing, etc could consume plenty of RAM too. For most embedded systems you could probably get by with 2 MiB of RAM (or less if there's no video buffer), but then you wouldn't need 64-bit or SMP (or an OS) for most of those.

It is better to be preventive here. You never should allow file system caches to use all available physical RAM. You always should keep some free for application usage. This means you should preventively descreas FS caches as physical memory lowers.

And I require this function as the animations consume much more memory than is available, so the FS caches needs to be purged regularily.

Posted: **Tue Sep 11, 2012 9:34 am**

Brendan wrote:Hi,
How many cases can you think of where you need adjacent pages? The main one is ancient hardware that uses the ISA DMA chips. There might also be a few/rare PCI cards that don't support scatter/gather. I'd be surprised if you could find a reasonably modern system (e.g. something that supports long mode) where allocations of adjacent pages adds up to more than 1 MiB.

On the Wiki (http://wiki.osdev.org/ISA_DMA) it says:

A typical 1.44 MB floppy disk can easily transfer 36 sectors of data in a single transfer. This is only 18 KB. The biggest internal floppy with worst-case formatting may be able to transfer 84 sectors at once. This is still only 42 KB. A Soundblaster card may run best with a 64 KB buffer. It is never necessary to try to double-buffer ISA DMA transfers, because they are so slow anyway. There are at most 6 usable DMA channels, and it is not necessary to allocate a full 64 KB to each of them. Putting all of this together, any OS should be easily able to allocate all the ISA DMA physical memory that it needs from a 256 KB pool, or even only half of that.

After reading this, my plan was to simply pre-allocate a set amount... As 256k ought to be negligible for anyone...
In that case, can't you avoid the requirement of allocating sequential pages? You could simply make a much simpler memory allocator for DMA.

Posted: **Tue Sep 11, 2012 9:34 am**

rdos wrote:It is better to be preventive here. You never should allow file system caches to use all available physical RAM. You always should keep some free for application usage. This means you should preventively descreas FS caches as physical memory lowers.

And I require this function as the animations consume much more memory than is available, so the FS caches needs to be purged regularily.

The FS cache occupying all the system's RAM is no problem. After all, the kernel can simply discard entries on demand as needed

Posted: **Tue Sep 11, 2012 11:15 am**

Owen wrote:The FS cache occupying all the system's RAM is no problem. After all, the kernel can simply discard entries on demand as needed

That's not efficient, and could lead to dead-locks. It is mostly the pagefault handler that allocates memory, and it can be called for almost everything, including on behalf of some device in kernel space (including the disc cache manager). That's why I prefer a lockless physical memory allocation algorithm. If memory allocation needs to scan disc or file buffers for free memory, allocation cannot be lockless, and could become a real bottleneck.

In my current page allocation algorithm, I never allow a thread that requests physical memory to try to free physical memory. If there is no free memory I either reboot or sleep and retry. Freeing caches is done by an independent thread that never can allocate memory itself.

Posted: **Tue Sep 11, 2012 2:19 pm**

As for continuous memory, has anybody thought of other architectures? I think you will need this on the ARM architecture, or why should there be code in the linux-kernel for this, if it isn´t used/needed?!

Posted: **Tue Sep 11, 2012 2:48 pm**

rdos wrote:
Owen wrote:The FS cache occupying all the system's RAM is no problem. After all, the kernel can simply discard entries on demand as needed
That's not efficient, and could lead to dead-locks. It is mostly the pagefault handler that allocates memory, and it can be called for almost everything, including on behalf of some device in kernel space (including the disc cache manager). That's why I prefer a lockless physical memory allocation algorithm. If memory allocation needs to scan disc or file buffers for free memory, allocation cannot be lockless, and could become a real bottleneck.

It simply plucks the first page from the kernel's least recently used pages list, zeros it's contents, and returns it. No need to take any non-MM locks.

Posted: **Tue Sep 11, 2012 9:10 pm**

Hi,

evoex wrote:On the Wiki (http://wiki.osdev.org/ISA_DMA) it says:
A typical 1.44 MB floppy disk can easily transfer 36 sectors of data in a single transfer. This is only 18 KB. The biggest internal floppy with worst-case formatting may be able to transfer 84 sectors at once. This is still only 42 KB. A Soundblaster card may run best with a 64 KB buffer. It is never necessary to try to double-buffer ISA DMA transfers, because they are so slow anyway. There are at most 6 usable DMA channels, and it is not necessary to allocate a full 64 KB to each of them. Putting all of this together, any OS should be easily able to allocate all the ISA DMA physical memory that it needs from a 256 KB pool, or even only half of that.

6 * 64 KiB is the theoretical maximum amount of RAM that could be being used by the ISA DMA chips at any point in time. Nothing prevents a single driver from using multiple DMA buffers at different times.

For example, in theory you could have a floppy driver that allocates up to 160 separate 12 KiB DMA buffers (one "21 sectors * 512 bytes" buffer for each track on each side of the disk).

For another example, a device driver might use 2 DMA buffers. While one is being used by hardware, software might be preparing the next DMA transfer in the second buffer, so that as soon as the first DMA transfer completes the driver can switch buffers (such that the hardware uses the second buffer and software uses the first buffer to prepare the next transfer - a bit like "page flipping" for video, except maybe with audio data or something).

evoex wrote:After reading this, my plan was to simply pre-allocate a set amount... As 256k ought to be negligible for anyone...
In that case, can't you avoid the requirement of allocating sequential pages? You could simply make a much simpler memory allocator for DMA.

That's only ISA DMA.

There are a few PCI cards that don't support scatter/gather. For example, RDOS is right about RTL8139 ethernet cards - they need a minimum of "8 KiB plus 16 bytes" for a receive ring buffer (and the driver might want a larger ring buffer - e.g. up to "64 KiB plus 16 bytes"). If you've got 10 of these cards in a computer (maybe some sort of network router?) then that's up to 680 KiB.

I guess what I'm saying is that:

a very small amount of RAM will be allocated as contiguous physical pages; and there's no need to support allocating contiguous physical pages for all RAM
most of these allocations will probably be done during device driver startup (unlike normal allocations where pages are constantly being allocated/freed by processes, etc); and the speed of your "contiguous physical pages" allocations is far less important than the speed of allocating normal pages
you can't predict which PCI devices and other things will need contiguous physical pages when the kernel is being designed (unless the kernel is being designed for one specific "fixed hardware" computer - e.g. xBox)
you might not be able to predict which PCI devices and other things will need contiguous physical pages during boot - e.g. if the OS supports hot-plug PCI; or if the OS allows missing drivers to be installed and started without a reboot; or if any drivers (including drivers for legacy/ISA devices and drivers for PCI devices) do dynamically allocate/free buffers.

Cheers,

Brendan

Posted: **Tue Sep 11, 2012 10:44 pm**

Hi,

rdos wrote:
Owen wrote:The FS cache occupying all the system's RAM is no problem. After all, the kernel can simply discard entries on demand as needed
That's not efficient, and could lead to dead-locks. It is mostly the pagefault handler that allocates memory, and it can be called for almost everything, including on behalf of some device in kernel space (including the disc cache manager). That's why I prefer a lockless physical memory allocation algorithm. If memory allocation needs to scan disc or file buffers for free memory, allocation cannot be lockless, and could become a real bottleneck.

I don't think there's one way that suits everyone - something like a monolithic kernel (where disk caches are handled by the kernel and can be be freed very quickly) is going to be very different to something like a micro-kernel (where there may be nothing in kernel space that can be freed).

For a micro-kernel; I'd be tempted to have some pages reserved for emergencies. For example, if a normal application asks to allocate physical pages but there's less than <threshold> pages left, then you could put that task into a "blocked waiting for memory" state (and send messages to VFS, etc asking them to free some cache) and unblock the task when the pages become available later.

I'd also say that for the most efficient memory management you'd need to consider the whole system. For example, sending an application's pages to swap space so that those pages can be used to cache file data may make sense (e.g. if the application's pages aren't likely to be used soon and if the file data is likely to be needed). You couldn't just say "file cache is always less important than an application's data".

I personally like the "system of weights" idea. For each thing that a physical page could be used for, calculate some sort of score (e.g. "score = chance data will be needed soon * cost of not having that data in RAM if it is needed"); then for a system with a total of N physical pages of RAM, use them for the N uses with the highest score. Of course this is "theoretically ideal page usage" that can't be achieved in practice (it'd involve far too much overhead); but you could (should?) attempt to implement a "less than perfect" emulation of this ideal behaviour (find a compromise between "perfect" and overhead).

I'd also say that there's only 5 categories of virtual pages:

pages that are in RAM, that can't be freed because they're needed to fetch data (e.g. parts of kernel, disk drivers, etc)
pages that are in RAM, that can be freed immediately because they contain data that is already stored somewhere else (e.g. file system, swap)
pages that are in RAM, that would need to be stored somewhere before they're freed (e.g. modified data that would need to be saved to swap space)
pages that aren't in RAM, that need to be fetched from somewhere (e.g. file system, swap) before they can be used
pages that are full of zeros (where many virtual pages can refer to the exact same physical page full of zeros)

Cheers,

Brendan

Posted: **Wed Sep 12, 2012 6:09 am**

Brendan wrote:
rdos wrote:
Owen wrote:The FS cache occupying all the system's RAM is no problem. After all, the kernel can simply discard entries on demand as needed
That's not efficient, and could lead to dead-locks. It is mostly the pagefault handler that allocates memory, and it can be called for almost everything, including on behalf of some device in kernel space (including the disc cache manager). That's why I prefer a lockless physical memory allocation algorithm. If memory allocation needs to scan disc or file buffers for free memory, allocation cannot be lockless, and could become a real bottleneck.
I don't think there's one way that suits everyone - something like a monolithic kernel (where disk caches are handled by the kernel and can be be freed very quickly) is going to be very different to something like a micro-kernel (where there may be nothing in kernel space that can be freed).

For a micro-kernel; I'd be tempted to have some pages reserved for emergencies. For example, if a normal application asks to allocate physical pages but there's less than <threshold> pages left, then you could put that task into a "blocked waiting for memory" state (and send messages to VFS, etc asking them to free some cache) and unblock the task when the pages become available later.

Given that the kernel is keeping track of a page's clean/dirty state, it knows that any non-dirty pages are ones for which there is a backing store, and therefore it is entirely acceptable for it to discard their contents when an application needs more memory.

Thats not the full story, of course; it needs to be able to send "Flush dirty page" requests to the VFS and such (preferably with priorities, e.g. I_REALLY_NEED_MEMORY_NOW vs SPECULATIVELY_WRITE_COPY_TO_BACKING_STORE)

(My design is a microkernel, BTW)

Posted: **Tue Oct 16, 2012 2:47 am**

I'll start a major rewrite (basically from scratch) of the physical memory allocation scheme. I've discovered that this part of the kernel isn't possible to port to PAE-paging without great difficulties, so I need to do this before I continue with PAE.

I'll use the 4K bitmap approach, with a 4 byte header for each bitmap, as this can be implemented without spinlocks, which is a great advantage.

The header will look like this:

Code: Select all

BitMapHeader struct
{
    short int FreePages;
    short int CurrentPos;
};

The FreePages field can be updated without spinlocks (lock add FreePages,1 and lock sub FreePages,1), and also give an indication that the current bitmap is empty so it can be switched from. The CurrentPos is only advisory, so doesn't need lock prefix. It indicates where to start to search for free pages. It will be updated as the current dword is 0 (no free pages). As the position wraps around, the block is switched from.

Allocating a bit is done with lock btr. If the operation fails (content was already 0), a new attempt is made. The actual bit to try to allocate is speculative and thus doesn't need locking (probably the algorithm will read a dword at CurrentPos, find a set bit (bsf), and try to allocate it).

Switching active block is done by finding the first block with the largest number of free pages. This is done like that because allocation speed is faster the more free pages a bitmap contains. This search algorithm is lockless.

The current active block is kept per processor core, which would decrease lock contention. Later, it might be possible to improve this further by changing the switching of active block algorithm to favor different blocks for cores.

As an option, I might later decide to change allocation model when physical memory is scarce. Then I might let all frees place the pages in a stack instead, where allocate can retrieve them.

I think I will use 8MB linear address space for the page bitmaps, which would support 256GB of physical memory.

OSDev.org

Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation

Re: Physical page allocation