Is it a good idea to use only 4KiB pages

Xeno · **Joined:** Tue Oct 10, 2023 7:40 pm **Posts:** 51

I am currently writing my memory manager for my 64 bit OS, I am writing in C now as assembly began to be more complex then C lol. So I was wondering if it would be a good or bad idea to only use 4 KiB pages insted of mixing 2 MiB and/or 1 GiB pages? I understand it would be more page table overhead, but if I understand correctly it would allow me to use a single bitmap to keep track of all physical memory, also it would allow me to never defragment memory to make more 2MiB pages avalible.

If that's not right please correct me as paging is something I'm still trying to fully comprehend.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5146

Xeno wrote:

So I was wondering if it would be a good or bad idea to only use 4 KiB pages insted of mixing 2 MiB and/or 1 GiB pages?

It depends. Some things perform better when bigger pages are available, other things don't. (Also, you do have the option of using only 2MiB pages.)

Xeno wrote:

I understand it would be more page table overhead, but if I understand correctly it would allow me to use a single bitmap to keep track of all physical memory, also it would allow me to never defragment memory to make more 2MiB pages avalible.

That's correct. You can also allow mixed page sizes without defragmenting memory, although that works best in situations where you're always running the same set of applications (e.g. a server in a datacenter) rather than a typical PC.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1605

Xeno wrote:

So I was wondering if it would be a good or bad idea to only use 4 KiB pages insted of mixing 2 MiB and/or 1 GiB pages?

Personally, I always use the largest page size possible for any given mapping. However, it turns out that in a lot of cases, the largest page size possible is 4kB. Sometimes memory types get in the way (if you use huge pages, the entire range must have the same memory type), but most of the time, the other parts of the OS don't want that much memory. My heap grows in 4kB increments, my DMA allocator rarely wants anything more than a couple of bytes, and in the end nothing needs all that much memory.

Plus, for userspace I intend to use the SysV ABI for x86-64, and that requires at least a capability for 4kB pages. Most programs are not big enough to justify a huge page.

Xeno wrote:

I understand it would be more page table overhead,

How? A huge page just short-circuits the page table lookup, so if anything, it decreases page table overhead. They may increase the amount of code you have to write, yes, but that's not overhead. They may also complicate your demand-paging code (as if that was necessary these days), and you may need to think differently about your data structures, but really, it isn't that much worse.

Xeno wrote:

but if I understand correctly it would allow me to use a single bitmap to keep track of all physical memory,

Again I scratch my head. A single 64-bit number is never enough to keep track of all physical memory. So a bitmap allocator always needs to use multiple words. And then it barely matters how many. If you use 4kB pages, you need one bit for every 4kB, with 2MB pages it would be 1/512 of that. But even at the 4kB level, you only need the 16384th part of the memory for the bitmap. If you have 16GB of RAM, you need 1MB of bitmap. As if that would matter.

BTW, I am using a memblock allocator, which is sort of like a bitmap allocator, but the bitmap is compressed with run-length encoding.

Xeno wrote:

also it would allow me to never defragment memory to make more 2MiB pages avalible.

Yes, if you run with 2MB as basic allocation size, you never need to move things to make 2MB available. However, if memory allocation gets so catastrophic that no continuous 2MB are available anymore, your demand pager has already overslept. We live in a time when a toy PC has 16GB of RAM and nothing to fill it with. I think you are going to be fine whatever pagesize you use.

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 403

Xeno wrote:

I am currently writing my memory manager for my 64 bit OS, I am writing in C now as assembly began to be more complex then C lol. So I was wondering if it would be a good or bad idea to only use 4 KiB pages insted of mixing 2 MiB and/or 1 GiB pages? I understand it would be more page table overhead, but if I understand correctly it would allow me to use a single bitmap to keep track of all physical memory, also it would allow me to never defragment memory to make more 2MiB pages avalible.

If that's not right please correct me as paging is something I'm still trying to fully comprehend.

If you parametrize your page size (default to 4K, but make it a CPP macro), then define an API to get/set page mappings, you can be largely page size agnostic in most of your code.

You can work with mapping virtual address ranges to physical pages (a lower and upper bound of the virtual address to physical address), then your API doesn't in fact care how big pages are, and you can vary and even mix page sizes.

For example, frame buffers are a great candidate for large page mappings, and you can map a frame buffer in a single API call using the virtual page range of the mapped frame buffer, along with the frame buffer physical address, and let the API decompose that to as many large 2MB page mappings as possible, with any non-aligned mappings using 4K page mappings.

The only constraint is that the virtual page range is aligned to the minimum page size.

In terms of managing (rather than mapping) free physical memory, in order to handle multiple page sizes, a buddy allocator is ideal to manage contiguous physical pages if you want to be able to allocate more than a single page.

But I don't know how much benefit that brings, to be honest. Outside of large direct mapped areas (such as the frame buffer example), I would wonder about the performance benefits of anything like 2MB pages for normal process text or data, against the massively increased internal fragmentation and inevitable memory wastage such a large page size would result in.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5146

nullplan wrote:

Xeno wrote:

I understand it would be more page table overhead,

How? A huge page just short-circuits the page table lookup, so if anything, it decreases page table overhead.

Using only 4kiB pages increases page table overhead, assuming you missed opportunities to use larger pages.

nullplan wrote:

A single 64-bit number is never enough to keep track of all physical memory. So a bitmap allocator always needs to use multiple words. And then it barely matters how many.

But it does matter if you want your bitmap allocator to be thread-safe. A bitmap that represents 4kiB pages requires 512 bits to represent a 2MiB page, and x86 can only perform atomic operations on 128 bits at once. (Some old CPUs are limited to 64 bits, even in 64-bit mode!)

thewrongchristian wrote:

Outside of large direct mapped areas (such as the frame buffer example), I would wonder about the performance benefits of anything like 2MB pages for normal process text or data, against the massively increased internal fragmentation and inevitable memory wastage such a large page size would result in.

For typical desktop PC applications, 2MiB pages usually hurt performance rather than helping. The only programs that benefit are ones that need to randomly access very large amounts of memory, such as databases.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3195

An bitmap allocator with 4k pages can implement 2M allocations using lock free methods even in 32 bit OS. You can use exchange with 32 bit operations and if nonzero results are returned you simply release the allocated pages again and retry.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1605

Octocontrabass wrote:

But it does matter if you want your bitmap allocator to be thread-safe. A bitmap that represents 4kiB pages requires 512 bits to represent a 2MiB page, and x86 can only perform atomic operations on 128 bits at once. (Some old CPUs are limited to 64 bits, even in 64-bit mode!)

I thought we were talking about physical memory? I only have one PMM for the whole system, and it works in 4kB pages, and I see little value in allocating larger blocks and then subdividing them further.

Thread safety is generally achieved with spinlocks or lock-free algorithms. In case of a bitmap, you can use cmpxchg or even bts to allocate a page. Though thinking about it, I think a spinlock is probably the way to go. It won't be held for long, too.

rdos wrote:

You can use exchange with 32 bit operations and if nonzero results are returned you simply release the allocated pages again and retry.

But if you read from the bitmap and get a nonzero result, then someone else has already allocated the page. So you cannot release it since it's not yours. I'd go for cmpxchg, then I don't change anything if something else changed concurrently.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3195

nullplan wrote:

Thread safety is generally achieved with spinlocks or lock-free algorithms. In case of a bitmap, you can use cmpxchg or even bts to allocate a page. Though thinking about it, I think a spinlock is probably the way to go. It won't be held for long, too.

I use an lock-free algorithm since I use the page fault handler to allocate physical memory for both pages and directory entries. These page faults could happen in scheduler or IRQs, so a lock-free algorithm makes sure these operations can never block.

nullplan wrote:

rdos wrote:

You can use exchange with 32 bit operations and if nonzero results are returned you simply release the allocated pages again and retry.

But if you read from the bitmap and get a nonzero result, then someone else has already allocated the page. So you cannot release it since it's not yours. I'd go for cmpxchg, then I don't change anything if something else changed concurrently.

Allocation is done by doing 32-bit xchg with zero on 16 dwords. If the result is not all one's, you release the allocated pages that were changed from 1 (available) to 0 (allocated). This is done by a locked or operation that for the first revert use the result from the failed xchg and in the following by using all 1s. I actually do a check before-hand to make sure all pages are free before trying to allocate. For single 4k allocation, I use bsf and a locked btr operation.

Xeno · **Joined:** Tue Oct 10, 2023 7:40 pm **Posts:** 51

Thank you everyone for your answers.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5146

nullplan wrote:

I thought we were talking about physical memory?

We are. Using 2MiB pages requires finding a naturally-aligned 2MiB chunk of physical memory, and that means either locking the bitmap (preventing any allocations from happening in parallel) or doing several lock-free partial allocations and rolling back if another CPU manages to sneak in an allocation (easy to mess up).

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1605

Octocontrabass wrote:

We are. Using 2MiB pages requires finding a naturally-aligned 2MiB chunk of physical memory, and that means either locking the bitmap (preventing any allocations from happening in parallel) or doing several lock-free partial allocations and rolling back if another CPU manages to sneak in an allocation (easy to mess up).

Ah! Now I finally understand what rdos was on about. OK, yes this makes sense. Of course, I just have a memblock allocator. It is impossible to use it in a lock-free manner, so I don't bother. Allocation means iterating over the intersection between the "available" blocks and the holes between the "reserved" blocks, finding a block that contains something that suits your needs, then adding that range to the list of "reserved" blocks.

Personally, I would suggest to use locking even if you have a bitmap allocator, since with the strategy presented here, a failing allocation results in transient resource usage, which can cause other allocations to fail for no real reason. With locking, at least the answer you get is the correct answer for that moment in time.

devc1 · **Posted:** Fri Nov 10, 2023 8:36 am

Yes, and I think its better to use locking once instead of using it for every bit. Lock with instructions makes them much slower they were like 7 times slower in my tests.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5146

devc1 wrote:

Yes, and I think its better to use locking once instead of using it for every bit. Lock with instructions makes them much slower they were like 7 times slower in my tests.

Are you talking about locking or the x86 LOCK prefix? Those are two different things.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3195

devc1 wrote:

Yes, and I think its better to use locking once instead of using it for every bit. Lock with instructions makes them much slower they were like 7 times slower in my tests.

I think you are wrong about that. Acquiring a spinlock, doing some tests, and releasing the spinlock is a lot slower than a single lock btr instruction. Also, btr sets carry flag in a way that makes it possible to verify if the allocation succeeded or not. Of course, that only works with assembly code (or inline assembly).

The bitmap approach (which can be done lock-free) is a lot better in protected mode since the overhead is much smaller, and it also means you don't have to map physical memory in linear memory, which linked lists requires. Linked lists cannot be implemented lock-free. Another advantage of bitmaps is that it can detect double-frees, something that typically corrupts a link list.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3195

nullplan wrote:

Personally, I would suggest to use locking even if you have a bitmap allocator, since with the strategy presented here, a failing allocation results in transient resource usage, which can cause other allocations to fail for no real reason. With locking, at least the answer you get is the correct answer for that moment in time.

Failing is an exception. For 4k allocations, I know the bit is free, and so it will only fail if there is contention, which is an exception. Same with the 2M allocator. It will check so the memory area is free before trying to allocate it. I typically only use the 2M allocator for special cases.

I agree that lock-free operation needs to be carefully implemented, and is only possible with a bitmap allocator, and not a linked list approach. However, once it's in place and working, it will consume less physical memory and less linear memory, and will not expose physical memory to kernel space.

OSDev.org

Is it a good idea to use only 4KiB pages

Who is online