OSDev.org

Posted: **Fri Jul 12, 2024 2:21 pm**

Hi all,

Currently in my 32 bit kernel I have the top GB of virt. address space allocated for kernel use, and have that offset and identity mapped to the first GB of physical memory for easy translation when working with process page dirs and tables (reserving the top 4mb of kernel space for special physical accesses).

I found that a lot of people use recursive mapping to access the page dir/table instead for 32 bit, which I found strange - considering that in x86_64 Linux they just map the entire physical address space above some offset - and this is just a minimal version of that for the address space that the kernel occupies.

Is there a reason why people use recursive mapping for x86 instead of just an offset identity map? Are there any unexpected pitfalls I'm getting into? I already account for reserved regions in my kernel memory manager off of the GRUB memory map, and I can't see any benefit of using recursive paging over an offset identity map.

Posted: **Fri Jul 12, 2024 9:15 pm**

On x86_64, you have way more virtual address space than physical, so you can just map all memory linearly to the start of kernel space, and still have enough left over to allow userspace free reign over its half and have a virtual mapping for kernel space.

On x86, you have 4GB of virtual address space and, with PAE, 4 EB of physical address space. Of course, when you shovel 10 pounds of stuff into a 5 pound bag, something will break, so a pure identity mapping will not work. Even without PAE, you have 4GB of physical address space, and linear mapping 4GB to the 3GB line will also not work.

In order to work around this and enable the OS to manage use of physical address space, there are only two solutions I have ever seen in use. One is the recursive mapping, and the other is to swap out parts of the kernel address space strategically as needed. Linux on 32-bit architectures pursues the latter strategy. It will linear map 768 MB to the 3GB line, and the last 256 MB it will map dynamically. The allocator, instead of returning pointers, will return handles, that can be turned into pointers with a map function, and should be unmapped when no longer in immediate use.

BTW, some dynamic mapping will always be necessary with the linear mapping approach, because devices will give you straight up physical addresses to work with, and you need to map those into the address space as well. And on the PC, those addresses will often be between 3GB and 4GB in the physical realm.

Recursive mapping has the advantage of being simpler to start with. I think that is why many people are drawn to it, particularly in a place calling itself "the place to start of operating system developers." The disadvantages are numerous, though:

All of the implementations I have seen will link an uninitialized page into the paging structure and then clear it. This has the potential of polluting the TLBs with nonsense, and I have railed against this before.
It becomes more complicated as soon as multiple CPUs enter the picture. Depending on your virtual memory model, you may need a lot of TLB shootdowns.
Recursive mapping always eats one full entry in the top level of your paging structure. While with 32-bit legacy paging, this is only 1/1024 of virtual address space, and with 64-bit paging, it is only 1/512, with PAE paging it is 1/4. That is typically what most OSes want to allocate for the entire kernel space in total.
With recursive mappings, you can only adjust the currently loaded page tables. With stuff like demand paging, the whole idea is to write pages that haven't been used in a while to swap space and set them to "not present" in the page table, but with recursive mappings you have to load the old CR3 again to do this.

Have I mentioned I am not a fan of recursive mappings? I have it easy, since I am developing a 64-bit OS and don't need to swap out parts of the address space.

Posted: **Fri Jul 12, 2024 11:24 pm**

Are there any mainstream OSs that have actually used recursive page tables? I find it strange how common it is in OS dev teaching material considering it's both more complex than offset mapping (I think a fairly big hurdle, conceptually, for beginners) and, seemingly, not widely used in industry.

It's particularly odd in tutorials written well after 64-bit CPUs became common. While I'm sure there are good reasons to use recursive mapping I feel like there's a sort of "cargo cult" effect at work.

Posted: **Sat Jul 13, 2024 11:45 am**

Hi,

Windows for 32 bit targets incorporated several strategies: one of which is recursive mapping. By default, it would map kernel space at 2GB however can be changed to 3GB. It used two memory pools for allocations, one of which are for pageable kernel objects. Additionally, kernel components and drivers can define pageable code and data segments that the IO Manager would detect and the Memory Manager can use to determine what is safe to page out. I am pretty confident Linux also used recursive mapping at 3GB along with pageable code and data segments. For 64 bit targets I believe both use linear mapping probably while supporting the other strategies.

Posted: **Sat Jul 13, 2024 12:28 pm**

neon wrote: ↑Sat Jul 13, 2024 11:45 am I am pretty confident Linux also used recursive mapping at 3GB along with pageable code and data segments.

It doesn't now, it hasn't since I first looked at the code a decade ago, and it would be pretty unintuitive for them to first introduce and then abolish the recursive mapping, though I wouldn't put it past the Linux developers. Linux started out with a linear map, and that did work out OK as long as RAM sizes were below 768 MB. At that point, its strategy no longer worked (because IO memory also takes up some space). However, that time was also the time PAE got traction. As mentioned already, recursive mapping with PAE is terrible, since it eats a whole 1GB of virtual address space, and Linux already only reserves 1GB of virtual address space for kernel space on x86_32.

The docs don't mention anything about a recursive mapping. Only the highmem stuff I already explained.

Posted: **Mon Jul 15, 2024 6:26 am**

I don't see why PAE would eat 1GB linear memory. PAE needs four physical pages at the top level instead of just one as with legacy paging. Thus, the top recursive mapping in the page directory for legacy paging contains one page and for PAE four pages. PAE needs to map twice as much linear memory since each entry is twice as large. In my design, I decide on PAE dynamically, but the recursive mapping is at a fixed address, so I always allocate space for PAE. In fact, PAE has the same overhead as long mode paging (1/512) since it uses the same page layout.

My design has a "double" recursive mapping. I both map the current process page tables and the "system" mapping of kernel page tables. This is because page directory entries in kernel are allocated dynamically, and thus another process could already have allocated a kernel page directory entry, and so when doing these allocations they are first done in the system mapping, and then copied to the process mapping. By requiring PAE, and having a 1GB kernel space, it would be possible to optimize this by setting the last top level PAE entry to a fixed kernel entry. I've not done this, but it would save linear kernel memory. In fact, this would provide the same 1/512 footprint for both legacy and PAE paging since PAE would only need one recursive mapping while legacy mode needs two.

An additional issue is how to manage physical memory. In 32-bit mode, with many GBs of physical memory, you need a scheme that doesn't consume lots of linear memory. Direct mapping of all physical memory into linear memory obviously doesn't work when there are several GBs of physical memory. The most effective design is the bitmap approach that uses one bit per 4k physical memory, resulting in a 1/32768 ratio.

I don't see how a usable design for 32-bit mode could use anything else than a recursive mapping. Linux and Windows both seems to have poor 32-bit designs that originated when physical RAM was low, and then have not been properly updated. Instead, these designs are now obsolete and replaced with 64-bit versions. The major issues in 32-bit mode is that you cannot map all physical memory and you cannot map disc caches (or any other potentially large objects) in linear kernel memory.

In kernel space, I have two basic memory allocators: A byte-aligned allocator (malloc) and a page-aligned allocator. The former has 128MB reserved linear address space and the latter has 768MB. It's typically the latter that will be problematic, particularly since it was used to cache disc data. That's why I have changed to a microkernel mode for file systems, which keeps caches in user-mode space and not in kernel space. I also no longer map disc data buffers in linear memory, rather keep lists of physical addresses to disc data. This change makes a big difference in memory usage in kernel, and allows to keep large disc caches far larger than 4G.

The issue with lots of TLB invalidations when physical addresses needs to be mapped to linear addresses can be solved by creating a kernel object cache that allow fast conversions between linear and physical addresses in typical memory schedules of modern devices that also solve the problem of same-size object allocation. The cache is page based, and keeps the physical address in the header. The cache should be created per device, and not be a general kernel feature. The linear to physical conversion (and the reverse) can thus easily be implemented by comparing with the physical address in the header, and require no mappings or TLB invalidations. An additional benefit is that the physical address space is more protected when it is not mapped. What you cannot access, you cannot corrupt.

Posted: **Mon Jul 15, 2024 8:55 am**

rdos wrote: ↑Mon Jul 15, 2024 6:26 am I don't see why PAE would eat 1GB linear memory.

You align the PDPT with a page boundary, then set one of the four entries to the address of the PDPT. That is a classic recursive mapping, and it eats 1/4 of address space, since the PDPT only has 4 entries.

rdos wrote: ↑Mon Jul 15, 2024 6:26 am the top recursive mapping in the page directory for legacy paging contains one page and for PAE four pages.

So you have a recursive mapping in the PDTs back to the PDTs. No idea how you manipulate the PDPT then, but this way you can manipulate the PDTs and PTs the same way as before. OK, that is feasible, and it does lower the overhead. I had not thought of that before.

Posted: **Mon Jul 15, 2024 9:10 am**

I don't think you need to manipulate the PDPT entries. You just allocate all four at process creation time. Ideally, it would be better to only allocate three, and then map the last one to a shared kernel entry (assuming 1G kernel space).

Posted: **Mon Jul 15, 2024 4:31 pm**

rdos wrote: ↑Mon Jul 15, 2024 6:26 am I don't see how a usable design for 32-bit mode could use anything else than a recursive mapping.

I honestly don't see how a usable design can use recursive mapping - if a page directory is only mapped in virtual address space when its currently in use, that means for things like copy-on-write, you would have to switch to the processes address space and edit its entries from there - it seems like it introduces a lot of avoidable problems. Is there any actual advantage over just identity mapping (with an offset) the regions where paging structures are created?

nullplan wrote: ↑Fri Jul 12, 2024 9:15 pm Linux on 32-bit architectures pursues the latter strategy. It will linear map 768 MB to the 3GB line, and the last 256 MB it will map dynamically. The allocator, instead of returning pointers, will return handles, that can be turned into pointers with a map function, and should be unmapped when no longer in immediate use.

When you say it will linear map 768 MB to the 3GB line, do you mean it will identity/offset map 768MB of memory from 3GB to 3GB+768MB in the virtual address space (where it places the paging structures for easy translation)? And is there a link you could provide where I can read about the handles that kmalloc() provides.. can't seem to find anything about that.

Posted: **Mon Jul 15, 2024 9:25 pm**

thexpiredpear wrote: ↑Mon Jul 15, 2024 4:31 pm When you say it will linear map 768 MB to the 3GB line, do you mean it will identity/offset map 768MB of memory from 3GB to 3GB+768MB in the virtual address space (where it places the paging structures for easy translation)?

Exactly. This means for everything in low memory, it knows the exact translation between physical and virtual address.

thexpiredpear wrote: ↑Mon Jul 15, 2024 4:31 pm And is there a link you could provide where I can read about the handles that kmalloc() provides.. can't seem to find anything about that.

Actually I misremembered that. kmalloc() always provides a pointer. It is alloc_pages() that provides a struct page, and that can be turned into a pointer using kmap(): https://elixir.bootlin.com/linux/latest ... rnal.h#L48

For further reading, I can only offer the kernel docs on memory allocation: https://www.kernel.org/doc/html/latest/ ... ation.html

Posted: **Tue Jul 16, 2024 2:43 am**

thexpiredpear wrote: ↑Mon Jul 15, 2024 4:31 pm I honestly don't see how a usable design can use recursive mapping - if a page directory is only mapped in virtual address space when its currently in use, that means for things like copy-on-write, you would have to switch to the processes address space and edit its entries from there - it seems like it introduces a lot of avoidable problems. Is there any actual advantage over just identity mapping (with an offset) the regions where paging structures are created?

I have a PC with 256GB of physical memory that I use for high-speed data acquisition using a PCIe board based on a FPGA with a two channel 1G sample/second ADC. My OS must definitely support this configuration and should be able to use all physical memory. Besides, basically all modern PCs come with 4G or more of physical memory, and so an 32-bit OS that can only use 768MB of physical memory is quite obsolete. To have a 256 MB "dynamic" space seems a lot like those horrible XMS and EMS standards of the past.

I don't see the problem with copy-on-write. I have a fork function that doesn't support all the stranger scenarios, but it works well enough if the code is well-behaved. it doesn't rely on the ability to access other address spaces. Copy-on-write uses an available bit in the page directory entry in the current address space. Additionally, it has other data structures to keep track of fork history and reference counts. Since my main mode of process creation is "CreateProcess", and not fork, fork is implemented as an add-on feature and doesn't need to be super-effective or 100% POSIX compatible. Therefore, page reference counts are only kept when a fork is active.

I do have a 3:rd recursive mapping in kernel that can be used to manipulate other address spaces. However, this is mostly used for debugging, which requires the ability to read-out and modify data in the debugged process since the debug server is running in it's own process. This is not used for copy-on-write or any other critical function in kernel.

OSDev.org

Identity mapping vs. Recursive mapping for x86

Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86

Re: Identity mapping vs. Recursive mapping for x86