Physical memory manger for virtual memory and paging issue

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
User avatar
iman
Member
Member
Posts: 84
Joined: Wed Feb 06, 2019 10:41 am
Libera.chat IRC: ImAn

Physical memory manger for virtual memory and paging issue

Post by iman »

Hi.

I have some misconceptions in using physical memory allocator to be used in virtual memory allocation and paging.

The code for the physical memory allocator resides in kernel (below 1 MB) and would be mapped to every process address space, but the physical address that the allocator assigns as 4KB frames of memory is in the range of 0x01000000 to 0xC0000000. It does not matter who is asking for a frame of memory, it only passes a void* to address a 4KB memory block somewhere in this range.

Here comes the mixing up of concepts.
- You turn on paging after you've set up your initial page directory and identically mapped your kernel and passed the address to cr3 register.
- Every time you create a new process, you will need a page_directory and possibly page_table to be allocated by your physical allocator. By the time you call your allocator and it wants to access an address in the range of 0x01000000 to 0xC0000000, there will be a page fault and it has to be mapped. Additionally, if the process itself internally calls for memory, once again physical memory allocator faces a page fault.

In principle, in this design, will you get a page fault whenever your physical memory manager (or allocator) gets called?

Best.
Iman.
Iman Abdollahzadeh
Github
Codeberg
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: Physical memory manger for virtual memory and paging iss

Post by thewrongchristian »

iman wrote:Hi.

I have some misconceptions in using physical memory allocator to be used in virtual memory allocation and paging.

The code for the physical memory allocator resides in kernel (below 1 MB) and would be mapped to every process address space, but the physical address that the allocator assigns as 4KB frames of memory is in the range of 0x01000000 to 0xC0000000. It does not matter who is asking for a frame of memory, it only passes a void* to address a 4KB memory block somewhere in this range.

Here comes the mixing up of concepts.
- You turn on paging after you've set up your initial page directory and identically mapped your kernel and passed the address to cr3 register.
- Every time you create a new process, you will need a page_directory and possibly page_table to be allocated by your physical allocator. By the time you call your allocator and it wants to access an address in the range of 0x01000000 to 0xC0000000, there will be a page fault and it has to be mapped. Additionally, if the process itself internally calls for memory, once again physical memory allocator faces a page fault.

In principle, in this design, will you get a page fault whenever your physical memory manager (or allocator) gets called?

Best.
Iman.
You get page faults when a page isn't mapped. If you've already identity mapped your kernel, it shouldn't, therefore, page fault. When you create your new page table, make sure you set up the shared kernel mappings in it before switching to it.

You might also want to look at:

- Not identity mapping physical memory to virtual memory
- Having a higher half mapping for your kernel

The reasons for the former are that relying on such a mapping constrains how much physical memory you can have to the size of your kernel virtual address space. Perhaps not a problem for an embedded system, but a problem for a modern desktop system in 32-bit (though not in 64-bit). Also, not all platforms have virtual memory starting at 0, some platforms might in fact start physical memory higher, and an identity mapped kernel will do likewise and waste virtual address space (again, more a problem in 32-bit, but not necessarily 64-bit.)

For the latter, it's nicer to have user processes occupy the lower end of virtual memory. That way, you can move the user/kernel boundary without impacting backward compatibility. For example, Windows used to have the kernel address space start at 0x80000000, but optionally moved that boundary to 0xc0000000 (and thus giving user processes more address space) without affecting compatibility for user programs. (Hmm, now that I write it, that shouldn't affect compatibility either, at least if the boundary is moving down giving the user process more space.)
User avatar
iman
Member
Member
Posts: 84
Joined: Wed Feb 06, 2019 10:41 am
Libera.chat IRC: ImAn

Re: Physical memory manger for virtual memory and paging iss

Post by iman »

thewrongchristian wrote:You get page faults when a page isn't mapped. If you've already identity mapped your kernel, it shouldn't, therefore, page fault. When you create your new page table, make sure you set up the shared kernel mappings in it before switching to it.
Yes the shared kernel has been identity mapped. But every time kernel calls to alloc_phys_frame(), there is something like:

Code: Select all

#define PHYS_MEM_BASE 0xA0000000 //for instance
void* alloc_phys_frame(void)
{
    return (void*)(PHYS_MEM_BASE + next_free_frame);
}
will be ended up into page fault (PHYS_MEM_BASE resides much above kernel code below 1MB), if the page for 0xA0000000 still does not exist, right?
Iman Abdollahzadeh
Github
Codeberg
reapersms
Member
Member
Posts: 48
Joined: Fri Oct 04, 2019 10:10 am

Re: Physical memory manger for virtual memory and paging iss

Post by reapersms »

That won't page fault unless you actually dereference the pointer you constructed. Until you do that, as far as the CPU is concerned that's just another arbitrary value sitting in a register.

I would expect the code that calls this to take that value, and chop it up to go in the page table entry it's filling out...
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: Physical memory manger for virtual memory and paging iss

Post by nullplan »

And I keep telling people that physical addresses are not pointers. This is why. For the purposes of a higher-half kernel, a physical address is just some number (with some important properties I won't detail here). It is the corresponding virtual address which is a pointer and can be dereferenced, but not the physical address. In particular, further statements from you indicate you are in 32-bit mode, in which physical addresses don't even need to have the same size as virtual ones. So I have a special type for this, but you probably want something like this:

Code: Select all

#include <inttypes.h>
#if defined CONFIG_PAE || defined __x86_64__
typedef uint64_t phys_addr_t;
#else
typedef uint32_t phys_addr_t;
#endif
And then exclusively use that type when referring to physical addresses.
iman wrote:- You turn on paging after you've set up your initial page directory and identically mapped your kernel and passed the address to cr3 register.
Yes, that is also something I first had to learn. You can use a temporary environment. Allocate some memory to house the initial page directory and page tables, create a linear map of kernel memory, and activate paging. Now what? Now you are stuck with a linear mapping of kernel memory to the higher half. You can create new paging structures and switch to those instead. As long as they contain the same linear map (that is, a "mov cr3,xxx" instruction must always be at an address that translates to the same physical address in both the old and new address spaces. Turning on paging is merely a special case of this, where the old address space is "identity map everything").
iman wrote:- Every time you create a new process, you will need a page_directory and possibly page_table to be allocated by your physical allocator. By the time you call your allocator and it wants to access an address in the range of 0x01000000 to 0xC0000000, there will be a page fault and it has to be mapped. Additionally, if the process itself internally calls for memory, once again physical memory allocator faces a page fault.
And this is where we get into territory where I changed my mind, but it doesn't really matter to me. I saw that the Linux kernel will just linearly map a lot of memory (like 768MB, which is most of the 1GB the kernel reserves for itself) to the 3GB line. I used to think this was a hack, but it is actually really clever. It allows the kernel to always look at all physical memory below a certain line, and it keeps complexity in check. Now, you can only deal with 768 MB of RAM with this approach, but that is justifiable, given that these days, you either have a lot of RAM and a 64-bit machine, or little RAM and a 32-bit machine. Little RAM and 64-bits might happen as well, but lots of RAM and 32-bits is a combination not really seen in the wild anymore. And yes, 768MB of memory is still plenty, especially for a hobby OS. And the reason it isn't 1 GB is because you also need to map I/O memory somewhere.
The "hack" part I changed my mind about is this: This model means you must hardcode the location your kernel will be loaded to into the linker script, because that plus 3GB is where the kernel will end up being mapped. And it always bothered me to do it this way. Ideally, with virtual addressing, it shouldn't matter where the binary is located in physical memory, or even if it is contiguous. But in 32-bit mode, there is simply no space for a linear mapping and an additional "correct" kernel mapping. And doing this the "proper" way means adding temporary mappings to kernel space, where you end up mapping in a physical page for a short time, changing something, mapping it out again, and so forth. Which ends up complicating all memory management from the PMM down, and once multiple CPUs get into the mix you start tearing your hair out.

Now I'm working in 64-bit mode, and don't have that issue. I have more than enough virtual memory space to map all physical memory that might ever be in the system, and then some, but I still can, and indeed must, map my kernel "properly", because I am compiling in the "kernel" code model, which presupposes that all link-time addresses will be in the last 2GB of address space. And I still have more than enough virtual memory left over to do my dynamic allocations with (kmalloc memory is virtually contiguous but might be physically fragmented, and there is absolutely no downside to that)
Carpe diem!
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: Physical memory manger for virtual memory and paging iss

Post by thewrongchristian »

nullplan wrote:... It is the corresponding virtual address which is a pointer and can be dereferenced, but not the physical address. In particular, further statements from you indicate you are in 32-bit mode, in which physical addresses don't even need to have the same size as virtual ones. So I have a special type for this, but you probably want something like this:

Code: Select all

#include <inttypes.h>
#if defined CONFIG_PAE || defined __x86_64__
typedef uint64_t phys_addr_t;
#else
typedef uint32_t phys_addr_t;
#endif
And then exclusively use that type when referring to physical addresses.
I use something similar, but my physical addresses are physical page number (not including offset) so when I move to 64-bit, I'll be able to keep my physical addresses 32-bit for the foreseable future.
nullplan wrote:
iman wrote:- You turn on paging after you've set up your initial page directory and identically mapped your kernel and passed the address to cr3 register.
Yes, that is also something I first had to learn. You can use a temporary environment. Allocate some memory to house the initial page directory and page tables, create a linear map of kernel memory, and activate paging. Now what? Now you are stuck with a linear mapping of kernel memory to the higher half. You can create new paging structures and switch to those instead. As long as they contain the same linear map (that is, a "mov cr3,xxx" instruction must always be at an address that translates to the same physical address in both the old and new address spaces. Turning on paging is merely a special case of this, where the old address space is "identity map everything").
...[
And this is where we get into territory where I changed my mind, but it doesn't really matter to me. I saw that the Linux kernel will just linearly map a lot of memory (like 768MB, which is most of the 1GB the kernel reserves for itself) to the 3GB line. I used to think this was a hack, but it is actually really clever. It allows the kernel to always look at all physical memory below a certain line, and it keeps complexity in check. Now, you can only deal with 768 MB of RAM with this approach, but that is justifiable, given that these days, you either have a lot of RAM and a 64-bit machine, or little RAM and a 32-bit machine. Little RAM and 64-bits might happen as well, but lots of RAM and 32-bits is a combination not really seen in the wild anymore. And yes, 768MB of memory is still plenty, especially for a hobby OS. And the reason it isn't 1 GB is because you also need to map I/O memory somewhere.
I still think it's a hack. It was a reasonable hack in the early 90s, and fits in nicely with platforms like 32-bit MIPS.

But its not very general, and once I decided to move my kernel boundary to 0xf0000000, I didn't have a huge amount of virtual memory to map physical memory with anyway, so I went entirely the other way, and none of my kernel is identity mapped other than the static portions (basically the bits loaded by the bootloader and a bootstrap heap.)

Now, both my user and kernel address space is covered by region descriptors that can resolve page faults, including my heap, so no memory other than the bootstrap memory is reserved and constrained for kernel use.

Part of my problem with the kernel->physical map plan is just managing which physical pages are used elsewhere, and thus off limits to the kernel. How do you manage that? Do you unmap pages from your kernel map as they are doled out?
nullplan wrote: The "hack" part I changed my mind about is this: This model means you must hardcode the location your kernel will be loaded to into the linker script, because that plus 3GB is where the kernel will end up being mapped. And it always bothered me to do it this way. Ideally, with virtual addressing, it shouldn't matter where the binary is located in physical memory, or even if it is contiguous. But in 32-bit mode, there is simply no space for a linear mapping and an additional "correct" kernel mapping. And doing this the "proper" way means adding temporary mappings to kernel space, where you end up mapping in a physical page for a short time, changing something, mapping it out again, and so forth. Which ends up complicating all memory management from the PMM down, and once multiple CPUs get into the mix you start tearing your hair out.
What's wrong with temporary mappings? So long as the temporary VA areas can be 'owned' by the threads that need them, you don't even have to worry about tracking (or even unmapping) old mappings. The owning thread will know all the details of the temporary mapping, and presumably once we're done, you just put the VM area back into a pool for reuse when you need temporary mappings again.
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: Physical memory manger for virtual memory and paging iss

Post by nullplan »

thewrongchristian wrote:once I decided to move my kernel boundary to 0xf0000000,
Why would you do that? BTW, did you know that it is undefined behavior in C to be having any single object on a 32-bit system that is larger than 2GB? Because that means that pointer differences within that object no longer work. Giving userspace this much virtual memory only means it can violate this rule more often.
thewrongchristian wrote:What's wrong with temporary mappings?
The added complexity. I am sharing as much as possible of the paging structures on the kernel side. With 32-bit PAE paging, this would mean the kernel-side PDP is the same in all threads (on 64-bit, it means the second half of the PML4T is copied to be the same everywhere). Any change of the kernel memory mapping table must therefore be synchronized. So, anything that touches on kernel-VM must grab a spinlock. With temporary mappings, you end up having to do that for a whole lot of things you didn't in the past. Example: Examining a foreign paging structure. People love to use recursive page tables to examine their own structures, but when you have identified a memory block to be swapped out, the whole point of that is that the memory block is in possession of another process. So you end up having to mark those page table entries as "not present". Since they are no longer always mapped in, you have to map in the PDT, find the PT, map out the PDT, map in the PT, mark the PT as not present, map out the PT.

But that is not the worst of it. The worst is when multiple CPUs get into the mix. Now when you add a kernel-side mapping, you actually don't have to do anything. The mapping will propagate to the other CPUs with page faults. But when you remove or change a mapping, now you have to do a TLB shootdown (and with temporary mappings, you have to remove mappings as often as you add them). So you have to send IPIs to all other CPUs and make them invalidate a TLB, and then you have to wait for all of them to actually do that, and if you thought a spinlock was a bottleneck, try looking at CPU barriers. These things also don't scale. The current threadripper has 128 threads or so. How long am I supposed to wait for? And then there is the thorny issue of timeouts. Do I add a timeout to the process? That would leave the possibility that one CPU was somehow blocked for too long and didn't get the shootdown request, and is still accessing the wrong memory. Leading to fun times debugging the issue.

Or, you could just avoid all of this, and map all memory linearly to the start of kernel space. Now TLB shootdowns are no longer necessary. If I end up changing the virtual memory of a process with multiple threads, I can simply tell the other CPUs to schedule a new process once I am done. Userspace mappings are not global, so they will be flushed out of the TLB once scheduling happens. And for the schedule IPI, I don't need any additional memory or any synchronization. It is just fire-and-forget.
Carpe diem!
User avatar
iman
Member
Member
Posts: 84
Joined: Wed Feb 06, 2019 10:41 am
Libera.chat IRC: ImAn

Re: Physical memory manger for virtual memory and paging iss

Post by iman »

reapersms wrote:That won't page fault unless you actually dereference the pointer you constructed
nullplan wrote:And I keep telling people that physical addresses are not pointers. This is why.
It solves my confusion. Even if I always use physical address as uint_32, if in protected mode or uint_64 if in long mode, the confusion was with me that to what extent something like return (void*)address makes page fault.
Iman Abdollahzadeh
Github
Codeberg
Octocontrabass
Member
Member
Posts: 5572
Joined: Mon Mar 25, 2013 7:01 pm

Re: Physical memory manger for virtual memory and paging iss

Post by Octocontrabass »

nullplan wrote:BTW, did you know that it is undefined behavior in C to be having any single object on a 32-bit system that is larger than 2GB? Because that means that pointer differences within that object no longer work.
Taking the difference between two pointers to such a large object is undefined, but I don't see any reason why other operations wouldn't work. Most programs don't allocate individual objects that large, and taking the difference of two pointers to different objects is always undefined.
nullplan wrote:Since they are no longer always mapped in, you have to map in the PDT, find the PT, map out the PDT, map in the PT, mark the PT as not present, map out the PT.
I haven't tried it myself, but I have wondered if it's possible to map the page tables all the time specifically for situations like this. Page tables are relatively small, so there might be enough virtual address space to make it work.

Also, some AMD CPUs have a feature (Translation Cache Extension) that might reduce the impact of frequent INVLPG.
nullplan wrote:But when you remove or change a mapping, now you have to do a TLB shootdown (and with temporary mappings, you have to remove mappings as often as you add them).
If you reserve portions of your address space for each CPU to use for temporary mappings, you can lock the thread to the CPU where it's performing the temporary mapping and skip the IPIs. The other CPUs may have stale TLB entries for the temporary mappings, but those aren't very important since the other CPUs will never write to those addresses or explicitly read from them (though they may still prefetch or speculatively read using the stale TLB entries).

This only works for TLBs and not paging structure caches; you still need IPIs if you're modifying paging structure entries that point to further paging structures.
nullplan wrote:Do I add a timeout to the process? That would leave the possibility that one CPU was somehow blocked for too long and didn't get the shootdown request, and is still accessing the wrong memory. Leading to fun times debugging the issue.
The VMware hypervisor panics if not all CPUs respond to a TLB shootdown before the timeout.
nullplan wrote:If I end up changing the virtual memory of a process with multiple threads, I can simply tell the other CPUs to schedule a new process once I am done.
Huh, that's really clever. What happens if there are no other processes available to be scheduled?
nullplan wrote:It is just fire-and-forget.
As long as you ensure the TLB and paging structure caches have been flushed before you reuse the memory.
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: Physical memory manger for virtual memory and paging iss

Post by nullplan »

Octocontrabass wrote:Taking the difference between two pointers to such a large object is undefined, but I don't see any reason why other operations wouldn't work.
I'm not sure, either, however, that was the reasoning given when I reported to the musl mailing list that their qsort() code can overflow if the object given is extremely large, resulting in an infinite loop. musl implements the Smoothsort algorithm, and will calculate all Leonardo numbers necessary as the very first thing. Those numbers will be scaled by element size, so they cannot just be precomputed. The algorithm must calculate till it reaches the Leoonardo number (scaled by element size) that exceeds array size. That number will never be used, but it is the signal to stop. If the input array is larger than the largest representable Leonardo number (times scale), the calculation overflows and the loop condition never becomes false, thus looping forever (or, more accurately, till all stack is overwritten and the process crashes). When I suggested to also stop the loop on overflow, that was rejected by saying that such large objects are generally undefined. Musl's allocator also prohibits allocating anything larger than PTRDIFF_MAX for the same reason. Pointer differences are very important to Smoothsort, BTW.
Octocontrabass wrote:I haven't tried it myself, but I have wondered if it's possible to map the page tables all the time specifically for situations like this. Page tables are relatively small, so there might be enough virtual address space to make it work.
I don't know and it is no skin off my nose either way. I'm in 64-bit mode.
Octocontrabass wrote:The VMware hypervisor panics if not all CPUs respond to a TLB shootdown before the timeout.
Super, exactly what I wanted: More calls to panic(). Because when that function is called, the user is going to be so very pleased. Especially when it is about something they can do nothing about.
Octocontrabass wrote:Huh, that's really clever. What happens if there are no other processes available to be scheduled?
Huh, hadn't thought of that. I'll add special code to always at least reload CR3 in that case, to force a TLB flush.
Octocontrabass wrote:As long as you ensure the TLB and paging structure caches have been flushed before you reuse the memory.
Dangit, you're right. The paging structures have been changed by the currently running CPU by the time the syscall returns, but the TLBs in the other processors will only get flushed once those get around to it. A quick reuse might cause a race condition where the memory is used by two processes for a short time. That is mitigated by the fact that the interrupt flag is always active in user mode. So by the time the syscall returns, the current CPU will have sent IPIs to all other CPUs, which should cause them to switch to the interrupt handler immediately. If they are running in user mode, anyway. If they are running in kernel mode, they cannot be accessing user memory. Unless they are executing a system call, but system calls always have interrupts active, since they are always at the top of kernel stack. (That is, before they grab spinlocks, anyway, but user mode memory access must be done without holding spinlocks. Otherwise, deadlocks can result)

So you had me going there for a moment. But user memory can only be accessed while interrupts are enabled, so those fire-and-forget IPIs will either immediately interrupt anything that might access user memory, or will do so once interrupts are re-enabled when the CPU is finished doing whatever as the interrupts are disabled. Even if there are still TLBs open on other processors when the memory is reused, the TLB flush will happen before they could be used.
Carpe diem!
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: Physical memory manger for virtual memory and paging iss

Post by thewrongchristian »

nullplan wrote:
thewrongchristian wrote:once I decided to move my kernel boundary to 0xf0000000,
Why would you do that? BTW, did you know that it is undefined behavior in C to be having any single object on a 32-bit system that is larger than 2GB? Because that means that pointer differences within that object no longer work. Giving userspace this much virtual memory only means it can violate this rule more often.
To be honest, what user space does is not my problem. As pointed out by Octocontrabass, taking the difference between two different objects is undefined anyway, and I'm unlikely to have such a large single object. But, pointer arithmetic is done in units of the sizeof the object being pointed to, so an array of objects that are 128 chars in size, can span the entire 3.75GB user process space and still yield valid pointer differences between the first and last element. In fact, even an int is 4 chars, and so an entire 4GB address space filled with a single array of int will produce a maximum pointer difference of 2^30-1, still a valid pointer difference.

But, it was also because I can, and I wanted to ensure my code wasn't tied to any particular kernel cut-off boundary. In fact, the only assumption I currently make is that the kernel boundary is on a page directory boundary due to how I initialize my bootstrap page table, so I can make the boundary any value in 4MB increments.
nullplan wrote:
thewrongchristian wrote:What's wrong with temporary mappings?
The added complexity. I am sharing as much as possible of the paging structures on the kernel side. With 32-bit PAE paging, this would mean the kernel-side PDP is the same in all threads (on 64-bit, it means the second half of the PML4T is copied to be the same everywhere). Any change of the kernel memory mapping table must therefore be synchronized. So, anything that touches on kernel-VM must grab a spinlock. With temporary mappings, you end up having to do that for a whole lot of things you didn't in the past. Example: Examining a foreign paging structure. People love to use recursive page tables to examine their own structures, but when you have identified a memory block to be swapped out, the whole point of that is that the memory block is in possession of another process. So you end up having to mark those page table entries as "not present". Since they are no longer always mapped in, you have to map in the PDT, find the PT, map out the PDT, map in the PT, mark the PT as not present, map out the PT.
Given the nature of temporary mappings, it is also unlikely that the memory being temporarily mapped is going to be swapped out. In fact, as it's in use by the kernel, it is likely going to be explicitly locked against being swapped out.

My current temporary mapping is confined to a page cleaning routine, which takes as an argument a page to be cleaned, maps the page to it's temporary mapping, and does a memset to set the page to all 0s. The only thing that can interrupt that is an interrupt, and interrupts currently don't trigger the pre-emption of a kernel thread, so I'm guaranteed to finish before the temporary mapping will be used next.

Also, as currently implemented, I have a limited cache of address contexts anyway, so I recursively map in all the page tables to the top of kernel memory. I think I currently have between 2 and 8 (compile time option), and they're recycled in an LRU fashion, but I have successfully tested with 1 and think probably 16 or 32 would provide ample address contexts certainly for anything I'm likely to meet in the next couple of years. A process sleeping for a long time will lose its address context under context pressure, so all its mappings will be lost, so page table mappings are purely transitory and disposable in my kernel, and all VM information is managed in the platform independent virtual memory manager. The upshot being that the actual amount of live mapping state is bound to the number of contexts, which are all recursively mapped into the kernel address space and can be copied, synced and cleaned easily.
nullplan wrote: But that is not the worst of it. The worst is when multiple CPUs get into the mix. Now when you add a kernel-side mapping, you actually don't have to do anything. The mapping will propagate to the other CPUs with page faults. But when you remove or change a mapping, now you have to do a TLB shootdown (and with temporary mappings, you have to remove mappings as often as you add them). So you have to send IPIs to all other CPUs and make them invalidate a TLB, and then you have to wait for all of them to actually do that, and if you thought a spinlock was a bottleneck, try looking at CPU barriers. These things also don't scale. The current threadripper has 128 threads or so. How long am I supposed to wait for? And then there is the thorny issue of timeouts. Do I add a timeout to the process? That would leave the possibility that one CPU was somehow blocked for too long and didn't get the shootdown request, and is still accessing the wrong memory. Leading to fun times debugging the issue.
A per-CPU address will suffice to prevent locking requirements, and given that the temporary address is not used for anything else, there should be no TLB shoot-down required. Other CPUs just won't care.

I can't remember how many places I actually use temporary mappings. It might be just the one place (zeroing a page), but even with a threadripper and 128 vCPUs, that's only 512K of address space to put aside for temporary mappings per CPU. But of course, with a threadripper system, it's also unlikely that I'll be running in 32-bit mode either, so it's horses for courses.

Moot in my kernel anyway, which is currently UP only. When I make it to SMP, though, I'll let you know how that pans out ;)
nullplan wrote: Or, you could just avoid all of this, and map all memory linearly to the start of kernel space. Now TLB shootdowns are no longer necessary. If I end up changing the virtual memory of a process with multiple threads, I can simply tell the other CPUs to schedule a new process once I am done. Userspace mappings are not global, so they will be flushed out of the TLB once scheduling happens. And for the schedule IPI, I don't need any additional memory or any synchronization. It is just fire-and-forget.
As I said, I can see the benefits of mapping the entirety of physical memory, which was practical when Linux was first designed, and is practical again now with 64-bit, but I don't want to limit my kernel to just small 32-bit or large 64-bit configurations. I want it to be usable with large memory 32-bit configurations as well, so I need something that works across all of them.

Besides, modern x64 CPUs, and plenty of RISC CPUs since the 80s, have an address space id that would still necessitate TLB shootdown of even user processes, so rescheduling won't help if you're using this ASID feature, and CPUs such as PowerPC don't even have the concept of multiple address spaces, and so would similarly require TLB shootdown of user mappings.
Octocontrabass
Member
Member
Posts: 5572
Joined: Mon Mar 25, 2013 7:01 pm

Re: Physical memory manger for virtual memory and paging iss

Post by Octocontrabass »

nullplan wrote:When I suggested to also stop the loop on overflow, that was rejected by saying that such large objects are generally undefined. Musl's allocator also prohibits allocating anything larger than PTRDIFF_MAX for the same reason.
I did some further research, and it turns out that neither GCC nor Clang support such large objects, even though they seem to be allowed by the C standard.
nullplan wrote:But user memory can only be accessed while interrupts are enabled, so those fire-and-forget IPIs will either immediately interrupt anything that might access user memory, or will do so once interrupts are re-enabled when the CPU is finished doing whatever as the interrupts are disabled. Even if there are still TLBs open on other processors when the memory is reused, the TLB flush will happen before they could be used.
Hold on a minute, that's just the TLBs. What about the paging structure caches? The CPU can still walk the page tables using stale entries in the paging structure caches, and when it performs those walks it may attempt to write the "accessed" bit in the tables it finds. Your IPI doesn't stop that from happening.
Post Reply