nullplan wrote:thewrongchristian wrote:once I decided to move my kernel boundary to 0xf0000000,
Why would you do that? BTW, did you know that it is undefined behavior in C to be having any single object on a 32-bit system that is larger than 2GB? Because that means that pointer differences within that object no longer work. Giving userspace this much virtual memory only means it can violate this rule more often.
To be honest, what user space does is not my problem. As pointed out by Octocontrabass, taking the difference between two different objects is undefined anyway, and I'm unlikely to have such a large single object. But, pointer arithmetic is done in units of the sizeof the object being pointed to, so an array of objects that are 128 chars in size, can span the entire 3.75GB user process space and still yield valid pointer differences between the first and last element. In fact, even an int is 4 chars, and so an entire 4GB address space filled with a single array of int will produce a maximum pointer difference of 2^30-1, still a valid pointer difference.
But, it was also because I can, and I wanted to ensure my code wasn't tied to any particular kernel cut-off boundary. In fact, the only assumption I currently make is that the kernel boundary is on a page directory boundary due to how I initialize my bootstrap page table, so I can make the boundary any value in 4MB increments.
nullplan wrote:
thewrongchristian wrote:What's wrong with temporary mappings?
The added complexity. I am sharing as much as possible of the paging structures on the kernel side. With 32-bit PAE paging, this would mean the kernel-side PDP is the same in all threads (on 64-bit, it means the second half of the PML4T is copied to be the same everywhere). Any change of the kernel memory mapping table must therefore be synchronized. So, anything that touches on kernel-VM must grab a spinlock. With temporary mappings, you end up having to do that for a whole lot of things you didn't in the past. Example: Examining a foreign paging structure. People love to use recursive page tables to examine their own structures, but when you have identified a memory block to be swapped out, the whole point of that is that the memory block is in possession of another process. So you end up having to mark those page table entries as "not present". Since they are no longer always mapped in, you have to map in the PDT, find the PT, map out the PDT, map in the PT, mark the PT as not present, map out the PT.
Given the nature of temporary mappings, it is also unlikely that the memory being temporarily mapped is going to be swapped out. In fact, as it's in use by the kernel, it is likely going to be explicitly locked against being swapped out.
My current temporary mapping is confined to a page cleaning routine, which takes as an argument a page to be cleaned, maps the page to it's temporary mapping, and does a memset to set the page to all 0s. The only thing that can interrupt that is an interrupt, and interrupts currently don't trigger the pre-emption of a kernel thread, so I'm guaranteed to finish before the temporary mapping will be used next.
Also, as currently implemented, I have a limited cache of address contexts anyway, so I recursively map in all the page tables to the top of kernel memory. I think I currently have between 2 and 8 (compile time option), and they're recycled in an LRU fashion, but I have successfully tested with 1 and think probably 16 or 32 would provide ample address contexts certainly for anything I'm likely to meet in the next couple of years. A process sleeping for a long time will lose its address context under context pressure, so all its mappings will be lost, so page table mappings are purely transitory and disposable in my kernel, and all VM information is managed in the platform independent virtual memory manager. The upshot being that the actual amount of live mapping state is bound to the number of contexts, which are all recursively mapped into the kernel address space and can be copied, synced and cleaned easily.
nullplan wrote:
But that is not the worst of it. The worst is when multiple CPUs get into the mix. Now when you add a kernel-side mapping, you actually don't have to do anything. The mapping will propagate to the other CPUs with page faults. But when you remove or change a mapping, now you have to do a TLB shootdown (and with temporary mappings, you have to remove mappings as often as you add them). So you have to send IPIs to all other CPUs and make them invalidate a TLB, and then you have to wait for all of them to actually do that, and if you thought a spinlock was a bottleneck, try looking at CPU barriers. These things also don't scale. The current threadripper has 128 threads or so. How long am I supposed to wait for? And then there is the thorny issue of timeouts. Do I add a timeout to the process? That would leave the possibility that one CPU was somehow blocked for too long and didn't get the shootdown request, and is still accessing the wrong memory. Leading to fun times debugging the issue.
A per-CPU address will suffice to prevent locking requirements, and given that the temporary address is not used for anything else, there should be no TLB shoot-down required. Other CPUs just won't care.
I can't remember how many places I actually use temporary mappings. It might be just the one place (zeroing a page), but even with a threadripper and 128 vCPUs, that's only 512K of address space to put aside for temporary mappings per CPU. But of course, with a threadripper system, it's also unlikely that I'll be running in 32-bit mode either, so it's horses for courses.
Moot in my kernel anyway, which is currently UP only. When I make it to SMP, though, I'll let you know how that pans out
nullplan wrote:
Or, you could just avoid all of this, and map all memory linearly to the start of kernel space. Now TLB shootdowns are no longer necessary. If I end up changing the virtual memory of a process with multiple threads, I can simply tell the other CPUs to schedule a new process once I am done. Userspace mappings are not global, so they will be flushed out of the TLB once scheduling happens. And for the schedule IPI, I don't need any additional memory or any synchronization. It is just fire-and-forget.
As I said, I can see the benefits of mapping the entirety of physical memory, which was practical when Linux was first designed, and is practical again now with 64-bit, but I don't want to limit my kernel to just small 32-bit or large 64-bit configurations. I want it to be usable with large memory 32-bit configurations as well, so I need something that works across all of them.
Besides, modern x64 CPUs, and plenty of RISC CPUs since the 80s, have an address space id that would still necessitate TLB shootdown of even user processes, so rescheduling won't help if you're using this ASID feature, and CPUs such as PowerPC don't even have the concept of multiple address spaces, and so would similarly require TLB shootdown of user mappings.