Hellbender wrote:My 2c about the subject (edit: reading back, I see Combuster kind of said this already..)
Unfortunately, I do not feel convinced by some of the arguments. It looks like what you want is quite different from what I do, but I will still answer from my own perspective. This is not comprehensive, but it’s as long a coherent text as I’m currently able of writing; sorry if the narrative is sloppy.
Hellbender wrote:
I assume kernel has linearly mapped all the memory into its own address space, so that it never needs to do any changes for itself.
There are systems where this is possible; some (MIPS) even have a kernel-only “unpaged” area in memory built in. However, x86-32 + PAE is one example where it is
not possible to have all physical memory space mapped at the same time: virtual addresses are 32 bits (addressing at most 4G bytes) and physical are 36 (capable of addressing 64G of memory). Even if not the whole physical address space is really memory, it can realistically have big holes in it. Couple that with not wanting to have a separate address space for the kernel, and you’re in trouble.
In theory, x86-64 in its current form suffers from the same problem: (canonical) virtual addresses are 48 bits wide, and physical addresses have 52 bits. I imagine this will be only of theoretical concern for years to come, though.
To encompass these cases consistently, I feel it is cleaner for the kernel to maintain a ‘private mapping area’ for temporary mappings. It this area is per-CPU, though, it won’t need TLB shootdowns at all: the kernel doesn’t need protection from itself. I’m not sure if all cases of kernel-private mappings can be handled in this manner, though.
Hellbender wrote:
Third, I feel like normal processes would never need to remove/update page mapping in their virtual address space. They should just be growing heap, growing stack, or mapping shared or private memory into their address space. Unmapping memory does not need to happen immediately as long as the virtual address space is not needed for anything else.
This is true in today’s systems. However, even with current approaches cheap unmapping could be used to improve some things. For example, implementing read barriers for GC using page faults is not unheard of; but this was done some 20 years ago so I don’t know if it is still feasible today.
The main scenario I imagine is different, though. If a process submits data to a device driver using shared memory, the driver might let the process write the data in its final form, but will still want to check its consistency before submitting it to the device itself. Even if this is not needed, letting the user program overwrite data that is being read by a device is probably not a good idea. This means that the submitting process must give up the write rights until the device is done with it. Voilà: decreasing access rights.
Example: for zero-copy networking, let the process itself write TCP, IP and possibly even Ethernet headers, but check that it has the permission to use the specified port and protocol numbers and addresses before pointing the hardware to the data. You will need to ensure that the things you’ve checked won’t change in the interim.
Current systems show that this
can be worked around, but I’d still like to have cheap unmapping if possible.
In any case, there will be an unmapping call available from user space, so the machinery needs to be in place. This is still problematic. On one hand, a process should be able to do whatever it wants with its own mappings. On the other hand, if any process that happens to have a right to occupy one processor at a particular moment in time can cause
all processors in the system to stall for thousands of cycles (IPI), we have a DoS vulnerability or at least a resource accounting problem.
Hellbender wrote:
Thus the problem reduces to the case when a kernel running on one core decides to remove a mapping of a process address space and should eventually notify other cores about the change.
In my world, the kernel makes almost no decisions by itself, so
this is the unfrequent case.
Hellbender wrote:
when a process dies, the heap is removed and physical pages are reused.
Make address spaces not processes: don’t mix protection domains with execution states. Then ‘delete address space’ is an explicit user-level call, because even if nothing is currently running in the address space, it doesn’t mean it’s dead.
Hellbender wrote:
when a thread dies, the stack is removed and physical pages are reused.
The kernel needn’t occupy itself with threads. See the scheduler activations paper for evidence that kernel threads impede user-level threading performance, or the Haskell language runtime for an example of just how lightweight user-level threads can be. (Spoiler: it can handle
millions of them without breaking a sweat. And this is enormously useful.) As always, moving stuff out of the kernel generally improves things, but the shrinking amount of trusted code means you have to ensure (coherency of page) protection.
Hellbender wrote:
Finally, virtual address space reuse feels even easier to avoid, just map new stuff to a different address space (48-bits should be enough for everything, right?). If other thread uses the now-unmapped space in another core, it is ill behaving anyway and can be let to access the old physical page (as long as it is not re-used yet).
I must be missing some crucial point here, because I feel like real time TLB invalidation it not really needed that much, as long as the system has enough RAM.
I think I know what the point is: if this is applied in full generality, either the user-level allocator will go mad or it will need to implement an additional level of address mapping just to remember where it put things.