OSDev.org

Posted: **Wed May 10, 2023 10:00 am**

I am *trying* to write an exokernel as a learning project. It has been great so far, I've been learning a lot (and restarting a lot, too, but I'm pretty sure it is part of the process). For now, I'm focusing on the x86_64 architecture only.

Being an exokernel, my OS (and by that, I mean "I") wants to let applications choose the mapping between physical memory and virtual memory.

Originally, I had two system calls:

Code: Select all

// Populates `dst` with at most `len` segments of free physical memory.
size_t get_memory(free_segment_t *dst, size_t len);

// Maps physical page at `phys` to virtual page at `virt`. Flags are not important for the point of this post.
size_t map_memory(uintptr_t phys, uintptr_t virt, int flags);

However, having to query all available physical memory seems expansive, and it gets even worst when considering the race between `get_memory` and `map_memory`. Indeed, if the memory we want to map has been acquired by another process between a call to `get_memory` and the call to `map_memory`, we'd have to query all available memory again to find another free segment. That's pretty bad.

So, I thought, maybe it would be better if the kernel just shared the whole list of free segment to all userspace process (by mapping it in their address space at a well known address in the higher half, along with the rest of the kernel) with read-only permissions. This way, they wouldn't have to pay the cost of an extra system call when acquiring memory. They'd just have to read the list directly, and then perform the `map_memory` system call. Note that there's still a race, but it's less of a problem.

Unless...

Unless the kernel starts writing to the list. The list is mapped with read-only permissions, so processes can't really take a lock to look at the list. And even if they could, it seems like a pretty bad idea to allow untrusted processes to acquire global locks such as this one.

So here comes my question:

Is there a way to syncronize kernel writes with userspace processes without the kernel needing to wait for them to finish a read?

If there is an obvious answer, I'd be glad to learn about it. Otherwise, I've thought about it for a bit, and came up with something, which I'm not sure is really sound.

1. When the kernel starts writing to one of the elements of the list, atomically increment an epoch counter. (maybe even compare_exchange for concurrent access with other CPU cores)
2. When it is done, increment it again.

With that, in order to read an element from the list, userspace processes must:

1. Read the epoch counter once atomically
2. Read the value.
3. Read the epoch counter again and verify that it has not changed (there's still a race if it actually has wrapped around, but that seems unlikely enough).
4. If the value has changed, retry with the new value. Otherwise, do whatever with the value.

Would that even be sound? and How do atomic operations interract with multitasking?

PS: This is my first post here. I've been reading a lot since I started trying to write a kernel, and I must say, I would't have made it this far without you all. And I'm still at the very begining...

Posted: **Tue May 16, 2023 12:09 am**

nilsmathieu wrote:Is there a way to syncronize kernel writes with userspace processes without the kernel needing to wait for them to finish a read?

Perhaps a combined epoch counter and lock? The lock only needs to be one bit, which leaves you 63 bits for the counter.

nilsmathieu wrote:Would that even be sound?

No. Userspace may complete a read without observing the epoch counter changing while the kernel is busy with a write. You need a lock to prevent that from happening.

nilsmathieu wrote:How do atomic operations interract with multitasking?

What do you mean? Atomic operations can be used to guarantee that all tasks have a consistent view of shared data. The only interaction I can think of at a userspace level is that a task may yield instead of wasting CPU cycles polling a lock that's unlikely to be released soon.

Posted: **Tue May 16, 2023 12:46 am**

Thanks for the reply!

Octocontrabass wrote:No. Userspace may complete a read without observing the epoch counter changing while the kernel is busy with a write. You need a lock to prevent that from happening.

Oh yeah, hadn't thought about that. So userspace program should perform the read after having taken a lock (checking that the epoch is odd would be enough in that case). Ensuring that no one is currently writing. Does reading the value needs to be atomic too? I've seen things like atomic memcpys, is that something I need here ? I feel like the lock is enough and that critical section can use regular reads, but I don't want to settle for something I just feel™ works.

EDIT: (code example)

Code: Select all

extern Value shared;

int try_read(Value *val) {
    uint64_t epoch = atomic_load_acquire(&shared.epoch);
    if (epoch & 1 == 1)
        return 0;
    memcpy(val, shared, sizeof(Value));
    if (atomic_load_acquire(&shared.epoch) != epoch)
        return 0; // we lost the race and need to try again
    return 1;
}

Octocontrabass wrote:What do you mean?

Yeah, my bad, that was poorly phrased. I think that comes from me not really getting which part of an atomic operation load/store affects compilers, and which part affects CPUs themselves. That's not the topic here, though, so I'll just research that on my own.

Posted: **Tue May 16, 2023 2:20 am**

nilsmathieu wrote:Does reading the value needs to be atomic too? I've seen things like atomic memcpys, is that something I need here ?

No and no. C and C++ atomics are sequentially consistent by default, so all you need to do is read the epoch, read your data structure (assuming it wasn't locked), then read the epoch again. You don't even need to do anything special to read the epoch, just declare it as an atomic type. Your compiler will handle the rest.

Your example code looks pretty reasonable, but I'm not sure acquire memory ordering makes a strong enough guarantee.

nilsmathieu wrote:For now, I'm focusing on the x86_64 architecture

Atomics are very architecture-dependent. I forgot to mention it earlier, but this design is only possible on architectures that are capable of atomic 64-bit reads, such as x86. If one of your future targets doesn't make that guarantee, you'll have to come up with something else.

Posted: **Tue May 16, 2023 8:40 am**

You know, the normal way to solve stuff like this is to reserve the memory before returning it. get_memory() marks the pages it hands out as used, and no other get_memory() call can return the same pages until some new function free_memory() marks them as usable again. This is normal allocation stuff, and a variety of designs is available. The easiest would be to have a global allocator lock that is taken whenever anything changes the allocator data structures, though faster/more parallel designs might be better in the long run. The nice thing about abstraction is that you can start with the hacky solution to get going and look for a better one later.

Posted: **Tue May 16, 2023 11:56 am**

nullplan wrote: You know, the normal way to solve stuff like this is to reserve the memory before returning it.

Yeah, I know x)

This is mostly me expermenting with exokernel principles which attempt to expose to userspace processes as much as possible the physical hardware (such as physical memory).
My `get_memory` was not a function to allocates memory, but a way to retrieve a list of all usable physical pages.

Example:

Code: Select all

size_t virt = grow_heap();

free_segment_t segments[10];
size_t count = get_memory(&segments, 10);
if (count == SIZE_MAX)
    panic("error");
size_t phys = choose_physical_page_to_use(segments, count);
if (map_memory(phys, virt, MAPMEM_WRITE | MAPMEM_READ) == SIZE_MAX)
    panic("error");

And I want to make it look more like this, removing the need to perform a system call:

Code: Select all

// Always mapped somewhere in all userspace processes.
extern free_segment_t segments[];
extern size_t segment_count;

size_t virt = grow_heap();
size_t phys = choose_physical_page_to_use(segments, segment_count);
if (map_memory(phys, virt, MAPMEM_WRITE | MAPMEM_READ) == SIZE_MAX)
    panic("error");

The idea is to allow untrusted processes to choose their memory mapping themselves.

PS: Thanks octocontrabass for your help!

Posted: **Thu May 18, 2023 3:44 am**

I think that if you let user processes map any physical address they like into their address space, then there is no need to run user processes in their own ring, and you might just as well let them run in kernel space, which gives them access to everything in kernel. That's because not allowing mapping of physical memory is at the heart of ring protection and protecting kernel space from user space and different processes from each other. When user processes run in kernel space, there is no need for syscalls, and trying to protect stuff with read-only attributes is quite meaningless since userspace can just make a new mapping of the same physical address that is read-write. You might even consider running without paging (faster), or using unity mapping. That will make it easier to support memory schedules of modern PCI devices.

As an alternative, if you want exo-kernel features but not all the drawbacks above, you can add specfic syscalls that allows a user process to map PCI bars in their address space, and that allows the construction of memory schedules for modern PCI devices. While this will never become completely safe, it's at least a whole lot better than letting user-space map any physical address. For normal allocation, you then tell the kernel that a page is allocated, and when it's accessed, the pagefault handle will allocate a physical page and map it. I don't think there is any reason whatsoever why a user space process should be able to map ordinary physical memory.

Posted: **Thu May 18, 2023 12:38 pm**

rdos wrote:I think that if you let user processes map any physical address they like into their address space,

I don't think anyone is suggesting that. The user process may choose the physical address, but the exokernel will reject any requests that break address space isolation.

rdos wrote:I don't think there is any reason whatsoever why a user space process should be able to map ordinary physical memory.

Cache coloring. Userspace can choose pages according to anticipated access patterns to minimize cache contention.

Posted: **Fri May 19, 2023 2:11 am**

Octocontrabass wrote:
rdos wrote:I think that if you let user processes map any physical address they like into their address space,
I don't think anyone is suggesting that. The user process may choose the physical address, but the exokernel will reject any requests that break address space isolation.

That would impose a lot of overhead on mapping, particularly for processes with many pages allocated. Additionally, are things like PCIe BARs acceptable for a user process to map? If not, I don't see how the exokernel feature of allowing low-level access is supported if this is not possible.

Octocontrabass wrote:
rdos wrote:I don't think there is any reason whatsoever why a user space process should be able to map ordinary physical memory.
Cache coloring. Userspace can choose pages according to anticipated access patterns to minimize cache contention.

I think cache coloring can be solved without allowing user space to decide which physical addresses to map.

Posted: **Fri May 19, 2023 6:57 am**

rdos wrote:Additionally, are things like PCIe BARs acceptable for a user process to map? If not, I don't see how the exokernel feature of allowing low-level access is supported if this is not possible.

A driver does not need to access the BARs to use MMIO. But also, an exokernel does not need to allow MMIO access either. The defining feature of an exokernel is the lack of software abstractions such as filesystems. Hardware abstractions such as block device drivers may still be part of the kernel.

rdos wrote:I think cache coloring can be solved without allowing user space to decide which physical addresses to map.

Sure, but the point of an exokernel is to give applications the power to make those decisions because applications know their own access patterns. The kernel has to guess, and sometimes it will guess wrong.

OSDev.org

On sharing memory between the kernel and userland processes.

On sharing memory between the kernel and userland processes.

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces

Re: On sharing memory between the kernel and userland proces