mapping kernel address space into user address space

Brendan · Post by **Brendan** » Tue May 04, 2004 3:03 am

Hi,

Candy wrote: Pages sent to swap, and unmapped pages of memory mapped files were not in use recently. When they are being mapped, you first load it, then map it and then invlpg yourself. No problemo.

Sure - the pages weren't in use recently, but that doesn't always mean they won't be in use soon. Also if a CPU is halted (waiting for something to do) it may retain TLB information indefinately (especially if they are marked global).

Candy wrote: Theoretical situation about a race condition: You have a process A and a process B, and a process C. A uses 3 pages(0-2), B uses 3 pages (3-5). Process A is reading from all three pages on processor 1, B is reading from all three pages, C has no relevant issues other than not being A. Processor 0 switches from A to C, processor 1 sees B needing a page, unmaps page #2 from space in process A, process not active, no ipi's. Processor 0 switches back to A, still using the old TLB entry mapping page 2 of A to the page in the cache, and thereby overwrites B's data.

You want to signal all processors.

Still not (normally) necessary - when processor 0 switches back to process A it has to reload CR3 (different address space) which will flush anything not marked global.

Candy wrote:
My processes (applications, drivers, etc) have a flag in the executable file header that says if it supports multi-processor. If this flag is clear the scheduler will not allow more than one thread from the process to be running. I expect that most processes will not set this flag. Also if the process only has 1 thread there's no problem. For a very rough estimate I guess that I'll only need to worry about TLBs on other CPUs in process space about %15 of the time.
That's very ugly. Multiprocessor things also occur with ALL multithreading programs, and if you don't multithread there's nothing the other processor can do about it. What's the use of the bit?

Most people who write multi-threaded processes/applications don't worry about using semaphores/mutexes for shared data or anything - they assume it's single CPU multi-threading. As a simple example, if a multi-threaded application (that doesn't support multi-processor) is the only application that the user is using, then the keyboard driver, mouse driver, VFS, user interface and video driver will/may all be doing something because of it (in my OS all of these things are seperate processes). In addition any process can have idle threads (including the kernel) which only run when there's nothing else to do. This means other CPUs will have plenty to keep them busy, even though the application is only being run on one CPU.

If the application does support multi-processor then the OS knows that it does use semaphores, mutexes, etc for shared data and it can allow multiple threads belonging to the process to be scheduled on different CPUs at the same time.

Candy wrote:
In addition it's a micro-kernel (device drivers, VFS, etc are processes) - it's not going to be changing pages as much as a monolithic kernel.
What's different between a monolithic kernel and a microkernel that makes you say this? I actually dare say you'll keep getting TLB flushes. You might be different from a traditional monolithic kernel that you don't load the code you never use in the first place. That doesn't make you any better though, all your processes are in separate pages, giving a load of overhead a monolithic kernel can beat easily. (yes, an optimized microkernel can be faster than a non-optimized monolithic kernel, pretty *duh* if you ask me). I'm still going for hybrid

With a monolithic kernel all the data used by device drivers, etc would be in kernel space - in this case INVLPG would need to be done on all CPUs. As my device drivers, etc are implemented as seperate processes the INVLPG won't normally be needed on all CPUs, so in general the micro-kernel design reduces the overhead of TLB flushing. My most complete kernel contained less than 30 Kb of code and used around 200 Kb of data. The overhead of IPC and other things is increased though, so I expect the overall overhead of the micro-kernel to be worse. I'm using a micro-kernel for other reasons (sacrificing some performance for other benefits).

Cheers,

Brendan

Candy · Post by **Candy** » Tue May 04, 2004 4:39 am

Brendan wrote: Sure - the pages weren't in use recently, but that doesn't always mean they won't be in use soon. Also if a CPU is halted (waiting for something to do) it may retain TLB information indefinately (especially if they are marked global).

CPU's are either required to snoop TLB changes, catch IPI's or invalidate the TLB. That is, afaik.

Still not (normally) necessary - when processor 0 switches back to process A it has to reload CR3 (different address space) which will flush anything not marked global.

ok, forgot about the cr3 reload.

Most people who write multi-threaded processes/applications don't worry about using semaphores/mutexes for shared data or anything - they assume it's single CPU multi-threading. As a simple example, if a multi-threaded application (that doesn't support multi-processor) is the only application that the user is using, then the keyboard driver, mouse driver, VFS, user interface and video driver will/may all be doing something because of it (in my OS all of these things are seperate processes). In addition any process can have idle threads (including the kernel) which only run when there's nothing else to do. This means other CPUs will have plenty to keep them busy, even though the application is only being run on one CPU.

If the application does support multi-processor then the OS knows that it does use semaphores, mutexes, etc for shared data and it can allow multiple threads belonging to the process to be scheduled on different CPUs at the same time.

You need mutexes and semaphores to correctly run two threads, on any preemtive system. Each thread can be preempted after reading a value before writing back the increased value, so your idea wouldn't work. The only place where it would work is when you schedule per process, and then start the same thread as last time until it blocks. Still, that defeats the purpose of making it multithreaded

.

With a monolithic kernel all the data used by device drivers, etc would be in kernel space - in this case INVLPG would need to be done on all CPUs. As my device drivers, etc are implemented as seperate processes the INVLPG won't normally be needed on all CPUs, so in general the micro-kernel design reduces the overhead of TLB flushing. My most complete kernel contained less than 30 Kb of code and used around 200 Kb of data. The overhead of IPC and other things is increased though, so I expect the overall overhead of the micro-kernel to be worse. I'm using a micro-kernel for other reasons (sacrificing some performance for other benefits).

I'm suggesting that in case of a microkernel, you constantly switch context, so you actually have a higher overhead with constant TLB invalidation. That also means it doesn't matter that you skipped over these points.

Brendan · Post by **Brendan** » Tue May 04, 2004 6:55 am

Hi,

Candy wrote:You need mutexes and semaphores to correctly run two threads, on any preemtive system.

Curse those air bubbles!

Candy wrote: I'm suggesting that in case of a microkernel, you constantly switch context, so you actually have a higher overhead with constant TLB invalidation. That also means it doesn't matter that you skipped over these points.

Why? I don't see a relationship between TLB invalidation and context switches...

Cheers,

Brendan

Candy · Post by **Candy** » Tue May 04, 2004 7:06 am

Brendan wrote:
Candy wrote: I'm suggesting that in case of a microkernel, you constantly switch context, so you actually have a higher overhead with constant TLB invalidation. That also means it doesn't matter that you skipped over these points.
Why? I don't see a relationship between TLB invalidation and context switches...

In a microkernel each kernel driver has its own process (address space area). True or not?

When you switch between address space areas you invalidate the entire TLB, save for the global entries. True or not?

So, when you use a microkernel, you invalidate the entire TLB (exc... ) for each swap between kernel level threads, not just for page removals. True or not?

If you'd have a microkernel sample with more than one thread in a single process area, they'd still have to be synchronized, using IPI's or the like.

Brendan · Post by **Brendan** » Tue May 04, 2004 7:22 am

Hi,

Candy wrote:So, when you use a microkernel, you invalidate the entire TLB (exc... ) for each swap between kernel level threads, not just for page removals. True or not?

Sort of

We seem to be talking about slightly different things here. Explicit TLB invalidation, where a CPU unmaps a page and has to do the IPI/spinlock/INVLPG (other CPUs stopped) to make sure other CPUs TLBs are consistant is one thing. Implicit TLB invalidation, where the CPU automatically invalidates TLB entries when CR3 is reloaded is another. For microkernels explicit TLB invalidation is (or at least can be) done less often than it would in a monolithic kernel, while implicit TLB invalidation (which doesn't involve stopping other CPUs) is done more often.

Cheers,

Brendan

Candy · Post by **Candy** » Tue May 04, 2004 7:49 am

Brendan wrote: We seem to be talking about slightly different things here. ... For microkernels explicit TLB invalidation is (or at least can be) done less often than it would in a monolithic kernel, while implicit TLB invalidation (which doesn't involve stopping other CPUs) is done more often.

Yes. I tend to group them under the term "TLB invalidation" since they both, kind of, invalidate the TLB. Both cause a breakdown of performance, the implicit more so than the explicit. I rather look at it as the amount of TLB misses you get because of a certain choice or operation, or rather, how your performance degrades at no win in security.

I do not imply that monolithic is better because it is faster.

I'm doing hybrid, modules can be either, and all modules are in the end using the same kernel. Using the same base kernel you can make a microkernel and a monolithic kernel, without difference.

BI lazy · Post by **BI lazy** » Tue May 04, 2004 2:23 pm

I'd keep *drivers* in kernel space - sorta kernel threads. Eases life in many ways.

qb · Post by qb » Thu May 06, 2004 7:50 pm

proxy wrote: A thought just occured to me. So we implement independant address spaces by each process having it's own page directory right? And it is common practice to hace the kernel address space mapped directly into the user space to simplify things...well when the kernel does dynamic allocation, does that mean we have to iterate through all processes and add these new pages to there memory maps as well, or is there some neat trick I am unaware of?
proxy

1. You can have shared Ppgetables in the kernel area. So if you change
a pagetable entry, it become automatically visible in the other address spaces.

2. For non shared stuff like 4mb pages, which are in pagedirectorys you need a another solution.
On boot create a master kernel page directory. If you change the current pagedirectory, then add this changes in the master kernel directory as well.
After a switch to a another pagetable and a pagefault occurs
the pagefault handler will look at the master kernel directory.
If a valid entry in the master-dir exists, the handler will update
the current page directory and returns immediately.
This is called lazy updating.
So you dont need to update every addresspace.

OSDev.org

mapping kernel address space into user address space

Re:mapping kernel address space into user address space

Re:mapping kernel address space into user address space

Re:mapping kernel address space into user address space

Re:mapping kernel address space into user address space

Re:mapping kernel address space into user address space

Re:mapping kernel address space into user address space

Re:mapping kernel address space into user address space

Re:mapping kernel address space into user address space