OSDev.org

Posted: **Sun Apr 18, 2010 5:16 pm**

Hi,
I just finished my threading code inside my kernel and want to get to usermode as fast as possible now to get finally out of the kernel.

My design is going to be a microkernel.
The address space layout is inspired from mac os where every process and the kernel has its very own virtual address space. The kernel isn't mapped anywhere in the usermode address space.

This leads to a problem on x86:
If an interrupt occurs, the processor needs to jump somewhere, but since, unlike on Power PC, the CPU doesn't switch the page directory to the kernel directory or turns off paging it jumps into a nonexistent memory location. Bad!
So I need to switch it by myself. My idea was do dedicate one page as a trampoline, mapped into every address space. The IDT points to several entrypoints inside the trampoline page and switches the page directory if necessary and jumps to the real handling code after that.
On x86 Mac OS uses this approach, too. I already opened a thread a few weeks ago about that because I couldn't find the source code about that, but I can't figure out how all this works on XNU

Now there are several questions how to do that:
How would you write such a trampoline? It needs to be loaded with the kernel and needs to be configured at runtime. The page directory to switch to needs to be set as well as the real entry points to the interrupt handlers. All this with as less work as necessary.
Another thing is the x86 TSS: When an interrupt occurs in usermode the stack is switched to the correponding stack segment and stack pointer in the current TSS. If the kernel stack is only mapped in the kernel page directory, this leads to a problem. The CPU tries to push the return information to the stack which is not mapped and faults.
If I only map it in the usermode page directory of the process the stack gets invalid as soon as the trampoline switches the address spaces and the cpu would fault, too.
I don't really like the obvious solution about this: Mapping this stack into both address spaces since it would bring much management work with it. I would have to handle two address spaces, keep the location unique for every process on both of them etc.

Any ideas about that?

Posted: **Sun Apr 18, 2010 9:08 pm**

Andy,

I read your old thread. There is another thread somewhere that describes the minimum required to do what you're suggesting. (idt, gdt, tss etc) but I cannot find it.

Do you know why they did it this way? Is there any advantage over mapping the kernel in every address space? There must certainly be a cost involved.

BTW, I don't mean to discourage you from doing it. On the contrary, I'm very interested to see the result.

- gerryg400

Posted: **Sun Apr 18, 2010 9:22 pm**

gerryg400 wrote:I read your old thread. There is another thread somewhere that describes the minimum required to do what you're suggesting. (idt, gdt, tss etc) but I cannot find it.

I'll try to find it by myself. Maybe I'm luckier than you

gerryg400 wrote:Do you know why they did it this way? Is there any advantage over mapping the kernel in every address space? There must certainly be a cost involved.

I think Apple did this just because they wanted to maintain compatibility between Power PC and x86 with as little work as possible. To me it seems they just didn't want to redesign the whole low level interrupt handling.
Sure, they are trashing the TLB everytime, but nevertheless it seems they are pretty fast.

gerryg400 wrote:BTW, I don't mean to discourage you from doing it. On the contrary, I'm very interested to see the result.

And you wouldn't be successful with it

I just thought to give this approach a try. It's something new and relatively uncommon. I hope the reason for that isn't that it's slow and unusable

I just remember that in a german book "Linux Kernelarchitektur", which describes the Linux kernel in its implementation, a Linux patch was mentioned which gives the linux kernel its own address space.
I just hope it sill exists and I can find it.

edit:
Found the Linux patch: http://lwn.net/Articles/39283/
Let's have a look...

Posted: **Mon Apr 19, 2010 1:18 am**

Hi,

Andy1988 wrote:
gerryg400 wrote:I read your old thread. There is another thread somewhere that describes the minimum required to do what you're suggesting. (idt, gdt, tss etc) but I cannot find it.
I'll try to find it by myself. Maybe I'm luckier than you

That may have been my post: http://forum.osdev.org/viewtopic.php?f= ... 89#p172589

This discusses the absolute minimum you'd need in all processes from a theoretical perspective; and there's probably better ways with slightly less overhead and slightly more in each address space. For example, using a full IDT with 256 entries would be more sane and would only cost about 2 KiB extra.

Cheers,

Brendan

Posted: **Mon Apr 19, 2010 2:38 am**

I'm not sure there's any real advantage to what you're doing - you're not really creating a clean design because you still need to use part of the address space for the trampoline itself. Is the extra overhead (even if small) worth the few extra KiBs (you mentioned a microkernel design)? I would think not.

Posted: **Mon Apr 19, 2010 7:37 am**

I'm not sure if this technique will give you good enough performance for a true microkernel. Xnu is not really a microkernel, since all drivers and OS services run in the kernel's address space. For a real microkernel, there will be a lot more TLB thrashing since the kernel has to be involved in every IPC operation between processes.

Posted: **Mon Apr 19, 2010 8:08 am**

Andy1988 wrote:So I need to switch it by myself. My idea was do dedicate one page as a trampoline, mapped into every address space. The IDT points to several entrypoints inside the trampoline page and switches the page directory if necessary and jumps to the real handling code after that.

Can't this be done by using Task Gates in the IDT?

Posted: **Mon Apr 19, 2010 8:42 am**

For the performance impact: I think I'll have it anyway.
All the communication between processes will be done over IPC. The only processes which do syscalls (and are allowed to) are the system daemons and drivers.

If a process wants to open a file it sends a message to the filesystem server which handles this appropriately and returns the expected values (if possible and security allows it). I'll have to switch the context anyway for that.

The reason for this "everything-is-done-with-messages"-approach is that it is possible to distribute several services over a network/serial-line etc. I don't care if it's practical or already exists or will be used by hundreds of other people. I just want to do it like that

And this would be even possible across different architectures.
And if you look at the projects page in the wiki, you will not find one hobby os which is distributable.
Sure, it will take me some time to get a message delivered to another process, even on the same host, but it's a hobby. Other people are working several years to look at a model train doing its rounds on a half completed track in some half-painted landscape

And if this trampoline approach goes wrong, I'll just drop it. It's not about efficiency or getting the code done to meet some kind of a deadline because I'm getting money for it. It's about learning things in computer science and trying new stuff.
There is a great feature in git, which is called branches. Very usable for fooling around with your source code and dropping or merging it into the master branch afterwards

@Hobbes
Yes, seems to be doable.
I can set CR3 in the "IRQ-TSS" and do a far jump to the task gate. The page directory should get switched then.
Thanks.

Posted: **Mon Apr 19, 2010 2:26 pm**

Hi,

Andy1988 wrote:The reason for this "everything-is-done-with-messages"-approach is that it is possible to distribute several services over a network/serial-line etc. I don't care if it's practical or already exists or will be used by hundreds of other people. I just want to do it like that And this would be even possible across different architectures.

It is possible, it has been done before, and the most well known example is MPI (which has been ported to just about every OS).

For something like allocating RAM for a process, using messaging is insane. First you need to allocate RAM for the message somehow, then the message must go to the same computer (no point allocating "remote RAM" for a process). I have similar objections to using messaging for accessing the scheduler's functions (although depending on how it's done it may make sense). In both cases, just use a kernel API and be done with it (the cost of using a kernel API to send the message will probably be double the overhead of using a kernel API to access memory manager and scheduler functions directly).

For file I/O, it's important to realise that processes pound the daylights out of the VFS (and "stat()" is often the most frequently used file I/O function, not "open()", "read()" or "write()"). For this reason it's very important for the VFS to cache things locally (especially directory information). This means that messages for file I/O go from a process to the local VFS/cache (and not directly to a local or remote file system, or directly to a remote VFS), partly because the VFS cache is important to avoid constant disk access, but also (for distributed systems) to avoid network latency.

For other cases, you end up with groups of processes that all talk to each other a lot (but don't talk to processes outside of this group much). In this case it's important to keep the processes within the group on the same computer. In the same way, in some situations it's better to duplicate processes to avoid networking. For example, if several computers want to use the same "font engine" service, don't be afraid to run multiple "font engine" services where each is "close" to where it's being used.

Of course none of this has anything to do with putting the kernel in it's own address space (or using a micro-kernel).

Andy1988 wrote:And if you look at the projects page in the wiki, you will not find one hobby os which is distributable.

I know there's at least one distributed OS project on the wiki's projects page (and I'd be surprised if there weren't others).

Andy1988 wrote:Sure, it will take me some time to get a message delivered to another process, even on the same host, but it's a hobby. Other people are working several years to look at a model train doing its rounds on a half completed track in some half-painted landscape

How many people poke themselves in the eye just to see what happens? I'd guess almost none - the outcome is too easy to predict, and the most likely outcome isn't "good". Even if poking yourself in the eye was a hobby, it still wouldn't make sense.

How many people design micro-kernels where the kernel is in it's own address space, just to see what happens? I'd guess almost none, for the same reason that people don't go around poking themselves in the eye all the time.

Andy1988 wrote:Yes, seems to be doable.
I can set CR3 in the "IRQ-TSS" and do a far jump to the task gate. The page directory should get switched then.
Thanks.

Task gates (and the hardware task switching mechanism in general) are not reentrant. This means you'll probably end up with a TSS for each interrupt handler for each CPU (e.g. for 100 interrupt handlers and a quad-core CPU, that's 400 TSSs for the kernel; all mapped into every address space). On top of that you'd need a TSS for each process (which don't need to be mapped into every address space; but do need to be mapped at the same virtual address in the kernel's address space and the process' address space). It's a logistical nightmare...

Cheers,

Brendan

Posted: **Mon Apr 19, 2010 2:54 pm**

Brendan wrote: For something like allocating RAM for a process, using messaging is insane. First you need to allocate RAM for the message somehow, then the message must go to the same computer (no point allocating "remote RAM" for a process). I have similar objections to using messaging for accessing the scheduler's functions (although depending on how it's done it may make sense). In both cases, just use a kernel API and be done with it (the cost of using a kernel API to send the message will probably be double the overhead of using a kernel API to access memory manager and scheduler functions directly).

OK. Everything is a message may be too much overkill. That's right for tasks that happen locally only.

Brendan wrote:For file I/O, it's important to realise that processes pound the daylights out of the VFS (and "stat()" is often the most frequently used file I/O function, not "open()", "read()" or "write()"). For this reason it's very important for the VFS to cache things locally (especially directory information). This means that messages for file I/O go from a process to the local VFS/cache (and not directly to a local or remote file system, or directly to a remote VFS), partly because the VFS cache is important to avoid constant disk access, but also (for distributed systems) to avoid network latency.

Caching may be implemented in the responsible servers itself.
The first thing to achieve would be doing operations on a remote file before implementing caching. And as I'm not even really in usermode I just don't care at the moment

I just wanted to tell that the messaging stuff will be modular (at least I hope so

) with several backends for local function calls, shared memory, serial, tcp/ip, $whatever as a communication channel.

Brendan wrote:For other cases, you end up with groups of processes that all talk to each other a lot (but don't talk to processes outside of this group much). In this case it's important to keep the processes within the group on the same computer. In the same way, in some situations it's better to duplicate processes to avoid networking. For example, if several computers want to use the same "font engine" service, don't be afraid to run multiple "font engine" services where each is "close" to where it's being used.

Sure, I didn't mean to use one computer for every process running on the whole system. How and if you communicate with services over a remote connection is up to the user who configures all this stuff.
If the users wants to move a whole lot of graphics data onto another node which does some 3d calculation there and wants to deliver the rendered images back over a high latency connection, he may do it.

Brendan wrote:
Andy1988 wrote:Sure, it will take me some time to get a message delivered to another process, even on the same host, but it's a hobby. Other people are working several years to look at a model train doing its rounds on a half completed track in some half-painted landscape
How many people poke themselves in the eye just to see what happens? I'd guess almost none - the outcome is too easy to predict, and the most likely outcome isn't "good". Even if poking yourself in the eye was a hobby, it still wouldn't make sense.

How many people design micro-kernels where the kernel is in it's own address space, just to see what happens? I'd guess almost none, for the same reason that people don't go around poking themselves in the eye all the time.

Currently I have two branches. One Kernel living at 0x0 and the other at 0xC0000000. And the code isn't that different. Some constants and the Paging initialization changes.
The only thing I need to implement for testing it is this trampoline. After getting a process running in userspace I could easily do some benchmarks.
I'm not doing a total redesign of my kernel.

But anyway, I think I'm going to implement the APIC-stuff and SMP first since these features are very big changes and it's better to do it as early as possible.

Brendan wrote:
Andy1988 wrote:Yes, seems to be doable.
I can set CR3 in the "IRQ-TSS" and do a far jump to the task gate. The page directory should get switched then.
Thanks.
Task gates (and the hardware task switching mechanism in general) are not reentrant. This means you'll probably end up with a TSS for each interrupt handler for each CPU (e.g. for 100 interrupt handlers and a quad-core CPU, that's 400 TSSs for the kernel; all mapped into every address space). On top of that you'd need a TSS for each process (which don't need to be mapped into every address space; but do need to be mapped at the same virtual address in the kernel's address space and the process' address space). It's a logistical nightmare...

Not a good idea, you are right.

OSDev.org

Writing an interrupt trampoline

Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline

Re: Writing an interrupt trampoline