OSDev.org

Posted: **Sun Jul 24, 2005 12:45 am**

I'm thinking about how to tackle memory management in my ukernel, and I've been reading up on L4. In a nutshell, it has 3 MM system calls -- map, grant, and unmap. These allow for the recursive construction of address spaces by user-level pagers.

Has anyone else looked at the techniques L4 uses for memory management? If so, I'm curious what you think about its advantages and disadvantages. What kind of techniques or abstractions do you use in your ukernel to allow for user-level pagers (if any)?

Posted: **Sun Jul 24, 2005 8:53 am**

Hello,
I've had a quite extensive look at the L4 kernel and in my opinion the memory management system is pretty good, although not perfect. You've already mentioned the main advantage which is that L4 allows user-level memory management, something that is a pretty unique feature for an OS, allowing the implementation of specialized managers which would boost the development of new MM policies (the main reason why inovation is rather rare in OS development is that kernels are to static). Since the abstractions grant, map & flush/unmap (don't know which of the two terms is more up to date) are very low-level all kinds of policies are possible and can be implemented in a reasonable efficient way. A second adavantage is of course that the kernel itself is kept policy free and slim because the actual memory management was out-sourced to a user-space sever. This also means that it's less vulnerarable to bugs and therefore more stable, one of the main goals of the ?kernel design.
Of course there are also some problems with the design, but they are rather small and some can probably be fixed. On of the disadvantages is, for example, that there is no clear cut between physical memory and virtual memory since the three memory management primitives work with virtual addresses only. In most cases it doesn't matter where in memory you are, but for some purposes such as DMA it is vital to know the physical address of a page. This doesn't mean that DMA is impossible in L4, because there's still sigma0 (memory root server) which holds the whole physical-memory mapped idem-potentialy, but it certainly makes it more complex. Another downside is that memory management is intimatly intertwined with IPC in L4 as grant, map aren't system-calls but part of a message sent to the mapping receiver. While this increasing performance significantly, it's also a bit confusing for somebody new to the OS. This is however more of a problem of the L4 implementation, not of the concept itself.

What kind of techniques or abstractions do you use in your ukernel to allow for user-level pagers (if any)?

I'm right now working on a concept for an exo-kernel:
Physical memory will be divided into 4KB page-frames and each of them will be associated with a 32bit capability that serves as a key needed for access. Then a quota is created that spans the whole memory range allowing it's owner to allocate fpages and is sent to the root memory-server. The root-server can now either divide it's quota and pass it on to subsequent memory managers or allow applications to allocate memory using it's quota:

[pre]
system-calls:
cap AllocatePageFrame(fpage , quota)
ret MapPageFrame(context, fpage, cap)

user-lib:
fpage CreateFlexPage(IO|MEMORY, base, size)
[/pre]

examples:
(App wants to allocate page-frame 0x100000 and map it to 0x123456 in it's address-space)

1) Send message to the root-server and ask for the page-frame
2) Root server uses it's quota to allocate the page-frame and returns the received capability to the user
3) User maps the page-frame to it's address-space using the capability

(App now divides to share the page with another app)
1) App 1 sends the capability to app 2
2) App 2 maps it to it's address-space

(Data-base app wants to allocate some memory every now and then)
1) App asks the memory-server to create a quote of a specific size
2) Root creates the quote by forging it from its own
3) App can now directly allocate page-frames using its own quota without having to call the mem-server and thus saving context switches

The quota is meant to give apps more liberties in order to increase performance by avoiding stacked managers. Unfortunately MM isn't a very good example to show this, but with other tasks, such as hard-disk management (disk:patition:fs:file) the advantage should get obvious.

regards,
gaf

Posted: **Sun Jul 24, 2005 10:42 am**

Hi,

Ok - I'm confused.

Imagine a thread is using allocation on demand. In a more traditional OS, if the thread accesses a non-present page you end up with page fault (user->kernel transition) followed by a page being mapped and then the page fault returns (kernel->user transition). For L4 you could end up with something like:

a) user->kernel transition (initial page fault)
b) sendMessage, thread switch and kernel->user transition (kernel asks the pager what to do)
c) user->kernel transition (pager calls the MM)
d) thread switch, kernel->user transition (MM returns to initial thread)

The "recursive" nature of it means that this pager may not be able to handle the request and may need to ask it's parent pager, who may need to ask it's parent pager, all the way up to the root address space - the overhead could be multiplied by the number of user-level pagers.

Now am I wrong or would L4's recursive user-level pagers add a huge amount of overhead (excessive thread switches and user<->kernel transitions)?

gaf wrote:You've already mentioned the main advantage which is that L4 allows user-level memory management, something that is a pretty unique feature for an OS, allowing the implementation of specialized managers which would boost the development of new MM policies (the main reason why inovation is rather rare in OS development is that kernels are to static).

This is what I figured - the main benefit is "research" rather than anything practical.

gaf wrote:A second adavantage is of course that the kernel itself is kept policy free and slim because the actual memory management was out-sourced to a user-space sever.

I'd call a "policy free" kernel a disadvantage. The idea behind an OS is to provide a consistant environment for application software, which remains uneffected by variations in hardware or other software. With a policy free kernel this consistancy can be easily broken - good for research, but bad for practical purposes (bad for that poor programmer at the end of the line who's trying to figure out what went wrong, or why it worked on computer A but not computer B).

gaf wrote:This also means that it's less vulnerarable to bugs and therefore more stable, one of the main goals of the ?kernel design.

Yes and no. The kernel itself would be less vulnerable to bugs, but user level software would be vulnerable to all the bugs in all the pagers above it.

I'd also be concerned with security. If your software doesn't trust the user-level pager (and all of the user-level pagers above it) how can it prevent the user-level pager from accessing it's data? For a traditional kernel you need to be able to trust the kernel itself, but for L4 it seems anyone would be able to write a user-level pager and make it run above your software somewhere.

Another problem I see is that initially all physical memory is mapped within the root address space. For some computers physical memory is larger than the virtual address space - e.g. 32 bit 80x86 with 4 Gb virtual address spaces and up to 64 GB of RAM. I assume L4 would have a lot of trouble supporting PAE...

Cheers,

Brendan

Posted: **Sun Jul 24, 2005 12:00 pm**

Brendan wrote:The "recursive" nature of it means that this pager may not be able to handle the request and may need to ask it's parent pager, who may need to ask it's parent pager, all the way up to the root address space - the overhead could be multiplied by the number of user-level pagers.

I suppose someone could always set up a pathological case like that. I would imagine that for more practical applications, there would be only one user-level pager (at least, that's what I'd do).

Now am I wrong or would L4's recursive user-level pagers add a huge amount of overhead (excessive thread switches and user<->kernel transitions)?

In the pathological case, maybe. In a more typical case with only one pager managing "sigma zero" (the initial address space), I think the context switch overhead would pale in comparison to the overhead of the disk I/O required to handle the page fault.

I'd call a "policy free" kernel a disadvantage. The idea behind an OS is to provide a consistant environment for application software, which remains uneffected by variations in hardware or other software. With a policy free kernel this consistancy can be easily broken

You're confusing "kernel" and "OS" here. Of course an OS has policies -- L4 by itself is not an OS. I think the idea is that you would take your microkernel, combine it with some "application-specific" servers (including pagers), and the result is a particular OS targeted for a particular purpose (desktop, server, embedded, etc.)

gaf wrote:This also means that it's less vulnerarable to bugs and therefore more stable, one of the main goals of the ?kernel design.

Yes and no. The kernel itself would be less vulnerable to bugs, but user level software would be vulnerable to all the bugs in all the pagers above it.

Which is why for practical purposes I'd only have one user-level pager.

I'd also be concerned with security. If your software doesn't trust the user-level pager (and all of the user-level pagers above it) how can it prevent the user-level pager from accessing it's data? For a traditional kernel you need to be able to trust the kernel itself, but for L4 it seems anyone would be able to write a user-level pager and make it run above your software somewhere.

It's up to the system as a whole to prevent someone from installing a malicious pager. This implies that at least the root-level pager must be trusted and can't allow for any funny business. The kernel by itself cannot guarantee that, but it's not the whole "OS" anyway.

Another problem I see is that initially all physical memory is mapped within the root address space. For some computers physical memory is larger than the virtual address space - e.g. 32 bit 80x86 with 4 Gb virtual address spaces and up to 64 GB of RAM. I assume L4 would have a lot of trouble supporting PAE...

Yep, good point. Also it seems to me that "small spaces" in L4 won't work in long mode, since there's no more segmentation.

Posted: **Sun Jul 24, 2005 12:28 pm**

Hi guys!

Now am I wrong or would L4's recursive user-level pagers add a huge amount of overhead (excessive thread switches and user<->kernel transitions)?

I think in OS development whenever you try to add excessive stability, flexibility or security, you trade away some speed...and microkernels generally do. But, OTOH, how often does a page fault really happen during normal execution, compared to the number of cycles spent on running the app's code? If you want to build a faster system on top of L4, you simply should avoid stacking pagers too much.

This is what I figured - the main benefit is "research" rather than anything practical.

Hasn't everything practical we are using today once started as a research project?

I'd call a "policy free" kernel a disadvantage. The idea behind an OS is to provide a consistant environment for application software, which remains uneffected by variations in hardware or other software.

I think kernel != OS. As an app developer, you probably wouldn't write software for L4, you write it for a specific OS, built on top of L4 (or an OS subsystem). And when an installation claims to support a specific subsystem, this should guarantee to run your software without any problems.

For the security and trust issue:
Again, the kernel doesn't form the OS. When an app trusts a pager, it must also respect that this pager might trust another set of pagers above itself. It isn't like anyone could log into the system, run a random pager and make it the root pager. The OS will give a process only as much privilege as it deserves, in order to remain a trusted entity itself.

OK, these are my thoughts on this.

I think L4 is forcing policy out of the kernel a bit too much. I'd prefer to have at least capability support in the kernel, a'la CoyotOS, which integrates memory management with capabilities then.

EDIT: @Colonel Kernel: We seem to have written our postings in parallel

, but we have similar opinions.

cheers
Joe

Posted: **Sun Jul 24, 2005 1:06 pm**

The "recursive" nature of it means that this pager may not be able to handle the request and may need to ask it's parent pager, who may need to ask it's parent pager, all the way up to the root address space - the overhead could be multiplied by the number of user-level pagers.

When a page-fault occures, L4 switches to a page-fault handler that was defined when the task was created. This handler is expected to fix the problem by either bringing the page back in from disk or terminating the app if an access-violation occured. In the first case it's neccessary to first get the data from the disk-driver and then load it to a page that was allocated from the apps pager. It is not expected that the pager has to refer to any of its parent memory-managers but rather that it simply uses some page from a internal pool. You also over-estimate the number of pagers a normal system has, more than 2 levels (root, servers) are seldomly necessary.
A page-fault will therefore mean 4 context switches, which might sound like quite a lot, but is not that bad if you consider that L4 has an extremly fast context transition mechanism that requires several times fewer cycles than that of any other ?kernel, leave alone monlithic kernels.

Nevertheless I agree that stacked managers are one of the biggest problems of the ?kernel design (eg disk-management), which is one of the reason why I'm going for an exo-kernel.

This is what I figured - the main benefit is "research" rather than anything practical.

No, the main advantage is that results from research can now be used practicaly. What I wanted to say is that there are a lot of new ideas and concepts in OS dev (->SOSP) but that it's very hard to add them to todays operating systems because of their monolithic and inflexible design.
Apart from that the possiblity of having several managers running simulatiously allows different policies that can fit the needs of apps much better than the single policy offered by a mono-kernel which always has to be a compromise that works for all, so that overall-performance will improve.

I'd call a "policy free" kernel a disadvantage. The idea behind an OS is to provide a consistant environment for application software, which remains uneffected by variations in hardware or other software. With a policy free kernel this consistancy can be easily broken - good for research, but bad for practical purposes (bad for that poor programmer at the end of the line who's trying to figure out what went wrong, or why it worked on computer A but not computer B).

Can you give me any reason why this consistency shouldn't be provided by user-space servers or libraries ?
What changes for the app programmer if he now has to rely on a server rather than the kernel ?
Mone-kernels are only consider to be consistent because they don't evolve, but there are better means to do so than to cease any inovation (backward-compability, COM-like interfaces, etc).

Yes and no. The kernel itself would be less vulnerable to bugs, but user level software would be vulnerable to all the bugs in all the pagers above it.

Yes, but I still prefer a user-level server crashing over a triple-fault, because then there's still the possibility to recover. Apart from that it's not very funny at all to debug the kernel..

I'd also be concerned with security. If your software doesn't trust the user-level pager (and all of the user-level pagers above it) how can it prevent the user-level pager from accessing it's data? For a traditional kernel you need to be able to trust the kernel itself, but for L4 it seems anyone would be able to write a user-level pager and make it run above your software somewhere.

Same as for "consistency by the kernel". I don't see why anything should change just because we are now in user-space. Due to the modular nature of micro-kernels you'd normally use somekind of name-server that acts as a central instance that coordinates the servers and helps apps to find their pager etc..

regards,
gaf

Posted: **Sun Jul 24, 2005 2:05 pm**

Hi,

Colonel Kernel wrote:
Now am I wrong or would L4's recursive user-level pagers add a huge amount of overhead (excessive thread switches and user<->kernel transitions)?
In the pathological case, maybe. In a more typical case with only one pager managing "sigma zero" (the initial address space), I think the context switch overhead would pale in comparison to the overhead of the disk I/O required to handle the page fault.

If the only thing the pager handles is the swap, there's not much point in allowing multiple pagers (different methods of determining which page to send to disk when free RAM runs out, perhaps). I was thinking of things like allocation on demand, shared pages and normal page allocation/free (where there is no disk IO).

In a more typical case with only one pager managing "sigma zero", you might as well build that pager into the kernel and save yourself the additional overhead. Saying that all the disadvantages of allowing multiple pagers don't apply if you only use one pager doesn't make a lot of sense to me...

One benefit that I've overlooked is the ability to emulate other OS's memory managers - one root pager, plus another pager for windows executables, another pager for freeBSD executables, etc. This is the only case where L4's scheme does make some sense to me, but I'd still prefer one pager built into the kernel that can handle all the different address space types.

Colonel Kernel wrote:
I'd call a "policy free" kernel a disadvantage. The idea behind an OS is to provide a consistant environment for application software, which remains uneffected by variations in hardware or other software. With a policy free kernel this consistancy can be easily broken
You're confusing "kernel" and "OS" here. Of course an OS has policies -- L4 by itself is not an OS. I think the idea is that you would take your microkernel, combine it with some "application-specific" servers (including pagers), and the result is a particular OS targeted for a particular purpose (desktop, server, embedded, etc.)

Ok, so you'd have a general purpose pager (supporting the OS's standard MM policy) for all native L4 applications, and then additional "applications specific" pagers for applications with special needs that the general purpose pager can't satisfy. What are those special needs, and why can't the general purpose pager support them? This is where I'm having trouble understanding how the benefits justify the overheads...

Colonel Kernel wrote:Also it seems to me that "small spaces" in L4 won't work in long mode, since there's no more segmentation.

Doh! I forgot about that - in L4 (with small address spaces) sending a message to another thread can be as fast as a far jump from CPL=3 to CPL=3 with a few checks (best case). No more small address spaces means this'll change to a CPL=3 to CPL=0 transition, an address space switch (and TLB flush) and a CPL=0 to CPL=3 transition. That'll slow their IPC down...

Cheers,

Brendan

Posted: **Mon Jul 25, 2005 8:20 am**

If the only thing the pager handles is the swap, there's not much point in allowing multiple pagers (different methods of determining which page to send to disk when free RAM runs out, perhaps). I was thinking of things like allocation on demand, shared pages and normal page allocation/free (where there is no disk IO).

In fact choosing the right page to be evicted from memory is the core of any paging algorithmn and therefore it does make a difference. And of course you can also use pagers for the purposes you have proposed - they are the very reason why L4 supports user-level pager: Allowing you to simply write a new pager based on your own idea.

In a more typical case with only one pager managing "sigma zero", you might as well build that pager into the kernel and save yourself the additional overhead. Saying that all the disadvantages of allowing multiple pagers don't apply if you only use one pager doesn't make a lot of sense to me...

Even if there's only one user-level pager, the user still has the choice of chosing which one shall be used. Nevertheless I agree that it's doesn't make much sense to support multiple pagers if only one is run anyway.

A normal L4 system would therefore use two levels of pagers:
1. Physical Memory Manager (->sigma0)
2. User-Level pagers

[pre]
sigma0
|
o-----o-----o
| | |
p1 p2 p3
[/pre]

With this configuration an unlimited number of pagers can run in parallel without causing issues due to overhead caused by stacking. At system start-up each of the user-space managers has the possibility to allocate a small pool of pages from the sigma0 server which it can then use to answer requests from apps. Further communication with sigma0 is only neccessary when the pool ran empty or doesn't suffice to satisfiy the request.

What are those special needs, and why can't the general purpose pager support them?

A gereral purpose pager has to use a policy that allows all different kind of tasks to run (database, gaming, calculation...). Since these tasks have very different requirements, this policy neccessarily has to be a trade-off that is acceptable for all, which however also means that no app is supported perfectly. If apps are given the possiblity to chose their own policy their performance can be increased significantly because they won't have to pay for all the abstraction and compability overhead.

Examples:
(1) File Managemant
App a = game
large data packets, long access times acceptable

App b = video-capture
continuous stream of data (-> realtime)

Here a trade-off has to be made between access-time and throughput. Apart from that all applications are forced to use the file abstractions offered by the OS although the video-capturing task could life very well without it.

(2) Task Management
App a = Interactive or I/O bound
Wants to be scheduled as fast as possible to keep the device busy or to respond to a user command

App b = Calculations (CPU bound)
Doesn't want to be interrupted in order to minimize overhead

Since it's impossible to have it both, the scheduling policy in the kernel has to be based on a compromise:
-> Interactive task have to wait and get a worse response time
-> CPU bound task will be interrupted and need longer to finish

Doh! I forgot about that - in L4 (with small address spaces) sending a message to another thread can be as fast as a far jump from CPL=3 to CPL=3 with a few checks (best case). No more small address spaces means this'll change to a CPL=3 to CPL=0 transition, an address space switch (and TLB flush) and a CPL=0 to CPL=3 transition. That'll slow their IPC down...

I'm sorry having to disappoint you, but even then it will be much faster than any other IPC mechanism in use. Apart from that you can't blame the L4 designers for AMD's extremly poor 64bit design: In my opinion it's just amazing that they still refuse to realized that todays high task-switch costs *are* a problem for all operating systems and that it will get even worse in the future. Funny enough that something as simple as a tagged TLB, which can't be that expensive because IA-64 has it, would solve the problem and make me a bit more happy

regards,
gaf

Posted: **Mon Jul 25, 2005 11:10 am**

Hi,

gaf wrote:In fact choosing the right page to be evicted from memory is the core of any paging algorithmn and therefore it does make a difference. And of course you can also use pagers for the purposes you have proposed - they are the very reason why L4 supports user-level pager: Allowing you to simply write a new pager based on your own idea.

There's a major difference between evicting the "best" page that the current pager controls, and evicting the "best" page in the whole system (regardless of which pager). How L4 gets around this I've no idea - I can imagine a system where all pagers nominate their best page to evict and give it a score, and the page with the highest score is the one that's actually evicted, but it does make me wonder if L4 evicts the "best" page in the whole system, or if it compromises.

I guess my main problem with L4 was/is the overhead, because I'm used to IPC costing more than a system call. If IPC is faster than a system call then overhead is reduced by keeping as much as possible at CPL=3, which makes L4's memory management a better way to do things (even if you never usee more than one user-pager). Of course this argument breaks for page faults, which are effectively a forced system call (ie. allocation on demand).

gaf wrote:If apps are given the possiblity to chose their own policy their performance can be increased significantly because they won't have to pay for all the abstraction and compability overhead.

Examples:
(1) File Managemant
App a = game
large data packets, long access times acceptable

App b = video-capture
continuous stream of data (-> realtime)

Here a trade-off has to be made between access-time and throughput. Apart from that all applications are forced to use the file abstractions offered by the OS although the video-capturing task could life very well without it.

You're talking of memory mapped files? For App B I wouldn't use memory mapped files at all, but despite this you're making a good argument for allowing the block size to be adjustable and for file IO priorities. For example "int madvise (void *addr, size_t length, int advice, int prefered_block_size, int IO_priority)". I've probably missed something, but I'm not seeing a need for multiple pagers here.

(2) Task Management
App a = Interactive or I/O bound
Wants to be scheduled as fast as possible to keep the device busy or to respond to a user command

App b = Calculations (CPU bound)
Doesn't want to be interrupted in order to minimize overhead

Since it's impossible to have it both, the scheduling policy in the kernel has to be based on a compromise:
-> Interactive task have to wait and get a worse response time
-> CPU bound task will be interrupted and need longer to finish

Scheduling policy doesn't concern us here. What would concern us is the page eviction stategy, as it can constitute worse response times for App A and interruptions for App B. If one application is more important than the other then the compromise is easy (assuming the page eviction strategy takes the importance of all applications into account). If both aplications have equally high importance the solution is to buy more RAM, and nothing the an OS can do would change that. I'm still not seeing a case for multiple pagers.

I'm sorry having to disappoint you, but even then it will be much faster than any other IPC mechanism in use.

Here I'm more interested in IPC becoming slower than system calls. Once that happens I'd assume half of L4's design decisions would start to look shaky, and people will probably consider reversing some of them (for e.g. building the sigma0 pager into the kernel). It also means L4's "microkernels are faster than monolithic kernels" becomes false, which (IMHO) will put additional pressure on L4 to find a way to support small address spaces in long mode.

The ironic thing is that long mode supports "RIP relative addressing", making position independant code much easier/faster. Implementing small address spaces is possible in long mode if you're willing to dispose of all security/protection between the small address spaces and make all software use position independant code. It will be interesting to see if L4's developers consider fast IPC more important than security/protection...

An alternative would be for L4 not to use long mode on CPUs that support it. In this case L4 would still be limited to < 4 GB of RAM (or the sigma0 pager breaks) - please note that IIRC a 32 bit OS can use PAE to access all RAM on a 64 bit computer without long mode.

Cheers,

Brendan

Posted: **Mon Jul 25, 2005 1:52 pm**

There's a major difference between evicting the "best" page that the current pager controls, and evicting the "best" page in the whole system (regardless of which pager). How L4 gets around this I've no idea - I can imagine a system where all pagers nominate their best page to evict and give it a score, and the page with the highest score is the one that's actually evicted, but it does make me wonder if L4 evicts the "best" page in the whole system, or if it compromises.

Page eviction algorithms can in general be divided into the two categories 'local' (least frequently used page for a single task) and 'global' (whole system). For an OS that supports multiple pagers a third category could be introduced that tries to find the LRU page for all the tasks that use a pager.
Locally limited (local, pager) algorithms can perform pretty well but have problems when the size of a working-set varies during the life-span of a process. Some communication between the pagers is therefore needed in order to coordinate and make sure that the pages are shared fair among all tasks. The protocol for this is not part of the L4 kernel but has to be based on a user-space standard.
How it's done exactly in L4 is out of my knowledge, I can't even guaranty you that it already supports something like this because the kernel is still more of an "experimental" nature..

You're talking of memory mapped files? For App B I wouldn't use memory mapped files at all, but despite this you're making a good argument for allowing the block size to be adjustable and for file IO priorities. For example "int madvise (void *addr, size_t length, int advice, int prefered_block_size, int IO_priority)". I've probably missed something, but I'm not seeing a need for multiple pagers here.

I actually meant real files (readme.txt) which are also an abstraction by the kernel that is redundant for some tasks. Without them and the associated overhead (eg caching police), the video-capturing task might get less dropped frames.
Maybe it was a bit too radical an example, but in exo-kernels (L4 is in between micro- and exo-kernel) there are indeed no files but only sectors that can be written to freely.

Back to the topic..

By adding fields for block-size and IO-priority to your system-call, you parameterized it and thus started to slowly move the policy from kernel to user:
The more parameters you add, the more control is passed to the user and eventually your kernel will be entirely policy free. In fact the development from early monolithic systems to the still "exotic" exo-kernels can be seen as an evolutionary process:

*monolithic-kernel*
much policy in the kernel, few user control through paramters for system-calls

*1st gen ?-kernels (mach)*
policy partly outsourced (eg window sub-system), more user control through both parameters for system-calls and user-space managers

*2nd gen ?-kernels (L4, QNX)*
almost policy free, apps can decide almost everything by user their prefered manager

*exo-kernels*
no kernel abstractions, apps have full control over policies, user-space managers is reduced to the minimum so that apps have even more liberties

When giving apps more liberties by offering additional parameters to specify their requirements, problems occure pretty soon when malicious/buggy/ignorant software exploit these new liberties (eg waste resources of any kind). It is therefore neccessary that a central instance acts as a neutral arbitrator that ensures some minmum policy needed to make sure that every app is treated fair. This is what user-space managers are all about..

- The kernel is policy free and only offers mechanisms that are protected so that they can only be used by authorized tasks
- User-Space Managers ensure some basic policy to make sure that all apps can run
- Apps have much more liberties and can define more precisly what they need

(post goes on..)

Posted: **Mon Jul 25, 2005 1:59 pm**

Scheduling policy doesn't concern us here. What would concern us is the page eviction stategy, as it can constitute worse response times for App A and interruptions for App B. If one application is more important than the other then the compromise is easy (assuming the page eviction strategy takes the importance of all applications into account). If both aplications have equally high importance the solution is to buy more RAM, and nothing the an OS can do would change that. I'm still not seeing a case for multiple pagers.

Scheduling isn't that off-topic because L4 also allows it to be done in user-space. I just wanted to show that the idea applies to all parts of the OS..

As it goes for the page-eviction policy:
You extended the algorithm to take the task's privilege-level into account, but there are thousands of other factors that the user might want to have considered. For example he might want I/O tasks or tasks not to be paged-out to keep the system more responsible or the task's memory footprint to be considered. This would also lead to a policy free system with...pagers.

Here I'm more interested in IPC becoming slower than system calls. Once that happens I'd assume half of L4's design decisions would start to look shaky, and people will probably consider reversing some of them (for e.g. building the sigma0 pager into the kernel). It also means L4's "microkernels are faster than monolithic kernels" becomes false, which (IMHO) will put additional pressure on L4 to find a way to support small address spaces in long mode.

L4's IPC mechanism might be very fast but it naturaly can't be faster than a system-call because it's based on one. This means that it takes more time to, for example, switch to the pager, but you're IMHO dramatically over-estimating the results for the over-all system performance (normally <10%, I guess). L4 trades this performance for modularity and stability, and you should ask yourself if performance is really everything and if these "features" don't have a value aswell.
For the best possible performance you would have to use a DOS-like OS that runs all tasks in kernel mode and relies on cooperation (XBox for example) - everything in between is just is wishy-washy

Small-Spaces:
--------------
4GB address-space is divided into smaller sections using segmentation
[pre]o-----------------------o----o--------------o
| large space | k | small spaces |
o-----------------------o----o--------------o[/pre]

1. app(LS) -> kernel
2. kernel changs app's descriptors to cover the small-spaces portion of the address-space
3. kernel -> app(SS)

It will be interesting to see if L4's developers consider fast IPC more important than security/protection...

I can answer you this question straight away: No
Anything else would betray the principles of L4 and the ?-kernel design in general and therefore can't bet a serious option (might sounds pathetic, but they didn't create such a complex kernel design just to change their mind when a small performance problem arises).

An alternative would be for L4 not to use long mode on CPUs that support it. In this case L4 would still be limited to < 4 GB of RAM (or the sigma0 pager breaks) - please note that IIRC a 32 bit OS can use PAE to access all RAM on a 64 bit computer without long mode.

Since only very few computers have more than 4GB of phys memory, the normal 32bit compability-mode should do for the moment. Using PAE is however a good idea for the future that I didn't yet consider..

regards,
gaf

Posted: **Mon Jul 25, 2005 5:02 pm**

Hi,

gaf wrote:Scheduling isn't that off-topic because L4 also allows it to be done in user-space. I just wanted to show that the idea applies to all parts of the OS..

Scheduler policy seems partially determined by IPC (each message sent involves a thread switch directly to the message receiver) rather than being controlled entirely by the scheduler/s (which would require message queues). Now for true modularity, you'd need the option of replacing the IPC, or perhaps allowing different IPC modules to co-exist.

gaf wrote:L4's IPC mechanism might be very fast but it naturaly can't be faster than a system-call because it's based on one.

I thought they'd optimize IPC to use a cpl=3 to cpl=3 "jmp far" if the message sender and receiver share a page directory. The paper http://i30www.ira.uka.de/research/docum ... c-perf.pdf (Table 2, page 3) states 121 cycles for IPC on a 166 MHz Pentium, where 72 cycles are for the system call/ret (over half the cycles). I'm not sure what changes have been made since 1997...

In any case, if IPC is slower than system calls then using system calls instead of IPC would be better for performance (ie. building more into the kernel). IMHO this can be done without reducing flexibility in some cases (e.g. the sigma0 pager, which would also allow for PAE support).

This means that it takes more time to, for example, switch to the pager, but you're IMHO dramatically over-estimating the results for the over-all system performance (normally <10%, I guess). L4 trades this performance for modularity and stability, and you should ask yourself if performance is really everything and if these "features" don't have a value aswell.

Performance isn't everything, and modularity and stability are important, but like most things excessive use can lead to diminishing returns. I think we'd both agree that a stack of 50 user-pagers wouldn't be more stable, and having a huge amount of tiny modules wouldn't work well either. These examples are extreme, but the best possible compromise always exists between the extremes.

Cheers,

Brendan

Posted: **Tue Jul 26, 2005 8:54 am**

Scheduler policy seems partially determined by IPC (each message sent involves a thread switch directly to the message receiver) rather than being controlled entirely by the scheduler/s (which would require message queues).

You're right about that, IPC is really a problem and makes it very hard to move the scheduler to user-space. L4 therefore gives a scheduler the choice if it really wants to be called every single time a task has to re-scheduled, or if it would rather want to use the internal scheduler which is based on a simple round robin algorithm that supports priorities and can serve as a base for a more advanced user policy. The external scheduler can define the time-slice length and the priority class of its tasks while the internal scheduler does the task-switching.

Now for true modularity, you'd need the option of replacing the IPC, or perhaps allowing different IPC modules to co-exist.

In some way this is already supported because L4 IPC is so low-level that it's merely a mechanism that doesn't impose any policy. Any reasonable programmer would use some library or server on top of it to get more comfort and support for asynchronious messages, etc.

I thought they'd optimize IPC to use a cpl=3 to cpl=3 "jmp far" if the message sender and receiver share a page directory. The paper http://i30www.ira.uka.de/research/docum ... c-perf.pdf (Table 2, page 3) states 121 cycles for IPC on a 166 MHz Pentium, where 72 cycles are for the system call/ret (over half the cycles). I'm not sure what changes have been made since 1997...

I don't really understand in what ways the paper differs from what I've said:

1. Task switches to the kernel (systemcall)
2. Kernel executes some IPC code
3. Task's descriptors are changed to span the small-space
4. Kernel returns to the small-space

The big advantage of this scheme is that no context switch is needed and therefore a TLB flush (which would take several hundred to thousand cycles) isn't neccessary.

If it was possible to directly switch between large-space and small-space(s) using a far-jmp, security would be broken because the tasks could read/write each-other's data.

In any case, if IPC is slower than system calls then using system calls instead of IPC would be better for performance (ie. building more into the kernel). IMHO this can be done without reducing flexibility in some cases (e.g. the sigma0 pager, which would also allow for PAE support).

It might be better for performance but I'd also mean that the kernel had to be monolithic. The goal never was to create something faster than system-calls, but rather to get IPC fast enough to at least compete with them and thus allowing external managers.

regards,
gaf

Posted: **Tue Jul 26, 2005 1:40 pm**

Hi,

gaf wrote:
Scheduler policy seems partially determined by IPC (each message sent involves a thread switch directly to the message receiver) rather than being controlled entirely by the scheduler/s (which would require message queues).
You're right about that, IPC is really a problem and makes it very hard to move the scheduler to user-space. L4 therefore gives a scheduler the choice if it really wants to be called every single time a task has to re-scheduled, or if it would rather want to use the internal scheduler which is based on a simple round robin algorithm that supports priorities and can serve as a base for a more advanced user policy. The external scheduler can define the time-slice length and the priority class of its tasks while the internal scheduler does the task-switching.

That's not quite what I meant. Imagine you want to make sure that the highest priority thread that can run is always the thread that does run. Every time a high priority thread sends a message to a low priority thread the IPC would cause a thread switch to the low priority thread, so regardless of how the scheduler is implemented the low priority thread gets CPU time when it shouldn't.

Because IPC is used for almost everything, I'd guess that IPC often has more control over which threads get CPU time than the scheduler does, and the scheduler is often ignored (regardless of how it's implemented).

To implement any scheduling policy accurately you need to prevent IPC from causing undesirable thread switches. The only way this can be done is with message queues.

gaf wrote:
I thought they'd optimize IPC to use a cpl=3 to cpl=3 "jmp far" if the message sender and receiver share a page directory. The paper http://i30www.ira.uka.de/research/docum ... c-perf.pdf (Table 2, page 3) states 121 cycles for IPC on a 166 MHz Pentium, where 72 cycles are for the system call/ret (over half the cycles). I'm not sure what changes have been made since 1997...
I don't really understand in what ways the paper differs from what I've said:

The paper doesn't differ from what you said - I included it to show that the cost of the system call is over 50% of the total IPC costs (when a message is sent from one small space to another small space within the same page directory).

You are right - to allow the "far jump" to be used instead of the expensive system call (when a message is sent from one small space to another small space within the same page directory) you'd need to allow the sending thread to load DS, ES, SS, SP and CS:EIP for the receiving thread, which would be a security problem. There are ways to minimize the security problem, but it can't be made 100% secure.

gaf wrote:
In any case, if IPC is slower than system calls then using system calls instead of IPC would be better for performance (ie. building more into the kernel). IMHO this can be done without reducing flexibility in some cases (e.g. the sigma0 pager, which would also allow for PAE support).
It might be better for performance but I'd also mean that the kernel had to be monolithic. The goal never was to create something faster than system-calls, but rather to get IPC fast enough to at least compete with them and thus allowing external managers.

Building the sigma0 pager (or a physical memory manager) into the kernel wouldn't suddenly make the entire OS monolithic.

From http://encycl.opentopia.com/term/L4_microkernel_family:

"One of L4's goals was to demonstrate that the microkernel concept was not a flawed concept, and that acceptable performance was achievable simply by focusing on fixing the problems and making the system more machine specific. That is, by dropping the constraint of being entirely machine neutral and tuning each implementation to a particular platform, the overhead of the microkernel would no longer be a real concern. L4, and its predecessor L3, appear to have proven the point, and the overhead is generally about 5% or less."

This is of course a rather debatable statement - from my perspective L4 doesn't prove that micro kernels can be fast enough compared to monolithic kernels, but instead only proves that machine specific code is faster than portable code. Further, I don't think that the microkernel concept was ever flawed to begin with (as long as you accept that performance is sacrificed for other benefits that can't be acheived with monolithic kernels).

My point is that it's possible to take concepts from both micro-kernels and monolithic kernels, choosing performance or other benefits where it makes the most sense.

Cheers,

Brendan

Posted: **Tue Jul 26, 2005 11:36 pm**

This has been an interesting read. Unfortunately, I'm no closer to deciding what to do in my own kernel.

A general comment about L4 based on what I've read -- the point of it being so minimalist is not just for the added flexibility, but also to keep its cache footprint as small as possible. RAM is the most critical performance bottleneck in modern hardware, and it's only getting worse. Cache and TLB thrashing seem to be the worst performance killers. Sounds to me like a good reason to keep the kernel as small as possible... Just my $0.02.

I do have another question though. Let's say that I decide I don't need to have multiple pagers in user-space, and that I want most of memory management in the microkernel. How do I handle page faults that require disk I/O? The disk and filesystem drivers will be in user-space, and it would be awkward for the kernel to have special knowledge of them... Any suggestions?

OSDev.org

Minimalist Microkernel Memory Management

Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management

Re:Minimalist Microkernel Memory Management