Virtual mm solutions

rootel77 · Post by **rootel77** » Mon Mar 06, 2006 12:06 pm

if it is absolutely necessary to support faulting the kernel pages then you should use Brendan's solution. note that when choosing to map the entire phys. mem (the max possible) into the kernel space, and opting for 4mb page size (wich is the more preferable option to preserve TLB entreis - x86 use a separate TLB for 4mb pages-) then there is no simple way to recover kernel's page fault (ie the page fault concerns a 4mb page and you need to recover only a 4kb portion page from disk).
even if you entend to support a max number of processes in modest hardware. you should take into consideration the "Practical limit" for your OS ie the limit that if exceeded the user would choose to move to a new Hardware because the system become "intolerable".

Brendan · Post by **Brendan** » Mon Mar 06, 2006 10:43 pm

Hi,

JAAman wrote:you should never have to update globally-shared page-tables, as these will be the same in all address spaces, you just point all address spaces to the same actual tables, and all address spaces will be updated when you update the current address space

For plain 32 bit paging, unless every kernel page table is pre-allocated during boot you need to be able to allocate a new kernel page table and put it into every page directory. I don't like pre-allocating all kernel page tables - it costs roughly 1 MB (if kernel space is 1 GB) which is too much on limited hardware. For example, on an 80486 with 4 MB of RAM (the "minimum hardware requirement" for my OS) this would leave 2688 KB of usable RAM for everything else (as the area from 0x000A0000 to 0x00100000 isn't usable).

rootel77 wrote:if it is absolutely necessary to support faulting the kernel pages then you should use Brendan's solution.

The "static sized table containing kernel page table entries that is used to update address spaces after a task switch" method mentioned by Proxy (above) as a solution to the problems of the "Update the address space after a task switch" method by JAAman (also above) is good - I couldn't think of any significant problems for it

.

rootel77 wrote:note that when choosing to map the entire phys. mem (the max possible) into the kernel space, and opting for 4mb page size (wich is the more preferable option to preserve TLB entreis - x86 use a separate TLB for 4mb pages-) then there is no simple way to recover kernel's page fault (ie the page fault concerns a 4mb page and you need to recover only a 4kb portion page from disk).

How did 4 KB of a 4 MB page end up on the swap? Normally you'd split a 4 MB page into 1024 smaller 4 KB pages just before you send 4 KB of it to swap - when this page needs to be loaded back from swap the 4 MB page doesn't exist.

Of course you could mean that you can't send 4 KB of the kernel's "physical memory mapping" to the swap, but you should never do this anyway - the physical memory mapping should always be a mapping of physical memory and should never change for any reason.

rootel77 wrote:even if you entend to support a max number of processes in modest hardware. you should take into consideration the "Practical limit" for your OS ie the limit that if exceeded the user would choose to move to a new Hardware because the system become "intolerable".

How do you calculate the practical limit?

Let's say I write a utility to send SMS reminders. It's a command line utility that let's the user set a target mobile phone number, a time that the message is to be sent and the message itself - something like "sendsms 0404123456 10:30 5/5/06 Don't forget your dentist appointment at 12:00!" or "sendsms 0404123456 12:30 1/4/2110 It's your 100th wedding anniversay next week!".

The utility starts, reads it's command line arguments, calculates the length of time it needs to sleep and then does "sleep()". When it wakes up it sends the message and terminates. For a 66 MHz 80486 with 4 MB of RAM and 10 GB of swap space, what is the maximum number of "sendsms" processes that can be running at the same time?

Cheers,

Brendan

rootel77 · Post by **rootel77** » Tue Mar 07, 2006 4:45 am

The "static sized table containing kernel page table entries that is used to update address spaces after a task switch" method mentioned by Proxy (above) as a solution to the problems of the "Update the address space after a task switch" method by JAAman (also above) is good - I couldn't think of any significant problems for it

Of course, when i say "if it is absolutely necessary ...." we are in method 1 context (map entire phys. mem into kernel space) and therefore we dont need to sync kernel page tables because the are already allocated so JAAMAn and Proxy solutions are unnecessary in this context (ie because kernel pde entries never changes).

How do you calculate the practical limit?

Let's say I write a utility to send SMS reminders. It's a command line utility that let's the user set a target mobile phone number, a time that the message is to be sent and the message itself - something like "sendsms 0404123456 10:30 5/5/06 Don't forget your dentist appointment at 12:00!" or "sendsms 0404123456 12:30 1/4/2110 It's your 100th wedding anniversay next week!".

The utility starts, reads it's command line arguments, calculates the length of time it needs to sleep and then does "sleep()". When it wakes up it sends the message and terminates. For a 66 MHz 80486 with 4 MB of RAM and 10 GB of swap space, what is the maximum number of "sendsms" processes that can be running at the same time?

(fortunately I am an accountant, no I am serious

)
first, we cant calculate (wich mean exact estimation) the paractical limit all we can do is estimate it.
so we, first, need to know "the cost per process" in term of memory. this may correspond to the permanent structs per process that must resides in the kernel address space. this should give us an estimation to the kernel needs for certain number of processs. then come the own process needs wich may vary in size among differents process. there we can estimate an average cost (well if you need more elaborated approach you should consider the universal 20/80 law when 20% of elements (or peaples) monopolize 80% of resources).
now you can calculate the theoretical limit of your OS (always in memory terms) wich corresponds to the total phys. mem+total swap space.
the practical limit take in consideration the performance of each memory type (ram and disk) and the frequency usage of the swap between this tow mem. what i should name "the swap cost". more you swap more the performance decrease and more the user must wait more. so the paractical limit is function of the swap frequency but also of the "user tolerance" (wich depends on the type of the appli been executed and the user patience of course).
when you choose to swap the kernel space the swap cost may increase enormously and the user experience decrease in the some proportion. so it is highly probable that you reach the practical limit.

JAAman · Post by **JAAman** » Tue Mar 07, 2006 9:02 am

Of course, when i say "if it is absolutely necessary ...." we are in method 1 context (map entire phys. mem into kernel space) and therefore we dont need to sync kernel page tables because the are already allocated so JAAMAn and Proxy solutions are unnecessary in this context (ie because kernel pde entries never changes).

that is exactly what i said (but i do not support mapping physical memory - i believe it to be a bad idea)

@brendan:

I don't like pre-allocating all kernel page tables - it costs roughly 1 MB (if kernel space is 1 GB) which is too much on limited hardware

who said anything about pre-allocation?
what i do:

every process starts with an address-space, the entries pointing into (shared) kernel space in every process, point to the same page tables (since it will be exactly the same in every process)
if i need to allocate more shared kernel space, i simply add entries to the current page tables (and invalidate), which are marked global and reused in every address-space -- these page tables are never changed when switching address-spaces (i also have a local-kernel region, which is not marked global, and does change with CR3)

the only time i would ever have to change a page table in another address-space, would be page sharing or page passing (both can be used in forms of IPC)

rootel77 · Post by **rootel77** » Tue Mar 07, 2006 10:23 am

that is exactly what i said (but i do not support mapping physical memory - i believe it to be a bad idea)

to avoid confusion, let's say we are in x86-32bit arch.
ident mapping the (max) entire phys. mem have several advantages:

- you use 4mb pages size, then you save 4kb space for each page table, save TLB entries for usage by process pages
- memory management in the kernel becom easy as the phys. addr are also virtual addr (possibly via a translation).
- continuous phys. page have good locality on the L1 cache (i'm not absolutely sure on this but i think L1 cache are physically indexed).
- since kernel page dir entries never change, you dont have to synchronize kernel page dir entries between processes
- continuous phys. page allocation is necessary for some devices.

now about the disavantages:
- you cant make cont virtual regions from discont. phys. page so you must implement cont. memory allocation in the low phys. allocator. in other words you must deal with external fragmentation in the phys. allocator.
- the max. phys. mem that can be mapped is limited by the kernel address space

linux is using the ident mapping. and here are solutions it adopts:

- to deal with external fragmentation it uses the Buddy system.
- of course the buddy system is subject to severe internal fragmentation (ie wasted internal space after allocation). for this reason linux implements a slab allocator on top of the phys. allocator. thus although the buddy allocation'time may not be constant the slab allocation's time is constant.
- for the high memory (spce that cant be mapped directly) a "normal" virtual allocation can be used.

personnally i think linux adopts the best solution for the actual 32bit architectures both on terms of memory and cache usage.

note that although you choose not to identity map the phys. mem. this does not mean you have discarded the external fragmentation problem, you have just reported it to the virtual allocation stage. thus even if the discont phys. allocation may take a constant time (for ex. via a stack or a doubly linked list). then the cont virtual mem allocation will be inevitably non constant.

Brendan · Post by **Brendan** » Tue Mar 07, 2006 10:53 am

Hi,

JAAman wrote:
I don't like pre-allocating all kernel page tables - it costs roughly 1 MB (if kernel space is 1 GB) which is too much on limited hardware
who said anything about pre-allocation?
what i do:

every process starts with an address-space, the entries pointing into (shared) kernel space in every process, point to the same page tables (since it will be exactly the same in every process)
if i need to allocate more shared kernel space, i simply add entries to the current page tables (and invalidate), which are marked global and reused in every address-space -- these page tables are never changed when switching address-spaces (i also have a local-kernel region, which is not marked global, and does change with CR3)

the only time i would ever have to change a page table in another address-space, would be page sharing or page passing (both can be used in forms of IPC)

What do you do when need to allocate more shared kernel space but all of the kernel's page tables are full? Do you allocate a new page table, or avoid this problem by pre-allocating all page tables that you might want?

I can't think of any other alternatives here, but both of these alternatives contradict what you've said...

Cheers,

Brendan

Brendan · Post by **Brendan** » Tue Mar 07, 2006 12:03 pm

Hi,

rootel77 wrote:personnally i think linux adopts the best solution for the actual 32bit architectures both on terms of memory and cache usage.

Personally, I think Linux is a big steaming pile of hacks held together with developers sweat and tears (that happens to work reliably due to perseverance rather than good design).

AFAIK when Linux was first written it was loaded at physical address 0x00101000, then all of physical memory was mapped at 0xC0000000, and the kernel is never moved from this initial mapping so that it always existed at linear address 0xC0001000 (and physical address 0x00101000).

Since then there's been a series of different hacks to get around the problems this causes - support for memory above 1 GB (and major changes to device drivers that expected to be able to access any physical page using the "physical memory mapping", but broke when a page was above 1 GB and this stopped working), different ratios between user space and kernel space (1 GB/3 GB, 2 GB/2 GB, etc), support for PAE, etc. The latest series of hacks are for better NUMA support (where the first 1 GB might all be in one NUMA domain and slow to access from all other NUMA domains).

It would've been really nice and clean when Linus first wrote it. I'm sure if he designed a new version of Linux from scratch he'd avoid making the mistake a second time.

Cheers,

Brendan

rootel77 · Post by **rootel77** » Tue Mar 07, 2006 12:36 pm

Personally, I think Linux is a big steaming pile of hacks held together with developers sweat and tears (that happens to work reliably due to perseverance rather than good design).

of course, this is a subjective viewpoint. other peoples may think (and think) that Linux rather exhibits a good design allowing separate teams to work efficiently. for ex. all subsystems exports the same interface despite the architecture been used. i cant just figure how a best design would be.

AFAIK when Linux was first written it was loaded at physical address 0x00101000, then all of physical memory was mapped at 0xC0000000, and the kernel is never moved from this initial mapping so that it always existed at linear address 0xC0001000 (and physical address 0x00101000).

Since then there's been a series of different hacks to get around the problems this causes - support for memory above 1 GB (and major changes to device drivers that expected to be able to access any physical page using the "physical memory mapping", but broke when a page was above 1 GB and this stopped working), different ratios between user space and kernel space (1 GB/3 GB, 2 GB/2 GB, etc), support for PAE, etc. The latest series of hacks are for better NUMA support (where the first 1 GB might all be in one NUMA domain and slow to access from all other NUMA domains).

i dont think that all memory problems rely to the 0xC0000000 address. the kernel would choose more virt space for the kernel. there will be always the High memory problem.
if you take a look at the kernel source you may notice there are never an explicit reference to any 0xC0000000 but to a certain constant "PAGE_OFFSET". i havnt any idea about how drivers make their presumption about this constant but i cant still figure out where we may get blocked in decreasing the PAGE_OFFSET constant.

Brendan · Post by **Brendan** » Tue Mar 07, 2006 10:33 pm

Hi,

rootel77 wrote:
Personally, I think Linux is a big steaming pile of hacks held together with developers sweat and tears (that happens to work reliably due to perseverance rather than good design).
of course, this is a subjective viewpoint. other peoples may think (and think) that Linux rather exhibits a good design allowing separate teams to work efficiently. for ex. all subsystems exports the same interface despite the architecture been used. i cant just figure how a best design would be.

There's a reason for that - there is no "best" design. Instead there's many alternatives with different advantages/disadvantages.

For comparison, my OS has a physical memory manager that is used during boot and then discarded. This is used to dynamically allocated pages for the initial address space/s. The OS has a different copy of the kernel (and several other data structures) for each NUMA domain, where all copies of kernel use dynamically allocate physical pages. This means that these things all use physical memory that is "close" to the CPUs that will be using it (minimized bus traffic between NUMA domains).

There is no physical memory mapping in kernel space. To solve the "modifying other address spaces" problem this causes, the kernel uses "method 2" above (mapping parts of the paging structures into themselves - e.g. mapping the page directory as the highest page directory entry for 32 bit/plain paging). This avoids the "physical memory size > kernel space" problem entirely.

Of course this means that device drivers can't access physical memory. For example, if an application asks to write 4 KB of data to disk, it sends this data via. messaging and the disk driver copies it to a bounce buffer and then sends it to the hardware. This is bad for performance, but good for protection. It also means that a 32 bit device driver will work fine on a 64 bit kernel (I won't need seperate 64 bit device drivers).

The end result is that all of it can work the same regardless of what type of paging is used, how much physical memory there is, how many NUMA domains there are, etc - it's "clean".

Cheers,

Brendan

JAAman · Post by **JAAman** » Wed Mar 08, 2006 12:33 pm

@brendan:

i don't know quite what you mean:
when your current page tables are full you just add a new one:

for me:

0-2GB = user space
2-3GB = kernel local space
3-4GB = kernel global space

when i create the process, the top-level page points to the kernel shared page(s) (this is esp easy when useing PAE), these always exist

when i need more space in shared kernel, i simply allocate it in the lower pages (you might be right though about non-PAE -- the top-level pages must always exist -- haddn't thought about that)

@rootel77:

ident mapping the (max) entire phys. mem have several advantages:

- since kernel page dir entries never change, you dont have to synchronize kernel page dir entries between processes

you don't have to sync page dir entries between processes anyway -- if you use the same pages in all processes -- be sure to read my entire post rather than just one sentence -- which you seem to have taken completely out of context

rootel77 · Post by **rootel77** » Wed Mar 08, 2006 12:58 pm

@Jaaman
when a kernel mode thread is actually running. it runs on what memory context (=current page dir)? if it is running in the current process's mem context and makes change to a kernel page dir entry within the kernel address space. the change is only visible to this mem context (ie in the current page dir). all other page dirs for all processes contains the old entry's data. how do you update kernel page dir entreis for all other processes?
of course if you preallocate all kernel page dir entries at the begining (ie preallocate all kernel page tables) there will be never change <so you dont have to sync...>.

Brendan · Post by **Brendan** » Wed Mar 08, 2006 5:56 pm

Hi,

JAAman wrote:for me:

0-2GB = user space
2-3GB = kernel local space
3-4GB = kernel global space

when i create the process, the top-level page points to the kernel shared page(s) (this is esp easy when useing PAE), these always exist

when i need more space in shared kernel, i simply allocate it in the lower pages (you might be right though about non-PAE -- the top-level pages must always exist -- haddn't thought about that)

Ahh, that makes much more sense (I didn't realize you where using PAE).

For me, for PAE, it goes like this:

0 to X GB = process space (user space)
X to 3 GB = thread space (user space)
3 to Y GB = kernel global space
Y to 4 GB = domain specific kernel space

The kernel has a seperate page directory for each NUMA domain where all page tables for kernel global space must be mapped into each domain specific page directory. This means that I have the same problem in PAE and plain 32 bit paging - the global kernel page table needs to be inserted into multiple page directories.

Cheers,

Brendan

rootel77 · Post by **rootel77** » Thu Mar 09, 2006 2:38 am

@JAaman
Sorry, i havent realized that you are using PAE (i'm not completely well in english, i havent understood the exact meaning of your last post). all my replies were related to 2nd level pagind in 32bit adressing mode.

JAAman · Post by **JAAman** » Thu Mar 09, 2006 12:20 pm

@rootel77:

actually, it doesn't matter if im useing PAE -- if the pages are reused in all address spaces, then they don't have to be updated -- even in non-PAE, they may have to pre-exist, but they don't have to be added to multiple address-spaces

@brendan:

actually i still have some work to do on my mem-man, but i do intend to support non-PAE also (though i too am building for 32bit and 64bit at the same time -- so using PAE where ever it may be allowed makes a lot of sence to me) as i intend to support a minimum spec of 386, 4MB RAM (hopefully)

my current design is theoretical right now (as iam starting over, but haven't had much time to implement -- to many projects demanding my attention!!)

Brendan · Post by **Brendan** » Thu Mar 09, 2006 9:12 pm

Hi,

JAAman wrote:actually i still have some work to do on my mem-man, but i do intend to support non-PAE also (though i too am building for 32bit and 64bit at the same time -- so using PAE where ever it may be allowed makes a lot of sence to me) as i intend to support a minimum spec of 386, 4MB RAM (hopefully)

my current design is theoretical right now (as iam starting over, but haven't had much time to implement -- to many projects demanding my attention!!)

I'm rarely sure what I'm refering to when I'm talking about my OS (the currently implemented code, the code implemented for the previous version/s or the design itself).

I guess my current design is also theoretical (I still haven't started any part of the kernel/s for the current rewrite), although it's all similar to the previous/implemented version.

As for PAE vs. plain 32 bit paging, my currently implemented/rewritten boot code will detect what the computer needs and auto-select. If the BIOS's "int 0x15, eax=E820" function returns any RAM area that ends above 4 GB, or if the ACPI SRAT table returns any hot pluggable RAM area that ends above 4 GB, or if the "!ForcePAE" boot variable is set to true, then PAE will be used. Otherwise it will use plain 32 bit paging. The reason for this is memory overhead - for PAE, the paging structures aren't as efficient (a page table or page directory contains 512 entries instead of 1024, so you need more of them).

Currently I'm doing graphics code in real mode for the boot menu system (should be complete in a day or so). After that I'll do an optional "default video" module in real mode (to allow a default video mode to be selected), and add a pile of code to the 3 kernel setup modules to support different colour depths. I'm guessing it'll take another 2 months before I'm ready to begin work on kernel modules.

I started this rewrite in September last year, so that makes it about 8 months of work before work on any part of any kernel begins.

Cheers,

Brendan

OSDev.org

Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions

Re:Virtual mm solutions