Page 2 of 2

Re:Paging ...

Posted: Mon May 08, 2006 1:22 pm
by viral
hi..
Even i m doing the way BI have just come with.. My process controll block contains a queue (linklist) containing phy addrs of pages currently allocated to this process. So wheneven a new page is mapped in process address space, its phy add is pushed in this queue.

And when one has to remove this process.. just free all the pages one by one by poping the queue.. no need to convert from virtual -> physical..

Re:Paging ...

Posted: Mon May 08, 2006 2:14 pm
by Brendan
Hi,
viral wrote: Even i m doing the way BI have just come with.. My process controll block contains a queue (linklist) containing phy addrs of pages currently allocated to this process. So wheneven a new page is mapped in process address space, its phy add is pushed in this queue.

And when one has to remove this process.. just free all the pages one by one by poping the queue.. no need to convert from virtual -> physical..
What do you do if you need to free some pages from the middle of this queue - for example, if the kernel needs to send some pages to swap space, or if the process allocates 4 huge chunks of RAM and then frees the first huge chunk?

Why do you need anything to convert virtual addresses into physical addresses when you can use the page directory and page table entries anyway?

The only reason I can think of is to speed up code that changes an address space that isn't currently being used (i.e. when the page tables, etc can't be directly accessed), but this doesn't happen often and can be avoided.

The reverse is a much larger problem - finding the virtual address (or addresses) that use a specific physical page. I can't think of a situation where this would be needed...


Cheers,

Brendan

Re:Paging ...

Posted: Mon May 08, 2006 2:49 pm
by paulbarker
@Brendan:

I'm thinking of putting together a Physical Region Manager (PRM) like your PASM, where each region can be managed in its own way (possibly a list, bitmap, buddy bitmaps or as a single resource). I want the design of this to survive modification to the OS, so I was thinking about NUMA. Would this idea work with NUMA?

I don't know much about NUMA, is it a system where physical addresses 0-1GB belong to CPU0, 1-2GB to CPU1, etc; or is it a system where 0-1GB on CPU0 refers to different memory to 0-1GB on CPU2?

Re:Paging ...

Posted: Mon May 08, 2006 4:24 pm
by Brendan
Hi,
paulbarker wrote:I'm thinking of putting together a Physical Region Manager (PRM) like your PASM, where each region can be managed in its own way (possibly a list, bitmap, buddy bitmaps or as a single resource). I want the design of this to survive modification to the OS, so I was thinking about NUMA. Would this idea work with NUMA?
The idea will work with NUMA - all you need to do is give each range a "domain" number and split any contiguous RAM ranges that cross NUMA boundaries.

For example, the "int 0x15, eax = 0xE820" might say there's one chunk of RAM from 1 MB to 8 GB, but the SRAT ACPI table might say that things below 4 GB belong to NUMA domain 0 and things from 4 GB to 8 GB belong to NUMA domain 1. In this case you'd have split the areas into 2 seperate entries, one for each NUMA domain.
paulbarker wrote:I don't know much about NUMA, is it a system where physical addresses 0-1GB belong to CPU0, 1-2GB to CPU1, etc; or is it a system where 0-1GB on CPU0 refers to different memory to 0-1GB on CPU2?
There's 2 different types of NUMA. The first is "ccNUMA" (or cache-coherent NUMA), where the physical address space is identical for all CPUs (but the access times aren't). There's also "non-cache-coherent NUMA" where the physical address space is not identical for all CPUs.

In practice, almost all NUMA computers are ccNUMA, and the general term "NUMA" is synonymous with ccNUMA. This is because "non-cache-coherent NUMA" is incredibly difficult to program for. For "non-cache-coherent NUMA" it's much better for software to treat each NUMA domain as an entirely seperate machine so that it looks like a cluster of seperate SMP/UP computers (which is probably why no-one really makes non-cache-coherent NUMA machines to begin with - it's easier to make "clustered hardware" too).

I should also point out that devices in a NUMA machine are often also specific to a certain NUMA domain, and that one or more CPUs can share the same NUMA domain. This means that (for example) 4 out of 16 CPUs might be able to access a specific device faster than the other 12 CPUs.

For memory mapped devices there's normally no restriction on where the device is mapped into the physical address space. This means that (for e.g.) RAM from 0 GB to 2 GB might belong to NUMA domain 0 but there's a video card mapped at 1 GB that belongs to NUMA domain 3.

There's also "less common" system topologies. For example, you could have 4 NUMA domains that contain CPUs and nothing else, and 2 more NUMA domains that both contain some I/O devices and some memory.

For larger systems there can also be "meta-domains". For example, there could be 2 "groups of NUMA domains" where each "group of NUMA domains" consists of 2 NUMA domains. In this case CPUs in domains 0 and 1 access memory in domains 2 and 3 with a small performance penalty, and access memory in domains 4, 5, 6 and 7 with a larger performance penalty. This means the OS needs to maintain a list of "relative distances" - for example, if CPU 0 needs to allocate a page of RAM then it'd check for free RAM in domain 0, then domain 1, then domain 2 and 3, then domains 4, 5, 6 and 7. The idea of this is to allocate the first free page that is closest to the CPU that will be using it, even when the closest domain/s have no free RAM left, so that the performance penalties are minimized.

Another problem is the kernel itself - some "not-so-smart" OS's load it at a fixed physical address (e.g. at 1 MB) which means that CPUs in one NUMA domain access kernel code faster than CPUs in other NUMA domains because those other CPUs have continual "distance" penalties when accessing kernel code. In addition you can have similar problems with the kernel's data, which can lead to re-entrancy lock starvation, etc. IMHO any OS that reserves the first 16 MB for the kernel's code and data and then pretends to support NUMA properly should be shot. ;)


Cheers,

Brendan

Re:Paging ...

Posted: Mon May 08, 2006 4:47 pm
by paulbarker
Thanks for the info, thats made me much less likely to write support for NUMA but I want to write an OS that at least allows someone else to support NUMA if they want to. Also, I see SMP/NUMA and clustered systems increasing in popularity over the coming years, for many reasons, so maybe one day I will be writing code for these sorts of platforms.

Re:Paging ...

Posted: Mon May 08, 2006 4:49 pm
by Kemp
Would you suggest the kernel be copied into each domain? I can imagine this eating up a lot of RAM once the kernel starts growing in size, but it's the only way I can think of off-hand to avoid performance drains entering the kernel across domains.

Edit:
Actually that's a stupid idea, right? You'd have to enter the different copies depending which CPU you were running on, which I'd imagine could get quite complex and more than substitute for the performance penalties.

Re:Paging ...

Posted: Mon May 08, 2006 5:31 pm
by Colonel Kernel
Brendan wrote: The reverse is a much larger problem - finding the virtual address (or addresses) that use a specific physical page. I can't think of a situation where this would be needed...
That's because (IIRC) you don't implement shared memory. Windows' "page frame database" has exactly that reverse mapping.

Re:Paging ...

Posted: Mon May 08, 2006 5:46 pm
by Brendan
Hi,
Kemp wrote:Would you suggest the kernel be copied into each domain? I can imagine this eating up a lot of RAM once the kernel starts growing in size, but it's the only way I can think of off-hand to avoid performance drains entering the kernel across domains.

Edit:
Actually that's a stupid idea, right? You'd have to enter the different copies depending which CPU you were running on, which I'd imagine could get quite complex and more than substitute for the performance penalties.
For my OS, "kernel space" is split into 2 parts - there's a "shared/global" part and a "domain specific" part.

The domain specific part contains a copy of every "kernel module" and all fixed kernel data (any kernel data that doesn't change after boot, like details for all CPUs, kernel API function tables, the GDT and IDT (the IDT doesn't change after boot for my OS), NUMA domain information, etc.

For a computer with N domains, there's N copies of the domain specific area (or N copies of kernel modules and N copies of the fixed kernel data). The domain specific area are all mapped into the same part of every address space. Because nothing in the domain specific area changes, kernel space looks identical from every CPU.

For the CPU's TLB, all pages in kernel space can still be marked as "global" because each CPU always sees the same pages in the domain specific area (another CPU might need have different TLB entries for the same area of kernel space, but that doesn't matter).

The only real problem is that if a process is shifted from one NUMA domain to another NUMA domain then the page directory entries used for the domain specific area need to be corrected (but processes don't get migrated to different NUMA domains often because it stuffs up all of the memory accesses in user-space, which would have been tuned to suit the old NUMA domain when they were allocated).

As for eating up lots of RAM, this would be a problem for a monolithic kernel - for e.g. if there's 64 NUMA domains then using 6 MB per domain wouldn't be too practical. My OS is a mico-kernel though...

For me, if every kernel module (physical memory manager, scheduler, messaging code, etc) is the maximum size of 128 KB and if I find a use for all of the 16 "kernel modules slot", and if the fixed kernel data area is also completely full, then for 32 bit systems it adds up to "16 * 128 + 660 KB" per domain and for 64 bit systems it adds up to "16 * 128 + 668 KB" per domain. On top of this there's one page table for 32 bit paging or 2 page tables for PAE or long mode.

This means the worst case is 2724 KB per domain for long mode, 2716 KB per domain for PAE and 2712 KB per domain for plain 32 bit paging. Fortunately, this worst case won't happen in practice - I only intend to use half of the "kernel module slots", each "kernel module" will only be about 20 KB and even if there is 255 CPUs (all in seperate domains) I won't fill all of the tables in the fixed kernel data area (I allowed for a lot of "what if" expansion space). In reality it's going to be around 512 KB per domain for a huge computer (255 CPUs with 128 NUMA domains), less than 200 KB per domain for a small computer (2 CPUs and 2 NUMA domains).

If I didn't have multiple copies, then one copy of this data would still need to exist somewhere (which is what happens for a computer that doesn't support NUMA, which is actually treated as a NUMA computer with one domain). This means that for a massive computer (255 CPUs with 128 NUMA domains) I'm looking at (roughly) a total of 63.5 MB extra, and for a small computer (2 CPUs and 2 NUMA domains) I'm looking at 200 KB extra.

Of course for that "massive computer", nothing has ever been manufactured that is this large, and everything that has been manufactured that comes slightly close has an equally huge amount of RAM. It's also worth pointing out that as systems become larger the cost of accessing RAM in other NUMA domains tends to become worse, and ways to avoid this become even more important to avoid scalability problems.

The other bonus for micro-kernels in a NUMA system is for the device drivers themselves, which can be assigned to (or confined within) the NUMA domain that is closest to the hardware devices. This means you don't end up with a device driver running in the context of an application in NUMA domain 0 trying to work with a device that is physically connected to NUMA domain 7. This speeds up access to the device and also minimizes the traffic between NUMA domains (which reduces contention on these "inter-domain" links).


Cheers,

Brendan

Re:Paging ...

Posted: Mon May 08, 2006 6:03 pm
by Brendan
Hi,
Colonel Kernel wrote:
Brendan wrote: The reverse is a much larger problem - finding the virtual address (or addresses) that use a specific physical page. I can't think of a situation where this would be needed...
That's because (IIRC) you don't implement shared memory. Windows' "page frame database" has exactly that reverse mapping.
Surely you'd only need something to keep track of shared memory regions, not individual pages, and not "non-shared" pages...

For example, something that says there's N pages in process A beginning at location X that are shared by process B beginning at location Y, and by process C at location Z.


Cheers,

Brendan

Re:Paging ...

Posted: Tue May 09, 2006 11:28 am
by paulbarker
The problem here comes with COW regions, where only untouched pages are shared. This way a region may be private but have some pages within it shared. I will be using a system of "Backing Sets", and this is how that system will deal with COW and other shared regions:

Say processes A & B have a COW region with some pages shared. A & B have a backing set for the region, which have a reference count of 1 and contain every page of physical memory (Backing pages / pages of backing store) which is owned only by that process. They also each refernce a third backing set, which has a reference count of 2 and contains the pages shared by both processes. Thus only backing sets and not individual pages need reference counts.

For 3 processes with backing sets A, B and C: Pages shared between A & B are in a backing set D, and pages shared between A & C are in another backing set E, B & C goes in F and A & B & C goes in G. If that makes sense. The backing sets described will only exist if they actually have pages in them.

Going from a page to its owner or a backing set to its owner will not be supported as I see no need for it.

This whole thing is in the future, after I'm happy with my physical memory manager and have written my multithreading support, when I start to work on processes.