Hi,
Kemp wrote:Would you suggest the kernel be copied into each domain? I can imagine this eating up a lot of RAM once the kernel starts growing in size, but it's the only way I can think of off-hand to avoid performance drains entering the kernel across domains.
Edit:
Actually that's a stupid idea, right? You'd have to enter the different copies depending which CPU you were running on, which I'd imagine could get quite complex and more than substitute for the performance penalties.
For my OS, "kernel space" is split into 2 parts - there's a "shared/global" part and a "domain specific" part.
The domain specific part contains a copy of every "kernel module" and all fixed kernel data (any kernel data that doesn't change after boot, like details for all CPUs, kernel API function tables, the GDT and IDT (the IDT doesn't change after boot for my OS), NUMA domain information, etc.
For a computer with N domains, there's N copies of the domain specific area (or N copies of kernel modules and N copies of the fixed kernel data). The domain specific area are all mapped into the same part of every address space. Because nothing in the domain specific area changes, kernel space looks identical from every CPU.
For the CPU's TLB, all pages in kernel space can still be marked as "global" because each CPU always sees the same pages in the domain specific area (another CPU might need have different TLB entries for the same area of kernel space, but that doesn't matter).
The only real problem is that if a process is shifted from one NUMA domain to another NUMA domain then the page directory entries used for the domain specific area need to be corrected (but processes don't get migrated to different NUMA domains often because it stuffs up all of the memory accesses in user-space, which would have been tuned to suit the old NUMA domain when they were allocated).
As for eating up lots of RAM, this would be a problem for a monolithic kernel - for e.g. if there's 64 NUMA domains then using 6 MB per domain wouldn't be too practical. My OS is a mico-kernel though...
For me, if every kernel module (physical memory manager, scheduler, messaging code, etc) is the maximum size of 128 KB and if I find a use for all of the 16 "kernel modules slot", and if the fixed kernel data area is also completely full, then for 32 bit systems it adds up to "16 * 128 + 660 KB" per domain and for 64 bit systems it adds up to "16 * 128 + 668 KB" per domain. On top of this there's one page table for 32 bit paging or 2 page tables for PAE or long mode.
This means the worst case is 2724 KB per domain for long mode, 2716 KB per domain for PAE and 2712 KB per domain for plain 32 bit paging. Fortunately, this worst case won't happen in practice - I only intend to use half of the "kernel module slots", each "kernel module" will only be about 20 KB and even if there is 255 CPUs (all in seperate domains) I won't fill all of the tables in the fixed kernel data area (I allowed for a lot of "what if" expansion space). In reality it's going to be around 512 KB per domain for a huge computer (255 CPUs with 128 NUMA domains), less than 200 KB per domain for a small computer (2 CPUs and 2 NUMA domains).
If I didn't have multiple copies, then one copy of this data would still need to exist somewhere (which is what happens for a computer that doesn't support NUMA, which is actually treated as a NUMA computer with one domain). This means that for a massive computer (255 CPUs with 128 NUMA domains) I'm looking at (roughly) a total of 63.5 MB extra, and for a small computer (2 CPUs and 2 NUMA domains) I'm looking at 200 KB extra.
Of course for that "massive computer", nothing has ever been manufactured that is this large, and everything that has been manufactured that comes slightly close has an equally huge amount of RAM. It's also worth pointing out that as systems become larger the cost of accessing RAM in other NUMA domains tends to become worse, and ways to avoid this become even more important to avoid scalability problems.
The other bonus for micro-kernels in a NUMA system is for the device drivers themselves, which can be assigned to (or confined within) the NUMA domain that is closest to the hardware devices. This means you don't end up with a device driver running in the context of an application in NUMA domain 0 trying to work with a device that is physically connected to NUMA domain 7. This speeds up access to the device and also minimizes the traffic between NUMA domains (which reduces contention on these "inter-domain" links).
Cheers,
Brendan