Multiple copies of kernel code on NUMA?

Brendan · Post by **Brendan** » Wed Jul 28, 2004 12:28 am

Hi,

I'm re-implementing my paging code, and I'm wondering if it's worth storing the kernel's code in seperate physical pages for each NUMA domain/memory range.

The advantage would be that each CPU would be able to run kernel code faster because it would be using memory that is "closer" to that CPU. The disadvantages would be additional memory usage and that it would be harder to implement (making sure that each address space uses the most appropriate copy of the kernel code).

If I only had one copy of the kernel's code, then one CPU would run faster than the others (if the kernel used physical pages from only one NUMA memory range). Alternatively I could mix pages from all NUMA memory ranges so that each CPU would run each kernel function at different speeds (depending on which functions use pages from which memory ranges).

I also have similar problems with the kernel's data, which must use the same physical memory for all CPUs.

My kernel will end up being roughly 128 Kb, so I'd end up using an extra 128 Kb on a 2 way computer, 384 Kb on a 4 way computer, etc.

Anyone have any thoughts?

Cheers,

Brendan

Candy · Post by **Candy** » Wed Jul 28, 2004 1:49 am

Brendan wrote: I'm re-implementing my paging code, and I'm wondering if it's worth storing the kernel's code in seperate physical pages for each NUMA domain/memory range.

The advantage would be that each CPU would be able to run kernel code faster because it would be using memory that is "closer" to that CPU. The disadvantages would be additional memory usage and that it would be harder to implement (making sure that each address space uses the most appropriate copy of the kernel code).

If I only had one copy of the kernel's code, then one CPU would run faster than the others (if the kernel used physical pages from only one NUMA memory range). Alternatively I could mix pages from all NUMA memory ranges so that each CPU would run each kernel function at different speeds (depending on which functions use pages from which memory ranges).

MOESI states -> using the S state (shared), you can have the kernel in all processors cache, without much delay. Of course, abusing 384k to make it run faster on a 4-way system is no waste, doing that with a windows-ish kernel is plain suicidal (384M extra?).

It is faster to copy.

I also have similar problems with the kernel's data, which must use the same physical memory for all CPUs.

My kernel will end up being roughly 128 Kb, so I'd end up using an extra 128 Kb on a 2 way computer, 384 Kb on a 4 way computer, etc.

No.

You DO NOT have to keep all the data in the same memory. Think fairly strong processor affinity, which would be similar to a normal computer in terms of storing and swapping processes, but then with 5 spaces (for a 4-way box), in which one is the harddisk, and each processor has his own queue and list of processes. You don't have to run all the processes in the same memory, they only have to be able to communicate using memory. Using similar ways to do that across a network (mapping a page nonpresent with the reader, upon read requesting a new copy from the writer, which marks it read-only, and upon write sends a message to invalidate the page) you can do that locally too, although most of the stuff is automated.

Personally, I'd go for a bunch of processes that use a single processor, and only let them switch when the processor load is at least very unfair to other processors (5-1 in terms of processes, IE, if it's logical to shovel one off).

Of course, those NUMA systems work very good with microkernel designs, since you don't have to shut down the services, and you'd be naturally communicating. Still it works great with monolithic (and derivates) too, since in most you can enter the kernel more than once, and if written for numa, they have multiples of certain data structures so the mutexing isn't necessary. The monolithic system would be transformed to a system with a bunch of local services, plus a few global services, that communicate through microkernel means, to achieve good performance.

Brendan · Post by **Brendan** » Wed Jul 28, 2004 7:15 am

Hi,

Candy wrote: Personally, I'd go for a bunch of processes that use a single processor, and only let them switch when the processor load is at least very unfair to other processors (5-1 in terms of processes, IE, if it's logical to shovel one off).

Here my OS gets complicated as I've got 2 different types of CPU affinity. The first I call "hard affinity", which is where the thread is not allowed to run on any other CPU for any reason. Hard affinity was originally intended for unmatched CPUs. For example, if the thread uses MMX and there's a pair of pentium chips, one that supports MMX and the other that doesn't. In this case the OS would force the use of hard affinity to prevent the code from running on the CPU that doesn't support MMX.

The second type of affinity I call "soft affinity" and is intended to maximize cache re-use, so that a thread will run on the same logical CPU it ran on last time unless that CPU overloaded. In this case the thread will be run on a different logical CPU in the same chip (for hyper-threading and/or multi-core chips). If all else fails a thread with soft affinity runs on any CPU it can.

Then there's threads that don't use any CPU affinity at all. These are threads that don't need much execution time but need quick response times (e.g. IRQ handlers/device drivers).

Soft affinity would be used as default for all threads (hard affinity and no affinity would be options). For normal multi-cpu machines this means that the OS/scheduler would try to maximize CPU cache reuse. On NUMA machines it means that the OS/scheduler would try to run a thread on the correct CPU for it's physical memory. The only difference NUMA would make is that the OS/scheduler would make "soft affinity" harder (but not as hard as "hard affinity"). In this way if one CPU is overloaded some threads would be run (less-efficiently) on a not-too-busy CPU for some time slices in an attempt to balance CPU load.

Each process may also have affinity. If the process has hard affinity then all of it's threads will also use hard affinity. If a process has soft affinity then the OS will (usually) try to give all of it's threads the same soft affinity, unless hard affinity is asked for when the thread is spawned. Also when a thread is spawned "reverse affinity" (soft or hard) can be requested, where the thread will run on any CPU that is not the CPU for the rest of the process. In this case a logical CPU in the same chip (or same NUMA memory domain) would be used if possible. Reverse affinity would be used when a process wants several threads to be running on several CPUs at the same time, but the OS will ignore a request for reverse affinity if the process is using hard affinity.

Please note that on my OS a process has part of the address space and each thread uses the rest. On most other OSs the process has all of the address space so there's no point having thread affinity in addition to process affinity.

When you apply all this to using multiple copies of the kernel code on NUMA machines it means that when the scheduler switches threads it may need to adjust the threads linear address space to correct which copy of the kernel is about to be used for the threads time slice, which is easy if PAE is being used (but harder when PAE isn't).

Hope this isn't too confusing..

Cheers,

Brendan

Brendan · Post by **Brendan** » Wed Jul 28, 2004 7:17 am

Hi again,

[Offtopic: Does anyone else think the 4 Kb limit on posts is too small??]

Candy wrote: You DO NOT have to keep all the data in the same memory. Think fairly strong processor affinity, which would be similar to a normal computer in terms of storing and swapping processes, but then with 5 spaces (for a 4-way box), in which one is the harddisk, and each processor has his own queue and list of processes. You don't have to run all the processes in the same memory, they only have to be able to communicate using memory. Using similar ways to do that across a network (mapping a page nonpresent with the reader, upon read requesting a new copy from the writer, which marks it read-only, and upon write sends a message to invalidate the page) you can do that locally too, although most of the stuff is automated.

The kernel can/will be optimized for NUMA, so that (where possible) data is stored in the most appropriate physical memory pages. For e.g. message queues would use memory suited to the message queue owners affinity. Some things (the IO port owner table, IRQ receiver lists, thread state table, etc) would still need to share the same physical memory. Having several copies of these tables would involve too much overhead when they are modified (for e.g. two entries in the thread state table are modified for every thread switch). For these data structures it'd be faster using the wrong NUMA physical memory pages than updating several copies. Something like the IO port owner table is debatable as it's read more than it's modified. For this I might use a master table and several copies where the copies are marked invalid when the master is modified. If the master table/s are bound to a specific CPU the OS could prefer that CPU over the others when a process that uses IO ports and/or IRQs is started (and prefer any other CPUs when the process won't use them).

Cheers,

Brendan

Candy · Post by **Candy** » Wed Jul 28, 2004 10:35 pm

Uhm... how do you figure any BIOS will be written that allows incompatible CPU's to be used simulaneously? At least, I never saw any, and the only practical location I can think of (in the Intel world) would be hotswapping Xeons around, having 2 types when you're halfway done switching. Still, it's damn unlikely anyone ever even upgrades those servers, let alone runs any non-worldwide spread OS on it.

Assuming it's done more often, what do you do with a 16-way xeon of which one is old? That wrecks the dynamic properties of the 16-way system, because one of the ways isn't functional.

Suggestion would be to always check using some procdata mechanism and a bunch of feature flags (possibly directly those from CPUID) if the current CPU can run that thread, otherwise skip ahead to the next. This still allows you to soft-tie it to any processor, but you don't have to hard-tie because of an incompatible feature set. BTW, you could use a bit map stating the processors on which it may run

For the thread scheduling, consider researching (and implementing of course) band scheduling. If you always run all of the threads simulaneously, complex threaded code runs faster (and noncomplex runs the same speed nonetheless). More info in Tanenbaums Moden Operating Systems.

Re to the second post:

I wasn't implying copying the entire table around or anything. Just make each thread-OS keep its own list of threads it manages, that is, those with hard affinity on it, those with soft that are currently here (last ran here) and those with no affinity that happen to be in his address space. When the CPU is underloaded, he broadcasts an I-want-more-to-do-IPI, whereby the rest of the cpu's checks whether they have any threads with no affinity that they want to dump (making the switching between processors explicit, but unlinked) or with soft affinity, if they're overloaded themselves.

For the band scheduling, you could make an interrupt handler that decides the next thread for ALL cpu's, whereas the rest gets an IPI requesting them to check their procdata's for the new thread they're going to run. Makes the overhead less, wrecking the caches less, keeping it all manageable, why isn't this in any book?

Re to the last of your first post, that's true, if you put all in the same address space. For 32-bit machines, there's no gain in putting in more than 2 cpu's, so you can still put it all (both cpu's) in the same address space. Using 64-bit gives you a big address space, so again no reason to do that.

Grtz, Candy

PS: the 4K limit is way too small, any decent discussion will use double posting by all participants

PPS: quotes left out to reduce size

Brendan · Post by **Brendan** » Thu Jul 29, 2004 9:25 am

Hi,

<- BIOS's and incompatible CPU's ->

There's hardware limitations, BIOS limitations, and software/OS limitations. I can't do anything about the first 2, but intend to remove as many software/OS limitations as possible. You'd be surprised how many things are software/OS limitations. For CPUs a Pentium 166 without MMX is almost identical to a Pentium 166 with MMX, and as the BIOS wouldn't use MMX it shouldn't care if one CPU has it and another doesn't. Same goes for Pentium IV with and without SSE3 or with different cache sizes. Current motherboards/BIOS's will handle these slight differences. How about a 2 way motherboard with 80486DX at 33 Mhz and a 80486DX2 at 66 Mhz? Both chips use the same pins and external clock frequencies, and I doubt that the BIOS will compare CPU speeds and complain. There's a number of internal CPU differences that the BIOS and motherboard wouldn't/shouldn't have problems with.

IMHO most OS's are SMP, and therefore most multi-CPU computers use identical CPUs (ie. software imposing limits on hardware). My OS is multi-CPU (not SMP) and allows any combination of CPUs. So much so that if someone was silly enough to build a 4 way computer with an 80486, a dual core opteron, a hyper-threading Pentium IV and a Pentium Pro my OS will happily work and take advantage of the features of all chips. The only exception here is that the paging data structures would use the lowest common denominator (32 bit linear address spaces, no global bit, etc). In reality minor differences in CPUs is more likely.

From Intel's "MultiProcessor Specification", Version 1.4, "B.8 Supporting Unequal Processors":

Some MP operating systems that exist today do not support processors of different types, speeds, or capabilities. However, as processor lifetimes increase and new generations of processors arrive, the potential for dissimilarity among processors increases. The MP specification addresses this potential by providing an MP configuration table to help the operating system configure itself. Operating system writers should factor in processor variations, such as processor type, family, model, and features, to arrive at a configuration that maximizes overall system performance. At a minimum, the MP operating system should remain operational and should support the common features of unequal processors.

<- Procdata mechanism ->

This is a great idea

I already have a data structure for each CPU that contains the corrected feature flags, that is permanantly reference-able via. GS (ie. "[gs:CPULISTentryStruct.features1]"). The flags taken from CPUID (if CPUID is supported) aren't always sane (there's errata from all vendors). E.g. I've got a Cyrix CPU that returns the ASCII characters "tead" instead of feature flags 2 (ECX returned by CPUID/eax=1). A 32 byte bit map would be easy too. This would also allow me to forget about "hard affinity"..

<- CPU load balancing ->

The only reason a thread would have no affinity would be if response time is more important than efficiency. How about a single queue of threads without any affinity, and several (one per CPU) queues of threads using soft affinity. A CPU would run the highest priority thread from either the no-affinity queue or the soft affinity queue. An underloaded CPU would run more of the threads without affinity, but could run a thread from another CPUs soft affinity queue if it's still underloaded (thread stealing). If a CPU is overloaded it'd end up running threads from it's own soft affinity queue and nothing else. If the computer does have NUMA then "thread stealing" would be for one time slice only (to improve NUMA memory accesses). If there is no NUMA then a "stolen" thread will belong to the new CPU (to improve CPU cache re-use). In both cases stealing threads from a CPU on the same physical chip would be prefered.

<- Band Scheduling ->

This would mean that all CPUs use the shortest time slice of any run thread, which would dramatically increase the number of thread switches (and IPIs) required. A lot of threads in my OS need very short amounts of time (e.g. receive message -> handle message -> blocked waiting for next message, where "handle message" often very short).

Cheers,

Brendan

Candy · Post by **Candy** » Thu Jul 29, 2004 10:41 am

<- Incompatible CPUs ->
Ok, didn't know that was in there. Time for me to get that added to the design (not hard).

<- Band scheduling ->
The idea of band scheduling is that they do not just block on the new message retrieval, they spin. Since all the threads run simulaneously, there is a big chance one of the messages will be sent and arriving immediately, occuring a bunch of times after another. In the other case you have constant switches, because they all block after EACH message. My guess would be that this is faster, but I have no benches.

<- procdata ->
I just knew I took the easy way

. Only officially supporting AMD processors that have an APIC, does pay off in the end. It limits the CPU count somewhat, but it makes programming one hell of a lot easier. Since most CPU quirks will be hidden in patch-boxes (module for that cpu), it should be OK to just ship a plain version.

By the way, my procdata struct is in src/kernel/core/pm/procdata.h, if you're interested (or anyone else). Code at www.sf.net/projects/atlantisos

<- CPU load balancing ->
Imagine the case with 3 threads, 2 CPUs, all CPU bound at same priority. In the case of a nonswapping nonseparated scheduler, they're all going to pingpong back & forward like there's no tomorrow.

I keep my point that, especially in the case of NUMA, having explicit thread swaps with soft affinity is a good thing. Also, penalizing a processor for running a thread from a different processors memory is a good thing so that it's swapped back asap. Care should be taken so that preferably processes are started on the cpu with the least load of longrunning tasks, preventing a processor being overloaded (or other processors being penalized).

Brendan · Post by **Brendan** » Fri Jul 30, 2004 7:47 am

Hi,

Candy wrote: Ok, didn't know that was in there. Time for me to get that added to the design (not hard).

Not too hard, but there are some complications. For e.g. Intel's MP Specification will give you feature flags 1 (from CPUID), but not feature flags 2, or the extended feature flags (CPUID, eax=0x80000001). Of these my OS needs the extended feature flags while in 16 bit mode so it knows if all CPUs support long mode (rather than just the BSP), and 64 bit setup code and a 64 bit kernel should be used. Therefore I need to detect & start all CPUs and get this information during boot, which means parsing both ACPI and MP tables (one of the reasons I've got about 160 Kb of 16 bit code & data). On top of this there's the affinity stuff, and "does CPU N support feature X". My OS takes it a little further, in that if none of the CPUs support feature X the process may be started on a different computer in the cluster that does support feature X.

Candy wrote: The idea of band scheduling is that they do not just block on the new message retrieval, they spin. Since all the threads run simulaneously, there is a big chance one of the messages will be sent and arriving immediately, occuring a bunch of times after another. In the other case you have constant switches, because they all block after EACH message. My guess would be that this is faster, but I have no benches.

I don't like spinning where it can be avoided (unless it's for very short/finite amounts of time). In general there's a good chance that the messages will be sent to a thread that isn't running anyway (especially with a low number of CPUs), as you can't easily predict where a thread will send messages to.

It'd probably be better to redefine the format of sent messages so that a single message can convey all information and multiple messages in quick succession aren't necessary. On my OS a message can be up to 32 Mb, so a list of requests can be sent as a single message (although this was originally done to allow a complete frame of video to be sent as a message).

Cheers,

Brendan

OSDev.org

Multiple copies of kernel code on NUMA?

Multiple copies of kernel code on NUMA?

Re:Multiple copies of kernel code on NUMA?

Re:Multiple copies of kernel code on NUMA?

Re:Multiple copies of kernel code on NUMA?

Re:Multiple copies of kernel code on NUMA?

Re:Multiple copies of kernel code on NUMA?

Re:Multiple copies of kernel code on NUMA?

Re:Multiple copies of kernel code on NUMA?