Hi,
Virtlink wrote:When I read the posts in this thread, I understand it as this:
- No process should be limited in the number of CPUs that they use at one time.
Hmm - I'd say "No process should be limited in the number of CPUs that they
could use at one time". There's plenty of good reasons for a process to be limited to a subset of all available CPUs, either by the OS/kernel or by the process itself. These reasons could include situations where the process uses features that only some of the CPUs support (e.g. a process that uses MMX on a "mixed Pentium" system where only some of the CPUs support MMX), NUMA efficiency (where a process is limited to CPUs that belong to the same NUMA domain), etc.
Virtlink wrote:- CPUs are very likely to share caches.
As a rough guide, logical CPUs within the same core will share L2 and L3 caches and maybe the L1 data cache, and cores within the same chip might share L2 and/or L3 caches; but no caches are ever shared between CPUs that are in different chips (except in rare/unusual NUMA systems where there's some sort of "cache line state" caching done by the chipset to reduce cache miss and RFO costs).
Virtlink wrote:It is feasible for the OS to know which caches are shared?
Yes. For modern Intel CPUs you can use "CPUID, EAX=0x00000004" to get the "Deterministic Cache Parameters", which includes the number of logical CPUs using each cache and a lot more information (cache size, associativity, etc). For AMD CPUs there isn't a way to get this information, but AFAIK you can assume that L1 and L2 caches aren't shared, and the L3 cache (if present) is shared by all cores within the chip.
Virtlink wrote:If so, the scheduler could schedule multiple threads of the same process to CPUs with a shared TLB.
AFAIK TLB entries are never shared.
For hyper-threading, logical CPUs that belong to the same core "competitively share" the TLB, which means that one logical CPU can't use a TLB entry that is owned by another logical CPU (even if CR3 is the same in both logical CPUs, and even if the TLB entry is marked as "global"). This means that running threads that belong to the same process on logical CPUs within the same core or on cores that belong to the same chip won't reduce the chance of TLB misses.
Note: "won't reduce the chance of TLB misses" is not the same as "won't reduce the cost of TLB misses"....
Virtlink wrote:Are there other means to optimize scheduling for the other caches (L1, L2, etc...)?
Yes - the general idea is to keep caches "warm"; or schedule threads on CPUs that may still have cached data that the thread may need. For example, if a thread was running on CPU#0 for a while and blocks, then when that thread is started again it's data might still be in CPU#0's cache, and might also still be in other CPU's caches if those caches are shared with CPU#0.
However, if you always schedule threads to run on the same CPU that they ran on last time (or the same group of CPUs that share cache) then you won't be able to effectively balance CPU load - for e.g. you could have 100 threads waiting to use CPU#0 while other CPUs are doing nothing, which is bad for performance. Therefore it's a good idea to take into account the chance of a thread's data still being in the CPU's cache. CPU caches use "least recently used" algorithm. So, if a thread was running on CPU#0 but CPU#0 has done a lot of other work since, then that thread's data probably won't be in CPU#0's cache anymore, and it may be a bad idea to schedule that thread on CPU#0 again if other CPUs have less load.
To complicate this more, threads that belong to the same process probably share some data. If 2 threads that belong to the same process are running at the same time it would improve performance to run those threads on CPUs that share cache; and if a CPU was used to run a thread recently then it might still have data in it's cache that a different thread that belongs to the same process might need.
Note: this sort of cache optimization can indirectly effect TLBs miss costs. For example, if 2 CPUs share the same L3 cache but don't share TLBs, then data needed for a TLB miss on one CPU might be satisfied by data that was cached in the L3 cache because of another CPU's TLB miss. Basically, even though TLBs aren't shared, optimizing for L1/L2/L3 cache sharing can also indirectly improve the amount of time it takes to handle TLB misses.
For load balancing on CPUs with hyper-threading things get even more complicated, because work done on one logical CPU effects the performance of the other logical CPU. For example, if you've got 2 threads that belong to the same process and nothing else that's ready to run, and 2 seperate chips that both support hyper-threading (4 logical CPUs total); then it's better for performance to schedule one thread on each chip (with no chance of cache sharing) instead of scheduling both threads on the same chip (where they can share cache) and leave the other chip idle.
Of course all of the above is about optimizing performance. Power management (e.g. optimizing heat) is a different problem. It's not an unrelated problem though; most modern CPUs will reduce their performance when they get too hot (it's "self defense"
) and can suddenly drop down to 12.5% performance or 25% performance. To optimize for heat, you want to schedule threads on the coolest CPU (or maybe on the least loaded CPU if you can't get temperature information) , which means forgetting about all of the performance optimizations. A "very good" scheduler would take into account both performance and power management and optimize for both depending on a variety of factors (and not just optimize for performance only). For power management, other factors include whether or not the computer is running on batteries (mobile systems like laptops *and* servers running from a UPS during power failures), and whether or not the computer is in a crowded office or someone's home (where CPU fan noise/speed may be more important than performance).
Cheers,
Brendan