Hi,
bewing wrote:
Small processes don't need an entire address space, so the number of address space switches is reduced.
meh. Then you need to relocate them, one way or another, if they are sharing an address space. And there is no inter-process memory protection. I don't think it's worth it, just to save an address space switch.
A 32-bit OS can use segmentation to implement "small address spaces"; where segmentation provides the relocation (segment base) and protection (segment limit). In this case you'd need to mark GDT/LDT descriptors that are used by other processes as "not present" so that CPL=3 code can't change segment registers and access data that belongs to another process (which in turn might mean using a different GDT for each CPU), but apart from that it's easy enough to do. AFAIK the L4 kernel supports this.
For a 64-bit OS running 32-bit (or 16-bit) processes the same principle applies. For a 64-bit OS running 64-bit processes it won't work though - you can use RIP relative addressing for relocation, but you can't provide the protection without messing with paging structures (e.g. marking page directories as "not present" to prevent access) which involves TLB flushing and defeats the purpose.
bewing wrote:
However, this doesn't work very well with multiple threads, because they run sequentially on the same CPU.
Important disagreement #1: I think it works quite excellently with multiple threads, for precisely that reason. There should always be enough processes/address spaces running on the machine to saturate the total number of cores. There is no reason to assume that any one process will ever have such a high priority that it will ever need to be granted exclusive use of a core, plus exclusive use of that core's hyperthread, PLUS MORE.
Huh? Firstly this has nothing to do with priority - a process with
n low priority threads should still be able to use
n CPUs (if present). Otherwise you end up with entire CPUs doing nothing while one CPU does everything.
bewing wrote:To postulate a situation where one process is requiring that much CPU time, and all other processes are completely idle is a) ridiculous, and b) a situation where it is quite acceptable for an OS to provide less than 100% performance from the hardware. (However, if it is vital to create such a gluttonous app, then the programmer should create it as multiple processes in idependent address spaces, communicating via IPC or the filesystem, if necessary.)
Huh? A situation where one process is requiring that much CPU time, and all other processes are completely idle is a) extremely likely. Think of something like a ray-tracer, where the frame being generated is divided into
n areas for
n CPUs. You could try to force something like this to use multiple processes instead of just using multiple threads, but in this case it'd just end up using shared memory mappings which ends up being almost identical to multi-threaded anyway (and if your OS doesn't support shared memory mappings, then they'll find a different OS that does let them use all the CPUs).
bewing wrote:Some logical CPUs share all or most of their caches. For the benefits of "locality" you want only the hardware that shares caches to be sharing address spaces. Period.
You never want to extend a process beyond the shared caches of the logical CPUs. If you do, you get RFOs, which are guaranteed to slow your system to a crawl -- and it will crawl EVEN SLOWER, on a faster machine with thousands of cores, than it will on a system with two. This is negative scalability.
It's not so simple. For example, for multi-core CPUs often the L2 cache is shared between 2 cores.
If there's 8 logical CPUs (e.g. a 4 core CPU with hyper-threading) and a process is only allowed to use one pair of logical CPUs, then that process will end up running about 125% faster than it would if it only ran on one CPU (because the speedup from hyperthreading is nothing like the speedup from multi-CPU). If the same process was allowed to use all logical CPUs but suffered a 10% performance penalty because of things like cache line sharing, etc, then it'd end up running about 430% faster than it would if it only ran on one CPU. IMHO 430% is greater than 125% by a significant margin...
bewing wrote:You can guarantee that a hyperthread shares ALL the caches of its associated CPU. So that is the obvious candidate for sharing address spaces between logical CPUs. Beyond that, I think it is a bad idea -- unless you get really clever about writing the code, and about compiling it -- and you do not have that degree of control over user apps.
You don't need to be very clever about writing code - it's relatively easy to design a process where one thread does one type of work and another thread does another type of work or where threads do the same type of work on different data (with very little data sharing); and it's also relatively easy to design a process consisting of many I/O bound threads (where caches are cold by the time any thread gets CPU time again anyway). Memory that's never modified can happily be in many CPU's caches at the same time (this can actually improve performance because it's faster for one CPU to get data from another CPU's cache than it is to read it from RAM).
It's only when data that is used by different CPUs is modified that you get potential performance problems. However, if a piece of data needs to be shared then a piece of data needs to be shared, and an RFO is much better than forcing programmers to rely on messaging, pipes or file I/O (or any other form of inter-process communication that doesn't involve shared memory mappings and RFOs).
Basically what you're doing is assuming you know more about the process than the person writing the process, even though you don't even know what that process is. If the programmer writing the process wants to be limited to a subset of available CPUs, then let them use CPU affinity to reduce the number of CPUs that the process can use; or alternatively, if the programmer writing the process doesn't want to be limited to a subset of available CPUs, then let them use CPU affinity to increase the number of CPUs the process can use. In this case you only need to worry about the most sensible "default" behavior; and IMHO the most sensible default behavior is to let a process use all CPUs within a NUMA domain.
bewing wrote:
what do the rest of the CPU's do?
One runs the kernel. One runs each manager/executive/system service. One runs each interrupt. One runs each other process on the machine. Until you fill them all up. Then you start scheduling.
I wouldn't necessarily do that either. Having one CPU that's idle 20% of the time and 7 more CPUs in a low power sleep state is much better than having 8 CPUs that are idle 90% of the time (that are constantly being taken out of any low power state they manage to get into). Even better might be to have 2 CPUs going at 80% load using software controlled throttling (e.g. CPU clocks halved) with the other 6 CPUs in a low power sleep state, where every five seconds or so the CPUs are rotated so that the heat is spread across all available heatsinks (and so that no single chip generates all the heat, so that all CPUs are more able to sustain a longer period of "full throttle" activity before becoming hot).
My suggestion is to use some sort of calculation to give each potential CPU a score and then make the thread run on the CPU with the best score. That way you can dramatically change scheduling characteristics by changing the function that calculates the score. I'd start by using something like this to calculate the score:
Code: Select all
int find_best_CPU_to_use_for_thread(THREAD_ID thread) {
int best_CPU = -1;
int best_score = -1;
for(potential_CPU = 0; potential_CPU < max_CPU; potential_CPU++) {
if(potential_CPU_is_allowed_in_CPU_affinity_mask) {
score = 100;
if(thread_was_running_recently) { // Previously used CPU's caches might still be warm
if(potential_CPU = previously_used_CPU) score += 20;
else if(potential_CPU_shares_L1_cache_with_previously_used_CPU) score += 10;
else if(potential_CPU_shares_L2_cache_with_previously_used_CPU) score += 5;
}
if(process_has_a_default_NUMA_domain) {
score -= NUMA_distance_table[potential_CPU_NUMA_domain][default_NUMA_domain];
}
if(potential_CPU_temperature == HOT) score -= 25;
if(potential_CPU_temperature > COLD) {
if(potential_CPU_state == sleeping) score -= 50;
}
score -= potential_CPU_current_load;
if(bestScore < score) {
best_score = score;
best_CPU = potential_CPU;
}
}
}
return best_CPU;
}
Of course you'd want to do some intensive performance tuning on this...
Cheers,
Brendan