Scheduling in multi-core/thread systems
-
- Member
- Posts: 524
- Joined: Sun Nov 09, 2008 2:55 am
- Location: Pennsylvania, USA
Scheduling in multi-core/thread systems
When working with a system that supports hyper threading, I am aware there are some optimizations that should be done regarding hyper threading (ie running the same process on both logical processors in a hyper threading pair) and have implemented it. Do these same optomizations apply to multi-core systems? For example, if I have a system with 4 quad core processors should I try to schedule processes on the same physical package so that cache control traffic is limited to the physical package, or will this have no effect/have a very small effect?
Re: Scheduling in multi-core/thread systems
I'd imagine it would be optimal to schedule related processes on the same physical processor as that would cut down on bus-lock demand.
Also, the cache would probably be better utilized if Program A and all its threads are using one physical processor.
Also, the cache would probably be better utilized if Program A and all its threads are using one physical processor.
Website: https://joscor.com
Re: Scheduling in multi-core/thread systems
The reason that you want to handle CPU affinity has to do with shared caches. Hyperthreads share all caches, including the TLB, with their alter-ego core. So it is good if they share all processes, too -- because that will not mess up any of the caches.
But there are several levels of caches. L1, L2, maybe L3.
It depends on whether you are running on AMD or Intel chips, as to what extent and which of those caches are shared between cores on the same package. However, no matter which manufacturer, at least some of the cache space is indeed shared between cores.
So the answer to your question is yes, if you need to run a single process on multiple cores, there are significant advantages to keeping the process localized to cores on the same physical package.
But there are several levels of caches. L1, L2, maybe L3.
It depends on whether you are running on AMD or Intel chips, as to what extent and which of those caches are shared between cores on the same package. However, no matter which manufacturer, at least some of the cache space is indeed shared between cores.
So the answer to your question is yes, if you need to run a single process on multiple cores, there are significant advantages to keeping the process localized to cores on the same physical package.
Re: Scheduling in multi-core/thread systems
Hi,
However, the scheduler must do load balancing too - you don't want one CPU doing everything while the rest are sitting around bored, just because one CPU might have data in it's caches. If the task (or all threads that belong to the same process) were blocked for a while then it's likely none of it's data will still be in any CPU's caches, and it's more important to schedule the task on the least loaded CPU (rather than scheduling it to run on a CPU that used to have the task's data in it's caches).
For maximum performance with hyper-threading, you still want to run tasks on CPU/s that may still contain the task's data in it's cache. However, when some logical CPUs are idle you also want to spread the "idleness" around to reduce resource sharing. This means that (for e.g.) if there's 2 single-core chips with 2 logical CPUs each and 3 threads to run where 2 of those threads belong to the same process, then you want both threads that belong to the same process running on the same chip (for cache sharing); but if the other thread blocks you'd want to run both threads (that belong to the same process) on different chips (to reduce resource sharing).
Then there's the "turbo mode" in Intel's Core I7 chips. Systems with 2 (or more?) Core I7 chips are meant to be released on the 24th of March (about a week from now). Optimizing performance for this is similar to optimizing performance for hyper-threading - if you can, it's better to have one core in each chip busy (and getting the turbo boost) instead of having one chip completely idle and another chip not getting the turbo boost. This can get tricky, and I need to do more research on this myself - for example, if there's only 2 tasks to run, is it better to run both tasks on different logical CPUs in the same core and get the turbo boost, or is it better to have both tasks running on different cores (to avoid sharing the resources of a core) but miss out on the turbo boost? I'd guess in this case the second option is better for performance, but it's a guess - I honestly don't know...
For NUMA (and not forgetting that all multi-chip computers will be NUMA in the not-too-distant future), you want to allocate memory that's close the the CPU/s the task/process is running on, and schedule tasks on the CPU/s that are close to the memory that they allocated.
If you do all of the above, then for something like a multi-chip Core I7 system (which is very interesting, because it's NUMA with multi-core and hyper-threading) you can probably get 100% more performance than a simple scheduler in some situations.
Optimizing for performance is complicated enough, but...
For modern computers optimizing for performance isn't always what you want to do - often what you want to do is optimize for heat.
For servers you want to minimize heat (save on running costs, including air-conditioning). For offices you want to minimize noise (e.g. keep CPU fan speeds low). For high performance games machines you want to avoid thermal throttling (although avoiding thermal throttling is always a good idea for any computer). For laptops you want to maximize battery life (this applies to a desktop/server connected to a UPS during power failures too).
If you're optimizing for heat, then most of the things you can do to optimize for performance are a bad idea. For example (for hyper-threading), if there's 2 single-core chips with 2 logical CPUs each and only 2 tasks to run, then you want one chip running both tasks so that you can put the other chip into a low power state.
Now, imagine an OS that has 2 sliders. One that goes from "minimum power consumption" to "maximum performance", and another slider that goes from "minimum noise" to "maximum performance", where the user (and/or system administrator) can use these sliders to influence the scheduler. Where would you set these sliders for your computer/s? None of my computers would ever be set to "maximum performance, maximum performance" - that'd be like asking for "hot, noisy and annoying"...
Cheers,
Brendan
For maximum performance on SMP (multi-chip and multi-core), you want to run tasks on CPU/s that may still contain the task's data in it's cache. For multi-core, CPUs can share caches. For example, if you're deciding which CPU to use to run a task and that task was last running on CPU#1, then you'd prefer to run the task on CPU#1 again; but if CPU#2 shares the same L2 cache as CPU#1 then it's a good second option, and if CPU#3 and CPU#4 share the same L3 cache as CPU#1 then they're good third options. Of course if 2 tasks share the same data (e.g. different threads that belong to the same process) then you'd take into account the CPUs that other thread's are/were running on too.JohnnyTheDon wrote:When working with a system that supports hyper threading, I am aware there are some optimizations that should be done regarding hyper threading (ie running the same process on both logical processors in a hyper threading pair) and have implemented it. Do these same optomizations apply to multi-core systems? For example, if I have a system with 4 quad core processors should I try to schedule processes on the same physical package so that cache control traffic is limited to the physical package, or will this have no effect/have a very small effect?
However, the scheduler must do load balancing too - you don't want one CPU doing everything while the rest are sitting around bored, just because one CPU might have data in it's caches. If the task (or all threads that belong to the same process) were blocked for a while then it's likely none of it's data will still be in any CPU's caches, and it's more important to schedule the task on the least loaded CPU (rather than scheduling it to run on a CPU that used to have the task's data in it's caches).
For maximum performance with hyper-threading, you still want to run tasks on CPU/s that may still contain the task's data in it's cache. However, when some logical CPUs are idle you also want to spread the "idleness" around to reduce resource sharing. This means that (for e.g.) if there's 2 single-core chips with 2 logical CPUs each and 3 threads to run where 2 of those threads belong to the same process, then you want both threads that belong to the same process running on the same chip (for cache sharing); but if the other thread blocks you'd want to run both threads (that belong to the same process) on different chips (to reduce resource sharing).
Then there's the "turbo mode" in Intel's Core I7 chips. Systems with 2 (or more?) Core I7 chips are meant to be released on the 24th of March (about a week from now). Optimizing performance for this is similar to optimizing performance for hyper-threading - if you can, it's better to have one core in each chip busy (and getting the turbo boost) instead of having one chip completely idle and another chip not getting the turbo boost. This can get tricky, and I need to do more research on this myself - for example, if there's only 2 tasks to run, is it better to run both tasks on different logical CPUs in the same core and get the turbo boost, or is it better to have both tasks running on different cores (to avoid sharing the resources of a core) but miss out on the turbo boost? I'd guess in this case the second option is better for performance, but it's a guess - I honestly don't know...
For NUMA (and not forgetting that all multi-chip computers will be NUMA in the not-too-distant future), you want to allocate memory that's close the the CPU/s the task/process is running on, and schedule tasks on the CPU/s that are close to the memory that they allocated.
If you do all of the above, then for something like a multi-chip Core I7 system (which is very interesting, because it's NUMA with multi-core and hyper-threading) you can probably get 100% more performance than a simple scheduler in some situations.
Optimizing for performance is complicated enough, but...
For modern computers optimizing for performance isn't always what you want to do - often what you want to do is optimize for heat.
For servers you want to minimize heat (save on running costs, including air-conditioning). For offices you want to minimize noise (e.g. keep CPU fan speeds low). For high performance games machines you want to avoid thermal throttling (although avoiding thermal throttling is always a good idea for any computer). For laptops you want to maximize battery life (this applies to a desktop/server connected to a UPS during power failures too).
If you're optimizing for heat, then most of the things you can do to optimize for performance are a bad idea. For example (for hyper-threading), if there's 2 single-core chips with 2 logical CPUs each and only 2 tasks to run, then you want one chip running both tasks so that you can put the other chip into a low power state.
Now, imagine an OS that has 2 sliders. One that goes from "minimum power consumption" to "maximum performance", and another slider that goes from "minimum noise" to "maximum performance", where the user (and/or system administrator) can use these sliders to influence the scheduler. Where would you set these sliders for your computer/s? None of my computers would ever be set to "maximum performance, maximum performance" - that'd be like asking for "hot, noisy and annoying"...
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.