OSDev.org

Posted: **Thu Nov 01, 2007 5:01 pm**

Intel recommends you not to align the stacks of both tasks on the cpu's to the same base so they most likely don't refer to the same address modulo 64k. They forgot to increase the associativity of their cache or something like that.

Posted: **Thu Nov 01, 2007 6:20 pm**

Hi,

nooooooooos wrote:Do I have to optimize my OS especially for HT or is it the same as for Dual-Core?

All optimizations are always optional.

For hyperthreading the same physical CPU is pretending to be 2 logical CPUs. This doesn't mean you end up with 2 logical CPUs that have 50% of the performance of the physical CPU. If CPU #1 is doing HLT waiting for an IRQ, then CPU #2 will be able to use 100% of the physical CPU's resources, while if CPU #1 is doing something then CPU #2 will only get 50% of the physical CPU's resources.

Even if both CPUs are doing work, hyperthreading still improves performance by hiding latencies, improving pipeline usage and reducing instruction dependancies. For example, without hyperthreading if a CPU needs to wait for data to come from RAM then the physical CPU does nothing until the data arrives. With hyperthreading, if CPU #1 is waiting for data to come from RAM then CPU #2 can still execute instructions and the physical CPU does useful work. If one logical CPU is doing integer operations and another logical CPU is doing floating point operations, then both logical CPUs are using different pipelines and won't be competing for pipelines. If one logical CPU needs the result of one instruction before it can do the next instruction, then the CPU can execute instructions from another logical CPU instead of waiting for the first instruction to complete.

There are also costs - mostly related to scalability and re-entrancy locking. More CPUs means more CPUs trying to acquire the same re-entrancy locks and higher lock contention (e.g. more chance of CPUs doing nothing until they can acquire a lock). If the OS has poor scalability, then hyper-threading might make performance worse.

Also, logical CPUs compete equally for the physical CPU's resources. For example, if CPU #1 is running a high priority thread and CPU #2 is running a low priority thread, then you might want to run HLT on CPU #2 instead of the low priority thread to improve the performance of the high priority thread.

For optimization, always use the PAUSE instruction in spinloops - it reduces the CPU resources used by the spinning CPU and increases the CPU resources that can be used by other logical CPUs, and does no harm on CPUs that don't have hyperthreading, including CPUs that don't have the PAUSE instruction (it's the same as a NOP in that case).

For the scheduler, if there's 2 physical CPUs with 2 logical CPUs each (4 logical CPUs total), then it's better for performance to have one logical CPU in each physical CPU idle (with no logical CPUs competing for resources) than to have both logical CPUs in the same physical CPU idle (with logical CPUs in the other physical CPU competing for resources). For power management the reverse can be true - for e.g. it might be better to have both logical CPUs in the same physical CPU idle so the physical CPU isn't do anything and can be put into an power saving state to save battery power and reduce heat.

There's also cache sharing. If both logical CPUs are using the same address space, then they can share the cache instead of competing for cache (and getting 50% of the cache each, for e.g.). This means that if 2 threads belong to the same process, then running those threads on the same physical CPU can improve cache efficiency.

The scheduler also needs to decide which CPU a thread should run on. This is where things start to get complicated...

For plain SMP, a CPU might still have some data in it's cache from last time a thread was run, and you can improve performance by running the thread on the same CPU the next time it gets a time slice. For hyperthreading, you get the same benefits from running the thread on any logical CPU within the physical CPU it ran on last time. For multi-core some caches can also be shared - for example, (depending on which CPU) the L2 cache might be shared by all cores in the chip, while the L1 caches are shared by all logical CPUs in the same core. This gives an order or preference - it would be better to run the thread on the same core as last time, but if you can't it'd be better to run the thread on the chip as last time.

However, you also need to think about load balancing - it's bad for performance if one CPU has heaps of work to do (because that's where the threads ran last time) while 3 other CPUs are doing nothing.

What I do when I'm deciding which CPU a thread should use for it's next time slice is calculate a score for each CPU and select the CPU with the best score. The code that calculates the score could consider things like the chance of the thread's data still being in a CPUs cache (how long ago the thread got CPU time), cache sharing, CPU load, NUMA domain, the priority of the thread, power management policy, CPU temperature, etc.

nooooooooos wrote:When I don't want to use the shutdown code, is it possible to skip the sending of the Init-IPI?

You always need the Init-IPI (IIRC, if you don't send the Init-IPI the CPU will ignore the Startup-IPI).

nooooooooos wrote:My last question....: Does it make sense to support APIC even when there aren't any SMP or ACPI tables?

For I/O APICs, you can't configure them without MPS or ACPI tables.

For local APICs you can try to enable them and/or manually probe to see if it's present, so it can make sense (if you're careful with your enabling and probing).

Unless you're having several seperate kernels or seperate HALs (e.g. one for SMP, one for single CPU without APICs, one for single CPU with APICs, etc) then the OS would support APICs regardless of whether the APICs are used or not (and regardless of whether there's ACPI and/or MPS tables)...

Cheers,

Brendan

Posted: **Fri Nov 02, 2007 9:35 am**

Candy wrote:Intel recommends you not to align the stacks of both tasks on the cpu's to the same base so they most likely don't refer to the same address modulo 64k. They forgot to increase the associativity of their cache or something like that.

But when I change the CR3 at each change of Tasks it would invalidate the TLB? Or do I understand something wrong?

Cheers
Noooooooooooos

Posted: **Fri Nov 02, 2007 8:38 pm**

nooooooooos wrote:
Candy wrote:Intel recommends you not to align the stacks of both tasks on the cpu's to the same base so they most likely don't refer to the same address modulo 64k. They forgot to increase the associativity of their cache or something like that.
But when I change the CR3 at each change of Tasks it would invalidate the TLB? Or do I understand something wrong?

this is different, the cache on a CPU works in lines (16, 32, 64 bytes longs, depend on the cpu/cache type and size) which can be associated with a few (unless its fully associative) addresses.

if all threads use the same address for their stacks, they will all use the same cache line(s) and trash(empty) each other's cache so data will never be left in the cache when the thread runs again.

by offsetting each stack's top address, you increase the chances of NOT overlapping cache line "colors".

but this is true only for identical or similar threads; different applications will use the stack differently and so will not trash each other's cache so much, except if your applications have the same structure like all win32 application having the same basic message loop in winmain.
or the threads are very small worker threads.

I suggest reading about cache associativity, cache line colors, and how caches work in general. check wikipedia, good articles

Posted: **Fri Nov 02, 2007 9:06 pm**

Brendan wrote:Hi,

nooooooooos wrote:Do I have to optimize my OS especially for HT or is it the same as for Dual-Core?
All optimizations are always optional.

For hyperthreading the same physical CPU is pretending to be 2 logical CPUs. This doesn't mean you end up with 2 logical CPUs that have 50% of the performance of the physical CPU. If CPU #1 is doing HLT waiting for an IRQ, then CPU #2 will be able to use 100% of the physical CPU's resources, while if CPU #1 is doing something then CPU #2 will only get 50% of the physical CPU's resources.

Even if both CPUs are doing work, hyperthreading still improves performance by hiding latencies, improving pipeline usage and reducing instruction dependancies. For example, without hyperthreading if a CPU needs to wait for data to come from RAM then the physical CPU does nothing until the data arrives. With hyperthreading, if CPU #1 is waiting for data to come from RAM then CPU #2 can still execute instructions and the physical CPU does useful work. If one logical CPU is doing integer operations and another logical CPU is doing floating point operations, then both logical CPUs are using different pipelines and won't be competing for pipelines. If one logical CPU needs the result of one instruction before it can do the next instruction, then the CPU can execute instructions from another logical CPU instead of waiting for the first instruction to complete.

I've read intel is removing HT in its future chips, the reason being that its mostly a marketing hype and its causing more trouble than it's really worth.

on some benchmarks applications have benefitted from HT with 15% or so increase in speed, but in some other cases there actually has been a 25% slowdown (HT enabled on two threads taking 125% of the time of 1 thread with HT off), altho newer P4s have leviated the problem somewhat.

this is due to various problems such as: cache trashing, some P4 pipeline stalls on any of the logical cpus blocking both logical cpu ("instruction replay" feature), both threads running the same code in parallel therefore always using the same instructions/executions units.

the best usage I can find for an HT logical cpu would be to do side-work such as handling I/O and other kernel book-keeping things in order to minimize side-effects, using them as a faster form of task-switching (keeping the other logical cpu from task switching as often).

current x86 implementations lack the number of extra execution units needed to run smoothly, other non-x86 cpus have successfully used logical cpus but they had more cpu ressources.

there's also the problem of an irregular cpu speed where one logical cpu get slown down by the other which is a pain in the butt when you're trying to optimize some code and get completely erratic numbers, it also messes up many load-balancing algorithms.

you might also want to temporarily stop using logical cpus when the physical cpu is running hot, they cause more heat for what they are worth.

they're good ideas, its the current implementation that's bad.

Posted: **Fri Nov 02, 2007 11:25 pm**

Hi,

rv6502 wrote:I've read intel is removing HT in its future chips, the reason being that its mostly a marketing hype and its causing more trouble than it's really worth.

on some benchmarks applications have benefitted from HT with 15% or so increase in speed, but in some other cases there actually has been a 25% slowdown (HT enabled on two threads taking 125% of the time of 1 thread with HT off), altho newer P4s have leviated the problem somewhat.

this is due to various problems such as: cache trashing, some P4 pipeline stalls on any of the logical cpus blocking both logical cpu ("instruction replay" feature), both threads running the same code in parallel therefore always using the same instructions/executions units.

Intel aren't stupid - they are planning to bring hyperthreading back next year (up to 8 cores with hyperthreading on a single chip, or up to 16 logical CPUs per chip - sounds like good fun, especially if you've got a "4 socket" motherboard) - search for "Hyper-threading Nehalem" for more info.

When hyper-threading was first released it could've been better (and Intel did improve it a bit in later P4 CPUs), operating systems weren't optimized for it (they treated it exactly the same as SMP), and a lot of software was single-threaded (no benefit from any form of multi-CPU).

Some people compared single-CPU without hyperthreading to single-CPU with hyperthreading, which (IMHO) isn't really a fair comparison (no multi-CPU TLB shootdown, less re-entrancy locking concerns, etc). Comparing multi-core or multi-chip without hyperthreading to multi-core or multi-chip with hyperthreading would've been more fair (same re-entrancy concerns and multi-CPU TLB shootdown in both cases).

I'm also fairly sceptical about just how scalable OS's like WindowsXP and Linux are (lock contention, etc). SMP wasn't much of an issue when the WindowsNT kernel was first designed (computers with 4 or more CPUs were very rare), and old versions of Linux didn't support SMP at all (until they added a "big crappy kernel lock" - they've improved scalability a lot since then though).

From a CPU manufacturer's perspective, hyperthreading costs very little silicon for the (potential) performance improvements. Most of the things that made hyperthreading worse when it was first released have changed - OS's (should) support it properly now, applications are much more likely to expect multi-cpu, etc. Also, Intel have mostly abandoned the Netburst/P4 core, which can't be a bad thing.

rv6502 wrote:there's also the problem of an irregular cpu speed where one logical cpu get slown down by the other which is a pain in the butt when you're trying to optimize some code and get completely erratic numbers, it also messes up many load-balancing algorithms.

Power management (e.g. thermal throttling, either software controlled done automatically due to CPU/s getting too hot) causes the same problems. Using performance monitoring counters for performance tuning might be a better idea. An option to force the OS into "fixed performance mode" might be worth considering (so that developers can enable/disable "fixed performance mode" whenever they like). In "fixed performance mode" you'd set the slowest CPU speed to prevent performance changes caused by power management and make logical CPUs do HLT. It'd be useless for normal users, but I imagine people testing "soft real-time" software would like it a lot...

For load balancing it's more useful to look at it from a different perspective - one logical CPU may speed up if the other logical CPU becomes idle (an unexpected performance boost, rather than an unexpected performance loss).

Cheers,

Brendan

Posted: **Sat Nov 03, 2007 9:28 am**

Brendan wrote:Hi,

rv6502 wrote:I've read intel is removing HT in its future chips, the reason being that its mostly a marketing hype and its causing more trouble than it's really worth.

on some benchmarks applications have benefitted from HT with 15% or so increase in speed, but in some other cases there actually has been a 25% slowdown (HT enabled on two threads taking 125% of the time of 1 thread with HT off), altho newer P4s have leviated the problem somewhat.

this is due to various problems such as: cache trashing, some P4 pipeline stalls on any of the logical cpus blocking both logical cpu ("instruction replay" feature), both threads running the same code in parallel therefore always using the same instructions/executions units.
Intel aren't stupid - they are planning to bring hyperthreading back next year (up to 8 cores with hyperthreading on a single chip, or up to 16 logical CPUs per chip - sounds like good fun, especially if you've got a "4 socket" motherboard) - search for "Hyper-threading Nehalem" for more info.

sounds great, like I said, there are successfull implementation of logical cpus out there, just not on P4s

they're not stupid, they just went full blown Mhz myth with their marketting dept.
now they saw the hype game changed to multi-cpu so it looks they're back on an overzealous marketing run.
altho I can't complain much, except if I waited a bit to buy that dual opteron I would have a much cheaper Atlon64x2, (I bought it like 2 months before they announced multi-core cpus.. aaaaarg! )

but a system with 4cpus x 8 cores x 2 logical cpus ( drool ) would have the potential to choke itself to death on memory bandwidth and latency.

the OS would have to implement a parallel-worker-thread class of threads for things such as raycasting farms and other like tasks for threads highly likely to share cache data instead of trashing each-other's.

where the threads would be bunched up together to be started and stopped at the same time so they run in relative sync, helping each other (by prefetching / sharing data right in the cache) rather than tripping over each other.

it would probably also need cpu lock-down so cpus gets allocated entirely to single threads for longer than usual time-slices, which is fine because you got more than a dozen other cpus to provide user interaction and they can force preemption on that locked cpu.

its clear that current/main OSes techniques and application programming would be completely inadequate.

Posted: **Sat Nov 03, 2007 3:11 pm**

Hi,

rv6502 wrote:but a system with 4cpus x 8 cores x 2 logical cpus ( drool ) would have the potential to choke itself to death on memory bandwidth and latency.

I'm not too sure about that - Intel are also dumping the front-side bus and introducing Common System Interface in the same CPUs...

Of course this also means that both Intel and AMD "multi-socket" machines will be NUMA.

rv6502 wrote:the OS would have to implement a parallel-worker-thread class of threads for things such as raycasting farms and other like tasks for threads highly likely to share cache data instead of trashing each-other's.

where the threads would be bunched up together to be started and stopped at the same time so they run in relative sync, helping each other (by prefetching / sharing data right in the cache) rather than tripping over each other.

it would probably also need cpu lock-down so cpus gets allocated entirely to single threads for longer than usual time-slices, which is fine because you got more than a dozen other cpus to provide user interaction and they can force preemption on that locked cpu.

its clear that current/main OSes techniques and application programming would be completely inadequate.

Intel's first dual core 80x86 CPU was released in the middle of 2005. Their first quad core CPU was released at the end of 2006. If they release 8-core in the middle of 2008 then it makes a nice pattern (I'm sure Mr Moore would agree): about every 18 months they double the number of cores.

If the pattern continues we get 64 core CPUs in 6 years (128 logical CPUs on a chip if there's still 2 logical CPUs per core, although there's nothing really stopping them from doing 4 logical CPUs per core if they've got enough pipelines).

Current programming techniques (especially for applications) will need to change. I just hope that Windows and Linux stick with current techniques until my OS is ready...

Cheers,

Brendan

Posted: **Mon Nov 05, 2007 11:19 am**

Hmm when I found out (with CPUID), that my BSP has for example 4 logical cores/processors, do I now for sure, that all other physical processors detected by ACPI or MPS tables have the same number of logical cores/processors?
Or do I have to look at each physical processor how many logical cores it contains?

Cheers
Nooooooooos

Posted: **Mon Nov 05, 2007 4:55 pm**

nooooooooos wrote:Hmm when I found out (with CPUID), that my BSP has for example 4 logical cores/processors, do I now for sure, that all other physical processors detected by ACPI or MPS tables have the same number of logical cores/processors?
Or do I have to look at each physical processor how many logical cores it contains?

Cheers
Nooooooooos

Just assume that you can have different processors in a multiprocessor system. I.e., they can be more different than just amount of cores

Posted: **Mon Nov 05, 2007 9:13 pm**

Hi,

nooooooooos wrote:Hmm when I found out (with CPUID), that my BSP has for example 4 logical cores/processors, do I now for sure, that all other physical processors detected by ACPI or MPS tables have the same number of logical cores/processors?

Be careful here - CPUID will report the total number of CPUs in the chip, not the total number of reliable/working CPUs in the chip.

When the computer starts up the CPUs do a built in self test (BIST). If a chip or a core within a chip fails the self test then it's not included in the ACPI or MPS tables, and the OS doesn't use it. For hyper-threading, if a core (that's used by both logical CPUs) fails, then both logical CPUs won't be reported and shouldn't be used.

If you know one logical CPU in a core is present then you can use CPUID to determine how many other logical CPUs in the same core are present and usable, but you shouldn't make assumptions about other cores in the same chip.

nooooooooos wrote:Or do I have to look at each physical processor how many logical cores it contains?

There's no guarantee that seperate chips are identical. This can apply to more than just "logical CPUs per core per chip" - different chips can support different features.

Most OSs don't support different CPUs in the same system, and do things like getting the feature flags from CPUID on each CPU and AND them together (to find the features that are supported by all CPUs), or in some cases crashing because one CPU supports something than another doesn't.

It is possible to have different CPUs in the same system (e.g. a Pentium with MMX and a Pentium without MMX), and it's also possible for an OS to support this (e.g. if a process uses MMX, then use CPU affinity so it's threads only run on the CPU that supports MMX).

Whether or not this is worth the hassle is up to you. I do support different CPUs in the same sytem (but this is mostly so I can say my OS is better than Windows/Linux, that don't support it)...

Cheers,

Brendan

Posted: **Sat Nov 10, 2007 1:33 pm**

Brendan wrote:The only difference is that the MPS tables won't list additional logical CPUs (e.g. for a dual core chip with hyper-threading you'd get 2 CPUs listed by MPS), while the ACPI tables will (e.g. for a dual core chip with hyper-threading you'd get 4 CPUs listed by ACPI).

And when I have only one physical processor with HT, would the BIOS even produce any MPS tables (just with one [the bsp] processor)? Or are there always APIC-tables, when the processor supports HT?

Cheers
Nooooooooooos

Posted: **Sat Nov 10, 2007 7:31 pm**

Hi,

nooooooooos wrote:
Brendan wrote:The only difference is that the MPS tables won't list additional logical CPUs (e.g. for a dual core chip with hyper-threading you'd get 2 CPUs listed by MPS), while the ACPI tables will (e.g. for a dual core chip with hyper-threading you'd get 4 CPUs listed by ACPI).
And when I have only one physical processor with HT, would the BIOS even produce any MPS tables (just with one [the bsp] processor)? Or are there always APIC-tables, when the processor supports HT?

Basically anything with an I/O APIC will generate MPS tables for the OS so that the OS can find out about the APICs (even if the motherboard doesn't support hyper-threading or SMP at all). This includes computers with a single physical processor (with and without hyper-threading), and computers that support SMP but only have one physical CPU installed.

Cheers,

Brendan

Posted: **Mon Dec 24, 2007 5:07 am**

In Bochs there are 3 parameters to set the number of CPUs/processors. The first is: number of CPUs in SMP mode. The second: number of processor cores in each CPU (e.g. dualcore) and the third: number of HT threads in each processor core.

I found out, that every processor core is mentioned in MPS tables.
And the number of HT threads per processor core multiplied with the processor cores per CPUs are counted for the value at CPUID EBX Bit 16-23 (e.g. a dualcore CPU with HT has 4 logical cores).

So the processor cores are counted once in MPS tables and once in the CPUID field.

But now, how can I find out, how many processor cores per CPU the system has to avoid double enabling the processor (Once, when I count the logical cores per CPU with the CPUID value and once, when I read the MPS table)?

Thank you
Nooooooooooooos

Posted: **Mon Dec 24, 2007 4:18 pm**

But now, how can I find out, how many processor cores per CPU the system has to avoid double enabling the processor (Once, when I count the logical cores per CPU with the CPUID value and once, when I read the MPS table)?

Each logical CPU has its own APIC state, and hence, its own unique APIC ID. Use it wisely

OSDev.org

Enable local-apic interrupts?