Core i7

LoseThos · Post by **LoseThos** » Tue Dec 30, 2008 12:15 am

Woo-hoo, I'm getting a core i7 system in about 25 days. Does anyone know what changes I'll likely have to make? It has hyperthreading... does that behave just like multiple cores -- you start it up with the APIC? I remember I had some confusion on the logical CPU number. It seemed like it wanted a bit value instead of a number. Well, there are 8 bits, but somehow I don't think I'll be so lucky. Is it clusters of CPU's or something?

The cool thing is my system will have Intel HD audio. That means I have documentation, I'm pretty sure. If I can get that working, I'll go from internal PC speaker straight to top-of-the-line audio! Has anyone done Intel HD audio?

It has an obnoxious amount of RAM -- 12 Gig! Woo-hoo! Can't think of what to use it for.

Brendan · Post by **Brendan** » Tue Dec 30, 2008 1:09 am

Hi,

LoseThos wrote:Woo-hoo, I'm getting a core i7 system in about 25 days. Does anyone know what changes I'll likely have to make?

You shouldn't need to make any changes, but you might want to add some optimizations and support for additional instructions (e.g. CRC32).

LoseThos wrote:It has hyperthreading... does that behave just like multiple cores -- you start it up with the APIC? I remember I had some confusion on the logical CPU number. It seemed like it wanted a bit value instead of a number. Well, there are 8 bits, but somehow I don't think I'll be so lucky. Is it clusters of CPU's or something?

An OS can treat hyper-threading the same as it treats multiple cores (same startup code, same APIC code, etc). However, you need to use the ACPI tables to detected CPUs because logical CPUs aren't listed by the MP Specification tables. Also, logical CPUs don't have the same performance characteristics as separate cores - for example, with hyper-threading 2 logical CPUs share the same core's resources (pipelines, caches, etc), and work done by one logical CPU will take resources away from the other logical CPU in the same core. Because of this you can optimize the scheduler for performance (e.g. make sure one logical CPU in each core is busy before you give work to the second logical CPU in any core), or optimize the scheduler for power management (e.g. make sure both logical CPUs in a core are given work before you give work to an idle core), or optimize for both and let the user/administrator decide; or let the scheduler make dynamic decisions based on a variety of things (load, CPU temperatures, user settings, etc).

For APICs, there's a completely new "x2APIC" which completely changes the way the OS deals with the local APIC. Fortunately this has a (default) "backward compatible" mode where it behaves the same as a normal ("xAPIC") local APIC; so unless you've got more than 255 logical CPUs you won't need to worry about it (although there's probably performance advantages involved with x2APIC support, even on "single-chip" computers like yours).

LoseThos wrote:It has an obnoxious amount of RAM -- 12 Gig! Woo-hoo! Can't think of what to use it for.

File system cache is probably the main thing it'd be used for...

Cheers,

Brendan

LoseThos · Post by **LoseThos** » Tue Dec 30, 2008 5:03 am

Whats this MP table you speak of? There's the CPUID instruction which has a processor number, there's the APIC register with an ID in bits 24-31 and there's a ID supposedly reachable with an index register and data register. The IO APIC accessed through the index register is read/write! So, I'm supposed to use the CPUID to set the register accessed through an INDEX register and set the register with bits 24-31? I might have the wrong base address or something, because that r/w register set through the index won't take.

Disk cache! My distribution has a 32 Meg harddrive footprint. My personal install has a 50 Meg footprint. If I get sound going I might have reason for more disk cache. I create RAM drives for CD-ROM burning.

Actually, I could use larger disk cache if I accessed other partitions on my drive, but there's not much reason. I do make videos with screen shots, still 12 Gig is a lot!!

Brendan · Post by **Brendan** » Tue Dec 30, 2008 5:53 am

Hi,

LoseThos wrote:Whats this MP table you speak of?

It's a specification written by Intel that was used by operating systems for SMP detection before ACPI was introduced. Download it here if you like...

LoseThos wrote:There's the CPUID instruction which has a processor number, there's the APIC register with an ID in bits 24-31 and there's a ID supposedly reachable with an index register and data register.

The APIC ID from CPUID is the APIC ID that the CPUs negotiated at power-on (before any code is executed - e.g. before any BIOS code is executed). For an OS, it's entirely useless *except* for determining CPU topology (working out which CPUs are in which chip, core, etc). Note: For Core I7 Intel extended CPUID so that it returns a lot more information for CPU topology detection, mostly because Core I7 is NUMA (where previous Intel CPUs mostly weren't).

The current APIC ID in the local APIC (the one that's reachable with an index register and data register) is the one that the OS needs to use for starting the CPU, IPI's, IRQs, etc. This register is (sometimes) "read/write", but there's absolutely no sane reason for any OS to ever change it - the only requirement is that the APIC IDs are unique, but if the APIC ID's aren't unique then the OS wouldn't be able to start the CPU anyway. Just use whatever APIC IDs the BIOS/firmware set in the local APIC (which is probably the same as the APIC ID returned by CPUID, but may not be).

Also note that (from the Intel Manual, Section 8.4.6):

Intel wrote:.... However, the ability of software to modify the APIC ID is processor model specific. Because of this, operating system software should avoid writing to the local APIC ID register.

LoseThos wrote: Disk cache! My distribution has a 32 Meg harddrive footprint. My personal install has a 50 Meg footprint. If I get sound going I might have reason for more disk cache. I create RAM drives for CD-ROM burning.

Actually, I could use larger disk cache if I accessed other partitions on my drive, but there's not much reason. I do make videos with screen shots, still 12 Gig is a lot!!

I'm currently in a similar position - 8 GiB of RAM, where a few GiB of RAM probably hasn't been used since I replaced Vista with Gentoo (which happened the day I got the computer). I'm too lazy to upgrade computers later on though, and DDR2 was very cheap at the time (and it's still very cheap - about a quarter of the price of DDR3).

For your 12 GiB computer, you'll probably have 6 GiB that's never used, and then in a few years the price of DDR3 would have dropped by a huge amount. It might make sense to just get 6 GiB now, and then upgrade to 48 GiB in a year or 2 (after "supply and demand" has tipped in your favour)...

Cheers,

Brendan

Love4Boobies · Post by **Love4Boobies** » Thu Jan 15, 2009 9:07 am

Brendan wrote:Because of this you can optimize the scheduler for performance (e.g. make sure one logical CPU in each core is busy before you give work to the second logical CPU in any core)

Ah -- but here you are wrong. The way to gain performance (unlike thought back in the singlecore days) is not having 100% CPU utilization, but rather how you distribute your workload. I can get back on this if anyone's interested.

quok · Post by **quok** » Thu Jan 15, 2009 10:10 am

Love4Boobies wrote:
Brendan wrote:Because of this you can optimize the scheduler for performance (e.g. make sure one logical CPU in each core is busy before you give work to the second logical CPU in any core)
Ah -- but here you are wrong. The way to gain performance (unlike thought back in the singlecore days) is not having 100% CPU utilization, but rather how you distribute your workload. I can get back on this if anyone's interested.

I don't think Brendan meant each core had to be 100% utilized before giving work to another core, but rather that it's better to utilize one logical cpu per core in all available cores before starting to use the additional logical cpus available in those same cores. Since hyperthreading does take a small performance hit over 'real' cpus, for performance is it probably better to utilize all the real cpus before starting to use the other available cpus. It's not quite as good for power consumption, though. I generally find that performance and power consumption are mutually exclusive, but a good compromise can usually be found. Since the Core i7 has enhanced power saving states, reaching that compromise shouldn't be that hard.

But, please, do explain more on what you mean by distributing the workload. One last thing I'd like to point out, though, is that some processors are designed to be most efficient when fully loaded (the UltraSparc T1 and T2, for instance) and have horrible performance when they're running with a light load. Any good scheduler will need to take this in to account as well.

Hyperdrive · Post by **Hyperdrive** » Thu Jan 15, 2009 12:10 pm

quok wrote:It's not quite as good for power consumption, though. I generally find that performance and power consumption are mutually exclusive, but a good compromise can usually be found.

That's not completely true. There are papers out there (I currently have none at hand), that show that power consumption of a fully loaded multicore on a single socket is lower than on a comparable multisocket system with singlecores. So for power consumption you'd first try to assign work to all logical cores of an physical core, then to all physical cores in a physical package (socket) and then to all sockets in a system. This will be complicated if you allow the processors to operate at different power levels (not only 0% and 100%). And there will be probably more considerations for NUMA systems (I didn't think about that too much, so I let this out for now).

quok wrote:Since the Core i7 has enhanced power saving states, reaching that compromise shouldn't be that hard.

For Core i7 you have the "Turbo Mode" which is something like a on demand hardware based overclocking feature. I'm pretty sure that has some impacts on optimizing power consumption.

quok wrote:But, please, do explain more on what you mean by distributing the workload.

One thing Love4Boobies might mean is you should assign different types of work to logical core siblings. HyperThreading for example allows for major performance boosts - but only if you have, say, a process/thread/<put here your schedulable entity> with mostly integer operations on the first logical core and another with mostly floating point operations on the other logical core. If you'd load both logical cores with the same type of workload you won't have a big speedup.

There are also papers on that. IIRC in the following papers is some information (but I may be mistaken and there are only some basic SMT considerations):

Tullsen, Dean M. ; Eggers, Susan J. ; Emer, Joel S. ; Levy, Henry M. ; Lo, Jack L. ; Stamm, Rebecca L.: Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In: ISCA '96: Proceedings of the 23rd annual international symposium on Computer architecture. New York, NY, USA : ACM, 1996, pp. 191-202.
Tullsen, Dean M. ; Eggers, Susan J. ; Levy, Henry M.: Simultaneous multithreading: maximizing on-chip parallelism. In: ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture. New York, NY, USA : ACM, 1995, pp. 392-403.

Regards,
Thilo

Love4Boobies · Post by **Love4Boobies** » Thu Jan 15, 2009 12:23 pm

Hyperdrive wrote:HyperThreading for example allows for major performance boosts - but only if you have, say, a process/thread/<put here your schedulable entity> with mostly integer operations on the first logical core and another with mostly floating point operations on the other logical core. If you'd load both logical cores with the same type of workload you won't have a big speedup.

Also be aware that HT may also decrease performance in some (somewhat rare) cases. Something that few know about is the replay system which will signal the scheduler if an operation sent for execution fails (say due to a cache miss or something) and then kicks in, repeating that instruction over and over until it will finally work. When HT is turned on, the relay unit will prevent the other logical CPU from functioning. AFAIK there is also a thing called the replay queue so it will not hog the execution units as much.

All the notes above are true for HT (Intel's implementation of SMT). I am not sure about other implementations.

Love4Boobies · Post by **Love4Boobies** » Thu Jan 15, 2009 12:57 pm

Also, SMT in itself is not entirely safe, since the L1 cache is shared between the two logical CPUs. For instance, a process might do something like this to monitor the cache:

Code: Select all

mov ecx,buffstart
sub buffsize,2000h
rdtsc
mov esi,eax
xor edi,edi
repeat:
prefetcht2 [ecx+edi+2800h]
add cx,[ecx+edi]
imul ecx,1
add cx,[ecx+edi+800h]
imul ecx,1
add cx,[ecx,edi+1000h]
imul ecx,1
add cx,[ecx+edi+1800]
imul ecx,1
rdtsc
sub eax,esi
mov [ecx+esi],eax
add esi,eax
imul ecx,1
add edi,40h
test edi,7c0h
jnz repeat
sub edi,7feh
test edi,3efh
jnz repeat
add edi,7c0h
sub buffsz,800h
jge repeat

Brendan · Post by **Brendan** » Fri Jan 16, 2009 6:21 am

Hi,

Love4Boobies wrote:Also, SMT in itself is not entirely safe, since the L1 cache is shared between the two logical CPUs. For instance, a process might do something like this to monitor the cache:

If you set the "TSD" (Time Stamp Disable) bit in CR4, then you get a general protection exception if CPL=3 code tries to use the RDTSC or RDTSCP instructions. This means your kernel can completely prevent access to the time stamp counter (e.g. terminate the process if it tries to use RDTSC or RDTSCP), or maybe virtualize the time stamp counter.

Cheers,

Brendan

OSDev.org

Core i7

Core i7

Re: Core i7

Re: Core i7

Re: Core i7

Re: Core i7

Re: Core i7

Re: Core i7

Re: Core i7

Re: Core i7

Re: Core i7