Page 1 of 2

CPU Performance

Posted: Thu Feb 05, 2015 5:34 am
by johnsa
Hey,

It's been a very long time since I was last here or experimenting with OS code.. (about 6 years I think!), in any event out of interest I pulled out my old code.. it compiled.. put it on a usb.. it booted.. amazing :)
I wanted to really just try something out which was to profile some blocks of code and do a comparison between my OS/system startup and the same code in Windows.

Under Windows my setup is:
Core i7 quadcore (Sandy Bridge)
Windows 7 Ultimate x64
Test code assembled using JWASM

My OS (If you can really call it that):
Code assembled with FASM
Long Mode, basic identity mapped mem, no ints enabled etc.

My theory was that given there should be no interrupts, no context switches, less cache thrashing or TLB misses.. the code blocks should run at least as fast as under Windows, possibly (hopefully) a bit faster (not sure if anyone has tried these sort of benchmarks before?)
However.. this is not the case, my first test code takes 33 seconds to execute under Windows and 47 seconds under my long mode startup.

So I have a couple of suspicions for what might causes this listed in order of likelihood:

1) I am enabling the caches (first during boot with bios calls to int 15h) then on long mode switch with:

Code: Select all

	;------------------------------------------------------------------------------
	; Clear CD and NW flags in CR0 to enable CPU caches.
	;------------------------------------------------------------------------------
	xor rax,rax
	mov rax,cr0
	and eax,9fffffffh
	mov cr0,rax
Could it be possible that more is required to fully enable all the caches, perhaps L3 post SMP startup ?

2) Under Windows I run in high performance power mode, perhaps on boot via ACPI the CPU is not running at maximum?
(If for example in Windows I switch to balanced power plan the same code takes 1 minute to run).

3) The memory address I'm using in the test under Windows is obtained via LEA (just a small array in the code) so it's physical address is probably somewhere higher up in memory than the test address of 0x90000 i'm using in the os code.
Is it possible that this area (being under 640k .. not that really applies anymore) might be non-cacheable or have different attributes?

4) A problem with attributes set on the paging structures

5) Windows does some secret magic to get more performance out of the machine (highly doubt this as my below figures indicate that Windows is running the test at about the optimal level)

For reference here is the test block.. which in theory should be well balance as the CPU and memory bandwidth max out at about the same point.
assuming 64bit bus, 8 byte reads, 3ghz cpu..it should achieve 24Gb/s (which is very close to the theoretical limit of my RAM)
given my timing of 47 seconds that equates to a read throughput of about 17Gb/s
Windows achieves exactly that.. 24Gb/s

Code: Select all


	mov rcx,100000000
outerloop:	
	mov rdi,90000h
	mov rdx,1000
@@:
	mov rax,[rdi]
	add rdi,8
	dec rdx
	jnz short @B
	dec rcx
	jnz short outerloop

Yes the test is contrived and not real world, but the fact that I'm under-performing makes it a useful measure to ensure that I've got everything setup correctly.
Anyone have any thoughts or suggestions as to where to look?

Re: CPU Performance

Posted: Thu Feb 05, 2015 7:50 am
by Brendan
Hi,

The code mostly just reads the same 8000 byte area repeatedly. The first time through you'd get a bunch of cache misses and TLB misses, but 8000 bytes fits in the CPU's L1 data cache so after that RAM speed should be irrelevant.

If caching wasn't happening (e.g. caches disabled in the page table), then RAM is about 16 times slower than L1 data cache and the speed difference would much larger (like, without caches it'd probably take 10 times as long as it does on Windows). Therefore I'd assume caches are working correctly. Note: as far as I can tell, 8000 bytes accessed like that is also immune to cache aliasing conflicts.

That only leaves one thing I'd consider likely: power management. Maybe the BIOS leaves the CPU in its "P1" state and Windows' high performance power mode shifts to the CPU's "P0" state, maybe it's something else (different turbo boost and/or thermal throttling thresholds or something).


Cheers,

Brendan

Re: CPU Performance

Posted: Thu Feb 05, 2015 3:28 pm
by palk
You need to set the MTRRs up; the default is uncacheable, so your original suspicion is correct: you're not running out of the cache. Read volume 3, chapter 11 (specifically the sections on caching and the MTRRs)

Re: CPU Performance

Posted: Thu Feb 05, 2015 4:16 pm
by Brendan
Hi,
palk wrote:You need to set the MTRRs up; the default is uncacheable, so your original suspicion is correct: you're not running out of the cache. Read volume 3, chapter 11 (specifically the sections on caching and the MTRRs)
The default at power on (before the firmware initialises MTRRs) doesn't apply for OS code (that runs after the firmware initialises MTRRs).


Cheers,

Brendan

Re: CPU Performance

Posted: Thu Feb 05, 2015 4:42 pm
by johnsa
I've found a few MSRs which I thought I'd play with and see if they helped, so far some of the settings just cause a crash.. possibly because i'm not checking cpuid if all the bits are valid yet.. but fyi:

Code: Select all

MSR_IA32_PERF_STATUS equ 198h ; Current Performance State R/O. bits 15-0

MSR_IA32_PERF_CTL    equ 199h ; R/W 
;15:0 Target performance State Value
;31:16 Reserved
;32 IDA Engage. (R/W) When set to 1: disengages IDA 06_0FH (Mobile)
;63:33 Reserved


MSR_IA32_MISC_ENABLE equ 1a0h ; Enable Misc CPU Features R/W.
;0 Fast-Strings Enable. When set, the fast-strings feature (for REP MOVS and REP STORS) is enabled (default); when clear, faststrings are disabled.
;2:1 Reserved.
;3 Automatic Thermal Control Circuit Enable. (R/W) 1 = Setting this bit enables
;the thermal control circuit (TCC) portion of the Intel Thermal Monitor feature. This allows the processor to automatically reduce power consumption in response to TCC activation.. 0 = Disabled (default).
;6:4 Reserved
;7 Performance Monitoring Available. (R) 1 = Performance monitoring enabled 0 = Performance monitoring disabled
;9 Hardware Prefetcher Disable. (R/W)
;When set, disables the hardware prefetcher operation on streams of data. When clear (default), enables the prefetch queue.
;Disabling of the hardware prefetcher may impact processor performance.
;10 Shared FERR# Multiplexing Enable. (R/W) 1 = FERR# asserted by the processor to indicate a pending break event within the processor
; 0 = Indicates compatible FERR# signaling behavior This bit must be set to 1 to support XAPIC interrupt model usage.
;11 Branch Trace Storage Unavailable. (RO) 1 = Processor doesn’t support branch trace storage (BTS) 0 = BTS is supported
;12 Precise Event Based Sampling (PEBS) Unavailable. (RO) 1 = PEBS is not supported; 0 = PEBS is supported.
;13 Shared TM2 Enable. (R/W)
;When this bit is set (1) and the thermal sensor indicates that the die temperature is at the pre-determined threshold, the Thermal Monitor 2 mechanism is engaged. TM2 will
;reduce the bus to core ratio and voltage according to the value last written to MSR_THERM2_CTL bits 15:0. 
;When this bit is clear (0, default), the processor does not change the VID signals or the bus to core ratio when the processor enters a thermally managed state. The BIOS must enable this feature if the TM2 feature flag (CPUID.1:ECX[8]) is set; if the TM2
;feature flag is not set, this feature is not supported and BIOS must not alter the contents of the TM2 bit location. The processor is operating out of specification if both this bit and the TM1 bit are set to 0.
;15:14 Reserved
;16 Enhanced Intel SpeedStep Technology Enable. (R/W) 0= Enhanced Intel SpeedStep Technology disabled 1 = Enhanced Intel SpeedStep Technology enabled
;17 Reserved
;18 ENABLE MONITOR FSM. (R/W) When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0). This indicates that MONITOR/MWAIT are not supported.
;Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0. When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1).
;If the SSE3 feature flag ECX[0] is not set (CPUID.01H:ECX[bit 0] = 0), the OS must not attempt to alter this bit. BIOS must leave it in the default state. Writing this bit when the
;SSE3 feature flag is set to 0 may generate a #GP exception.
;19 Shared Adjacent Cache Line Prefetch Disable. (R/W)
;When set to 1, the processor fetches the cache line that contains data currently required by the processor. When set to 0, the processor fetches cache lines that comprise a
;cache line pair (128 bytes). Single processor platforms should not set this bit. Server platforms should set or clear this bit based on platform performance observed in validation and testing.
;BIOS may contain a setup option that controls the setting of this bit.
;20 Shared Enhanced Intel SpeedStep Technology Select Lock. (R/WO)
;When set, this bit causes the following bits to become read-only:
;• Enhanced Intel SpeedStep Technology Select Lock (this bit),
;• Enhanced Intel SpeedStep Technology Enable bit. The bit must be set before an Enhanced Intel SpeedStep Technology transition is requested. This bit is cleared on reset.
;21 reserved.
;22 Limit CPUID Maxval. (R/W) When this bit is set to 1, CPUID.00H returns a maximum value in EAX[7:0] of 3. BIOS should contain a setup
;question that allows users to specify when the installed OS does not support CPUID functions greater than 3. Before setting this bit, BIOS
;must execute the CPUID.0H and examine the maximum value returned in EAX[7:0]. If the maximum value is greater than 3, the bit is supported.
;Otherwise, the bit is not supported. Writing to this bit when the maximum value is greater than 3 may generate a #GP exception. Setting this bit may cause unexpected behavior in
;software that depends on the availability of CPUID leaves greater than 3.
;23 xTPR Message Disable. (R/W) When set to 1, xTPR messages are disabled. xTPR messages are optional messages that allow the processor to inform the
;chipset of its priority. if CPUID.01H:ECX[1 4] = 1
;33:24 Reserved
;34 XD Bit Disable. (R/W) When set to 1, the Execute Disable Bit feature (XD Bit) is disabled and the XD Bit extended feature flag will be
;clear (CPUID.80000001H: EDX[20]=0). if CPUID.80000001 H:EDX[20] = 1 When set to a 0 (default), the Execute Disable Bit
;feature (if available) allows the OS to enable PAE paging and take advantage of data only pages. BIOS must not alter the
;contents of this bit location, if XD bit is not supported.. Writing this bit to 1 when the XD Bit extended feature flag is set to 0 may generate a #GP exception.
;37 Unique DCU Prefetcher Disable. (R/W)
;When set to 1, The DCU L1 data cache prefetcher is disabled. The default value after reset is 0. BIOS may write ‘1’ to disable this feature. The DCU prefetcher is an L1 data cache
;prefetcher. When the DCU prefetcher detects multiple loads from the same line done within a time limit, the DCU prefetcher assumes the next line will be required. The next line is prefetched in to the L1 data cache from memory or L2.
;38 Shared IDA Disable. (R/W)
;When set to 1 on processors that support IDA, the Intel Dynamic Acceleration feature (IDA) is disabled and the IDA_Enable feature flag will be clear (CPUID.06H: EAX[1]=0). When set to a 0 on processors that support
;IDA, CPUID.06H: EAX[1] reports the processor’s support of IDA is enabled. Note: the power-on default value is used by BIOS to detect hardware support of IDA. If
;power-on default value is 1, IDA is available in the processor. If power-on default value is 0, IDA is not available.
;39 Unique IP Prefetcher Disable. (R/W)
;When set to 1, The IP prefetcher is disabled. The default value after reset is 0. BIOS may write ‘1’ to disable this feature.
;The IP prefetcher is an L1 data cache prefetcher. The IP prefetcher looks for sequential load history to determine whether to prefetch the next expected data into the L1 cache from memory or L2.
;63:40 Reserved.

MSR_IA32_ENERGY_PERF_BIAS equ 1b0h 
;Performance Energy Bias Hint (R/W)
;3:0 Power Policy Preference: 0 indicates preference to highest performance. 15 indicates preference to maximize energy saving.
; 63:4 Reserved

Unfortunately setting the ones which don't cause issues for me hasn't had any effect on the performance.

Code: Select all

	mov ecx,MSR_IA32_ENERGY_PERF_BIAS
	rdmsr
	xor edx,edx
	xor eax,eax		   						; Specify CPU BIAS Hint to Maximum Performance.
	wrmsr

	mov ecx,MSR_IA32_MISC_ENABLE			; Make for happy CPU Time
	rdmsr

;THIS crashes with 16,19 set... using a value of 10000h (just speedstep works... I'm not sure why enabling the cacheline pair prefetch would be unsupported on my CPU and I don't see any cpuid feature bits to test for it either)
	or eax,90000h							; set bit 16,19 (speedstep and cahceline pair prefetch).  

;THESE cause a crash..
;	and eax,0fffffdffh
;	and edx,0ffffff1fh  					; clear bit 9,37,38,39 ;(enable hw prefetcher, enable dcu prefetcher, enable Intel Dynamic Acceleration, enable unique ip prefetch)
	wrmsr

	mov ecx,MSR_IA32_PERF_STATUS
	rdmsr
	xor edx,edx
	mov ecx,MSR_IA32_PERF_CTL				;Ensure IDA/Turbo boost is enabled (clear bit 32).
	wrmsr
With regards to the MTRRs I would assume that by the time my code is running the system has already mapped the physical memory correctly (so the range I'm working with would be WB).. I guess I could verify this from the MSR for the fixed ranges.. but what this does make me think I should double check as well is the PAT and the page table/directory entries to ensure they're mapping the correct type too.. the most conservative type will always be used, so if these entries happen to be UC and the MTRRs for the physical memory are setup as WC or WB.. UC will win.

Failing this.. I believe my only other option is to check/get the processor for P1 to P0 performance state.. so far i've not found any info on how to do this and I would assume it's going to require a lot of ACPI filth unless someone has a short-cut :)

Re: CPU Performance

Posted: Thu Feb 05, 2015 5:15 pm
by johnsa
Volume 3a page 140..PWT + PCT page entry flags (starting from CR3 and moving down the page table hierarchy determine which PAT entry to use, each PAT entry has a memory type, WC, WB, UC etc)

Then page 574 defines the PAT.

Re: CPU Performance

Posted: Thu Feb 05, 2015 5:39 pm
by Brendan
Hi,
johnsa wrote:Failing this.. I believe my only other option is to check/get the processor for P1 to P0 performance state.. so far i've not found any info on how to do this and I would assume it's going to require a lot of ACPI filth unless someone has a short-cut :)
Sadly, Intel don't document it (and it's different for different CPU models).

The shortcut is to find tools to extract and disassemble the firmware's AML (e.g. Intel's AML assembler/disassembler from the ACPICA project) and determine the mystic incantations needed for your CPU from the disassembly.


Cheers,

Brendan

Re: CPU Performance

Posted: Thu Feb 05, 2015 6:03 pm
by johnsa
I wouldn't even know where to begin with that... a lot of reading up, and I'd prefer to be able to do p1->p0 for at least a bunch of CPUs not just my own .. full ACPI layer sounds more and more like it's required.. as horrible as it is.

Re: CPU Performance

Posted: Thu Feb 05, 2015 7:47 pm
by Brendan
Hi,
johnsa wrote:I wouldn't even know where to begin with that... a lot of reading up, and I'd prefer to be able to do p1->p0 for at least a bunch of CPUs not just my own .. full ACPI layer sounds more and more like it's required.. as horrible as it is.
There is one other alternative: ignore it. For most software P0 is just a waste of power anyway; which is why "balanced" is Windows' default setting (and probably why P1 is the firmware's default setting too).

The main problem is that your benchmark is closer to a pathological case than it is to a realistic workload (a tight loop, no cache or TLB misses, no IRQs, no task switching, few branch mispredictions, no IO of any kind, no load on the other CPUs, etc). That's why you get a much more noticeable difference.

Of course I'd still be tempted to figure out if power management definitely is the cause of the difference, just to satisfy curiosity (and double check that something else isn't causing it).


Cheers,

Brendan

Re: CPU Performance

Posted: Fri Feb 06, 2015 3:33 am
by johnsa
Agreed 100%
While it's not a real world test, the entire exercise of having your own bare metal long mode os/system is null and void if it's going to run 30% slower than under Windows without any real work load, irqs, drivers other devices going on.

My objective never was or would be to build a full OS anyway.. it would be far too ambitious given the mess that is hardware these days, what I would like to achieve is a 64bit single tasking host (with other cores enabled so that I can use them
in a non multi-tasking sort of way.. very much what I do under Windows with my own per core scheduler for fine grained concurrent task execution) with support for HD Audio, USB HID+Mass storage, PCI, sata/ahci, one network adapter and one video card.. probably one of the new Intel HD Graphics chips.. basically pretending my pc has a fixed architecture.. har har wouldn't that be nice.

Oddly, even if I run the test under Windows set to Balanced mode it still takes 33 odd seconds instead of 47, so either Windows is pushing the CPU to p0 even in balanced mode or that is not the cause.

For comparison under Windows with some other settings:
Balanced mode with max cpu performance set to 90%: 55 sec
Balanced mode with max cpu performance set to 95%: 48 sec (almost identical at this point), if there was a way to check the current power state after boot and find it to default to 95% that would explain the difference.

Re: CPU Performance

Posted: Fri Feb 06, 2015 3:58 am
by johnsa
I believe the default setting for my CPU to be running at 2.3ghz
when Turbo Boost kicks in it goes up to 3.2ghz
2.3 / 3.2 = 0.72
0.72 * 47sec = 34 sec.. coincidence? (turbo boost takes about 1-2 seconds to fully ramp up as well.. which could account for variable measurements of between 34-36 seconds under windows)

Re: CPU Performance

Posted: Fri Feb 06, 2015 5:33 am
by johnsa
Ok.. some success

Firstly.. I found this (which apparently isn't supposed to work under Win7 .. but works fine):
http://sourceforge.net/projects/perfins ... p_redirect

it has an msr command line tool to read/write msr values under windows x64..
So I checked the value of IA32_PERF_CTL (0x199) and found it to be set to 0x00000000:0x00002100
The top part EDX = 0 indicates that speed step or Turbo boost is active, the bottom dword in EAX I can't find out what the value is supposed to be. The documentation from Intel says this is the desired target EIST performance state in bits 0:15, but doesn't give a range of values.

Intel Turbo Boost in theory is controlled by this same interface/MSR.
So plugging that into my code and running on real h/w.. I now run in 33 seconds as opposed to 47 .. which is 1-3 seconds faster than under Windows, which I'd expect as I have no IRQS, devices or context switches.

Code: Select all

	mov ecx,MSR_IA32_ENERGY_PERF_BIAS
	rdmsr
	xor edx,edx
	xor eax,eax		   						; Specify CPU BIAS Hint to Maximum Performance.
	wrmsr

	mov ecx,MSR_IA32_MISC_ENABLE			; Make for happy CPU Time
	rdmsr
	or eax,10000h							; Set bit 16, Enable Speed Step.
	wrmsr

	mov ecx,MSR_IA32_PERF_STATUS
	rdmsr
	xor edx,edx
	mov eax,2100h
	mov ecx,MSR_IA32_PERF_CTL				;Ensure IDA/Turbo boost is enabled (clear bit 32).
	wrmsr
Now to find a meaning of the 0x00002100...

In addition a new feature in the latest chips is HWP (which means the CPU can autonomously manage all this jazz in theory without ACPI and convoluted OS involvement. All the OS has to do in this case is supply hints, select desired mode and possibly supply workload estimates). I only found this by updating my volume 3 intel manual to the latest Jan 2015 edition.

[UPDATE]:

Found this:
http://download.intel.com/design/networ ... 117401.pdf

which says:

The 16-bit encoding defining valid operating points is model-specific and Intel proprietary. See your Intel
representative to obtain documentation outlining the required encoding.

199H 409 IA32_PERF_CTL Bits 15:0 indicate the target frequency and voltage
operating point.

In theory however from ACPI tables we should be able to find:

ACPI 2.0 Object Table Usage
_PCT Identifies location of I/O mapped MSRs for status and control
_PSS Lists the possible processor frequency and voltage operating states (these should map to the above?)
_PPC Reflects the capabilities of the platform

Further:

http://sourceforge.net/p/freedos/mailma ... /31894268/

So the encoding is 8bits for frequency mul. and 8bits for voltage.
so 0x2100 is no change to voltage and base clock multiplier of 0x21(33) ?

Re: CPU Performance

Posted: Fri Feb 06, 2015 6:09 am
by johnsa
CPU-Z usefully shows my multiplier which is accurately reflecting the MSR findings and performance.
It lists the range as 12-33. I have no idea still what the VID(Voltage options) would be but so-far i've not seen Windows change that at all so perhaps that can be safely ignored.
Obviously the OS is actively controlling these values I would assume through probing the MSRs, thermal monitors, applying load algorithms feeding this either via the chipset driver or ACPI so it's a lot more complex to manage than my example of setting the max value (specific to my CPU).

Re: CPU Performance

Posted: Fri Feb 06, 2015 7:02 am
by johnsa
FYI:

Code: Select all


MSR_THERMAL_TARGET equ 1a2h ; Only in Sandy Bridge+
;15:0 Reserved.
;23:16 Temperature Target (R)
;The minimum temperature at which PROCHOT# will be asserted.
;The value is degree C.
;63:24 Reserved.

By reading this MSR, then MSR_IA32_THERM_STATUS equ 19ch, subtract that digital readout from the tjmax above.. and you have the core temp (requires sandy bridge plus arch.)
I would look to implement this assuming HWP isn't present.. that combined with the TCC and thermal interrupts warning would indicate when I should scale back the multiplier.

I've updated the code to read the max FID and request that instead of hard-coded 0x2100

Code: Select all

mov ecx,MSR_IA32_ENERGY_PERF_BIAS
	rdmsr
	xor edx,edx
	xor eax,eax		   						; Specify CPU BIAS Hint to Maximum Performance.
	wrmsr

	mov ecx,MSR_IA32_MISC_ENABLE			; Make for happy CPU Time
	rdmsr
	or eax,10000h							; Set bit 16, Enable Speed Step.
	wrmsr

	mov ecx,MSR_IA32_PERF_STATUS
	rdmsr
	mov eax,edx
	and eax,0ff00h							;mov eax,2100h
        xor edx,edx
	mov ecx,MSR_IA32_PERF_CTL				;Ensure IDA/Turbo boost is enabled (clear bit 32).
	wrmsr

Re: CPU Performance

Posted: Fri Feb 06, 2015 8:17 am
by embryo
johnsa wrote:So I checked the value of IA32_PERF_CTL (0x199) and found it to be set to 0x00000000:0x00002100
The top part EDX = 0 indicates that speed step or Turbo boost is active, the bottom dword in EAX I can't find out what the value is supposed to be. The documentation from Intel says this is the desired target EIST performance state in bits 0:15, but doesn't give a range of values.
Thanks for an interesting example of a hidden performance boosting capabilities! It's nice to discover such thing without long studying of a documentation :)