Speed over Memory

54616E6E6572 · Post by **54616E6E6572** » Wed Aug 11, 2010 5:44 am

So I've officially started development on my OS again and after the first compile, I started debugging my code only to realize even with full optimizations turned on, I was getting 'sub-optimal' code. I tried with several different 'optimizing' compilers (i.e. MSVC, GCC, Open64), even with the AMD distro of Open64 providing the most optimal code of the three, I was still disappointed. So I'm developing completely in assembly instead.

This lead me to start thinking of some of the small things in my OS that while wasting memory could make my code run much faster or more efficiently. I would appreciate your thoughts on the following things in my OS.

Optimization:
Allocate memory on 32-byte boundaries. (This forces minimal allocation to be 32-bytes).
Rationale:
Allows faster memory operations by allowing use of XMM registers to work with 128-bits at a time and hide pointer/counter arithmetic and branch latencies.
Example:
Loop w/o latency takes 6 + (9 * C) clock cycles, where C = # of times we looped.
movdqa xmm0, [mem] has a 2:1 throughput (2 instructions per clock cycle)
Will take 2 instruction data fetches (CPU will cache 32-bytes of instructions at a time)
Assumes memcpy is 16-byte, but not 32-byte aligned to maximize branch performance.

Code: Select all

	memcpy:	; void* memcpy(void* src, void* dest, size_t size)
		shr	r8, 5					; Determine how many times to loop
		mov	rax, rdx				; Set the return value to dest
		db 0x66, 0x66, 0x90			; Align branch window to 16-bytes
		db 0x66, 0x66, 0x90			;
		db 0x66, 0x66 ,0x90			;
	 .copyLoop:			;
		movdqa	xmm0, [rcx + 0x00]	; Read 32-bytes from source
		movdqa	xmm1, [rcx + 0x10]	; 
		add		rcx, 32				; Adjust source buffer
		movdqa	[rdx + 0x00], xmm0	; Write 32-bytes to destination
		movdqa	[rdx + 0x10], xmm1	; 
		add		rdx, 32				; Adjust destination buffer
		dec		r8					; Repeat the copy until complete
		jnz		.copyLoop			;
		ret		0					; Return to caller

Optimization:
Use 2MB pages as default instead of 4KB pages.
Rationale:
The average application running on windows or linux uses 1MB+ of memory. By switching to 2MB pages we can reduce memory required to store page tables.

Optimization:
Require all applications to be distributed in a platform agnostic form (binary source modules).
Rationale:
Normal applications are built using 'lowest common denominator' and cannot support new instructions sets without providing seperate code paths. Distributing code in binary source modules will allow code to be compiled a second tume upon installation for the machine it will be run on. Furthermore, it will allow static type checking and security verification of code.

NickJohnson · Post by **NickJohnson** » Wed Aug 11, 2010 8:04 am

54616E6E6572 wrote:So I've officially started development on my OS again and after the first compile, I started debugging my code only to realize even with full optimizations turned on, I was getting 'sub-optimal' code. I tried with several different 'optimizing' compilers (i.e. MSVC, GCC, Open64), even with the AMD distro of Open64 providing the most optimal code of the three, I was still disappointed. So I'm developing completely in assembly instead.

Have you tried clang? You can use it to do very fancy stuff like LLVM link time optimization.

Optimization:
Allocate memory on 32-byte boundaries. (This forces minimal allocation to be 32-bytes).
Rationale:
Allows faster memory operations by allowing use of XMM registers to work with 128-bits at a time and hide pointer/counter arithmetic and branch latencies.
Example:
Loop w/o latency takes 6 + (9 * C) clock cycles, where C = # of times we looped.
movdqa xmm0, [mem] has a 2:1 throughput (2 instructions per clock cycle)
Will take 2 instruction data fetches (CPU will cache 32-bytes of instructions at a time)
Assumes memcpy is 16-byte, but not 32-byte aligned to maximize branch performance.
Code: Select all
	memcpy:	; void* memcpy(void* src, void* dest, size_t size)
		shr	r8, 5					; Determine how many times to loop
		mov	rax, rdx				; Set the return value to dest
		db 0x66, 0x66, 0x90			; Align branch window to 16-bytes
		db 0x66, 0x66, 0x90			;
		db 0x66, 0x66 ,0x90			;
	 .copyLoop:			;
		movdqa	xmm0, [rcx + 0x00]	; Read 32-bytes from source
		movdqa	xmm1, [rcx + 0x10]	; 
		add		rcx, 32				; Adjust source buffer
		movdqa	[rdx + 0x00], xmm0	; Write 32-bytes to destination
		movdqa	[rdx + 0x10], xmm1	; 
		add		rdx, 32				; Adjust destination buffer
		dec		r8					; Repeat the copy until complete
		jnz		.copyLoop			;
		ret		0					; Return to caller

How are you going to perform memcpy() on non-aligned data (like in structs or on the stack)? If the main problem is the speed of copying, a few instructions checking and accounting for the alignment will probably not take too much overhead. Also, on Linux, all allocations are 16-byte aligned already, for cache reasons.

Optimization:
Use 2MB pages as default instead of 4KB pages.
Rationale:
The average application running on windows or linux uses 1MB+ of memory. By switching to 2MB pages we can reduce memory required to store page tables.

Yeah, but is all of that allocated memory virtually contiguous? You may have a lot of memory allocated for shared libraries/DLLs, which are separate from the normal code/data, and have to be separate pages. If you have five shared libraries, that makes your program take up at least 2*6 = 12 megabytes! And you'd be surprised how many programs run in the background (at least on Linux) and take almost no memory: these would be blown up tenfold in memory usage.

Also, the maximum amount of memory you can use to store 4KB paging structures is 4MB, which is a completely worst case scenario (nearly impossible on a multitasking system too), and nothing compared to how much memory you'll be wasting by using 2/4MB pages.

Optimization:
Require all applications to be distributed in a platform agnostic form (binary source modules).
Rationale:
Normal applications are built using 'lowest common denominator' and cannot support new instructions sets without providing seperate code paths. Distributing code in binary source modules will allow code to be compiled a second tume upon installation for the machine it will be run on. Furthermore, it will allow static type checking and security verification of code.

[/quote]

Well, this is why you usually write app code in C, not assembly: you can recompile it from source for optimizations on different processor models. Also, have you taken a look at LLVM? It seems to do the sort of optimization thing you're looking for, and has a good infrastructure.

Final comment: is your OS actually done? If not, this is definitely premature optimization.

Owen · Post by **Owen** » Wed Aug 11, 2010 8:22 am

54616E6E6572 wrote:Optimization:
Allocate memory on 32-byte boundaries. (This forces minimal allocation to be 32-bytes).
Rationale:
Allows faster memory operations by allowing use of XMM registers to work with 128-bits at a time and hide pointer/counter arithmetic and branch latencies.

This isn't even an optimization. Its a requirement to make AVX go fast.

54616E6E6572 wrote:

Code: Select all

		db 0x66, 0x66, 0x90			; Align branch window to 16-bytes
		db 0x66, 0x66, 0x90			;
		db 0x66, 0x66 ,0x90			;

Why 3 byte NOPs? Instructions can be up to 15 bytes. Less would be faster...

54616E6E6572 wrote: Optimization:
Use 2MB pages as default instead of 4KB pages.
Rationale:
The average application running on windows or linux uses 1MB+ of memory. By switching to 2MB pages we can reduce memory required to store page tables.

Negative optimization. Massive memory waste, plus it reduces the effectiveness of the TLB drastically.

You're much better off leaving the 2MB pages dedicated for where they'll be of benefit - mapping things like framebuffers, and the kernel code - the first because it is a big benefit there, the second because it increases the lifetime your kernel code stays in the TLB

54616E6E6572 wrote: Optimization:
Require all applications to be distributed in a platform agnostic form (binary source modules).
Rationale:
Normal applications are built using 'lowest common denominator' and cannot support new instructions sets without providing seperate code paths. Distributing code in binary source modules will allow code to be compiled a second tume upon installation for the machine it will be run on. Furthermore, it will allow static type checking and security verification of code.

You've already discovered that compilers don't optimize well. Yours won't be either

And I think you're overthinking the inefficiencies of compilers. Really, how much time will programs spend in your kernel? Particularly when you consider that on x86 privilege level changes take ~50 cycles each way minimum...

54616E6E6572 · Post by **54616E6E6572** » Wed Aug 11, 2010 4:16 pm

Owen wrote:Why 3 byte NOPs? Instructions can be up to 15 bytes. Less would be faster...

You would be surprised. First The use of more than 3 prefixes limits decoder performance. The only more optimal solution would be "nop word [rax + rax + 0]" which would fill out as "db 0x66, 0x0F, 0x1F, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00" (I actually should have done this.) Finally:

AMD64 Software Optimization Guide wrote:Although there are several possible multibyte NOP-equivalent instructions that do not change the processor state (other than rIP), combinations of the operand-size override and the multibyte NOP instruction are more efficient.

Owen wrote:You've already discovered that compilers don't optimize well. Yours won't be either

And I think you're overthinking the inefficiencies of compilers. Really, how much time will programs spend in your kernel? Particularly when you consider that on x86 privilege level changes take ~50 cycles each way minimum...

First Point: I don't expect my compiler to be the best or even perfect, I expect any application running on my OS to know the environment it is running in and thus run to a fuller potential.
Second Point: I don't know about anyone else, but to me at least, every clock cycle counts. If a compiler generates a loop that will complete in 44, 29, 14, or even 10 clock cycles, and the loop is called lets say, 1000 times every second, but you can write or generate code that allows it to complete in 9 clock cycles, than you save 1000 to 35000 clock cycles every second which could be used to do so much more (e.g. Make up for time that a poorly written program takes to execute.)

NickJohnson wrote:is your OS actually done? If not, this is definitely premature optimization.

It's complete to the point of where it provides a Memory Manager, ACPI, PCI, VESA, and PnP support, as well as rudimentary Multiprocessing and Process Scheduling.

NickJohnson wrote:Also, have you taken a look at LLVM? It seems to do the sort of optimization thing you're looking for, and has a good infrastructure.

I have and it is acually what I am thinking of using to write my high level applications (GCC and Open64 both have support to output and read LLVM object files.)

NickJohnson wrote:How are you going to perform memcpy() on non-aligned data (like in structs or on the stack)? If the main problem is the speed of copying, a few instructions checking and accounting for the alignment will probably not take too much overhead. Also, on Linux, all allocations are 16-byte aligned already, for cache reasons.

I should actually replace all "movdqa" to "movdqu" since the unaligned version provides performance equal to the aligned version when data is actually aligned.
Also, looking over the loop I could restructure it to take (6 + (9 * C)) on any data block that is an even multiple of 32 and (20 + (9 * C)) (worst case) on any data that is not.

bewing · Post by **bewing** » Thu Aug 12, 2010 5:27 pm

I agree that the output I get from fully "optimized" GCC is about three times bigger and slower than my hand-optimized ASM. It would be difficult -- but should not be impossible -- to do much better than that, as a project. I'm going to take a shot at it someday.

Solar · Post by **Solar** » Fri Aug 13, 2010 12:54 am

The sad thing about it is that it does not matter.

Outside very special niches, that is.

It mattered for quite some time (6502, 386, 68030), and legendary feats were done by skilled people of that era.

Today, the average user has so many useless apps running on his machine without being aware of it that optimizations like these don't register.

The non-average user just gets a CPU that's twice as fast, next year.

Candy · Post by **Candy** » Fri Aug 13, 2010 1:10 am

I'm able to outperform GCC by a factor of 4-5 on code that's assembly-friendly and GCC-unfriendly.

Last time I tried, GCC was able to outperform my best optimization effort by plain "-O3" by about 10 percent. Using that as input, I was able to outperform that again with another 10% by using prefetches.

Add to that that 90% of your code is not on the critical path. That means that you're wasting a lot of time.

bewing · Post by **bewing** » Fri Aug 13, 2010 5:45 am

I disagree that it's a waste of time. Coding in ASM (and then thinking about it) shows you how to design your data structures for efficiency in the first place. And if you are building your OS for your own use, then you are causing yourself endless future pain by writing crummy code that you have to live with for the rest of your life. People complain about fatware. They notice.

IanSeyler · Post by **IanSeyler** » Thu Aug 19, 2010 8:06 am

Why is this in the Auto-Delete forum? There is some good discussion in here.

So in order to maximize the effectiveness of the TLB you should use 4KiB pages instead of 2MiB? Currently I am using a PML4 with 2MiB pages. My goal is raw speed so if 4KiB pages is better then I will switch. My OS is mono-tasking and the kernel is only 16KiB.

-Ian

54616E6E6572 · Post by **54616E6E6572** » Tue Aug 24, 2010 7:38 am

So I have another optimization now for Integer to Hex conversion.

Essentially, the following code converts integers to their hexadecimal form, what's unique about it is that it converts them 128-bits at a time using SSE2. Their is probably a lot more I could do to improve the code, but for now it's meant to be an accurate way to quickly convert Int2Hex during my kernel's DontPanic() hex dump function.

It takes the input value in XMM0, and returns the converted values in HexBuffer which is 256-bytes in size. The code contains no loops and compiles and runs on any x86-64 cpu (it should also compile and run on any x86 cpu with SSE2 support). It would take a minimum of 47 clock cycles to execute, assuming that you don't factor in memory latency, cache misses, contention for resources, or fetching and decoding (it could actually take less if you also factor in SIMD Vector throughput and whether the CPU uses full width 128-bit operations (later AMD64) or decomposes them into internal 64-bit operations (early AMD64 and All Intel AFAIK)).

Code: Select all

SECTION .text

ALIGN 16
Hex2Ascii:
   movdqa   oword [HexBuffer + 0x00], xmm1                  ; Save XMM1
   movdqa   oword [XmmBuffer + 0x00], xmm2                  ; Save XMM2
   movdqa   xmm1, xmm0                                      ; Save XMM0
   pand   xmm0, oword [XmmBuffer + 0x10]                    ; Get Lower 8-Bits
   paddb   xmm0, oword [XmmBuffer + 0x30]                   ; Make first ASCII adjustment
   movdqa   xmm2, xmm0                                      ; Find which values need another adjustment
   pcmpgtb   xmm2, oword [XmmBuffer + 0x40]                 ; 
   pand   xmm2, oword [XmmBuffer + 0x50]                    ; 
   paddb   xmm0, xmm2                                       ; Make second ASCII adjustment
   pand   xmm1, oword [XmmBuffer + 0x20]                    ; Get High 8-Bits
   paddb   xmm1, oword [XmmBuffer + 0x30]                   ; Make third ASCII adjustment
   movdqa   xmm2, xmm1                                      ; Find which values need another adjustment
   pcmpgtb   xmm2, oword [XmmBuffer + 0x40]                 ;
   pand   xmm2, oword [XmmBuffer + 0x50]                    ;
   paddb   xmm1, xmm2                                       ; Make final ASCII adjustment
   movdqa   oword [HexBuffer + 0x10], xmm1                  ; Save the lower converted value
   movdqa   xmm1, oword [HexBuffer + 0x00]                  ; Restore XMM1
   movdqa   oword [HexBuffer + 0x00], xmm0                  ; Save the higher converted value
   movdqa   xmm2, oword [XmmBuffer + 0x00]                  ; Restore XMM2
   ret      0x0000                                          ; Return to Caller
   
; ------------------------------------------------------------- ;
; ============================================================= ;
; ------------------------------------------------------------- ;
   
SECTION .data

ALIGN 16
HexBuffer:   dq 0x0000000000000000, 0x0000000000000000
             dq 0x0000000000000000, 0x0000000000000000

ALIGN 16
XmmBuffer:   dq 0x0000000000000000, 0x0000000000000000
             dq 0x0F0F0F0F0F0F0F0F, 0x0F0F0F0F0F0F0F0F
             dq 0xF0F0F0F0F0F0F0F0, 0xF0F0F0F0F0F0F0F0
             dq 0x3030303030303030, 0x3030303030303030   
             dq 0x3939393939393939, 0x3939393939393939
             dq 0x0707070707070707, 0x0707070707070707

JamesM · Post by **JamesM** » Fri Sep 24, 2010 8:55 am

54616E6E6572 wrote:So I've officially started development on my OS again and after the first compile, I started debugging my code only to realize even with full optimizations turned on, I was getting 'sub-optimal' code. I tried with several different 'optimizing' compilers (i.e. MSVC, GCC, Open64), even with the AMD distro of Open64 providing the most optimal code of the three, I was still disappointed. So I'm developing completely in assembly instead.

This lead me to start thinking of some of the small things in my OS that while wasting memory could make my code run much faster or more efficiently. I would appreciate your thoughts on the following things in my OS.

Optimization:
Allocate memory on 32-byte boundaries. (This forces minimal allocation to be 32-bytes).
Rationale:
Allows faster memory operations by allowing use of XMM registers to work with 128-bits at a time and hide pointer/counter arithmetic and branch latencies.
Example:
Loop w/o latency takes 6 + (9 * C) clock cycles, where C = # of times we looped.
movdqa xmm0, [mem] has a 2:1 throughput (2 instructions per clock cycle)
Will take 2 instruction data fetches (CPU will cache 32-bytes of instructions at a time)
Assumes memcpy is 16-byte, but not 32-byte aligned to maximize branch performance.
Code: Select all
	memcpy:	; void* memcpy(void* src, void* dest, size_t size)
		shr	r8, 5					; Determine how many times to loop
		mov	rax, rdx				; Set the return value to dest
		db 0x66, 0x66, 0x90			; Align branch window to 16-bytes
		db 0x66, 0x66, 0x90			;
		db 0x66, 0x66 ,0x90			;
	 .copyLoop:			;
		movdqa	xmm0, [rcx + 0x00]	; Read 32-bytes from source
		movdqa	xmm1, [rcx + 0x10]	; 
		add		rcx, 32				; Adjust source buffer
		movdqa	[rdx + 0x00], xmm0	; Write 32-bytes to destination
		movdqa	[rdx + 0x10], xmm1	; 
		add		rdx, 32				; Adjust destination buffer
		dec		r8					; Repeat the copy until complete
		jnz		.copyLoop			;
		ret		0					; Return to caller
Optimization:
Use 2MB pages as default instead of 4KB pages.
Rationale:
The average application running on windows or linux uses 1MB+ of memory. By switching to 2MB pages we can reduce memory required to store page tables.

Optimization:
Require all applications to be distributed in a platform agnostic form (binary source modules).
Rationale:
Normal applications are built using 'lowest common denominator' and cannot support new instructions sets without providing seperate code paths. Distributing code in binary source modules will allow code to be compiled a second tume upon installation for the machine it will be run on. Furthermore, it will allow static type checking and security verification of code.

Hi,

Can I just ask - how are you measuring code optimality in this context? Are you just looking at the generated optimised code from GCC and inspecting it compared to hand-tuned assembly? or are you actually performing accurate benchmarks?

The only reason I ask is that GCC gets a lot of input from Intel; it's extremely well tuned and maintained. You also mention cycle times - these are an utterly pointless metric on the x86. The pipeline is so complex that intel stopped specifying cycle counts in their manuals ages ago.

The instruction scheduler also takes into account hazards that cause pipeline stalls and flushes, numbers of ALUs/Load-store units/multipliers, cache line size and alignment, prefetch tendencies and data dependencies.

It also takes into account which instructions are most tuned - this tends to be the MOV and LEA instructions whereever possible (hence why optimised compiled code has a metric ton of MOVs in them and uses LEA for arithmetic operations where possible).

Basically, some metric and numbers would be nice to see!

Cheers,

James

JamesM · Post by **JamesM** » Fri Sep 24, 2010 9:01 am

I have and it is acually what I am thinking of using to write my high level applications (GCC and Open64 both have support to output and read LLVM object files.)

GCC cannot output or read LLVM bitcode files. GCC uses GIMPLE as its IR, LLVM uses its own bitcode. llvm-gcc does the GIMPLE->LLVM conversion, and it is not symmetric.

Candy · Post by **Candy** » Sun Oct 03, 2010 2:53 am

Holidays make your response times so much worse.

People complain about fatware. They notice.

People notice software that is running slower than *they* expect it to run. What defines their expectations? It's mostly training how fast something should be on a slow/medium/fast computer. I recall having a 33k6 modem loading web pages in about 10 seconds and thinking *that's fast*. Yesterday in the train I was loading a web page (on my phone, using some crappy wireless connection) and it didn't show within 3 seconds, and I was thinking *that's slow*. Same action, comparable web page, faster time, worse response.

Users expect computers to be as fast in doing X in your application as it was in all the other applications. For the rest, they make a rough estimate and try to compare.

If searching mail in X takes 2 seconds, your app should not take 20. If copying a vhs tape to another vhs tape takes about 1 hour at double speed, your dvd to file conversion program is going to get a similar expectation from them. They won't expect it to be done in 5 seconds, but if it takes over 2 hours it's going to be considered "slow".

Technically you can make so much code faster it's nearly pointless to discuss. Whether users are going to notice and how much they're going to notice it is another thing entirely. The best starting point IMO is to get an API set that is agreed upon and common to be in use, and then to make the implementation of that API set as fast as possible. I haven't seen anybody succeed in doing #1, so I see not much value in doing #2.

OSDev.org

Speed over Memory

Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory

Re: Speed over Memory