Speed over Memory
Posted: Wed Aug 11, 2010 5:44 am
So I've officially started development on my OS again and after the first compile, I started debugging my code only to realize even with full optimizations turned on, I was getting 'sub-optimal' code. I tried with several different 'optimizing' compilers (i.e. MSVC, GCC, Open64), even with the AMD distro of Open64 providing the most optimal code of the three, I was still disappointed. So I'm developing completely in assembly instead.
This lead me to start thinking of some of the small things in my OS that while wasting memory could make my code run much faster or more efficiently. I would appreciate your thoughts on the following things in my OS.
Optimization:
Allocate memory on 32-byte boundaries. (This forces minimal allocation to be 32-bytes).
Rationale:
Allows faster memory operations by allowing use of XMM registers to work with 128-bits at a time and hide pointer/counter arithmetic and branch latencies.
Example:
Loop w/o latency takes 6 + (9 * C) clock cycles, where C = # of times we looped.
movdqa xmm0, [mem] has a 2:1 throughput (2 instructions per clock cycle)
Will take 2 instruction data fetches (CPU will cache 32-bytes of instructions at a time)
Assumes memcpy is 16-byte, but not 32-byte aligned to maximize branch performance.
Optimization:
Use 2MB pages as default instead of 4KB pages.
Rationale:
The average application running on windows or linux uses 1MB+ of memory. By switching to 2MB pages we can reduce memory required to store page tables.
Optimization:
Require all applications to be distributed in a platform agnostic form (binary source modules).
Rationale:
Normal applications are built using 'lowest common denominator' and cannot support new instructions sets without providing seperate code paths. Distributing code in binary source modules will allow code to be compiled a second tume upon installation for the machine it will be run on. Furthermore, it will allow static type checking and security verification of code.
This lead me to start thinking of some of the small things in my OS that while wasting memory could make my code run much faster or more efficiently. I would appreciate your thoughts on the following things in my OS.
Optimization:
Allocate memory on 32-byte boundaries. (This forces minimal allocation to be 32-bytes).
Rationale:
Allows faster memory operations by allowing use of XMM registers to work with 128-bits at a time and hide pointer/counter arithmetic and branch latencies.
Example:
Loop w/o latency takes 6 + (9 * C) clock cycles, where C = # of times we looped.
movdqa xmm0, [mem] has a 2:1 throughput (2 instructions per clock cycle)
Will take 2 instruction data fetches (CPU will cache 32-bytes of instructions at a time)
Assumes memcpy is 16-byte, but not 32-byte aligned to maximize branch performance.
Code: Select all
memcpy: ; void* memcpy(void* src, void* dest, size_t size)
shr r8, 5 ; Determine how many times to loop
mov rax, rdx ; Set the return value to dest
db 0x66, 0x66, 0x90 ; Align branch window to 16-bytes
db 0x66, 0x66, 0x90 ;
db 0x66, 0x66 ,0x90 ;
.copyLoop: ;
movdqa xmm0, [rcx + 0x00] ; Read 32-bytes from source
movdqa xmm1, [rcx + 0x10] ;
add rcx, 32 ; Adjust source buffer
movdqa [rdx + 0x00], xmm0 ; Write 32-bytes to destination
movdqa [rdx + 0x10], xmm1 ;
add rdx, 32 ; Adjust destination buffer
dec r8 ; Repeat the copy until complete
jnz .copyLoop ;
ret 0 ; Return to caller
Use 2MB pages as default instead of 4KB pages.
Rationale:
The average application running on windows or linux uses 1MB+ of memory. By switching to 2MB pages we can reduce memory required to store page tables.
Optimization:
Require all applications to be distributed in a platform agnostic form (binary source modules).
Rationale:
Normal applications are built using 'lowest common denominator' and cannot support new instructions sets without providing seperate code paths. Distributing code in binary source modules will allow code to be compiled a second tume upon installation for the machine it will be run on. Furthermore, it will allow static type checking and security verification of code.