OSDev.org

Posted: **Wed Mar 04, 2009 3:39 am**

Hey all,

So I thought I'd start doing some performance testing (just for fun really) between identical code (MASM) under Windows and the same code inside my kernel (32bit pm).
I'll also be testing the difference of paged vs non paged with different types of memory operations.

Sofar the results I've come up with are quite interesting:
1) zeroing and copying small blocks of memory (256byte / 512byte / 1kb) 3 billion iterations:
- Windows XP SP3/MASM: 22 seconds
- Custom Kernel: 15 seconds

2) zeroing and copying large blocks of memory (1meg / 2meg / 4meg) 300 thousand iterations:
- Windows : 61 seconds
- Customer Kernel: 58 seconds

Timings done via rdtsc in custom kernel and QueryPerfCounter/Freq. under windows.

So the results would indicate that small memory transfers under windows are far from optimal, but once the blocks start getting larger the difference is negligible. This would also lead me to suspect that paging won't make any or much difference to overall performance.

Posted: **Wed Mar 04, 2009 6:12 am**

johnsa wrote:Hey all,

So I thought I'd start doing some performance testing (just for fun really) between identical code (MASM) under Windows and the same code inside my kernel (32bit pm).
I'll also be testing the difference of paged vs non paged with different types of memory operations.

Sofar the results I've come up with are quite interesting:
1) zeroing and copying small blocks of memory (256byte / 512byte / 1kb) 3 billion iterations:
- Windows XP SP3/MASM: 22 seconds
- Custom Kernel: 15 seconds

2) zeroing and copying large blocks of memory (1meg / 2meg / 4meg) 300 thousand iterations:
- Windows : 61 seconds
- Customer Kernel: 58 seconds

Timings done via rdtsc in custom kernel and QueryPerfCounter/Freq. under windows.

So the results would indicate that small memory transfers under windows are far from optimal, but once the blocks start getting larger the difference is negligible. This would also lead me to suspect that paging won't make any or much difference to overall performance.

Am I missing something here? Small memory transfers incur zero OS overhead, unless the transfers pagefault (have you managed to rig it so they do?)

Else, you're just testing the processor, FSB and RAM. So you should get identical results, which makes me think that your timing system is off.

Posted: **Wed Mar 04, 2009 6:26 am**

Hi,

The other thing that I'm wondering about is task switches. As well as the time while your task is sleeping, you have things like TLB misses due to reloads of CR3 and so on. Although how you take that in to account in your timer, I have no idea...

Cheers,
Adam

Posted: **Wed Mar 04, 2009 6:40 am**

Essentially there should be no page faulting, tlb's primed and the small blocks in cpu cache. The first part was to see if any time was lost to paging overhead and second was to see how much affect Windows time slicing, reg reloads and cache misses due to task swaps would cause. For that very reason I'm not measuring those things as that is what I wanted to see.

Bottom line really is: given a machine of performance n, what is the theoretical maximum and what is possible while under windows.. IE: by having windows as your OS, you're losing 30% performance on small memory operations. Granted that's not representative of an entire OS with all facets running and any other OS will also have certain overheads but as a benchmark against which to measure.

Posted: **Wed Mar 04, 2009 10:57 am**

@johnsa, is it not posable to post the test code, so we can give you our results from our OS's design's, yes we may need to mod the code to run on our OS's, but the timer and move data etc would not need moding.
My OS is single-tasking, non -paging etc.
Thanks

Posted: **Wed Mar 04, 2009 6:59 pm**

johnsa wrote:Essentially there should be no page faulting, tlb's primed and the small blocks in cpu cache. The first part was to see if any time was lost to paging overhead and second was to see how much affect Windows time slicing, reg reloads and cache misses due to task swaps would cause. For that very reason I'm not measuring those things as that is what I wanted to see.

Bottom line really is: given a machine of performance n, what is the theoretical maximum and what is possible while under windows.. IE: by having windows as your OS, you're losing 30% performance on small memory operations. Granted that's not representative of an entire OS with all facets running and any other OS will also have certain overheads but as a benchmark against which to measure.

It's a bad benchmark because it doesn't take any differences about the environments into account. That's like comparing a normal os's memcpy with that of an os with transactional distributed memory. Yes, it will almost certaintly be slower on the second one, but you're paying that extra performance for a lot of extra functionality. Your kernel wasn't competing for anything, while your windows program was competing with the gui, the display driver, the disk driver, the network driver, and anything else that was running on your system. I'm willing to bet on small memory operations you would get similar results on a Linux or Unix system.

Posted: **Thu Mar 05, 2009 1:59 am**

That's exactly the point of the test, to compare apples with oranges

I wanted to see if you were to build a single-tasking ring0 paged or non-paged kernel, what sort of difference would it make as opposed to running a consumer OS which does paging/ring3/task-switch/time-slice etc.
To gauge how much that sort of functionality costs. I'm thinking along the lines of implementing something similar to what we used to use in the Amiga for demo/game coding and that was some custom startup code that basically told the OS to suspend multi-tasking and resume it at the end allowing full control to the running application (we had to do a few other things like storing copper-lists, dma state etc.) but in essence the OS provided a mechanism to allow apps to run in a single-tasked mode at supervisor level. So the question was.. how much would that code in that app be benefiting from running in such a way.

Posted: **Thu Mar 05, 2009 2:51 am**

Hi,

I would suggest that if you are interested in how much difference that makes, you could perhaps play around with priorities. I don't know how you would do that in Linux, but in Windows, you can set any task's priority from "Low" to "Realtime" via the task manager. The problem is that (on my system anyway), Windows can very rarely set a task to "Realtime" due to other running processes.

Even if you play around with this system on Windows, you will get very poor repeatability, because it will vary depending on how many tasks are running and so on...

Cheers,
Adam

Posted: **Thu Mar 05, 2009 1:40 pm**

johnsa wrote:That's exactly the point of the test, to compare apples with oranges
I wanted to see if you were to build a single-tasking ring0 paged or non-paged kernel, what sort of difference would it make as opposed to running a consumer OS which does paging/ring3/task-switch/time-slice etc.
To gauge how much that sort of functionality costs. I'm thinking along the lines of implementing something similar to what we used to use in the Amiga for demo/game coding and that was some custom startup code that basically told the OS to suspend multi-tasking and resume it at the end allowing full control to the running application (we had to do a few other things like storing copper-lists, dma state etc.) but in essence the OS provided a mechanism to allow apps to run in a single-tasked mode at supervisor level. So the question was.. how much would that code in that app be benefiting from running in such a way.

And that why you should post the code, so i can test it on my OS, which is

From the start, as you would expect from a OS based on a game's console OS, optimizing for speed has been of paramount important in the over all design. To this end there's no virtual memory paging, and only a single process is allowed (though that process can spawn multiple threads). The program runs in ring0, you have direct access to all hardware (including CPU and graphics). Memory allocation is the responsibility of the app--there's no front-end memory allocation. The entire OS will fit into less than 100k.

Why reinvent the weel ?
Just place your code here:

Code: Select all

[code]
;====================================================;
; TEST  .                                            ;
;====================================================;
use32
	ORG   0x400000				       ; where our program is loaded to
	jmp   start				       ; jump to the start of program.
	db    'DEX2'				       ; We check for this, to make shore it a valid DexOS file.

 ;----------------------------------------------------;
 ; Start of program.                                  ;
 ;----------------------------------------------------;
start:
	mov   ax,18h
	mov   ds,ax
	mov   es,ax
 ;----------------------------------------------------;
 ; Get calltable address.                             ;
 ;----------------------------------------------------;
	mov   edi,Functions			       ; this is the interrupt
	mov   al,0				       ; we use to load the Dex.inc
	mov   ah,0x0a				       ; with the address to dexOS functions.
	int   40h

	mov   esi,Message1
	call  [PrintString]			       ; Print  message1
   call  [WaitForKeyPress]
; *** PUT YOUR CODE HERE *****
	ret					               ; Exit.



 ;----------------------------------------------------;
 ; calltable include goes here.                       ;
 ;----------------------------------------------------;
Message1: db ' Press any key to start test',13,0
include 'Dex.inc'					; Dex inc file

Assembly it with fasm as c:\fasm test.asm test.dex <enter>
Let me know if you want Dex.inc

OSDev.org

Performance Tests Round #1

Performance Tests Round #1

Re: Performance Tests Round #1

Re: Performance Tests Round #1

Re: Performance Tests Round #1

Re: Performance Tests Round #1

Re: Performance Tests Round #1

Re: Performance Tests Round #1

Re: Performance Tests Round #1

Re: Performance Tests Round #1