Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best OS!
Re: Best processor for 32-bit OS
Hi rdos ,
I have an
(1) AMD Phenom 2 X6 64 but - ( hex core ) 2.8 Ghz
(2) AMD Sempron - 32 bit single core - roughtly round 2 Ghz
(3) Intel dual core 32bit - 2.2 Ghz
(4) AMD K6 2 - 500 Mhz .
(5) Intel Celeron 900Mhz - ( on my Asus EEE PC )
Please let me know what tests you want to perform ? ,,, if you provide me from images I will test it to for you. I can try immediately on (1) and (5) . But others are in my home town. I may go there only after 2 weeks, is that okay ?
--Thomas
EDIT - Found the location of images . If interested i will try on (1) and (5) today evening ... or tommorow
I have an
(1) AMD Phenom 2 X6 64 but - ( hex core ) 2.8 Ghz
(2) AMD Sempron - 32 bit single core - roughtly round 2 Ghz
(3) Intel dual core 32bit - 2.2 Ghz
(4) AMD K6 2 - 500 Mhz .
(5) Intel Celeron 900Mhz - ( on my Asus EEE PC )
Please let me know what tests you want to perform ? ,,, if you provide me from images I will test it to for you. I can try immediately on (1) and (5) . But others are in my home town. I may go there only after 2 weeks, is that okay ?
--Thomas
EDIT - Found the location of images . If interested i will try on (1) and (5) today evening ... or tommorow
Re: Best processor for 32-bit OS
I'm afraid not. When I used the performance monitor of a real world application (our terminal), it was obvious that it loaded the AMD E-300 portable a lot more than the AMD Geode, even if the first run at 1400MHz and the second at 500MHz. I first assumed it had to do with reading the LFB buffer, but even after this was removed, the difference remained more or less the same. The reason for this is probably the lousy implementation of segmentation in AMD E-300, and some other processors.brain wrote:This is all well and good, but compiler wars aside benchmarks are not a good indicator of real-world performance. in real code the syscalls should be few and far between, meaning any performance issues you note through these tests should be pretty much a non issue for all but corner cases.
Re: Best processor for 32-bit OS
Sure. Try as many as you want. Although (1) seems a lot like the one I already tested which so far has the best RDOS performance.Thomas wrote:Hi rdos ,
I have an
(1) AMD Phenom 2 X6 64 but - ( hex core ) 2.8 Ghz
(2) AMD Sempron - 32 bit single core - roughtly round 2 Ghz
(3) Intel dual core 32bit - 2.2 Ghz
(4) AMD K6 2 - 500 Mhz .
(5) Intel Celeron 900Mhz - ( on my Asus EEE PC )
Please let me know what tests you want to perform ? ,,, if you provide me from images I will test it to for you. I can try immediately on (1) and (5) . But others are in my home town. I may go there only after 2 weeks, is that okay ?
--Thomas
EDIT - Found the location of images . If interested i will try on (1) and (5) today evening ... or tommorow
Re: Best processor for 32-bit OS
Hi,
The only other scenario I could imagine is that RDOS uses multiple different call gates for multiple different APIs (with multiple different code segments). If this is the case, then you could still use SYSENTER for the most frequently used API (and continue using the other call gates for the other APIs).
I'd also suggest benchmarking SYSENTER and SYSCALL so that you've got a complete picture of how badly call gates are hurting performance. That way you'd be able to take your finding to your customers and show them why they shouldn't be using RDOS (and why they should pay you some extra $$$ for the new and improved "RDOS 2").
Cheers,
Brendan
Windows XP supports SYSENTER. Considering that XP was released 11 years ago (2001) that's not too bad.rdos wrote:Because call gate performance is the best estimator of how fast RDOS (and any other older OS, like Windows XP), will perform on a particular processor. It is also the best estimator of interrupt latencies and interrupt performance, as using gates in IDT is the only way of handling interrupts in 32-bit mode.Brendan wrote:Why are you testing call gates on modern CPUs?
I don't believe you. Why would you need to do a far call after SYSENTER has loaded CS with the kernel's (initial?) code segment; but not have to do a far call after a call gate has loaded CS with the kernel's (initial?) code segment? Either the far call is needed regardless, or the far call isn't needed.rdos wrote:SYSENTER/SYSCALL is not an alternative for me. While I haven't tested, I'm pretty sure that a SYSENTER followed by a far call in kernel is not faster than a call gate on a modern processor.Brendan wrote:About 15 years ago, Intel and AMD realised that call gates suck (due to multiple GDT lookups and protection checks) and created "less silly" alternatives. Unfortunately Intel's alternative (SYSENTER) wasn't the same as AMD's alternative (SYSCALL) and for a while each manufacturer wouldn't/didn't support the other's competing alternative. For modern CPUs this isn't a problem - all modern CPUs support SYSENTER in 32-bit code and SYSCALL in long mode.
The only other scenario I could imagine is that RDOS uses multiple different call gates for multiple different APIs (with multiple different code segments). If this is the case, then you could still use SYSENTER for the most frequently used API (and continue using the other call gates for the other APIs).
There's 4 cases:rdos wrote:Why would anybody want to do such a bad thing? On older processors, at least up to AMD Geode, using call gates is the fastest method of calling kernel.Brendan wrote:To support old CPUs it's easy to emulate SYSENTER (and/or SYSCALL) within the invalid opcode handler, and to support old software it's easy to also support call gates; but legacy stuff like this isn't really important for performance tests.
- The normal/expected case (applications and CPUs support SYSENTER/SYSCALL). This probably covers 99.9% of everything, so caring about the performance of other cases is just plain stupid.
- The first "obsolete piece of crud" case (applications don't support SYSENTER/SYSCALL even though the CPU does). In this case the kernel provides the crusty old call gate interface anyway; and while you don't get the performance improvement you don't care as it's not important (and you don't get any extra overhead either).
- The second "obsolete piece of crud" case (applications support SYSENTER/SYSCALL but the CPU doesn't). In this case the kernel emulates the SYSCALL/SYSENTER instruction; and while you don't get the performance improvement (and you do get extra overhead) you don't care because it's not important. The important thing is to prevent stupid application developers from using the call gate interface "just in case", and hurting performance for the normal/expected case where it does matter.
- The third "obsolete piece of crud" case (applications and CPUs both don't SYSENTER/SYSCALL). In this case you can't do anything to improve performance (but wouldn't be doing anything to hurt performance by adding "unusable support" for SYSENTER/SYSCALL either).
For legacy OSs you should be benchmarking SYSENTER (the instruction is 15 years old, and is officially deprecated in favour of SYSCALL and long mode). You should also be benchmarking interrupts too (as they're important for IRQs, and software interrupts can also be good for reducing the size of "less frequently used" code in modern applications).rdos wrote:What I would want to do is to put up performance lists for people wanting to run legacy OSes (and RDOS), and warn them that some modern processors really suck at this and should be avoided.
I'd also suggest benchmarking SYSENTER and SYSCALL so that you've got a complete picture of how badly call gates are hurting performance. That way you'd be able to take your finding to your customers and show them why they shouldn't be using RDOS (and why they should pay you some extra $$$ for the new and improved "RDOS 2").
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: Best processor for 32-bit OS
The kernel has only one CS, but every device-driver lives in it's own CS. That's why any calls to device-drivers entered into kernel with SYSENTER would need a far call in order to get to the final destination.Brendan wrote:I don't believe you. Why would you need to do a far call after SYSENTER has loaded CS with the kernel's (initial?) code segment; but not have to do a far call after a call gate has loaded CS with the kernel's (initial?) code segment? Either the far call is needed regardless, or the far call isn't needed.
Brendan wrote:The only other scenario I could imagine is that RDOS uses multiple different call gates for multiple different APIs (with multiple different code segments). If this is the case, then you could still use SYSENTER for the most frequently used API (and continue using the other call gates for the other APIs).
That has occured to me. However, one problem is to decide which module to setup the default CS that SYSENTER uses. While kernel provides an attractive option (and also has a lot of defined syscalls), the graphics device driver would probably be the best option performance-wise.
Another problem is that all syscall-handlers end with a retf. In order to support the SYSENTER interface, all procedures must support returning with a retn, otherwise the benefit would cease to exist. Certainly possible, but requires parts of the code to be reworked.
Actually, applications know absolutely nothing about how to make a syscall. They are linked with an invalid call (call 002:0000xxxx). This protection faults on first usage and is then replaced by the call gate. It is possible to replace it with a SYSENTER instead + some identification of the function number. The kernel could decide what to replace it with, and the application wouldn't care.Brendan wrote:The normal/expected case (applications and CPUs support SYSENTER/SYSCALL). This probably covers 99.9% of everything, so caring about the performance of other cases is just plain stupid.
I might do.Brendan wrote:I'd also suggest benchmarking SYSENTER and SYSCALL so that you've got a complete picture of how badly call gates are hurting performance.
More likely they would switch to Linux.Brendan wrote:That way you'd be able to take your finding to your customers and show them why they shouldn't be using RDOS (and why they should pay you some extra $$$ for the new and improved "RDOS 2").
Re: Best processor for 32-bit OS
Having read-up on sysenter (and syscall), I've more or less concluded that my original statement was true. They would not be of any kind of benefit.
List of the problems:
1. SS:ESP is loaded with a fixed value regardless thread (how about core?), and cli is executed. This means the kernel ss:esp must be manually loaded from the TSS since I don't want syscalls to run to completion with interrupts disabled (the same kernel stack is used for all threads).
2. Upon return, CS and SS are loaded with a zero base. This would only work in the case when the caller is a flat application (the usual case), and when the application has a zero base (mostly also the usual case)
3. CS is loaded with a zero base. This basically means that ALL entrypoints must do a far call, since neither kernel nor any device-driver executes in a code segment with a zero base. Additionally, limit checking in the code segment of the syscalls in question would be disabled unless a far call is executed.
4. ECX and EDX is used in the interface. This means that EAX (function number), ECX and EDX must be saved and restored on the application stack. Creates more overhead that is not necesary otherwise.
All in all, the segment registers (SS and CS) that SYSENTER aims not to have to load from descriptor tables would need to be loaded anyway both on entry, and sometimes also on exit depending on configuration. Additionally, considerable overhead with loading function number in EAX, and preserving and restoring registers during/after the function call, means it is almost a given that this will not improve anything.
An interesting alternative might be to only use the SYSEXIT instruction, as it maps more cleanly to the environment. After all, it could load the correct selectors (if they have a zero base). However, that would require rebuilding the return stack, and having a special clean-up handler at the user end that pops ECX and EDX.
List of the problems:
1. SS:ESP is loaded with a fixed value regardless thread (how about core?), and cli is executed. This means the kernel ss:esp must be manually loaded from the TSS since I don't want syscalls to run to completion with interrupts disabled (the same kernel stack is used for all threads).
2. Upon return, CS and SS are loaded with a zero base. This would only work in the case when the caller is a flat application (the usual case), and when the application has a zero base (mostly also the usual case)
3. CS is loaded with a zero base. This basically means that ALL entrypoints must do a far call, since neither kernel nor any device-driver executes in a code segment with a zero base. Additionally, limit checking in the code segment of the syscalls in question would be disabled unless a far call is executed.
4. ECX and EDX is used in the interface. This means that EAX (function number), ECX and EDX must be saved and restored on the application stack. Creates more overhead that is not necesary otherwise.
All in all, the segment registers (SS and CS) that SYSENTER aims not to have to load from descriptor tables would need to be loaded anyway both on entry, and sometimes also on exit depending on configuration. Additionally, considerable overhead with loading function number in EAX, and preserving and restoring registers during/after the function call, means it is almost a given that this will not improve anything.
An interesting alternative might be to only use the SYSEXIT instruction, as it maps more cleanly to the environment. After all, it could load the correct selectors (if they have a zero base). However, that would require rebuilding the return stack, and having a special clean-up handler at the user end that pops ECX and EDX.
Last edited by rdos on Fri Apr 13, 2012 8:09 am, edited 1 time in total.
- Owen
- Member
- Posts: 1700
- Joined: Fri Jun 13, 2008 3:21 pm
- Location: Cambridge, United Kingdom
- Contact:
Re: Best processor for 32-bit OS
Windows XP (NT5.1) uses SYSENTER/SYSEXIT. Windows XP x64/Windows Server 2003 x64 (NT 5.2) uses SYSCALL/SYSRET. Whether NT 5.2 added support for SYSCALL/SYSRET in 32-bit code I don't know.rdos wrote:Because call gate performance is the best estimator of how fast RDOS (and any other older OS, like Windows XP), will perform on a particular processor. It is also the best estimator of interrupt latencies and interrupt performance, as using gates in IDT is the only way of handling interrupts in 32-bit mode.Brendan wrote:Why are you testing call gates on modern CPUs?
(EDIT: How the hell did I not notice the second page? -_-)
Last edited by Owen on Fri Apr 13, 2012 9:33 am, edited 1 time in total.
Re: Best processor for 32-bit OS
This thread should be renamed to Best processor for RDOS to avoid confusion. Since the statements and statistics apply to RDOS only, but somehow the result is wrongly generalized to all 32-bit OSes.
This will still be an interesting information to know what affect RDOS and his consideration, and free us from arguing for an universally best processor which simply does not make sense (for same reason as there is no universally best language).
This will still be an interesting information to know what affect RDOS and his consideration, and free us from arguing for an universally best processor which simply does not make sense (for same reason as there is no universally best language).
Re: Best processor for 32-bit OS
While the call gate performance doesn't affect all 32-bit OSes directly, it does so indirectly because of the link to interrupt performance. All the speces for instruction timings indicate a strong correlation between call gate perfomance and interrupt gate performance. If a processor sucks on call gates, it will also suck on interrupt performance, which is a strong reason to avoid it.bluemoon wrote:This thread should be renamed to Best processor for RDOS to avoid confusion. Since the statements and statistics apply to RDOS only, but somehow the result is wrongly generalized to all 32-bit OSes.
Re: Best processor for 32-bit OS
There is no listed evidence that other major OS have bottle-neck on call gate or interrupt gate.
As other mentioned, those OSes perhaps not using gates if possible, so there is not much to concern.
In short, you're right about RDOS OS, I'm pretty interested in the statistics as well;
but I'll say lack of evidence for the result to cover other OSes, until someone bored enough do the tests on those OSes.
As other mentioned, those OSes perhaps not using gates if possible, so there is not much to concern.
In short, you're right about RDOS OS, I'm pretty interested in the statistics as well;
but I'll say lack of evidence for the result to cover other OSes, until someone bored enough do the tests on those OSes.
Re: Best processor for 32-bit OS
Another sample:
1-core AMD Athlon, 1.2GHz:
near: 15.3 million calls per second
gate: 3.8 million calls per second
1-core AMD Athlon, 1.2GHz:
near: 15.3 million calls per second
gate: 3.8 million calls per second
Re: Best processor for 32-bit OS
I do a rough test on my mac pro hexa-core 3.33GHz within mac's terminal, and got 511 million near call per second.
Why the number differ so much?
By the way I can't do the gate's benchmark in mac.
The C code, similar with your test case. Linked with gcc-apple-4.2 -lpthread -o a a.c
Assembler output with gcc-apple-4.2 -S a.c -o a.S:
Why the number differ so much?
By the way I can't do the gate's benchmark in mac.
Code: Select all
$ ./a
Near: 511118865
Code: Select all
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
int sync_val = 0;
void* sync_thread(void* param) {
sync_val = 1;
usleep(1000*1000);
sync_val = 2;
usleep(1000*1000);
sync_val = 3;
return NULL;
}
void NullProc() {
}
int main() {
int near_count = 0;
pthread_t thread;
pthread_create( &thread, NULL, sync_thread, NULL );
while (sync_val != 1)
;
while (sync_val != 2) {
NullProc();
near_count++;
}
printf ("Near: %d\n", near_count );
return 0;
}
Code: Select all
.globl _main
_main:
LFB6:
pushq %rbp
LCFI5:
movq %rsp, %rbp
LCFI6:
subq $16, %rsp
LCFI7:
movl $0, -4(%rbp)
leaq -16(%rbp), %rdi
movl $0, %ecx
leaq _sync_thread(%rip), %rdx
movl $0, %esi
call _pthread_create
L6:
movl _sync_val(%rip), %eax
cmpl $1, %eax
jne L6
jmp L8
L9:
movl $0, %eax
call _NullProc
incl -4(%rbp)
L8:
movl _sync_val(%rip), %eax
cmpl $2, %eax
jne L9
movl -4(%rbp), %esi
leaq LC0(%rip), %rdi
movl $0, %eax
call _printf
movl $0, %eax
leave
ret
Re: Best processor for 32-bit OS
Quite likely because of all the junk that Watcom generates in the NullProc. It has a stack-check call, a call to grow within the stack check procedure, and then some more junk in the body. I could compile an optimized version, but then the other benchmarks I've already done would not be comparable. For me, it is the gate result that is important.bluemoon wrote:I do a rough test on my mac pro hexa-core 3.33GHz within mac's terminal, and got 511 million near call per second.
Why the number differ so much?
Re: Best processor for 32-bit OS
I've digged up some of my very old CPU boards.
I've found my 386SX motherboard. I need real luck to get that to even boot, much less to run RDOS as I'd need the floating point emulator as it doesn't have a numeric processor. No recent Windows or Linux would probably run on it. It only supports 4MB RAM.
Then I found my Cyrix 486DX motherboard. That one at least has a math processor, but memory limits are similar.
If I can make any of these boot, I'll post the results.
I've found my 386SX motherboard. I need real luck to get that to even boot, much less to run RDOS as I'd need the floating point emulator as it doesn't have a numeric processor. No recent Windows or Linux would probably run on it. It only supports 4MB RAM.
Then I found my Cyrix 486DX motherboard. That one at least has a math processor, but memory limits are similar.
If I can make any of these boot, I'll post the results.
Re: Best processor for 32-bit OS
My last one at home (apart from the really old one's):
1-core Intel Celeron (2.66GHz)
near: 16.3 million calls per second
gate: 3.0 million calls per second
1-core Intel Celeron (2.66GHz)
near: 16.3 million calls per second
gate: 3.0 million calls per second