Page 3 of 3
Re: flat memory model
Posted: Thu Apr 12, 2012 4:26 am
by rdos
Not a valid reference.
1. The figures are messured in Linux, which doesn't even have call gates, and never had
2. It could be any nut that put this up. There is not even a domain address!
Re: flat memory model
Posted: Thu Apr 12, 2012 4:35 am
by Combuster
Not a valid reference.
The world disagrees with that sentiment.
In other words:
This debate is over. You lose.
Re: flat memory model
Posted: Thu Apr 12, 2012 4:58 am
by iansjack
rdos wrote:iansjack wrote:A flat memory model without segmentation doesn't mean that user programs can access kernel memory, any more than they can in a segmented model. Protection is provided at the page level.
My system calls can't fool the kernel into writing or reading kernel memory (well - that's not quite true. Most system calls involve some reading or writing of kernel data by the kernel itself.) That would be poor design if it were possible. I don't think that any of my system calls say "write this data to this memory location", even indirectly. I'm afraid that I just don't see the problems that you refer to as being real.
Example:
WriteFile API function. Takes these parameters:
1. File handle
2. Buffer (a pointer to memory)
3. Size
My implementation:
bx = file handle
es:edi = buffer
ecx = size
es:edi cannot map kernel memory, as the flat ES selector has a limit that is 0xE000000, and all kernel data is above that limit.
OK, I see what you are on about. But it is a trivial check to ensure that buffers passed lie in user data memory. Even easier if all of your user memory occupies a fixed virtual address range, the same for every process. The overhead of that sort of simple comparison is far less than that of loading a segment.
I would say that you are replacing a very simple check with a complicated mechanism that will occupy many more clock cycles.
I guess the answer is that I always thought that segmented memory addressing was a kludge. It was necessary in ancient processors, to allow them to address a decent amount of memory, but it has no place in a modern design.
Re: flat memory model
Posted: Thu Apr 12, 2012 6:02 am
by rdos
Combuster wrote:Not a valid reference.
The world disagrees with that sentiment.
In other words:
This debate is over. You lose.
Your ideas of proof are quite ridiculous.
OK, so I tested this on two different machines.
Source code for experiment:
http://rdos.net/vc/viewvc.cgi/trunk/tes ... iew=markup
First the code calls a local (32-bit near) procedure, then it does a syscall via a call gate. The syscall simply returns with an retf32.
AMD Geode:
near: 5.9 million calls per second.
gate: 4.0 million calls per second.
AMD dual core (modern portable):
near: 20.0 million calls per second.
gate: 2.7 million calls per second.
IOW, on AMD Geode it takes only 40% longer to execute a kernel call compared to a near C call, while on the dual core AMD, the processor executes 7 near C calls in the same time that it executes one kernel call.
Re: flat memory model
Posted: Thu Apr 12, 2012 6:22 am
by iansjack
I'm afraid that I would have to question the validity of your test methodology. Apart from the fact that repeated loops like this are not typical of real programs, there is no way that we can know what optimisations your compiler applied when compiling this C code or what the overhead of the other instructions that it produces are.
If you are going to try to test the timing of instructions then I feel you should at least be doing your test in assembler to rule out the second variable. But the tests will still mean very little in real-world terms.
Re: flat memory model
Posted: Thu Apr 12, 2012 7:11 am
by rdos
At least this is a lot more relevant than testing something in Linux that doesn't support call gates, but used the interrupt method + all the required decoding in relation to that.
But I'll give Owen a point that call gate speed has degraded considerably in modern processors. I'll check with my Intel Atom and 6-core AMD Phenom as well, but I suspect the results to be similar.
Re: flat memory model
Posted: Thu Apr 12, 2012 7:34 am
by gerryg400
AMD dual core (modern portable):
near: 20.0 million calls per second.
gate: 2.7 million calls per second.
Rdos, is there something wrong with your CPU ? In a quick, not very scientific test, on my laptop I can do more than 500 million call/ret in user-space per second.
Code: Select all
#include <stdio.h>
int test_func(void);
volatile long count;
int main(int argc, char *argv[]) {
long i;
for (i=0; i<500000000; ++i) {
test_func();
}
printf("count=%ld\n", count);
}
int test_func() {
++count;
}
disassembled..
Code: Select all
_main:
0000000100000e8c pushq %rbp
0000000100000e8d movq %rsp,%rbp
0000000100000e90 subq $0x20,%rsp
0000000100000e94 movl %edi,0xec(%rbp)
0000000100000e97 movq %rsi,0xe0(%rbp)
0000000100000e9b movq $0x00000000,0xf8(%rbp)
0000000100000ea3 jmp 0x00000eae
0000000100000ea5 callq 0x00000ed5
0000000100000eaa incq 0xf8(%rbp)
0000000100000eae cmpq $0x1dcd64ff,0xf8(%rbp)
0000000100000eb6 jle 0x00000ea5
0000000100000eb8 leaq 0x000001a9(%rip),%rax
0000000100000ebf movq (%rax),%rsi
0000000100000ec2 leaq 0x0000005b(%rip),%rdi
0000000100000ec9 movl $0x00000000,%eax
0000000100000ece callq 0x00000efa
0000000100000ed3 leave
0000000100000ed4 ret
_test_func:
0000000100000ed5 pushq %rbp
0000000100000ed6 movq %rsp,%rbp
0000000100000ed9 leaq 0x00000188(%rip),%rax
0000000100000ee0 movq (%rax),%rax
0000000100000ee3 leaq 0x01(%rax),%rdx
0000000100000ee7 leaq 0x0000017a(%rip),%rax
0000000100000eee movq %rdx,(%rax)
0000000100000ef1 leave
0000000100000ef2 ret
Re: flat memory model
Posted: Thu Apr 12, 2012 8:24 am
by iansjack
Fascinating though this discussion is, it's rather irrelevant to 64-bit programming, which is my interest. Talk of segment limits is irrelevant, as is talk of the ES and DS registers. You use a flat memory model whether you like it or not. Intel and AMD have clearly decided (quite correctly IMO) that the segmented memory model is not relevant to today's processors.
Re: flat memory model
Posted: Thu Apr 12, 2012 10:03 am
by rdos
Results for my 6-core AMD Phenom: (at 2.8 GHz)
near: 44.7 million calls per second
gate: 12.0 million calls per second
And for my 2-core Intel Atom (at 3GHz)
near: 24.4 million calls per second
gate: 2.4 million calls per second
That means that Phenom, by a large margin, is the best processor for usage with RDOS. This also is evident in various test programs
It would be interesting to test on the Intel core duo as well, but I wonder if it booted?
Re: flat memory model
Posted: Thu Apr 12, 2012 10:07 am
by rdos
gerryg400 wrote:Rdos, is there something wrong with your CPU ? In a quick, not very scientific test, on my laptop I can do more than 500 million call/ret in user-space per second.
I disassembled my code too, and it actually contains 3 calls and a lot of other junk (it was compiled for debug-mode with stack checks). That probably explain the discrepancy.
BTW, what speed and processor does your laptop have?
Re: flat memory model
Posted: Thu Apr 12, 2012 12:22 pm
by Kazinsal
rdos wrote:That means that Phenom, by a large margin, is the best processor for usage with RDOS.
How is this relevant to the thread at all?
Re: flat memory model
Posted: Thu Apr 12, 2012 12:50 pm
by Rudster816
rdos wrote:Results for my 6-core AMD Phenom: (at 2.8 GHz)
near: 44.7 million calls per second
gate: 12.0 million calls per second
And for my 2-core Intel Atom (at 3GHz)
near: 24.4 million calls per second
gate: 2.4 million calls per second
That means that Phenom, by a large margin, is the best processor for usage with RDOS. This also is evident in various test programs
It would be interesting to test on the Intel core duo as well, but I wonder if it booted?
You tested two microarchitectures, with one very specific (and one that some might say is useless) benchmark, and you declare that a Phenom is the best CPU to use with RDOS? Rather bold. I'd be happy to crush your benchmarks if you're willing to provide an image of your OS for me
Normal calls without segment\ privilege changes should only take a couple of clock cycles. This makes it very difficult to measure with any degree of accuracy how long they actually take in userspace.
After some tweaking of the parameters, this is the best benchmark I could come up with
Code: Select all
#include <time.h>
#include <iostream>
using namespace std;
int main()
{
unsigned long long total = 0;
unsigned long long low = 0xFFFFFFFFFFFFFFF;
unsigned long long high = 0;
for (int i = 0; i < 1000; i++)
{
clock_t start = clock();
_asm
{
push ecx
mov ecx, 11000000
callfunc:
call function
sub ecx, 1
jnz callfunc
jmp done
function:
ret
done:
pop ecx
}
clock_t end = clock();
total += end - start;
if (low > (end - start))
low = end - start;
else if (high < (end - start))
high = end - start;
}
cout << "Total: " << total << endl;
cout << "Average: " << (double)total / 1000.0 << endl;
cout << "Low: " << low << endl;
cout << "High: " << high << endl;
return 0;
}
Running this gave the result:
Total: 16510
Average: 16.51
Low: 2
High: 31
Those times are in millaseconds BTW (CLOCKS_PER_SEC = 1000). Pretty inconsistent results, but using the low of 2ms for 11m function calls yields 5.5b function calls per second. Doesnt seem possible on my 4ghz CPU, but I assume that's because of clock() rounding. But even saying 3ms was the best case scenario for 11m function calls, that gives us 3.66 billion function calls per second, which is a little bit more reasonable, but it's still barely over 1 clock cycle for a call\ret pair. The average across the whole test was 666 million function calls, which is 6 cycles per call\ret pair.
Given this data, I think its reasonable to say a cached call\ret pair takes ~4 clock cycles on a modern CPU (this test was ran on a Nehalem CPU).
Re: flat memory model
Posted: Thu Apr 12, 2012 1:34 pm
by Owen
I confess I misremembered something: CALL near is actually 3 cycles vector path (i.e. nothing may execute in parallel)
Code: Select all
INSTRUCTION OPCODE DECODE TYPE MIN. LATENCY (Where 2 values 32bit-64bit)
CALL disp16/32 (near, displacement) E8h VectorPath 3
CALL pntr16:16/32 (far, direct, no CPL change) 9Ah VectorPath 33
CALL pntr16:16/32 (far, direct, CPL change) 9Ah VectorPath 150
RET near imm16 C2h VectorPath 5
RET near C3h Double 5
RET far imm16 (no CPL change) CAh VectorPath 31–44
RET far imm16 (CPL change) CAh VectorPath 57-72
RET far (no CPL change) CBh VectorPath 31–44
RET far (CPL change) CBh VectorPath 57-72
INT imm8 (CPL change) CDh VectorPath 91–112
IRET, IRETD, IRETQ (from 64-bit to 64-bit) CFh VectorPath 91
IRET, IRETD, IRETQ (from 64-bit to 32-bit) CFh VectorPath 111
IRET, IRETD, IRETQ (from 64-bit to 32-bit) ... no data
MOV sreg, mreg16/32/64 8Eh VectorPath 8
MOV sreg, mem16 8Eh VectorPath 10
SYSCALL 0Fh 05h VectorPath 27
SYSENTER 0Fh 34h VectorPath ~
SYSEXIT 0Fh 35h VectorPath ~
SYSRET 0Fh 07h VectorPath 35
Additional:
- Add 1 cycle to any memory op with SIB addressing where Index+Base+Offset is used & the associated segment has Base != 0 (Adder only 3 operand)
- Add 1 cycle to any relative jump in a code segment which has Base != 0 (Adder only 2 operand)
Citation. I'd use a more recent document... Except AMD have stopped telling us (Which is more helpful than Intel, who tell us nothing)