flat memory model

rdos · Post by **rdos** » Thu Apr 12, 2012 4:26 am

Combuster wrote:Now for actual verifiable proof:
ftp://202.201.0.209/pub/events/rtlws-20 ... mgartl.pdf
Get your ruler and nitpick what you want

Not a valid reference.

1. The figures are messured in Linux, which doesn't even have call gates, and never had
2. It could be any nut that put this up. There is not even a domain address!

Combuster · Post by **Combuster** » Thu Apr 12, 2012 4:35 am

Not a valid reference.

The world disagrees with that sentiment.

In other words:

This debate is over. You lose.

iansjack · Post by **iansjack** » Thu Apr 12, 2012 4:58 am

rdos wrote:
iansjack wrote:A flat memory model without segmentation doesn't mean that user programs can access kernel memory, any more than they can in a segmented model. Protection is provided at the page level.

My system calls can't fool the kernel into writing or reading kernel memory (well - that's not quite true. Most system calls involve some reading or writing of kernel data by the kernel itself.) That would be poor design if it were possible. I don't think that any of my system calls say "write this data to this memory location", even indirectly. I'm afraid that I just don't see the problems that you refer to as being real.
Example:
WriteFile API function. Takes these parameters:
1. File handle
2. Buffer (a pointer to memory)
3. Size

My implementation:
bx = file handle
es:edi = buffer
ecx = size

es:edi cannot map kernel memory, as the flat ES selector has a limit that is 0xE000000, and all kernel data is above that limit.

OK, I see what you are on about. But it is a trivial check to ensure that buffers passed lie in user data memory. Even easier if all of your user memory occupies a fixed virtual address range, the same for every process. The overhead of that sort of simple comparison is far less than that of loading a segment.

I would say that you are replacing a very simple check with a complicated mechanism that will occupy many more clock cycles.

I guess the answer is that I always thought that segmented memory addressing was a kludge. It was necessary in ancient processors, to allow them to address a decent amount of memory, but it has no place in a modern design.

rdos · Post by **rdos** » Thu Apr 12, 2012 6:02 am

Combuster wrote:
Not a valid reference.
The world disagrees with that sentiment.

In other words:

This debate is over. You lose.

Your ideas of proof are quite ridiculous.

OK, so I tested this on two different machines.

Source code for experiment: http://rdos.net/vc/viewvc.cgi/trunk/tes ... iew=markup

First the code calls a local (32-bit near) procedure, then it does a syscall via a call gate. The syscall simply returns with an retf32.

AMD Geode:
near: 5.9 million calls per second.
gate: 4.0 million calls per second.

AMD dual core (modern portable):
near: 20.0 million calls per second.
gate: 2.7 million calls per second.

IOW, on AMD Geode it takes only 40% longer to execute a kernel call compared to a near C call, while on the dual core AMD, the processor executes 7 near C calls in the same time that it executes one kernel call.

iansjack · Post by **iansjack** » Thu Apr 12, 2012 6:22 am

I'm afraid that I would have to question the validity of your test methodology. Apart from the fact that repeated loops like this are not typical of real programs, there is no way that we can know what optimisations your compiler applied when compiling this C code or what the overhead of the other instructions that it produces are.

If you are going to try to test the timing of instructions then I feel you should at least be doing your test in assembler to rule out the second variable. But the tests will still mean very little in real-world terms.

rdos · Post by **rdos** » Thu Apr 12, 2012 7:11 am

At least this is a lot more relevant than testing something in Linux that doesn't support call gates, but used the interrupt method + all the required decoding in relation to that.

But I'll give Owen a point that call gate speed has degraded considerably in modern processors. I'll check with my Intel Atom and 6-core AMD Phenom as well, but I suspect the results to be similar.

gerryg400 · Post by **gerryg400** » Thu Apr 12, 2012 7:34 am

AMD dual core (modern portable):
near: 20.0 million calls per second.
gate: 2.7 million calls per second.

Rdos, is there something wrong with your CPU ? In a quick, not very scientific test, on my laptop I can do more than 500 million call/ret in user-space per second.

Code: Select all

#include <stdio.h>

int test_func(void);

volatile long count;

int main(int argc, char *argv[]) {

	long i;

	for (i=0; i<500000000; ++i) {
		test_func();
	}
	printf("count=%ld\n", count);
}

int test_func() {

	++count;
}

disassembled..

Code: Select all

_main:
0000000100000e8c	pushq	%rbp
0000000100000e8d	movq	%rsp,%rbp
0000000100000e90	subq	$0x20,%rsp
0000000100000e94	movl	%edi,0xec(%rbp)
0000000100000e97	movq	%rsi,0xe0(%rbp)
0000000100000e9b	movq	$0x00000000,0xf8(%rbp)
0000000100000ea3	jmp	0x00000eae
0000000100000ea5	callq	0x00000ed5
0000000100000eaa	incq	0xf8(%rbp)
0000000100000eae	cmpq	$0x1dcd64ff,0xf8(%rbp)
0000000100000eb6	jle	0x00000ea5
0000000100000eb8	leaq	0x000001a9(%rip),%rax
0000000100000ebf	movq	(%rax),%rsi
0000000100000ec2	leaq	0x0000005b(%rip),%rdi
0000000100000ec9	movl	$0x00000000,%eax
0000000100000ece	callq	0x00000efa
0000000100000ed3	leave
0000000100000ed4	ret
_test_func:
0000000100000ed5	pushq	%rbp
0000000100000ed6	movq	%rsp,%rbp
0000000100000ed9	leaq	0x00000188(%rip),%rax
0000000100000ee0	movq	(%rax),%rax
0000000100000ee3	leaq	0x01(%rax),%rdx
0000000100000ee7	leaq	0x0000017a(%rip),%rax
0000000100000eee	movq	%rdx,(%rax)
0000000100000ef1	leave
0000000100000ef2	ret

iansjack · Post by **iansjack** » Thu Apr 12, 2012 8:24 am

Fascinating though this discussion is, it's rather irrelevant to 64-bit programming, which is my interest. Talk of segment limits is irrelevant, as is talk of the ES and DS registers. You use a flat memory model whether you like it or not. Intel and AMD have clearly decided (quite correctly IMO) that the segmented memory model is not relevant to today's processors.

rdos · Post by **rdos** » Thu Apr 12, 2012 10:03 am

Results for my 6-core AMD Phenom: (at 2.8 GHz)
near: 44.7 million calls per second
gate: 12.0 million calls per second

And for my 2-core Intel Atom (at 3GHz)
near: 24.4 million calls per second
gate: 2.4 million calls per second

That means that Phenom, by a large margin, is the best processor for usage with RDOS. This also is evident in various test programs

It would be interesting to test on the Intel core duo as well, but I wonder if it booted?

rdos · Post by **rdos** » Thu Apr 12, 2012 10:07 am

gerryg400 wrote:Rdos, is there something wrong with your CPU ? In a quick, not very scientific test, on my laptop I can do more than 500 million call/ret in user-space per second.

I disassembled my code too, and it actually contains 3 calls and a lot of other junk (it was compiled for debug-mode with stack checks). That probably explain the discrepancy.

BTW, what speed and processor does your laptop have?

araxestroy · Post by **araxestroy** » Thu Apr 12, 2012 12:22 pm

rdos wrote:That means that Phenom, by a large margin, is the best processor for usage with RDOS.

How is this relevant to the thread at all?

Rudster816 · Post by **Rudster816** » Thu Apr 12, 2012 12:50 pm

rdos wrote:Results for my 6-core AMD Phenom: (at 2.8 GHz)
near: 44.7 million calls per second
gate: 12.0 million calls per second

And for my 2-core Intel Atom (at 3GHz)
near: 24.4 million calls per second
gate: 2.4 million calls per second

That means that Phenom, by a large margin, is the best processor for usage with RDOS. This also is evident in various test programs

It would be interesting to test on the Intel core duo as well, but I wonder if it booted?

You tested two microarchitectures, with one very specific (and one that some might say is useless) benchmark, and you declare that a Phenom is the best CPU to use with RDOS? Rather bold. I'd be happy to crush your benchmarks if you're willing to provide an image of your OS for me

Normal calls without segment\ privilege changes should only take a couple of clock cycles. This makes it very difficult to measure with any degree of accuracy how long they actually take in userspace.

After some tweaking of the parameters, this is the best benchmark I could come up with

Code: Select all

#include <time.h>
#include <iostream>
using namespace std;


int main()
{
	unsigned long long total = 0;
	unsigned long long low = 0xFFFFFFFFFFFFFFF;
	unsigned long long high = 0; 

	for (int i = 0; i < 1000; i++)
	{
		clock_t start = clock();

		_asm
		{
			push ecx
			mov ecx, 11000000
	
			callfunc:
				call function
				sub ecx, 1
				jnz callfunc
				jmp done

			function:
				ret
			done:
				pop ecx
		}

		clock_t end = clock();
		total += end - start;
		if (low > (end - start))
			low = end - start;
		else if (high < (end - start))
			high = end - start;
	}

	cout << "Total: " << total << endl;
	cout << "Average: " << (double)total / 1000.0 << endl;
	cout << "Low: " << low << endl;
	cout << "High: " << high << endl;

	return 0;
}

Running this gave the result:
Total: 16510
Average: 16.51
Low: 2
High: 31

Those times are in millaseconds BTW (CLOCKS_PER_SEC = 1000). Pretty inconsistent results, but using the low of 2ms for 11m function calls yields 5.5b function calls per second. Doesnt seem possible on my 4ghz CPU, but I assume that's because of clock() rounding. But even saying 3ms was the best case scenario for 11m function calls, that gives us 3.66 billion function calls per second, which is a little bit more reasonable, but it's still barely over 1 clock cycle for a call\ret pair. The average across the whole test was 666 million function calls, which is 6 cycles per call\ret pair.

Given this data, I think its reasonable to say a cached call\ret pair takes ~4 clock cycles on a modern CPU (this test was ran on a Nehalem CPU).

Owen · Post by **Owen** » Thu Apr 12, 2012 1:34 pm

I confess I misremembered something: CALL near is actually 3 cycles vector path (i.e. nothing may execute in parallel)

Code: Select all

INSTRUCTION								OPCODE	DECODE TYPE	MIN. LATENCY (Where 2 values 32bit-64bit)
CALL disp16/32 (near, displacement)				E8h	VectorPath	3
CALL pntr16:16/32 (far, direct, no CPL change)		9Ah	VectorPath	33
CALL pntr16:16/32 (far, direct, CPL change)		9Ah	VectorPath	150
RET near imm16 							C2h	VectorPath 	5
RET near 									C3h	Double 		5
RET far imm16 (no CPL change) 					CAh 	VectorPath 	31–44
RET far imm16 (CPL change) 					CAh 	VectorPath 	57-72
RET far (no CPL change) 						CBh 	VectorPath 	31–44
RET far (CPL change) 						CBh 	VectorPath 	57-72

INT imm8 (CPL change) 						CDh	VectorPath	91–112
IRET, IRETD, IRETQ (from 64-bit to 64-bit) 		CFh	VectorPath	91
IRET, IRETD, IRETQ (from 64-bit to 32-bit) 		CFh	VectorPath	111
IRET, IRETD, IRETQ (from 64-bit to 32-bit)	... no data

MOV sreg, mreg16/32/64						8Eh 	VectorPath 	8
MOV sreg, mem16 							8Eh 	VectorPath 	10

SYSCALL 									0Fh 05h	VectorPath	27
SYSENTER 									0Fh 34h	VectorPath 	~
SYSEXIT 									0Fh 35h	VectorPath 	~
SYSRET 									0Fh 07h	VectorPath 	35

Additional:

Add 1 cycle to any memory op with SIB addressing where Index+Base+Offset is used & the associated segment has Base != 0 (Adder only 3 operand)
Add 1 cycle to any relative jump in a code segment which has Base != 0 (Adder only 2 operand)

Citation. I'd use a more recent document... Except AMD have stopped telling us (Which is more helpful than Intel, who tell us nothing)

OSDev.org

flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model