flat memory model

rdos · Post by **rdos** » Thu Apr 12, 2012 1:51 am

Owen wrote:The best case I've seen for syscall is about 20 cycles, similar for sysret, and this is a true cost best case, because syscall always hits microcode and can't be executed in parallel with any other instruction. Additionally, the branch predictor doesn't predict over a syscall, so it's very possible that the destination TLB entry is not present and that the instructions are not in cache. And, of course, the kernel can't trust any addresses it has been given by userspace, so there is additional overhead to deal with there.

INT 0xNN and CALL FAR via call gate both tend to be ~100 cycles, with IRET being similar. Segment loads and privilege checks are expensive operations.

I think you are wrong. Getting the instruction timings for modern processors is kind of hard, but I have the timings for 386DX:

CALL:
within segment: 7 + m (m is number of components of the next instruction)
to different segment: 34 + m
via gate to same privilege: 52 + m
via gate to different privilege (no parameters): 86 + m
via gate to different privilege (x parameters): 94 + 4x + m

RET:
within segment: 10 + m
to different segment: 32 + m
to different privilege: 69

As can be seen, not even on the 386DX, call / ret with callgates take 100 cycles, so I suspect that Owen is wrong about that. If not, please state the processor you claim have this characteristic, along with a PDF that proves your point.

Using interrupts are even worse:

INT xx:
To same privilege: 59
To different privilege: 99

IRET:
Same privilege: 38
Different privilege: 82

Since the 386DX doesn't have sysenter and sysexit, I cannot compare with that.

With a 386DX, you can do just above 10 calls within segment in the same time you do one call to kernel, which is not that bad. I hardly think that 64-bit mode has better relative performance here, as if you use sysenter / sysexit or interrupts, you also need code to validate pointers and decode destinations, which probably means you end up in the same range. The validations that hardware do for you with call gates need to be done in software in long mode and flat mode, and this must be taken into account.

rdos · Post by **rdos** » Thu Apr 12, 2012 2:16 am

iansjack wrote:Protection of memory at the malloc() level is certainly a problem, but I can't see that segmentation is much help here. You're hardly going to create a new segment every time you malloc() 1 byte, are you?

In debug mode, I actually do. In production, I won't, because the cost is to high, and I might run out of selectors. That is in kernel. Today I have no 32-bit executable format for applications that would allow me to allocate a LDT selector for each malloc, but I will work on implementing one in the near future. I think it might be possible to use a 32-bit compact memory model with an executable format for a flat memory model (for instance PE). This is because the segmentation issues are in the compiler and runtime library, and not in the executable, and the compact memory model has one code segment and one data segment just as the flat memory model has. The difference is the handling of malloc, and the segment setups. While the flat memory model would setup an overlapping code and data segment, the compact memory model would create a code selelector that starts at zero, with a precise limit, and a data selector that starts at zero, with a precise limit. That way, the application cannot use the code selector to execute code outside of the code segment, and cannot use the data selector to access anything outside of the default data segment. Additionally, It would be good to map the first page(s) in the code and data segments as invalid with paging to catch null pointers.

iansjack wrote:It's a two-fold problem. One is preventing buffer overruns, but the more serious problem is protecting the data structures that keep track of malloc() allocations. In my OS I use linked lists with the link being immediately in front of the allocated memory. But any buffer overrun here will corrupt the whole data chain. (At least it is only the data chain for the one process. You might even argue that once a buffer overrun has happened the most favourable outcome is for the process to immediately crash.)

The implementations I've seen put the chains in application memory (usually before the allocated block), and doesn't use syscalls other than to allocate large chunks of memory. Using a syscall for every malloc would be terribly inefficient.

iansjack · Post by **iansjack** » Thu Apr 12, 2012 2:22 am

if you use sysenter / sysexit or interrupts, you also need code to validate pointers and decode destinations

I'm not sure what you mean by this. What pointers do you need to validate, and what destinations do you need to decode (other than a simple indirect JMP in your system calls routine? SYSCALL is a very efficient instruction (timings as mentioned by Owen excepted) and doesn't even require any stack access, unlike subroutines or interrupts. (Although you will probably be calling subroutines from the system call routines.)

rdos · Post by **rdos** » Thu Apr 12, 2012 2:33 am

iansjack wrote:
if you use sysenter / sysexit or interrupts, you also need code to validate pointers and decode destinations
I'm not sure what you mean by this. What pointers do you need to validate, and what destinations do you need to decode (other than a simple indirect JMP in your system calls routine? SYSCALL is a very efficient instruction (timings as mentioned by Owen excepted) and doesn't even require any stack access, unlike subroutines or interrupts. (Although you will probably be calling subroutines from the system call routines.)

Typical ways of doing this include using some register for call number (say EAX). Then you need to load EAX with the syscall number in the calling routine. In the syscall entry in kernel you need to validate EAX (because it might be out-of-bounds), and then jmp / call to the server routine from some call table. This is all done for you by hardware with call-gates. Additionally, you use one register for syscall index, which means that if you use register call conventions, there is one less register available for parameters. If you use a flat memory model, you also need to validate that all pointers passed to kernel are valid.

Using the 386DX as a reference, it is clear that call gates are superior performance wise, because they are both faster than ints, and actually are done with their task before the int-variant even disregarding the extra validation needed with ints.

I'm sure the picture is different for modern processors, but on old (386 and 486), and possibly half-old (Pentium), call gates is the superior method

Combuster · Post by **Combuster** » Thu Apr 12, 2012 2:42 am

rdos wrote:As can be seen, not even on the 386DX, call / ret with callgates take 100 cycles, so I suspect that Owen is wrong about that. If not, please state the processor you claim have this characteristic, along with a PDF that proves your point.

All your facts come within 10% of Owen's rough estimate. I think that's proof enough.

rdos · Post by **rdos** » Thu Apr 12, 2012 2:49 am

Combuster wrote:
rdos wrote:As can be seen, not even on the 386DX, call / ret with callgates take 100 cycles, so I suspect that Owen is wrong about that. If not, please state the processor you claim have this characteristic, along with a PDF that proves your point.
All your facts come within 10% of Owen's rough estimate. I think that's proof enough.

Owen wasn't talking about the 386DX, so your comparison to his estimates means nothing.

BTW, the 7 + m cycle timing for intersegment call, comes nowhere near Owen's estimate, and I'm 100% all these timings are shorter on the 486DX, but I cannot find my manual right now.

Unless you can provide a reference documentation, I don't believe you. Facts comes from proofs, which are in the instruction timings of processors. So lets see the "facts"!

Combuster · Post by **Combuster** » Thu Apr 12, 2012 2:58 am

Owen wrote:INT 0xNN and CALL FAR via call gate both tend to be ~100 cycles, with IRET being similar

rdos wrote:Unless you can provide a reference documentation

I wrote:All your facts come within 10% of Owen's rough estimate

rdos wrote:the timings for 386DX

CALL
via gate to different privilege (no parameters): 86 + m
via gate to different privilege (x parameters): 94 + 4x + m
INT
To different privilege: 99
IRET
Different privilege: 82

rdos · Post by **rdos** » Thu Apr 12, 2012 3:08 am

I've found a lot more interesting instruction timings now. These are for AMD Geode, a popular processor in modern lowend embedded boards.

CALL:
within segment: 3
different segment: 14
via gate, same privilege: 24
via gate, different privilege: 45
via gate, with parameters: 51 + 2x

RET:
within segment: 3
different segment, same privilege: 13
different segment, different privilege: 35

INT xx:
same privilege: 33
different privilege: 55

IRET:
same privilege: 20
different privilege: 39

sysenter/sysexit is not supported.

Here, the relationship between intertask calls and kernel calls is also around 10, possibly a little lower. Call gates are considerably faster than ints, disregarding the extra decoding needed.

Combuster, do you regard these timings to be within 10% of Owen's estimate as well?

iansjack · Post by **iansjack** » Thu Apr 12, 2012 3:10 am

rdos wrote:
iansjack wrote:
if you use sysenter / sysexit or interrupts, you also need code to validate pointers and decode destinations
I'm not sure what you mean by this. What pointers do you need to validate, and what destinations do you need to decode (other than a simple indirect JMP in your system calls routine? SYSCALL is a very efficient instruction (timings as mentioned by Owen excepted) and doesn't even require any stack access, unlike subroutines or interrupts. (Although you will probably be calling subroutines from the system call routines.)
Typical ways of doing this include using some register for call number (say EAX). Then you need to load EAX with the syscall number in the calling routine. In the syscall entry in kernel you need to validate EAX (because it might be out-of-bounds), and then jmp / call to the server routine from some call table. This is all done for you by hardware with call-gates. Additionally, you use one register for syscall index, which means that if you use register call conventions, there is one less register available for parameters. If you use a flat memory model, you also need to validate that all pointers passed to kernel are valid.

Using the 386DX as a reference, it is clear that call gates are superior performance wise, because they are both faster than ints, and actually are done with their task before the int-variant even disregarding the extra validation needed with ints.

I'm sure the picture is different for modern processors, but on old (386 and 486), and possibly half-old (Pentium), call gates is the superior method

I'm not convinced that you understand the SYSCALL/SYSRET instructions.

As for validating pointers, well you need to do that whatever mechanism you use to call kernel routines. Why wouldn't you validate pointers passed via a call gate? (Well, you don't have to - you can just let the kernel terminate a process if it produces a page fault.) And, if you don't want your program to crash you have to validate addresses in user programs as well.

rdos · Post by **rdos** » Thu Apr 12, 2012 3:28 am

iansjack wrote:I'm not convinced that you understand the SYSCALL/SYSRET instructions.

I think I do.

BTW, the relative timing between int and sysenter proposed in this article doesn't support Owen's claim either: http://www.codemachine.com/article_syscall.html

They claim that sysenter is 3 times faster than ints, while Owen claim it is a factor 5 (20 vs 100).

iansjack wrote:As for validating pointers, well you need to do that whatever mechanism you use to call kernel routines. Why wouldn't you validate pointers passed via a call gate? (Well, you don't have to - you can just let the kernel terminate a process if it produces a page fault.) And, if you don't want your program to crash you have to validate addresses in user programs as well.

Not so. In my design, user applications code and data selector doesn't map kernel, and all pointers must be passed as segment:offset in registers. Thus, an application cannot fool the kernel to access kernel data since it cannot load a selector that maps kernel. Therefore, there is no pointer validation.

Combuster · Post by **Combuster** » Thu Apr 12, 2012 3:41 am

1: What use is referencing an rather unique processor core with a neglegible market share ever in relation to an estimated number?
2: Your reference does not cite its sources or experiments
3: Your reference contradicts the AMD manual
4: You are so obnoxious to force the segmented memory model once again in an topic "flat memory model". Im assuming you plan on using that later to say that syscall instructions suck because your OS can't ever deal with them.

rdos · Post by **rdos** » Thu Apr 12, 2012 3:54 am

Combuster wrote:1: What use is referencing an rather unique processor core with a neglegible market share ever in relation to an estimated number?

Not negligable on the embedded market, rather quite common.

You are welcome to reference a processor that supports your claims.

Combuster wrote:2: Your reference does not cite its sources or experiments

Source: http://support.amd.com/us/Embedded_TechDocs/gx1_ds.pdf

Combuster wrote:3: Your reference contradicts the AMD manual

It does? Source, please!

An additional source: http://www.agner.org/optimize/instruction_tables.pdf

The above source lists instruction timings for many modern processors, and on none of them does a call gate take 100 cycles / uops. For instance, on Intel Atom (another popular processor for embedded systems), call near takes 1 uop, and call far takes 37 uops. However, the document doesn't specify which type of call far they mean. Sysenter is not part of their document either AFAIK.

iansjack · Post by **iansjack** » Thu Apr 12, 2012 4:18 am

A flat memory model without segmentation doesn't mean that user programs can access kernel memory, any more than they can in a segmented model. Protection is provided at the page level.

My system calls can't fool the kernel into writing or reading kernel memory (well - that's not quite true. Most system calls involve some reading or writing of kernel data by the kernel itself.) That would be poor design if it were possible. I don't think that any of my system calls say "write this data to this memory location", even indirectly. I'm afraid that I just don't see the problems that you refer to as being real.

Combuster · Post by **Combuster** » Thu Apr 12, 2012 4:21 am

Now for actual verifiable proof:
ftp://202.201.0.209/pub/events/rtlws-20 ... mgartl.pdf
Get your ruler and nitpick what you want

Interprivilege INT:
P1 ~50 cycles
P2 ~100 cycles
P3 ~90 cycles
P4 ~500 cycles (yes!)
Athlon ~90 cycles

Averaged over all CPU models mentioned that gives an average 140 cycles per INT
I also checked the difference between int and syscall for the P3 table, arriving at 4.5x speed improvement.

Not negligable on the embedded market, rather quite common.

Figures wanted. It's not currently available here on the desktop market, and ARM has the embedded market.

Source

Now the other one, silly.

It does? Source, please!

Volume 2, page 150 wrote:As a result, SYSCALL and SYSRET can take fewer than
one-fourth the number of internal clock cycles to complete than the legacy CALL and RET
instructions

rdos · Post by **rdos** » Thu Apr 12, 2012 4:24 am

iansjack wrote:A flat memory model without segmentation doesn't mean that user programs can access kernel memory, any more than they can in a segmented model. Protection is provided at the page level.

My system calls can't fool the kernel into writing or reading kernel memory (well - that's not quite true. Most system calls involve some reading or writing of kernel data by the kernel itself.) That would be poor design if it were possible. I don't think that any of my system calls say "write this data to this memory location", even indirectly. I'm afraid that I just don't see the problems that you refer to as being real.

Example:
WriteFile API function. Takes these parameters:
1. File handle
2. Buffer (a pointer to memory)
3. Size

My implementation:
bx = file handle
es:edi = buffer
ecx = size

es:edi cannot map kernel memory, as the flat ES selector has a limit that is 0xE000000, and all kernel data is above that limit.

OSDev.org

flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model

Re: flat memory model