Page 4 of 5

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Tue Apr 17, 2012 3:11 pm
by rdos
Combuster wrote:
I can even see exactly what the customer does
Including his credit/debit card number?
Of course not. Those are not even accessible to our application.

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Tue Apr 17, 2012 8:04 pm
by bubach
Combuster wrote:
I can even see exactly what the customer does
Including his credit/debit card number?
I doubt that's even possible with todays card-handling, with chip & pin, 3d-secure and everything else put in place to protect from fraud. Well, except for the US - where ancient stuff like cheques/checkes are still in use (been like 40 years since it was widely used here).

Re: Best processor for 32-bit OS

Posted: Wed Apr 18, 2012 12:21 am
by Solar
rdos wrote:That's what you have log files for.
Logging is the domain of the application, not the OS. The OS provides the logging framework, the application decides what to log where. And don't give me the "it's different in embedded programming" stuff. It's perhaps different in RDOS, but it's bad design any way you put it.
rdos wrote:Presenting errors to end-users of embedded systems is just plain stupid.
If your syscall failed fatally, you do present an error to the end user, because you don't have a choice. It might be a screen message, it might be a reboot, or a flashing LED, or a beeping sound, but it's an error message.
Besides, you would not log filesystem error codes in a log, as that would not make any sense to the typical support guy. [...] The trick is to provide useful information, and not to log things that only programmers understand.
Correct. That's why the logging shouldn't be done by the filesystem, or the GUI, but meaningful errors be reported to the application, so that the application (which is the part of the system knowing what it was actually doing at the time) can generate a meaningful log message.

Whether the solution is to retry, or reboot, or whatever, doesn't really matter. In order to fix the problem at the cause, you need all the information you can get. The OS knows how something failed, but only the application knows what it was that failed. That is why a simple boolean success / fail return code of syscalls is suboptimal, which is what Brendan pointed out, and which is where you lapsed (for the umpteenth time) into your standard defensive pattern of "I am not wrong, because that is the way RDOS does it".

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 12:30 am
by rdos
Solar, I have 15+ years of experience professional embedded system development for petrol stations, and I know what works and what doesn't, and there is no design limitations in RDOS in this regard. The API mostly was designed during the last 15 years, and adapted to what I regard best practises for such applications. So, the API is the consequence of my experience in the area, not a bagage to overcome. Therefore, I don't need to defend anything.

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 1:13 am
by Combuster
You just failed your logic exam.

Again.

I don't need to defend that statement you are the only one who disagrees and does not understand.

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 1:32 am
by Solar
rdos wrote:Solar, I have 15+ years of experience professional embedded system development for petrol stations...
And I have 12+ years of experience mopping up the truly mediocre stuff others have left behind, be it out of ignorance, attempts at "job security", or being locked up in an ivory tower.

Can we pull up our pants again?

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 4:13 am
by rdos
Updated results on my 2-core AMD E-300 portable (at 1.2GHz):
near: 20.0 million calls per second.
gate: 2.7 million calls per second.
syscall: 3.8 million calls per second.

IOW, there is a 40% performance improvement when SYSENTER/SYSEXIT is used instead of a call gate. OTOH, this processor is still slower than AMD Geode at 500MHz!

This is the code tested:

Code: Select all


; patched code

    nop                                   ; lead-byte (1 byte)
    call gate_entry                   ; a near call to a dynamically created user-level gate entry (5 bytes)
    nop
    nop

; The dynamic entry. These are placed in application space with read only access.

gate_nr        DD ?

gate_entry Proc near
    push eax
    push ecx
    push edx
    mov ecx,esp
    mov edx,OFFSET gate_leave
    sysenter
gate_leave:
    pop edx
    pop ecx
    pop eax
    ret
gate_entry Endp

; in kernel

gate_nr         = -9

app_eax         = 8
app_ecx         = 4
app_edx         = 0

; Each core will setup it's own sysenter handler. This can be used to define the processor block linear address

sysenter_entry:
    mov eax,OFFSET proc_linear           ; patched at initialization time to contain linear address of processor block
    mov ss,cs:[eax].ps_syscall_ss0       ; get ss0 of current thread
    mov esp,stack0_size                      ; load top of stack
    sti
    push edx                                    ; push return-point (application EIP)
    push ecx                                    ; put application ESP on kernel stack
    mov eax,ds:[edx].gate_nr              ; get gate # from just before the current procedure
    push dword ptr cs:[eax].ret           ; push sysleave offset 
    push dword ptr cs:[eax].sel           ; push handler selector
    push dword ptr cs:[eax].offset       ; push handler offset
    mov eax,ds:[ecx].app_eax
    mov edx,ds:[ecx].app_edx
    mov ecx,ds:[ecx].app_ecx
    retf32                                          ; jump to syscall handler

; in a device-driver module

dummy_gate  Proc near
    ret
dummy_gate  Endp

; exit procedure:

sysleave_entry16:
    push ecx
    mov ecx,ss:[esp+6]                   ; get application ESP
    mov ds:[ecx].app_edx,edx          ; return registers to caller
    mov ds:[ecx].app_eax,eax
    pop ds:[ecx].app_ecx
    pop dx                                      ; pop unused high part of entry-point EIP
    pop ecx                                    ; pop application ESP
    pop edx                                    ; pop application EIP
    sysleave

This alternative sysentry code provides an even larger boost:

Code: Select all


sysenter_entry:
    push edx                                    ; push return-point (application EIP)
    push ecx                                    ; put application ESP on kernel stack
    mov eax,ds:[edx].gate_nr              ; get gate # from just before the current procedure
    push dword ptr cs:[eax].ret           ; push sysleave offset 
    push dword ptr cs:[eax].sel           ; push handler selector
    push dword ptr cs:[eax].offset       ; push handler offset
    mov eax,ds:[ecx].app_eax
    mov edx,ds:[ecx].app_edx
    mov ecx,ds:[ecx].app_ecx
    retf32                                          ; jump to syscall handler
This version does 6.5 million calls per second, which is 2.4 times the call gate performance.

There are several issues that must be solved before these results are usable. One issue is that since it is the application that sets up both EIP and ESP, it is possible for an application to forge addresses within kernel space. A second issue is that RDOS cannot handle the stack being loaded with a 32-bit flat stack pointer, and will panic when the code faults or is debugged. A third issue is that the switch from user to kernel with sysenter when debugging now will go through lots of irrelevant code, which makes it harder to debug.

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 5:40 am
by rdos
This alternative provides the required safety from applications inserting random sysenter instructions. Instead of loading edx with the return address, edx is loaded with the gate #. The gate number can then be evaluated, and the correct handler return is then hardcoded for sysexit.

EDIT: Removed EAX from saved registers as it is possible to do the sysenter handler without modifying EAX.

Code: Select all


; patched code

    nop                                   ; lead-byte (1 byte)
    call gate_entry                   ; a near call to a dynamically created user-level gate entry (5 bytes)
    nop
    nop

; The dynamic entry. These are placed in application space with read only access.

gate_entry Proc near
    push ecx
    push edx

    mov ecx,esp
    mov edx,gate_nr                       ; code is patched with gate # at creation time
    sysenter

gate_leave:
    pop edx
    pop ecx
    ret
gate_entry Endp

; in kernel

app_ecx         = 4
app_edx         = 0

; Each core will setup it's own sysenter handler. This can be used to define the processor block linear address

sysenter_entry:
    mov ss,cs:ps_syscall_ss0                ; get ss0 of current thread. Patched at creation time
    mov esp,stack0_size                      ; load top of stack
    sti
    push ecx                                    ; put application ESP on kernel stack

    cmp edx,usergate_entries
    jae sysenter_fail                           ; check that gate # is within limits

    mov ecx,cs:[4*edx].gate_linear      ; get handler-address of this entry in kernel space. Gate linear is patched at creation time
    push ecx                                    ; push application return-point
    shl edx,GATE_SHIFT                     ; get to correct entry
    add edx,OFFSET gate_table           ; add linear address to entry table (patched)
    push dword ptr cs:[edx].ret           ; push sysleave offset 
    push dword ptr cs:[edx].sel           ; push handler selector
    push dword ptr cs:[edx].offset       ; push handler offset
    mov edx,ds:[ecx].app_edx             ; get user EDX
    mov ecx,ds:[ecx].app_ecx             ; get user ECX
    retf32                                         ; jump to syscall handler

; exit procedure for 32-bit code:

sysleave_entry32:
    xchg ecx,ss:[esp+4]                  ; get application ESP, and save return ECX
    mov ds:[ecx].app_edx,edx          ; write-back application EDX
    mov edx,ss:[esp+4]                  ; get return ECX
    mov ds:[ecx].app_ecx,edx          ; write-back application ECX
    mov edx,ss:[esp]                   ; get application EIP
    sysexit

; exit procedure for 16-bit code:

sysleave_entry16:
    xchg ecx,ss:[esp+6]                  ; get application ESP, and save return ECX
    mov ds:[ecx].app_edx,edx          ; write-back application EDX
    mov edx,ss:[esp+6]                  ; get return ECX
    mov ds:[ecx].app_ecx,edx          ; write-back application ECX
    mov edx,ss:[esp+2]                   ; get application EIP
    sysexit


Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 8:27 am
by rdos
The tamper safe versions end up with these timings:

2-core AMD E-300 portable (at 1.2GHz):
near: 20.0 million calls per second.
gate: 2.7 million calls per second.
syscall (loading ss:esp): 3.6 million calls per second.
syscall (not loading ss:esp): 6.0 million calls per second.

IOW, creating support for using a flat kernel stack in the production release would increase syscall performance considerably on this processor.

Perhaps, the most intersting thing is that loading ss:esp takes much longer than just loading a general segment register like GS. When an additional load of GS in the version that doesn't load ss:esp is added, performance changes to 5.3 million calls per second, not to the value when loading ss:esp, indicating that the implementation of loading ss is horribly slow on this processor.

Re: Best processor for 32-bit OS

Posted: Wed Apr 18, 2012 9:20 am
by turdus
Solar wrote:Logging is the domain of the application, not the OS.
No way! None of your system would pass any kind of audit!

If it would be the application's duty to log it's trying to do something nasty, of course it won't log it! Design failure!

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 9:21 am
by rdos
Ignoring all off-topic posts above ^^

Re: Best processor for 32-bit OS

Posted: Wed Apr 18, 2012 9:44 am
by Solar
turdus wrote:
Solar wrote:Logging is the domain of the application, not the OS.
No way! None of your system would pass any kind of audit!

If it would be the application's duty to log it's trying to do something nasty, of course it won't log it! Design failure!
Erm... what? I think we're seriously misunderstanding each other, here.

Example: sshd, the SSH server. If I attempt to break in to that noteable, it's sshd that will write the log entry about it. Not the kernel, not the pam module handling the actual auth request, but the application that knows what's going on overall.

To be precise, even the logging isn't done by the OS itself, but by an application (syslog-ng, in my case).

Of course my hacker script won't write a log about what it's doing...

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 10:34 am
by bubach
I think you all got away from the issue, if I understand it correctly Brendan commented on using the carry as error signal instead of returning proper error codes. Returning different error codes from each function would be a good idea, no matter who's responsibility it is to log it, print it or discard it.

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 11:01 am
by Brendan
Hi,
rdos wrote:IOW, there is a 40% performance improvement when SYSENTER/SYSEXIT is used instead of a call gate. OTOH, this processor is still slower than AMD Geode at 500MHz!
rdos wrote:This alternative sysentry code provides an even larger boost:
rdos wrote:This version does 6.5 million calls per second, which is 2.4 times the call gate performance.
20.0 million "near calls" per second on a 1.2 GHz CPU works out to about 60 cycles per "near call". I'd expect that a near call actually costs about 4 cycles, so this first test indicates that the loop overhead is probably about 56 cycles per iteration (probably because the compiler is crap - a decent compiler would have inlined the "do nothing" function, then decided that "sync_val" never changes because it's not volatile and generated a "jmp $" infinite loop).

2.7 million "call gates" per second on a 1.2 GHz CPU works out to about 444 cycles per "call gate". By subtracting the "about 56 cycles per iteration" loop overhead from above this gives us an actual figure closer to 388 cycles for the call gate alone.

3.8 million "sysenters" per second on a 1.2 GHz CPU works out to about 316 cycles per "sysenter". By subtracting the "about 56 cycles per iteration" loop overhead again this gives us an actual figure closer to 260 cycles for the sysenter alone. This is about 50% faster than the call gate method.

6.5 million "alternative sysenters" per second on a 1.2 GHz CPU works out to about 185 cycles per "alternative sysenter". By subtracting the "about 56 cycles per iteration" loop overhead again this gives us an actual figure closer to 129 cycles for the alternative sysenter alone. This is about 300% faster than the call gate method, and about 200% faster than the original "sysenter" method.

The only difference between the "sysenter" and "alternative sysenter" method is that the former loads a different SS:ESP while the latter doesn't. Because the alternative method is about 200% faster, this means that loading a different SS:ESP must halve the performance. Loading a different value into ESP is just a normal "mov" and should only cost about 2 cycles. Therefore loading a different value into the SS register must be costing about 126 cycles all by itself. Loading a different value into CS would cost about the same. Therefore, without these (CS and SS) segment loads (at 126 cycles each) the cost of the sysenter and sysexit instructions alone would be about 5 cycles.

This is undeniable proof that if RDOS didn't use segmentation system calls would be about as fast as a near call. :lol:
rdos wrote:There are several issues that must be solved before these results are usable. One issue is that since it is the application that sets up both EIP and ESP, it is possible for an application to forge addresses within kernel space.
If a flat application does attempt to forge dodgy values for EIP or ESP it'd only cause a page fault due to the correct use of the supervisor/user flag in page table entries, and would be no worse than the same application doing "jmp somewhere_in_kernel" or "mov esp,somewhere_in_kernel". My main concern (if I understand RDOS enough) would be segmented applications using the SYSENTER interface to break their segments. For example, if you have several segmented applications in the same virtual address space, then one of them could use SYSENTER to modify its SS and then use its SS:ESP to read a different application's data.

This is undeniable proof that if RDOS didn't use segmentation it would be more secure. :lol:


Cheers,

Brendan

Re: Best processor for 32-bit [rd]OS - a.k.a RDOS-OS is best

Posted: Wed Apr 18, 2012 11:32 am
by rdos
Brendan wrote:20.0 million "near calls" per second on a 1.2 GHz CPU works out to about 60 cycles per "near call". I'd expect that a near call actually costs about 4 cycles, so this first test indicates that the loop overhead is probably about 56 cycles per iteration (probably because the compiler is crap - a decent compiler would have inlined the "do nothing" function, then decided that "sync_val" never changes because it's not volatile and generated a "jmp $" infinite loop).
That's probably not correct. The overhead is not in the loop, but in the C procedure that saves registers, checks the stack and so on. I think it is reasonable to set loop overhead to 10 cycles and procedure overhead to 46 instead
Brendan wrote:2.7 million "call gates" per second on a 1.2 GHz CPU works out to about 444 cycles per "call gate". By subtracting the "about 56 cycles per iteration" loop overhead from above this gives us an actual figure closer to 388 cycles for the call gate alone.
That would be 434 using the corrected figure
Brendan wrote:3.8 million "sysenters" per second on a 1.2 GHz CPU works out to about 316 cycles per "sysenter". By subtracting the "about 56 cycles per iteration" loop overhead again this gives us an actual figure closer to 260 cycles for the sysenter alone. This is about 50% faster than the call gate method.
And that would be 306. Just above 40% faster.
Brendan wrote:6.5 million "alternative sysenters" per second on a 1.2 GHz CPU works out to about 185 cycles per "alternative sysenter". By subtracting the "about 56 cycles per iteration" loop overhead again this gives us an actual figure closer to 129 cycles for the alternative sysenter alone. This is about 300% faster than the call gate method, and about 200% faster than the original "sysenter" method.
Overhead would be 175 cycles, and that is 150% faster.
Brendan wrote:The only difference between the "sysenter" and "alternative sysenter" method is that the former loads a different SS:ESP while the latter doesn't. Because the alternative method is about 200% faster, this means that loading a different SS:ESP must halve the performance. Loading a different value into ESP is just a normal "mov" and should only cost about 2 cycles. Therefore loading a different value into the SS register must be costing about 126 cycles all by itself. Loading a different value into CS would cost about the same. Therefore, without these (CS and SS) segment loads (at 126 cycles each) the cost of the sysenter and sysexit instructions alone would be about 5 cycles.
I think loading CS is also a lot faster than loading SS (probably similar to loading general segment register), and SYSENTER/SYSEXIT doesn't use 5 cycles, but a lot more.
Brendan wrote:This is undeniable proof that if RDOS didn't use segmentation system calls would be about as fast as a near call. :lol:
Yeah, and unreliable. :mrgreen:
Brendan wrote:If a flat application does attempt to forge dodgy values for EIP or ESP it'd only cause a page fault due to the correct use of the supervisor/user flag in page table entries, and would be no worse than the same application doing "jmp somewhere_in_kernel" or "mov esp,somewhere_in_kernel".
Not so since these are used to load/save stack state in application space in kernel. User/supervisor flags are useless when the operations take place in kernel.
Brendan wrote:My main concern (if I understand RDOS enough) would be segmented applications using the SYSENTER interface to break their segments. For example, if you have several segmented applications in the same virtual address space, then one of them could use SYSENTER to modify its SS and then use its SS:ESP to read a different application's data.
The last version provides full protection. Besides, for segmented applications, the SYSENTER interface could not be used (application CS/SS are not flat with a zero base), and thus would default to call gates only.

There is one issue though, and it is that the CS and SS that is setup by SYSEXIT has an incorrect limit, which means that CS and SS could be used to address kernel. However, this is not a big issue as RDOS has supervisor only access to kernel pages. If you note the code carefully, you can see that I deliberately use DS (which is loaded with a limit that excludes kernel) when I address the user-supplied stack, so if the user forges ECX, the stack operations will fault in kernel. For the same reason I use CS override for data that are located in kernel.