I think you are wrong. Getting the instruction timings for modern processors is kind of hard, but I have the timings for 386DX:Owen wrote:The best case I've seen for syscall is about 20 cycles, similar for sysret, and this is a true cost best case, because syscall always hits microcode and can't be executed in parallel with any other instruction. Additionally, the branch predictor doesn't predict over a syscall, so it's very possible that the destination TLB entry is not present and that the instructions are not in cache. And, of course, the kernel can't trust any addresses it has been given by userspace, so there is additional overhead to deal with there.
INT 0xNN and CALL FAR via call gate both tend to be ~100 cycles, with IRET being similar. Segment loads and privilege checks are expensive operations.
CALL:
within segment: 7 + m (m is number of components of the next instruction)
to different segment: 34 + m
via gate to same privilege: 52 + m
via gate to different privilege (no parameters): 86 + m
via gate to different privilege (x parameters): 94 + 4x + m
RET:
within segment: 10 + m
to different segment: 32 + m
to different privilege: 69
As can be seen, not even on the 386DX, call / ret with callgates take 100 cycles, so I suspect that Owen is wrong about that. If not, please state the processor you claim have this characteristic, along with a PDF that proves your point.
Using interrupts are even worse:
INT xx:
To same privilege: 59
To different privilege: 99
IRET:
Same privilege: 38
Different privilege: 82
Since the 386DX doesn't have sysenter and sysexit, I cannot compare with that.
With a 386DX, you can do just above 10 calls within segment in the same time you do one call to kernel, which is not that bad. I hardly think that 64-bit mode has better relative performance here, as if you use sysenter / sysexit or interrupts, you also need code to validate pointers and decode destinations, which probably means you end up in the same range. The validations that hardware do for you with call gates need to be done in software in long mode and flat mode, and this must be taken into account.