Clarify how x86 interrupts work

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
StudlyCaps
Member
Member
Posts: 232
Joined: Mon Jul 25, 2016 6:54 pm
Location: Adelaide, Australia

Clarify how x86 interrupts work

Post by StudlyCaps »

I am looking at interrupts as a syscall mechanism and I think I understand how it works but there was one thing that stumped me for a bit and I want to clarify how it works:

I am in protected mode, ring 3. I do a software interrupt (int 0x80 for example). My IDT has a interrupt gate installed in entry 0x80. The CPU invokes my handler.

The point that gave my difficulty is this, when the code in my handler begins to execute, SS and ESP are taken from the values of SS0 and ESP0 in the TSS currently pointed to by the Task Register, correct? So while a hardware task switch has not occurred, by pointing the Task Register to a TSS pointing to the kernel stack and having the selector field in the IDT entry pointing to the kernel code segment we effectively switch from ring 3 using the user stack to ring 0 (using the kernel stack). Is this right or am I barking up the wrong tree?
This seems to be how other OSs do it but the Intel manuals make it seem like this isn't the way the processor is designed to work.

Thanks!
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
StudlyCaps wrote:I am looking at interrupts as a syscall mechanism and I think I understand how it works but there was one thing that stumped me for a bit and I want to clarify how it works:

I am in protected mode, ring 3. I do a software interrupt (int 0x80 for example). My IDT has a interrupt gate installed in entry 0x80. The CPU invokes my handler.

The point that gave my difficulty is this, when the code in my handler begins to execute, SS and ESP are taken from the values of SS0 and ESP0 in the TSS currently pointed to by the Task Register, correct? So while a hardware task switch has not occurred, by pointing the Task Register to a TSS pointing to the kernel stack and having the selector field in the IDT entry pointing to the kernel code segment we effectively switch from ring 3 using the user stack to ring 0 (using the kernel stack). Is this right or am I barking up the wrong tree?
That is right.
StudlyCaps wrote:This seems to be how other OSs do it but the Intel manuals make it seem like this isn't the way the processor is designed to work.
The Intel manuals only really describe how the CPU reacts to various things, and not how an OS could/should use them. For how an OS could/should do things; typically you'd use a register (e.g. EAX) to select a function, and inside the kernel you'd have a table of function pointers (e.g. "call [myTable +eax*4]" to call whichever kernel API function was requested). This means that it's relatively easy to support multiple different ways of calling the kernel API, where each different way does the same "call [myTable +eax*4]" to call whichever kernel API function was requested.

In general, the possibilities (from slowest to fastest) are:
  • An exception (e.g. "int3" to trigger breakpoint exception, "ud2" to trigger undefined opcode exception, etc). Relatively abnormal option, that can be messy and slow; but can also cost 1 byte for caller to use (potentially unbeatable for code size).
  • Software interrupt. Slowest "normal" option (due to touching both IDT and GDT, and all the protection checks involved). Costs 2 bytes for the caller to use.
  • Call gate. Slightly faster that a software interrupt; but costs 7 bytes for the caller to use (worst for code size)
  • SYSENTER and SYSCALL. Fastest options and also 2 bytes for caller to use (so good for code size too). Not supported on some CPUs (including modern CPUs).
Mostly; it's relatively easy to support "any or all of the above" (including emulating SYSENTER and/or SYSCALL in the invalid opcode exception handler if the CPU doesn't support it); and therefore it's easy to let the process decide for itself what it wants (e.g. maybe software interrupt for "rarely executed" code where code size matters more, and something else in "frequently executed" code where speed matters more).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

Brendan, have you thought about using a page fault?

Code: Select all

    mov eax, [10]           ; make a system call 10
    mov eax, [11]           ; make a system call 11
Please note that a "mov eax, [mem]" instruction is shorter than e.g. a "mov ecx, [mem]" instruction. The real trick is to put as much information as possible into the address value that will trap. A total of five bytes and you select the function and make the system call.
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

...continued.

Opcode range for this purpose could be from 0xA0 to 0xA3. If user space is not from 0xFF0000000 to 0xFFFFFFFF, it would be possible to do:

Code: Select all

    db 0xA0, func, param1, param2, 0xFF     ; SyscallType0(uint8_t func, uint8_t param1, uint8_t param2)
    db 0xA1, func, param1, param2, 0xFF     ; SyscallType1(uint8_t func, uint8_t param1, uint8_t param2)
    db 0xA2, (pointer >> 8 | 0xFF000000)    ; SyscallType2(void *pointer)
    db 0xA3, (pointer >> 8 | 0xFF000000)    ; SyscallType3(void *pointer)
Nice "atomic" syscall instructions? For pointers, validity should be checked before making the system call but that is not strictly necessary (user space programs may do anything they like anyway). Now the "pointer = 0xFF123456" could mean "pointer = 0x12345600" but that could be a documented feature?
StudlyCaps
Member
Member
Posts: 232
Joined: Mon Jul 25, 2016 6:54 pm
Location: Adelaide, Australia

Re: Clarify how x86 interrupts work

Post by StudlyCaps »

Brendan, thanks a lot for the detailed reply. I feel much more confident now. :D
onlyonemac
Member
Member
Posts: 1146
Joined: Sat Mar 01, 2014 2:59 pm

Re: Clarify how x86 interrupts work

Post by onlyonemac »

Antti wrote:Brendan, have you thought about using a page fault?

Code: Select all

    mov eax, [10]           ; make a system call 10
    mov eax, [11]           ; make a system call 11
Please note that a "mov eax, [mem]" instruction is shorter than e.g. a "mov ecx, [mem]" instruction. The real trick is to put as much information as possible into the address value that will trap. A total of five bytes and you select the function and make the system call.
Too much magic. A "call", "int", or other similar instruction looks like a function call to someone reading or writing the code. "mov eax, [10]" just looks like an invalid move, which is presumably a bug rather than a system call.
When you start writing an OS you do the minimum possible to get the x86 processor in a usable state, then you try to get as far away from it as possible.

Syntax checkup:
Wrong: OS's, IRQ's, zero'ing
Right: OSes, IRQs, zeroing
LtG
Member
Member
Posts: 384
Joined: Thu Aug 13, 2015 4:57 pm

Re: Clarify how x86 interrupts work

Post by LtG »

onlyonemac wrote:Too much magic. A "call", "int", or other similar instruction looks like a function call to someone reading or writing the code. "mov eax, [10]" just looks like an invalid move, which is presumably a bug rather than a system call.
While I don't have a strong opinion on the use of paging as a form of syscall, for the "magic":
- Anyone using assembly deserves it
- Anyone not commenting their code deserves it

Just in case you get the wrong impression, I'm not against assembly, but unless there's a good reason to use it, don't.

For the "normal" case, the above "page fault magic syscall" would be hidden in a C level syscall function so it shouldn't bother anyone. Just not sure if I see a benefit compared to using a SYSCALL. Antti, was there some benefit other than doing it in an unconventional way?

Brendan, were there some specific "modern" CPU's without SYSCALL/SYSENTER that you were referring to?
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Clarify how x86 interrupts work

Post by Korona »

As least you would have to heavily benchmark that. I suspect that page faults are much slower than software interrupts. The CPU will try to issue a page walk on each such mov as non-present pages are not cached in the TLB. The CPU will also speculatively assume mov always succeeds; violation of that assumption might incur additional overheads.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

onlyonemac wrote:A "call", "int", or other similar instruction looks like a function call to someone reading or writing the code.
This depends on your assembler and mnemonics it uses. A simple macro could hide the actual byte sequence.
LtG wrote:Antti, was there some benefit other than doing it in an unconventional way?
I did not know when I wrote the post but I do know something now after thinking it. That is partially the point of bringing up some "crazy" ideas. If we optimally pack the information into those 5 bytes, our "atomic" system calls could have some interesting features. Having a one-byte-shorter instruction than, e.g. "mov eax, 123" & "int3", is an advantage in itself. Perhaps that is not enough so let's innovate.

Brendan has mentioned his "batch processing" of system calls. What if we think about the idea of not returning to ring 3 immediately? If other system calls immediately follow, why don't we just interpret them? Writing a simple interpreter is easy if every system calls start with a byte 0xA0, 0xA1, 0xA2, or 0xA3 and has a constant length.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
Antti wrote:Brendan, have you thought about using a page fault?
That's clever, and for code size it would beat something like "int3" (use one less byte). However, for performance it'd be hideous, partly because most CPUs don't create TLB entries for "not present" pages (so you'd start with TLB miss costs), and partly because you'd need multiple checks to determine if it's a system call or a bug (e.g. check if CR2 is in a certain range, check if the access was a read, check if CPU was running at CPL=3, check if EIP is sane, check if EIP points to a special instruction). The other problems are that the page fault handler typically ends up being relatively complex/messy (to handle various virtual memory management tricks) and I'd prefer not to add more complexity to that; and that there are potentially valid reasons for wanting that page to be valid (e.g. maybe some kind of virtual machine where you want to do "host virtual address = guest physical address" for performance).
onlyonemac wrote:Too much magic. A "call", "int", or other similar instruction looks like a function call to someone reading or writing the code. "mov eax, [10]" just looks like an invalid move, which is presumably a bug rather than a system call.
You're right - if you're not deliberately trying to make it harder to understand a disassembly (and if "doesn't look like a function call" isn't a tiny advantage); then something like "call 10" would look more like a function call than a software interrupt does, and you could even trick compilers into thinking it's a normal "extern" function (if the rest of the ABI matches). However, this would still have all of the same (performance and complexity) problems.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
LtG
Member
Member
Posts: 384
Joined: Thu Aug 13, 2015 4:57 pm

Re: Clarify how x86 interrupts work

Post by LtG »

Antti wrote:That is partially the point of bringing up some "crazy" ideas.
I like outside the box ideas, though not all have immediate use =)

Antti wrote: If we optimally pack the information into those 5 bytes, our "atomic" system calls could have some interesting features. Having a one-byte-shorter instruction than, e.g. "mov eax, 123" & "int3", is an advantage in itself. Perhaps that is not enough so let's innovate.
Is it really any more "atomic" than your "normal":

Code: Select all

mov al, 0x15; Select syscall function
syscall
It's certainly a lot slower.. And you have to pack the data into the byte, of course if that's done at compile time then there's not much of an issue but at runtime it becomes even slower and possibly bigger.. My above example also supports 256 syscall functions, but can't remember how many bytes the move to AL takes..
Antti wrote: Brendan has mentioned his "batch processing" of system calls. What if we think about the idea of not returning to ring 3 immediately? If other system calls immediately follow, why don't we just interpret them? Writing a simple interpreter is easy if every system calls start with a byte 0xA0, 0xA1, 0xA2, or 0xA3 and has a constant length.
I thought Brendan's "batch processing" of syscalls is message based, in which case you could just provide the kernel/syscall handler with a message list, and thus you don't need any "interpreter" and again it's faster and easier..

Just trying to figure out if there are any use cases for using VAS (virtual address space) for syscalls, possibly saving one byte with significant runtime overhead just doesn't seem like it's worth it..

As for the TLB concerns the others raised, can't those be avoided by marking the page(s) present (so cached in TLB) and also marking it ring0 or something else to cause a #PF?
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Clarify how x86 interrupts work

Post by Brendan »

Hi,
LtG wrote:
Antti wrote:Brendan has mentioned his "batch processing" of system calls. What if we think about the idea of not returning to ring 3 immediately? If other system calls immediately follow, why don't we just interpret them? Writing a simple interpreter is easy if every system calls start with a byte 0xA0, 0xA1, 0xA2, or 0xA3 and has a constant length.
I thought Brendan's "batch processing" of syscalls is message based, in which case you could just provide the kernel/syscall handler with a message list, and thus you don't need any "interpreter" and again it's faster and easier..
Above I mentioned the idea of supporting multiple different "kernel system call" methods (software interrupt, call gate, SYSCALL/SYSENTER, etc) where all of them end up doing something like "call [functionTable + eax*4]" and where the kernel's functions themselves don't care which method was used by the calling thread.

For "batch system calls" the original idea was for a thread to create a list of entries (with one entry per kernel function) and call a "do this list of function calls" kernel API function; where the kernel does something like "for each entry in the list { load input values from entry into registers; call [functionTable + eax*4]; store output values from registers into list; }". In this way, the kernel API functions themselves don't care if the calling thread used batch system calls (in the same way that they don't care if the calling thread used software interrupt or call gate or ....).

Essentially, it didn't use messages and only used "lists of entries in memory somewhere". However, dealing with "pointers to things in user-space" is messy (partly because kernel has to do sanity checks, and other threads in the process that are running on other CPUs can modify the data after the kernel has done sanity checks) and this gets a little more messy for batch system calls because one system call might do something that modifies the list (e.g. free the page that was used to stores the list itself). For my micro-kernels I have special "message buffer" areas that are a lot less messy (they are "per thread" where one thread can't access another thread's message buffer area; and can never be part of a memory mapped file or shared memory area or ...). For this reason, even though the "batch system call" didn't really have anything to do with messaging, I probably did say something about using my "message buffer" area (and probably confused everyone ;) ).
LtG wrote:As for the TLB concerns the others raised, can't those be avoided by marking the page(s) present (so cached in TLB) and also marking it ring0 or something else to cause a #PF?
Yes - you could make the area "present, supervisor only" to avoid the TLB miss. For this case, if the kernel is buggy (e.g. dereferences a null pointer) you won't get a page fault.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Antti
Member
Member
Posts: 923
Joined: Thu Jul 05, 2012 5:12 am
Location: Finland

Re: Clarify how x86 interrupts work

Post by Antti »

Thank you for the discussion. The idea was not very good but it helped me to think about the details that may reveal something worthwhile. For example, what could be done with an Alignment Check (#AC) exception? It is always ring 3 only so kernel space could not trigger it. :D

If an #AC comes before a page fault, this could be usable. The payload for

Code: Select all

    mov eax, [packedData | 1]
is excellent.
LtG
Member
Member
Posts: 384
Joined: Thu Aug 13, 2015 4:57 pm

Re: Clarify how x86 interrupts work

Post by LtG »

Antti wrote:For example, what could be done with an Alignment Check (#AC) exception? It is always ring 3 only so kernel space could not trigger it. :D
I doubt you'll find anything with better performance than SYSCALL, but out of curiosity, is there some _real_ use for the #AC? For what purpose is it intended?

I guess you could enable it before profiling some program to see if there's mis-aligned access and then fix the code for better performance but doing so should be relatively simple for a profiler to do by analyzing the code itself.

Antti, have you checked the cost of exceptions? I'm not sure how high that is on modern CPUs, once you have that figured out it should help limit on what you might want to use the exceptions for.

Alternatively you could approach it from the other end, what needs are there, beyond SYSCALL..? For me everything is relatively simple since I'm planning a micro-kernel and purely messages (for now at least) so SYSCALL is the only one I need and AFAIK it has the best performance.

Though curious if you can find something useful with #PF, #AC and friends.
onlyonemac
Member
Member
Posts: 1146
Joined: Sat Mar 01, 2014 2:59 pm

Re: Clarify how x86 interrupts work

Post by onlyonemac »

LtG wrote:While I don't have a strong opinion on the use of paging as a form of syscall, for the "magic":
- Anyone using assembly deserves it
- Anyone not commenting their code deserves it

Just in case you get the wrong impression, I'm not against assembly, but unless there's a good reason to use it, don't.

For the "normal" case, the above "page fault magic syscall" would be hidden in a C level syscall function so it shouldn't bother anyone. Just not sure if I see a benefit compared to using a SYSCALL. Antti, was there some benefit other than doing it in an unconventional way?
What about if you're looking at a disassembly in a debugger, and you see an invalid "mov" and think "that must be the problem"? Unless you know that it's a syscall, and keep this in mind whenever you're reading disassembles, you're asking for confusion.
When you start writing an OS you do the minimum possible to get the x86 processor in a usable state, then you try to get as far away from it as possible.

Syntax checkup:
Wrong: OS's, IRQ's, zero'ing
Right: OSes, IRQs, zeroing
Post Reply