Page 1 of 1

Question about system calls

Posted: Wed Jul 03, 2019 9:46 am
by 0xBADC0DE
I have just finished writing system calls in my operating system, but unsure of if I have done it correctly. I've googled many things, but cannot find any definitive answers.

My first question is: do system calls always have to generate software interrupts? I am using interrupt 0x80 as Linux does. I have coded a system call handler, part in C and part in Assembly, that takes a system call number from the eax (32 bit OS) register (which is stored in a struct which is populated from the register values pushed onto the stack when an interrupt occurs), and then uses that as an index into a system call table. It then sets up the stack to pass all the values to the functions. This works perfectly. Initially, I was testing this using inline assembly, but once I got that sorted, I moved on to actually writing the system calls. So, for example, I got a system call "sys_open", which just calls "vfs_open", and returns the FILE structure associated with that file. In user mode programs, a user would typically use "fopen", so in my case, I just made "fopen" call "sys_open" (which subsequently calls "vfs_open"). In this way, there is some sort of protection, as the kernel can still check for certain errors/bugs/whatever. But I am unsure if this is the correct way of doing this. Is it OK that I just have functions calling other functions, instead of just generating interrupts, or is there a better way that I am supposed to be doing this? If I do have to use interrupts for all system calls, do part(s) of the system call have to be coded in assembly, or can it all be done in C?

In this approach I have taken, there is no switching to kernel mode, which leads onto my next question: do I really need to switch to kernel mode? And if so, how? Or is it possible just to setup up things in the kernel initially, and then switch to and stay in user mode the entire time (I feel like this is similar to one of the kernel designs)? If I do have to switch to kernel mode each time, I haven't found anything explaining how that would be done (I have managed to switch to user mode successfully, although generating an interrupt in user mode does cause a double fault, and subsequently, a crash. Any ideas why?). I understand that user mode is restricted from doing certain things that only the kernel can do. I read that this includes IO port access, but I still seem to be able to read from ports in user mode (unless I need to still set the IOPL?).

I currently have no virtual memory manager, but would like to implement this in the future. I know that memory management is the job of the kernel, and so wouldn't this require switching to kernel mode, for example mapping in and out different page?

I thought about just hiding some functions from the user, or rather, do some sort of checking inside functions, such as system calls, to check whether the process has sufficient privileges etc, but I'm not sure if this is a thorough enough, or the most correct way, of preventing privileged operations from being run in user mode instead of kernel mode.

The code can be found at https://github.com/aaron2212/XOS. It is under the branch "system-calls"

Re: Question about system calls

Posted: Wed Jul 03, 2019 1:30 pm
by nullplan
0xBADC0DE wrote:My first question is: do system calls always have to generate software interrupts?
No, you don't have to, but it is likely going to be very similar. First of all, there are the SYSCALL and SYSENTER instructions, which are sufficiently documented elsewhere. These basically cause software interrupts in a funny hat.

Also, there is the possibility of creating a fault. Linux AMD64 had a mechanism called "vsyscalls", where user processes would basically "call" magic addresses (in kernel space). This would fail with a page fault, which the kernel would then handle by running the requested syscall. But that mechanism was discontinued for performance reasons.
0xBADC0DE wrote:So, for example, I got a system call "sys_open", which just calls "vfs_open", and returns the FILE structure associated with that file.
Oh god, I seriously hope that FILE* is not part of your kernel ABI. Because usually that is userspace stuff, at least most of it. UNIX traditionally worked with file descriptors, and Windows uses HANDLEs, which encapsulates the problem better.
0xBADC0DE wrote:But I am unsure if this is the correct way of doing this. Is it OK that I just have functions calling other functions, instead of just generating interrupts, or is there a better way that I am supposed to be doing this? If I do have to use interrupts for all system calls, do part(s) of the system call have to be coded in assembly, or can it all be done in C?
You are going to have to write the entry into the kernel and the exit to the userspace in assembly. But I am uncertain as to why you think "functions calling other functions" would be a problem. That is, in fact, encouraged.
0xBADC0DE wrote:In this approach I have taken, there is no switching to kernel mode
Yes there is, once you execute the softint.
0xBADC0DE wrote:(I have managed to switch to user mode successfully, although generating an interrupt in user mode does cause a double fault, and subsequently, a crash. Any ideas why?)
You set kernel ESP to 0 in the TSS. And you say you have no vmem, so kernel stack is at the top of address space? At 4GB and below? Well, isn't there usually IO memory there? So the soft int tries to write into IO memory, which might fault, hence the DF.
0xBADC0DE wrote:I read that this includes IO port access, but I still seem to be able to read from ports in user mode (unless I need to still set the IOPL?).
I do hope you haven't set IOPL to 3, because then user mode can do anything.
0xBADC0DE wrote:I thought about just hiding some functions from the user, or rather, do some sort of checking inside functions, such as system calls, to check whether the process has sufficient privileges etc, but I'm not sure if this is a thorough enough, or the most correct way, of preventing privileged operations from being run in user mode instead of kernel mode.
The usual way to go about this is to use virtual memory (even with identity mapping, for starters), to make it so user mode code can only access its own memory. And just RAM. Then user code always has to call the kernel to do anything besides write into its own memory. Then the kernel has all the data it needs to make a decision. User mode code literally cannot do anything without the kernel's approval.

Re: Question about system calls

Posted: Thu Jul 04, 2019 4:50 am
by 0xBADC0DE
@nullplan thanks for the reply!

I am not planning on using SYSENTER/SYSEXIT etc, so I'll just stick with software interrupts. So when an interrupt is generated, the CPU switches to kernel mode automatically?

And regarding "vfs_open", yea, I do have it as returning a FILE structure. In there I store the file's descriptor, but I didn't realize that "vfs_open" should not return a FILE structure, but just the file descriptor instead. I'll change this, thanks for that :D

And as for writing system calls, say I have "sys_open" again. Will the initial part of "sys_open" be written assembly, as in "fopen" would call "sys_open", which is an assembly function which sets up the stack, puts the appropriate system call number in the eax register, and then calls the actual C function that opens the file (through the user of a system call table, with eax as the index)? And then for the final part, would I just use my "switch_to_user_mode" function to get back to user mode?

And I have not explicitly set IOPL to 3, so I don't think it should be. I checked the value of the eflags register in qemu by doing "print $eflags", and I get a result of 0x2. I checked https://wiki.osdev.org/CPU_Registers_x8 ... S_Register and it seems that my IOPL is set to 0 in the eflags register. I also did "info registers" in qemu, and I am getting a DPL of 0 for all the segment registers even though I have made the switch to user mode. Perhaps my "enter_usermode" function is incorrect? I am using inline assembly for it, and it looks like the following:

Code: Select all

asm(
        "cli\n"
        "mov $0x23, %ax\n" // User mode data selector is 0x20. Add 3 for user mode (ring 3)
        "mov %ax, %ds\n"
        "mov %ax, %es\n"
        "mov %ax, %fs\n"
        "mov %ax, %gs\n"
        ""
        "mov %esp, %eax\n"
        "pushl $0x23\n"
        "pushl %eax\n"
        "pushf\n"
        "pushl $0x1b\n"
        "push $1f\n"
        "iret\n"
        "1:\n"
        ""
    );
And as for kernel ESP, what should I set that to? In my start.asm file, I have a symbol called "stack_space", which I think is just the start of the stack area for the kernel, but I'm not sure about that. So I tried using that symbol as the kernel ESP, but I am still getting a DF. I remembered that my kernel's entry point is at 0x100000, so I tried setting the kernel ESP to a number like 0x105000. That didn't cause a crash, but I am now not getting the system call being called when I do

Code: Select all

asm(
        "mov $0, %eax\n"
        "int $0x80"
    );
The system is just hanging and is never entering the shell. I know that would be a silly address to use, since there could be anything there, and I have not reserved space for the kernel stack. Is this something I need to do?
My operating system's memory looks like this at the moment:

Code: Select all

kernel (0x100000 entry point) -> heap (allocated as 5% of memory, not sure if this is the best way to do it) -> file system (loaded as a GRUB module, and then copied over into memory) -> user space code. 
So then would I have to explicitly reserve some space (after the heap for example) for a kernel stack? If so, what is a recommended size to make the stack? I know that the stack grows in user mode processes, but I'm thinking it should be kept as a fixed size, so that I don't have to keep moving other things around in memory.
Thanks for the help

Re: Question about system calls

Posted: Thu Jul 04, 2019 10:59 am
by nullplan
0xBADC0DE wrote:I am not planning on using SYSENTER/SYSEXIT etc, so I'll just stick with software interrupts. So when an interrupt is generated, the CPU switches to kernel mode automatically?
Depends. x86 is extremely flexible there -- much to its detriment, because all that flexibility has to be configured, and has to be read and processed by the processor. In this case, when you execute "int 128", first the CPU will check the DPL of the IDT entry, to see if you were even allowed to call that int. Therefore, for the syscall interrupt, you must set the DPL to 3.

Then the CPU will load CS with the segment reference out of the IDT entry. If that segment is a CPL 0 segment, then that' what it switches to.
0xBADC0DE wrote:And regarding "vfs_open", yea, I do have it as returning a FILE structure. In there I store the file's descriptor, but I didn't realize that "vfs_open" should not return a FILE structure, but just the file descriptor instead.
Depends. "vfs_open" is an implementation detail. It can return what ever you should like to return - as long as you know that it cannot return a userspace type. "sys_open", on the other hand, is public ABI, and should return something that can be used to write and read the file. Not a FILE*, however, because that's a type name reserved for libc.
0xBADC0DE wrote:Will the initial part of "sys_open" be written assembly, as in "fopen" would call "sys_open", which is an assembly function which sets up the stack, puts the appropriate system call number in the eax register, and then calls the actual C function that opens the file (through the user of a system call table, with eax as the index)? And then for the final part, would I just use my "switch_to_user_mode" function to get back to user mode?
Usual way to do this is to have fopen() call a userspace helper that calls the syscall. Something like:

Code: Select all

int open(const char *fn, int flags,...) {
  va_list ap;
  int mode = 0;
  if (flags & O_CREAT) {
    va_start(ap, flags);
    mode = va_arg(ap, int);
    va_end(ap);
  }
  long r;
  asm("int $128" : "=a"(r) : "a"(__NR_open), "b"(fn), "c"(flags), "d"(mode) : "memory");
  return __syscall_ret(r);
}
And then __syscall_ret() extracts errno info if need be.

On the other side, you have the int 128 handler in the kernel, which does something like:

Code: Select all

int128:
  addl $-44, %esp
  movl %gs, 40(%esp)
  movl %fs, 36(%esp)
  movl %es, 32(%esp)
  movl %ds, 28(%esp)
  movl %esi, 24(%esp)
  movl %edi, 20(%esp)
  movl %edx, 16(%esp)
  movl %ecx, 12(%esp)
  movl %ebx, 8(%esp)
  movl %eax, 4(%esp)

  movl $KERNEL_DS, %eax
  movl %eax, %ds
  movl %eax, %es
  movl %eax, %fs
  movl %eax, %gs

  leal 4(%esp), %eax
  movl %eax, (%esp)
  call sys_main

  movl 28(%esp), %eax
  movl %eax, %ds
  movl 32(%esp), %eax
  movl %eax, %es
  movl 36(%esp), %eax
  movl %eax, %fs
  movl 40(%esp), %eax
  movl %eax, %gs

  movl 4(%esp), %eax
  movl 8(%esp), %ebx
  movl 12(%esp), %ecx
  movl 16(%esp), %edx
  movl 20(%esp), %edi
  movl 24(%esp), %esi

  addl $44, %esp
  iretl
If you want to support signals later, add more code after the call instruction above, testing if the current process has signals pending, and calling a handler function for that, too.

Anyway, the kernel then needs the main syscall function:

Code: Select all

struct regs {
  long ax, bx, cx, dx, di, si, ds, es, fs, gs, ip, cs, flags, sp, ss;
};

static long sys_ni_syscall(void) { return -ENOSYS; }

static const long (*syscall_table[NUM_SYSCALLS])() = {
  /* Using a GCC extension here */
  [0..NUM_SYSCALLS - 1] = sys_ni_syscall,
  [__NR_open] = sys_open,
  [__NR_foo] = sys_foo,
  [__NR_bar] = sys_bar,
  /* etc. pp. */
}
void sys_main(struct regs* r)
{
  if (r->ax < NUM_SYSCALLS)
    r->ax = syscall_table[r->ax](r->bx, r->cx, r->dx, r->di, r->si);
  else
    r->ax = -ENOSYS;
}
Again, sys_main() is bound to become more complicated, once you add code to track the time spent in system or user space, and add a facility to track which system calls a process calls.
0xBADC0DE wrote:I am getting a DPL of 0 for all the segment registers even though I have made the switch to user mode
Evidently not. Did you load the correct segments? Your enter_usermode() function has a few problems, though: The usermode stack will be the same as the kernel mode one (bad idea, usually), and the code will continue in the same place. Usually you'd allocate a separate stack for the user to use and screw up to their liking, and take as code pointer something user-specified as argument. Also, you first CLI, then PUSHF, meaning the IF will be disabled in the flags you pushed. So the iret will also not set it, meaning you enter user mode with interrupts disabled. Really bad idea. Usually, you'd

Code: Select all

pushfl
orl $(1<<9), (%esp)
0xBADC0DE wrote:And as for kernel ESP, what should I set that to? In my start.asm file, I have a symbol called "stack_space", which I think is just the start of the stack area for the kernel, but I'm not sure about that.
stack_space is in fact the stack top that you need. That's what you can use there. But that means you abandon your stack once you enter user mode, which is normal, you just have to be aware of it.

Later on, allocate a new kernel stack for each new thread. You switch them out in the TSS when you switch processes. And your boot stack can be the stack for the init process, which never ends, anyway, right?