pass parameter in a syscall

ITchimp · Post by **ITchimp** » Mon Jun 05, 2023 5:15 pm

I was always under the impression that in x86, syscall parameters are passed using eax, ebx, ecx, edx, edi, esi, but lately i was reading a virtualization paper in which the syscall parameters are pushed onto user mode stack, it was done in bsd... how could this work? after you call in 0x80, the stack pointer switch to the kernel stack, and 5 numbers, ss3, esp3, eflags, ss3, eip3 are first pushed on top of kernel stack.... does bsd use esp0 to retrieve syscall parameters?

Code: Select all

Allegedly, this is bsd's syscall open()
open:
push dword mode
push dword flags
push dword path
mov eax, 5
push eax
int 80h

Octocontrabass · Post by **Octocontrabass** » Mon Jun 05, 2023 5:55 pm

ITchimp wrote:I was always under the impression that in x86, syscall parameters are passed using eax, ebx, ecx, edx, edi, esi,

That's one way to do it, but it's not enforced by the CPU architecture. Linux uses those six registers plus EBP.

ITchimp wrote:does bsd use esp0 to retrieve syscall parameters?

I think you mean ESP3, and yes, it does use the ring 3 stack pointer to retrieve the parameters.

ITchimp · Post by **ITchimp** » Mon Jun 05, 2023 6:06 pm

but how could a stack based approach work? isn't the user stack not accessible to kernel?
or maybe manipulate esp3 directly?

Octocontrabass · Post by **Octocontrabass** » Mon Jun 05, 2023 6:13 pm

ITchimp wrote:isn't the user stack not accessible to kernel?

Why wouldn't the kernel be able to access the user stack?

nullplan · Post by **nullplan** » Mon Jun 05, 2023 9:28 pm

The problem with this approach is that it is memory based. And the kernel must be careful in accessing user memory, always be prepared that the pointer given might point to Nirvana. So in this case, the kernel must already call get_user() (or something like it) six times just to read the arguments. Another issue is that failure to read those arguments is not necessarily a mistake. The call could be well formed, and the stack is just very empty. You would actually have to figure out how many arguments a call has before reading them.

So that's why it is typically a better idea to have the arguments in registers.

thewrongchristian · Post by **thewrongchristian** » Tue Jun 06, 2023 2:32 pm

nullplan wrote:The problem with this approach is that it is memory based. And the kernel must be careful in accessing user memory, always be prepared that the pointer given might point to Nirvana. So in this case, the kernel must already call get_user() (or something like it) six times just to read the arguments. Another issue is that failure to read those arguments is not necessarily a mistake. The call could be well formed, and the stack is just very empty. You would actually have to figure out how many arguments a call has before reading them.

So that's why it is typically a better idea to have the arguments in registers.

If the syscall arguments are packaged into a structure, you only have to validate and get_user for the structure itself, if at all.

I'm in the process of redesigning my syscall interface to do just that, using structures generated at compile time to marshal syscall parameters into structures, and pass the pointers to the structure in registers.

This allows me to generalise the syscall mechanism making fewer architecture dependent assumptions (like number of registers, stack frame layout etc.) and make most of the system call marshalling code platform agnostic. The structure can be copied atomically from user to kernel space.

Now, this is all slower than passing the equivalent arguments in registers, that's for sure, but in theory I can still pass the entire arguments structure in the registers. Something like, for the read syscall:

Code: Select all

union syscall_args
{
  /* Mapped to x86 registers ebx, ecx, edx, edi, esi, ebp */
  uintptr_t regs[6];
  struct {
    int fd;
    void * buffer;
    size_t count;
  } read_args;
  /* Add other syscalls here */
  ...
}

At the user side, read() will fill in the read_args structure, then pass the whole thing to the syscall dispatcher, which will put each of syscall_args::regs into registers.

In the kernel, we do the opposite, putting the registers into the kernel side syscall_args::regs, the result being the syscall arguments passed to the kernel without doing an explicit memory copy.

The kernel will still have to validate the buffer pointer, for example, but we'd have to do that anyway. And the interface for all this can be hidden in the platform specific code, so the packing/unpacking of memory to/from registers can be done generically once per platform, and the syscall argument structures populated and used with generated code from the syscall definitions.

In fact, this is exactly what I'm doing at the moment. Syscall function in my source are annotated using a macro which defines the system call name, and the structure for the arguments is based on the function arguments:

Code: Select all

SYSCALL(read) ssize_t file_read(int fd, void * buf, size_t count)
{
 ...
}

Generates:

Code: Select all

struct read_params_t {
    int fd;
    void *buf;
    size_t count;
};

which can be serialised to/from the registers quite easily, along with eax to specify the system call number, but all that is hidden from the generated code.

By passing the arguments in registers, the code generator would have to be much more complex, and be able to, for example, marshal 64-bit parameters across multiple 32-bit registers. Not impossible, but more effort than I can be bothered with at this time.

nullplan · Post by **nullplan** » Wed Jun 07, 2023 9:46 am

thewrongchristian wrote:If the syscall arguments are packaged into a structure, you only have to validate and get_user for the structure itself, if at all.

Yes, but then you have a different length for each syscall. So now you need to know the length of the structure before calling the syscall, or else get the structure in each syscall. You could conceivably have the length be an element of the structure, but then you have to determine if the length fits the requested call.

thewrongchristian wrote:The structure can be copied atomically from user to kernel space.

I doubt the claim of atomicity, but yes, you can copy the whole thing.

thewrongchristian wrote: /* Add other syscalls here */

Not gonna lie, I am not a fan of the union with 1000 substructures. You know Linux is up to almost 500 syscalls by now, right? And sure, many of them are legacy, but that is still hundreds of things in there.

thewrongchristian wrote:By passing the arguments in registers, the code generator would have to be much more complex, and be able to, for example, marshal 64-bit parameters across multiple 32-bit registers. Not impossible, but more effort than I can be bothered with at this time.

Meanwhile, how does Linux solve this? Userspace puts arguments into registers. All arguments are of type "long" (except on x32, where it is "long long", and you mustn't sign-extend pointers in the conversion). There are up to six arguments on all architectures, some support seven (I think it was only MIPS). Therefore, all syscalls are designed to require at most six arguments. If more are needed, some must be passed through memory.

A bunch of macros is used to hide the complexity of calling on the userspace side, so that a libc can portably define

Code: Select all

ssize_t read(int fd, void *buf, size_t len) { return syscall(SYS_read, fd, buf, len); }

On x86_64, this will expand (after finitely many steps) to

Code: Select all

ssize_t read(int fd, void *buf, size_t len) {
  long ret;
  __asm__("syscall" : "=a"(ret) : "a"(SYS_read), "D"(fd), "S"(buf), "d"(len) : "memory","cc");
  return __syscall_ret(ret);
}

long __syscall_ret(unsigned long x) {
  if (x > -4096UL) {
    errno = -x;
    x = -1UL;
  }
  return x;
}

That second function just for illustration. It's the same everywhere.

On the kernel side, there is a syscall table, admittedly autogenerated, that looks something like

Code: Select all

typedef long syscall_t();
static syscall_t *const syscall_tbl[__NR_syscalls] =
[0...__NR_syscalls-1] = sys_ni_syscall,
...
[SYS_read] = sys_read,
...
};

The code that calls this is arch specific, but then, it is interrupt handling code, so that is always arch specific. For x86_64, something like:

Code: Select all

void handle_syscall(struct regs *regs) {
  if (regs->rax < __NR_syscalls)
    regs->rax = syscall_tbl[regs->rax](regs->rdi, regs->rsi, regs->rdx, regs->r10, regs->r8, regs->r9);
  else
    regs->rax = -ENOSYS;
}

In order for this to work, an ABI must be used that is OK with passing too many arguments. Luckily, all Linux ABIs are such ABIs. Windows stdcall would not work here.

In order to marshal 64-bit arguments on 32-bit platforms, each platform has its own definitions. In general, though, the argument is likely to be split into two registers, and possibly padded. E.g. on PowerPC, such arguments are passed in an even/odd register pair, with the higher half in the even register. So userspace can define

Code: Select all

off_t lseek(int fd, off_t off, int whence) {
#ifdef SYS__llseek
  int ret = syscall(SYS__llseek, fd, SC_LL_E(off), &off, whence);
  if (ret) off = ret;
  return off;
#else
  return syscall(SYS_lseek, fd, off, whence);
#endif
}

Where SC_LL_E is the identity on x86_64, and on PowerPC it is defined as

Code: Select all

#define SC_LL_E(x) x >> 32, x
#define SC_LL_O(x) 0, SC_LL_E(x)

To fix this up in kernel space, on PowerPC the syscall table has its landing pad for llseek in an arch specific function

Code: Select all

long sys_ppc_llseek(int fd, unsigned long off_hi, unsigned long off_lo, off_t *off_ret, int whence)
{
  return sys_llseek(fd, (0ULL+off_hi) << 32 | off_lo, off_ret, whence);
}

There, that is it. That is the entire extent of syscall parameter passing in Linux. The syscall dispatcher is arch specific, as is the syscall table, and maybe a few arch specific wrappers. But the syscalls themselves are arch independent (unless, of course, the syscall is also arch-dependent. modify_ldt(), for example, only exists on x86). And in userspace, you only need a couple of macros to run the syscall instruction with a variety of parameters, but they are only different in number of arguments, not in types.

The problem of passing a 64-bit number directly to a syscall occurs surprisingly rarely. Most of the time, you pass such numbers through memory again.

thewrongchristian · Post by **thewrongchristian** » Wed Jun 07, 2023 4:02 pm

nullplan wrote: The problem of passing a 64-bit number directly to a syscall occurs surprisingly rarely. Most of the time, you pass such numbers through memory again.

True, it's rare, but how much of that is because it's a PITA?

To be honest, I'm torn. I like the union idea (it's generated code, and write once mechanism that is portable across architectures) as it makes it much easier to add syscalls as required, and automatically generate the user side stub directly from the kernel function definition.

I'm not worried about the size of the union, I can just copy sizeof(syscall_args), which will copy more memory than most syscalls will require.

I also want to experiment with asynchronous syscall mechanisms, and having syscalls as effectively messages leaves registers free to provide support for the asynchronous message management.

Imagine the only syscalls being something like:

Code: Select all

status syscall(void * params, size_t sizeparams, void * result, size_t sizeresult);

* params - Is the message that encapsulates the syscall. Like the POSIX read example.
* result - Is an result pointer where the asynchronous call will deposit result information.
* status - Return status indicating failure, or whether syscall has completed already or is completing asynchronously.

No syscalls will block, other than syscalls explicitly designed to synchronise. It would be almost trivial to implement N:1 or N:M user threading on top of this.

Food for thought.

Gigasoft · Post by **Gigasoft** » Thu Jun 08, 2023 12:51 pm

nullplan wrote:
thewrongchristian wrote:If the syscall arguments are packaged into a structure, you only have to validate and get_user for the structure itself, if at all.
Yes, but then you have a different length for each syscall. So now you need to know the length of the structure before calling the syscall, or else get the structure in each syscall. You could conceivably have the length be an element of the structure, but then you have to determine if the length fits the requested call.

Presumably, the author of an operating system knows the number of arguments that each syscall takes, which he can put into a table. You're making this out to be much harder than it needs to be.

nullplan · Post by **nullplan** » Thu Jun 08, 2023 1:30 pm

Gigasoft wrote:Presumably, the author of an operating system knows the number of arguments that each syscall takes, which he can put into a table. You're making this out to be much harder than it needs to be.

Sure, it isn't a big problem. just one more thing you have to do. It's not that it is hard, it's that it adds one more step to the whole context switch rigamarole. And the register based mechanism can make do without that.

OSDev.org

pass parameter in a syscall

pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall

Re: pass parameter in a syscall