thewrongchristian wrote:If the syscall arguments are packaged into a structure, you only have to validate and get_user for the structure itself, if at all.
Yes, but then you have a different length for each syscall. So now you need to know the length of the structure before calling the syscall, or else get the structure in each syscall. You could conceivably have the length be an element of the structure, but then you have to determine if the length fits the requested call.
thewrongchristian wrote:The structure can be copied atomically from user to kernel space.
I doubt the claim of atomicity, but yes, you can copy the whole thing.
thewrongchristian wrote: /* Add other syscalls here */
Not gonna lie, I am not a fan of the union with 1000 substructures. You know Linux is up to almost 500 syscalls by now, right? And sure, many of them are legacy, but that is still hundreds of things in there.
thewrongchristian wrote:By passing the arguments in registers, the code generator would have to be much more complex, and be able to, for example, marshal 64-bit parameters across multiple 32-bit registers. Not impossible, but more effort than I can be bothered with at this time.
Meanwhile, how does Linux solve this? Userspace puts arguments into registers. All arguments are of type "long" (except on x32, where it is "long long", and you mustn't sign-extend pointers in the conversion). There are up to six arguments on all architectures, some support seven (I think it was only MIPS). Therefore, all syscalls are designed to require at most six arguments. If more are needed, some must be passed through memory.
A bunch of macros is used to hide the complexity of calling on the userspace side, so that a libc can portably define
Code: Select all
ssize_t read(int fd, void *buf, size_t len) { return syscall(SYS_read, fd, buf, len); }
On x86_64, this will expand (after finitely many steps) to
Code: Select all
ssize_t read(int fd, void *buf, size_t len) {
long ret;
__asm__("syscall" : "=a"(ret) : "a"(SYS_read), "D"(fd), "S"(buf), "d"(len) : "memory","cc");
return __syscall_ret(ret);
}
long __syscall_ret(unsigned long x) {
if (x > -4096UL) {
errno = -x;
x = -1UL;
}
return x;
}
That second function just for illustration. It's the same everywhere.
On the kernel side, there is a syscall table, admittedly autogenerated, that looks something like
Code: Select all
typedef long syscall_t();
static syscall_t *const syscall_tbl[__NR_syscalls] =
[0...__NR_syscalls-1] = sys_ni_syscall,
...
[SYS_read] = sys_read,
...
};
The code that calls this is arch specific, but then, it is interrupt handling code, so that is always arch specific. For x86_64, something like:
Code: Select all
void handle_syscall(struct regs *regs) {
if (regs->rax < __NR_syscalls)
regs->rax = syscall_tbl[regs->rax](regs->rdi, regs->rsi, regs->rdx, regs->r10, regs->r8, regs->r9);
else
regs->rax = -ENOSYS;
}
In order for this to work, an ABI must be used that is OK with passing too many arguments. Luckily, all Linux ABIs are such ABIs. Windows stdcall would not work here.
In order to marshal 64-bit arguments on 32-bit platforms, each platform has its own definitions. In general, though, the argument is likely to be split into two registers, and possibly padded. E.g. on PowerPC, such arguments are passed in an even/odd register pair, with the higher half in the even register. So userspace can define
Code: Select all
off_t lseek(int fd, off_t off, int whence) {
#ifdef SYS__llseek
int ret = syscall(SYS__llseek, fd, SC_LL_E(off), &off, whence);
if (ret) off = ret;
return off;
#else
return syscall(SYS_lseek, fd, off, whence);
#endif
}
Where SC_LL_E is the identity on x86_64, and on PowerPC it is defined as
Code: Select all
#define SC_LL_E(x) x >> 32, x
#define SC_LL_O(x) 0, SC_LL_E(x)
To fix this up in kernel space, on PowerPC the syscall table has its landing pad for llseek in an arch specific function
Code: Select all
long sys_ppc_llseek(int fd, unsigned long off_hi, unsigned long off_lo, off_t *off_ret, int whence)
{
return sys_llseek(fd, (0ULL+off_hi) << 32 | off_lo, off_ret, whence);
}
There, that is it. That is the entire extent of syscall parameter passing in Linux. The syscall dispatcher is arch specific, as is the syscall table, and maybe a few arch specific wrappers. But the syscalls themselves are arch independent (unless, of course, the syscall is also arch-dependent. modify_ldt(), for example, only exists on x86). And in userspace, you only need a couple of macros to run the syscall instruction with a variety of parameters, but they are only different in number of arguments, not in types.
The problem of passing a 64-bit number directly to a syscall occurs surprisingly rarely. Most of the time, you pass such numbers through memory again.