Page 1 of 1

AMD64 System calls & SSE registers

Posted: Tue Apr 06, 2010 10:35 am
by Owen
This one is probably more relevant to those of us building microkernels than monolithic kernels, on the basis that microkernels tend to require very fast IPC, and therefore it is critical to place as much data as possible in registers (to avoid having to copy any into buffers inside the kernel).

The question is simple: Why not use the SSE registers? XMM0-7 are caller save, so the kernel is going to try to keep any values away from there anyway (And, additionally, highly intensive maths code rarely uses vectors), and the 8 of them provide up to 128 bytes of data. They offer another advantage over the more traditional usage of the scalar registers also: They reduce register pressure there, where it most matters (Inside the kernel you're unlikely to be doing any intensive mathematics, but might have to look through a structure of two).

Particularly in my case, where the intention has always been to have a maximum transfer size of 128 bytes (Bigger transfers should be handled via memory mapping), it would seem like a very nice optimization.

Is there any reason not to do this that I've been too blind to see?

Re: AMD64 System calls & SSE registers

Posted: Tue Apr 06, 2010 10:49 am
by Combuster
MMX/SSE are part of the FPU stack, and you normally try to save some moves by lazily switching FPU contexts. I doubt if the performance won weighs against forcing an FPU switch with each IPC call (assuming they cause a task switch each).

Re: AMD64 System calls & SSE registers

Posted: Tue Apr 06, 2010 10:53 am
by Owen
That is certainly the case with the MMX registers, but the SSE XMM ones are a separate set. There is no need to do any FPU state switches (like the EMMS/FEMMS) instructions in order to enable them.

Re: AMD64 System calls & SSE registers

Posted: Tue Apr 06, 2010 11:49 am
by bewing
That's my understanding also -- the SSE/XMM registers are not on the FPU stack.
This does sound like an interesting technique. An OS is allowed to make a few OS-specific modifications to the standard calling mechanisms. I don't see any reason why your OS should not be able to dictate that XMM7 (or XMM15, or XMM8 - 15) are "reserved for OS use" -- and then that OS use is IPC parameter passing.

Of course, you have to take care that the IPC mechanism has no privileges -- so that malformed/intentionally "malicious" hacked-together IPC messages in the XMM register can't do any harm -- and can perhaps be detected by the system as spoofs.

Re: AMD64 System calls & SSE registers

Posted: Tue Apr 06, 2010 12:28 pm
by Brendan
Hi,
Owen wrote:That is certainly the case with the MMX registers, but the SSE XMM ones are a separate set. There is no need to do any FPU state switches (like the EMMS/FEMMS) instructions in order to enable them.
FPU/MMX and SSE are almost entirely separate; but they do share the "automatic FPU/MMX/SSE state saving" mechanism (the TS flag in eflags, the device not available exception handler, etc), and an OS that uses this automatic state saving mechanism typically uses FXSAVE to save all FPU/MMX/SSE state (because it'd be a nightmare otherwise).

If the OS doesn't use the automatic state saving mechanism, then it always saves FPU/MMX/SSE state (even when it hasn't been used), so it won't matter how much code uses FPU/MMX/SSE.


Cheers,

Brendan

Re: AMD64 System calls & SSE registers

Posted: Tue Apr 06, 2010 3:26 pm
by Owen
I suppose it can be said that processes break down into the following groups:
  • IO-bound. Will benefit from use of SSE registers increasing IPC performance
  • CPU-bound scalar (For example, compilers). Will take a minor performance hit due to requirement to save and restore FPU states, however, likely to use vectors to some extent anyway, so hit will not be overly large (Not that saving the FPU state is particularly big compared to normal context switch overhead)
  • CPU-bound vector (For example, 3D renderers, games, H.264 encoders/decoders). Will not be affected as they need FPU state to be saved anyway. It must be noted that many applications use SSE for scalar operations.
I think we can safely say that this should have, overall, a positive effect on some programs, and little negative effect on others. I'll have to benchmark things in more detail when I am able to do so :)

Re: AMD64 System calls & SSE registers

Posted: Tue Apr 06, 2010 10:38 pm
by Brendan
Hi,
Owen wrote:I suppose it can be said that processes break down into the following groups:
The only case that matters is tasks that do a lot of IPC. Tasks that don't do a lot of IPC won't be effected much by improvements in IPC performance or worse task switch overhead.

For tasks that do a lot of IPC, you'd want to consider how often sending/receiving causes a task switch. For example, if lots of messages are sent/received with no task switches, then improving the performance of IPC at the expense of task switch overhead could be an overall improvement. However often this isn't the case, and sending a message tends to cause the sender to block and the receiver to be unblocked (even for asynchronous IPC), and causes a task switch to occur between sender and receiver.

Also, whether or not a task uses SSE in general may not make that much difference. For example, a task might unblock, do a small amount of work without using SSE and then block again; then unblock, do a large amount of work with SSE, then block again; then unblock, do a small amount of work without using SSE and block again. In this case it might use SSE most of the time, but for most task switches SSE hasn't been used and SSE state wouldn't need to be saved/restored.

Finally, putting data in registers probably won't make it faster anyway. There's 2 reasons for this. First, if the sender and receiver are running on different CPUs then the data must be stored in RAM somewhere anyway. Secondly, systems that use small/fixed sized messages (where the message data could fit in registers) tend to be slow regardless of how fast the IPC is. For example, instead of sending one large message (that costs you 1000 cycles) you might end up sending 500 small messages (that cost 100 cycles each, which works out to 50 times slower).

The other thing I'd point out is that for large messages (or variable sized messages), the kernel doesn't necessarily need to copy the data. In long mode, a single "MOVSQ" instruction can shift 512 GiB of data from one virtual address space to another (by copying a PML4 entry). In the same way you could move page directory entries or page table entries (that contain message data) instead of copying the message data itself.


Cheers,

Brendan

Re: AMD64 System calls & SSE registers

Posted: Wed Apr 07, 2010 4:17 am
by Owen
Brendan wrote:Hi,
Owen wrote:I suppose it can be said that processes break down into the following groups:
The only case that matters is tasks that do a lot of IPC. Tasks that don't do a lot of IPC won't be effected much by improvements in IPC performance or worse task switch overhead.

For tasks that do a lot of IPC, you'd want to consider how often sending/receiving causes a task switch. For example, if lots of messages are sent/received with no task switches, then improving the performance of IPC at the expense of task switch overhead could be an overall improvement. However often this isn't the case, and sending a message tends to cause the sender to block and the receiver to be unblocked (even for asynchronous IPC), and causes a task switch to occur between sender and receiver.

Also, whether or not a task uses SSE in general may not make that much difference. For example, a task might unblock, do a small amount of work without using SSE and then block again; then unblock, do a large amount of work with SSE, then block again; then unblock, do a small amount of work without using SSE and block again. In this case it might use SSE most of the time, but for most task switches SSE hasn't been used and SSE state wouldn't need to be saved/restored.
One must also consider the cost of taking the TS exception and handling it. Depending upon its cost (Exceptions are never cheap), it may be beneficial to use heuristics in order to determine whether to save and restore the state for the process at context switch time anyway.

Additionally, for a microkernel, a significant quantity of blocks are going to be IPC.
Finally, putting data in registers probably won't make it faster anyway. There's 2 reasons for this. First, if the sender and receiver are running on different CPUs then the data must be stored in RAM somewhere anyway. Secondly, systems that use small/fixed sized messages (where the message data could fit in registers) tend to be slow regardless of how fast the IPC is. For example, instead of sending one large message (that costs you 1000 cycles) you might end up sending 500 small messages (that cost 100 cycles each, which works out to 50 times slower).

The other thing I'd point out is that for large messages (or variable sized messages), the kernel doesn't necessarily need to copy the data. In long mode, a single "MOVSQ" instruction can shift 512 GiB of data from one virtual address space to another (by copying a PML4 entry). In the same way you could move page directory entries or page table entries (that contain message data) instead of copying the message data itself.
Putting the data into registers can save loads and stores even in the case of sending to a blocked process; the kernel doesn't need to pull the message from memory, and hopefully the sender and receiver can operate on the message directly from the registers (The ability to do this will, of course, largely depend upon the compiler's skill at SSE optimization).

As I said earlier, "bigger transfers should be handled via memory mapping". For example, stream socket IO (such as TCP, Unix Domain Sockets) would be handled by mapping two ring buffers in each process, writing data into them, then signalling the other end via a short IPC.