L4 IPC performance
L4 IPC performance
Hello, currently I am researching various IPC implementations and the L4 IPC API is one of them. I have seen that L4 allows threads to communicate with each other but isn't this extremely slow? I have seen several L4 servers and they all use a single message loop like this:
for (;;) {
msg = ReceiveMessage();
switch (msg->tag) {
case MMAP: DoMmap(.......); break;
}
}
If I understand it correctly they use one server thread to handle all requests. However, I don't understand why they implemented it this way - it is a real bottleneck!
Wouldn't it be better to have something like this:
- Thread x sends a message to process y
- Kernel interrupts one running thread in process y or allocates a new worker thread
- Set the entry point of the interrupted thread to a message handler function
- Kernel choses an appropriate way to pass the input to the server (registers, stack, shared memory....)
- Resume the interrupted thread to call the message handler
- When the message handler returns, copy the output data back to the sender
- Resume the previous work
Instead of just using one receiver thread, this would allow to handle multiple requests at the same time.
Is L4 that bad or am I getting something wrong?
Thanks in advance.
for (;;) {
msg = ReceiveMessage();
switch (msg->tag) {
case MMAP: DoMmap(.......); break;
}
}
If I understand it correctly they use one server thread to handle all requests. However, I don't understand why they implemented it this way - it is a real bottleneck!
Wouldn't it be better to have something like this:
- Thread x sends a message to process y
- Kernel interrupts one running thread in process y or allocates a new worker thread
- Set the entry point of the interrupted thread to a message handler function
- Kernel choses an appropriate way to pass the input to the server (registers, stack, shared memory....)
- Resume the interrupted thread to call the message handler
- When the message handler returns, copy the output data back to the sender
- Resume the previous work
Instead of just using one receiver thread, this would allow to handle multiple requests at the same time.
Is L4 that bad or am I getting something wrong?
Thanks in advance.
Re: L4 IPC performance
Code: Select all
for (;;) {
msg = ReceiveMessage();
switch (msg->tag) {
case MMAP: DoMmap(.......);
break;
}
}
If a trainstation is where trains stop, what is a workstation ?
Re: L4 IPC performance
OK, I agree the performance of L4 IPC is good. However, how does the whole synchronous L4 IPC scale when it has to serve thousands of requests per second? Wouldn't this result in a long waiting time for other processes which are waiting for the blocked server?? And wouldn't the performance become even worse on multicore / multiprocessor systems??
Re: L4 IPC performance
If the server is multithreaded and handling messages in priority order, how is the 'waiting time' more than any other method?Wouldn't this result in a long waiting time for other processes which are waiting for the blocked server??
If a trainstation is where trains stop, what is a workstation ?
Re: L4 IPC performance
I don't really agree. First if you think of the singlecore case I don't think there's any performance to gain in having context switches between server threads if the server is busy executing code, if the server is busy waiting for hardware (or another server) to respond then it should return and be ready to accept request or process responses from hardware (or other servers). The only situation where it could be problematic if if there is a server that is doing long CPU intensive operations, which makes me wonder what kind of server that would be and if it wouldn't be better to do these operations on the client side instead.xdopamine wrote:OK, I agree the performance of L4 IPC is good. However, how does the whole synchronous L4 IPC scale when it has to serve thousands of requests per second? Wouldn't this result in a long waiting time for other processes which are waiting for the blocked server?? And wouldn't the performance become even worse on multicore / multiprocessor systems??
OTOH if you're talking the multicore case (or even multiprocessor case) there could be a point in having multiple threads of the server, but then they would probably best be created at server startup.
Re: L4 IPC performance
Hi,
Synchronous: A task sends a request and there's a task switch (because the sender can't continue until it receives the reply). The receiver receives the request, does something and sends a reply. Now the original thread can run, so (eventually) there's a second task switch. If a task wants to send 100 requests (and receive 100 replies) then that's a minimum of 200 task switches (and it doesn't matter how many CPUs you throw at it, it doesn't get any better).
Asynchronous: For single-CPU, a task sends 100 requests and then there might be one task switch to the receiver which processes the requests and sends 100 replies, and this is followed by an (eventual) task switch back to the original thread. That's a minimum of 2 task switches instead of 200. For multi-CPU the sender and receiver can be running on different CPUs; and if you're lucky there might not be any task switches at all.
However, synchronous behaves a little like normal function calls, which makes it a lot easier for programmers to deal with. For asynchronous you typically end up with a state machine - e.g. a main loop to get the messages and a "switch(message_type)" with plenty of "case :" statements that do some work and move the state machine from one state to another. It can easily become a large mess (especially if you're not used to it and/or don't split complex things into multiple threads with separate message handling loops), and it's not easy to port "traditional" software to it (especially if you don't want to end up emulating synchronous and missing out on the benefits of asynchronous).
Cheers,
Brendan
Synchronous: A task sends a request and there's a task switch (because the sender can't continue until it receives the reply). The receiver receives the request, does something and sends a reply. Now the original thread can run, so (eventually) there's a second task switch. If a task wants to send 100 requests (and receive 100 replies) then that's a minimum of 200 task switches (and it doesn't matter how many CPUs you throw at it, it doesn't get any better).
Asynchronous: For single-CPU, a task sends 100 requests and then there might be one task switch to the receiver which processes the requests and sends 100 replies, and this is followed by an (eventual) task switch back to the original thread. That's a minimum of 2 task switches instead of 200. For multi-CPU the sender and receiver can be running on different CPUs; and if you're lucky there might not be any task switches at all.
However, synchronous behaves a little like normal function calls, which makes it a lot easier for programmers to deal with. For asynchronous you typically end up with a state machine - e.g. a main loop to get the messages and a "switch(message_type)" with plenty of "case :" statements that do some work and move the state machine from one state to another. It can easily become a large mess (especially if you're not used to it and/or don't split complex things into multiple threads with separate message handling loops), and it's not easy to port "traditional" software to it (especially if you don't want to end up emulating synchronous and missing out on the benefits of asynchronous).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Re: L4 IPC performance
Hi Brendan,
Isn't the cost of sending a request (enter kernel, do something, resume original task) the same as a context switch (enter kernel, do something, resume other task) except for the change in memory context ? If so, in a synchronous system, couldn't a client and server run on different cores and eliminate the need for so much memory context switching ?
Of course the kernel would need a way of getting a message from one core to the other.
Isn't the cost of sending a request (enter kernel, do something, resume original task) the same as a context switch (enter kernel, do something, resume other task) except for the change in memory context ? If so, in a synchronous system, couldn't a client and server run on different cores and eliminate the need for so much memory context switching ?
Of course the kernel would need a way of getting a message from one core to the other.
If a trainstation is where trains stop, what is a workstation ?
Re: L4 IPC performance
That sounds stupid. Core A sends a message to Core B and then waits for Core B to complete the request then Core B waits for Core A to send next request. Sure the context switching will be eliminated, but on the expense that you only use 50% of the CPU capacity.gerryg400 wrote:Hi Brendan,
Isn't the cost of sending a request (enter kernel, do something, resume original task) the same as a context switch (enter kernel, do something, resume other task) except for the change in memory context ?
[/qoute]
That's dependent on the CPU design. In some cases it's quite costly to switch between user and supervisor mode while on other it's quite costly to change the memory context.
If so, in a synchronous system, couldn't a client and server run on different cores and eliminate the need for so much memory context switching ?
Re: L4 IPC performance
Well not necessarily. The client process, if it has something else to do would be multithreaded. Other threads in the client process could be constructing further messages to send to the server (or other servers). The process on core B is a server process. It always sits and waits for things to do and could serve other clients after replying to the first.That sounds stupid. Core A sends a message to Core B and then waits for Core B to complete the request then Core B waits for Core A to send next request. Sure the context switching will be eliminated, but on the expense that you only use 50% of the CPU capacity.
If a trainstation is where trains stop, what is a workstation ?
Re: L4 IPC performance
Then it sounds more like the asynchronous case, well maybe not in the strict sense since the client continuing executing after request only in another thread.gerryg400 wrote:Well not necessarily. The client process, if it has something else to do would be multithreaded. Other threads in the client process could be constructing further messages to send to the server (or other servers). The process on core B is a server process. It always sits and waits for things to do and could serve other clients after replying to the first.That sounds stupid. Core A sends a message to Core B and then waits for Core B to complete the request then Core B waits for Core A to send next request. Sure the context switching will be eliminated, but on the expense that you only use 50% of the CPU capacity.
OTOH this is only the case where the client and server each has threads enough to keep one core busy, which would be a special scheduling optimization case which would be possible to arise with two non-communicating multithreaded processes as well. As soon as there is significant asymmetry in the CPU load requirements it might be better to put client threads on CoreB as well (and you get MMU context switches on CoreB).
Re: L4 IPC performance
That's entirely true. If you have few cores you cannot avoid memory context switches. I'm currently porting my old single core 32bit synchronous kernel to run on my 4 core 64 bit machine. When I first got it working I found that my scheduler tried to run the client and server on the same core. It's just the way my simple scheduler did things. Anyway, I quickly realised that it was wasteful in memory context switches. So I'm trying to make something a bit smarter that will scale to more cores. I don't yet know how it's going to go and I have a lot of work to do before I find out !As soon as there is significant asymmetry in the CPU load requirements it might be better to put client threads on CoreB as well (and you get MMU context switches on CoreB).
If a trainstation is where trains stop, what is a workstation ?