Hi,
LtG wrote:Brendan wrote:I don't know what "light weight" really means - I tend to think of rendezvous messaging (where the "message" is tiny and fits in CPU's registers, and you switch tasks without saving or loading registers), but this is completely broken for multi-CPU.
Why is it broken for multi-CPU? Especially if you run the recipient on the same core?
Rendezvous messaging:
- Ruins scalability (for all requests the "requester" blocks while waiting for a reply, so the "requester" and "requestee" can't do their work in parallel on different CPUs)
- Ruins cache locality and NUMA optimisations
- Ruins the scheduler's ability to effectively schedule work ("what runs where and when" becomes a consequence of the constant blocking/unblocking of threads caused by IPC; and things like scheduling policies and thread priorities have little effect)
For "single-CPU" none of these things matter much, and for early micro-kernels (from back when almost all computers where single CPU, and had less need for CPU caches, etc because the difference between "CPU speed" and "RAM speed" was far less) rendezvous messaging was vaguely beneficial. For modern computers (which are almost all multi-CPU, and need lots of caching and other stuff to overcome the difference between "CPU speed" and "RAM speed") its incredibly bad.
LtG wrote:Brendan wrote:In any case; you can expect that all of the stuff in the CPU's data caches, instruction caches, TLBs, branch prediction buffers and whatever else will all become worthless (old stuff for the previous task that's useless for the new task); and that adds up to "expensive" (lots of cache and TLB misses, and mispredicted branches, and...) and there's almost nothing you can do about it.
Suppose procA sends message to VFS and the VFS does very little work and then returns, the data and code caches should be just fine, TLBs (if tagged) are fine, so I would imagine that it's mostly all still valid/useful.
Suppose procA sends 20 messages to "VFS running on different CPU", VFS sends a reply for 10 requests (that could be satisfied from file data caches, etc) and forwards 10 requests to three different file systems, and those file systems forward more requests to 5 different storage device drivers. Now we have a situation where all 10 processes can be doing useful work in parallel on different CPUs, potentially with zero task switches.
Suppose procA sends 20 messages to VFS, but it has to do them one at a time because the IPC is crap. It sends the first message (task switch from procA to VFS), VFS does a little work and sends a reply (task switch from VFS back to procA); procA sends the second message (task switch from procA to VFS), VFS does a little work and has to send a request to FS (task switch from VFS to FS) which also does a little work and sends a request to storage device driver (task switch from FS to driver); then driver sends reply (task switch from driver back to FS) and FS sends a reply (task switch from FS back to VFS) and VFS sends a reply (task switch from VFS back to procA).
For the same 20 messages; you're looking at 2 task switches each for ten of the requests and 6 task switches each for the other ten requests.
Instead of "10 processes can be doing useful work in parallel on different CPUs, potentially with zero task switches" you've got "only one process can do useful work at a time, with at least 80 task switches".
Now assume that (on average) your task switches are twice as expensive because the CPU's caches are full of trash (e.g. after each task switch, the new task's code and data is no longer in CPU's L1 caches and probably not in CPU's L2 caches either), and assume that there's 8 CPUs that could've been used but weren't. If task switches were free it'd be (up to) 8 times worse because you're not using all the CPUs properly, but task switches aren't free (and probably cost more than the work each task does) so it's going to be more like 50 times worse.
LtG wrote:Also remember that quite often when you send a message that the message contains something that was prepared just prior to sending the message, so the message content to varying extent is in exactly the same core's caches, if you have another core pickup the work then it's guaranteed to be in the cache of a wrong core. In addition to your design having to interrupt that other core, or sacrifice prioritization and latency (and possibly affect throughput).
For my case; thread asks kernel to fetch the message from its message queue, kernel tells CPU to prefetch the first part of the message, and CPU prefetches first part of message into cache while CPU switches back to CPL=3 and while the thread figures out who sent the message, etc. The message data may be in cache before it's accessed anyway.
For your case; kernel does a task switch that destroys everything, then returns to CPL=3 causing a cache miss at RSP and a cache miss at RIP (and probably a bunch of TLB misses too). You've already lost before the message data is accessed.
LtG wrote:Brendan wrote:LtG wrote:I have no idea how my design will perform ultimately, so I can't say if it's a good idea or not. If I switch the task on the same core to do the "very little work" for VFS, then I get better cache locality than if I sent an IPI or some other way of communicating to the other core(s).
More like the opposite (if that other task's code and data is cached anywhere, then it's probably cached in a different CPU's caches, so you ruin cache locality by running it on the current CPU). Note that this gets worse once you start thinking about NUMA.
This should get better with NUMA, again all the data is being processed in the same core, thus the same NUMA domain. I would think that it's always beneficial to continue work in the same core, unless there's cores that are idle, in which case the current core could stay with it's "main" task instead of diverging into whatever whatever the reason was why the message was sent.
You'd either pay a "NUMA penalty" when accessing the message data (which is typically small); or you pay a "NUMA penalty" accessing all of the task's code, stack and data (which is typically much larger). If you want to minimise the "NUMA penalties" you pay the small penalty to avoid the large penalty.
LtG wrote:Brendan wrote:Yes, but it's a small price to pay to reduce the number of task switches (which are more expensive, regardless of what you do).
If the message is in the CPU registers and you use TLB tagging, what's the extra price to do a task switch? Compared to an IPI?
The extra price is that the CPU has to fetch code from RIP and data from RSP; and you can expect that code will want to access things in ".data" or ".bss" or heap; and have you poor cache locality for everything that matters more than the message data.
Note that TLB tagging (e.g. address space IDs) isn't as good you might think. The problem is "multi-CPU TLB shootdown" - without TLB tagging there's multiple cases (e.g. single-threaded process) where you know that the TLB entry can't be in other CPU's TLBs and therefore don't need to send an IPI to invalidate the translation in other CPU's TLBs; and with TLB tagging you can't do this (you have to assume that other CPU's might still be caching the translation in their TLBs). This means that (with TLB tagging) you either send a lot more IPIs for "multi-CPU TLB shootdown"; or you keep track of "postponed shootdowns" and do them during task switches (which is much better for scalability, assuming you're not doing an excessive number of task switches because of your IPC).
LtG wrote:Brendan wrote:LtG wrote:So having a single thread vs one thread per core really depends on the overall design.
Sure, if you design the OS so that most CPUs spend most of their time trying to acquire locks/mutexes, then...
If each core is almost entirely independent, then what locks/mutexes? I realize that your design on this aspect is radically different from mine, but I'm not planning on using almost any locks or mutexes as far as the kernel is concerned. A disk cache (VFS) thread would probably have to use a mutex when evicting a page from memory and when marking a page to be currently needed, but that can be done on a page by page basis, so the likelyhood of two cores wanting to do that should be relatively low and the mutex should be cheap (not many cycles).
If VFS, file systems and storage device drivers are all single-threaded; then there's no need to have any locks or mutexes anywhere in any of them. If VFS, file systems and storage device drivers are multi-threaded (e.g. could be running on different CPUs at the same time) then you're going to need locks/mutexes (and/or lock-free and/or block-free algorithms) everywhere.
Note that it's not really lock contention that I'd worry about - acquiring a lock has a cost even when there's no contention (there's always a cache line that ends up bouncing around between multiple CPUs). Even things like tracking statistics can become a surprising scalability problem (e.g. one CPU doing "add [total_bytes_written],rax" compared to multiple CPUs doing "lock add [total_bytes_written],rax") because of "cache line bouncing everywhere".
LtG wrote:If you try to view everything I've written here thru the lenses of your OS, then most/all of your criticism might be accurate, but again, my design is quite different to yours. When possible I try to avoid all the problems (locks/mutexes, task switching cost, etc) and hoping that eventually it all pays off... Time will tell.
I'm not saying these things because my OS does things differently; my OS does things differently because of these things.
Cheers,
Brendan