Hi,
mvanga wrote:Brendan wrote:To avoid wasting CPU time polling you can send interrupts ("Inter-Processor Interrupts") from one CPU to another. For example, a CPU could put data into a specific memory location then send an IPI to the other processor, which would receive the interrupt/IPI and check for data in that specific memory location.
Well, I'm planning to have a really small kernel doing polling + sleeping (maybe a 10:90 time split on that).
In that case you're wasting 100% of CPU time.
mvanga wrote:Brendan wrote:If a piece of data is not used often enough to remain in the cache, then it will be replaced with more important data. In this case "polling" would only make the cache less efficient for things that matter more. Also don't forget that if one CPU modifies something, then the stale data must be evicted from all other CPU's caches to guarantee that those other CPUs won't use old/wrong data - the polling would hurt performance for things that matter more, and also won't improve performance for your message passing either.
Yeah but polling ensures that the message passing areas stay in cache as they always stay hot. Correct?
80x86 CPUs use (a variation of) the
MESI protocol. When a CPU is polling something that doesn't change the data would remain in that CPU's cache. However, when anything modifies the data, the effected part of the cache has to change to the "invalid" state. When this happens the next read will cause a cache miss.
Basically, if you poll you get a cache miss the first time you read the data after it has changed, and if you don't poll at all you get a cache miss the first time you read the data after it has changed. There's no difference, so polling (in an attempt to keep it in the cache) is pointless.
Now imagine you were actually trying to do useful work (and not wasting 100% of CPU time on polling and sleeping), and no messages were being sent or received. In this case, is it better to waste part of the cache for "no messages", or is it better to let that part of the cache help the useful work get done faster?
mvanga wrote:And since I only use it for passing messages (lightweight, probably be 64 to 128 bits at most), I am actually relying on the cache coherency algorithm to help me pass my message to all the cores. Should work right?
Well, it would work, but...
If the sender needs to send 2 messages, then it'd send the first message, then wait until the receiving CPU has received it before it can send the second message. This means you'd need some way to tell sender that the message buffer is free. For 64-bit messages you could use the highest bit for this. For example, the sender waits until the highest bit is clear, then stores a 63-bit message with the highest bit set.
Now think about sending a decent amount of data. If you need to send 1000 bytes (8000 bits), then with some messy bit shifting you could break it up into 127 (63-bit) messages. Alternatively (without the messy bit shifting) you could have 334 (24-bit or 3-byte) messages instead. Either way, the sender and the receiver are going to pound the living daylights out of that cache line - each time the sender sends a message and each time the receiver clears the highest bit (to say that it's safe to send the next message) you get a cache miss on a CPU; so at a minimum (for 64-bit messages) sending 1000 bytes costs 254 cache misses.
Think of it as a loop that does this 127 times:
- Sender loops waiting for the high bit to become clear (causing a cache miss when high bit does become clear)
- Sender writes new message (causing cache line to become "invalid" in receiver's cache)
- Receiver loops waiting for the high bit to become set (causing a cache miss when high bit does become set)
- Receiver gets the message and clears the high bit (causing cache line to become "invalid" in sender's cache)
Also, because both CPUs are so tightly synchronised, you've killed most of the optimisations that "out of order" CPUs are capable of. For example, after the first cache miss the CPU can't speculatively fetch the next cache line.
For an alternative, imagine if the same thing happened but with much larger (e.g. up to 4 KiB?) messages. The sender could write the entire 1000 byte message then set a "message in buffer" flag; and the receiver could read the entire 1000-byte message and then clear the "message in buffer" flag. In this case you still get cache misses; but it's 2 cache misses per cache line and not 2 cache misses per 63-bits. A cache line is typically 64 bytes; so a 1000-byte message works out to 32 cache misses instead of 254 of them (and you haven't killed most CPU optimisations - e.g. after the first few cache misses the CPU's hardware pre-fetcher would kick in and pre-fetch the remaining cache lines).
Of course if you did this; if you want to send lots of 64-bit messages then you'd still have the same "cache pounding" problem. You can fix that by sending small messages in a group. For example, combine 8 small messages into a 64-byte packet that fills an entire cache line, then send the 64-byte packet (and let the receiver break it up into 8 individual little messages). That way, for many 64-bit messages you can end up with an average of 1 cache miss for every 4 messages (rather than 2 cache misses per message).
Now let's talk about latency. If you do useful work for 90000 cycles then poll for 10000 cycles, then you could receive a message when first start doing useful work. In this case the worst case time it takes for a CPU to notice that a message has been received will be 90000 cycles. To improve "worst case" latency, the receiver could do no polling at all, and sender could send an IPI. The worst case latency for an IPI is probably around 100 cycles (which is a lot better than 90000 cycles), and as a bonus you wouldn't be wasting 10% of CPU time (the IPI could interrupt the useful work if/when necessary).
If you combine all of the above, "larger messages with IPIs" has a lot less cache misses and much better worst case latency; and you end up being able to do a lot more useful work while it's happening (because you're not wasting CPU time polling and not wasting a cache line for no reason when no messages are being sent/received).
Cheers,
Brendan