Booting up multiple instances on multiple cores

mvanga · Post by **mvanga** » Mon Nov 14, 2011 4:27 am

Hi,

I am trying to boot two instances of a kernel on multiple cores of my x86 machine. Can anyone point me in the right direction of how to control the cores during boot?

Thanks!

Combuster · Post by **Combuster** » Mon Nov 14, 2011 4:45 am

First things first: can you detect and start up additional processors (cores) yet? You can get the processor information from either the ACPI tables or the Multiprocessor tables - the former provides more detailed information while the latter is simpler. Once you know what processors there are you can send them a special non-maskable startup interrupt.

The multiprocessor specification shows intel's (former) recommended approach and outlines the basic steps. Most of the variations of opinion here are about when the interrupts are actually sent: Intel suggests to send two of them within a short interval, often one is enough and people will not send the second or only when the first fails to arrive. Blindly sending interrupts works for an initial test but is not recommended as it might end up triggering broken or non-processor hardware.

Casm · Post by **Casm** » Mon Nov 14, 2011 6:07 am

mvanga wrote:Hi,

I am trying to boot two instances of a kernel on multiple cores of my x86 machine. Can anyone point me in the right direction of how to control the cores during boot?

Thanks!

Personally I would forget about the other cores until the boot process is complete, and I have got some useful work for the other cores to do. Basically that means once the scheduler is running. Thereafter, a core will automatically find itself executing kernel code whenever a system call is made, or it processes a hardware interrupt.

You need locks, of course, to prevent multiple processors trying to simultaneously modify the same system variables. The easy way to do that is with spinlocks, whereby a processor goes into a continuous loop until the lock is freed. The smarter way is for an application to be taken off the scheduler's ready queue, until the lock has been released, and the relevant core is then given something else to do.

rdos · Post by **rdos** » Mon Nov 14, 2011 7:05 am

Casm wrote:Personally I would forget about the other cores until the boot process is complete, and I have got some useful work for the other cores to do. Basically that means once the scheduler is running.

Yes, but the initialization process can be started before the scheduler is running, but extra cores will not be doing anything useful until the scheduler is running.

Casm wrote:You need locks, of course, to prevent multiple processors trying to simultaneously modify the same system variables. The easy way to do that is with spinlocks, whereby a processor goes into a continuous loop until the lock is freed. The smarter way is for an application to be taken off the scheduler's ready queue, until the lock has been released, and the relevant core is then given something else to do.

I think there is a need for both spinlocks, and ordinary blocking locks. Spinlocks wastes CPU-time, but are needed to synchronize cores, so they are best used in the scheduler to implement blocking locks, which are used by the rest of the system, except for IRQ-synchronization which would need spinlocks.

mvanga · Post by **mvanga** » Mon Nov 14, 2011 9:48 am

Hey!

Thanks for the replies! Combustor's reply is what I was looking for! Some questions,

I have basically just booted off grub and have created multiple copies of the kernel image in memory (based on a hardcoded core count which I guess I will change with a configuration-detected value). Now I want (say) 4 cores to run 4 different instances of the kernel each with their own area of memory. (The idea is to build a user-space-based message passing kernel). Can you give me and idea of multiprocessor tables (I assume you meant this document: http://www.intel.com/design/pentium/datashts/242016.HTM)? Is this a processor feature or is the specification implemented by the BIOS/GRUB? I want to use specific memory locations to pass 64 bit messages between kernels (polling this location from time to time so that it stays cached at all times on all cores).

Thanks! You guys are great

Brendan · Post by **Brendan** » Mon Nov 14, 2011 10:09 am

Hi,

mvanga wrote:Can you give me and idea of multiprocessor tables (I assume you meant this document: http://www.intel.com/design/pentium/datashts/242016.HTM)? Is this a processor feature or is the specification implemented by the BIOS/GRUB?

The BIOS creates the MultiProcessor Specification tables (and other tables) during boot. You'll find the table's contents and information on how to find the table in the datasheet you linked to.

mvanga wrote:I want to use specific memory locations to pass 64 bit messages between kernels (polling this location from time to time so that it stays cached at all times on all cores).

To avoid wasting CPU time polling you can send interrupts ("Inter-Processor Interrupts") from one CPU to another. For example, a CPU could put data into a specific memory location then send an IPI to the other processor, which would receive the interrupt/IPI and check for data in that specific memory location.

If a piece of data is used often enough the CPU will keep it in the cache. In this case there's no need to poll to make the data stay in the cache.

If a piece of data is not used often enough to remain in the cache, then it will be replaced with more important data. In this case "polling" would only make the cache less efficient for things that matter more. Also don't forget that if one CPU modifies something, then the stale data must be evicted from all other CPU's caches to guarantee that those other CPUs won't use old/wrong data - the polling would hurt performance for things that matter more, and also won't improve performance for your message passing either.

Cheers,

Brendan

mvanga · Post by **mvanga** » Tue Nov 15, 2011 9:02 pm

Hey Brendan,

Sorry for the late reply, didn't have time during theday to check the forums.

Brendan wrote:The BIOS creates the MultiProcessor Specification tables (and other tables) during boot. You'll find the table's contents and information on how to find the table in the datasheet you linked to.

Thanks for the help

Brendan wrote:To avoid wasting CPU time polling you can send interrupts ("Inter-Processor Interrupts") from one CPU to another. For example, a CPU could put data into a specific memory location then send an IPI to the other processor, which would receive the interrupt/IPI and check for data in that specific memory location.

Well, I'm planning to have a really small kernel doing polling + sleeping (maybe a 10:90 time split on that). The polling is for making sure that there is a chance for messages to be received immediately. The rest of the time waits for an IPI. In the case of IPI however, the core sends an interrupt to every core that might have the mapping in its cache (which in my case would be all cores). Each core takes the trap and acknowledges the IPI. The trap is pretty expensive compared to polling and the idea is to keep this overhead low.

Brendan wrote:If a piece of data is used often enough the CPU will keep it in the cache. In this case there's no need to poll to make the data stay in the cache.

(and polling ensures that this data stays hot in the cache right?)

Brendan wrote:If a piece of data is not used often enough to remain in the cache, then it will be replaced with more important data. In this case "polling" would only make the cache less efficient for things that matter more. Also don't forget that if one CPU modifies something, then the stale data must be evicted from all other CPU's caches to guarantee that those other CPUs won't use old/wrong data - the polling would hurt performance for things that matter more, and also won't improve performance for your message passing either.

Yeah but polling ensures that the message passing areas stay in cache as they always stay hot. Correct? And since I only use it for passing messages (lightweight, probably be 64 to 128 bits at most), I am actually relying on the cache coherency algorithm to help me pass my message to all the cores. Should work right? I write a message and the cache-coherency algorithm makes sure everyone gets a copy in their cache

Thanks!

Combuster · Post by **Combuster** » Wed Nov 16, 2011 12:09 am

The general rule is that you don't read data when you don't need it. The caching algorithms performed by the processor are good at keeping the necessary data at hand (and polling is not going to keep the data in your cache if another processor is modifying it, as well as that any cache-maintenance code will also take up cache space of its own.)

Also, sleeping is for when nothing exists to be done. Waking a core from its sleep state is hardly a performance loss because there was nothing to perform in the first place. Typical message handling code will work like the following:

Code: Select all

while(!done)
{
  if(!message_pending()) block_and_wait(); // while should not be needed!
  message = get_message();
  handle_message(message);
}

Brendan · Post by **Brendan** » Wed Nov 16, 2011 12:18 am

Hi,

mvanga wrote:
Brendan wrote:To avoid wasting CPU time polling you can send interrupts ("Inter-Processor Interrupts") from one CPU to another. For example, a CPU could put data into a specific memory location then send an IPI to the other processor, which would receive the interrupt/IPI and check for data in that specific memory location.
Well, I'm planning to have a really small kernel doing polling + sleeping (maybe a 10:90 time split on that).

In that case you're wasting 100% of CPU time.

mvanga wrote:
Brendan wrote:If a piece of data is not used often enough to remain in the cache, then it will be replaced with more important data. In this case "polling" would only make the cache less efficient for things that matter more. Also don't forget that if one CPU modifies something, then the stale data must be evicted from all other CPU's caches to guarantee that those other CPUs won't use old/wrong data - the polling would hurt performance for things that matter more, and also won't improve performance for your message passing either.
Yeah but polling ensures that the message passing areas stay in cache as they always stay hot. Correct?

80x86 CPUs use (a variation of) the MESI protocol. When a CPU is polling something that doesn't change the data would remain in that CPU's cache. However, when anything modifies the data, the effected part of the cache has to change to the "invalid" state. When this happens the next read will cause a cache miss.

Basically, if you poll you get a cache miss the first time you read the data after it has changed, and if you don't poll at all you get a cache miss the first time you read the data after it has changed. There's no difference, so polling (in an attempt to keep it in the cache) is pointless.

Now imagine you were actually trying to do useful work (and not wasting 100% of CPU time on polling and sleeping), and no messages were being sent or received. In this case, is it better to waste part of the cache for "no messages", or is it better to let that part of the cache help the useful work get done faster?

mvanga wrote:And since I only use it for passing messages (lightweight, probably be 64 to 128 bits at most), I am actually relying on the cache coherency algorithm to help me pass my message to all the cores. Should work right?

Well, it would work, but...

If the sender needs to send 2 messages, then it'd send the first message, then wait until the receiving CPU has received it before it can send the second message. This means you'd need some way to tell sender that the message buffer is free. For 64-bit messages you could use the highest bit for this. For example, the sender waits until the highest bit is clear, then stores a 63-bit message with the highest bit set.

Now think about sending a decent amount of data. If you need to send 1000 bytes (8000 bits), then with some messy bit shifting you could break it up into 127 (63-bit) messages. Alternatively (without the messy bit shifting) you could have 334 (24-bit or 3-byte) messages instead. Either way, the sender and the receiver are going to pound the living daylights out of that cache line - each time the sender sends a message and each time the receiver clears the highest bit (to say that it's safe to send the next message) you get a cache miss on a CPU; so at a minimum (for 64-bit messages) sending 1000 bytes costs 254 cache misses.

Think of it as a loop that does this 127 times:

Sender loops waiting for the high bit to become clear (causing a cache miss when high bit does become clear)
Sender writes new message (causing cache line to become "invalid" in receiver's cache)
Receiver loops waiting for the high bit to become set (causing a cache miss when high bit does become set)
Receiver gets the message and clears the high bit (causing cache line to become "invalid" in sender's cache)

Also, because both CPUs are so tightly synchronised, you've killed most of the optimisations that "out of order" CPUs are capable of. For example, after the first cache miss the CPU can't speculatively fetch the next cache line.

For an alternative, imagine if the same thing happened but with much larger (e.g. up to 4 KiB?) messages. The sender could write the entire 1000 byte message then set a "message in buffer" flag; and the receiver could read the entire 1000-byte message and then clear the "message in buffer" flag. In this case you still get cache misses; but it's 2 cache misses per cache line and not 2 cache misses per 63-bits. A cache line is typically 64 bytes; so a 1000-byte message works out to 32 cache misses instead of 254 of them (and you haven't killed most CPU optimisations - e.g. after the first few cache misses the CPU's hardware pre-fetcher would kick in and pre-fetch the remaining cache lines).

Of course if you did this; if you want to send lots of 64-bit messages then you'd still have the same "cache pounding" problem. You can fix that by sending small messages in a group. For example, combine 8 small messages into a 64-byte packet that fills an entire cache line, then send the 64-byte packet (and let the receiver break it up into 8 individual little messages). That way, for many 64-bit messages you can end up with an average of 1 cache miss for every 4 messages (rather than 2 cache misses per message).

Now let's talk about latency. If you do useful work for 90000 cycles then poll for 10000 cycles, then you could receive a message when first start doing useful work. In this case the worst case time it takes for a CPU to notice that a message has been received will be 90000 cycles. To improve "worst case" latency, the receiver could do no polling at all, and sender could send an IPI. The worst case latency for an IPI is probably around 100 cycles (which is a lot better than 90000 cycles), and as a bonus you wouldn't be wasting 10% of CPU time (the IPI could interrupt the useful work if/when necessary).

If you combine all of the above, "larger messages with IPIs" has a lot less cache misses and much better worst case latency; and you end up being able to do a lot more useful work while it's happening (because you're not wasting CPU time polling and not wasting a cache line for no reason when no messages are being sent/received).

Cheers,

Brendan

OSDev.org

Booting up multiple instances on multiple cores

Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores

Re: Booting up multiple instances on multiple cores