Multi-core Programming

Barry · Post by **Barry** » Tue Jan 04, 2022 11:38 pm

Hello,

So I've been working recently on making my kernel support SMP, and I've read all of the pages on the wiki, which were really helpful.
I've got to the point of detecting and initialising the other cores on the computer, but I'm rather confused as to what I need to do next.
I've basically got all the cores just sat running a while(1); loop doing nothing because I don't understand where to go next.
I'll say that each core is setup running in protected mode and actually in a position to run the code in my kernel, I just don't know what to make it run.
Do I need to redo the init process I did for the IDT, GDT, PIT, etc. on each processor?
How do I get those processors to actually schedule and run tasks?
I think all I need to do is just get them to the point of starting the scheduler and everything should just fall into place with the rest of my OS.

I made the mistake of leaving SMP a bit late and I've already got a fair bit done, but just running on a single core.

Thanks,
Barry

nullplan · Post by **nullplan** » Wed Jan 05, 2022 3:30 am

Generally, you are supposed to initialize your scheduler before you initialize the other cores. Then the other cores can just ask your scheduler for what they should do next. This of course requires an MP-safe scheduler. Typically you would initialize the CPU, the memory manager, the scheduler, and the interrupt controllers. Then launch the other cores, and then go on a hardware-detection spree. Make sure to spawn new tasks appropriately so the other cores have stuff to do. For example, you can have one task discovering devices on PCI, that then spawns new tasks for each device it finds and can map a driver to. Then the drivers can at least initialize on their own, including waiting for timeouts and stuff. For USB, you can spawn one discovery task per USB device. And so on and so forth. There is a lot of waiting involved, and you want there to be other stuff the CPU could be doing.

Barry wrote:Do I need to redo the init process I did for the IDT, GDT, PIT, etc. on each processor?

You need to load IDT, GDT, and TSS, yes. Generally you can re-use the IDT. You need a different TSS for each CPU, just to store different stack pointers for each CPU. For the GDT it comes down to taste. You can either have a different GDT per CPU, or have one GDT in the system with multiple TSS pointers. While the latter requires less memory overall, you either need to know the number of CPUs ahead of time, or have an allocation scheme that allows expansion. Or statically size the GDT to the maximum supported, but that means wasting memory if there are fewer CPUs.

I chose to just have one GDT per CPU. That then just gets allocated along with everything else for the CPU, and so there is not as much waste. Yes, the GDT is mostly the same for each CPU, but we are talking about 40 bytes per CPU here.

You will also have to redo some MSR assignments (if you use MSRs). You do not have to reinitialize the PIC and the PIT; that is external hardware that is initialized only once.

Barry wrote:How do I get those processors to actually schedule and run tasks?

Call the scheduler? Depends on how you write it, but it may be best to initialize a dummy task in the new CPU and exit out of it. Then the scheduler should just hold in a loop waiting for a task to become available, and executing the idle task until then. You do have an idle task, right?

Barry · Post by **Barry** » Wed Jan 05, 2022 5:15 am

Thanks for the reply, that's cleared quite a lot up. Just a couple of questions about it:

nullplan wrote:Call the scheduler? Depends on how you write it, but it may be best to initialize a dummy task in the new CPU and exit out of it. Then the scheduler should just hold in a loop waiting for a task to become available, and executing the idle task until then. You do have an idle task, right?

Yes I have an idle task. My understanding of this is basically just calling schedule() from each CPU and just letting it work (assuming each CPU has a ready queue). Is that correct, or would there need to be some fancy IPI stuff going on?
Currently my schedule() is run by the PIT IRQ. Would this work for all CPUs, or is there something I have to do there to get that to work?

The stuff you said about spawning processes for the drivers was really helpful in terms of me understanding what's got to happen, so thanks for all of that.

Octocontrabass · Post by **Octocontrabass** » Wed Jan 05, 2022 1:14 pm

nullplan wrote:Generally you can re-use the IDT.

I wonder if this could be a bottleneck on HPC hardware. I'm sure you would want separate Tx/Rx queues and IRQs per CPU, but would separate ISRs be necessary too?

Barry wrote:Currently my schedule() is run by the PIT IRQ. Would this work for all CPUs, or is there something I have to do there to get that to work?

It should work as long as you're using the APIC instead of the PIC, but you'll probably want to use different timers in the future to get more control over when each CPU will be interrupted. (Your schedule() can run outside the PIT IRQ handler, right?)

nullplan · Post by **nullplan** » Wed Jan 05, 2022 3:27 pm

Barry wrote:Currently my schedule() is run by the PIT IRQ. Would this work for all CPUs, or is there something I have to do there to get that to work?

This will likely not work. The PIT is external to the CPU, and any interrupt it generates will be delivered to only one core. As Octo said, you need to use APIC, and then you can use the LAPIC timer interrupts to ensure a regular interruption on each core. You need to calibrate the LAPIC timer, yes, but if you have a PIT then that should be very quick and easy.

Octocontrabass wrote:I wonder if this could be a bottleneck on HPC hardware. I'm sure you would want separate Tx/Rx queues and IRQs per CPU, but would separate ISRs be necessary too?

You can still reuse the IDT, and the first-level interrupt handlers (which should push the interrupt number and then save registers before calling the second-level interrupt handler). The second-level handler can then separate out by CPU-ID (e.g. given by GS base address) if that is really needed. Or maybe the list of function pointers for the interrupts could be per CPU. In any case, there is no big need to have separate entry points for each CPU, and therefore no need to have separate IDTs, and the IDT when fully loaded in 64-bit mode is a full page in size.

You need some way to tell CPUs apart in normal code, anyway, so I just use the GS base address for that. And I really don't know any way to do it that doesn't involve having either GS or FS hold a base address in kernel mode, or else using a paging trick to have different data at the same address. But the latter option is fraught with peril, so I just use the former one, and GS is nicer to use because of SWAPGS.

Octocontrabass · Post by **Octocontrabass** » Wed Jan 05, 2022 4:08 pm

nullplan wrote:The PIT is external to the CPU, and any interrupt it generates will be delivered to only one core.

The PIT's IRQ line is attached to an IOAPIC, and the IOAPIC can route an IRQ to multiple LAPICs. I'm not saying it's a good idea, but it would work.

nullplan wrote:You can still reuse the IDT, and the first-level interrupt handlers (which should push the interrupt number and then save registers before calling the second-level interrupt handler). The second-level handler can then separate out by CPU-ID (e.g. given by GS base address) if that is really needed. Or maybe the list of function pointers for the interrupts could be per CPU.

Now that you mention it, that makes a lot of sense. I see Linux uses a per-CPU function pointer list. It seems like this would be pretty fast already, but I have to wonder if there's a point where using a separate stub for each IRQ handler function instead of a generic second-level handler would offer any improvement. (In Linux, there's some common code that has to run before and after the function from the list, so the added cache and BTB costs of duplicating that probably outweighs any benefits.)

nullplan wrote:But the latter option is fraught with peril, so I just use the former one, and GS is nicer to use because of SWAPGS.

While I agree it's nicer than per-CPU page tables, SWAPGS has its own pitfalls!

nullplan · Post by **nullplan** » Wed Jan 05, 2022 11:47 pm

Octocontrabass wrote:It seems like this would be pretty fast already, but I have to wonder if there's a point where using a separate stub for each IRQ handler function instead of a generic second-level handler would offer any improvement.

The only way a separate first-level handler could be appreciably faster than a generic one with a function pointer would be by having a first-level handler that is hard-coded to do a certain thing. That means duplicating at least some of the work from the first and second-level generic handlers, and is therefore a speed optimization costing code quality (DRY principle gets violated). And especially for the interrupt handlers, you want the code to be airtight, and that usually means you want to write it only once, so you have to verify it only once.

As I've frequently stated before, I don't really care about speed until I am made to worry about it, so I had not even considered if it could be worthwhile. But considering that Linux, which often sacrifices code quality for speed, sees no need to implement such optimization, I think it is evident that the costs outweigh the benefits here. I have made similar observations about a number of other topics. Is it really worth it to try and minimize the number of registers saved at interrupts or system calls? Apparently not.

Barry · Post by **Barry** » Thu Jan 06, 2022 6:11 am

Octocontrabass wrote:It should work as long as you're using the APIC instead of the PIC, but you'll probably want to use different timers in the future to get more control over when each CPU will be interrupted.

Is this just as simple as sending the commands through the Local APIC registers? Will this trigger IRQ 0 on the core, or is there some APIC specific method I have to use to get the interrupt to trigger schedule()? And does this also occur on the BSP, or should I keep using the PIT there?
I'm also a little confused as to what benefits I'd get from using different timers.
Other than that, you've basically answered all my questions and some I didn't even know I had.

nullplan · Post by **nullplan** » Thu Jan 06, 2022 6:57 am

Barry wrote:Is this just as simple as sending the commands through the Local APIC registers? Will this trigger IRQ 0 on the core, or is there some APIC specific method I have to use to get the interrupt to trigger schedule()? And does this also occur on the BSP, or should I keep using the PIT there?

If you are going to use multiple cores, you have to use the APIC. That means, for external devices you have to use the IOAPIC, and you can program the IOAPIC to send the interrupt to multiple cores. Or to one automatically selected core. But for that, you will need to find the correct input. Which generally means you have to read ACPI tables. Generally, while the PIT has IRQ 0 on the PIC, it has GSI 2 on the IOAPIC. You can find that out by reading the MADT, which will typically contain an entry telling you that IRQ 0 is redirected to GSI 2. And you need to read the MADT anyway to find all the other LAPICs to start all the other CPUs.

Barry wrote:I'm also a little confused as to what benefits I'd get from using different timers.

Less hardware involved and fewer things to go wrong. The LAPIC timers will always interrupt their local core (the LAPIC is attached to exactly one core), so you don't need to concern yourself with the PIT, or the IOAPIC, or multiple-node delivery modes. You have a different timer on each core, so you can also mask it out on each core separately, just by modifying local data, so no synchronization required. No weird edge cases when the timer fires at the wrong time while setting the mask and stuff like that. The LAPIC timers are just completely independent from each other, and way easier to deal with. Also, you don't need to wake all cores simultaneously, as would happen with the PIT waking all the cores. Imagine you have a threadripper CPU with 128 logical cores, and 127 of them are sleeping, and then the PIT goes off, and suddenly all 128 cores try to look at the scheduler queue at the same time. Lots of lock contention. Lots of wasted effort. If you use a dynticks system (requesting the timer only if it might be needed) and the LAPIC timers, then most of those cores could just sleep continuously and never wake up, unless something were to actually use all of them.

rdos · Post by **rdos** » Thu Jan 06, 2022 11:46 am

nullplan wrote: You need some way to tell CPUs apart in normal code, anyway, so I just use the GS base address for that. And I really don't know any way to do it that doesn't involve having either GS or FS hold a base address in kernel mode, or else using a paging trick to have different data at the same address. But the latter option is fraught with peril, so I just use the former one, and GS is nicer to use because of SWAPGS.

I've tried many different solutions, but the current one probably is the most effective (and doesn't consume any ordinary registers like fs or gs). I setup my GDT so the first few selectors are at the end of a page boundary, and then map the first part to different memory areas on different CPUs with paging. Thus, the GDT is still shared among cores and can be used for allocating shared descriptors up to the 8k limit. To get to the core area, the CPU will load a specific selector which will point directly to the core area. The core area also contains the linear address and the specific (shared) GDT selector allocated for each core.

This is for protected mode, but in long mode the CPU has to load a fixed linear address instead. Which is a bit less flexible.

rdos · Post by **rdos** » Thu Jan 06, 2022 11:59 am

nullplan wrote:The LAPIC timers will always interrupt their local core (the LAPIC is attached to exactly one core), so you don't need to concern yourself with the PIT, or the IOAPIC, or multiple-node delivery modes. You have a different timer on each core, so you can also mask it out on each core separately, just by modifying local data, so no synchronization required. No weird edge cases when the timer fires at the wrong time while setting the mask and stuff like that. The LAPIC timers are just completely independent from each other, and way easier to deal with. Also, you don't need to wake all cores simultaneously, as would happen with the PIT waking all the cores. Imagine you have a threadripper CPU with 128 logical cores, and 127 of them are sleeping, and then the PIT goes off, and suddenly all 128 cores try to look at the scheduler queue at the same time. Lots of lock contention. Lots of wasted effort. If you use a dynticks system (requesting the timer only if it might be needed) and the LAPIC timers, then most of those cores could just sleep continuously and never wake up, unless something were to actually use all of them.

I think it depends on which hardware you want to target. If you are only concerned with long mode you will always have the LAPIC so can always use it. However, if you also want to run on older hardware, then you need to have a solution that works with the PIT too. Another possibility is to use HPET.

Generally, you need both timers which can time events per core (or system) and a system time function with high precision. You often cannot implement system time with timers and so different resources are needed.

The use of timer hardware also determines if timer queues are global (PIT) or per core (LAPIC).

Another bigger issue with SMP scheduling is how to assign threads to cores. One way is to have lists of threads per core (threads are assigned to a specific core). I currently use this method. This needs to be combined with load balancing that moves threads between cores. Threads can be woken up by other means than timers, and this needs to use IPIs if the thread is not assigned to the current core.

There is also a need to use spinlocks for IRQ protection, as cli/sti will not work for this. Generally, cli/sti are never safe on SMP.

Perhaps the toughest issue to solve is how to synchronize the scheduler, particularly if it can be invoked from IRQs.

rdos · Post by **rdos** » Thu Jan 06, 2022 12:23 pm

Barry wrote:
Octocontrabass wrote:It should work as long as you're using the APIC instead of the PIC, but you'll probably want to use different timers in the future to get more control over when each CPU will be interrupted.
Is this just as simple as sending the commands through the Local APIC registers? Will this trigger IRQ 0 on the core, or is there some APIC specific method I have to use to get the interrupt to trigger schedule()? And does this also occur on the BSP, or should I keep using the PIT there?

You setup APIC entries (and MSI/MSI-X) with the core it should interrupt. There is a special mode which is supposed to deliver IRQs to the lowest priority core, but this never worked for me. Maybe it actually doesn't work, or I did it wrong.

Anyway, you also want to assign interrupt handler servers to the same core as the IRQ is delivered to, otherwise there will be extra IPIs involved. If you move the server to another core, you also need to update the APIC entry. Alternatively, servers could run on fixed cores.

initially, I assign all IRQs to the BSP, and then move those that cause the most activity to other cores. I also have an advanced scheme to discover how IRQs and threads are related so related threads can be moved with IRQs. This means I don't need to specify this as I hook IRQ handlers or start servers.

rdos · Post by **rdos** » Thu Jan 06, 2022 3:01 pm

Another important issue with multicore OSes is to handle TLB invalidations. With single-core, it's easy enough to directly invalidate areas as they are modified, but with multicore you need to inform all other cores in the system about it (with IPI). This requires keeping track of invalidated areas per core.

nullplan · Post by **nullplan** » Thu Jan 06, 2022 3:46 pm

rdos wrote:The use of timer hardware also determines if timer queues are global (PIT) or per core (LAPIC).

If you don't have LAPIC, you also can't have SMP, so I don't think the distinction matters.

rdos wrote:Another bigger issue with SMP scheduling is how to assign threads to cores. One way is to have lists of threads per core (threads are assigned to a specific core). I currently use this method. This needs to be combined with load balancing that moves threads between cores.

Same here. Although I currently pay no attention to the previously assigned CPU of a task if it had to go sleep. Yes, this likely does mean sub-par cache performance, but I haven't really noticed it so far.

rdos wrote:Threads can be woken up by other means than timers, and this needs to use IPIs if the thread is not assigned to the current core.

Currently, I have three OS-defined IPIs in use: One is for rescheduling, and causes the receiving CPU to set the timeout flag on the current task (as would an incoming timer interrupt). The second is for halting, and it just runs the HLT instruction in an infinite loop. That IPI is used for panic and shutdown. And finally, the third is the always awful TLB shootdown. There is just no way around it. I can minimize the situations I need it in, but some are just always going to remain.

rdos wrote:There is also a need to use spinlocks for IRQ protection, as cli/sti will not work for this. Generally, cli/sti are never safe on SMP.

It was this sentence that made me want to respond: Who are you writing this response to? Because if it is to me, then I already know that. And if it is to the OP, then you are abridging the truth here a bit too much.

Basically, your failure here is to distinguish between parallelism and concurrency. Parallelism is a property of the source code to both support and be safe under multiple threads of execution. Concurrency is then an attribute of the hardware to allow for the simultaneous execution of these threads. Now obviously, source code that assumes that "CLI" is the same as taking a lock is not parallel. In order to be parallel, you must protect all accesses to global variables with some kind of synchronization that ensures only one access takes place at a time.

I've made sure from day one that my kernel code was parallel. All accesses to global variables are behind atomic operations or locks. Spinlocks clear the interrupt flag, yes, but that is to avoid deadlocks from locking the same spinlock in system-call and interrupt contexts.

rdos wrote:Perhaps the toughest issue to solve is how to synchronize the scheduler, particularly if it can be invoked from IRQs.

As I've learned from the musl mailing list, fine-grained locking is fraught with its own kind of peril, so I just have a big scheduler spinlock that protects all the lists, and that seems to work well enough for now.

rdos wrote:I've tried many different solutions, but the current one probably is the most effective (and doesn't consume any ordinary registers like fs or gs). I setup my GDT so the first few selectors are at the end of a page boundary, and then map the first part to different memory areas on different CPUs with paging. Thus, the GDT is still shared among cores and can be used for allocating shared descriptors up to the 8k limit. To get to the core area, the CPU will load a specific selector which will point directly to the core area. The core area also contains the linear address and the specific (shared) GDT selector allocated for each core.

This is for protected mode, but in long mode the CPU has to load a fixed linear address instead. Which is a bit less flexible.

See, this is precisely what I meant. Now you have a complicated memory model where the same kernel-space address points to different things on different CPUs. Leaving alone that this means you need multiple kernel-space paging structures, this means you cannot necessarily share a pointer to a variable with another thread that might execute on a different core for whatever reason, and this is a complication that was just unacceptable to me.

Also, FS and GS aren't ordinary registers, they are segment registers, and their only job in long mode is to point to something program-defined. So in kernel-mode I make them point to CPU-specific data. That doesn't really consume anything that isn't there anyway and would be unused otherwise.

Ethin · Post by **Ethin** » Thu Jan 06, 2022 3:59 pm

In my kernel I do something rather different: I fill the IDT with interrupt handlers (I fill all of them). By default they don't do anything (the list of handlers is empty), but when a driver gets initialized it can just register an interrupt handler. My kernel uses locks everywhere (it uses lazy initialization too, but those are backed by locks) so I have nothing to worry about there.

I am a bit unsure on what timer subsystem(s) to use. I've switched to just using the X2APIC, but that brings in a few unwanted questions:

The X2APIC uses MSRs instead of MMIO. The intel manuals imply that once the X2APIC is used, you cannot use the MMIO addresses. Therefore, how would you set up MSI or MSI-X?
I use the HPET for my sleep function. But I don't know if I should use that or the LAPIC timer (its a global function).

OSDev.org

Multi-core Programming

Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming

Re: Multi-core Programming