Multi-core Programming

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Barry
Member
Member
Posts: 54
Joined: Mon Aug 27, 2018 12:50 pm

Re: Multi-core Programming

Post by Barry »

So I've tried to follow as much as I could of what was explained in the thread, and I'm still having some troubles.
I've disabled the PIC (masked all the interrupts), I've enabled the APICs (set up the spurious thing) and sent all the SIPIs to the cores, and got them to run the trampoline code, setup their own stack and move into protected mode. I've run LIDT on each core with the interrupt table, which just points to a dummy function at the moment for all interrupts (it just prints which interrupt number was called).
I then attempt to get them to setup their APIC timers by configuring it with the PIT in oneshot mode. If I read the current count, I can see it going down, but I'm still not getting an actual interrupt.
If someone can see where I'm going wrong, I'd be very grateful to know.
Octocontrabass
Member
Member
Posts: 5563
Joined: Mon Mar 25, 2013 7:01 pm

Re: Multi-core Programming

Post by Octocontrabass »

Ethin wrote:The X2APIC uses MSRs instead of MMIO. The intel manuals imply that once the X2APIC is used, you cannot use the MMIO addresses. Therefore, how would you set up MSI or MSI-X?
Volume 3A section 10.12.7 says you need to use the IOMMU to translate the interrupt messages into something the x2APIC can use. Aside from using a destination ID that will be translated to the actual destination by the IOMMU, setup on the PCI device is the same.
Ethin wrote:I use the HPET for my sleep function. But I don't know if I should use that or the LAPIC timer (its a global function).
Each CPU has its own LAPIC, but there's only one HPET.
Barry wrote:If someone can see where I'm going wrong, I'd be very grateful to know.
You didn't forget STI, did you? Or the IOAPIC? Or the MADT?
Barry
Member
Member
Posts: 54
Joined: Mon Aug 27, 2018 12:50 pm

Re: Multi-core Programming

Post by Barry »

Octocontrabass wrote:You didn't forget STI, did you? Or the IOAPIC? Or the MADT?
I've run STI on each core. What exactly do I need to do with the IOAPIC?
I'm using the MP Config Tables rather than the MADT, because my test machine is quite old, should I switch to MADT or can I do it with these tables?
I know that there's going to be some work in redirecting interrupts, but I can't even get a local timer interrupt working on the same core.
rdos
Member
Member
Posts: 3296
Joined: Wed Oct 01, 2008 1:55 pm

Re: Multi-core Programming

Post by rdos »

nullplan wrote:
rdos wrote:The use of timer hardware also determines if timer queues are global (PIT) or per core (LAPIC).
If you don't have LAPIC, you also can't have SMP, so I don't think the distinction matters.
SMP is just a module in my system that implements synchronization, locks & interrupts in different ways. I still run the same scheduler on both SMP and non-SMP systems, and the timer functions are not dependent on having SMP or not, rather on which hardware that exists.
nullplan wrote: Currently, I have three OS-defined IPIs in use: One is for rescheduling, and causes the receiving CPU to set the timeout flag on the current task (as would an incoming timer interrupt). The second is for halting, and it just runs the HLT instruction in an infinite loop. That IPI is used for panic and shutdown. And finally, the third is the always awful TLB shootdown. There is just no way around it. I can minimize the situations I need it in, but some are just always going to remain.
I use the NMI IPI for panic. It will always work regardless of what the core is doing.
nullplan wrote: Basically, your failure here is to distinguish between parallelism and concurrency. Parallelism is a property of the source code to both support and be safe under multiple threads of execution. Concurrency is then an attribute of the hardware to allow for the simultaneous execution of these threads. Now obviously, source code that assumes that "CLI" is the same as taking a lock is not parallel. In order to be parallel, you must protect all accesses to global variables with some kind of synchronization that ensures only one access takes place at a time.
Maybe. In my experience, cli/sti often was used to synchronize variables that were shared with IRQs, and when going SMP, this needs to use spinlocks. Other than that, I don't see any reason to use cli or spinlocks. If you just protect code between various part of your software, semaphores, mutexes or critical sections (or whatever you call them) is the way to achieve locking, not cli or spinlocks. The synchronization primitives themselves might need to use cli or spinlocks or scheduler locks in their implementation, but this should not be known by users.

In drivers that need to use spinlocks to synchronize with an IRQ, I have a generic function for this that translates to cli/sti on non-SMP and a spinlock for SMP.
nullplan wrote: I've made sure from day one that my kernel code was parallel. All accesses to global variables are behind atomic operations or locks. Spinlocks clear the interrupt flag, yes, but that is to avoid deadlocks from locking the same spinlock in system-call and interrupt contexts.
rdos wrote:Perhaps the toughest issue to solve is how to synchronize the scheduler, particularly if it can be invoked from IRQs.
As I've learned from the musl mailing list, fine-grained locking is fraught with its own kind of peril, so I just have a big scheduler spinlock that protects all the lists, and that seems to work well enough for now.
My kernel originally wasn't SMP aware, but it used a few synchronization primitives that were implemented in the scheduler. It also used cli/sti to synchronize with IRQs. When I switched to SMP, I just needed to modify the synchronization primitives to become SMP safe, and remove cli/sti and replace them with the new generic spinlock function.

I have some places which uses lock-free code (the physical memory manager) where I actually wrote code that is SMP safe without the use of spinlocks or mutexes, but it's generally too burdensome to write code in this way. Not to mention that it needs to be carefully evaluated for really being safe in all possible scenarios.
nullplan wrote: See, this is precisely what I meant. Now you have a complicated memory model where the same kernel-space address points to different things on different CPUs. Leaving alone that this means you need multiple kernel-space paging structures, this means you cannot necessarily share a pointer to a variable with another thread that might execute on a different core for whatever reason, and this is a complication that was just unacceptable to me.
Not really. Each CPU allocates it's own GDT in linear address space, and then maps the first page to it's own page which links to the core data structure. All CPUs still use the same paging structures and the same linear address space. The only difference is that if you load GDT selector 40 it will map to the unique core structure. This structure is mapped to a unique linear address and selector, which are different for each core. So, it is GDT selector 40 that is mapped to this unique address that differs between cores. Therefore, when a core wants to do something in another core's data, it can do this by loading a linear address (or a selector) from a table of available cores. The core itself can also get the linear address (or selector) from the core data itself, and when loading this, can call code that needs to operate based on the unique addresses.
nullplan wrote: Also, FS and GS aren't ordinary registers, they are segment registers, and their only job in long mode is to point to something program-defined. So in kernel-mode I make them point to CPU-specific data. That doesn't really consume anything that isn't there anyway and would be unused otherwise.
Then you need some way to do this (effectively) every time the processor switches (or potentially switches) to kernel mode, something that slows down code.

In long mode, this might work well, but not in protected mode, and particularly not when a segmented memory model is used which might pass parameters in fs or gs.
Barry
Member
Member
Posts: 54
Joined: Mon Aug 27, 2018 12:50 pm

Re: Multi-core Programming

Post by Barry »

I've managed to solve my issues! I switched to using the MADT and moved the order of some things around.
I have just one more problem though:
The local APIC timer interrupts the APs but not the BSP, is this normal (do I need to use the PIT for the BSP)?
Other than that, I've got interrupts working on each processor, the IO APIC configured, and I'm starting to make my scheduler run on multiple cores. It's just not running on the BSP. Somehow I'm in the exact opposite situation to where I started. :?
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: Multi-core Programming

Post by nullplan »

rdos wrote:I have some places which uses lock-free code (the physical memory manager) where I actually wrote code that is SMP safe without the use of spinlocks or mutexes, but it's generally too burdensome to write code in this way. Not to mention that it needs to be carefully evaluated for really being safe in all possible scenarios.
Agreed, lockless is cool but error-prone.
rdos wrote:So, it is GDT selector 40 that is mapped to this unique address that differs between cores.[...]Then you need some way to do this (effectively) every time the processor switches (or potentially switches) to kernel mode, something that slows down code.
But... if you need to load a specific segment with a special value in kernel mode, then your code has the exact same "problem". A problem you are well overstating, anyway. Segment operations are slow, but not this slow. I can easily perform this work each time the kernel is entered without any problems in practice.
rdos wrote:In long mode, this might work well, but not in protected mode,
In long mode, you have SWAPGS, and in protected mode you have nonzero base addresses. And a segmentation logic that actually stacks. In long mode it is a bit more complicated, because SWAPGS is stateless. That said, I am not overly concerned with protected mode.
rdos wrote:and particularly not when a segmented memory model is used which might pass parameters in fs or gs.
So that's one more downside to a segmented memory model, then: diminishing your options for storing a thread pointer. My god, how glad am I that I don't have to deal with segmentation in earnest.
Barry wrote:The local APIC timer interrupts the APs but not the BSP, is this normal (do I need to use the PIT for the BSP)?
No. No, this is not normal; it ought to work. Interrupt flag set on the BSP as well? Is the timer masked out in the LAPIC?
Carpe diem!
Barry
Member
Member
Posts: 54
Joined: Mon Aug 27, 2018 12:50 pm

Re: Multi-core Programming

Post by Barry »

nullplan wrote:No. No, this is not normal; it ought to work. Interrupt flag set on the BSP as well? Is the timer masked out in the LAPIC?
Turns out the BSP wasn't starting the timer, the routine was only being called by the AP code. Everything seems to be working fine now in terms of sending/receiving interrupts. However when I switch the page directory it causes the processor to reboot, which is a new development since I started using APIC. The issue seems to be writing to cr0.
Barry
Member
Member
Posts: 54
Joined: Mon Aug 27, 2018 12:50 pm

Re: Multi-core Programming

Post by Barry »

I've figured out the issue I was having, the kernel is page faulting when I try to access the local APIC registers at 0xFEE00000. I just need to either identity page them or move them to lower memory.
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: Multi-core Programming

Post by nullplan »

Barry wrote:I've figured out the issue I was having, the kernel is page faulting when I try to access the local APIC registers at 0xFEE00000. I just need to either identity page them or move them to lower memory.
You need to map that address. 0xFEE00000 is a physical address, and you need to map it into your address space somehow.

You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward. That place in virtual memory is not needed for anything else, and you can pretty much handle up to 64 TB of physical memory like that. Which ought to be enough for your purposes, right? Anyway, this way, translating a physical to a virtual address is as simple as adding the base address.
Carpe diem!
Barry
Member
Member
Posts: 54
Joined: Mon Aug 27, 2018 12:50 pm

Re: Multi-core Programming

Post by Barry »

nullplan wrote:You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward.
That would be good if I had a 64-bit OS, but I'm writing a 32-bit OS, so that's a bit far out of my reach. The LAPIC and IOAPIC are only a page each, so its no great issue where they go, I can find some space in my kernel's memory area for them. Currently I'm just identity mapping them, since 0xF0000000 to the end of memory is reserved in my kernel anyway for zero-copy memory shares.
Everything does seem to be working now however, so thanks for all the help.
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: Multi-core Programming

Post by xeyes »

rdos wrote:
nullplan wrote: You need some way to tell CPUs apart in normal code, anyway, so I just use the GS base address for that. And I really don't know any way to do it that doesn't involve having either GS or FS hold a base address in kernel mode, or else using a paging trick to have different data at the same address. But the latter option is fraught with peril, so I just use the former one, and GS is nicer to use because of SWAPGS.
I've tried many different solutions, but the current one probably is the most effective (and doesn't consume any ordinary registers like fs or gs). I setup my GDT so the first few selectors are at the end of a page boundary, and then map the first part to different memory areas on different CPUs with paging. Thus, the GDT is still shared among cores and can be used for allocating shared descriptors up to the 8k limit. To get to the core area, the CPU will load a specific selector which will point directly to the core area. The core area also contains the linear address and the specific (shared) GDT selector allocated for each core.

This is for protected mode, but in long mode the CPU has to load a fixed linear address instead. Which is a bit less flexible.
My methods of telling them apart:

I read the LAPIC's ID if there's APIC (x2 is a MSR read, even x1 is on chip and as a device frequently used by mainstream kernels, its MMIO path is likely optimized).

If there's no APIC available, I read out TR and do some math, all core's TSS are arranged in a consecutive array with a known base and pow2 size in my case.
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: Multi-core Programming

Post by xeyes »

Barry wrote:
nullplan wrote:You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward.
That would be good if I had a 64-bit OS, but I'm writing a 32-bit OS, so that's a bit far out of my reach. The LAPIC and IOAPIC are only a page each, so its no great issue where they go, I can find some space in my kernel's memory area for them. Currently I'm just identity mapping them, since 0xF0000000 to the end of memory is reserved in my kernel anyway for zero-copy memory shares.
Everything does seem to be working now however, so thanks for all the help.
Was annoyed that 32bit OSes would "eat" up to 1GB of my precious RAM 15 years ago, until I found myself in the same shoes as them and you. :oops:

Consider taking the full 1GB from C000_0000 without worrying too much about it page by page, it's probably the best trade off among a world of different priorities if both Linux and Windows took to the exact same approach (I'm sure they are much more sophisticated than just taking the full 1GB from C000_0000, but the end result to the user is more or less the same depending on devices).
rdos
Member
Member
Posts: 3296
Joined: Wed Oct 01, 2008 1:55 pm

Re: Multi-core Programming

Post by rdos »

xeyes wrote: My methods of telling them apart:

I read the LAPIC's ID if there's APIC (x2 is a MSR read, even x1 is on chip and as a device frequently used by mainstream kernels, its MMIO path is likely optimized).
That would be a possibility with a flat kernel assuming a fixed APIC physical address. Or if you always map the APIC to a fixed linear address. Although, it won't work on older systems with no APIC.
xeyes wrote: If there's no APIC available, I read out TR and do some math, all core's TSS are arranged in a consecutive array with a known base and pow2 size in my case.
Won't work for me since I have a TSS per thread. Although I use this in user mode code to identify if a thread owns a lock or not. This is possible since TR can also be read-out in user mode without faults.
rdos
Member
Member
Posts: 3296
Joined: Wed Oct 01, 2008 1:55 pm

Re: Multi-core Programming

Post by rdos »

xeyes wrote:
Barry wrote:
nullplan wrote:You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward.
That would be good if I had a 64-bit OS, but I'm writing a 32-bit OS, so that's a bit far out of my reach. The LAPIC and IOAPIC are only a page each, so its no great issue where they go, I can find some space in my kernel's memory area for them. Currently I'm just identity mapping them, since 0xF0000000 to the end of memory is reserved in my kernel anyway for zero-copy memory shares.
Everything does seem to be working now however, so thanks for all the help.
Was annoyed that 32bit OSes would "eat" up to 1GB of my precious RAM 15 years ago, until I found myself in the same shoes as them and you. :oops:

Consider taking the full 1GB from C000_0000 without worrying too much about it page by page, it's probably the best trade off among a world of different priorities if both Linux and Windows took to the exact same approach (I'm sure they are much more sophisticated than just taking the full 1GB from C000_0000, but the end result to the user is more or less the same depending on devices).
The biggest problem I have with the limits (1GB) of kernel space is that filesystem buffers are mapped there, which causes a lot of issues. In my new design (which I worked a lot on half-a-year ago), I store these buffers as physical addresses and only map them into the limited linear address space on demand. I think that solves most of it. The FS drivers also run in their own address space with another 1G of shared linear address space per drive.

I've also discovered that I can handle 100GB of data in my 32-bit application by just requesting the kernel to map the data on a "block" basis. Works very well and probably have a minimal performance hit provided the application access the data in a smart way.

So, if you write smart code, not having a 64-bit OS is not that much of an issue. I've dropped the idea to move to long mode as it doesn't seem to be beneficial enough.
Post Reply