Multi-core Programming
Re: Multi-core Programming
So I've tried to follow as much as I could of what was explained in the thread, and I'm still having some troubles.
I've disabled the PIC (masked all the interrupts), I've enabled the APICs (set up the spurious thing) and sent all the SIPIs to the cores, and got them to run the trampoline code, setup their own stack and move into protected mode. I've run LIDT on each core with the interrupt table, which just points to a dummy function at the moment for all interrupts (it just prints which interrupt number was called).
I then attempt to get them to setup their APIC timers by configuring it with the PIT in oneshot mode. If I read the current count, I can see it going down, but I'm still not getting an actual interrupt.
If someone can see where I'm going wrong, I'd be very grateful to know.
I've disabled the PIC (masked all the interrupts), I've enabled the APICs (set up the spurious thing) and sent all the SIPIs to the cores, and got them to run the trampoline code, setup their own stack and move into protected mode. I've run LIDT on each core with the interrupt table, which just points to a dummy function at the moment for all interrupts (it just prints which interrupt number was called).
I then attempt to get them to setup their APIC timers by configuring it with the PIT in oneshot mode. If I read the current count, I can see it going down, but I'm still not getting an actual interrupt.
If someone can see where I'm going wrong, I'd be very grateful to know.
-
- Member
- Posts: 5563
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Multi-core Programming
Volume 3A section 10.12.7 says you need to use the IOMMU to translate the interrupt messages into something the x2APIC can use. Aside from using a destination ID that will be translated to the actual destination by the IOMMU, setup on the PCI device is the same.Ethin wrote:The X2APIC uses MSRs instead of MMIO. The intel manuals imply that once the X2APIC is used, you cannot use the MMIO addresses. Therefore, how would you set up MSI or MSI-X?
Each CPU has its own LAPIC, but there's only one HPET.Ethin wrote:I use the HPET for my sleep function. But I don't know if I should use that or the LAPIC timer (its a global function).
You didn't forget STI, did you? Or the IOAPIC? Or the MADT?Barry wrote:If someone can see where I'm going wrong, I'd be very grateful to know.
Re: Multi-core Programming
I've run STI on each core. What exactly do I need to do with the IOAPIC?Octocontrabass wrote:You didn't forget STI, did you? Or the IOAPIC? Or the MADT?
I'm using the MP Config Tables rather than the MADT, because my test machine is quite old, should I switch to MADT or can I do it with these tables?
I know that there's going to be some work in redirecting interrupts, but I can't even get a local timer interrupt working on the same core.
Re: Multi-core Programming
SMP is just a module in my system that implements synchronization, locks & interrupts in different ways. I still run the same scheduler on both SMP and non-SMP systems, and the timer functions are not dependent on having SMP or not, rather on which hardware that exists.nullplan wrote:If you don't have LAPIC, you also can't have SMP, so I don't think the distinction matters.rdos wrote:The use of timer hardware also determines if timer queues are global (PIT) or per core (LAPIC).
I use the NMI IPI for panic. It will always work regardless of what the core is doing.nullplan wrote: Currently, I have three OS-defined IPIs in use: One is for rescheduling, and causes the receiving CPU to set the timeout flag on the current task (as would an incoming timer interrupt). The second is for halting, and it just runs the HLT instruction in an infinite loop. That IPI is used for panic and shutdown. And finally, the third is the always awful TLB shootdown. There is just no way around it. I can minimize the situations I need it in, but some are just always going to remain.
Maybe. In my experience, cli/sti often was used to synchronize variables that were shared with IRQs, and when going SMP, this needs to use spinlocks. Other than that, I don't see any reason to use cli or spinlocks. If you just protect code between various part of your software, semaphores, mutexes or critical sections (or whatever you call them) is the way to achieve locking, not cli or spinlocks. The synchronization primitives themselves might need to use cli or spinlocks or scheduler locks in their implementation, but this should not be known by users.nullplan wrote: Basically, your failure here is to distinguish between parallelism and concurrency. Parallelism is a property of the source code to both support and be safe under multiple threads of execution. Concurrency is then an attribute of the hardware to allow for the simultaneous execution of these threads. Now obviously, source code that assumes that "CLI" is the same as taking a lock is not parallel. In order to be parallel, you must protect all accesses to global variables with some kind of synchronization that ensures only one access takes place at a time.
In drivers that need to use spinlocks to synchronize with an IRQ, I have a generic function for this that translates to cli/sti on non-SMP and a spinlock for SMP.
My kernel originally wasn't SMP aware, but it used a few synchronization primitives that were implemented in the scheduler. It also used cli/sti to synchronize with IRQs. When I switched to SMP, I just needed to modify the synchronization primitives to become SMP safe, and remove cli/sti and replace them with the new generic spinlock function.nullplan wrote: I've made sure from day one that my kernel code was parallel. All accesses to global variables are behind atomic operations or locks. Spinlocks clear the interrupt flag, yes, but that is to avoid deadlocks from locking the same spinlock in system-call and interrupt contexts.As I've learned from the musl mailing list, fine-grained locking is fraught with its own kind of peril, so I just have a big scheduler spinlock that protects all the lists, and that seems to work well enough for now.rdos wrote:Perhaps the toughest issue to solve is how to synchronize the scheduler, particularly if it can be invoked from IRQs.
I have some places which uses lock-free code (the physical memory manager) where I actually wrote code that is SMP safe without the use of spinlocks or mutexes, but it's generally too burdensome to write code in this way. Not to mention that it needs to be carefully evaluated for really being safe in all possible scenarios.
Not really. Each CPU allocates it's own GDT in linear address space, and then maps the first page to it's own page which links to the core data structure. All CPUs still use the same paging structures and the same linear address space. The only difference is that if you load GDT selector 40 it will map to the unique core structure. This structure is mapped to a unique linear address and selector, which are different for each core. So, it is GDT selector 40 that is mapped to this unique address that differs between cores. Therefore, when a core wants to do something in another core's data, it can do this by loading a linear address (or a selector) from a table of available cores. The core itself can also get the linear address (or selector) from the core data itself, and when loading this, can call code that needs to operate based on the unique addresses.nullplan wrote: See, this is precisely what I meant. Now you have a complicated memory model where the same kernel-space address points to different things on different CPUs. Leaving alone that this means you need multiple kernel-space paging structures, this means you cannot necessarily share a pointer to a variable with another thread that might execute on a different core for whatever reason, and this is a complication that was just unacceptable to me.
Then you need some way to do this (effectively) every time the processor switches (or potentially switches) to kernel mode, something that slows down code.nullplan wrote: Also, FS and GS aren't ordinary registers, they are segment registers, and their only job in long mode is to point to something program-defined. So in kernel-mode I make them point to CPU-specific data. That doesn't really consume anything that isn't there anyway and would be unused otherwise.
In long mode, this might work well, but not in protected mode, and particularly not when a segmented memory model is used which might pass parameters in fs or gs.
Re: Multi-core Programming
I've managed to solve my issues! I switched to using the MADT and moved the order of some things around.
I have just one more problem though:
The local APIC timer interrupts the APs but not the BSP, is this normal (do I need to use the PIT for the BSP)?
Other than that, I've got interrupts working on each processor, the IO APIC configured, and I'm starting to make my scheduler run on multiple cores. It's just not running on the BSP. Somehow I'm in the exact opposite situation to where I started.
I have just one more problem though:
The local APIC timer interrupts the APs but not the BSP, is this normal (do I need to use the PIT for the BSP)?
Other than that, I've got interrupts working on each processor, the IO APIC configured, and I'm starting to make my scheduler run on multiple cores. It's just not running on the BSP. Somehow I'm in the exact opposite situation to where I started.
Re: Multi-core Programming
Agreed, lockless is cool but error-prone.rdos wrote:I have some places which uses lock-free code (the physical memory manager) where I actually wrote code that is SMP safe without the use of spinlocks or mutexes, but it's generally too burdensome to write code in this way. Not to mention that it needs to be carefully evaluated for really being safe in all possible scenarios.
But... if you need to load a specific segment with a special value in kernel mode, then your code has the exact same "problem". A problem you are well overstating, anyway. Segment operations are slow, but not this slow. I can easily perform this work each time the kernel is entered without any problems in practice.rdos wrote:So, it is GDT selector 40 that is mapped to this unique address that differs between cores.[...]Then you need some way to do this (effectively) every time the processor switches (or potentially switches) to kernel mode, something that slows down code.
In long mode, you have SWAPGS, and in protected mode you have nonzero base addresses. And a segmentation logic that actually stacks. In long mode it is a bit more complicated, because SWAPGS is stateless. That said, I am not overly concerned with protected mode.rdos wrote:In long mode, this might work well, but not in protected mode,
So that's one more downside to a segmented memory model, then: diminishing your options for storing a thread pointer. My god, how glad am I that I don't have to deal with segmentation in earnest.rdos wrote:and particularly not when a segmented memory model is used which might pass parameters in fs or gs.
No. No, this is not normal; it ought to work. Interrupt flag set on the BSP as well? Is the timer masked out in the LAPIC?Barry wrote:The local APIC timer interrupts the APs but not the BSP, is this normal (do I need to use the PIT for the BSP)?
Carpe diem!
Re: Multi-core Programming
Turns out the BSP wasn't starting the timer, the routine was only being called by the AP code. Everything seems to be working fine now in terms of sending/receiving interrupts. However when I switch the page directory it causes the processor to reboot, which is a new development since I started using APIC. The issue seems to be writing to cr0.nullplan wrote:No. No, this is not normal; it ought to work. Interrupt flag set on the BSP as well? Is the timer masked out in the LAPIC?
Re: Multi-core Programming
I've figured out the issue I was having, the kernel is page faulting when I try to access the local APIC registers at 0xFEE00000. I just need to either identity page them or move them to lower memory.
Re: Multi-core Programming
You need to map that address. 0xFEE00000 is a physical address, and you need to map it into your address space somehow.Barry wrote:I've figured out the issue I was having, the kernel is page faulting when I try to access the local APIC registers at 0xFEE00000. I just need to either identity page them or move them to lower memory.
You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward. That place in virtual memory is not needed for anything else, and you can pretty much handle up to 64 TB of physical memory like that. Which ought to be enough for your purposes, right? Anyway, this way, translating a physical to a virtual address is as simple as adding the base address.
Carpe diem!
Re: Multi-core Programming
That would be good if I had a 64-bit OS, but I'm writing a 32-bit OS, so that's a bit far out of my reach. The LAPIC and IOAPIC are only a page each, so its no great issue where they go, I can find some space in my kernel's memory area for them. Currently I'm just identity mapping them, since 0xF0000000 to the end of memory is reserved in my kernel anyway for zero-copy memory shares.nullplan wrote:You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward.
Everything does seem to be working now however, so thanks for all the help.
Re: Multi-core Programming
My methods of telling them apart:rdos wrote:I've tried many different solutions, but the current one probably is the most effective (and doesn't consume any ordinary registers like fs or gs). I setup my GDT so the first few selectors are at the end of a page boundary, and then map the first part to different memory areas on different CPUs with paging. Thus, the GDT is still shared among cores and can be used for allocating shared descriptors up to the 8k limit. To get to the core area, the CPU will load a specific selector which will point directly to the core area. The core area also contains the linear address and the specific (shared) GDT selector allocated for each core.nullplan wrote: You need some way to tell CPUs apart in normal code, anyway, so I just use the GS base address for that. And I really don't know any way to do it that doesn't involve having either GS or FS hold a base address in kernel mode, or else using a paging trick to have different data at the same address. But the latter option is fraught with peril, so I just use the former one, and GS is nicer to use because of SWAPGS.
This is for protected mode, but in long mode the CPU has to load a fixed linear address instead. Which is a bit less flexible.
I read the LAPIC's ID if there's APIC (x2 is a MSR read, even x1 is on chip and as a device frequently used by mainstream kernels, its MMIO path is likely optimized).
If there's no APIC available, I read out TR and do some math, all core's TSS are arranged in a consecutive array with a known base and pow2 size in my case.
Re: Multi-core Programming
Was annoyed that 32bit OSes would "eat" up to 1GB of my precious RAM 15 years ago, until I found myself in the same shoes as them and you.Barry wrote:That would be good if I had a 64-bit OS, but I'm writing a 32-bit OS, so that's a bit far out of my reach. The LAPIC and IOAPIC are only a page each, so its no great issue where they go, I can find some space in my kernel's memory area for them. Currently I'm just identity mapping them, since 0xF0000000 to the end of memory is reserved in my kernel anyway for zero-copy memory shares.nullplan wrote:You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward.
Everything does seem to be working now however, so thanks for all the help.
Consider taking the full 1GB from C000_0000 without worrying too much about it page by page, it's probably the best trade off among a world of different priorities if both Linux and Windows took to the exact same approach (I'm sure they are much more sophisticated than just taking the full 1GB from C000_0000, but the end result to the user is more or less the same depending on devices).
Re: Multi-core Programming
That would be a possibility with a flat kernel assuming a fixed APIC physical address. Or if you always map the APIC to a fixed linear address. Although, it won't work on older systems with no APIC.xeyes wrote: My methods of telling them apart:
I read the LAPIC's ID if there's APIC (x2 is a MSR read, even x1 is on chip and as a device frequently used by mainstream kernels, its MMIO path is likely optimized).
Won't work for me since I have a TSS per thread. Although I use this in user mode code to identify if a thread owns a lock or not. This is possible since TR can also be read-out in user mode without faults.xeyes wrote: If there's no APIC available, I read out TR and do some math, all core's TSS are arranged in a consecutive array with a known base and pow2 size in my case.
Re: Multi-core Programming
The biggest problem I have with the limits (1GB) of kernel space is that filesystem buffers are mapped there, which causes a lot of issues. In my new design (which I worked a lot on half-a-year ago), I store these buffers as physical addresses and only map them into the limited linear address space on demand. I think that solves most of it. The FS drivers also run in their own address space with another 1G of shared linear address space per drive.xeyes wrote:Was annoyed that 32bit OSes would "eat" up to 1GB of my precious RAM 15 years ago, until I found myself in the same shoes as them and you.Barry wrote:That would be good if I had a 64-bit OS, but I'm writing a 32-bit OS, so that's a bit far out of my reach. The LAPIC and IOAPIC are only a page each, so its no great issue where they go, I can find some space in my kernel's memory area for them. Currently I'm just identity mapping them, since 0xF0000000 to the end of memory is reserved in my kernel anyway for zero-copy memory shares.nullplan wrote:You can move the APIC register window by setting some MSR, but that is rarely necessary or wise. As before said, I suggest mapping all memory from 0xffff800000000000 onward.
Everything does seem to be working now however, so thanks for all the help.
Consider taking the full 1GB from C000_0000 without worrying too much about it page by page, it's probably the best trade off among a world of different priorities if both Linux and Windows took to the exact same approach (I'm sure they are much more sophisticated than just taking the full 1GB from C000_0000, but the end result to the user is more or less the same depending on devices).
I've also discovered that I can handle 100GB of data in my 32-bit application by just requesting the kernel to map the data on a "block" basis. Works very well and probably have a minimal performance hit provided the application access the data in a smart way.
So, if you write smart code, not having a 64-bit OS is not that much of an issue. I've dropped the idea to move to long mode as it doesn't seem to be beneficial enough.