Why SMP?

rdos · Post by **rdos** » Wed Dec 11, 2019 12:26 pm

I did much of my OS and scheduler before there were any (reasonable) SMP machines to buy, and it is a lot hazzle to convert a single core OS to a multicore one. I eventually succeeded, but it was a lot of work, and I truly would have failed without extensive use of SVN. Once I had hundreds of commits over several months that made the SMP core crash regularly, and I had no idea what I've done wrong. I eventually solved it with merging.

The only really significant difference between a SMP core and a single core OS is that in a SMP core you must use spinlocks instead of cli/sti for synchronization. So, if you put spinlocks in your code from start instead of sti/cli, then the move to SMP will be much easier.

And I think the main motivation that your OS should be able to run on multiple cores is that it is really cool feature to have code run truly in parallel.

I would also want to one day create some kind of monitor / hard real time extension that would run on a dedicated core. Could be used if I once get those fast AD converters that can run in up to several 100s of MHz sample frequency, and so would need to be read regularly in real-time.

Schol-R-LEA · Post by **Schol-R-LEA** » Wed Dec 11, 2019 3:10 pm

rdos wrote:The only really significant difference between a SMP core and a single core OS is that in a SMP core you must use spinlocks instead of cli/sti for synchronization. So, if you put spinlocks in your code from start instead of sti/cli, then the move to SMP will be much easier.

Fair point. There are also some 'lockless' synchronization models which work for multicore/multiprocessor systems, but they all have various requirements or limitations, and none of them are as straightforward as spinlocks IIUC - they would require a much more radical departure when starting from a lock-based single-core system than spinlocks would.

Craze Frog · Post by **Craze Frog** » Wed Dec 11, 2019 4:56 pm

Adding more cores is like adding more carriages to a train. The train doesn't go faster, in fact it could even go a bit slower, but it can carry twice the amount of cargo in nearly the same time.

Of course, it is required that the cargo can be divided into parts that don't touch each other, because it goes into separate carriages.

Korona · Post by **Korona** » Wed Dec 11, 2019 10:54 pm

Lock-free data structures are not a general replacement for locks. You can implement certain abstract data types in a lock-free way but lock-free operations do not compose (i.e., the composition is not atomic anymore).

rdos · Post by **rdos** » Thu Dec 12, 2019 2:05 am

I would recommend the use of lock-free physical memory allocation. It's not easy, but since pagefaults that potentially will need physical memory allocation can happen almost everywhere (unless you use extensive protective methods), this will make that code a lot easier in multicore systems.

LtG · Post by **LtG** » Thu Dec 12, 2019 3:33 pm

rdos wrote:I would recommend the use of lock-free physical memory allocation. It's not easy, but since pagefaults that potentially will need physical memory allocation can happen almost everywhere (unless you use extensive protective methods), this will make that code a lot easier in multicore systems.

What do you mean "everywhere"? As in kernel? That's easy, don't let the kernel use demand paged memory.

As for lock-free physical memory allocation, that one I agree with. Though I wouldn't use lockless in the general sense, but rather divide physical memory for each core and let each core do their own PMEM allocation. If one runs out, rebalance, which should be rare and it's ok to take a tiny hit when that happens, instead of a tiny hit with every allocation.

rdos · Post by **rdos** » Thu Dec 12, 2019 3:47 pm

LtG wrote:
rdos wrote:I would recommend the use of lock-free physical memory allocation. It's not easy, but since pagefaults that potentially will need physical memory allocation can happen almost everywhere (unless you use extensive protective methods), this will make that code a lot easier in multicore systems.
What do you mean "everywhere"? As in kernel? That's easy, don't let the kernel use demand paged memory.

That means you need to validate all userdata that is passed to kernel for presence, which results in poor performance. I rely on the page fault handler in those situations instead.

LtG wrote:As for lock-free physical memory allocation, that one I agree with. Though I wouldn't use lockless in the general sense, but rather divide physical memory for each core and let each core do their own PMEM allocation. If one runs out, rebalance, which should be rare and it's ok to take a tiny hit when that happens, instead of a tiny hit with every allocation.

My HRT data acquisition tool might allocate 100GB on a 128GB machine.

eekee · Post by **eekee** » Fri Dec 13, 2019 6:02 am

Interesting thread. I shall have to brush up on spinlocks.

LtG wrote:...divide physical memory for each core and let each core do their own PMEM allocation. If one runs out, rebalance, which should be rare and it's ok to take a tiny hit when that happens, instead of a tiny hit with every allocation.

What happens when threads/processes want to share memory? Do other cores get to look into the memory owned by the core where the shared region was allocated?

LtG · Post by **LtG** » Fri Dec 13, 2019 9:28 am

rdos wrote: That means you need to validate all userdata that is passed to kernel for presence, which results in poor performance. I rely on the page fault handler in those situations instead.

Do you mean syscall where the argument is a struct pointing to ten different places in memory, and thus you would have to validate presence to ensure page fault not triggered in kernel land?

If so, then that's not an issue for me, as I don't do complex kernel operations. For those that need such structs to be passed to kernel then that would be a problem.

rdos wrote:
LtG wrote:As for lock-free physical memory allocation, that one I agree with. Though I wouldn't use lockless in the general sense, but rather divide physical memory for each core and let each core do their own PMEM allocation. If one runs out, rebalance, which should be rare and it's ok to take a tiny hit when that happens, instead of a tiny hit with every allocation.
My HRT data acquisition tool might allocate 100GB on a 128GB machine.

Shouldn't be a problem =)

LtG · Post by **LtG** » Fri Dec 13, 2019 9:34 am

eekee wrote:
LtG wrote:...divide physical memory for each core and let each core do their own PMEM allocation. If one runs out, rebalance, which should be rare and it's ok to take a tiny hit when that happens, instead of a tiny hit with every allocation.
What happens when threads/processes want to share memory? Do other cores get to look into the memory owned by the core where the shared region was allocated?

My proposal doesn't change any of that. I'm talking about ownership of the region of memory while it's owned by the kernel, so when it's free (or disk cache, which is pretty much the same thing).

So when PMEM is allocated each core can simply consult their own free PMEM list, this is the happy path. Only when a core runs out of it's own free PMEM list does it need to do synchronization, so it can "steal" free PMEM from other cores. It doesn't need to do full re-balancing either, it can just steal from one core.

I go a bit further, I treat each core separately, each having their own "kernel", instead of one kernel being called in all cores. The free PMEM list is essentially just a consequence of that.

rdos · Post by **rdos** » Fri Dec 13, 2019 2:24 pm

LtG wrote: Do you mean syscall where the argument is a struct pointing to ten different places in memory, and thus you would have to validate presence to ensure page fault not triggered in kernel land?

If so, then that's not an issue for me, as I don't do complex kernel operations. For those that need such structs to be passed to kernel then that would be a problem.

No, I've forbidden both structs and enums in syscalls.

I also forbid the implentation of ioctl.

Still, if you pass a large string or data buffer to something like a filesystem driver it will need to copy it to or from some kernel buffer, and that operation could trigger demand paging. Or if you pass audio-data to the audio driver. Of course, if you want to pass it directly to a buffer ring of some device, then you will need to get the physical pages anyway, and so validation will be relatively cheap.

loonie · Post by **loonie** » Sun Dec 15, 2019 2:23 pm

rdos wrote: The only really significant difference between a SMP core and a single core OS is that in a SMP core you must use spinlocks instead of cli/sti for synchronization. So, if you put spinlocks in your code from start instead of sti/cli, then the move to SMP will be much easier..

Exactly, plus one of the bugs that are easier to catch on single cpu with spinlocks is then ring3 does syscall that accesses shared with device driver buffer and all of the sudden interrupt comes thats "exits" and does "sti" to do heavy task of populating shared buffer with events - and it waits until syscall releasses the lock.
On multiple CPUs this could simply manifest itself at slowdown that are harder and harder to catch on modern super fast CPUs. (single x64 core is like 5-10 times faster than in year 2001 i think)

OSDev.org

Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?

Re: Why SMP?