memory management with multiple cores

acccidiccc · Post by **acccidiccc** » Sun Jun 12, 2022 10:46 am

Hello, I was wondering how memory management with multiple cores would work.
My guess is that you have one master CPU which keeps track of the others.
So the master CPU tells the slaves:

1. the address of the page directory of the process
2. the address that the CPU needs to start executing

Assuming this is correct, how do the Master and Slave communicate?
over interrupts (LAPIC, IOAPIC)?

[/code]

nullplan · Post by **nullplan** » Sun Jun 12, 2022 12:45 pm

acccidiccc wrote:My guess is that you have one master CPU which keeps track of the others.

Whatever for? Far better to manage your memory in such a way that any CPU can just call up the manager and request memory. The easiest (though not fastest) option for that is to just use a big allocator lock. Whenever memory is being allocated, you take a lock, look up the address in whatever data structure you have, mark it as used, and release the lock. While not the fastest option, it is still easier than doing what you wanted to do. No need to limit the allocator to a single core.

nexos · Post by **nexos** » Sun Jun 12, 2022 6:50 pm

No, that leads to asymmetric (i.e., your OS will be AMP not SMP) treatment of the CPUs, which will turn into a bottleneck. Instead, I would start out by having global memory management structures (i.e., your physical memory slab / bitmap / free list, and your virtual memory structures) that all CPUs work with equally, as if there was one CPU.

One catch though: these global structures will need to have locks. If you haven't looked it locking, now's the time to do it - it comes up everywhere is multi-processor systems.

The other option would be to have per-CPU memory management structures - that would be lockless (which is faster), but would be a pain to implement right, as memory is typically a global resource, not a per-CPU resource.

One great guide to memory management is on the wiki: https://wiki.osdev.org/Brendan%27s_Memo ... ment_Guide

Barry · Post by **Barry** » Mon Jun 13, 2022 12:31 am

As the others have said, you effectively have to ensure that only one processor is accessing the structures at any one time. You can do this either by having a separate structure for each processor, or by using locks. Locks are the way most Operating Systems handle it. When it comes to locks, you need to hit a balance between making them fine-grained, but low-contention - a fast memory allocator will also help with this. If done right, there should be no need to communicate anything to the other processors, they just need to be able to check the lock.

In terms of virtual memory management, each processor should have a different MMU, and as such can have a different virtual address space. There can be some troubles if multiple processors are using the same address space, since they'll still be using separate TLBs, so any invalidated pages will need to be communicated to all processors via an IPI. Handling this is up to you and how you design your kernel, you can use IPIs per page, use an IPI to clear the entire TLB, or schedule selectively to stop processors sharing address spaces. I just thought I'd throw this in since all other posts have been about physical memory management.

acccidiccc wrote:

1. the address of the page directory of the process
2. the address that the CPU needs to start executing

It seems like you're confused over the job of the Memory Manager, and the job of the Scheduler. These two things should be stored in the relevant Process Control Block. Each processor should just be able to ask the scheduler for another task, and it should switch that processor the next (currently not running) task. This job includes saving the current task's registers and updated information, then loading the new task's address space and jumping to the new task's saved instruction pointer. The Memory Manager should just be there to allocate memory to tasks that need it.

Thanks,
Barry

acccidiccc · Post by **acccidiccc** » Mon Jun 13, 2022 9:40 am

Thanks for the Answers.
So Instead of having some master-slave setup, I have some global structure that implements locking. (would be the same for the scheduler)
So when a core allocates some a page, it sets some lock variable to true and all other cores have to poll? or wait for some IPI to allocate memory.

Barry wrote:In terms of virtual memory management, each processor should have a different MMU, and as such can have a different virtual address space. There can be some troubles if multiple processors are using the same address space, since they'll still be using separate TLBs, so any invalidated pages will need to be communicated to all processors via an IPI. Handling this is up to you and how you design your kernel, you can use IPIs per page, use an IPI to clear the entire TLB, or schedule selectively to stop processors sharing address spaces. I just thought I'd throw this in since all other posts have been about physical memory management.

thanks for throwing this in

. The cores would use the same address space when multithreading, right?
So upon invalidating (what do you mean by that?) the pages you need to invalidate the TLB to clear it. So if the cores share a page directory, and new pages get allocated, the TLBs of the cores in question need to be updated to prevent invalid entries.

nexos wrote:One great guide to memory management is on the wiki: https://wiki.osdev.org/Brendan%27s_Memo ... ment_Guide

thanks for this link.

Barry · Post by **Barry** » Mon Jun 13, 2022 10:39 am

acccidiccc wrote:Thanks for the Answers.
So Instead of having some master-slave setup, I have some global structure that implements locking. (would be the same for the scheduler)
So when a core allocates some a page, it sets some lock variable to true and all other cores have to poll? or wait for some IPI to allocate memory.

Look into spinlocks, and some other synchronisation primitives - you need them when you have a multi-processor system.

acccidiccc wrote:thanks for throwing this in . The cores would use the same address space when multithreading, right?
So upon invalidating (what do you mean by that?) the pages you need to invalidate the TLB to clear it. So if the cores share a page directory, and new pages get allocated, the TLBs of the cores in question need to be updated to prevent invalid entries.

Yes, multi-threading will cause that, but only if multiple threads of the same process are being run at the same time by different processors.
When a processor wants to translate a virtual address to a physical address, it checks the TLB first, if it can't find it there, then it checks the page tables. If you update the page tables, e.g. changing a page's frame or removing it, then you have to also remove that entry in the TLB so that the change is visible to the processor. This is easy on the processor that made the change, since you already know which entry to invalidate, but you have to communicate this to the other processors that are sharing the address space too, or they'll keep using their cached version of the mapping.
A very easy solution is just to have an IPI that clear's the entire TLB on the processor that receives it, this is called a TLB shootdown. This is what happens when you switch address spaces too.

Thanks,
Barry

rdos · Post by **rdos** » Mon Jun 13, 2022 4:07 pm

You shouldn't think of cores as things that run code, rather as resources your scheduler can use to get things done. The scheduler should decide which threads are allocated to which cores and thus what the cores are running. So, your "master" is the scheduler, not a specific core.

devc1 · Post by **devc1** » Sat Sep 03, 2022 6:57 pm

I prefer to use locked instructions (most current processors use them automatically) and make every processor access every thing as it wants. For e.g.

AllocatePage:
For i = 0; i < NumPages; i++
If page.allocated == 0 then :
lock bts page.allocated // sets the bit and returns if it was set (in a mp system it returns 1 several times because cpus can set it the same time as you)
jc NextPage // The bit was set by another cpu

rdos · Post by **rdos** » Sun Sep 04, 2022 2:14 am

devc1 wrote:I prefer to use locked instructions (most current processors use them automatically) and make every processor access every thing as it wants. For e.g.

AllocatePage:
For i = 0; i < NumPages; i++
If page.allocated == 0 then :
lock bts page.allocated // sets the bit and returns if it was set (in a mp system it returns 1 several times because cpus can set it the same time as you)
jc NextPage // The bit was set by another cpu

Agreed. At least the physical memory manager should be lock-free and rely on atomic operations. That excludes the "linked list" implementation that must have a semaphore for synchronization purposes. By using bitmaps for physical memory, it is possible implement physical memory management without semaphores only using locked instructions. This also means you don't need to map physical memory in the linear address space.

z0rr0 · Post by **z0rr0** » Mon Sep 05, 2022 3:53 am

Hello,

in Toro unikernel, we allocate memory per core at booting time. Currently, the logic is simple, we distribute proportionally with the number of cores. Each memory allocator is thus independent one the other and lock instructions are not required. Together with the cooperative local scheduler, memory allocation does not require any protection at all. This however is a particular design aimed at improving the execution of parallel applications. The main idea was to leverage on NUMA.

My two cents,

devc1 · Post by **devc1** » Sat Oct 01, 2022 1:44 pm

From my opinion, if you plan to build an OS for servers and users then this design will not work.
Do you mean splitting 8 GB of memory on 4 CPUs and make each CPU think it has only 2GB of RAM ?

Atomic operations are fast, and you don't need to perform them millions of times in each function from your memory manager.

the NUMA states that a processor accesses a memory region faster than others, but sometimes a program will need to access more memory so your MMGR will need to allocate some slower memory (or far memory from the CPU), thus you will need to use memory of another CPU and this will cause a problem.

That's why you need to design a memory manager that can share the all the memory between CPUs and you must rely on locked/atomic instructions.

I understand that it will lower the performance, needs some cache management and synchronization primitives, but who needs to allocate and free memory a million times per second ?

nexos · Post by **nexos** » Sun Oct 02, 2022 3:49 pm

It could work, you'd just need some way to steal free pages from another CPU. Actually, that's not a bad idea.

devc1 · Post by **devc1** » Sun Oct 02, 2022 3:54 pm

And to steal from the other CPU you will need synchronization and atomic operations. Just call that full NUMA-Compatible memory management. There is no such thing called splitting memory between cpus unless you need it for a special purpose like running ultimate performance servers.

nexos · Post by **nexos** » Sun Oct 02, 2022 3:57 pm

devc1 wrote:And to steal from the other CPU you will need synchronization and atomic operations.

Right, but if we access a per-CPU cache of pages 75% of the time and use a lock to steal pages 25%, that's pretty good. Not to mention it would greatly improve NUMA spatial locality, and the steal algorithm could be tuned to grab pages from the physically closest CPU in the system.

devc1 wrote:There is no such thing called splitting memory between cpus unless you need it for a special purpose like running ultimate performance servers.

Sometimes it's worth it using unconventional methods to get better results.

Barry · Post by **Barry** » Fri Nov 18, 2022 8:26 pm

DragonFlyBSD does something similar where it gives each CPU a kernel heap. Each CPU can use any page for process memory or whatever, it's just per-CPU for the kernel heap memory AFAIK. This means each CPU can allocate memory without needing to synchronize with the other processors, and can use completely lockless algorithms in the majority of places. If you do need atomicity you can just disable interrupts on that processor. When a block of memory needs to be freed the processor that needs to free it has to send an IPI to the processor that allocated it, unless it is the processor that allocated it. Apparently they've got pretty good results from doing this, and I assume they're using asynchronous IPI communication.

OSDev.org

memory management with multiple cores

memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores

Re: memory management with multiple cores