Rpi 4, buddy allocator, MMU

Orrexon · Post by **Orrexon** » Sun Jan 31, 2021 5:49 am

Hi! I searched the forum and I could not find this particular question, if there is already an answer I would love to get a link to it.

I am building an OS to act as a platform layer for my game. It is supposed to provide memory, input, sound to the game and run it. Right now I am looking to implement the buddy allocator and later combine it with the slab allocator.

I am wondering if activating the mmu like in this tutorial: https://github.com/bztsrc/raspi3-tutori ... tualmemory , would be suitable to build the buddy allocator upon? If not could some body please point me in a another direction?

The purpose is to implement some allocation function similar to the "malloc" function. Or perhaps in my case I don't need the MMU? (I suppose I'd like to have the MMU activated to support different processes running on the Pi in the future)

I am playing around with Rpi 4 and I am aware there might have been changes regarding this from the Rpi 3 to which this particular tutorial is written..

Very greatful for any help, I have been programming for some time and I have recently found out that bare metal programming is the most rewarding form of programming.

fbkr · Post by **fbkr** » Sun Jan 31, 2021 9:05 pm

Enabling the MMU and having a buddy allocator are orthogonal things, you can have either without the other or both.

However, from my experience with the RPi3, you kinda have to enable the MMU. The reason in particular is armv8 allows unaligned accesses, and your compiler may emit code assuming this. However, at least on the raspberry pi 3, you have to enable the MMU to get unaligned accesses working. I discovered this the very hard way

So just enable the MMU and setup identity paging if you don't want to spend too much effort on virtual memory stuff.

(Also, to get atomic instructions working, you'll need to enable caches in case you use them)

bzt · Post by **bzt** » Mon Feb 01, 2021 3:09 am

Hi,

Orrexon wrote:I am wondering if activating the mmu like in this tutorial: https://github.com/bztsrc/raspi3-tutori ... tualmemory , would be suitable to build the buddy allocator upon?

They are independent things.

Orrexon wrote:If not could some body please point me in a another direction? The purpose is to implement some allocation function similar to the "malloc" function.

That's a bit more complicated than that. You'll need several layers of allocation, see here. I've also wrote about it here.

If you're not thinking about a general purpose system, just a single application, even then you'll need

a page allocator, that keeps track of RAM (often called PMM, physical memory manager)
a virtual memory allocator, that keeps track which pages are mapped where (VMM, virtual memory manager)
and a user space library (which could be a kernel library if you're not planning on user space) that allows allocating arbitrary amounts of memory, this is what we actually call malloc.

Note that the first two layers are using pages (4096 bytes at once), so it is the library function's job to keep track smaller amounts, like 32 bytes allocations. With identity paging, you can get away the first two layers (but I would recommend to still implement PMM). There are many free and open source solutions for this last, dlmalloc, ptmalloc, etc., even I have written one, called bztalloc, but for a game I'd recommend jemalloc. It might be a bit harder to get it working on bare metal than the others, but it will pay out on the long run, when your game engine becomes advanced enough to feel the need for concurrent threads (one for playing the music, one for calculating the physics, one for handling user input etc.) Jemalloc is designed to be very effective in a multithreaded environment, and it was specifically created for the need of a game engine in C++.

Orrexon wrote:Or perhaps in my case I don't need the MMU? (I suppose I'd like to have the MMU activated to support different processes running on the Pi in the future)

You definitely need MMU, even for a monotasking system, just like @fbkr said. Without MMU, you won't have caching, memory is going to be slow, and you'll have to stick with strictly aligned accesses. With MMU, you can set up different caching mechanisms, and you can access any byte in the memory without getting an alignment fault. Furthermore, if you're planning to have multiple processes, then you'll need (it is not required, but strongly encouraged to have) separated address spaces for each process, and again, for that you'll need MMU.

Orrexon wrote:I am playing around with Rpi 4 and I am aware there might have been changes regarding this from the Rpi 3 to which this particular tutorial is written..

Nope, the basic concept of virtual memory is the same (and it is the same for all architectures). See here. Of course the bits are not like that on ARM, but everything else is the same. Take a look at the AArch64 paging figure in this post, if you scroll a bit up, you can compare it with the x86 long mode paging, and you can see the basics are the same.

Orrexon wrote:Very greatful for any help, I have been programming for some time and I have recently found out that bare metal programming is the most rewarding form of programming.

And the most challenging one too

Cheers,
bzt

Orrexon · Post by **Orrexon** » Mon Feb 01, 2021 6:12 pm

Oh'boy this is so cool

Ok so I need 3 layers then.

I think I'll stick with the Buddy as a PMM then, as a first go at this. Only because I like the "simplicity" of it.

I guess I need to read up on the VMM. Now I've got loads of material to read. Let me just see if I understand this at a high level:

First the PMM which could be implemented using the Buddy.

Then I need a VMM, which could be the MMU (?) with some help from software to translate the pages to and from the physical address (which I would get from PMM?) this is the most difficult part for me to understand.

then, to use it in some process you would add an extra layer, "xmalloc" which is basically an api request (to VMM or PMM?), especially from user space.

There are many free and open source solutions for this last, dlmalloc, ptmalloc, etc., even I have written one, called bztalloc, but for a game I'd recommend jemalloc.

I need to read more about these as well.

I have written one, called bztalloc

Unfortuantely, I could not see it I got a 404-error page not found

Without MMU, you won't have caching

Well then that's settled, I absolutely need caching

And the most challenging one too

LOL yes indeed

bzt · Post by **bzt** » Mon Feb 01, 2021 8:19 pm

Orrexon wrote:I think I'll stick with the Buddy as a PMM then, as a first go at this.

That's ok, slab also popular, just as bitmaps.

Orrexon wrote:I guess I need to read up on the VMM. Now I've got loads of material to read. Let me just see if I understand this at a high level:

First the PMM which could be implemented using the Buddy.

Correct.

Orrexon wrote:Then I need a VMM, which could be the MMU (?)

Well, the MMU is the actual circuit that implements virtual addressing inside the CPU. There's only one active address space at a time (the currently running process'), that's all what MMU knows about. The purpose of the VMM is to keep track what physical pages are mapped in which address spaces. They can be shared, swapped out to disk, etc. If you go on with identity mapping, you won't need this layer at all (because there'll be one address space only, so everything you have actually in the active page table).

Orrexon wrote:with some help from software to translate the pages to and from the physical address

It is the MMU's job to translate virtual addresses to physical ones. For the other way around, physical to virtual, that's not possible, because one physical page might be not mapped at all in any address spaces (no virtual address associated), or it could be mapped in several at the same time (multiple virtual addresses). That's why you need a VMM that does the housekeeping of the pages' mappings.

Orrexon wrote:(which I would get from PMM?) this is the most difficult part for me to understand.

then, to use it in some process you would add an extra layer, "xmalloc" which is basically an api request (to VMM or PMM?), especially from user space.

Yes I know. I'll try to explain.

your application / kernel calls malloc
the malloc implementation is a library, that keeps track of free memory in any arbitrary sizes. It tries to solve the request on it's own if possible.
when it runs out of free space, it calls the VMM and asks for a new free page. This is typically done via a syscall, like brk() or mmap() (but could be a direct function call in case of a kernel-malloc)
then the VMM asks for a free page from the PMM, and maps it for the app or the kernel (by updating the newly allocated page's physical address in the process' page tables and flushing the MMU cache)
if the PMM can't find any free pages, it either a) crashes the system b) prints "Out of RAM" and then crashes c) uses some very complicated way to figure out which pages are less needed, writes those to disk, making space in memory.

Orrexon wrote:I need to read more about these as well.

Believe me, for a game, choose jemalloc.

Orrexon wrote:Unfortuantely, I could not see it I got a 404-error page not found

Oh, I moved from github to gitlab a few years ago. Here it is: https://gitlab.com/bztsrc/bztalloc (my allocator is a compromise between complexity and efficientcy, lot better than dlmalloc, but worse than jemalloc. In return easily portable and small.)

Cheers,
bzt

Orrexon · Post by **Orrexon** » Thu Feb 25, 2021 4:00 pm

Ok I am still trying to get this thing to work

I am not a quitter

I need to know if I have understood this correctly.

This code comes from setting up the MMU: (using BZT's code, but I have replaced the physical addresses that I use for the mini-uart in my rpi4 which work when I access them directly without problems. I have double and triple checked and compared those addresses)

MMIO_BASE in the rpi3: 0x3F000000
with the offset: 0x00201000

The one I use as base: 0xFE000000
have also tried legacy: 0x7E000000
both with with offset: 0x00215000

Code: Select all

// kernel L3
    paging[5*512]=(unsigned long)(MMIO_BASE+0x00201000) |   // physical address
        PT_PAGE |     // map 4k
        PT_AF |       // accessed flag
        PT_NX |       // no execute
        PT_KERNEL |   // privileged
        PT_OSH |      // outter shareable
        PT_DEV;       // device memory

Am I correct to assume that this code is what maps the address to the virtual address accessed in the main function later?:

(here I also tried to adjust the offsets to the same as I would have when I read and write to the physical address, offsets 0x40 and 0x54 IO register and LSR register respectively)

Code: Select all

#define KERNEL_UART0_DR        ((volatile unsigned int*)0xFFFFFFFFFFE00000)
#define KERNEL_UART0_FR        ((volatile unsigned int*)0xFFFFFFFFFFE00018)

void main()
{
...
...
while(*s) {
        /* wait until we can send */
        do{asm volatile("nop");}while(*KERNEL_UART0_FR&0x20);
        /* write the character to the buffer */
        *KERNEL_UART0_DR=*s++;
    }
...
...
}

If I am correct, then the actual value being set in that element of the "paging"-array is actually more like MMIO_BASE+00201287 because of the or:ed values. How does that actually work?

bzt or anybody?

Or maybe I have misunderstood the whole thing, please tell me where I am going wrong in that case

bzt · Post by **bzt** » Thu Feb 25, 2021 4:27 pm

Orrexon wrote:This code comes from setting up the MMU: (using BZT's code, but I have replaced the physical addresses that I use for the mini-uart in my rpi4 which work when I access them directly without problems. I have double and triple checked and compared those addresses)

Yeah, they look ok. I'd suggest to use the PL011 chip, that gives you much more control over the serial port.

Orrexon wrote:MMIO_BASE in the rpi3: 0x3F000000
with the offset: 0x00201000

Correct. The offset is for the PL011 chip on RPi3.

Orrexon wrote:The one I use as base: 0xFE000000
have also tried legacy: 0x7E000000
both with with offset: 0x00215000

About the first one, that's right. About the second one, that's the GPU's address for the peripheral on RPi3. Not sure about the RPi4. You should double check the offset too, I think the MMIO relative offsets are the same on RPi3 and RPi4, but I haven't worked with RPi lately so I could remember wrong.

Orrexon wrote:
Code: Select all
// kernel L3
    paging[5*512]=(unsigned long)(MMIO_BASE+0x00201000) |   // physical address
        PT_PAGE |     // map 4k
        PT_AF |       // accessed flag
        PT_NX |       // no execute
        PT_KERNEL |   // privileged
        PT_OSH |      // outter shareable
        PT_DEV;       // device memory
Am I correct to assume that this code is what maps the address to the virtual address accessed in the main function later?:

Yes. The index in the "paging[5*512]" table specifies at which virtual address it's going to be accessible, and the array element's value specifies the physical address and access bits. You always must map device MMIO as outter sharable and nGnRE. (Here PT_DEV is 1 which selects the 2nd attribute, and in mair_el1 that's set as nGnRE). Here the paging table is set up in a way that index 5*512 maps to 0xFFFFFFFFFFE00000

Orrexon wrote:If I am correct, then the actual value being set in that element of the "paging"-array is actually more like MMIO_BASE+00201287 because of the or:ed values. How does that actually work?

Correct. I'd recommend to read DDI0487 ARM spec on how the paging table's bits are, but basically you always use a physical address that's least significant bits are zero, and most significant bits doesn't count, so that's where the ARM engineers put the access control bits. On the figure above, these are bits "Sign extend" and "Physical-Page Offset". Those bits are not needed for a page aligned physical address ("PA" inside the tables in the figure). The point is, if you mask the entry to clear the access control bits, you get a pure physical address.

This pdf is also useful, lists the page table bits (section 4.5 Memory attributes). Much easier to read than the ARM spec, however it's not that detailed.

Cheers,
bzt

OSDev.org

Rpi 4, buddy allocator, MMU

Rpi 4, buddy allocator, MMU

Re: Rpi 4, buddy allocator, MMU

Re: Rpi 4, buddy allocator, MMU

Re: Rpi 4, buddy allocator, MMU

Re: Rpi 4, buddy allocator, MMU

Re: Rpi 4, buddy allocator, MMU

Re: Rpi 4, buddy allocator, MMU