Help with UEFI memory map before loading kernel page table

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
stevefan1999
Posts: 3
Joined: Sun Jan 15, 2023 4:02 am

Help with UEFI memory map before loading kernel page table

Post by stevefan1999 »

I have to create a kernel page table because my kernel resides on 0xFFFFFFFF80000000 (same as Linux) as a reference point. This way I can also conveniently use GDB to find the breakpoints easily.

The kernel is compiled with the following Rust target:

Code: Select all

{
  "arch": "x86_64",
  "code-model": "kernel",
  "cpu": "x86-64",
  "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128",
  "disable-redzone": true,
  "eh-frame-header": true,
  "exe-suffix": ".elf",
  "executables": true,
  "features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-3dnow,-3dnowa,-avx,-avx2,+soft-float",
  "linker": "rust-lld",
  "linker-flavor": "ld.lld",
  "llvm-target": "x86_64-unknown-none-elf",
  "max-atomic-width": 64,
  "os": "none",
  "panic-strategy": "unwind",
  "position-independent-executables": true,
  "relro-level": "full",
  "target-c-int-width": "32",
  "target-endian": "little",
  "target-pointer-width": "64"
}
So that's why the kernel looks very clean with most of the things being absolute virtual address on 0xFFFFFFFF80000000
Image

I have also tried to implement the way Linux does virtual memory mapping, according to the following table:

Code: Select all


========================================================================================================================
    Start addr    |   Offset   |     End addr     |  Size   | VM area description
========================================================================================================================
                  |            |                  |         |
 0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
                  |            |                  |         |
 0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
                  |            |                  |         |     virtual memory addresses up to the -128 TB
                  |            |                  |         |     starting offset of kernel mappings.
__________________|____________|__________________|_________|___________________________________________________________
                                                            |
                                                            | Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
                  |            |                  |         |
 ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
 ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
 ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
 ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
 ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
 ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
 ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
 ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
 ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
                                                            |
                                                            | Identical layout to the 56-bit one from here on:
____________________________________________________________|____________________________________________________________
                  |            |                  |         |
 fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
                  |            |                  |         | vaddr_end for KASLR
 fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
 fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
 ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
 ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
 ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
 ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
 ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
 ffffffff80000000 |-2048    MB |                  |         |
 ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
 ffffffffff000000 |  -16    MB |                  |         |
    FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
 ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
 ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________

This is how I create a kernel page table:

Code: Select all

pub fn create_kernel_page_table(
    allocator: &mut impl FrameAllocator<Size4KiB>,
) -> (PhysFrame, OffsetPageTable<'static>) {
    let frame = allocator.allocate_frame().unwrap();

    // UEFI maps vmem with a zero offset.
    let phys_offset = VirtAddr::zero();
    let mut page_table = unsafe {
        let inner = &mut *((frame.start_address().as_u64() + phys_offset.as_u64()) as *mut _);
        OffsetPageTable::new(inner, phys_offset)
    };

    // map first 64TiB of physical memory onto kernel virtual memory
    // 0xffff888000000000
    // so 0xffff888000000000 -> PA 0x0
    // 0xffff888000000004 -> PA 0x4...etc
    {
        let offset = VirtAddr::new(0xffff888000000000);
        let start = PhysFrame::containing_address(PhysAddr::zero());
        let end = PhysFrame::containing_address(PhysAddr::zero() + n_tib_bytes(64));
        for frame in PhysFrame::<Size1GiB>::range_inclusive(start, end) {
            let page = Page::from_start_address(offset + frame.start_address().as_u64()).unwrap();
            unsafe {
                page_table
                    .map_to(
                        page,
                        frame,
                        PageTableFlags::PRESENT
                            | PageTableFlags::WRITABLE
                            | PageTableFlags::HUGE_PAGE,
                        allocator,
                    )
                    .expect("failed to map page")
                    .flush();
            }
        }
    }

    (frame, page_table)
}

Code: Select all

#[entry]
unsafe fn main(handle: Handle, mut st: SystemTable<Boot>) -> Status {
    uefi_services::init(&mut st).expect("unable to initialize UEFI services");
    log::set_max_level(LevelFilter::Trace);
    st.stdout().clear().unwrap();

    let bs = st.boot_services();

    let mut alloc = BootFrameAllocator::new(bs);
    let (page_table_frame, mut page_table) = frame_allocator::create_kernel_page_table(&mut alloc);
    trace!("created kernel page table at {:?}", page_table_frame);

    let kernel_elf = unsafe { NonNull::from(load_kernel(bs)).as_ref() };
    let mut loader = KernelLoader::new(kernel_elf, &mut page_table, &mut alloc);
    loader.relocate();

    let args = args
        .map(|args| args.into_boxed_str())
        .map(|args| Box::leak(args) as &'static str);

    let sizes = bs.memory_map_size();
    let mut mmap_storage = vec![0; sizes.map_size + 2 * sizes.entry_size].leak();
    let mut memory_descriptors: Vec<&'static MemoryDescriptor> =
        Vec::with_capacity(sizes.map_size / sizes.entry_size);

    let entry_point: extern "sysv64" fn(&BootData) = unsafe { mem::transmute(loader.entry()) };
    trace!("entry_point = {:p}", entry_point as *const u8);

    st.stdout().clear().unwrap();
    info!("bon voyage!");
    let (system_table, mmap_info) = st.exit_boot_services(handle, mmap_storage).unwrap();
    mmap_info.collect_into(&mut memory_descriptors);

    load_page_table_pointer(page_table_frame);
    entry_point(&BootData {
        magic:           yakern_types::boot::MAGIC.into(),
        kernel_elf_addr: kernel_elf.into(),
        boot_page_table: page_table_frame.start_address().as_u64() as usize,
        kind:            BootDataKind::Uefi(BootDataUefi {
            system_table,
            memory_descriptors: (memory_descriptors.leak() as &'static [_]).into(),
        }),
    });

    arch::drop_dead()
}
Now here's the problem:

Code: Select all

    load_page_table_pointer(page_table_frame);
    entry_point(&BootData {
        magic:           yakern_types::boot::MAGIC.into(),
        kernel_elf_addr: kernel_elf.into(),
        boot_page_table: page_table_frame.start_address().as_u64() as usize,
        kind:            BootDataKind::Uefi(BootDataUefi {
            system_table,
            memory_descriptors: (memory_descriptors.leak() as &'static [_]).into(),
        }),
    });
If I loaded this kernel page table into CR3, the original instruction pointer that lives on the UEFI side will clearly be invalid, unless I also identity-mapped it. This is why it triple faulted after I called load_page_table_pointer, as the machine will try to get the next instruction address first, failed to do so, trigger a page-fault, but the page fault handler is done by UEFI and it is also invalid and thus heading into a triple fault is absolutely correct.

That's why I added a blanket identity map hack on load_page_table_pointer:

Code: Select all

fn create_kernel_page_table(
    allocator: &mut impl FrameAllocator<Size4KiB>,
) -> (PhysFrame, OffsetPageTable<'static>) {
    let frame = allocator.allocate_frame().unwrap();

    // UEFI maps vmem with a zero offset.
    let phys_offset = VirtAddr::zero();
    let mut page_table = unsafe {
        let inner = &mut *((frame.start_address().as_u64() + phys_offset.as_u64()) as *mut _);
        OffsetPageTable::new(inner, phys_offset)
    };

    // Map 0-128 TiB as identity mapping by default
    {
        let start = PhysFrame::containing_address(PhysAddr::zero());
        let end = PhysFrame::containing_address(PhysAddr::zero() + n_tib_bytes(127));
        for frame in PhysFrame::<Size1GiB>::range_inclusive(start, end) {
            unsafe {
                page_table
                    .identity_map(
                        frame,
                        PageTableFlags::PRESENT
                            | PageTableFlags::WRITABLE
                            | PageTableFlags::HUGE_PAGE,
                        allocator,
                    )
                    .expect("failed to map page")
                    .ignore();
            }
        }
    }
    // map first 64TiB of physical memory onto kernel virtual memory
    // 0xffff888000000000
    // so 0xffff888000000000 -> PA 0x0
    // 0xffff888000000004 -> PA 0x4...etc
    {
        let offset = VirtAddr::new(0xffff888000000000);
        let start = PhysFrame::containing_address(PhysAddr::zero());
        let end = PhysFrame::containing_address(PhysAddr::zero() + n_tib_bytes(64));
        for frame in PhysFrame::<Size1GiB>::range_inclusive(start, end) {
            let page = Page::from_start_address(offset + frame.start_address().as_u64()).unwrap();
            unsafe {
                page_table
                    .map_to(
                        page,
                        frame,
                        PageTableFlags::PRESENT
                            | PageTableFlags::WRITABLE
                            | PageTableFlags::HUGE_PAGE,
                        allocator,
                    )
                    .expect("failed to map page")
                    .flush();
            }
        }
    }

    (frame, page_table)
}
However, I think this is retarded.

Not only that it will limit my memory the maximum memory of the kernel to support only up to 127 TiB physical memory, but it seems like I should map the OG UEFI memory map into my kernel page table, and that's enough without having to do this identity mapping overkill that could crush TLB performance.

What is the standard way of handling these situations in UEFI? More specifically, what is the more proper way of doing a kernel page table transition where the original memory map is still needed in some extent. I hope this could be as immutable as possible.
Octocontrabass
Member
Member
Posts: 5562
Joined: Mon Mar 25, 2013 7:01 pm

Re: Help with UEFI memory map before loading kernel page tab

Post by Octocontrabass »

stevefan1999 wrote:So that's why the kernel looks very clean with most of the things being absolute virtual address on 0xFFFFFFFF80000000
Why does your kernel have a relocation table?
stevefan1999 wrote:What is the standard way of handling these situations in UEFI? More specifically, what is the more proper way of doing a kernel page table transition where the original memory map is still needed in some extent.
Have your bootloader build some page tables that include both the necessary identity-mapped memory and your kernel's higher-half mappings, then exit boot services, then switch to those page tables, then jump to your kernel in the higher half, then have your kernel either remove the identity mappings from the bootloader's page tables or build its own page tables entirely from scratch.

There's no getting out of having a temporary identity-mapped area, but it only needs to be big enough for the code that actually jumps to the higher half. Everything else will already be mapped in the higher half.

I can't read Rust very well, but I see you're using large pages. Large pages that cross effective memory type boundaries are undefined behavior, so you will almost certainly need to split some of those into smaller pages.
Post Reply