Ah, trampoline code. I've made mine as self-contained as possible. It must be copied into the low 1MB anyway, and being self-contained means multiple CPUs can be initialized at the same time.
Intel recommends two startup IPIs because of old hardware. I kept to an idea by Brendan from back in the day: A data word in the trampoline itself is used to communicate readiness to the booting CPU. The BSP can periodically check back in if the bit has come on, and can thus perform other tasks instead of waiting for APs. Here's the simple version:
Code: Select all
.section .rodata,"a"
.code16
.global trampoline
.global trampoline_end
/* assumption: Starts with CS:IP = xx00:0000, where xx is startup IPI number */
trampoline:
jmp real_start
.align 4
cr3val: .long 0 /*CR3 for this core (must be below 4GB) */
kcode: .quad 0 /* where to go after entering long mode */
kstack: .quad 0 /* kernel stack pointer to load before going there */
kgsval: .quad 0 /* GS value */
commword: .long 0 /* communication word */
gdt: .word 0, gdt_end - gdt - 1
.long gdt - trampoline
.quad 0x00cf9a000000ffff
.quad 0x00cf92000000ffff
.quad 0x00af9a000000ffff
.quad 0x00af92000000ffff
gdt_end:
f32ptr: .long prot_start - trampoline, 0x8
f64ptr: .long long_start - trampoline, 0x18
real_start:
/* set DS to avoid CS prefixes all over the place */
movw %cs, %bx
movw %bx, %ds
/* signal readiness */
orl $1, commword - trampoline
/* no wait here (no need for it) */
/* relocate pointers */
shll $4, %ebx /* base address in ebx */
addl %ebx, gdt - trampoline + 4 /* absolute GDT base address */
addl %ebx, f32ptr - trampoline /* absolute protected mode base address */
addl %ebx, f64ptr - trampoline /* absolute long mode base address */
lidt gdt - trampoline /* load 0-length IDT to crash this processor should anything happen */
lgdt gdt - trampoline + 2 /* GDT pointer is folded into first GDT entry. */
/* enable protected mode */
movl %cr0, %eax
btsl $0, %eax
movl %eax, %cr0
/* jump to 32-bit protected mode */
ljmpl *(f32ptr - trampoline)
.code32
prot_start:
/* initialize data segment registers */
movw $0x10, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
movw %ax, %ss
leal 0x1000(%ebx), %esp /* also a stack at the end of the trampoline. */
pushl $0 /* clear all flags */
popfd
/* Enter long mode: Enable PAE */
movl %cr4, %eax
btsl $5, %eax
movl %eax, %cr4
/* Load CR3 */
movl cr3val - trampoline(%ebx), %eax
movl %eax, %cr3
/* Enable long mode */
movl $0xc0000080
rdmsr
btsl $8, %eax
wrmsr
/* Enable paging */
movl %cr0, %eax
btsl $31, %eax
movl %eax, %cr0
/* jump to long mode */
ljmpl *f64ptr - trampoline(%ebx)
.code64
long_start:
/* initialize data segments again (just to be sure, it probably can't hurt) */
movw $0x20, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
movw %ax, %ss
/* clear upper half of RBX */
orl %ebx, %ebx
/* load GS base */
movl $0xc0000101, %ecx
movl kgsval - trampoline(%rbx), %eax
movl kgsval - trampoline + 4(%ebx), %edx
wrmsr
/* Load kernel stack */
movq kstack - trampoline(%rbx), %rsp
xorl %ebp, %ebp
movq kcode - trampoline(%rbx), %rax
movq %rbx, %rdi /* pass trampoline base as first arg. */
callq *%rax
1:
cli
hlt
jmp 1b
trampoline_end:
The debug version additionally has exception handlers for all three modes, telling the BSP when an exception has happened, and where, and in what mode. That is a lot of repetitive code, and not always worth it, since mode change and IDT change cannot be made atomic. Anyway, this trampoline is self-contained, so anyone should be able to use it. Yes, it is tailored to my needs, but should be easy to expand. In my case, the kernel is in C, so after setting up a stack, nothing more is needed to call into C code. It is not expected that that routine would ever return, but I added a safety net, just in case. The run-time memory allocation for the trampoline is 4kB anyway, since start vectors can only be placed on 4kB aligned addresses.
The start-up code in the BSP is quite simple: Allocate three pages for kernel stack and map them contiguously in kernel space. Calculate the address of a struct cpu at the top of that range, initialize it. Allocate a page in the 1MB zone. Copy the trampoline there. Allocate a page in the 4GB zone. Copy the PML4 there, and ensure its first entry equals its 128th entry (this because the first half of kernel-space is a linear mapping of all RAM, so this ensures in the simplest possible way the trampoline will be identity mapped when it runs). Fill in the data at the start of the trampoline (kernel GS will be the base of the struct cpu, kernel stack will be the same, except aligned to 16 bytes downward, kernel code will be the address of a noreturn function.
Then send an INIT IPI, wait a little (the spec says how much), send a startup IPI, wait for the commword to become 1. If this times out, send a second startup IPI and yield to the scheduler. After a long time out (several seconds), if the CPU still has not set the commword, send another INIT IPI (in case the CPU did start running, and is running amok somewhere else), free all the memory and log a failure. As soon as commword is observed to be 1, all the memory in the kernel stack, the other CPU's PML4, and the trampoline, all belongs to the other CPU.
The AP landing pad will then load the real GDT, IDT, and TSS, initialize CR0 and CR4 to their final values, load whatever MSRs are still needed, free the trampoline page (low memory is precious, after all), clear the low half of the PML4, announce its presence to the scheduler (which, among other things, involves a fetch-and-add on a global variable, setting the logical CPU number), then run the scheduler in infinite loop. Therefore it will never return, as required.
This code is so self-contained that it can be run in multiple threads, on multiple CPUs. I don't necessarily need the BSP to do all the booting. So before reading the MADT, I initialize the scheduler (which is possible after initializing the memory managers), and then I just queue up tasks to start each CPU I find in there. Then the APs can join in starting other APs as soon as they themselves are ready.