Page 2 of 2

Re: SMP Trampoline Template

Posted: Sun Jan 03, 2021 9:54 am
by bzt
nullplan wrote:The code is copied once for every AP that starts up, and can be located at any page boundary in low memory (due to the limitations of the startup IPI only allowing the thing to start there). This might be wasteful, but it is the only way to duplicate the data section at the start
Ah, I see. I use the same code and data for all the APs, therefore it doesn't matter how many CPU cores there are. My code can use 256 cores ATM, but the BOOTBOOT protocol was designed for 65535 CPUs (with x2APIC), so there's not enough space for per core code in low mem (256*4096 > 640k).
nullplan wrote:But why? By the time you start the APs, UEFI is long forgotten, and you could just be loading those registers with values that make sense for your OS.
Because I run "bootboot_startcore()" on all cores at once (BSP and APs alike, the last step of the trampoline is to jump there), and that function expects the same environment on the APs that UEFI has set up for the BSP. Plus I don't know what makes sense for an OS, because this is in a boot loader which can load many different kernels (there are at least 4 kernels using BOOTBOOT that I know of, but probably more, and only one of them is written by me). I expect the kernel to set up its GDT and (non-identity) paging tables as soon as possible (they do that anyway), and I provide the same env that firmware has given us.

Cheers,
bzt

Re: SMP Trampoline Template

Posted: Sun Jan 03, 2021 12:21 pm
by rdos
I use the ACPI APIC table to identify how the APIC should be configured and which cores are available and their IDs. This means that APs cannot be started until ACPI is started & the APIC is setup, which ocurs in the kernel and not in the boot-loader. I also fail to see the connection with EFI or BIOS boot. The ACPI table is available in both modes and in the same format.

I also fail to see any reason why multiple cores should be started at the same time. I boot them in turn, which is much simpler. I copy GDT, IDT, CR3, CR4 and other settings to low RAM at fixed locations, and then reference these from the bootstrap code. I also have a kernel setting that can use less than the available cores in a system. Setting this to 1 effectively disables multicore operation.

I also have options to boot a core for VBE detection and to boot it for long-mode realtime operation. The only difference from the normal boot code is that different real & protected mode code is copied to low memory.

When an AP core has finished booting it will disable itself and wait for activation from the scheduler. The scheduler will only activate new cores when load is above some minimum limit. I also have code to shut down cores when load decrease again, but it is disabled since it doesn't work properly.

Part of the code which is part of the APIC driver itself (16-bit mode):

Code: Select all


; this code is loaded at 0000:0F80h

table_start:

gdt0:
    dw 0
    dd 0
    dw 0
gdt8:
    dw 30h-1
    dd 92000F80h
    dw 0
gdt10:
    dw 0FFFFh
    dd 9A001400h
    dw 0
gdt18:
    dw 0FFFFh
    dd 92000000h
    dw 0CFh
gdt20:
    dw 0FFFFh
    dd 92001800h
    dw 0
gdt28:
    dw 0FFFFh
    dd 9A000000h
    dw 0AFh

table_end:

; this code is loaded at 0100:0000. It should contain no near jumps!

real_start:    
    cli
    mov al,0Fh
    out 70h,al
    jmp short $+2
;
    xor al,al
    out 71h,al
    jmp short $+2
;
    xor ax,ax
    mov ds,ax
;    
    mov bx,0F88h
    lgdt fword ptr ds:[bx]
;
    mov eax,cr0
    or al,1
    mov cr0,eax
;
    db 0EAh
    dw 0
    dw 10h

real_end:

; this code is loaded at 01400. It should contain no near jumps!
    
prot_start:
    mov ax,18h
    mov ds,ax
    mov es,ax
    mov fs,ax
    mov gs,ax
    mov ss,ax
    mov sp,0F00h
;
    mov ax,20h
    mov es,ax
;    
    mov eax,es:ap_cr4
    mov cr4,eax
;
    mov eax,es:ap_cr3
    mov cr3,eax
;    
    db 66h
    lgdt fword ptr es:ap_gdt
;    
    db 66h
    lidt fword ptr es:ap_idt
;
    mov edx,es:ap_stack_offset
    mov bx,es:ap_stack_sel
;    
    mov eax,es:ap_cr0
    mov cr0,eax
;
    db 0EAh
    dw OFFSET ApInit
    dw SEG code

prot_end:

; this code is loaded at 01800h. The code is relative to GDT selector 20h

page_struc  STRUC

ap_stack_offset DD ?
ap_stack_sel    DW ?
ap_cr0          DD ?
ap_cr3          DD ?
ap_cr4          DD ?
ap_gdt          DB 6 DUP(?)
ap_idt          DB 6 DUP(?)

page_struc  ENDS

ApInit:
    xor ax,ax
    mov ds,ax
    mov es,ax
    mov fs,ax
    mov gs,ax
;
    mov ss,bx
    mov esp,edx
;
    mov ax,SEG data
    mov ds,ax
    mov eax,12345678h
    mov ds:mp_processor_sign,eax
;
    ShutdownCore


Re: SMP Trampoline Template

Posted: Sun Jan 03, 2021 2:21 pm
by nullplan
rdos wrote:I use the ACPI APIC table to identify how the APIC should be configured and which cores are available and their IDs.
That would be the MADT I alluded to earlier. Indeed I do the same. To my knowledge, the only alternatives are to use the MP tables (which might no longer be present), or sending INIT and Startup IPIs to all cores excluding self. Which basically means firing a shotgun into a dark room. So yeah, the MADT is your only reasonable hope.
rdos wrote:This means that APs cannot be started until ACPI is started & the APIC is setup, which ocurs in the kernel and not in the boot-loader. I also fail to see the connection with EFI or BIOS boot. The ACPI table is available in both modes and in the same format.
Yeah, I remember now that bzt talked about BOOTBOOT having multicore support before. Essentially, BOOTBOOT will boot all available CPUs so you don't have to. The MADT requires no AML, so it is indeed reasonable for a bootloader to parse it.
rdos wrote:I also fail to see any reason why multiple cores should be started at the same time.
Because speed. Let's name the actors in this play the booter and the bootee. While the booter is waiting for signs of life from the bootee, it isn't doing anything productive. In most systems I have seen that use this mechanism, only the BSP will be a booter, an the APs will be blocked in the scheduler until something pushes tasks into the queue. That means, they aren't doing anything productive. As you can see from bzt's most recent reply, architecturally there is no reason why there can't be 65535 CPUs. If the BSP is starting all APs sequentially, and the APs don't do anything, they are going to be here a while.

Actually, 65535 is beyond my absolute task limit (32767), so the function creating the AP starter tasks should also be in its own task. Then it can block while waiting for a slot in the task queue to open up. I might have to revisit that design decision (having an absolute task limit) some time. Having more CPUs than tasks is probably not terribly sensible, so i might have to limit those.
rdos wrote:I boot them in turn, which is much simpler.
What, now you argue for simplicity and against performance? A week ago you argued the opposite position when discussing interrupt code. Make up your mind, please. In this case, I don't think it is much simpler. I have a block of memory I copy to an arbitrary location in low mem and then point the AP at. I don't have to hope a certain address is going to be free. I don't even have to run the starter code on the BSP, it can run on any core.
rdos wrote:I copy GDT, IDT, CR3, CR4 and other settings to low RAM at fixed locations, and then reference these from the bootstrap code.
I fail to see the point. Unless your goal is to keep running in 16-bit real mode, the GDT required as part of the trampoline will be different from the final GDT loaded in AP startup code, if for no other reason, by its address. The IDT makes no sense unless you are operating in the correct mode, but by that point, you will no longer need stuff to be in low memory to reference it. CR3, OK, that one is needed. CR4 less so, that can be done in the AP landing pad. It would also break encapsulation of the trampoline, since it is not the trampoline's place to tell you whether you should set, say, the WP bit. You should, but the trampoline is the wrong place to do it in. In 64-bit mode, you should also enable SYSCALL, yet the trampoline won't do that for you, either.

Re: SMP Trampoline Template

Posted: Sun Jan 03, 2021 3:08 pm
by rdos
nullplan wrote:Yeah, I remember now that bzt talked about BOOTBOOT having multicore support before. Essentially, BOOTBOOT will boot all available CPUs so you don't have to. The MADT requires no AML, so it is indeed reasonable for a bootloader to parse it.
Certainly, but the AP will end up as an additional core available to the scheduler of a given kernel. You cannot fix up that stuff in the bootloader only. Therefore, if you boot cores in the bootloader rather than in the kernel, then you need to temporarily stop the APs at some stage and then let them continue so they receive their correct configuration that the kernel & scheduler requires. That's why I find it more practical to let the kernel boot AP cores and let them get their final setup in one step only.
nullplan wrote: Because speed. Let's name the actors in this play the booter and the bootee. While the booter is waiting for signs of life from the bootee, it isn't doing anything productive. In most systems I have seen that use this mechanism, only the BSP will be a booter, an the APs will be blocked in the scheduler until something pushes tasks into the queue. That means, they aren't doing anything productive. As you can see from bzt's most recent reply, architecturally there is no reason why there can't be 65535 CPUs. If the BSP is starting all APs sequentially, and the APs don't do anything, they are going to be here a while.

Actually, 65535 is beyond my absolute task limit (32767), so the function creating the AP starter tasks should also be in its own task. Then it can block while waiting for a slot in the task queue to open up. I might have to revisit that design decision (having an absolute task limit) some time. Having more CPUs than tasks is probably not terribly sensible, so i might have to limit those.
I agree to that. I probably should move the AP initialization to a thread instead. That way it won't affect startup time, and can still be performed in a sequence.
nullplan wrote: I fail to see the point. Unless your goal is to keep running in 16-bit real mode, the GDT required as part of the trampoline will be different from the final GDT loaded in AP startup code, if for no other reason, by its address. The IDT makes no sense unless you are operating in the correct mode, but by that point, you will no longer need stuff to be in low memory to reference it. CR3, OK, that one is needed. CR4 less so, that can be done in the AP landing pad. It would also break encapsulation of the trampoline, since it is not the trampoline's place to tell you whether you should set, say, the WP bit. You should, but the trampoline is the wrong place to do it in. In 64-bit mode, you should also enable SYSCALL, yet the trampoline won't do that for you, either.
One way or another you must set the correct bits in the control registers in each AP core, and doing this at initialization time is a good place. Since I map my kernel high, I must let the AP core load CR3 and enable paging before it can jump to kernel code and initialize itself from there. There is a "boot-time" GDT, but the "real" GDT & IDT can just as well be loaded before you get into real kernel code, if not because it must know the base & size of the device-driver it will start executing code in. Maybe not an issue in a flat kernel where you can use the same flat selector in both the bootstrap & kernel, but it is an issue for me. As for long mode, you can use the same GDT for protected mode & long mode, but the IDT will be different. Although you can still load the 64-bit IDT in the boot-loader as long as you avoid enabling interrupts.

This is how the long mode monitor is initialized (by copying different code to 0x1400). Since the assembler doesn't support 64-bit code, the long mode part of the code is implemented with dbs:

Code: Select all

; this code is loaded at 01400. It should contain no near jumps!
    
rt_start:
    xor ax,ax
    mov ds,ax
    mov fs,ax
    mov gs,ax
;
    mov ax,18h
    mov ss,ax
    mov esp,OFFSET rt_end - rt_start + 1400h + 10h
;
    mov ax,20h
    mov es,ax
;    
    mov eax,12345678h
    xchg eax,es:ap_cr4
    or al,20h
    mov cr4,eax
;
    mov ecx,IA32_EFER
    rdmsr
    or eax,101h
    wrmsr
;
    mov eax,es:ap_cr3
    mov cr3,eax
;    
    mov eax,cr0
    or eax,80010008h
    mov cr0,eax
;
    mov edx,es:ap_stack_offset
;
    xor ax,ax
    mov es,ax
;
    db 0EAh
    dw OFFSET rt_init64 - rt_start + 1400h
    dw 28h

rt_init64:
    db 48h  ; mov rbx,0FFFFFF8000201000h
    db 0BBh
    dd 201000h
    dd 0FFFFFF80h
;  
    db 48h   ; mov rsp,rbx
    db 8Bh
    db 0E3h
;
    db 48h  ; mov rbx,0FFFFFF8000000000h
    db 0BBh
    dd 0
    dd 0FFFFFF80h
;
    db 89h  ; mov [rbx+10],edx
    db 53h
    db 0Ah
;
    db 48h   ; mov rax,[rbx+2]
    db 8Bh
    db 43h
    db 2
;
    db 48h   ; add rbx,rax
    db 3
    db 0D8h
;
    db 53h  ; push rbx
;
    db 0C3h ; ret

rt_end: