multi-core initialization -- 16/32 bit issues

gmatthews · Post by **gmatthews** » Mon Jun 27, 2016 5:41 pm

Hi

I have implemented multi-core initialization and managed to get my APs (application processors) into their real mode boot code. I did this by:

a) copying the AP boot code down below 1Mb (to address 0x1000)
b) then going through the whole SIPI initialization steps

My APs all get to their real mode boot code just fine -- i.e. they are running the code I copied down to below 1Mb.

Now comes the part that has me stumped -- how to get my APs out of real mode and into protected mode (I know how to get out of real mode and into protected mode in general but not in this instance -- read on).

I am using Grub2 as my boot loader, so I far as I understand things I cannot include any 16 bit code in the OS binary (when I do include 16 bit code Grub 2 refuses to load my OS code).

So my AP real mode boot code looks like this (I have questions about the ??? parts).

.code16
startAP:
lgdt ????
movl %cr0, %eax
orl 0x1, %eax
movl %eax, %cr0
ljmp 0x8, ???

I have included this code in my 32-bit OS binary using .incbin -- so I compiled it as a 16-bit raw binary and then I include it as raw bytes using .incbin.

Usually you would put some labels in for the ??? parts, and the linker would relocate, and it would work just great. But I can't do that since my code is not linked -- it is loaded via .incbin -- effectively as data.

So I did the following:
a) copied this code below 1Mb (to address 0x1000)
b) put a descriptor table at 0x2000,
c) put the address of my protected mode code at 0x3000.

My two questions are then:

a) how do I load the gdt with 0x2000 -- the location of my AP gdt? I want to simply do something like
lgdt $0x2000
But that didn't work (using gcc as my assembler).

b) How do I far jump to the address in location 0x300?.

It feels like I might be going about this the wrong way, but I haven't programmed in assembler since my days hacking around on PDP-11 device drivers!

thanks
graham

BrightLight · Post by **BrightLight** » Mon Jun 27, 2016 11:09 pm

gmatthews wrote:how do I load the gdt with 0x2000 -- the location of my AP gdt? I want to simply do something like

lgdt (0x2000)

gmatthews wrote:How do I far jump to the address in location 0x300?.

You said 0x3000 above but 0x300 here, but anyways I'll assume it's 0x3000 because the memory used by BIOS.
To jump to 0x3000, just do a normal jump: jmp 0x08:0x3000
To jump at the value contained at 0x3000, what about:

Code: Select all

jmp 0x08:pmode

.code32

pmode:
	; set up segments here, especially SS and DS

	movl (0x3000), eax
	jmp eax

Combuster · Post by **Combuster** » Tue Jun 28, 2016 2:09 am

The problem with incbin'ing a snippet is that it can't actually reference symbols elsewhere. The trick is however that you don't even need it as long as you write some code not to require relocations:

Code: Select all

SECTION .rodata
BITS 16
ap_trampoline_code:
    MOV AX, 0
    MOV DS, AX
    LGDT [ap_gdtr - ap_trampoline_code + AP_TRAMPOLINE_OFFSET]
    MOV EAX, CR0
    OR AL, 1
    MOV CR0, EAX
    JMP FAR DWORD 0x08:ap_startup_code

ap_gdtr: 
    DW 0x1F
    DD gdt

ap_trampoline_end:
ap_trampoline_size EQU ap_trampoline_end - ap_trampoline_code

Velko · Post by **Velko** » Tue Jun 28, 2016 2:11 am

For GDT, you can calculate addresses from known offsets and code location at run time.

Code: Select all

boot_start16:
    /* Load DS */
    mov %cs, %eax
    mov %ax, %ds

    /* Calculate linear address of "boot_start16" */
    shl $4, %eax

    /* Load GDP */
    mov $(ap_tmp_gdp-boot_start16), %bx

    /* Patch GDP's pointer to current linear address of ap_tmp_gdt.
        Use 'or' instead of 'add' here, because it  will do no harm if
        executed multiple times. */
    or %eax, 2(%bx)

    lgdt (%bx)

/* ---- snip ---- */

ap_tmp_gdp:
    .short ap_tmp_gdt_end - ap_tmp_gdt - 1
    .long  ap_tmp_gdt - boot_start16

ap_tmp_gdt:
    /* NULL */
    .byte 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00
    /* 32-bit code & data */
    .byte 0xff, 0xff, 0x00, 0x00, 0x00, 0x9a, 0xcf, 0x00
    .byte 0xff, 0xff, 0x00, 0x00, 0x00, 0x92, 0xcf, 0x00
ap_tmp_gdt_end:

Put the code at "aligned enough" address (offset and linear address parts do not overlap at or instruction above). Loading at page-aligned address (0x1000) should be safe. You should enter the code with IP=0 - set reset vector to something like (100:0).

As for ljmp part - I have not had any issues linking 16-bit code in binary, so I simply use label. But anyway, you can patch the code with correct addresses after loading it into place. You can either modify the bytes at ljmp instruction itself or reserve some bytes at known offset and load the jump address from there.

I know that self-modifying code is generally discouraged. But if you already load code as data and move it around, patching it does not seem too bad

Brendan · Post by **Brendan** » Tue Jun 28, 2016 9:16 am

Hi,

I normally have the trampoline code at a certain address (let's call that "cs.base"); then put various values it will need at a fixed offset from "cs.base". Then I can just do (e.g.):

Code: Select all

    mov eax,PAGING_FLAG | PROTECTED_MODE_FLAG
    mov cr3, [cs:0xFFC]
    mov esp,[cs:0xFF8]
    lgdt [cs:0xFF0]
    mov cr0,eax             ;Enable paging and protected mode
    jmp far [cs:0xFF8]      ;Load 32-bit CS and jump to somewhere in kernel-space

Of course the code to start AP CPUs would allocate a stack for the CPU and set the appropriate values in the trampoline. Using "CS override prefix" like this means that have can have several copies of the trampoline at different addresses, and start multiple CPUs at the same time (while still giving them different details - e.g. different values for ESP). Note: "cs.base" is set by the "startup IPI" that you send.

Cheers,

Brendan

gmatthews · Post by **gmatthews** » Tue Jun 28, 2016 12:47 pm

Brendan wrote:Hi,

I normally have the trampoline code at a certain address (let's call that "cs.base"); then put various values it will need at a fixed offset from "cs.base". Then I can just do (e.g.):
Code: Select all
    mov eax,PAGING_FLAG | PROTECTED_MODE_FLAG
    mov cr3, [cs:0xFFC]
    mov esp,[cs:0xFF8]
    lgdt [cs:0xFF0]
    mov cr0,eax             ;Enable paging and protected mode
    jmp far [cs:0xFF8]      ;Load 32-bit CS and jump to somewhere in kernel-space
Of course the code to start AP CPUs would allocate a stack for the CPU and set the appropriate values in the trampoline. Using "CS override prefix" like this means that have can have several copies of the trampoline at different addresses, and start multiple CPUs at the same time (while still giving them different details - e.g. different values for ESP). Note: "cs.base" is set by the "startup IPI" that you send.

Cheers,

Brendan

First, thanks for everyone's help.

Second, I am not sure what you mean by ""cs.base" is set by the "startup IPI" that you send.". My startup IPI code does this (my AP startup code is at 0x3000, hence the choice of $0x000C4601);

mov $APIC_BASE, %ebx # APIC address in EBX
mov $0x000C4500, %eax # broadcast INIT-IPI
mov %eax, 0x300(%ebx) # to all-except-self
# do ten-millisecond delay, enough time for APs to awaken
mov $100000, %eax # ten-thousand microseconds
call delay_EAX_microseconds # execute programmed delay
mov $0x000C4601, %eax
mov %eax, 0x300(%ebx) # to all-except-self
# do ten-millisecond delay, enough time for APs to awaken
mov $100000, %eax # ten-thousand microseconds
call delay_EAX_microseconds # execute programmed delay

I realize the code may be a bit primitive (their is lot of discussion on the net about the delays not being the best way to do this, but it is simple code that I understand, and I want to get something simple working first).

You suggest that this code has to set cs.base. But it's unclear to me how this code can communicate cs.base to the startup code. Or is the cs.base assumed to be 0x3000 since that is where the startup code is? Or perhaps I am missing part of the semantics of the SIPI -- the AP code starts running at 0x3000 -- but is it running with cs = 0 and ip = 0x3000, or cs = 0x3000 and ip = 0? Hopefully the latter

thanks
graham

Brendan · Post by **Brendan** » Wed Jun 29, 2016 1:51 am

Hi,

gmatthews wrote:
Brendan wrote:Note: "cs.base" is set by the "startup IPI" that you send.
Second, I am not sure what you mean by ""cs.base" is set by the "startup IPI" that you send.". My startup IPI code does this (my AP startup code is at 0x3000, hence the choice of $0x000C4601);

The lowest 8 bits of the Startup IPI (the "vector" field) are loaded into the highest 8 bits of the AP CPU's CS register, so if the vector field is 0x01 the AP CPU's CS register ends up being 0x0100, which means the trampoline must be at "0x0100:0x0000" (in real mode) which is 0x00001000.

Note that (for an OS) broadcasting the "INIT SIPI SIPI" sequence (e.g. "to all excluding self") is a huge mistake. The problems are:

It can start CPUs that were disabled because they're faulty
It can start CPUs that were disabled because the user disabled hyper-threading in the firmware options
It makes it virtually impossible to detect when a CPU (that should start) has failed to start
It makes it hard to give each CPU different data (e.g. a different "top of stack" address)

You must only attempt to start CPUs that are listed by ACPI tables or MultiProcessor tables. Unfortunately Intel's manual only shows example code for firmware (where broadcasting the "INIT SIPI SIPI" sequence is normal) and doesn't show example code for OSs (where broadcasting the "INIT SIPI SIPI" sequence should never be done).

Another problem is that often the AP CPU will start on the first SIPI, execute some of your code, then get "restarted" by the second Startup IPI, which can cause bugs (e.g. if the AP CPU does "total_CPUs_present++;" then it can increment the counter twice). This means that you want some sort of synchronisation between the CPU being started and the CPU that's monitoring it. For example, as soon as the CPU starts it can set an "I started" flag in the trampoline and then wait for the other CPU to see this and set a "you can continue" flag before it continues. Also, if the other CPU sees the "I started" flag was set before the second Startup IPI is sent then you can skip the second Startup IPI completely.

This means that the full sequence would be more like:

For each CPU mentioned by ACPI or MultiProcessor Specification:
- Allocate stack for that CPU
  Set info in trampoline (address of stack to use, etc) and clear the "I started" flag and the "you can continue" flag
  Send INIT IPI to that CPU only
  Wait for 10 ms
  Send first Startup IPI to that CPU only
  Wait for up to 200 us or until "I started flag" set (whichever happens first)
  If "I started flag" not set:
  - Send second Startup IPI to that CPU only
    Wait for up to maybe 500 ms or until "I started flag" set (whichever happens first)
    If "I started flag" not set:
    - CPU failed to start (display error message and assume CPU is faulty and don't use it)
  Set "you can continue" flag

The problem with this is that when there are a lot of CPUs it takes a minimum of 10 ms per CPU. For example, with 128 CPUs it'd take at least 1.28 seconds to start all of them. If you want the OS to boot fast, then that's a lot of time spent just starting CPUs.

To fix that (and boot faster) there's various ways to start CPUs in parallel (safely). One way is to send the INIT IPI to (up to) 4 CPUs, then wait for 10 ms once, then do the Startup IPIs one CPU at a time. In this case, with 128 CPUs it'd take at least 320 ms to start all of them. Another way is to have one CPU start another CPU, then both of those CPUs start a CPU each, then all 4 CPUs start a CPU each, and so on. In that case it'd take at least 70 ms to start 128 CPUs. These can be combined - e.g. one CPU starts 4 CPUs, then all 5 CPUs start 4 more CPUs each, then all 25 CPUs start 4 CPUs each, etc. This is the fastest (and most complicated) way, and adds up to at least 40 ms to start 128 CPUs.

In any case; when you're starting CPUs in parallel (safely) you're going to want a different trampoline for each CPU. For example, if you start (up to) 4 CPUs in parallel, you're going to want 4 copies of the trampoline (with different values for "address of top of stack", separate "I started" flags, etc).

With multiple separate trampolines you need to adjust the "vector" field in the Startup IPI to tell the CPU which trampoline it should use.

Cheers,

Brendan

Velko · Post by **Velko** » Wed Jun 29, 2016 4:20 am

Brendan wrote:To fix that (and boot faster) there's various ways to start CPUs in parallel (safely). One way is to send the INIT IPI to (up to) 4 CPUs, then wait for 10 ms once, then do the Startup IPIs one CPU at a time. In this case, with 128 CPUs it'd take at least 320 ms to start all of them. Another way is to have one CPU start another CPU, then both of those CPUs start a CPU each, then all 4 CPUs start a CPU each, and so on. In that case it'd take at least 70 ms to start 128 CPUs. These can be combined - e.g. one CPU starts 4 CPUs, then all 5 CPUs start 4 more CPUs each, then all 25 CPUs start 4 CPUs each, etc. This is the fastest (and most complicated) way, and adds up to at least 40 ms to start 128 CPUs.

We had a similar discussion a few years back, but I'm still wondering if there is something wrong with my proposed routine.

Code: Select all

Set up trampoline (I use only one)
Build an array of structs, containing LAPIC ID and CPU_STATUS = Not_started, containing each CPU mentioned by ACPI or MultiProcessor Specification
set boot CPU's status as Running (for convenience)
Allocate neccessary number of stacks, put an array of pointers to known location

For each item in array, where CPU_STATUS == Not_started:
    Send INIT IPI
    
Wait 10 ms

For each item in array, where CPU_STATUS == Not_started:
    Send Startup IPI

Wait 200 us, or until all items in array have CPU_STATUS == Running

If there are CPUs not running:
    For each item in array, where CPU_STATUS == Not_started:
        Send Startup IPI

    Wait for up to maybe 500 ms or until all items in array have CPU_STATUS == Running

If there are CPUs not running:
    Report failed CPUs or ...

Clean up

Then each AP on starting up:

Code: Select all

Loads GDT, switches mode, enables paging, etc.

Obtains next available stack from prepared stack array (using proper locking, of course)

Retrieves its own LAPIC id 

Finds corresponding item in CPU state array and Sets CPU_STATUS = Running

Signals Boot CPU that it should re-check CPU array

Wouldn't it take 10 ms + 200 us + whatever time it takes to send IPIs, regardless of number of CPUs? Are there any pitfalls?

Brendan · Post by **Brendan** » Wed Jun 29, 2016 6:42 am

Hi,

Velko wrote:
Brendan wrote:To fix that (and boot faster) there's various ways to start CPUs in parallel (safely). One way is to send the INIT IPI to (up to) 4 CPUs, then wait for 10 ms once, then do the Startup IPIs one CPU at a time. In this case, with 128 CPUs it'd take at least 320 ms to start all of them. Another way is to have one CPU start another CPU, then both of those CPUs start a CPU each, then all 4 CPUs start a CPU each, and so on. In that case it'd take at least 70 ms to start 128 CPUs. These can be combined - e.g. one CPU starts 4 CPUs, then all 5 CPUs start 4 more CPUs each, then all 25 CPUs start 4 CPUs each, etc. This is the fastest (and most complicated) way, and adds up to at least 40 ms to start 128 CPUs.
We had a similar discussion a few years back, but I'm still wondering if there is something wrong with my proposed routine.
Code: Select all
Set up trampoline (I use only one)
Build an array of structs, containing LAPIC ID and CPU_STATUS = Not_started, containing each CPU mentioned by ACPI or MultiProcessor Specification
set boot CPU's status as Running (for convenience)
Allocate neccessary number of stacks, put an array of pointers to known location

For each item in array, where CPU_STATUS == Not_started:
    Send INIT IPI
    
Wait 10 ms

For each item in array, where CPU_STATUS == Not_started:
    Send Startup IPI

Wait 200 us, or until all items in array have CPU_STATUS == Running

If there are CPUs not running:
    For each item in array, where CPU_STATUS == Not_started:
        Send Startup IPI

    Wait for up to maybe 500 ms or until all items in array have CPU_STATUS == Running

If there are CPUs not running:
    Report failed CPUs or ...

Clean up
Then each AP on starting up:
Code: Select all
Loads GDT, switches mode, enables paging, etc.

Obtains next available stack from prepared stack array (using proper locking, of course)

Retrieves its own LAPIC id 

Finds corresponding item in CPU state array and Sets CPU_STATUS = Running

Signals Boot CPU that it should re-check CPU array
Wouldn't it take 10 ms + 200 us + whatever time it takes to send IPIs, regardless of number of CPUs? Are there any pitfalls?

If it takes 10 us to send an IPI (e.g. before the "delivery status" flag clears and it's safe to send the next IPI) and you have 128 CPUs, how long does it take to send 127 separate Startup IPIs? In this case, the first CPU would have already waited for 1270 us before you even begin the "wait 200 us" delay.

Without knowing how long it takes to send an IPI, the only thing you can know is that all the time delays may be much longer than intended.

I don't know if "time delays may be much longer than intended" can cause issues or not. Maybe it's fine on all CPUs that exist now (and maybe it's not), and maybe next year Intel will decide to do "after 400 us CPU decides it should go back to waiting for INIT IPI" and it breaks.

Cheers,

Brendan

gmatthews · Post by **gmatthews** » Wed Jul 13, 2016 3:21 pm

Brendan wrote:

jmp far [cs:0xFF8] ;Load 32-bit CS and jump to somewhere in kernel-space

How do I code that in gcc/gas? And what exactly has to be in location cs:0xFF8? Do I need to have a 4-byte absolute address or a 6-byte address with the first two bytes being my cs selector (so 0x8 or something like that), and the next 4-bytes being an offset from the base of that selector?

thanks
graham

gmatthews · Post by **gmatthews** » Thu Jul 14, 2016 12:19 am

Brendan

I have tried your suggestion re a trampoline and can't make it work. Conceptually I get it -- it's quite straightforward -- but the assembler is tripping me up (especially since I learned assembler on a machine with no stupid segments).

For your trampoline you have code like this:

Code: Select all

    mov esp,[cs:0xFF8]
    lgdt [cs:0xFF0]
    ...
    jmp far [cs:0xFF8]      ;Load 32-bit CS and jump to somewhere in kernel-space

So the way I read the first line is that:

a) we calculate a linear address A = cs * 16 + 0xFF8
b) we load the 32-bit value at A into the esp register, so esp = *A (in C-speak)

Is that correct?

If that is correct then I assume that the second line says:

a) we calculate a linear address A = cs * 16 + 0xFF0
b) A should be the linear address of 6 bytes -- the first 2 of which are a size, and the last 4 of which are the linear address of a global descriptor table (so A is the address of a gdtr)

And the final line says:

a) we calculate a linear address A = cs * 16 + 0xFF8
b) A should be the linear address of ???? -- I am not sure how we specify the new value of cs, and the offset .. I am not sure what is at address A.

I am guessing my understanding isn't correct, since I can't figure out why my trampoline doesn't work.

graham

Brendan · Post by **Brendan** » Thu Jul 14, 2016 6:23 am

Hi,

gmatthews wrote:For your trampoline you have code like this:
Code: Select all
    mov esp,[cs:0xFF8]
    lgdt [cs:0xFF0]
    ...
    jmp far [cs:0xFF8]      ;Load 32-bit CS and jump to somewhere in kernel-space
So the way I read the first line is that:

a) we calculate a linear address A = cs * 16 + 0xFF8
b) we load the 32-bit value at A into the esp register, so esp = *A (in C-speak)

Is that correct?

Yes.

gmatthews wrote:If that is correct then I assume that the second line says:

a) we calculate a linear address A = cs * 16 + 0xFF0
b) A should be the linear address of 6 bytes -- the first 2 of which are a size, and the last 4 of which are the linear address of a global descriptor table (so A is the address of a gdtr)

Yes.

gmatthews wrote:And the final line says:

a) we calculate a linear address A = cs * 16 + 0xFF8
b) A should be the linear address of ???? -- I am not sure how we specify the new value of cs, and the offset .. I am not sure what is at address A.

I am guessing my understanding isn't correct, since I can't figure out why my trampoline doesn't work.

For this case, the memory at "[cs:0xFF8]" would contain the values to load into CS and EIP (the CS and EIP to jump to).

It's a little bit like calling a function via. a function pointer in C; where the function pointer contains the address of the function; except that it's a jump and not a call (so it'd be more like "goto myFunctionPointer();" which isn't something that a C compiler will appreciate..

), and except that it loads CS and EIP (and doesn't just load EIP).

Cheers,

Brendan

gmatthews · Post by **gmatthews** » Thu Jul 14, 2016 10:54 am

Hi Brendan

You wrote:

For this case, the memory at "[cs:0xFF8]" would contain the values to load into CS and EIP (the CS and EIP to jump to).

So if A = cs * 16 + 0xFF8, what does A actually point to? 6 bytes -- the first 2 being a 16 bit CS, the next 4 being a 32 bit EIP?

graham

Brendan · Post by **Brendan** » Thu Jul 14, 2016 2:25 pm

Hi,

gmatthews wrote:
For this case, the memory at "[cs:0xFF8]" would contain the values to load into CS and EIP (the CS and EIP to jump to).
So if A = cs * 16 + 0xFF8, what does A actually point to? 6 bytes -- the first 2 being a 16 bit CS, the next 4 being a 32 bit EIP?

80x86 is "little-endian"; which means the small end goes first - the first 4 bytes would be EIP and then next 2 bytes would be CS.

Note that in 16-bit code you'd probably end up with a 16-bit far jump (with 16-bit IP) as default, and you'd have to tell the assembler that you want a 32-bit jump instead.

Cheers,

Brendan

gmatthews · Post by **gmatthews** » Sun Jul 17, 2016 4:49 pm

Thanks for the help Brendan. I knew the chip was little endian but would never have thought that would extend to CS:EIP pairs. Again thanks for all the help -- works now!
graham

OSDev.org

multi-core initialization -- 16/32 bit issues

multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues

Re: multi-core initialization -- 16/32 bit issues