OSDev.org

Posted: **Thu Oct 30, 2014 6:17 am**

Hi,

I would like to start experimenting with multi-core and I don't know where to start. Also googling and looking at OSDev just lead me to explainantion and descriptions without any practical approach.

Kindly, if any one knows a well documented tutorial or any detailed documentation on how to start up AP processors, I would really appreciate it.

I have finished implementing most of James Molly's modules but in 64-bit and I would like to start experimenting with multicore.

Thanks
Karim.

Posted: **Thu Oct 30, 2014 7:35 am**

Pretty much all the fundaments and steps to take are nicely summarized in the miltiprocessor specification. The only real difference for modern computers is that ACPI now provides the tables, and the legacy ones might no longer exist.

Posted: **Fri Oct 31, 2014 7:36 pm**

Hi,

Okay I did some studying but I might still ask naive questions.

I have wrote a small kernel trying to make IPI work.

I have my own boot loader which loads a simple small kernel. I have two assembly files that are compiled each into 4KB.

The first assembly file is the main kernel which contains the IPI instructions, while the other one is the AP entry point.

The boot loader loads the first 4 KB at 0x8000 and the second one at 0x9000 and then jumps to 0x8000.

Here is the main kernel which calls the IPI

Code: Select all

ORG 0x8000
BITS 16
        mov eax,[0xFEE000F0]
	or eax, 0000000100000000b
         mov [0xFEE000F0],eax

	mov     eax, 0x000C4500
	mov     [0xFEE00300], eax
B0:	bt	    dword [0xFEE00300], 12
		jc	   B0

	mov	eax,10000
	call	delay_EAX_microseconds

        mov	  eax, 0x000C4609
        mov    [0xFEE00300], eax
B1:	bt	 dword [0xFEE00300], 12
		jc	 B1

	mov	eax,200
	call	delay_EAX_microseconds

        mov	  eax, 0x000C4609
        mov    [0xFEE00300], eax
B2:	bt	 dword [0xFEE00300], 12
		jc	 B2

	mov	eax,200
	call	delay_EAX_microseconds

hhhh:
    jmp hhhh

delay_EAX_microseconds:
	pushall

	mov	ecx,eax
	mov	eax,100000
	xor	edx, edx
	div	ecx

	mov	ecx,eax
	mov	eax,1193182
	xor	edx, edx
	div	ecx

	out	0x42,al
	xchg al, ah
	out	0x42,al

.T0:	in	al,0x61
	test al,0x20
	jz	.T0

	popall
	ret

times 4096 - ($-$$) db 0

The Code to be executed by the APs

Code: Select all

ORG 0x9000
BITS 16

mov ah,0x0
mov al,0x3
int 0x10

mov si,core_welcome_msg

core_print_bios:
    lodsb
    or al, al  ;zero=end of str
    jz core_print_done    ;get out
    mov ah, 0x0E
    int 0x10
    jmp core_print_bios
core_print_done:


halt:
    jmp halt

core_welcome_msg   db 'Welcome From Core', 13, 10, 0

times 4096 - ($-$$) db 0

I start the kernel within qemu using the following command:
qemu-system-x86_64 -m 4096 -smp cpus=2,cores=2 -fda floppy.flp -enable-kvm

I expect to see the Welcome message but I never get it.

I am sure that the bootloader loads the code correctly because when I add "jmp 0x9000" to the end of the kernel it displays the message.

I appreciate if anyone can help me to go forward with that.

Thanks
Karim.

Posted: **Fri Oct 31, 2014 9:12 pm**

Hi,

Code: Select all

ORG 0x8000
BITS 16
        mov eax,[0xFEE000F0]

For 16-bit code the assembler probably generates "mov eax,[0x00F0]" here. You should also get a "warning: word data exceeds bounds" error (if it's enabled). You'd need to use "mov eax,[dword 0xFEE000F0]" so that the assembler knows you want 32-bit addressing (instead of the default 16-bit addressing).

However, I assume you are in real mode (and not 16-bit protected mode with 4 GiB data segment limits), which means that "mov eax,[dword 0xFEE000F0]" will cause a general protection fault (due to exceeding the 64 KiB segment limit). If that's the case then you need to switch to protected mode (and I'd recommend switching to 32-bit protected mode, and not using 16-bit code).

Code: Select all

	or eax, 0000000100000000b
         mov [0xFEE000F0],eax

When enabling the local APIC you have to set the "spurious vector" part too (and have a interrupt handler for it). Note: I'd recommend using a "spurious vector" where the lowest 4 bits are set, for compatibility with older CPUs where these bits are hard-wired to 1s.

Also the reserved bits should be cleared and not left as random/unknown. Essentially these 3 instructions should be "mov dword [dword 0xFEE000F0], 0x0000001?F" (where the ? is up to you).

Code: Select all

	mov     eax, 0x000C4500
	mov     [0xFEE00300], eax

0x000C4500 translates into "INIT, physical destination, assert, all excluding self".

Do not use "all excluding self" - it sends the IPI to CPUs that were disabled (either because they were faulty or because hyper-threading was disabled) and causes problems on some machines. It also makes it impossible to detect when one of the CPUs fails to respond.

Code: Select all

delay_EAX_microseconds:
	pushall

	mov	ecx,eax
	mov	eax,100000
	xor	edx, edx
	div	ecx

	mov	ecx,eax
	mov	eax,1193182
	xor	edx, edx
	div	ecx

	out	0x42,al
	xchg al, ah
	out	0x42,al

.T0:	in	al,0x61
	test al,0x20
	jz	.T0

	popall
	ret

I have no idea what "pushall" and "popall" are. Maybe you meant "pusha" and "popa" and the assembler thought it was a label without a colon.

You're assuming the firmware left the PIT configured somehow. Some BIOSs use mode 2 ("square wave") and some use mode 2 ("rate generator"). If the BIOS was using mode 2 then channel 2 output may be high immediately. Also, I wouldn't be too surprised if the BIOS IRQ handlers are still running here; so when timer channel 2 output changes the BIOS timer IRQ handler starts and the bit you're testing (in IO port 0x61) may be clear again before you get a chance to test it (which can result in waiting forever).

Code: Select all

ORG 0x9000
BITS 16

At this point, the AP CPU/s just started. They don't have a valid stack (and all APs are using the same default SS:SP because you broadcast the IPIs to all CPUs), now...

Code: Select all

mov ah,0x0
mov al,0x3
int 0x10

..here we probably have 3 or more CPUs, all with the same (invalid) stack, all calling a BIOS function at the exact same time; even though the BIOS was not designed for multi-CPU and none of its functions are re-entrant(!).

Cheers,

Brendan

Posted: **Sat Nov 01, 2014 4:35 am**

Thanks a lot for your detailed reply.

I have some questions though that you might be able to help me with.

Is it possible that I start the APs after the BSP is in 64 bit ? If so, while the BSP is in 64-bit mode can the APs execute bios interrupts; I mean the BSP at that point have initialized the video memory segment, and using int 10 might make conflicts? or the int 10 from the AP will still be able to print on the screen ?

If I configure my QEMU to have 2 cores only, do I still need to handle mutual execution? I mean for testing pruposes only, off course the complete final code should handle that as the number of cores might change from one machine to another.

Now, I am planning to move the code for the IPI to be executed by the BSP from C code while BSP in 64-bit, if possible can you point me to sample code and tutorials that shows how to implement mutual exclusion and implemnting delays? The problem here is that I am reading the BareMetal64 OS code, and to test the code I have to run the whole thing which is to big to test and play with so I can replicate it in my own code.

Thanks a lot in advance.
Karim.

Posted: **Sat Nov 01, 2014 4:57 am**

kemosparc wrote:Is it possible that I start the APs after the BSP is in 64 bit ? If so, while the BSP is in 64-bit mode can the APs execute bios interrupts; I mean the BSP at that point have initialized the video memory segment, and using int 10 might make conflicts? or the int 10 from the AP will still be able to print on the screen ?

The APs start in real mode, and could potentially execute BIOS code. However, many BIOS requests depend on legacy settings for interrupts, which will likely arrive at the wrong core to yield a functional result. Int 0x10 may be the most likely candidate to work from an AP perspective, but it's easier and safer to write directly to video memory if you know the screen is actually still in text mode. In production code, or if you boot more than one AP at the same time, BIOS calls should be considered off-limits due to race conditions.

If I configure my QEMU to have 2 cores only, do I still need to handle mutual execution? I mean for testing pruposes only, off course the complete final code should handle that as the number of cores might change from one machine to another.

If you know that one BSP and one AP don't ever access the same resources, or their accesses can never potentially hurt the other (amongst which, write-only video), then you can initially skip the logic to handle such cases. But when you do get concurrent execution with more than one AP, or when your AP enters the same code as your BSP, you'll soon find a need for mutual exclusion. Possible races should get fixed before they wreak havoc on your debugging skills.

if possible can you point me to sample code and tutorials that shows how to implement mutual exclusion

Wikipedia has all the background information on standard lock mechanics. Delays are more problematic because you need proper timing (read: interrupt) support for that.

Posted: **Mon Nov 10, 2014 3:18 pm**

Thanks a lot for your helpful comments earlier.

I was able to startup all the cores from Long mode.

I will start on the synchronization and locking mechanisms now.

Thanks again,
Karim.

Posted: **Mon Nov 10, 2014 10:10 pm**

Hi,

kemosparc wrote:Thanks a lot for your helpful comments earlier.

I was able to startup all the cores from Long mode.

I will start on the synchronization and locking mechanisms now.

Be careful - sometimes an AP will start on the first SIPI. This means that (for example) if the AP CPUs increment some sort of "number of CPUs started" counter then they can start on the first SIPI, increment the counter, reset on the second SIPI, then increment the counter again.

It also means that if the BSP can detect that the AP has started after the first SIPI, then it can skip the second SIPI (and cancel the time delay between sending the first and second SIPIs).

Also, you probably want to allocate a stack for the AP CPU, store the "address of your stack" in the trampoline, then start the AP CPU (where the AP CPU gets the "address of your stack" from the trampoline and uses it to set ESP).

Also, it's entirely possible (and not difficult) to switch to protected mode or long mode without using a stack. This makes it easier to allocate a stack for the AP CPU, as the stack can be anywhere (e.g. above 1 MiB, and maybe even in the virtual address space).

This means that the full "AP CPU startup" sequence might go like this:

BSP allocates stack for AP CPU, and stores the "top of stack" in the trampoline
BSP sends INIT IPI to AP CPU
BSP waits for 10 ms
BSP sends first SIPI IPI to AP CPU
BSP checks an "AP CPU started" flag in a loop, with a 200 us timeout
If "AP CPU started" flag was not set:
- BSP sends second SIPI IPI to AP CPU
- BSP checks an "AP CPU started" flag in a loop, with a much longer timeout (e.g. 1 second)
- If "AP CPU started" flag was not set:
  - BSP reports "AP CPU failed to start" error and gives up on this AP
BSP sets a "AP can continue" flag (so that AP CPU knows BSP has noticed that it started)
BSP waits for AP to set an "AP ready" flag (or increment a "number of CPUs" counter), so that it knows the AP CPU doesn't need the trampoline any more (and that it's safe start the next AP CPU).

The AP CPU goes a little like this:

AP sets the "AP CPU started" flag to tell BSP its running
AP waits for BSP to set the "AP can continue" flag
AP switches to protected mode or long mode (possibly including enabling paging)
AP loads its stack from the trampoline
AP sets an "AP ready" flag (or increments a "number of CPUs" counter), so BSP knows its finished using the trampoline

When there are a large number of CPUs, it can take a while to start them all one at a time. For example, with 32 CPUs it could take about a third of a second (and with 256 CPUs it could take 2.5 seconds). There are 2 ways to reduce that.

For the first way, the 10 ms delay after sending the INIT IPI doesn't need to be that exact. This means that you could send the INIT IPI to (up to) 16 CPUs; then have one 10 ms delay; then do the remainder of the startup sequence one AP CPU at a time. This would mean that for 32 CPUs it'd take about 2*10ms + 32*400us = 32.8 ms (and for 256 CPUs it'd take about 262 ms).

For the second way; the BSP can start the first AP CPU; then the BSP and the AP CPU can start 2 more AP CPUs; then all 4 CPUs can start 4 more CPUs, then all 8 can start 8 more; and so on. This would mean that for 32 CPUs it'd take 4 steps or 4*10.4 = 41.6 ms (and for 256 CPUs it'd take 7 steps, or about 72.8 ms). The only problem here is that you need multiple trampolines (with one set of flags for synchronisation and "address for AP CPU stack" in each separate trampoline).

These methods can be combined. For example, BSP could send the INIT IPI to 15 CPUs, then start those 15 CPUs one at a time; then all 16 CPUs could all send the INIT IPI to 16 CPUs each (256 total), then each of the first 16 CPU would start its "16 more" CPUs one at a time. In this way, you'd have 16 CPUs running after about 16 ms, and have (up to) 16+256 CPUs running after another 16.4 ms (and have up to 4368 CPUs running after another 16.4 ms).

However; because of the 8-bit field in the SIPI it's impossible to use more than 256 trampolines at a time, and some of them aren't usable because they correspond to ROM, etc. This means that in practice you probably can't use more than about 128 trampolines at the same time. This means that the fastest possible sequence in practice ends up being something like this:

BSP sends INIT IPI to 15 CPUs
BSP does the remainder of the startup sequence for those 15 CPUs one at a time
-- About 16 ms passed so far; 16 CPUs running --
BSP and 15 AP CPUs send INIT IPI to 16 CPUs each
BSP and 15 AP CPUs do the remainder of startup sequence for those 256 CPUs (16 at a time)
-- About 32.4 ms passed so far; 272 CPUs running --
-- Limited by the max. number of trampolines from here on --
128 CPUs send INIT IPI to 16 CPUs each
128 CPUs do the remainder of startup sequence for those 2048 CPUs (128 at a time)
-- About 48.8 ms passed so far; 2320 CPUs running --
128 CPUs send INIT IPI to 16 CPUs each
128 CPUs do the remainder of startup sequence for those 2048 CPUs (128 at a time)
-- About 65.2 ms passed so far; 4368 CPUs running --
...

Finally; for systems with a lots of CPUs you can expect NUMA. With this in mind it may be a good idea to take NUMA into account. For example, BSP might start one logical CPU in each NUMA domain; then each of those CPUs starts the remaining CPUs in its NUMA domain.

Note: Yes, I did go a little over-board here (there aren't too many computers with more than 32 CPUs at the moment). However; next year Intel will be releasing the next version of Xeon Phi; which will be the first version that can be used as a main CPU. Each Xeon Phi will have up to 72 cores with 4 logical CPUs per core (or up to 288 logical CPUs per Xeon Phi chip). It's entirely conceivable that within 12 months we will start seeing (e.g.) computers with 4-socket motherboards that have 4 Xeon Phi chips and a total of 1152 logical CPUs.

Cheers,

Brendan

Posted: **Tue Nov 11, 2014 1:58 am**

Hi,

First of all, I would like to thank you very myuch for your time trying to help me with such detailed thorough reply, I really appreciate it.

Although, I am still crawling trying to find my way, but for sure your post will give me inlightment in my next steps. Actually I am doing this as an afternoon hoppy beside my PhD. studies so I might be a little bit slow, so I will definitely come back with questions but in a low frequency

I have a question that might seem to be naive; what is a trampoline?

I looked up the term on google with the term CPU and I found more than one thing all related to addresses. Can you please recommend a breif reading on how to implement and use them.

One last important thing that I went through on osdev.org wiki http://wiki.osdev.org/Spinlock as I was trying to implement locking so the processors can print a message on the screen and update the memory in sequence. The issue is that the variable "lock" stated does not work with nasm, and it generates an error considering it as a reserved word, but the error does not say that and it took me a while to figure it out. Just thought that if the guys responsible for the wiki could fix that and rename the variable with something else so no other beginners get stuck like me

Thanks,
Karim.

Posted: **Tue Nov 11, 2014 2:35 am**

Hi,

kemosparc wrote:I have a question that might seem to be naive; what is a trampoline?

On 80x86, when you start an AP CPU it begins in real mode, but most OSs use protected mode or long mode. Because of this, most OSs have a small piece of code to switch the CPU from real mode to protected mode or long mode. This small piece of code is called the "trampoline".

It's like the CPU lands on the trampoline and bounces up to the protected/long mode code.

Cheers,

Brendan

Posted: **Tue Nov 11, 2014 2:38 am**

Thanks

Very nice analogy

Karim.

Posted: **Tue Nov 11, 2014 3:38 pm**

Brendan's response above needs to be added to the Wiki (or at least linked) if it's not already...

Edit: Done!

Symmetric Multiprocessing

OSDev.org

IPI Question

IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question

Re: IPI Question