ACPI or MP?

cyr1x · Post by **cyr1x** » Sun Mar 16, 2008 12:09 pm

Why not just

 mp = (mp* /* or how you named it*/) address;

This is just the same as assigning the values like you do it, but is alot cleaner.

Jeko · Post by **Jeko** » Sun Mar 16, 2008 12:23 pm

Other two important questions:

1 - When I start an AP processor, I must also go to Protected Mode with it and initialize its descriptors (GDT, IDT and TSS)?

2 - With ACPI and MP can I do only CPUs detection? Or also other things?

Jeko · Post by **Jeko** » Sun Mar 16, 2008 12:27 pm

cyr1x wrote:Why not just
Code: Select all
 mp = (mp* /* or how you named it*/) address;
This is just the same as assigning the values like you do it, but is alot cleaner.

what???

AJ · Post by AJ » Sun Mar 16, 2008 1:11 pm

MarkOS wrote:1 - When I start an AP processor, I must also go to Protected Mode with it and initialize its descriptors (GDT, IDT and TSS)?

Yes. The CPU is in real mode, so you send the Startup IPI, telling it where to start executing your 'trampoline code'. That code mus reside under 1MB in physical RAM (the AP does not have paging yet and is in real mode), and must switch the processor in to whatever state you want it to be in.

To do this, I use a sequence of signals between my BSP and AP's (in a shared, locked memory location), which tells the trampoline code where to find my system tables and so on. As I use the same trampoline binary for PMode and Long Mode, it also tells the AP whether to attempt the switch to 64 bit mode.

2 - With ACPI and MP can I do only CPUs detection? Or also other things?

http://www.acpi.info/
http://www.intel.com/design/pentium/datashts/242016.htm

Cheers,
Adam

Jeko · Post by **Jeko** » Sun Mar 16, 2008 1:23 pm

AJ wrote:
MarkOS wrote:1 - When I start an AP processor, I must also go to Protected Mode with it and initialize its descriptors (GDT, IDT and TSS)?
Yes. The CPU is in real mode, so you send the Startup IPI, telling it where to start executing your 'trampoline code'. That code mus reside under 1MB in physical RAM (the AP does not have paging yet and is in real mode), and must switch the processor in to whatever state you want it to be in.

To do this, I use a sequence of signals between my BSP and AP's (in a shared, locked memory location), which tells the trampoline code where to find my system tables and so on. As I use the same trampoline binary for PMode and Long Mode, it also tells the AP whether to attempt the switch to 64 bit mode.

2 - With ACPI and MP can I do only CPUs detection? Or also other things?
http://www.acpi.info/
http://www.intel.com/design/pentium/datashts/242016.htm

Cheers,
Adam

Will I have same descriptors for all APs?
PDBR is shared between all APs?

Must I change something in my scheduler or paging code?
Must I change something in my interrupts code?

exkor · Post by **exkor** » Sun Mar 16, 2008 9:23 pm

you could try out this version, search should be faster if you start from top because this is where I found most MP, loops can be further optimized

Code: Select all

 ;look for MP Floating Point Struct first
  ;------------------------------------------
  movzx eax, word [mem_bios]  ;EAX = size in KB
  shl   eax, 10               ;EAX = address
  mov   esi, 9ffffh           ;a0000 - where bios ends
  sub   eax, 16
.find_MP_FPS:
  add  eax, 16
  cmp  eax, esi
  jae  .range_check
  cmp  dword [rax], "_MP_"
  je   .found
  jmp  .find_MP_FPS
.range_check:
  cmp  esi, 0fffffh
  jae  .no_MP_FPS
  mov  eax, 0e0000h-16        ;f0000h, ROMs
  mov  esi, 0fffffh           ;they end here
  jmp  .find_MP_FPS
.found:
  lea  ebx, [rax+16]
  xor  edx, edx
  .MP_FPS_checksum:           ;verify FPS checksum
    dec  ebx
    movzx edi, byte [rbx]
    add  edx, edi
    cmp  ebx, eax
    jne  .MP_FPS_checksum
  test dl, dl
  jnz  .find_MP_FPS          ;checksum != 0, continue search
.ok:
  movzx ecx, byte [rax+11]   ;CL = MP Feature Byte 1
  mov  edi, [rax+4]          ;RDI = address of MP table
  test cl, cl                ;see if MP Feature Byte 1 != 0
  jnz  .F_byte1              ;positive # means configuration # and no MP table
  test edi, edi              ;see if addr = 0
  jz   .no_MP_FPS            ;it is 0 - no MP table, and no config?
  cmp  dword [rdi], 'PCMP'   ;check signature of MP Table
  jne  .no_MP_FPS            ;some weird stuff - wrong signature
                             ;default to no MP at all
  ;parse MP table now
  ;------------------------------------------
  movzx esi, word [rdi+22h]  ;ESI = # of entries
  add  edi, 2ch              ;EDI = address of 1st entry in MPTable
  
.entry:
  mov  rax, [rdi]
  cmp  al, 4
  ja   .next_entry
  je   .4_loc_int
  cmp  al, 2
  ja   .3_io_int
  je   .2_io_apic
  jp   .1_bus

.0_cpu:                 
  add  edi, 12
  jmp  .next_entry
.4_loc_int:             
  jmp  .next_entry
.1_bus:             
  jmp  .next_entry
.2_io_apic:       
  jmp  .next_entry
.3_io_int:        

.next_entry:
  add  edi, 8
  sub  esi, 1
  jnz  .entry


  ;only MP Floating Point Structure present
  ;and default configuration present
.F_byte1:

  ;no MP of any kind
.no_MP_FPS:

AJ · Post by AJ » Mon Mar 17, 2008 3:00 am

MarkOS wrote: Will I have same descriptors for all APs?
PDBR is shared between all APs?

The GDT and IDT can be shared - although you do need a separate TSS descriptor for each core. As for PDBR, that depends whether you have separate process spaces. My trampoline code loads the same PDBR value for the BSP and AP's. After the scheduler is started, it is a bit more chaotic as each core has the PDBR value of its corresponding process.

Must I change something in my scheduler or paging code?
Must I change something in my interrupts code?

I have a separate scheduler for each core. After the trampoline code runs, each AP in turn runs some code initialising its unique scheduler class (I am working in C++, but if you are using C/ASM/Other, I'm sure you can adapt the idea) and is put in to an idle state. The BSP then starts cramming the scheduler full of the initial tasks to run, before jumping in to its own scheduler.

It looks to me like there are a few ways to handle this (I still haven't made the design decision yet myself):

1) There is a global scheduler queue - when a CPU preempts, it passes the outgoing task to the back of the queue before taking the first task and running it - this will require careful locks around the global queue and has the greatest risk of one core sitting idle in a spinlock during scheduling. But has the advantage that no one CPU ends up with all the work.

2) Each CPU has its queue - each time you run a new task, the CPU with (perhaps) the least intensive tasks adds the new task to its queue. This needs some way of moving tasks around schedulers, because if one CPU has all its tasks cancelled, you will end up with uneven task distribution.

3) Each CPU has its own queue and theres is a global queue. When a task is preempted, it is placed in to the global queue and a new one selected from the local queue. This way, each CPU's queue gradually empties. The first CPU with an empty task queue takes charge of redistributing tasks to local queues again.

4) There are other people on the board who I hope will chip in because they are much better at this theory than me...

Cheers,
Adam

cyr1x · Post by **cyr1x** » Mon Mar 17, 2008 3:05 am

MarkOS wrote: what???

Your doing it like so

Code: Select all

      mp.signature = *address;
      mp.config = *(address+4);
      mp.length = *(unsigned char*)(address+8);
      mp.version = *(unsigned char*)(address+9);
      mp.checksum = *(unsigned char*)(address+10);
      mp.features1 = *(unsigned char*)(address+11);
      mp.features2 = *(unsigned char*)(address+12);
      mp.features3[0] = *(unsigned char*)(address+13);
      mp.features3[1] = *(unsigned char*)(address+14);
      mp.features3[2] = *(unsigned char*)(address+15);

This is .. ehm ... bad.

You can just assign the pointer to the "mp" and it will be "filled in"(not the right word) automaticly. Like so

Code: Select all

MultiProcessorFloatingPointer* mp = (MultiProcessorFloatingPointer* /*or how you named it*/ )address;

@AJ How about assigning a thread to a specific core and they allways run the same threads, now if one CPU runs out of threads it calls "balance()" or something and it takes some threads of other CPU's queues. This is like it's done in Linux. I think Brenden has (again) great theory for this.

Brendan · Post by **Brendan** » Mon Mar 17, 2008 4:49 am

Hi,

cyr1x wrote:@AJ How about assigning a thread to a specific core and they allways run the same threads, now if one CPU runs out of threads it calls "balance()" or something and it takes some threads of other CPU's queues. This is like it's done in Linux. I think Brenden has (again) great theory for this.

I'm not too sure about the Linux scheduler's details, but the "tickless" direction it's going in (or went recently?) is well worth looking into IMHO. The basic idea being that a periodic timer IRQ is bad because it wakes up CPUs (takes them out of any power saving state and wastes power, and causes unnecessary heat & noise). My more recent schedulers have used "one-shot" timers to reduce the number of unnecessary IRQs (CPL=3->CPU=0->CPL=3 context switches) and to provide more "time slice" precision, but I didn't optimize it for power management.

For my previous scheduler:

- each CPU was entirely independant (there was no "reschedule_everything()" to simulataneously delay all CPUs in the computer)
- there were 4 different scheduling groups, called "scheduling policies".
- there were 256 different priorities within each scheduling policy (effectively 65536 priorities, sort of).
- "Policy 0" was intended for very high priority threads (e.g. the device drivers for my micro-kernel), and used a "the highest priority thread that can run does run" algorithm. This was implemented as 256 seperate run queues per CPU (or one run queue per priority per CPU).
- "Policy 1" was intended for high priority threads (e.g. GUI and user-interface threads), and used a "variable frequency" algorithm (where a thread with twice as much priority got twice as many fixed length time slices). This was implemented as one run queue per CPU. This algorithm also caused problems with a large number of running threads, because the list of running threads was searched up to twice when the scheduler needed to find a new thread to give CPU time to (it wasn't "O(1)" like the other algorithms).
- "Policy 2" was intended for normal priority threads, and used a "variable time slice" algorithm (where a thread with twice as much priority got twice as much CPU time). This was implemented as one run queue per CPU.
- "Policy 3" was intended for idle priority threads, and also used a "variable time slice" algorithm, with one run queue per CPU.
- when deciding which thread to give CPU time to, the scheduler would check if any Policy 0 threads could run (and choose one of them if it can), then check if any Policy 1 threads could run (and choose one of them if it can), then check if any Policy 2 threads could run (and choose one of them if it can). If the scheduler didn't find anything it'd give CPU time to an idle thread (there's always at least one idle thread per CPU, that never blocks).
- this gave me a total 259 seperate run queues per CPU (e.g. 1036 run queues for a computer with 4 CPUs). Each run queue had it's own reentrancy lock, so the scheduler could be choosing a thread to run from one run queue while a different CPU modifies a different run queue. Lock contention was only possible if 2 or more CPUs are trying to modify the same run queue, and because this is all "per CPU" it should've scaled very well (more CPUs means more chance of CPUs potentially trying to acquire the same lock, but more CPUs also means more locks, so the chance of lock contention stays roughly the same regardless of how many CPUs are present).
- each thread had a "CPU affinity" bitmask, where each bit represents a CPU (where "clear" means the thread can't run on that CPU, and "set" means the thread can run on that CPU).
- when a thread is given CPU time it's removed from the run queue it was on.
- when a thread is stops using the CPU (either by preemption or blocking), or if something happens to unblock a blocked thread (e.g. it's "sleep()" time expires, it receives a message while it's blocked waiting for a message, etc), then the scheduler inserts the thread back on a run queue. To do this the scheduler first needs to decide which run queue to put the thread onto.
- to decide which run queue to put the thread onto, the scheduler calculates a score for each CPU that is allowed (in the thread's CPU affinity bitmap) and selects the run queue (for the thread's policy and priority) who's CPU had the best score.
- the score was calculated as "number of threads already on the queue", so the thread was scheduled to run on the CPU that has the least number threads already in it's run queue. This was a bad way to calculate the score (it was far too simplistic - more on that later).

IMHO there's 3 very important concepts here. The first is scalability - you can *never* have too much scalability. This includes minimizing the overhead of having many threads running, minimizing lock contention, minimizing cache thrashing, etc.

The next "important concept" is the CPU affinity bitmap. Each process had a "default CPU affinity", and when a new thread is spawned it's CPU affinity is set to whatever the "default CPU affinity" is. When a new process is started the OS chooses the least loaded NUMA domain and sets the "default CPU affinity" for the new process to every CPU in the selected NUMA domain (this is more about making sure RAM used by all threads in the process can be "close" to all of the CPUs that the threads could be running on, to improve memory access times). Also, the CPU affinity was used to allow the OS to support mixtures of CPUs with different features. For example, if you've got a dual Pentium machine where one CPU supports MMX and the other doesn't, then a thread that uses MMX would have it's CPU affinity adjusted so that it doesn't run on the CPU that doesn't support MMX. Of couse the "default CPU affinity" and the CPU affinity for each thread could be modified by something else, if necessary (e.g. a thread to monitor "per NUMA domain" CPU and memory usage for the computer as a whole, that considers the costs involved with migrating a process to a different NUMA domain and the benefits of balancing CPU and memory usage between NUMA domains).

The last "important concept" is the score used to decide which CPU a thread would be scheduled to run on. The calculation I used was a "quick and dirty" calculation that could be improved a lot (I was intending to improve the calculation after doing some extensive benchmarking, but never really got time). Ideas for a better calculation include taking into account how warm each CPU's caches are (including shared CPU caches), working out "current CPU load" better (more accurately), working out which other threads the thread talks to (to reduce IPC overhead), etc.

Of course what I've done in the previous version isn't necessarily what I'll do in the next version...

Cheers,

Brendan

AJ · Post by AJ » Mon Mar 17, 2008 5:10 am

Thanks Brendan - I didn't want to name names earlier, but thought you might have something to say on the subject of scheduling

I certainly agree about the "one-shot" mechanism - local APIC's are, of course, great for this purpose.

Cheers,
Adam

bewing · Post by **bewing** » Mon Mar 17, 2008 2:17 pm

Brendan's mechanism is impressive, but kinda fails the fourth important concept of KISS.

I am intending to implement an earlier suggestion, myself. Each core has an independent scheduler, with an independent task queue. In my system, independent "jobs" own "threads" as subtasks. When a new job is created, it is assigned to the core with the least load at that time. The job then creates all its child threads on that same core.

If one core's scheduler runs out of >idle priority runnable tasks, it requests a load balance from the most loaded core. The most loaded core picks a fairly high priority job, and dumps that entire job worth of threads all at once to that empty core -- ie. the "affinity" for the entire job gets modified all at once.

I like the basic idea of one-shot tickless systems. However, the idea scares me, too. What happens if the timer fails to deliver the next one-shot interrupt to the CPU? Nothing in a system is absolutely 100% guaranteed, including interrupt delivery. I would really want to have some sort of backup mechanism to wake up a CPU that missed a one-shot interrupt, and got stuck running the same task forever.
In a tick-based system, at least you can always count on the fact that another timer tick will come along the next time around.

Ready4Dis · Post by **Ready4Dis** » Mon Mar 17, 2008 2:54 pm

bewing wrote:Brendan's mechanism is impressive, but kinda fails the fourth important concept of KISS.

I am intending to implement an earlier suggestion, myself. Each core has an independent scheduler, with an independent task queue. In my system, independent "jobs" own "threads" as subtasks. When a new job is created, it is assigned to the core with the least load at that time. The job then creates all its child threads on that same core.

If one core's scheduler runs out of >idle priority runnable tasks, it requests a load balance from the most loaded core. The most loaded core picks a fairly high priority job, and dumps that entire job worth of threads all at once to that empty core -- ie. the "affinity" for the entire job gets modified all at once.

I like the basic idea of one-shot tickless systems. However, the idea scares me, too. What happens if the timer fails to deliver the next one-shot interrupt to the CPU? Nothing in a system is absolutely 100% guaranteed, including interrupt delivery. I would really want to have some sort of backup mechanism to wake up a CPU that missed a one-shot interrupt, and got stuck running the same task forever.
In a tick-based system, at least you can always count on the fact that another timer tick will come along the next time around.

Hmm.... that sounds, well... not right to me

. What's the point of multi-threading if the threads are running on a single CPU only? I mean, say I am writing a game, it has AI in one thread, graphics engine in another... wouldn't it be better if they could run on seperate CPU's so both parts are done simultaneously, rather than on a single CPU where they are sharing cylces. That kind of defeats the main benefit of multithreading in a multi-cpu OS doesn't it? Also, transfering a high CPU load process is great, but now you are locking 2 buffers (removal and insertion qeues), and if that is the only process running and is high priority, what's to stop your kernel from passing it from cpu 0 -> 1 -> 2 -> 3 and back to 0 again on each check. Now, what if it's the only process running, and it has a lot of sub-threads (web-server type applications spawn many threads to use blocking i/o for clients), having all threads on one CPU and always moved with it's parent seems like a bit of a waste to me. While brandon's method may seem a bit 'overkill', sometimes keeping it simple isn't always the best solution. I plan to have process balancing of sorts in my OS, where each CPU stores a # indicating how much work it's doing (based on process count, and process priorities), on each task switch it check itself against the other CPU's (so, 3-6 compares with 4 cpu's), if it finds one that is different by a certain # of points (will play with values to find a good compromise, or have a variable in the OS for balancing), then it will move a thread (with a specific priority based on the difference between the 2). So, for example:

idle = 1
medium = 2
high = 3

CPU 0 -> 2 threads, both idle total = 2
CPU 1 -> 5 threads, 2 idle, 1 medium, and 2 high total =10

So, when CPU 0 runs, it would check, and realize that CPU 1 is doing way more work, and the difference is 8, so 8/2 = 4.. so it would take a high priority thread...
CPU 0 -> 2 idle, 1 high total = 5
CPU 1 -> 2 idle, 1 medium, 1 high = 7

CPU 1 would run, see that it has a higher number, if the difference was set really low (say 1), it would give CPU 0 an idle thread (difference of 2/1 = 1), and they would be balanced at 6. Although, typically that difference would be small enough to not be worth the moving, and would just start the next task.

So, now they are re-balanced.. now, my values probably won't be 1,2,3, probably closer to 1,3,7 or something like that, but you get the idea... I will probably have the CPU with the lowest # do the balancing for the rest of them, so it has the least chance of being interrupted (the locks spinning), and the most free CPU cycles. It's not all that complex (just comparing a few numbers, each time a task is added or removed, increment or decrement a number

. Pretty simple, but accomplishes balancing pretty well I think. I still have a lot of planning and implementing, but that's what i've come up with so far, trying to keep it simple, but still be useful. I may have something in place that tries to keep threads with the same parent on different CPU's, so they are able to run simultaneously, but that will be for later.

bewing · Post by **bewing** » Mon Mar 17, 2008 4:17 pm

What's the point of multi-threading if the threads are running on a single CPU only?

Each thread can block independently.

wouldn't it be better if they could run on seperate CPU's so both parts are done simultaneously, rather than on a single CPU where they are sharing cylces?

If that one process was the only one running on a machine (which it NEVER is) then that would be great. If there are 500 tasks sharing cycles on 8 cores, then you gain absolutely nothing by spreading the threads between cores. In fact, you lose a tremendous amount of "affinity".

(And I did oversimplify my load balancing method description. It won't suffer any of the "obvious" errors. But it may have some subtle ones.)

Jeko · Post by **Jeko** » Tue Mar 18, 2008 11:46 am

cyr1x wrote:
This is .. ehm ... bad.

You can just assign the pointer to the "mp" and it will be "filled in"(not the right word) automaticly. Like so
Code: Select all
MultiProcessorFloatingPointer* mp = (MultiProcessorFloatingPointer* /*or how you named it*/ )address;

Oh yes. I didn't understand... However I did like that because I wanted to see if there were something wrong with the mp structure! Naturally I'm now using mp = (struct mp_floating_ptr*)address;

Jeko · Post by **Jeko** » Tue Mar 18, 2008 4:21 pm

Could you explain me how to schedule/divide processes or threads on multiple physical processors?