Multiprocessor design thingy

Candy · Post by **Candy** » Sun Oct 24, 2004 1:52 pm

I was playing with a thought (way too busy to do anything substantial), but I came up with the idea of each CPU having its personal info struct at some fixed location, being identical on all cpu's. The page for that location would only be mapped global a very first time, and then unmapped from all page tables. On a page fault at that location the cpu would search for the page that should be going there, and then adding a mapping for that as well as setting a debug flag (forgot which one). The instruction after the debug would trigger and the page would be removed again.

On the other hand, if the TLB is emptied out so quickly that it'd be reloaded each time it's accessed, that'd make it a bad idea. Anybody have an idea about this?

Dreamsmith · Post by **Dreamsmith** » Sun Oct 24, 2004 2:27 pm

Well, that would certainly be a great way to slow down access to that particular data structure. What else this accomplishes, I'm a bit unclear on. Perhaps if you explained why you would want to do this?

Candy · Post by **Candy** » Sun Oct 24, 2004 4:00 pm

Dreamsmith wrote: Well, that would certainly be a great way to slow down access to that particular data structure. What else this accomplishes, I'm a bit unclear on. Perhaps if you explained why you would want to do this?

The idea was that since it was globally mapped within that processor, it would be permanently cached within the TLB of that processor, where each processor would access its own data structure through that page without switching it in each address space (since it was going to be global anyway, but only within that processor). The idea was that you wouldn't have to change all those page tables on each processor switch for a process, and that you could run more than one thread from a process, on two processors with differing mappings for that specific page, but identical mappings for the other pages.

You could access the processor-specific data at a location that the processor remapped for me, and for which I didn't have to use a GS/FS prefix to address. Also, it would not impact speed because it'd be in the TLB constantly.

And that's exactly where I don't know whether that's true, is it in the TLB most of the time, as it is addressed at each timer tick and at each interrupt?

Brendan · Post by **Brendan** » Mon Oct 25, 2004 1:40 am

Hi,

Candy wrote: The idea was that since it was globally mapped within that processor, it would be permanently cached within the TLB of that processor, where each processor would access its own data structure through that page without switching it in each address space (since it was going to be global anyway, but only within that processor).

IMHO one problem with this is that other CPUs may need to access this data. For e.g. (my OS), determining which CPU has the least load, or finding all CPUs within a NUMA domain that support specific instruction sets (MMX, 3Dnow, SSE, etc - mostly as my OS is MP not SMP).

I use an array of CPU information structures, such that all information for all CPUs can be accessed by any CPU, and I also use a different GS for each CPU so it can access it's own information quickly. Because the CPU information structure is padded out to 4 KB (and marked "global" if the CPU supports it) it will usually be in it's associated CPU's TLB cache (ie. unless another CPU needed to access the information or the CPU ran out of free TLB entries). It seems to me to have the same results, but without the page faults or access restrictions (for remote CPUs).

Cheers,

Brendan

Candy · Post by **Candy** » Mon Oct 25, 2004 11:15 am

Brendan wrote: IMHO one problem with this is that other CPUs may need to access this data. For e.g. (my OS), determining which CPU has the least load, or finding all CPUs within a NUMA domain that support specific instruction sets (MMX, 3Dnow, SSE, etc - mostly as my OS is MP not SMP).

Well, afaik you can still map those same pages at a different place too, say, after it somewhere. It was just to have some very common place to access without finding out where GS is hiding, and preferably not using gs. Since I have made lots of stuff with c code, which doesn't really like gs-relative stuff, I have started to use a function gs_base which just gives the base of gs (from the gdt) as information for the OS.

NB: don't forget that user level programs can also load their own GS register.

I use an array of CPU information structures, such that all information for all CPUs can be accessed by any CPU, and I also use a different GS for each CPU so it can access it's own information quickly. Because the CPU information structure is padded out to 4 KB (and marked "global" if the CPU supports it) it will usually be in it's associated CPU's TLB cache (ie. unless another CPU needed to access the information or the CPU ran out of free TLB entries). It seems to me to have the same results, but without the page faults or access restrictions (for remote CPUs).

It won't be removed when another cpu uses it, it's no MOESI cache, it can only be read in and forgotten later on. It will only be invalidated by some ipi (or something else) sending an invalidation request. Since all processors have their own page for that place, they should never be freed.

Anyway, the advantage is more or less null anyway, since I might as well abuse the ldt register or something else the user can't touch. The point was not having to search for the cpu location.

Brendan · Post by **Brendan** » Mon Oct 25, 2004 10:12 pm

Hi,

Candy wrote: NB: don't forget that user level programs can also load their own GS register.

Hmm, ok (I assumed user-level code wasn't allowed to change any segment registers). In this case you haven't got too much choice as you can't reserve a (segment or general) register for use by the kernel only.

For it to work you'd probably need re-entrancy locking - e.g.:

CPU #0 - CPU #0 information gets flushed from TLB
CPU #1 - CPU #1 information gets flushed from TLB
CPU #0 - tries to access CPU information page
CPU #0 - generates page fault
CPU #0 - maps CPU information page into page directory
CPU #1 - tries to access CPU information page
CPU #1 - physical address of wrong CPU information loaded into TLB
*** CPU #1 using wrong CPU information until TLB flushed ***
CPU #0 - returns to instruction that caused page fault
CPU #0 - intruction retry causes new page table entry to be loaded into TLB
CPU #0 - removes CPU information page from page table (leaving it in TLB)

Also you may not need to use a debugger interrupt to remove the CPU information page from the page table, if the CPUs TLB caching is reliable (no guarantees here - extensive testing may be needed to prove it's OK due to different CPU TLB cache sizes, etc). The basic idea would go something like:

Code: Select all

pageFaultHandler {
  if( (CR2 & 0xFFFFF000) = CPU_information_page) {
    temp = find_the_physical_address_for_this_CPUs_info();
    lock_spinlock(the_CPU_information_PGF_lock);
    set_page_table_entry(CR2, temp);  /* insert CPU info page from page table */
    invalidate_TLB(CR2);              /* needed for Cyrix at least */
    temp = (char *)CR2;               /* dummy read to ensure new page table entry is in TLB */
    set_page_table_entry(CR2, NULL);  /* remove CPU info page from page table */
    unlock_spinlock(the_CPU_information_PGF_lock);
    return;
  } else {
    ...
  }
}

Then any CPU can access data from it's CPU information without any messing about. Otherwise locking and unlocking the re-entrancy lock may end up in different bits of code (e.g. lock in PGF handler and unlock in debug handler, or lock and unlock around all code that accesses the CPU info).

Depending on how often the CPU information needs to be accessed, the size of the CPUs TLB caches (number of TLB misses, page faults, etc) and the number of CPUs in the computer (lock contention, data cache thrashing where the re-entrancy lock is stored), it might be better to search the array of CPU information structures each time...

It may also be better to use GS (user-level code can still use DS, ES, FS and SS, but GS may be trashed by the kernel unexpectedly), and inline assembly:

Code: Select all

set_GS() {
  temp = local_APIC_ID;
  for(GS = first_CPU_GS; last_CPU_GS; GS = GS + 8) {
    if(GS:CPUinfoStruct.APIC_ID = local_APIC_ID) return;
  }
}

some_code_that_uses_CPU_info() {
  if( (GS < first_CPU_GS) || (GS > first_CPU_GS) ) set_GS();
  ...
}

Another possibility would be to use a local APIC register (e.g. the local APIC timer count register if you don't use the timer), or even one of the debugging registers (e.g. DR3). Something like:

Code: Select all

some_code_that_uses_CPU_info() {
  CPU_info_struct *CPU_info  = DR3_or_local_APIC_register;

  something = CPU_info->foo;
}

Candy wrote: It won't be removed when another cpu uses it, it's no MOESI cache, it can only be read in and forgotten later on. It will only be invalidated by some ipi (or something else) sending an invalidation request. Since all processors have their own page for that place, they should never be freed.

Ooops, I got TLB and data cache mixed

You're right - the CPU would always have the page in it's TLB (unless it runs out of free TLB entries). The page would be in the CPUs data cache/s unless another CPU needed to modify the information or the CPU ran out of free data cache entries.

Cheers,

Brendan

OSDev.org

Multiprocessor design thingy

Multiprocessor design thingy

Re:Multiprocessor design thingy

Re:Multiprocessor design thingy

Re:Multiprocessor design thingy

Re:Multiprocessor design thingy

Re:Multiprocessor design thingy