Best or fastest way to determine which CPU is running (SMP)

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
rod
Posts: 21
Joined: Mon Feb 10, 2014 7:42 am

Best or fastest way to determine which CPU is running (SMP)

Post by rod »

I enabled SMP in x86_64 and now I want to know which CPU or core is running the interrupt handler in each moment (timer, etc.) in order to make some decisions (scheduling, etc.).

As far I know there are several methods:
  • Provide different page table mappings for different cores and store different values in the same virtual address, then read those values. This should be fast, but I've read that with HyperThreading, the 2 threads of the same core share the same page tables, so it wouldn't work in that case.
  • CPUID eax=1 gives in ebx the APIC ID. But I've read that the CPUID instruction is quite slow, could spend 100 cycles?
  • Read from the APIC tables: APIC_BASE (usually 0xFEE00000) + 0x20 which is the APIC ID Register, and should return the same value as CPUID. For this to work, all cores should share the same APIC_BASE address (as obtained from the corresponding bits of rdmsr(0x01B)). Is that guaranteed? Is there much latency when reading from that memory-mapped area?
  • The RDTSCP instruction that also loads IA32_TSC_AUX into ecx (that value could be used to store a per-cpu value).
  • Some value stored in the GDT.
  • Some other processor specific register that can be quickly checked.
Which one would be better or faster? Are there any other methods?
LtG
Member
Member
Posts: 384
Joined: Thu Aug 13, 2015 4:57 pm

Re: Best or fastest way to determine which CPU is running (S

Post by LtG »

HyperThreading doesn't cause the paging to be shared, they are independent, however the TLB resources may (AFAIK will) be shared, so effectively speaking for each HT core the size of the TLB is halved, which may (in practice will) impact performance, however you'll likely get more performance from proper HT usage...

Another alternative is to use separate IDT's for each core, the ISR that is run already knows which core it's run on because it's different code for each core.

Whatever you choose you'll likely need CPU/core specific data areas (the first option you listed) so that might be the easiest and most convenient option.

edit. I don't know how slow CPUID is, but assuming your 100 cycles it's possible (depending on your OS) that accessing memory will in practice ~always cause cache miss and thus would be even slower. If for example in your OS the "core specific data area" is only accessed very infrequently and thus is always out of cache. However I wouldn't optimize something this small at this point, after your OS is "complete" you can decide what gives best performance, for now use what makes the most sense and leave optimizations till later.
User avatar
iansjack
Member
Member
Posts: 4706
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Best or fastest way to determine which CPU is running (S

Post by iansjack »

Read the ID register in the local APIC?
User avatar
xenos
Member
Member
Posts: 1121
Joined: Thu Aug 11, 2005 11:00 pm
Libera.chat IRC: xenos1984
Location: Tartu, Estonia
Contact:

Re: Best or fastest way to determine which CPU is running (S

Post by xenos »

What about reading the task register? For interrupts with privilege level change you should have one TSS per core, and so each core should have a unique TSS selector, to which the task register points.

I haven't compared the reading performance with APIC ID register, though.
Programmers' Hardware Database // GitHub user: xenos1984; OS project: NOS
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Best or fastest way to determine which CPU is running (S

Post by Korona »

Use gs to point to cpu-specific data on x86_64. Use the swapgs instruction to swap between user-mode gs and the cpu-specific pointer in the kernel. syscall basically forces you to use gs/swapgs for this purpose, as it does not give you a stack to save your other registers on.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
simeonz
Member
Member
Posts: 360
Joined: Fri Aug 19, 2016 10:28 pm

Re: Best or fastest way to determine which CPU is running (S

Post by simeonz »

rod wrote:Provide different page table mappings for different cores and store different values in the same virtual address, then read those values.
I believe this will require one version of every process's address space for each cpu. In particular, one cpu specific page table must be created for each address translation level. And on-the-fly changes to such address space will get complicated as well.
rod wrote:Are there any other methods?
Honestly, I am mostly spectator here (for educational purposes), but skimming over the Linux kernel sources I see that the x86-64 ISR uses the "swapgs" instruction on entry. This changes the GS descriptor's base to a value controlled through an MSR. The GS descriptor is pointed to a per-cpu structure in kernel mode (their ABI you could say), which means that the kernel can store all sorts of cpu-specific information as fields in it, including a CPU id (which you want), pointers to per-cpu scheduler queues, etc. You can also get the GS register base or the cpu id from your ISR stack, assuming it was configured through the interrupt stack table individually for each cpu. Essentially, you either need to get the kernel stack from the per-cpu structures or you need to get the per-cpu structures from the kernel stack. But either way, once you end up with a per-cpu state, you will receive a "cache" of the cpu id as a field in the per-cpu data. The instruction is actually mentioned in the wiki.

Now, this may not be actually be as reliable as some of the methods you have mentioned. The technique here assumes that the per-cpu structure is consistent between ISR invocations.

Edit: Korona gave you the answer already, but I will leave my answer as well, in case there is something useful in it.
User avatar
iansjack
Member
Member
Posts: 4706
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Best or fastest way to determine which CPU is running (S

Post by iansjack »

You might want to read this note about problems with the swapgs instruction. https://www.kernel.org/doc/Documentatio ... try_64.txt If all you want to do is identify which processor a task is running on I would suggest that the local APIC is the simplest and most reliable source of information.
LtG
Member
Member
Posts: 384
Joined: Thu Aug 13, 2015 4:57 pm

Re: Best or fastest way to determine which CPU is running (S

Post by LtG »

iansjack wrote:You might want to read this note about problems with the swapgs instruction. https://www.kernel.org/doc/Documentatio ... try_64.txt If all you want to do is identify which processor a task is running on I would suggest that the local APIC is the simplest and most reliable source of information.
Is there something that makes it _more_ reliable than some of the others (paging for instance)? Also, assuming you need CPU specific data anyway, then I don't really see it as simpler either.
tsdnz
Member
Member
Posts: 333
Joined: Sun Jun 16, 2013 4:09 am

Re: Best or fastest way to determine which CPU is running (S

Post by tsdnz »

XenOS wrote:I haven't compared the reading performance with APIC ID register, though.
I write handler for each core, very fast, reading APIC ID register takes a few cycles that I am not willing to waste.

Ali
User avatar
iansjack
Member
Member
Posts: 4706
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Best or fastest way to determine which CPU is running (S

Post by iansjack »

I suppose it's a trade-off between code size, and complexity, and speed. As far as I am concerned, interrupts typically occur at the end of a relatively lengthy pause (waiting for a key press, waiting for a network or USB frame, waiting for a disk sector read, etc.) so a clock cycle here or there isn't going to make any difference. An exception would be the timer tick, so it might be sensible to use the local APIC timer to drive separate interrupts on individual cores.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Best or fastest way to determine which CPU is running (S

Post by Brendan »

Hi,
rod wrote:Are there any other methods?
The only other method that I've heard of (that someone hasn't already mentioned) is using a debug register (e.g. DR3). This might actually be the fastest method (if you're willing to limit things like debuggers to 3 breakpoints instead of 4).
tsdnz wrote:
XenOS wrote:I haven't compared the reading performance with APIC ID register, though.
I write handler for each core, very fast, reading APIC ID register takes a few cycles that I am not willing to waste.
You'd pay for that in terms of cache misses. E.g. if L1 cache is shared by 2 CPUs, L2 cache is shared by 4 CPUs and L3 cache is shared by 8 CPUs; then using "different interrupt handler per CPU" means that other CPUs don't cause the cache line/s you need to be brought into caches you share; which means that it's more likely a CPU will have to fetch the IDT entry and the interrupt handler's code from further away (e.g. from RAM instead of L2 cache).
simeonz wrote:
rod wrote:Provide different page table mappings for different cores and store different values in the same virtual address, then read those values.
I believe this will require one version of every process's address space for each cpu. In particular, one cpu specific page table must be created for each address translation level. And on-the-fly changes to such address space will get complicated as well.
If you support multi-threaded processes (where 2 threads that belong to the same process could be running on different CPUs at the same time) it'd have to be worse than "virtual address space per process per CPU".

What I do is have a virtual address space for each thread; then patch part of the thread's virtual address space during task switches (before loading CR3 to avoid TLB invalidation) to get "per-CPU", "per-core" and "per NUMA domain" areas of kernel space. However, I'm using "virtual address space for each thread" for other reasons (to split user-space into "process space" and "thread space", and ensure one thread can't access data in a different thread's "thread space") and wouldn't do it like this if I wasn't already using "virtual address space for each thread".


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
simeonz
Member
Member
Posts: 360
Joined: Fri Aug 19, 2016 10:28 pm

Re: Best or fastest way to determine which CPU is running (S

Post by simeonz »

Brendan wrote:
tsdnz wrote:
XenOS wrote: I haven't compared the reading performance with APIC ID register, though.
I write handler for each core, very fast, reading APIC ID register takes a few cycles that I am not willing to waste.
You'd pay for that in terms of cache misses...
That was my first thought as well. However, couldn't you have an array of trampolines that simply call into the shared ISR code? The trampolines will effectively push the following address to the stack, which the ISR can use to determine the cpu index. It may be the fastest way actually, albeit a bit idiosyncratic.
Brendan wrote:What I do is have a virtual address space for each thread
This obviously aligns with the security goals of the OS, especially considering applications that service different clients using threads (possibly with impersonation). However it is also interesting, because the program parallelism in such case is almost process-based. The relation to multi-threading (assuming that I understood the scheme) is that shared data pointers match in both thread-like processes and can be natively worked with, without translation. If we extrapolate the same principle, may be multiple executables can share data in this way if the shared region is mapped in consistent location by the OS. So multi-threading can be replaced entirely by a memory mapping/sharing API that supports consistent inter-process layout. (I do not discuss sharing function pointers here, because of the negative security implications.)

Sorry for the off-topic.
User avatar
Geri
Member
Member
Posts: 442
Joined: Sun Jul 14, 2013 6:01 pm

Re: Best or fastest way to determine which CPU is running (S

Post by Geri »

i do it with cpuid.

also if you feel that you must do cpuid all the time again and again in a such large extent that will sloth your code, you probably doing something terribly wrong
Operating system for SUBLEQ cpu architecture:
http://users.atw.hu/gerigeri/DawnOS/download.html
User avatar
iansjack
Member
Member
Posts: 4706
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Best or fastest way to determine which CPU is running (S

Post by iansjack »

I think you may have misunderstood the question.
samiam95124
Posts: 9
Joined: Sun Sep 11, 2016 12:54 pm

Re: Best or fastest way to determine which CPU is running (S

Post by samiam95124 »

My guess:

Task register, to unique TSS, (which is usually per core in any case), then to a value in the TSS. The TSS span is set in the descriptor, which implies that you can make it longer than needed to store goodies in the TSS data, so voila, there is a place for a core number using a simple offset. You will have more than one thread per core, but in most implementations you don't use the hardware task switching but rather use the (fake) tss to bounce the stack pointer. Thus all threads use the same TSS, and thus all threads on the same core yield the same core number from the TSS.

Other idea: since you are not actually using the TSS to store registers, you can repurpose those fields so that you are not wasting the whole TSS per core.

Scott Franco
San Jose
Post Reply