Hi,
Colonel Kernel wrote:
My only quibble is with terminology -- "domain", "sub-domain", and "slot" don't say much to me. For starters, you can re-name "slot" to "CPU core" since you've already explained that they're equivalent.
Changing "slots" to "CPU cores" would make more sense
.
I don't really understand the distinction between a domain and a sub-domain, other than the fact that domains can't have slots in them... For example, the IBM/Sequent Numa-Q topology you presented has four "domains". What are they actually? Is there a big crossbar switch connecting them together, or...?
For everything I've been able to find so far (Numa-Q, Altix, Cray, some large Compaq servers) each domain does correspond to a cross-bar, master router or switch (which is IMHO just different manufacturers names for roughly the same thing).
Also, there is no example of a domain that contains memory ranges or I/O buses. Can you give a real-life example of what this looks like?
I can't, but even if a real life example doesn't exist now it doesn't necessarily mean that someone won't create one in the future.
I guess I'm trying to get the software representation to mimic any hardware possible, and do it so that everything within a sub-domain has the same/similar access times.
AFAICT ACPI is only designed for a single layer ("domains" rather than "domains and sub-domains"), which implies OS's are expected to keep track of the "distance" between each domain and every other domain.
For a simplified example (2 memory ranges only), I'd have:
[tt]System
|_Domain 0
| |_Sub-domain 0.0
| | |_Memory range 0.0.0
| |_Sub-domain 0.1
| |_Memory range 0.1.0
|_Domain 1
|_Sub-domain 1.0
| |_Memory range 1.0.0
|_Sub-domain 1.1
|_Memory range 1.1.0[/tt]
While for the same system they'd have:
[tt]System
|_Domain 0
| |_Memory range 0.0
|_Domain 1
| |_Memory range 1.0
|_Domain 2
| |_Memory range 2.0
|_Domain 3
|_Memory range 3.0[/tt]
And an additional array of relative "distances" between domains.
IMHO my system would be simpler for the kernel to use. Consider the physical memory manager - if it runs out of RAM in a sub-domain then it'd use RAM that is connected any sub-domain within the same domain. Using the example above, if the OS ran out of RAM in memory range 0.0.0 then it'd use RAM from memory range 0.1.0, and not memory range 1.0.0 or 1.1.0.
Without distinguishing between domains and sub-domains, the OS would need to compare relative distances, otherwise the memory manager wouldn't know which memory range is the "next best".
I'm not sure what Linux is doing either. Based on the "Topology API" from
http://lse.sourceforge.net/numa/topology_api/in-kernel/, it looks like they're thinking of "nested NUMA nodes" of any depth.
The main problem I'll have is topology detection - for now, I have no choice but to use ACPI (the SRAT and SLIT tables). Because of this I'm wondering if I'd be better to dump my plans and use domains only, with relative distances (without any attempt to make it correspond to the hardware).
Looks like I need to think about this more - any opinions welcome
...
Thanks,
Brendan