System Topology

Brendan · Post by **Brendan** » Mon Oct 03, 2005 2:15 am

Hi,

To improve my OS design's ability to work well on all different computers, I've been working on ways to represent the relationship between different parts (CPUs, memory, I/O buses) of a system. This information will be usedby the OS to effect scheduling and memory management decisions.

I've come up with a "System Topology Specification" for my OS, but I don't know enough about non-80x86 architectures to be confident that it's suitable for all computers.

The draft specification is at:
http://bcos.hopto.org/docs/appdev/theory/topology.html

I'm also wondering if anyone can find any problems or suggest any improvements to it...

Thanks,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Mon Oct 03, 2005 8:26 am

My only quibble is with terminology -- "domain", "sub-domain", and "slot" don't say much to me. For starters, you can re-name "slot" to "CPU core" since you've already explained that they're equivalent.

I don't really understand the distinction between a domain and a sub-domain, other than the fact that domains can't have slots in them... For example, the IBM/Sequent Numa-Q topology you presented has four "domains". What are they actually? Is there a big crossbar switch connecting them together, or...?

Also, there is no example of a domain that contains memory ranges or I/O buses. Can you give a real-life example of what this looks like?

Brendan · Post by **Brendan** » Mon Oct 03, 2005 11:08 am

Hi,

Colonel Kernel wrote: My only quibble is with terminology -- "domain", "sub-domain", and "slot" don't say much to me. For starters, you can re-name "slot" to "CPU core" since you've already explained that they're equivalent.

Changing "slots" to "CPU cores" would make more sense

.

I don't really understand the distinction between a domain and a sub-domain, other than the fact that domains can't have slots in them... For example, the IBM/Sequent Numa-Q topology you presented has four "domains". What are they actually? Is there a big crossbar switch connecting them together, or...?

For everything I've been able to find so far (Numa-Q, Altix, Cray, some large Compaq servers) each domain does correspond to a cross-bar, master router or switch (which is IMHO just different manufacturers names for roughly the same thing).

Also, there is no example of a domain that contains memory ranges or I/O buses. Can you give a real-life example of what this looks like?

I can't, but even if a real life example doesn't exist now it doesn't necessarily mean that someone won't create one in the future.

I guess I'm trying to get the software representation to mimic any hardware possible, and do it so that everything within a sub-domain has the same/similar access times.

AFAICT ACPI is only designed for a single layer ("domains" rather than "domains and sub-domains"), which implies OS's are expected to keep track of the "distance" between each domain and every other domain.

For a simplified example (2 memory ranges only), I'd have:

[tt]System
|_Domain 0
| |_Sub-domain 0.0
| | |_Memory range 0.0.0
| |_Sub-domain 0.1
| |_Memory range 0.1.0
|_Domain 1
|_Sub-domain 1.0
| |_Memory range 1.0.0
|_Sub-domain 1.1
|_Memory range 1.1.0[/tt]

While for the same system they'd have:

[tt]System
|_Domain 0
| |_Memory range 0.0
|_Domain 1
| |_Memory range 1.0
|_Domain 2
| |_Memory range 2.0
|_Domain 3
|_Memory range 3.0[/tt]

And an additional array of relative "distances" between domains.

IMHO my system would be simpler for the kernel to use. Consider the physical memory manager - if it runs out of RAM in a sub-domain then it'd use RAM that is connected any sub-domain within the same domain. Using the example above, if the OS ran out of RAM in memory range 0.0.0 then it'd use RAM from memory range 0.1.0, and not memory range 1.0.0 or 1.1.0.

Without distinguishing between domains and sub-domains, the OS would need to compare relative distances, otherwise the memory manager wouldn't know which memory range is the "next best".

I'm not sure what Linux is doing either. Based on the "Topology API" from http://lse.sourceforge.net/numa/topology_api/in-kernel/, it looks like they're thinking of "nested NUMA nodes" of any depth.

The main problem I'll have is topology detection - for now, I have no choice but to use ACPI (the SRAT and SLIT tables). Because of this I'm wondering if I'd be better to dump my plans and use domains only, with relative distances (without any attempt to make it correspond to the hardware).

Looks like I need to think about this more - any opinions welcome

...

Thanks,

Brendan

Colonel Kernel · Post by **Colonel Kernel** » Mon Oct 03, 2005 7:48 pm

Brendan wrote:While for the same system they'd have:

[tt]System
|_Domain 0
| |_Memory range 0.0
|_Domain 1
| |_Memory range 1.0
|_Domain 2
| |_Memory range 2.0
|_Domain 3
|_Memory range 3.0[/tt]

And an additional array of relative "distances" between domains.

That is actually a more realistic and useful representation. Consider a (physical) topology that looks like this (sorry for the weirdness in the diagram... this @#)$(* thing is definitely not WYSIWYG):

[tt]
[Memory] [CPU A] [Memory] [CPU B]
| | | |
[CrossBar] [CrossBar]
| |
-------------------- --------------------
| | | | | |
[I/O]--[I/O Bridge] | | | | [I/O Bridge]--[I/O]
| | | |
[CPU Bridge] | | [CPU Bridge]
| | | |
| [CPU Bridge]--[CPU Bridge] |
| |
| [CPU Bridge]--[CPU Bridge] |
| | | |
[CPU Bridge] | | [CPU Bridge]
| | | |
[I/O]--[I/O Bridge] | | | | [I/O Bridge]--[I/O]
| | | | | |
-------------------- --------------------
| |
[CrossBar] [CrossBar]
| | | |
[Memory] [CPU C] [Memory] [CPU D]
[/tt]

In other words, each CPU/memory/I/O "cluster" is connected only to its direct neighbour and not to all other clusters. In your model, each would be a separate domain (or sub-domain, take your pick), which implies that they are all equidistant. From a performance point of view, this just isn't true.

Your model can provide accurate distance information for the above example, but it would have to be tailored for each "cluster's" view of the world. For example, in the above topology, CPU A would consider itself to be part of the same sub-domain as its local memory and I/O. It would consider CPUs B and C to be in different sub-domains of the same domain. From its point of view, CPU D is in a separate domain because it is further away than B and C. Ditto for CPU D -- CPUs B and C appear to be in the same "domain", but A does not because it is two "hops" away.

Brendan · Post by **Brendan** » Mon Oct 03, 2005 9:41 pm

Hi,

Colonel Kernel wrote:Your model can provide accurate distance information for the above example, but it would have to be tailored for each "cluster's" view of the world. For example, in the above topology, CPU A would consider itself to be part of the same sub-domain as its local memory and I/O. It would consider CPUs B and C to be in different sub-domains of the same domain. From its point of view, CPU D is in a separate domain because it is further away than B and C. Ditto for CPU D -- CPUs B and C appear to be in the same "domain", but A does not because it is two "hops" away.

You're right - for this model my representation doesn't work, and neither would "nested nodes of any depth". The only thing that would work is ACPI's representation which doesn't try to follow the physical hardware.

This leaves me with domains, CPU cores, logical CPUs, memory and IO, with an additional table of relative distances between domains. Any domain contains CPU cores, memory and/or IO, and each CPU core contains one or more logical CPUs.

Thanks,

Brendan

OSDev.org

System Topology

System Topology

Re:System Topology

Re:System Topology

Re:System Topology

Re:System Topology