How change non-PAE paging mode to PAE paging mode?

626788149 · Post by **626788149** » Mon Nov 23, 2015 11:14 pm

Since my kernel boot,I use non-PAE paging mode simply maping 0~4MB to 0~4MB and 0xC0000000 ~ 0xC0000000 + 768MB to 0 ~ 768MB useing 4MB paging.
I have set kernel PDPTE up, But I just don't know how to change to PAE paging mode.

iansjack · Post by **iansjack** » Tue Nov 24, 2015 1:43 am

Create the appropriate Page Table(s). Set the PAE bit in CR4. What more is there to say?

Brendan · Post by **Brendan** » Tue Nov 24, 2015 1:54 am

Hi,

626788149 wrote:Since my kernel boot,I use non-PAE paging mode simply maping 0~4MB to 0~4MB and 0xC0000000 ~ 0xC0000000 + 768MB to 0 ~ 768MB useing 4MB paging.
I have set kernel PDPTE up, But I just don't know how to change to PAE paging mode.

To change from "plain 32-bit paging" to PAE you'd have to set the flag in CR4 (to enable PAE) and load CR3 with the address of a correct PDPT. You can't do both at the same time; and (without disabling paging) you can't do one before the other (e.g. enable PAE then crash because CR3 is wrong, or load CR3 then crash because PAE isn't enabled).

The only sane option is to disable paging, then enable PAE and load CR3, then enable paging again. This means that you'd need a tiny piece of code that's identity mapped in "plain 32-bit paging" and also identity mapped in PAE.

Of course you'd also have to worry about an IRQ (including NMI) occurring while paging is disabled. To guard against NMI I just load the IDTR with "base = 0, limit = 0" so that an NMI guarantees triple fault (and can't cause undefined behaviour); which seems "bad" but attempting to handle NMI correctly in boot code is insane (and ignoring critical hardware errors by disabling NMI is far worse) so a guaranteed triple fault is the "least worst" option.

This means that the final sequence might look something like this:

Disable IRQs (if necessary)
Load IDTR with "base = 0, limit = 0" and jump to the tiny piece of identity mapped code (in either order)
Disable paging
Enable PAE in CR4 and load CR3 (in either order)
Enable paging
Restore IDTR and jump back to where you came from (in either order)
Enable IRQs (if necessary)

Note: You should be able to do this without using the stack, so you don't need to care if the stack is also identity mapped.

Cheers,

Brendan

kzinti · Post by **kzinti** » Tue Nov 24, 2015 12:10 pm

The right thing to do is enable PAE when you turn on paging in the first place. Not disable existing paging and then re-enable it.

Brendan · Post by **Brendan** » Wed Nov 25, 2015 1:06 am

Hi,

kiznit wrote:The right thing to do is enable PAE when you turn on paging in the first place. Not disable existing paging and then re-enable it.

The right thing depends on the design of the boot code.

For example; my boot code is split into multiple parts; where the first part (boot loader) has to be designed to suit the boot device and firmware, and the second (Boot Abstraction Layer) doesn't know or care what the boot device is or what the firmware is (or which kernel will be used). The boot loader tests if the CPU is "80486 or later" extremely early (and displays a "CPU is too old" error message if it's not), then it sets up "plain 32-bit paging" (because that's all 80486 supports) and starts the BAL. The BAL can only assume "80486 or later, with plain 32-bit paging".

Later the BAL starts other CPUs and determines the features that all CPUs support (e.g. if the boot CPU supports PAE and other CPUs don't, then not all CPUs support PAE and PAE shouldn't be used), and decides which "kernel setup module" to start. The "32-bit kernel setup module" may need to switch from plain 32-bit paging to PAE, and the "64-bit kernel setup module" must switch from plain 32-bit paging to long mode.

The funny thing is that when the firmware is 64-bit UEFI it's all the same - the BAL has to assume "80486 or later" and uses plain 32-bit paging (and the chosen "kernel setup module" may have to switch from plain 32-bit paging to PAE or long mode after) even when the boot loader uses 64-bit long mode.

Cheers,

Brendan

kzinti · Post by **kzinti** » Wed Nov 25, 2015 11:58 am

Brendan wrote:(e.g. if the boot CPU supports PAE and other CPUs don't, then not all CPUs support PAE and PAE shouldn't be used)

Systems with different CPUs? You've ever seen one of these? I always understood that all CPU packages on a given motherboard must be identical. Is that not true? How often does it happen in practice?

Brendan wrote:The funny thing is that when the firmware is 64-bit UEFI it's all the same - the BAL has to assume "80486 or later" and uses plain 32-bit paging (and the chosen "kernel setup module" may have to switch from plain 32-bit paging to PAE or long mode after) even when the boot loader uses 64-bit long mode.

How to you call back to UEFI (Runtime Services) after all of this? Or do you simply not?

Brendan · Post by **Brendan** » Thu Nov 26, 2015 1:53 am

Hi,

kiznit wrote:
Brendan wrote:(e.g. if the boot CPU supports PAE and other CPUs don't, then not all CPUs support PAE and PAE shouldn't be used)
Systems with different CPUs? You've ever seen one of these? I always understood that all CPU packages on a given motherboard must be identical. Is that not true? How often does it happen in practice?

This is actually Intel's advice (from their "MultiProcessor Specification"). Specifically:

Intel wrote:B.8 Supporting Unequal Processors
Some MP operating systems that exist today do not support processors of different types, speeds, or capabilities. However, as processor lifetimes increase and new generations of processors arrive, the potential for dissimilarity among processors increases. The MP specification addresses this potential by providing an MP configuration table to help the operating system configure itself. Operating system writers should factor in processor variations, such as processor type, family, model, and features, to arrive at a configuration that maximizes overall system performance. At a minimum, the MP operating system should remain operational and should support the common features of unequal processors.

Basically; at a minimum, an OS should get the features flags from all CPUs and AND them together to find the subset of features supported by all CPUs.

In practice most motherboards manufacturers only test a few variations (e.g. same family and clock frequency with different model numbers) because fully testing every possible combination can get time consuming/expensive; but that doesn't mean the motherboards won't work for other "not officially supported/tested" combinations. The only real constraint is that the externally visible signals are compatible (e.g. both chips are using the same speed and protocol for the quick-path/hyper-transport interconnect between them).

Also in practice, I wouldn't want to assume that any existing OS does anything right. For example, (on all existing OSs) I'd fully expect that if a process tests that a feature exists (using CPUID on one CPU) and starts using that feature, it could be scheduled on a different CPU that doesn't support that feature and can crash. Because existing OSs are "likely broken" anyway, motherboards manufacturers don't really need to care about supporting/testing many combinations.

Basically; at a minimum an OS should get the features flags from all CPUs and AND them together to find the subset of features supported by all CPUs, and processes should never use CPUID directly (and should ask the OS for a sanitized "CPUID combination" to avoid bugs); but in practice most OSs fail to meet the minimum, and don't even provide a way for processes to get a sanitized "CPUID combination" (which is also bad for other reasons - just because a CPU says it supports a feature doesn't mean it's usable and not effected by errata).

Of course I'm never really happy with only doing the minimum. For my previous micro-kernels; the boot code and micro-kernel would only use the common subset of features; but the scheduler itself was aware of which features processes use and only allowed a process to be scheduled on CPUs that support those features. This means that (in theory) you could have a CPU that supports SSE and another that doesn't, and some processes that use SSE and some that don't, and it'd work properly. It also means that (in theory) you could have a system with an ancient 80486 and a recent "Core i7" and everything would work properly (but it'd be limited to "plain 32-bit paging").

I should probably mention that when I first did this I was having strange ideas about distributed hyper-visors; where the hyper-visor made it look like different real computers were just different NUMA domains in the same (virtual) computer. It sounded like a great idea at the time. It wasn't until later when I did some actual research (a crude benchmark using message passing to emulate cache coherency) that I decided the distributed hyper-visor idea was "less great".

kiznit wrote:
Brendan wrote:The funny thing is that when the firmware is 64-bit UEFI it's all the same - the BAL has to assume "80486 or later" and uses plain 32-bit paging (and the chosen "kernel setup module" may have to switch from plain 32-bit paging to PAE or long mode after) even when the boot loader uses 64-bit long mode.
How to you call back to UEFI (Runtime Services) after all of this? Or do you simply not?

I don't use run-time services after boot.

If you look at the run-time services, there's only 1 thing that might be useful: the ability to use UEFI variables to store a reason/cause when the kernel panics and reboots. For the rest, either they only need to be done during boot, or (for things like resetting the computer and setting a "wake time" to turn the computer on at a certain time) they can be done through ACPI (or a motherboard driver) in a way that works for UEFI, BIOS and "boot from ROM".

I do want a way to store a reason/cause when the kernel panics and reboots that is persistent; but if I can't find a way to support that without UEFI's run-time services (that works regardless of what the firmware is) I'd rather go without it. Note that run-time services aren't quite as easy to use as they might seem. For example; I doubt it'd be easy for a 32-bit kernel to use 64-bit UEFI run-time services.

Cheers,

Brendan

embryo2 · Post by **embryo2** » Fri Nov 27, 2015 6:30 am

Brendan wrote:In practice most motherboards manufacturers only test a few variations (e.g. same family and clock frequency with different model numbers) because fully testing every possible combination can get time consuming/expensive; but that doesn't mean the motherboards won't work for other "not officially supported/tested" combinations. The only real constraint is that the externally visible signals are compatible (e.g. both chips are using the same speed and protocol for the quick-path/hyper-transport interconnect between them).

There can be something like one PC and many motherboards plus some common memory. If an OS is going to support all those differences then it's going to be hard not only to design it, but also to test it and even to find a case where it really can be used.

Out of curiosity, is there "multi-motherboard" specification? Or it's about special cases for supercomputing and something similar?

Brendan wrote:Because existing OSs are "likely broken" anyway, motherboards manufacturers don't really need to care about supporting/testing many combinations.

Obviously, then they just haven't implemented correct processor-processor interaction protocol. Because existing OSs are "likely broken" anyway. And, obviously, if there's no protocol, then what for such a motherboard can be used? Only some special cases. But how can you know the exact case?

Basically, it seems highly unlikely to find a system with different CPUs, but designed to work according to widely accepted standards. So, the OS should support some special cases (is there documentation available?) or just assume there's no difference among processors.

Brendan wrote:Basically; at a minimum an OS should get the features flags from all CPUs and AND them together to find the subset of features supported by all CPUs, and processes should never use CPUID directly (and should ask the OS for a sanitized "CPUID combination" to avoid bugs); but in practice most OSs fail to meet the minimum, and don't even provide a way for processes to get a sanitized "CPUID combination" (which is also bad for other reasons - just because a CPU says it supports a feature doesn't mean it's usable and not effected by errata).

If most OSs do not support something, then how the "something" can work? It can work if manufacturer just spend some money for the special case handling. It can be a design, which supports some standards, but it also can be some special software. The latter is most probable. The former (as you have said) can be untested or even worse. And you can spend a lot of time to support cases that just aren't work.

It's economically viable to make things as similar as possible. And it's economically not viable to make a mix of something "special". More realistic approach is to invest some time in the support of clustering of different computers. But it seems you already have it on your radar with your messaging et all.

Brendan · Post by **Brendan** » Sat Nov 28, 2015 5:51 am

Hi,

embryo2 wrote:
Brendan wrote:In practice most motherboards manufacturers only test a few variations (e.g. same family and clock frequency with different model numbers) because fully testing every possible combination can get time consuming/expensive; but that doesn't mean the motherboards won't work for other "not officially supported/tested" combinations. The only real constraint is that the externally visible signals are compatible (e.g. both chips are using the same speed and protocol for the quick-path/hyper-transport interconnect between them).
There can be something like one PC and many motherboards plus some common memory. If an OS is going to support all those differences then it's going to be hard not only to design it, but also to test it and even to find a case where it really can be used.

In that case it'd be separate PC's with some common memory (e.g. using something like non-transparent bridges, which are typically just used as a point-to-point network device).

embryo2 wrote:Out of curiosity, is there "multi-motherboard" specification? Or it's about special cases for supercomputing and something similar?

For things like blade servers (which are mostly just separate computers on cards/blades that plug into a back-plane/chassis that provides power and networking) I'm not sure - it seems like there's lots of different "specifications" (e.g. several different ones from each vendor) where none of them are standard.

Depending on how you define "computer"; super-computers aren't computers - they're many separate computers connected via. high speed networking. They're a bit like the blade servers, just with custom designed blades and very few peripherals (e.g. processors, memory and networking and almost nothing else) to get higher densities and lower cost.

embryo2 wrote:
Brendan wrote:Because existing OSs are "likely broken" anyway, motherboards manufacturers don't really need to care about supporting/testing many combinations.
Obviously, then they just haven't implemented correct processor-processor interaction protocol. Because existing OSs are "likely broken" anyway. And, obviously, if there's no protocol, then what for such a motherboard can be used? Only some special cases. But how can you know the exact case?

Basically, it seems highly unlikely to find a system with different CPUs, but designed to work according to widely accepted standards. So, the OS should support some special cases (is there documentation available?) or just assume there's no difference among processors.

Currently; "multi-chip" 80x86 servers with (slightly) different CPUs are rare, and for "single chip" 80x86 systems all cores are identical (mostly because OSs/software doesn't support it properly and hardware manufacturer's hands are tied). In the future, who knows?

If you look at things like IBM's Cell or ARM's big.LITTLE you'll see different cores in the same chip. These exist because there's logical reasons (for performance and/or power consumption) to have one or more fast core/s (for things that need good single-thread performance) in addition to smaller slower cores (for things that benefit from many CPUs and/or for when the higher power consumption isn't justified). For example; it's not entirely improbable that within the next 5 years we'll be able to get a normal "8 fast cores" Xeon chip and a "60+ slower cores" Xeon Phi chip and plug both into the same dual-socket motherboard (which isn't that much different to existing "main CPU + Xeon Phi co-processor plugged into a PCI slot" that we have now except it'd be using quick-path instead of PCI-E for better bandwidth). In the same way it's not entirely improbably that eventually we'll see (e.g.) "4-fast core plus 32 slow core" chips (which is not that different to what we have already for laptop/desktop systems, except that what we have now is "fast 80x86 cores plus lots of slower GPU cores" on the same chip).

Also note that being able to add "supports computers with extremely different CPUs" onto a list of reasons why my OS is better than existing OSs (Windows, OS X, Linux) is something I consider useful (for future marketing); regardless of whether or not it's ever actually useful in practice.

Cheers,

Brendan

embryo2 · Post by **embryo2** » Sat Nov 28, 2015 7:24 am

Brendan wrote:In that case it'd be separate PC's with some common memory (e.g. using something like non-transparent bridges, which are typically just used as a point-to-point network device).

What do you think about your OS and supercomputing? Many blades with bridges aren't seldom today. The specific case for supercomputing is the hardware capable of running an application server for the whole Internet (millions requests per second). Do you plan to address the scalability issue of such magnitude?

Brendan wrote:Depending on how you define "computer"; super-computers aren't computers - they're many separate computers connected via. high speed networking.

Processors are also connected via high speed "peer to peer" network. Even if there's some cache hierarchy, it's still the network of connected nodes. The only difference - the nodes are on the same PCB (or event on the same silicon board).

Brendan wrote:If you look at things like IBM's Cell or ARM's big.LITTLE you'll see different cores in the same chip. These exist because there's logical reasons (for performance and/or power consumption) to have one or more fast core/s (for things that need good single-thread performance) in addition to smaller slower cores (for things that benefit from many CPUs and/or for when the higher power consumption isn't justified).

Yes, the reason is there. But I see here an analogy with your computer farm with many old computers and the need to somehow use them all. First was the farm and next the need has came. Also the heterogeneous processors are the product of some extra silicon. It's just there and we can use it "somehow". Why not to use if it's already there? But there's the drawback - it's much more complicated thing to program. If things are identical then we can schedule our thread on every processor, but if things are different (like CPU and GPU cores) we need to recompile our program first just to be able to switch it between cores. And next follows the state exchange problem. We can get some extra speed after recompilation and tricky scheduling implementation, but why not to go a bit farther and not to split things in the arithmetic unit arrays and register files, for example? If there's no need for high load computation then just a few arithmetic units are used, but if we need some extra power then we can spread the load across many units. And the units are all the same, they are identical, there's no need for recompilation or too tricky scheduling. Just select independent units of work and assign an arithmetic unit per each unit of work. If there are more work units then more arithmetic units employed, if there are less - less arithmetic and less power consumed. Things like Green Arrays just ideally fit in such a picture.

Brendan wrote:For example; it's not entirely improbable that within the next 5 years we'll be able to get a normal "8 fast cores" Xeon chip and a "60+ slower cores" Xeon Phi chip and plug both into the same dual-socket motherboard

The overhead of the task switch is relatively high. May be it can be implemented just as an equivalent of one pusha on one processor and next one popa on another, but the benefits here just the lesser caches and pipelines instead of flexible load distribution in case of many arithmetic units. The latter approach has greater potential, as I see it.

Brendan wrote:Also note that being able to add "supports computers with extremely different CPUs" onto a list of reasons why my OS is better than existing OSs (Windows, OS X, Linux) is something I consider useful (for future marketing); regardless of whether or not it's ever actually useful in practice.

Well, then you can claim "it supports even martian computers!", regardless of whether or not it's ever actually useful in practice. It's not bad for marketing, but I see it's bad for reputation among the people who are a bit wired up. But of course, the marketing is up to you.

Brendan · Post by **Brendan** » Sat Nov 28, 2015 8:38 am

Hi,

embryo2 wrote:
Brendan wrote:In that case it'd be separate PC's with some common memory (e.g. using something like non-transparent bridges, which are typically just used as a point-to-point network device).
What do you think about your OS and supercomputing? Many blades with bridges aren't seldom today. The specific case for supercomputing is the hardware capable of running an application server for the whole Internet (millions requests per second). Do you plan to address the scalability issue of such magnitude?

For super-computers (and possibly HPC in general) you've typically got the same (single) application running on each node, the application is custom designed for the hardware, and all you really want from an OS is for it to get out of the way. For this, my OS would mostly be inappropriate. I'm not even sure that any OS makes sense for this; and I'd be tempted to want bootable applications (with boot code and networking implemented as a library and statically linked directly into the application).

embryo2 wrote:
Brendan wrote:Depending on how you define "computer"; super-computers aren't computers - they're many separate computers connected via. high speed networking.
Processors are also connected via high speed "peer to peer" network. Even if there's some cache hierarchy, it's still the network of connected nodes. The only difference - the nodes are on the same PCB (or event on the same silicon board).

The difference (at least for how I define "computer") is whether or not there's a single/shared physical address space. For NUMA systems (e.g. CPUs and memory connected by links) there is a single physical address space used by all the pieces; but for separate computers on a LAN, or blades or super-computers there isn't.

embryo2 wrote:
Brendan wrote:If you look at things like IBM's Cell or ARM's big.LITTLE you'll see different cores in the same chip. These exist because there's logical reasons (for performance and/or power consumption) to have one or more fast core/s (for things that need good single-thread performance) in addition to smaller slower cores (for things that benefit from many CPUs and/or for when the higher power consumption isn't justified).
Yes, the reason is there. But I see here an analogy with your computer farm with many old computers and the need to somehow use them all. First was the farm and next the need has came. Also the heterogeneous processors are the product of some extra silicon. It's just there and we can use it "somehow". Why not to use if it's already there? But there's the drawback - it's much more complicated thing to program. If things are identical then we can schedule our thread on every processor, but if things are different (like CPU and GPU cores) we need to recompile our program first just to be able to switch it between cores. And next follows the state exchange problem. We can get some extra speed after recompilation and tricky scheduling implementation, but why not to go a bit farther and not to split things in the arithmetic unit arrays and register files, for example? If there's no need for high load computation then just a few arithmetic units are used, but if we need some extra power then we can spread the load across many units. And the units are all the same, they are identical, there's no need for recompilation or too tricky scheduling. Just select independent units of work and assign an arithmetic unit per each unit of work. If there are more work units then more arithmetic units employed, if there are less - less arithmetic and less power consumed. Things like Green Arrays just ideally fit in such a picture.

I'm not interested in designing hardware (it would cost millions/billions of dollars to create hardware that has any hope of gaining market share). For a "software only" project, I think 80x86 is the best (first) choice, as there's lots of people with PCs and no need to do strange things (jail breaking, etc) just to install an OS. If Green Arrays ever get popular (and if they're ever used in standardised systems and not just "CPU with random non-standard pain" like systems containing ARM CPUs) they'd be worth considering.

embryo2 wrote:
Brendan wrote:For example; it's not entirely improbable that within the next 5 years we'll be able to get a normal "8 fast cores" Xeon chip and a "60+ slower cores" Xeon Phi chip and plug both into the same dual-socket motherboard
The overhead of the task switch is relatively high. May be it can be implemented just as an equivalent of one pusha on one processor and next one popa on another, but the benefits here just the lesser caches and pipelines instead of flexible load distribution in case of many arithmetic units. The latter approach has greater potential, as I see it.

Task switching wouldn't be much different to existing 80x86 (where you already want optimisations for locality in the scheduler).

embryo2 wrote:
Brendan wrote:Also note that being able to add "supports computers with extremely different CPUs" onto a list of reasons why my OS is better than existing OSs (Windows, OS X, Linux) is something I consider useful (for future marketing); regardless of whether or not it's ever actually useful in practice.
Well, then you can claim "it supports even martian computers!", regardless of whether or not it's ever actually useful in practice. It's not bad for marketing, but I see it's bad for reputation among the people who are a bit wired up. But of course, the marketing is up to you.

Bad for reputation is bad for marketing (and making false claims gets you sued). While it might not be used in practice, being able to make the (valid) claim that it does support extremely different CPUs gives consumers the impression that it's a well designed/very flexible system (which is good for marketing and reputation).

Note that for my OS it'll be easy to prove it supports extremely different CPUs. Boot code does CPU feature detection (taking into account things like CPU errata, etc) and creates a "sanitized CPU info" structures for the OS (which never uses CPUID directly and only ever uses the "sanitized CPU info" structures); so it'd be trivial to modifying boot code to pretend some of the CPUs don't support a bunch of features even though they do, and in this way artificially create extremely different (as far as the OS knows) CPUs.

Cheers,

Brendan

embryo2 · Post by **embryo2** » Sun Nov 29, 2015 6:17 am

Brendan wrote:For super-computers (and possibly HPC in general) you've typically got the same (single) application running on each node, the application is custom designed for the hardware, and all you really want from an OS is for it to get out of the way. For this, my OS would mostly be inappropriate. I'm not even sure that any OS makes sense for this; and I'd be tempted to want bootable applications (with boot code and networking implemented as a library and statically linked directly into the application).

The common things for powerful computers include scheduler and process/thread support, drivers, bootloader, networking. May be it is a "specific OS", but it is changed too infrequent to keep it together with application logic. It's code should be isolated from applications just because of the useless complexity of many points of interaction. Application maintenance can be a problem if it includes the OS code. And statically linking things like scheduler looks a bit strange. Of course, there can be some modules and the whole thing can be called "the application", but if there's a protocol for interaction of application logic with general hardware related and execution organization functionality then it is the border where the OS starts.

Brendan wrote:The difference (at least for how I define "computer") is whether or not there's a single/shared physical address space.

It's the hardware and drivers who are responsible for the processor-to-memory interaction. x86 requires little software for such interaction, but things like caches and NUMA enforce some constraints and it is more efficient just to pay attention to the constraints in software. For GPUs it is required to provide some software means for efficient memory management. And next step is transparent memory allocation over network, which is also implemented in software. Supercomputers have a need for cooperation and the need can be satisfied with the common memory approach.

Direct addressing of a foreign memory with hardware only solutions can be seen as a border of the "ordinary-computer" (opposite to the super-computer), but I just don't know if there are some super-computers with hardware supported foreign memory interaction.

Brendan wrote:For a "software only" project, I think 80x86 is the best (first) choice, as there's lots of people with PCs and no need to do strange things (jail breaking, etc) just to install an OS.

The jail breaking for Android is simplified to the level when many advanced users just install things like Cyanogen instead of the vendor's distribution with limited user rights and a lot of bloatware which is impossible to uninstall.

Brendan wrote:If Green Arrays ever get popular (and if they're ever used in standardised systems and not just "CPU with random non-standard pain" like systems containing ARM CPUs) they'd be worth considering.

Yes, the time of the new hardware is still many years ahead. But the current way of silicon usage is inefficient, most of the silicon is idle most of the time. But every realistic OS, of course, should support x86.

Brendan wrote:Task switching wouldn't be much different to existing 80x86 (where you already want optimisations for locality in the scheduler).

But the locality optimizations only can not outperform things like AVX with it's 32 bytes or 16 words simultaneous operations, if your OS employs the lesser common denominator approach. And if it uses all available features then there's no way for using less feature rich parts of the heterogeneous system. So, much of the silicon is idle all the time. But yes, the task switching can be simple in this case. What about perfectionism?

Brendan wrote:Bad for reputation is bad for marketing (and making false claims gets you sued).

Nobody know what is a martian computer. So, you can safely claim your OS can manage it. If a proof is required then you can invent whatever you want and call it "the martian computer".

For ordinary people claims like "the OS runs on some supercomputers" is very similar to the claim "it runs on every martian computer".

Brendan wrote:While it might not be used in practice, being able to make the (valid) claim that it does support extremely different CPUs gives consumers the impression that it's a well designed/very flexible system (which is good for marketing and reputation).

Yes, that's why you hate the marketing people. Aren't you? My thing supports whatever you want and there's no problem in the world where you just have no idea about what actually "whatever" means. It's not a problem if you should pay for the thing that supports "everything" because it "gives consumers the impression that it's a well designed/very flexible system".

Swiss knife is less popular than an ordinary kitchen knife.

Brendan · Post by **Brendan** » Sun Nov 29, 2015 11:24 am

Hi,

embryo2 wrote:
Brendan wrote:For super-computers (and possibly HPC in general) you've typically got the same (single) application running on each node, the application is custom designed for the hardware, and all you really want from an OS is for it to get out of the way. For this, my OS would mostly be inappropriate. I'm not even sure that any OS makes sense for this; and I'd be tempted to want bootable applications (with boot code and networking implemented as a library and statically linked directly into the application).
The common things for powerful computers include scheduler and process/thread support, drivers, bootloader, networking. May be it is a "specific OS", but it is changed too infrequent to keep it together with application logic. It's code should be isolated from applications just because of the useless complexity of many points of interaction. Application maintenance can be a problem if it includes the OS code. And statically linking things like scheduler looks a bit strange. Of course, there can be some modules and the whole thing can be called "the application", but if there's a protocol for interaction of application logic with general hardware related and execution organization functionality then it is the border where the OS starts.

For a "one thread per CPU" model (where each thread/CPU is expected to be at 100% load all the time) you don't need a scheduler or much thread support (beyond maybe spinlocks). There'd only be one driver for the network card; and the networking stack could be very minimal (no need to support full TCP/IP, routing, etc - possibly just enough to implement MPI directly on top of the physical layer). There's also no need for security (including isolation between kernel and process). Yes; it would look very different to a typical general purpose OS, but that's because it's not a general purpose OS in the first place (e.g. no user/s, no file IO, no multi-tasking, etc).

embryo2 wrote:
Brendan wrote:The difference (at least for how I define "computer") is whether or not there's a single/shared physical address space.
It's the hardware and drivers who are responsible for the processor-to-memory interaction. x86 requires little software for such interaction, but things like caches and NUMA enforce some constraints and it is more efficient just to pay attention to the constraints in software. For GPUs it is required to provide some software means for efficient memory management. And next step is transparent memory allocation over network, which is also implemented in software. Supercomputers have a need for cooperation and the need can be satisfied with the common memory approach.

For "common memory"; the additional networking needed for cache coherency alone would probably triple the cost of the super-computer (or more realistically, slash the number of nodes you can afford for the same budget). On top of that you're probably going to need to find CPUs that can actually support >64-bit addresses; which means forgetting about cheap/commodity hardware (e.g. CPUs and/or GPUs designed for normal workstation/server that have been recycled). Basically; it's not even close to being viable.

embryo2 wrote:
Brendan wrote:Task switching wouldn't be much different to existing 80x86 (where you already want optimisations for locality in the scheduler).
But the locality optimizations only can not outperform things like AVX with it's 32 bytes or 16 words simultaneous operations, if your OS employs the lesser common denominator approach. And if it uses all available features then there's no way for using less feature rich parts of the heterogeneous system. So, much of the silicon is idle all the time. But yes, the task switching can be simple in this case. What about perfectionism?

For current "80x86 NUMA" hardware, you typically want to run a process on the same NUMA domain so you can allocate memory that's "close" to that NUMA domain (and avoid the penalties caused by accessing memory that's further away from those CPUs). If you migrate a process from one NUMA domain to another you end up with "worst case NUMA penalties" (or re-allocating all the memory the process uses to avoid those penalties, which isn't cheap either); so you try very hard to make sure processes aren't migrated from one NUMA domain to another. Also; for most normal OS's there's a "CPU affinity" feature that allows a process to be "pinned"/restricted to a sub-set of CPUs.

If you add "CPU feature constraints" on top of that you end up with something approximating my OS design. To do this you just use feature's the process requires to pre-determine the CPU affinity for the process. E.g. if some CPUs support a feature (e.g. AVX-512) and some don't; then you setup the CPU affinity so that a process that uses that feature can't be run on a CPU that doesn't support it. Note: It's slightly more complicated than this because supporting different CPUs also effects the code that handles the "delayed FPU/MMX/SSE/AVX state saving" mechanism).

I do these things for normal 80x86; so for a theoretical/future "normal Xeon + Phi hybrid" system it'd be no different to what I already do and I doubt I'd need to change anything at all.

embryo2 wrote:
Brendan wrote:Bad for reputation is bad for marketing (and making false claims gets you sued).
Nobody know what is a martian computer. So, you can safely claim your OS can manage it. If a proof is required then you can invent whatever you want and call it "the martian computer".

For ordinary people claims like "the OS runs on some supercomputers" is very similar to the claim "it runs on every martian computer".

It's plausible that the "runs on some supercomputers" claim is valid and (assuming the claim is valid) an OS developer can prove that the claim is valid. It's not plausible that the "runs on some martian computers" claim is valid and impossible for an OS developer can prove that the claim is valid.

Note that "it runs on every martian computer" is slightly different because it's likely to be a vacuous truth (e.g. if no martian computers exist then every OS runs on every martian computer). However, it's only slightly different, because you'll probably still get sued for false advertising (using a vacuous truth to mislead people).

embryo2 wrote:
Brendan wrote:While it might not be used in practice, being able to make the (valid) claim that it does support extremely different CPUs gives consumers the impression that it's a well designed/very flexible system (which is good for marketing and reputation).
Yes, that's why you hate the marketing people. Aren't you? My thing supports whatever you want and there's no problem in the world where you just have no idea about what actually "whatever" means. It's not a problem if you should pay for the thing that supports "everything" because it "gives consumers the impression that it's a well designed/very flexible system".

Swiss knife is less popular than an ordinary kitchen knife.

The theory of capitalist systems is that better products/services succeed (are more profitable) and worse products/services don't, and this (combined with a desire for businesses/companies to make profit) causes better products/services for consumers. I personally think this theory fails in practice for multiple reasons (some that are prevented by law and some that aren't); and one of the reasons is that consumers can't easily make informed choices because they're fed advertising instead of unbiased/fair comparisons. Essentially; bad products/services with excellent advertising succeed and excellent products/services with bad advertising fail; and this erodes the foundation of capitalism.

Mostly what I'm saying is that; using a fair/unbiased information only, I want (at least some) consumers/users to choose my OS at some point in the future; and to do that I need to create reasons why (using a fair/unbiased information only) at least some consumers/users would want to choose my OS instead of an alternative. Supporting different CPUs in the same system is just one more (relatively small/minor) way I can do that.

Cheers,

Brendan

embryo2 · Post by **embryo2** » Mon Nov 30, 2015 8:01 am

Brendan wrote:For a "one thread per CPU" model (where each thread/CPU is expected to be at 100% load all the time) you don't need a scheduler or much thread support (beyond maybe spinlocks). There'd only be one driver for the network card; and the networking stack could be very minimal (no need to support full TCP/IP, routing, etc - possibly just enough to implement MPI directly on top of the physical layer). There's also no need for security (including isolation between kernel and process). Yes; it would look very different to a typical general purpose OS, but that's because it's not a general purpose OS in the first place (e.g. no user/s, no file IO, no multi-tasking, etc).

I can't tell for all supercomputers, but many modern powerful clusters are used as a typical server when there is a queue of scheduled tasks and there is a need for some optimal task execution. Tasks are not sporadic and are planned by humans, but after they have a list of tasks all the rest is the server's responsibility. So, server needs a set of executables (tasks) and runs them in some order. People here need some means to control the task execution and the resources it needs. So, there is a console for admins (aren't they the missed users?) and there is an environment that runs the whole thing. And at least some clusters use Unix-like OSes as such environment. Even more, if a task has bugs it can crash all other tasks, lead to reboot and down time of a very expensive hardware. I see here a great need for an OS.

Another kind of clusters is something like Google or Facebook are using. The task there is always the same, but it's parts are very different, like is the case with every internet server. So they need some means to manage task's parts (application updates, for example) and task's distribution across available (clustered) computers. And because it's very hard to maintain one big application usually there are many applications. So, we have many processes and many threads distributed across many computers. What means of communication are available for the clustered computers? It's just the same TCP/IP. Where the data for every application is kept? It's the same RAID controllers with a lot of HDDs. So, again there is a need for OS. And not for something stripped down, but for something feature rich.

Brendan wrote:For "common memory"; the additional networking needed for cache coherency alone would probably triple the cost of the super-computer

The common memory here is used for communication. It's not the goal to keep it on par with the processor's speed. It's like caches vs memory on x86.

Brendan wrote:On top of that you're probably going to need to find CPUs that can actually support >64-bit addresses;

Clusters are usually built around a lot of Intel's chips. And they deliver very high processing power. Specialized designs of supercomputers are not ready to compete with the raw numbers of similar processors. Intel's processors are fast and doubling the speed here is a big problem, so it's easier to create a cluster with slower processors, but used in great quantities. It's cheap and total performance can be as big as scalable is the task. And usually supercomputer's tasks are scalable because of efforts targeted for task parallelization. I doubt if there are ways for essential increase of the performance of a sequential task. Then the only problem left is the communication in cluster and here some special hardware can help. But it's not processing related thing.

Brendan wrote:For current "80x86 NUMA" hardware, you typically want to run a process on the same NUMA domain so you can allocate memory that's "close" to that NUMA domain (and avoid the penalties caused by accessing memory that's further away from those CPUs). If you migrate a process from one NUMA domain to another you end up with "worst case NUMA penalties" (or re-allocating all the memory the process uses to avoid those penalties, which isn't cheap either); so you try very hard to make sure processes aren't migrated from one NUMA domain to another. Also; for most normal OS's there's a "CPU affinity" feature that allows a process to be "pinned"/restricted to a sub-set of CPUs.

If you add "CPU feature constraints" on top of that you end up with something approximating my OS design. To do this you just use feature's the process requires to pre-determine the CPU affinity for the process. E.g. if some CPUs support a feature (e.g. AVX-512) and some don't; then you setup the CPU affinity so that a process that uses that feature can't be run on a CPU that doesn't support it. Note: It's slightly more complicated than this because supporting different CPUs also effects the code that handles the "delayed FPU/MMX/SSE/AVX state saving" mechanism).

I do these things for normal 80x86; so for a theoretical/future "normal Xeon + Phi hybrid" system it'd be no different to what I already do and I doubt I'd need to change anything at all.

Is your OS capable of compiling a bytecode for different processors? Do you use AOT? How do you decide on the question of what processor should be used by which application? How do you know if application1 needs more processing power than application2? Or are you going to run the same application on different CPUs? And to have it compiled for every CPU. And to cope with it's thread interaction when threads are running on different CPUs? It isn't as simple as "I'd need to change anything at all".

Brendan wrote:The theory of capitalist systems is...

It's probably for the auto-delete forum

But yes, it fails, I agree.

Brendan wrote:Mostly what I'm saying is that; using a fair/unbiased information only, I want (at least some) consumers/users to choose my OS at some point in the future; and to do that I need to create reasons why (using a fair/unbiased information only) at least some consumers/users would want to choose my OS instead of an alternative. Supporting different CPUs in the same system is just one more (relatively small/minor) way I can do that.

If it's about rational assessment of your efforts then it's better to concentrate on something visually attractive or other "cool" stuff. But if it's about perfectionism then yes, you can try to justify your efforts with thoughts like above, but it's just for your personal consumption. Psychologically it helps, but [efforts]/[attracted users] ratio won't justify it.

Brendan · Post by **Brendan** » Mon Nov 30, 2015 1:22 pm

Hi,

embryo2 wrote:
Brendan wrote:For a "one thread per CPU" model (where each thread/CPU is expected to be at 100% load all the time) you don't need a scheduler or much thread support (beyond maybe spinlocks). There'd only be one driver for the network card; and the networking stack could be very minimal (no need to support full TCP/IP, routing, etc - possibly just enough to implement MPI directly on top of the physical layer). There's also no need for security (including isolation between kernel and process). Yes; it would look very different to a typical general purpose OS, but that's because it's not a general purpose OS in the first place (e.g. no user/s, no file IO, no multi-tasking, etc).
I can't tell for all supercomputers, but many modern powerful clusters are used as a typical server when there is a queue of scheduled tasks and there is a need for some optimal task execution. Tasks are not sporadic and are planned by humans, but after they have a list of tasks all the rest is the server's responsibility. So, server needs a set of executables (tasks) and runs them in some order. People here need some means to control the task execution and the resources it needs. So, there is a console for admins (aren't they the missed users?) and there is an environment that runs the whole thing. And at least some clusters use Unix-like OSes as such environment. Even more, if a task has bugs it can crash all other tasks, lead to reboot and down time of a very expensive hardware. I see here a great need for an OS.

Another kind of clusters is something like Google or Facebook are using. The task there is always the same, but it's parts are very different, like is the case with every internet server. So they need some means to manage task's parts (application updates, for example) and task's distribution across available (clustered) computers. And because it's very hard to maintain one big application usually there are many applications. So, we have many processes and many threads distributed across many computers. What means of communication are available for the clustered computers? It's just the same TCP/IP. Where the data for every application is kept? It's the same RAID controllers with a lot of HDDs. So, again there is a need for OS. And not for something stripped down, but for something feature rich.

As far as I know supercomputers are mostly used for scientific modelling (e.g. protein folding) where there's a single task that requires a massive amount of mathematical calculations and "floating point operations per second" is the main design goal. It's these systems I was talking about, and it's for these systems I was proposing "bootable application".

Other distributed systems (e.g. Google, Facebook) aren't considered supercomputers and have completely different usage and goals (e.g. handling many small independent tasks concurrently, without much/any processing, where things like bandwidth and storage are the main design goals and not FLOPs). I wasn't talking about these systems.

embryo2 wrote:
Brendan wrote:For "common memory"; the additional networking needed for cache coherency alone would probably triple the cost of the super-computer
The common memory here is used for communication. It's not the goal to keep it on par with the processor's speed. It's like caches vs memory on x86.

Then this is not "same physical address space shared by all nodes", and therefore doesn't meet my definition of "computer" (and is a cluster of many computers instead).

embryo2 wrote:
Brendan wrote:For current "80x86 NUMA" hardware, you typically want to run a process on the same NUMA domain so you can allocate memory that's "close" to that NUMA domain (and avoid the penalties caused by accessing memory that's further away from those CPUs). If you migrate a process from one NUMA domain to another you end up with "worst case NUMA penalties" (or re-allocating all the memory the process uses to avoid those penalties, which isn't cheap either); so you try very hard to make sure processes aren't migrated from one NUMA domain to another. Also; for most normal OS's there's a "CPU affinity" feature that allows a process to be "pinned"/restricted to a sub-set of CPUs.

If you add "CPU feature constraints" on top of that you end up with something approximating my OS design. To do this you just use feature's the process requires to pre-determine the CPU affinity for the process. E.g. if some CPUs support a feature (e.g. AVX-512) and some don't; then you setup the CPU affinity so that a process that uses that feature can't be run on a CPU that doesn't support it. Note: It's slightly more complicated than this because supporting different CPUs also effects the code that handles the "delayed FPU/MMX/SSE/AVX state saving" mechanism).

I do these things for normal 80x86; so for a theoretical/future "normal Xeon + Phi hybrid" system it'd be no different to what I already do and I doubt I'd need to change anything at all.
Is your OS capable of compiling a bytecode for different processors? Do you use AOT? How do you decide on the question of what processor should be used by which application? How do you know if application1 needs more processing power than application2? Or are you going to run the same application on different CPUs? And to have it compiled for every CPU.

Previous versions of my OS didn't use bytecode. For these, the executable file format included information about the CPU it was designed for (CPU type and a 256-bit bitfield of required/used CPU features) plus some other information (how much processing, memory and communication the executable does); where all this information (plus load statistics, etc) is used by the OS to determine where the executable should be run.

The current (temporary) version will be mostly the same initially. My plan is to create enough of the OS so that a native IDE and toolchain (and AOT compiler) can be implemented on top of it; and then port one piece at a time into my own language (with my own tools); so that eventually the current version will become something that does use bytecode and AOT.

For the final system; the programmer will still provide some information (how much processing, memory and communication the executable does) that's used by the OS to determine where an executable should run; but the "CPU type and features" information will be generated by the byte-code to native compiler, and the OS will cache multiple versions of the final/native executable (one for each different "CPU type and features"). When a process is being started the OS will use the processing/memory/communication information (from the byte-code) to determine the best place for the executable to be run; but then it will examine the cache of the final/native executable versions. Mostly:

If there's already final/native executable that suits where the OS wants to run the executable; use that. This will be the most common case.
Otherwise; determine if there's something "close enough" to use for now (either a final/native executable that's close enough for the best place for the executable to be run; or a final/native executable for a different "less best" place for the executable to be run); and:
- If there is something "close enough" to use; then use it (to avoid delays caused by byte-code to native compiler) but also compile the byte-code to native to suit the "CPU type and features" in the background so that it exists next time the executable is executed.
- If there isn't anything "close enough" (either the delays caused by byte-code to native compiler can't be avoided or the delays are preferable to slower run-time and/or worse load balancing); compile the byte-code to native to suit the "CPU type and features" and then use that.

embryo2 wrote:And to cope with it's thread interaction when threads are running on different CPUs? It isn't as simple as "I'd need to change anything at all".

For my system an application is multiple cooperating processes. All of the stuff above (to determine where an executable/process should be run) only cares about processes, and not applications, and not threads. For a multi-threaded process, each of the process' threads use CPUs that match the process' requirements.

embryo2 wrote:
Brendan wrote:Mostly what I'm saying is that; using a fair/unbiased information only, I want (at least some) consumers/users to choose my OS at some point in the future; and to do that I need to create reasons why (using a fair/unbiased information only) at least some consumers/users would want to choose my OS instead of an alternative. Supporting different CPUs in the same system is just one more (relatively small/minor) way I can do that.
If it's about rational assessment of your efforts then it's better to concentrate on something visually attractive or other "cool" stuff. But if it's about perfectionism then yes, you can try to justify your efforts with thoughts like above, but it's just for your personal consumption. Psychologically it helps, but [efforts]/[attracted users] ratio won't justify it.

You're under-estimating what would be required to compete favourably against existing OSs. If people start switching from existing OSs to your OS because your OS looks better; then within 6 months the existing OSs will improve their appearance and you'll be screwed.

Understand that for users there's "switching costs" - the time it takes users to find/obtain/setup applications they need, work around compatibility problems, and become so familiar with the OS and the applications that they feel comfortable/productive and usage becomes habitual (in an "I can use this software in my sleep" way). Initially these switching costs work against you - an OS doesn't just have to be better than existing OSs, it has to be so much better that it justifies the switching costs. Immediately after switching to your OS users can switch back to whatever existing OS they were using before (there's almost no "switching costs" because they're still already familiar with that previous OS).

I'd estimate that it'd take a few years before the "switching costs" starts to work in your favour, and people using your OS don't want to switch to another OS (even if it's an OS they were using a few years ago, and even if the other OS is slightly better than yours) because the hassle of switching to another OS is too high.

Basically; your OS needs to be significantly better (not just better), then has to remain better (not necessarily significantly better) for multiple years (while other OSs are improving and trying to catch up). Ideally; you'd want multiple reasons why your OS is better and not just one, where at least some of those reasons are extremely hard (time consuming) for existing OSs to copy/adopt (or are impossible for existing OSs to copy/adopt - e.g. patents).

Cheers,

Brendan

OSDev.org

How change non-PAE paging mode to PAE paging mode?

How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?

Re: How change non-PAE paging mode to PAE paging mode?