OSDev.org

Posted: **Sun Feb 14, 2021 11:06 pm**

xeyes wrote:Things could also head a different way. Did you know 10 years ago that a normal desktop can also have dozens of CPUs (cores) these days?

The trend was already ongoing back then, yes. In the past, manufacturers have improved single core performance by cranking up clock speeds, improving their parsing of opcodes, and adding stuff behind the scenes to make things faster, like caches.

These days we have pretty much hit a speed barrier: Beyond something like 3.5GHz, the heat generation is so high it requires specialized cooling solutions. While I have seen a CPU running at 5GHz, that CPU was also cooled with liquid nitrogen. And even then we are talking about not even a two-fold increase of clock speed. The opcode parsers now are very optimized already, and any gain in that field is going to be marginal at best. And caches also have diminishing returns. We have three layers of caching, with the highest going into the megabytes.

It does not look to me as if there is a whole lot of space left to improve single-core performance. We are firmly in tradeoff territory now, where almost any gain in one area is made up by a loss somewhere else. So CPU manufacturers chose to instead head into the direction of parallelization. That started about a decade ago, and since then we have seen more and more cores, and vector units with larger and larger registers. So large that they cause ABI problems: With AVX512, the register file is in the kilobytes. On a Unix-like OS, when a signal is delivered, the main program's register file is put on the signal stack, including FPU and vector registers. Usually the cost of that is marginal, and so OSes have allowed very small signal stacks and thread stacks in the past. However, with AVX512, even one signal can eat away a lot of a small stack, and leave it unable to call many functions. This does affect ABI, since PTHREAD_STACK_MIN is a compile time constant. And so you can get programs to start misbehaving simply by upgrading your CPU.

xeyes wrote:Firstly it is very hard to make MP program that are 100% correct, but much much harder to make MP program that are 100% correct and scales well beyond more than a dozen cores.

That is very true. Add to that that some programs are just inherently sequential (like compilers), and there is not a whole lot you can do. You can try to add more programs to run at the same time, but many programs have nothing to do for long periods of time. And so the more and more cores we see are less and less useful except for specialized workloads. Even for something like compiling big OSS projects, there are diminishing returns, as running more compilers simultaneously will just clog up the I/O bandwidth of the system. SSDs and NVMe are helping there, of course. But if you have a threadripper and a spinning rust hard drive, most of your cores are going to be twiddling their thumbs all the time.

xeyes wrote:You don't want to play with v86 mode and get a taste of hypervisor/VMM before your kernel can be a true hypervisor/VMM that boots mainstream modern OSes?

I have no interest in virtualization.

xeyes wrote:Do you see any parallel between "I don't want to deal with the details of memory management so I use GC" and "I don't want to deal with the details of memory ordering so I target high level C++ machine"?

That isn't what I wrote. Adding GC as an engineering decision is fine. In that case you are merely taking stock of your tools and deciding which ones to use. Adding GC as a bugfix however is not fine. If the whole program is designed around manual memory management, and you merely have a bug in one place, then adding GC is such an oversized measure compared to just fixing the bug. That is what I meant with shoddy craftsmanship. It is the lazy answer. It is the answer of a programmer who refuses to take pride in their works.

xeyes wrote:I'm more than a bit surprised that what you get out of that post is something like "the guy doesn't know how to set two variables both to 1 and has bugs in such a simple program".

That is also not what I wrote. He was demonstrating memory reordering, and quite properly so. And while his fix with the manual barrier still isn't correct, because it fails to resolve the data race, at least he understood the problem. My criticism was in that he mentioned the limiting of CPU affinity as a "solution". It is not, it is simply hiding the bug. The data race persists. Maybe in the future something will come along to unhide the bug again.

See, in C (and likely C++, but I haven't checked) you are not really programming the CPU you are currently on. You are programming an abstract machine, defined in the language standard. That abstract machine has weird semantics, which exist solely to allow it to be mapped onto any existing CPU in a high-performance way. That is why bit-shifting by a variable's width or more is undefined - so a bit shift can be implemented as the CPU's bit shift instruction, no matter what its semantics on oversized shifts are. In this case the program as shown in the blog post has a data race. And a data race is undefined behavior. And undefined behavior will at one point or another rear its head and bite you.

I also don't really want to deal with the low-level semantics of my hardware, and therefore I would prefer to just use the atomic instructions my compiler provides, and can translate for any CPU. That is the real solution to the problem, and one the blog author did not mention. I cannot blame him, as the post was from 2012, so the ink on C11 hadn't even dried yet, and prior to C11 there were simply no semantics to multithreaded access to the same memory locations at all.

OSDev.org

can IPI be cancelled?

Re: can IPI be cancelled?