Do you agree with linus regarding parallel computing?
Posted: Mon Jan 19, 2015 8:26 am
The Place to Start for Operating System Developers
https://f.osdev.org/
Yes and no.Combuster wrote:"4 cores is enough" sounds just as future-proof as "640KB is enough".
Actually we had 4 cores in household equipment for two decades now. Did that prove to be sufficient?
I had this exact same thought.Combuster wrote:"4 cores is enough" sounds just as future-proof as "640KB is enough".
Whatever performance gains hardware manufacturers make software developers take away.SpyderTL wrote:I had this exact same thought.Combuster wrote:"4 cores is enough" sounds just as future-proof as "640KB is enough".
You could give Microsoft, Apple, or the Linux developers a 128-core system with 1 TB of RAM, and within a year, they would find a way to make it take 50 seconds to boot to the login screen.
More hardware just makes software developers lazier.
You do have to admit that a quad-core 4ghz machine is better than a 4.77 MHz machine from 1985. Is it 4 thousand times better? Not really. But it is 10 times better.
Perhaps that is the point that Linus is making?More hardware just makes software developers lazier.
First, let's put this into context.dansmahajan wrote:http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html
So what do you think ?
I think that Linus has expressed his quick thoughts without any serious analysis. A spontaneous expression of some feelings from past experience mostly works as an advertising hype or as a trigger of a massive flooding process. And here we have an example of flooding.dansmahajan wrote:http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html
So what do you think ?
A "wimpy core" is just as stupid as it's user (a programmer). And "out-of-order core" is no more smart than it's programmer. If a programmer can grasp around an algorithm, then there can be some performance. But if algorithm implementation details are hidden under a great deal of caches, out-of-order things and other intervening stuff, then the performance will be just an accident.Brendan wrote:For example, rather than having an 8-core CPU at 3 GHz consuming 85 watts you could have a 32-core CPU running at 1.5 Ghz consuming the same 85 watts and be able to do up to twice as much processing.
Now, some people take this to extremes. They suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz. This is part of what Linus is saying is silly, and for this part I think Linus is right. Things like out-of-order execution help, even for "many core", and those wimpy cores would be stupid.
Then we should think about a way of efficient feeding of our cores. If hardware supports us in this quest (with caches and simple and predictable operations, for example), then our cores will never get stalled. But if we have some murky hardware, that lives it's own life, then of course, we just unable to provide it with data and commands just because we don't know when (and often - how) the data or commands should be delivered.Brendan wrote:Between the CPU and things like RAM and IO devices there's a whole pile of stuff (caches, memory controllers, etc). As the number of cores increases all of that stuff becomes a problem - twice as many CPUs fighting for the same interconnect bandwidth do not go twice as fast.
It's our way of thinking that scales badly. We are used to interact with a computer in a perverted way, e.g. typing letters instead of talking or even just thinking. And to get rid of the perversion we need much more processing power. But Linus tells us that he needs no more processing power. Then we still are typing and "mousing" around instead of doing very simple things.Brendan wrote:The other (much larger?) problem is software scalability.
But next question is - how big a thread should be? And the answer is that there just should be an optimum of a processing/communication ratio. It's very old rule, in fact. Processing has limits defined by the communication and communication has limits defined by the processing. This mutually interdependent parts of a same problem were identified very long ago. And a solution is an optimum, but the optimum is very hardware dependent. If we have just some hints about communication bottlenecks between a processor and a memory, then how can we find an optimum? And even the processor itself today has a lot of hidden mechanics to prevent us from proving that some solution is really optimal. We need some understandable hardware. And we need a way to manage the hardware at the lowest possible level instead of trying to hope for some "good" out-of-order behavior. If hardware is advertised as a ready "to make it's best" for us then we just have no way to get an optimum, but only "it's best" instead of our clear understanding that it's really best or it's really much worse than expected.Brendan wrote:The "very fine grained actor model" is too fine grained - the overhead of communication between actors is not free and becomes a significant problem. The ideal granularity is something between those extremes.
Basically; I think the ideal situation is where "actors" are about the size of threads.
Yes, it's really the only point of the message.Brendan wrote:He's saying that for most normal desktop software more cores won't help much, which is a little like saying "excluding all the cases where parallelism is beneficial, parallelism isn't beneficial"
I hope you agree that this is just a yet another call for the hardware simplicity and manageability. Else the optimum will always slip away from us.Brendan wrote:Also note that there are people (including me) that want CPUs (with wide SIMD - e.g. AVX) to replace GPUs. For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing - partly because it'd be more able to adapt to the current load (e.g. rather than having GPU doing nothing while CPUs are struggling with load and/or CPUs doing nothing while GPU is struggling with load); but also partly because supporting GPUs is a massive pain in the neck due to lack of common standards between Nvidia, ATI/AMD and Intel and poor and/or non-existent documentation; and also partly because I think there are alternative graphics pipelines that need to be explored but don't suit the "textured polygon pipeline" model GPUs force upon us.
You can have this, today, not from the Sufficiently Smart Compiler, but from smart programming and expressive libraries.[*]OSwhatever wrote:I don't know any language that can do this (but there are so many that I have probably overlooked it).
Looks like the all-“new” and fashionable Hoare’s CSP model to me. So Rust or Go; they also solve the problem of GCing shared mutable data by not sharing. Also, Haskell Chans have been there for years, but probably fail your simplicity test; there, no mutation. (Alef anyone?) To be more specific, is Plan 9’s Design of a Concurrent Window System “it”?Brendan wrote:Basically; I think the ideal situation is where "actors" are about the size of threads. Essentially; I think the solution is threads communicating via. asynchronous messaging (and not the actor model as it's typically described, and not language like Erlang - in fact, a simple language sort of like C would be almost ideal if asynchronous messaging replaced retarded nonsense like linking).
Did you mean: 512 cores? Seriously, 40 cores are a joke, even taking into account the weird SIMD/MIMD hybrid execution model.Brendan wrote:For example, rather than having a modern "core i7" chip containing 8 CPU cores and 40 "GPU execution units" I'd rather have a chip containing 48 CPU cores (with no GPU at all) and use those CPU cores for both graphics and normal processing.
is there for a reason: keeping pipes saturated (throughput!) is easier if you know exactly how much data and when will flow through them. Also, as you must already know, a texture sampler (or AES or SHA round, for that matter) in specialized silicon is much smaller than a GP processing unit that is equally fast at it. Sad; unsolved. Sorry RISC, we all loved you.Brendan wrote:the "textured polygon pipeline" model GPUs force upon us
They even make absurdly expensive evaluation boards: I’m surprised to see no mention of Charles Moore’s GA144. (If it doesn’t work, there should be some evidence for that, esotericity notwithstanding.)Brendan wrote:[Some people] suggest cramming a massive number of very slow wimpy cores (e.g. without things like out-of-order execution, branch prediction, tiny caches, etc) onto a chip. For example, maybe 200 cores running at 500 Mhz.