Brendan wrote:embryo wrote:But why there are no such simple things as stable cached memory ranges or dedicated memory bus? I mean here that we can avoid the "unknown" with some specialized hardware.
Because it's less efficient. For example, if everything is fast enough to handle the CPU's peak requirements, then the bandwidth (and power consumed to maintain it) is wasted whenever the CPU isn't using max. bandwidth.
It is a wrong way of thinking to expect that everything is fast enough. I have tried to point to another way - a standard bottleneck identification routine. If the bottleneck is the bandwidth, then the solution is obvious - just extend the bandwidth. If power is a limiting factor, then just make a memory bus that decreases it's clock rate or even stops it's clocks when there is no need for memory access. The mentioned before
GA144 has ability to stop any power consumption beyond the minimal transistor leak in an idle state, then why a memory bus can't do the same?
And the waste of bandwidth when a memory access is not needed is a bit insane, because then we are wasting a lot of things around - computers when we are not using them, cars when we are not going somewhere, houses when we are at work, sun light when we are unable to wrap the whole sun and get all of it's emission. So, such a "waste" should be considered normal, at least until we solve some very hard problems like - just why we need to consume the all sun's emission?
Brendan wrote:GPUs stall just as much as CPUs do
But do they increase the graphics processing speed? The goal is achieved. It is still not absolutely efficient, but the speed increase is here and definitely is very usable. So, if we can get such speed increase with the help of poly-core enabled software, but when having some cores in idle state, then it's a really good achievement and it pays a lot for the price of having some cores stalled. Then again I should point to the initial goal - we need a speed increase for new applications, like more natural human-computer interfaces and everything above it up to an artificial intelligence, for example. And it is just offline user related services, but if we consider the need for speed of online services, then the speed increase becomes visible in form of billions of $.
Brendan wrote:Intel already has the world's leading branch prediction algorithms built into their CPUs.
This world leading algorithm is targeted for a particular combination of hardware and software. If we use an analogy then it's like tailoring a suit for only one person when producing it for everybody on earth.
Brendan wrote:For predicting "unpredictable" memory accesses, I don't know how you think a predictor would work
If we use some data in our program then it is obvious that a predictor can just look at such fact and use such information for arrangement of the corresponding memory access. That's exactly what a smart compiler should do. If a compiler misses some information, then it's our bottleneck and we should remove it using standard bottleneck avoiding technics. For example - let a programmer annotate some critical code section, or let's invent a new language construct, or let's make a more smart compiler.
Brendan wrote:You end up with (e.g.) "theoretical maximum FLOPS" that no software ever comes close to achieving in practice.
At least I see it working for graphics applications. It also works for many other algorithms, that currently are executed on CPUs. And CPU designers invent MMX, SSE, AVX to help to extend this bottleneck. The final point of such evolution will be something like AVX48 with 1024*16 (512*32/256*64/128*128/...) bit of the same operations, that we have in GPUs. But what is changed on the silicon? Just nothing, except some units moved from one side of the silicon (GPU) to another (CPU). It means we have some volume for computing units and can move those units where we wish. But why we should wait for Intel's designers to move those units? We can propose a design with reconfigurable set of basic units and move those units as we wish and when we wish. And yes, hardware optimized movement is more efficient, but if we consider the speed of such movement by Intel's designers (decades), then the waste of some efficiency is a really good payment for a good flexibility.
Brendan wrote:You mean, Itanium?
I don't know the marketing name for the proposed solution, but it seems viable to call it like "reconfigurable computing units". And because the command word for such set of units can be very long, then it is possible to use existing abbreviation like VLIW. And yes, the Itanium uses VLIW.