Future of CPUs

geppyfx · Post by **geppyfx** » Mon Aug 16, 2010 10:21 pm

Future of processors by 2018 according to DARPA http://news.cnet.com/8301-13924_3-20013088-64.html (its not the same news where darpa asks to nvidia to build supercomputer out of cuda gpus)

skyking · Post by **skyking** » Tue Aug 17, 2010 4:35 am

arkain wrote:
darkestkhan wrote:You couldn't build a 1024 bit adder with appreciably better performance. Never mind multipliers or dividers - heck, division is already slow. At the scaling rate for division that current x86s manage (and most RISC machines are worse!), that division is gonna take you 1032 cycles. Multiplication will probably be 100.
I wouldn't say that. As a concept for a computer architecture paper, I designed a completely new adder with 25% less circuits and 30% more speed than the standard carry-lookahead adder (neither circuit containing optimizations). This advantage starts at 16-bit and grows as the adder is widened. I'm fairly certain that even this design is not the best that anyone can come up with. So it's not a good idea to assume that what we currently understand to be the best will remain so in the future.

I guess it's more of what's really requiered that is interresting. If you think of it which applications does really require 1024 bit arithmetics? And I don't count people that use 1024 bits just because they don't understand maths. 2^1024 is a really huge number, and 1024 bit floats means really high precision.

That you're getting 30% more speed means nothing, I'd guess that you could use that design for 64-bit adders as well so your 1024-bit adder has to compare with the improved 64-bit adder. One could also question if an divider is worth it at all since it uses valuable silicon space that will almost never be used anyway (a lot of CPU's actually does not include the div instructions at all).

Candy · Post by **Candy** » Fri Aug 20, 2010 4:15 am

These p4 were the most power-hungry implementations in intel's history, I have one lying in the closet and you can really warm your room in winter with that.

Same here - had an i805, was powerful enough to heat the cooler to 75C - and to get into thermal throttling - before windows booted up. Replaced with more powerful and dual core c2d and it can't get a cheaper cooler up to 60C.

ds · Post by ds » Sun Aug 22, 2010 2:38 am

Hmm, having read through the majority of posts here, I have but one thing to say:

"Cellular Automata" ...

Anything with 1000's of cores is starting to resemble one, no?

arkain · Post by **arkain** » Sun Aug 22, 2010 7:42 am

skyking wrote:
arkain wrote:
darkestkhan wrote:You couldn't build a 1024 bit adder with appreciably better performance. Never mind multipliers or dividers - heck, division is already slow. At the scaling rate for division that current x86s manage (and most RISC machines are worse!), that division is gonna take you 1032 cycles. Multiplication will probably be 100.
I wouldn't say that. As a concept for a computer architecture paper, I designed a completely new adder with 25% less circuits and 30% more speed than the standard carry-lookahead adder (neither circuit containing optimizations). This advantage starts at 16-bit and grows as the adder is widened. I'm fairly certain that even this design is not the best that anyone can come up with. So it's not a good idea to assume that what we currently understand to be the best will remain so in the future.
I guess it's more of what's really required that is interesting. If you think of it which applications does really require 1024 bit arithmetics? And I don't count people that use 1024 bits just because they don't understand maths. 2^1024 is a really huge number, and 1024 bit floats means really high precision.

That you're getting 30% more speed means nothing, I'd guess that you could use that design for 64-bit adders as well so your 1024-bit adder has to compare with the improved 64-bit adder. One could also question if an divider is worth it at all since it uses valuable silicon space that will almost never be used anyway (a lot of CPU's actually does not include the div instructions at all).

I can conceive of a use for such insane precision, try a spacial navigation system. Consider that you'd have to be able to track the precise position of a myriad of stellar objects, especially while traveling at near light speeds. But I will agree that for most practical purposes, at least for the conceivable future, anything beyond 256-bit precision isn't even scientifically useful.

As for why one would implement division in hardware, although it's a relatively little-used function for most common applications, anyone doing heavy math (accounting, scientific, banking, data mining, statistics, etc) certainly appreciates the speed increase provided by performing division in silicon rather than software. One of the points to making transistors smaller is so that there is more room on the die for useful functions. It wouldn't hurt if someone came up with better algorithms for multiplication and division either.

Candy · Post by **Candy** » Sun Aug 22, 2010 2:12 pm

arkain wrote:I can conceive of a use for such insane precision, try a spacial navigation system. Consider that you'd have to be able to track the precise position of a myriad of stellar objects, especially while traveling at near light speeds. But I will agree that for most practical purposes, at least for the conceivable future, anything beyond 256-bit precision isn't even scientifically useful.

As for why one would implement division in hardware, although it's a relatively little-used function for most common applications, anyone doing heavy math (accounting, scientific, banking, data mining, statistics, etc) certainly appreciates the speed increase provided by performing division in silicon rather than software. One of the points to making transistors smaller is so that there is more room on the die for useful functions. It wouldn't hurt if someone came up with better algorithms for multiplication and division either.

I know the decimals of PI until the 45th decimal - that's exactly the amount that you need to calculate the circumference of the universe given the width of the universe, in precision of electron-thicknesses. That's about 150 bits of precision. 256 bits is already stretching it - not to mention that 1024 is even way more excessivity.

You can only hope to use it for pure mathematics - cryptography for instance. But if you go that direction, you're better off with variable length integers (or floats) and variable-length opcodes, or software emulation (GMP for instance).

arkain · Post by **arkain** » Mon Aug 23, 2010 2:22 pm

Candy wrote:
arkain wrote:I can conceive of a use for such insane precision, try a spacial navigation system. Consider that you'd have to be able to track the precise position of a myriad of stellar objects, especially while traveling at near light speeds. But I will agree that for most practical purposes, at least for the conceivable future, anything beyond 256-bit precision isn't even scientifically useful.

As for why one would implement division in hardware, although it's a relatively little-used function for most common applications, anyone doing heavy math (accounting, scientific, banking, data mining, statistics, etc) certainly appreciates the speed increase provided by performing division in silicon rather than software. One of the points to making transistors smaller is so that there is more room on the die for useful functions. It wouldn't hurt if someone came up with better algorithms for multiplication and division either.
I know the decimals of PI until the 45th decimal - that's exactly the amount that you need to calculate the circumference of the universe given the width of the universe, in precision of electron-thicknesses. That's about 150 bits of precision. 256 bits is already stretching it - not to mention that 1024 is even way more excessivity.

You can only hope to use it for pure mathematics - cryptography for instance. But if you go that direction, you're better off with variable length integers (or floats) and variable-length opcodes, or software emulation (GMP for instance).

I knew you'd argue that 256 was a bit much. Now given that your argument mentions the need for 150-bit precision, and the trend in CPU precision follows powers of 2, to do that 150-bit calculation, you'd still need a 256 bit CPU if you wanted to do so in a minimum of instructions.

In the end, I still agree with you. Higher bit precision isn't going to get us very much if anything. What we need is better hardware encodings for our instructions so that all instructions can be completed in less clock cycles. Frankly, I'd much rather have a 400Mhz CPU where all instructions take 4 clocks instead of a 4Ghz CPU where every instruction takes 40 clocks. The latter CPU is going to be more prone to high power consumption and destroying itself with heat.

Owen · Post by **Owen** » Mon Aug 23, 2010 2:46 pm

...You are aware that most processors complete the vast majority of their instructions in one clock - and modern ones (i.e. x86 circa 95+, ARM Cortex-A9, etc) multiple at a time? Hell, x86 often does instructions in zero clocks.

There are instructions which are slow, but they are often rightfully slow (for example, division); and I don't get how you assume instruction encoding has any bearing on performance.

TylerH · Post by **TylerH** » Mon Aug 23, 2010 3:11 pm

Candy wrote:
arkain wrote:I can conceive of a use for such insane precision, try a spacial navigation system. Consider that you'd have to be able to track the precise position of a myriad of stellar objects, especially while traveling at near light speeds. But I will agree that for most practical purposes, at least for the conceivable future, anything beyond 256-bit precision isn't even scientifically useful.

As for why one would implement division in hardware, although it's a relatively little-used function for most common applications, anyone doing heavy math (accounting, scientific, banking, data mining, statistics, etc) certainly appreciates the speed increase provided by performing division in silicon rather than software. One of the points to making transistors smaller is so that there is more room on the die for useful functions. It wouldn't hurt if someone came up with better algorithms for multiplication and division either.
I know the decimals of PI until the 45th decimal - that's exactly the amount that you need to calculate the circumference of the universe given the width of the universe, in precision of electron-thicknesses. That's about 150 bits of precision. 256 bits is already stretching it - not to mention that 1024 is even way more excessivity.

You can only hope to use it for pure mathematics - cryptography for instance. But if you go that direction, you're better off with variable length integers (or floats) and variable-length opcodes, or software emulation (GMP for instance).

Intel's new 256 bit extension: http://software.intel.com/en-us/avx/

Candy · Post by **Candy** » Mon Aug 23, 2010 11:35 pm

Owen wrote:... I don't get how you assume instruction encoding has any bearing on performance.

Why did Intel create IA-64? Why is it a VLIW architecture with 3 instructions per encoded "instruction word"? Because it's not a bottleneck?

TylerAnon wrote:Intel's new 256 bit extension: http://software.intel.com/en-us/avx/

I was arguing against 1024-bit stuff, reasoning that for all but cryptography and pure mathematics, 256 bits is just about always enough for a precise answer. Of course, you also know that Intel added AES instructions, specifically for cryptography. I'm guessing we'll see 512-bit SHA3 instructions too, for the same reason.

Brendan · Post by **Brendan** » Tue Aug 24, 2010 12:28 am

Hi,

Candy wrote:
Owen wrote:... I don't get how you assume instruction encoding has any bearing on performance.
Why did Intel create IA-64? Why is it a VLIW architecture with 3 instructions per encoded "instruction word"? Because it's not a bottleneck?

As far as I know, the idea was to shift instruction scheduling out of the CPU and into the compiler; to reduce the complexity of the silicon, and maybe get better performance (didn't happen), better power consumption (also didn't happen), or reduced development costs (still didn't happen). In general the idea fails. When Itanium was introduced there wasn't a compiler capable of doing the instruction scheduling well (and I'm not even sure if such a compiler exist now), so performance was worse than the theoretical maximum. The other (long term) problem is that optimum instruction scheduling is very CPU-dependant - code tuned for one Itanium CPU (with one set of instruction latencies, etc) isn't tuned for another Itanium CPU (with a different set of instruction latencies, etc), and therefore even if a perfect compiler capable of generating optimum code existed, the resulting code would run well on one CPU and run poorly on other CPUs. If the CPU does the instruction scheduling (e.g. out-of-order CPUs), then there's much less need to tune code for a specific CPU.

The other way to "fix" the problem is to use something like hyper-threading to hide the problems caused by poor instruction scheduling. Unfortunately most Itaniums didn't support hyper-threading - Tukwila (released in 2010) is the first Itanium to support it.

The main reason Itanium was relatively successful in specific markets (e.g. for "enterprise" class hardware) was that a lot of scalability and fault tolerance features were built into the chipsets. Basically, Itanium got where it is now despite VLIW (not because of VLIW), although there's a lot of other factors that effected both the success (in some markets) and lack of success (in all other markets).

Cheers,

Brendan

skyking · Post by **skyking** » Tue Aug 24, 2010 10:09 am

arkain wrote: I can conceive of a use for such insane precision, try a spacial navigation system. Consider that you'd have to be able to track the precise position of a myriad of stellar objects, especially while traveling at near light speeds. But I will agree that for most practical purposes, at least for the conceivable future, anything beyond 256-bit precision isn't even scientifically useful.

I'd guess that's what my (or anybody else's) old grandma would want to use a computer for...

As for why one would implement division in hardware, although it's a relatively little-used function for most common applications, anyone doing heavy math (accounting, scientific, banking, data mining, statistics, etc) certainly appreciates the speed increase provided by performing division in silicon rather than software.

Question is here if the performance improvment in putting this in dedicated hardware would be much more than the improvment you'd gain in using the available resources to enhance more often used operations (maybe you even don't get increased performance at all...).

One of the points to making transistors smaller is so that there is more room on the die for useful functions.

Yes, and make them faster and reduce relative power consumtion, but the question is what's the best use of the smaller transistor size. I'm not convinced that you'd get the best general purpose performance by using the available resources for special purpose circuitry.

Thomas · Post by **Thomas** » Thu Aug 26, 2010 6:53 am

Hi,

Why is it a VLIW architecture with 3 instructions per encoded "instruction word"? Because it's not a bottleneck?

A bundle to be more precise ( A bundle is 128 bits in length). Brendan pretty much covered everything.

When Itanium was introduced there wasn't a compiler capable of doing the instruction scheduling well (and I'm not even sure if such a compiler exist now),

HP C/C++ compilers does a decent job

.

--Thomas

Owen · Post by **Owen** » Thu Aug 26, 2010 7:21 am

Thomas wrote:
When Itanium was introduced there wasn't a compiler capable of doing the instruction scheduling well (and I'm not even sure if such a compiler exist now),
HP C/C++ compilers does a decent job .

--Thomas

But not compared to an x86 branch predictor. As it turns out, the branch predictor knows far more about the execution paths of code than compilers do.

skyking · Post by **skyking** » Fri Aug 27, 2010 11:42 am

berkus wrote:
skyking wrote:I'm not convinced that you'd get the best general purpose performance by using the available resources for special purpose circuitry.
RISC architectures starting from, perhaps, Commodore 64 and its MOS 6502 chip and up to TI OMAP and nVidia tegra2 boards would disagree.

Who would disagree with what? AFAIK the ARM architecture does not include integer division in the instruction set...

OSDev.org

Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs