Future of CPUs

Owen · Post by **Owen** » Wed May 19, 2010 9:16 am

Benk wrote:
JackScott wrote:
Benk wrote:I would argue that x86_64 is a 128 bit CPU but not a pure one since you can do common operations on the 128 bit XMM registers including mov ( via movdqa) but not all.
Whaaaa...?! That makes SSE a 128-bit ISA, not x86_64.

If Intel and AMD call x86_64 a 64-bit ISA, then it's a 64-bit ISA by virtue of them calling it that. Aside from that, the main registers are also 64 bits wide.
A significant proportion of optomised code now uses the 128 bit XMM registers its surprising how much.

Intel & Amd MARKETING people call it 64 bit to distinguish it from 32 bit. They also called the 8086 16 bit untill the 286 came out after which the marketing department called the 8086 8 bit.

It is 64-bit for a simple reason:
64-bit is the largest single integer an instruction can manipulate.

SSE doesn't do 128-bit additions; it does maximally a pair of 64-bit additions. It is therefore a 64-bit processor.

(This definition is slightly over-broad; by that token a Z80 would be a 16-bit processor, but I think everyone will agree that it is an 8-bit one).

Also, nobody ever called the 8086 8-bit. By no measure is the 8086 8-bit. Perhaps the 8088 got named so for its 8-bit data bus, but it would still be classed as a 16-bit processor (Else many deployed ARMs would get classed as 16-bit by that measure)

Nathan · Post by **Nathan** » Wed May 19, 2010 10:40 am

Benk wrote:
Brynet-Inc wrote:
Nathan wrote:My prevision is that in 20 years, we are going to see a 128bits processor. =P A general purpose 128-bit processor (..data/address) is possible, but.. can you even fathom how large of address space 64-bit is? let alone 128-bit?

~
isnt 2 ^72 the number of atoms in the universe....

No, the correct one is 10^78.

Reference

Benk · Post by **Benk** » Thu May 20, 2010 6:37 am

isnt 2 ^72 the number of atoms in the universe....

No, the correct one is 10^78.

[/quote]

so about 2 ^ 81...

gravaera · Post by **gravaera** » Thu May 20, 2010 8:39 am

Benk, where are your references? You've been making quite a few radical statements all around the forum recently, and it's beginning to get ridiculous.

Nathan · Post by **Nathan** » Thu May 20, 2010 10:33 am

The same thing is at Wikipedia.

Benk · Post by **Benk** » Fri May 21, 2010 1:08 am

gravaera wrote:Benk, where are your references? You've been making quite a few radical statements all around the forum recently, and it's beginning to get ridiculous.

For what ? You mean my atoms in the Universe quote ?

I said isnt it 2^72 ?

Doesnt " isnt " indictate it isnt looked up though i should have added about which is the brain ahead of the fingers getting . 2^81 and 2^72 is not that different since we are talking about whether a 64 bit address space is enough i was just trying to show how big 2^64 and 2^128 are . ( it prob was 10^72 when i was young but the universe is bigger now) . I have no issue when people say roughly etc only when they present something as a fact without looking it up ( which i have not done) .

For the instruction set that is straight from memory , though i did lookup the nice table i made with wikipedia and i was around when the XT and AT were released so i remember all the 8 bit and 16 bit marketing.

I have provided 4 to 5 references and papers from some of the leading researchers in the OS field for my others , it amazes me however that people keep reinventing the same 70's style OS with little consideration to the research field in the last 15 years .

While my comments may be radical they were all made in the context of what may happen and it was intentional to provoke thought and to question convention, a few readers have thought about them and posted intelligent replies which got intelligent applies . I have done a lot of research esp the last few years and was first involved with OS dev with CP/M over 20 years ago so im not some kid shooting from the hip. I have convinced a lot of people over the years that what im saying is right , that it is possible and will work but it normally takes about 2 months and 2 dozen emails so it doesnt come easily. Lastly these are not my ideas ( i wish i was that smart) merely other researchers who i believe in after spending a lot of time studying their work.

Yes research doesnt tell you how to do it only what may be possible , its up to OS designers to take it a step further. Have a look at OS like Eros , Coyotos , CapRos , Cosmos , Mosa or Singularity and all the new ideas and that is 10 years old ( and i mean build the OS look at the 20 design docs check the code , bench mark it etc) .

Benk · Post by **Benk** » Fri May 21, 2010 4:28 am

Owen wrote: Whaaaa...?! That makes SSE a 128-bit ISA, not x86_64.

If Intel and AMD call x86_64 a 64-bit ISA, then it's a 64-bit ISA by virtue of them calling it that. Aside from that, the main registers are also 64 bits wide.

A significant proportion of optomised code now uses the 128 bit XMM registers its surprising how much.

Intel & Amd MARKETING people call it 64 bit to distinguish it from 32 bit. They also called the 8086 16 bit untill the 286 came out after which the marketing department called the 8086 8 bit.

[/quote]

It is 64-bit for a simple reason:
64-bit is the largest single integer an instruction can manipulate.

SSE doesn't do 128-bit additions; it does maximally a pair of 64-bit additions. It is therefore a 64-bit processor.

(This definition is slightly over-broad; by that token a Z80 would be a 16-bit processor, but I think everyone will agree that it is an 8-bit one).

Also, nobody ever called the 8086 8-bit. By no measure is the 8086 8-bit. Perhaps the 8088 got named so for its 8-bit data bus, but it would still be classed as a 16-bit processor (Else many deployed ARMs would get classed as 16-bit by that measure)[/quote]

Yes the 8088 is a 16 bit processor but marketing people and the industry called it 8 bit when the 286 came around. ( and Yes Intel and AMD changed it, there marketing people will call it what they can get away with) .

My position isnt that its a pure 128 bit CPU ( it isnt pure would be 128 /128 /128 ) ) , but its more than a 64 bit pure CPU . In fact its 64 bit data , 40 bit addressing in most platform ( 64 bit virtual) , 128 bit for some common instructions ( esp mov) and 80 bit fp . In fact calling a processer simply 16 or 32 bit is silly as all the properties ( address , data , fp , register and extended instructions ( like SSE2) are very important.

In the context of the question of whether we will have a 128 bit CPU my answer is no because the x86-64 with SSE2 is effectively /practically a 128 bit cpu and will give the same performance ( ok within 1%) since no one works with 128 bit integers but we do work with 128 bit for moves , bit scans etc ( except maybe guids) also SIMD ie adding 2-4 is more powerfull than 1 128 bit.

Neolander · Post by **Neolander** » Fri May 21, 2010 2:03 pm

Well, after all, 64-bit processors are such old and slow technology... I wonder why people even use it instead of making efficient software...

Gigasoft · Post by **Gigasoft** » Fri May 21, 2010 5:06 pm

so about 2 ^ 81...

10^78 is about 1.079*2^259. I guess you're not too good at math?

darkestkhan · Post by **darkestkhan** » Sat Jul 10, 2010 12:15 pm

10^78 is about 1.079*2^259. I guess you're not too good at math?

Who cares about this one? at the end it doesn't matter; as for why we (maniacs...) would like to have even 1024bit registers - scientific computing; it will way much faster (the difference between 32 and 64-bit is sometimes even 3x difference in performance)

And why we are stuck at these silly inneficient and really WEAK intel (<200 GFLOPs == SLOW) and amd (not counting fusion - i don't know how much raw computational power it has) - it is because of some ubiquitos OS which name should not be called... there are other CPU's than x86, amd64 and arm - for example:

SPARC: http://en.wikipedia.org/wiki/UltraSPARC_T1#Niagara_3 (last time I searched ~1TFLOPs, now that we can call power)

TILERA: http://tilera.com/products/processors/TILE-Gx_Family (about ~0.75 TFLOPs; they are main force responsible for development of linux multicpu support now)

And there we could find even more cpu's (sony cell processor), but the main point is that there are many cpu families, and for our best it would be if intel and amd cpus died soon; compared to what other can give us they are like snails. And we can forget that gpu can do more and more (and as far as we are counting raw coomputational power, then nvidia is SNAIL; but cuda toolkit is really nice, and writing for ati stream is...).

PS: Other thing is that we propably should write some good exokernel, so that we could write apps utilising underlying hardware much better, but this is just my 0.02$

"May The Source be with You"

Owen · Post by **Owen** » Sat Jul 10, 2010 12:51 pm

darkestkhan wrote:
10^78 is about 1.079*2^259. I guess you're not too good at math?
Who cares about this one? at the end it doesn't matter; as for why we (maniacs...) would like to have even 1024bit registers - scientific computing; it will way much faster (the difference between 32 and 64-bit is sometimes even 3x difference in performance)

You couldn't build a 1024 bit adder with appreciably better performance. Never mind multipliers or dividers - heck, division is already slow. At the scaling rate for division that current x86s manage (and most RISC machines are worse!), that division is gonna take you 1032 cycles. Multiplication will probably be 100.

You cannot build arithmetic structures that big. You're talking something where the sheer size is such that a signal would take a quarter of a clock cycle to traverse it - assuming it was travelling in the metal interconnect, not through independent transistors

And why we are stuck at these silly inneficient and really WEAK intel (<200 GFLOPs == SLOW) and amd (not counting fusion - i don't know how much raw computational power it has) - it is because of some ubiquitos OS which name should not be called... there are other CPU's than x86, amd64 and arm - for example:

SPARC: http://en.wikipedia.org/wiki/UltraSPARC_T1#Niagara_3 (last time I searched ~1TFLOPs, now that we can call power)

TILERA: http://tilera.com/products/processors/TILE-Gx_Family (about ~0.75 TFLOPs; they are main force responsible for development of linux multicpu support now)

And there we could find even more cpu's (sony cell processor), but the main point is that there are many cpu families, and for our best it would be if intel and amd cpus died soon; compared to what other can give us they are like snails. And we can forget that gpu can do more and more (and as far as we are counting raw coomputational power, then nvidia is SNAIL; but cuda toolkit is really nice, and writing for ati stream is...).

PS: Other thing is that we propably should write some good exokernel, so that we could write apps utilising underlying hardware much better, but this is just my 0.02$

"May The Source be with You"

I hate x86. I loathe it with a passion. It is ugly. It is inelegant. It is a complete mess, and gets messier by the revision.

But your complaint has nothing to do with x86, and everything to do with money. Gee whiz, I wonder why these processors are so relatively slow? Is it perhaps because I'm comparing them with chips 10x the price?

As for GPUs: their performance is highly code dependent, like Cell. Processors can be placed on a an axis of generality vs efficiency, with the normal CPU at one end, Cell SPEs in the middle, and GPUs occupying the other. GPUs are moving left, sure, but there are some things they're still really poor at - a great example is anything which involves heavy branching.

As for performance... nVIDIA are currently out in the lead if you compare like for like (Single GPU device, single precision). In fact, nVIDIA's best device is holding the same performance as ATIs best, despite ATIs being a dual GPU board.

(Then again, ATI should claw the performance back during their next product cycle - each of them tends to release half way through the other's cycle these days)

darkestkhan · Post by **darkestkhan** » Sun Jul 11, 2010 6:56 am

You couldn't build a 1024 bit adder with appreciably better performance. Never mind multipliers or dividers - heck, division is already slow. At the scaling rate for division that current x86s manage (and most RISC machines are worse!), that division is gonna take you 1032 cycles. Multiplication will probably be 100.

I guess you never did calculate !123456 or other similar things (or math distributed computing projects). Yo don't have to use such big registers - normally using 32 or 64 bit regsters is sufficient.

You cannot build arithmetic structures that big. You're talking something where the sheer size is such that a signal would take a quarter of a clock cycle to traverse it - assuming it was travelling in the metal interconnect, not through independent transistors.

So? even if it is 2 clocks cycles, is it really that big? try doing this kind of calculations on 64bit or even 256bit - I bet that it wll take way much more cycles than simple mov, add (not even speaking about mul)

But your complaint has nothing to do with x86, and everything to do with money. Gee whiz, I wonder why these processors are so relatively slow? Is it perhaps because I'm comparing them with chips 10x the price?

Fastest tilera chips you can buy for ~400$ (without mobo), fastest xeon - ~10k$. Which one is cheaper? As for speed - when calculating single threaded apps, then no doubt - xeon will be faster, but TILERA are mainly targeting web servers, where we can relatively easily parallelize them. I don't know the price of niagara3.

As for GPUs: their performance is highly code dependent, like Cell. Processors can be placed on a an axis of generality vs efficiency, with the normal CPU at one end, Cell SPEs in the middle, and GPUs occupying the other. GPUs are moving left, sure, but there are some things they're still really poor at - a great example is anything which involves heavy branching.

If intel won't speedup their cpu's by the factor of 4, then they will lose market (actually not exactly - still no gpl-ed fully workable cuda drivers; proprietary software will take decades before writing for really parallel environments (multicore cpus, gpu)). GPU's lately are more and more like cpu's - I bet that in one or two decades their branching insructions will be good enough to ditch out CPUs (if current trends will be preserved).

As for performance... nVIDIA are currently out in the lead if you compare like for like (Single GPU device, single precision). In fact, nVIDIA's best device is holding the same performance as ATIs best, despite ATIs being a dual GPU board.

Whatis single precision? it is utterly worthless outside graphics. In HPC most of time it is desirable to use at least double precision (wether predictions, anyone?), and in this nvidia is almost worthless. ATI has also much better string operations (cryptography...).But again, writing for ATI is almost like hell, while nvidia has nice cuda compiler, with support for C and Fortran. And it is free (like beer) unlike ATI's compilers. And because of taht nvidia will have much bigger marketshare, and it will continue to grow.

Owen · Post by **Owen** » Sun Jul 11, 2010 8:26 am

darkestkhan wrote:
You couldn't build a 1024 bit adder with appreciably better performance. Never mind multipliers or dividers - heck, division is already slow. At the scaling rate for division that current x86s manage (and most RISC machines are worse!), that division is gonna take you 1032 cycles. Multiplication will probably be 100.
I guess you never did calculate !123456 or other similar things (or math distributed computing projects). Yo don't have to use such big registers - normally using 32 or 64 bit regsters is sufficient.

You cannot build arithmetic structures that big. You're talking something where the sheer size is such that a signal would take a quarter of a clock cycle to traverse it - assuming it was travelling in the metal interconnect, not through independent transistors.
So? even if it is 2 clocks cycles, is it really that big? try doing this kind of calculations on 64bit or even 256bit - I bet that it wll take way much more cycles than simple mov, add (not even speaking about mul)

So, hang on, you don't want 1024-bit registers? But you just said you did? Wait, now you don't? Make sense damnit!

Oh, and yes, something which takes a quarter of a clock cycle to traverse is huge. Really, really huge. You would either end up with insanely big pipelines, or it being microcoded and multicycle.

But your complaint has nothing to do with x86, and everything to do with money. Gee whiz, I wonder why these processors are so relatively slow? Is it perhaps because I'm comparing them with chips 10x the price?
Fastest tilera chips you can buy for ~400$ (without mobo), fastest xeon - ~10k$. Which one is cheaper? As for speed - when calculating single threaded apps, then no doubt - xeon will be faster, but TILERA are mainly targeting web servers, where we can relatively easily parallelize them. I don't know the price of niagara3.

Please point me to where you can buy these chips - never mind price them. Niagra 3 isn't even shipping yet. As for pricing of Niagra 2... Sun won't sell you a chip, but I'd expect about $2k or so as a minimum.

But you're comparing apples to apples! You're comparing 16/64 core chips against ones that are at maximum 8. Of course the 64 core chip will have a higher FLOP count. Its easy to make many slow FPUs; hard to make one fast one.

As for GPUs: their performance is highly code dependent, like Cell. Processors can be placed on a an axis of generality vs efficiency, with the normal CPU at one end, Cell SPEs in the middle, and GPUs occupying the other. GPUs are moving left, sure, but there are some things they're still really poor at - a great example is anything which involves heavy branching.
If intel won't speedup their cpu's by the factor of 4, then they will lose market (actually not exactly - still no gpl-ed fully workable cuda drivers; proprietary software will take decades before writing for really parallel environments (multicore cpus, gpu)). GPU's lately are more and more like cpu's - I bet that in one or two decades their branching insructions will be good enough to ditch out CPUs (if current trends will be preserved).

Really? People keep saying this.

Guess what - unfortunately Intel are a humongous giant. Whatever everyone else can afford to do, they can afford to do better. They will answer the needs of the market.

The market isn't currently demanding what you do.

As for performance... nVIDIA are currently out in the lead if you compare like for like (Single GPU device, single precision). In fact, nVIDIA's best device is holding the same performance as ATIs best, despite ATIs being a dual GPU board.

Whatis single precision? it is utterly worthless outside graphics. In HPC most of time it is desirable to use at least double precision (wether predictions, anyone?), and in this nvidia is almost worthless. ATI has also much better string operations (cryptography...).But again, writing for ATI is almost like hell, while nvidia has nice cuda compiler, with support for C and Fortran. And it is free (like beer) unlike ATI's compilers. And because of taht nvidia will have much bigger marketshare, and it will continue to grow.[/quote]
"utterly worthless outside graphics" is a rather huge exaggeration. A vast quantity of applications are quite happy with Single Precision. Audio processing is a prime example of an application which doesn't need even that.

As for double precision performance, both are about level.

darkestkhan · Post by **darkestkhan** » Sun Jul 11, 2010 8:52 am

darkestkhan wrote:
Quote:
You couldn't build a 1024 bit adder with appreciably better performance. Never mind multipliers or dividers - heck, division is already slow. At the scaling rate for division that current x86s manage (and most RISC machines are worse!), that division is gonna take you 1032 cycles. Multiplication will probably be 100.

I guess you never did calculate !123456 or other similar things (or math distributed computing projects). Yo don't have to use such big registers - normally using 32 or 64 bit regsters is sufficient.

Quote:
You cannot build arithmetic structures that big. You're talking something where the sheer size is such that a signal would take a quarter of a clock cycle to traverse it - assuming it was travelling in the metal interconnect, not through independent transistors.

So? even if it is 2 clocks cycles, is it really that big? try doing this kind of calculations on 64bit or even 256bit - I bet that it wll take way much more cycles than simple mov, add (not even speaking about mul)

So, hang on, you don't want 1024-bit registers? But you just said you did? Wait, now you don't? Make sense damnit!

Oh, and yes, something which takes a quarter of a clock cycle to traverse is huge. Really, really huge. You would either end up with insanely big pipelines, or it being microcoded and multicycle.

I want big registers - but I'm doing lots of BIG numbers calculations, and it would really speed up them.

But you're comparing apples to apples! You're comparing 16/64 core chips against ones that are at maximum 8. Of course the 64 core chip will have a higher FLOP count. Its easy to make many slow FPUs; hard to make one fast one.

Yes, but most of things can be done in parallel. BTW what tilera did is slightly different - they could use much faster cores (or, as they call them, tiles) but their cpu is relatively lowpower. Niagara is entirely different (power hungry) beast. We also can't foget that x86/64 has lots of fundings, and they are only cpus used by ubiquitos os that should not be called... and is main force really keeping
us with x86/64.

Personally I'm stuck for some years with intel, mainly xeon, bexause I'm using old hardware. And getting server rack with 2 xeons 3.0GHz (EM64T) even here, in Poland, is not big problem (even as low as 250PLN, ~80$; i know, they are not fastest, but at least darn cheap). And it is kind of interesting that fully loaded p4 3.0 will eat as much as 160W, while fully loaded xeon will eat *only* (when comparing to p4) 110W, and still will be faster.

For next ten years there shouldn't be any revolution in CPU's technology, but by the time when these years pass, we will have BIG problem - how to miniaturize transistors even more (I know that recently in some university one atom transistor had been developed) ? What after we get to atom sized transistors? photonic cpus? quantum cpus? Probably we will have carbon nanotube in cpus soon enough (like in 5 years - reason being power efficiency).

Forgot to mention 3D cpu's but they have problems with heat dissipation.

arkain · Post by **arkain** » Sat Aug 14, 2010 1:12 am

darkestkhan wrote:You couldn't build a 1024 bit adder with appreciably better performance. Never mind multipliers or dividers - heck, division is already slow. At the scaling rate for division that current x86s manage (and most RISC machines are worse!), that division is gonna take you 1032 cycles. Multiplication will probably be 100.

I wouldn't say that. As a concept for a computer architecture paper, I designed a completely new adder with 25% less circuits and 30% more speed than the standard carry-lookahead adder (neither circuit containing optimizations). This advantage starts at 16-bit and grows as the adder is widened. I'm fairly certain that even this design is not the best that anyone can come up with. So it's not a good idea to assume that what we currently understand to be the best will remain so in the future.

OSDev.org

Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs