Future of CPUs

Owen · Post by **Owen** » Sat Apr 10, 2010 5:25 pm

Death of video cards? Huh?

As for the "specialized CPUs" comment, I have only seen this happen on the highest of high end NICs and sound cards. The GPU is itself these days an array of very highly specialized processors.

And more GPUs are shipping now than ever before (They're going into supercomputers - for example, for physics, nothing beats an nVIDIA GPU and CUDA)

Love4Boobies · Post by **Love4Boobies** » Sat Apr 10, 2010 5:31 pm

Owen wrote:Death of video cards? Huh?

Erm... CPUGPU packages? As such?

As for the "specialized CPUs" comment, I have only seen this happen on the highest of high end NICs and sound cards. The GPU is itself these days an array of very highly specialized processors.

Who cares where you've seen it? That is the direction in which we are heading - another reason for that is that SMP isn't very scalable (quite obviously). Few know this, but even i7 is ccNUMA.

And more GPUs are shipping now than ever before (They're going into supercomputers - for example, for physics, nothing beats an nVIDIA GPU and CUDA)

I don't have any benchmarks but I'm not sure about that...

Owen · Post by **Owen** » Sat Apr 10, 2010 6:25 pm

Larrabee as a discrete product has been canned. If Larrabee is seen at all, it will be as a replacement for the GMA series.

Love4Boobies · Post by **Love4Boobies** » Sat Apr 10, 2010 6:29 pm

It was. It's an important step for Intel and they can't afford to screw it up because it's the thing everyone's talking about. Arrandale (some of the Core i3, Intel Core i5 and Intel Core i7 CPUs) already contain integrated GPUs. It's a question of when.

Neolander · Post by **Neolander** » Sun Apr 11, 2010 2:00 am

I don't see this happening soon, and hope it won't happen except for the lowest end of the GPU market (GMA, you won't be missed).

1/A graphics card is not just about a chipset. That's what laptop manufacturers try to make people think, and everybody can see the horrible performance of laptops in games. Buses and fast dedicated video memory are an important issue, too, and you can't make all of this fit on a single chip, nor you can use the regular bus for video memory (it's already running out of bandwidth these days, that would be a performance suicide).
2/Heat is a major issue, too. Modern higher-end graphic cards generate so much heat that they take the room of one PCI-e card just for the cooler. Making them shrink to the size of a regular CPU won't help that much, it'll only reduce ease of heat dissipation due to reduced contact surface with the cooling device.
3/Graphics cards evolve faster than regular CPUs. If I buy a computer today, I know that I probably won't need to nor be able to change its CPU. On the other hand, PCI-e evolves slowly enough to envision buying a new graphics card if I need to in the lifetime. That's why it's better for graphics and CPU to be separate parts. Otherwise, people wanting a new GPU will have to buy a new CPU along with it (and a new motherboard, since changing the CPU means changing it nowadays). On the other hand, people wanting to build a powerful PC will have to buy two GPUs : one integrated in the CPU, and one with serious performance as a separate part.

In my opinion, PC hardware manufacturers are trying to take consumer screwing-up lessons from Apple with this, and Intel/AMD are trying to screw up Nvidia, which has little to no experience in the CPU market. I don't like it.

Love4Boobies · Post by **Love4Boobies** » Sun Apr 11, 2010 11:42 am

Don't argue with the technical aspects when it has already been done. Clearly, everything has been worked out.

As for modularity, yes, it's less modular. You get a lower total price for better performance. How often do you change your CPU and how often do you change your video card today?

Neolander · Post by **Neolander** » Sun Apr 11, 2010 1:08 pm

Has it already been done with a serious graphics card ? Something which provides at least the power and capabilities of a 5 years old graphics card like the Geforce 7800 GT ?

Owen · Post by **Owen** » Sun Apr 11, 2010 1:16 pm

Current generation CPU, maximum memory bus size: 192-bit (Core i7)
Current generation GPU, maximum memory bus size: 384-bit (nVIDIA GTX480)

The GPU has its memory bus clocked higher, because it doesn't need to transit sockets to get to RAM.

Both of these busses are absolutely full, and the GPU one is over twice the capacity of the CPU one.

Please reconcile with a high performance GPU implemented in the same silicon as the CPU

Love4Boobies · Post by **Love4Boobies** » Sun Apr 11, 2010 3:46 pm

You are right, it simply cannot be done. That's why Larrabee had a 1024-bit ring bus.

Owen wrote:Please reconcile with a high performance GPU implemented in the same silicon as the CPU

Okay, but just because your posts are as enlightning as always.

Owen · Post by **Owen** » Sun Apr 11, 2010 6:09 pm

Love4Boobies wrote:You are right, it simply cannot be done. That's why Larrabee had a 1024-bit ring bus.

Owen wrote:Please reconcile with a high performance GPU implemented in the same silicon as the CPU
Okay, but just because your posts are as enlightning as always.

I was under the impression you were suggesting the inclusion of GPUs on the same chip as the main processor

Larrabee is another kettle of fish entirely. A highly disappointing kettle of fish.

Brendan · Post by **Brendan** » Sun Apr 11, 2010 7:54 pm

Hi,

Owen wrote:Current generation CPU, maximum memory bus size: 192-bit (Core i7)
Current generation GPU, maximum memory bus size: 384-bit (nVIDIA GTX480)

Most people care about bandwidth, not bus size.

The fastest RAM bandwidth for AMD/ATI GPUs that is listed on wikipedia is Radeon HD 5870 at 153.6 GiB/s. The fastest RAM bandwidth for NVidia GPUs that is listed on wikipedia is GeForce GTX 480 at 177.4 GiB/s.

For Intel's Nehalem (not Nehalem-EX), RAM bandwidth is currently about 32 GiB/s. That's about the same as a NVidia GeForce GT 330, or an ATI Radeon HD 4670.

But, you wouldn't want to use all the RAM bandwidth for video and leave none for the CPU/s. If you use half for GPU and half for CPU then you're left with about 16 GiB/s, which is about the same as a NVidia GeForce GT 220 or an ATI Radeon 9700.

However, how good is "good enough"? For servers nobody cares. For office work nobody cares. For mobile devices nobody cares. For "moderate" PC gamers like me it's hard to say - I checked my games machine (which is plenty for every game I've tried) and realised it's using system RAM anyway (a quad core Phenom II with onboard ATI HD3200).

The "hard core gamers" who actually would have cared all shifted to games machines like xBox years ago. This is mostly because "latest release" PC games suck (there's continual compatibility problems).

Of course I should point out that the fastest bandwidth isn't RAM at all - it's Nehalem's L1/L2/L3 caches. Recently I've been wondering what sort of graphics performance you'd be able to get from using highly optimised code on a 4-core/8-CPU Nehalem (instead of using a GPU) using different algorithms than what GPUs use (to get the most out of cache bandwidth, etc). If you compare GFLOPS and RAM bandwidth (and ignore CPU cache), Nehalem works out roughly the same as an NVidia GeForce 9600. Next year we'll have AVX ("256-bit SSE"), which should double the CPU's "GFLOPS per core".

Cheers,

Brendan

Owen · Post by **Owen** » Mon Apr 12, 2010 7:58 am

Don't forget that modern GPUs have caches (And TLBs!) too

They also have built in real time texture decompression hardware (The textures remain compressed in the cache for space efficiency - For DXT1, a 4x4 RGB + Mask pixel block is compressed into 64-bits. For DXT3, 5 and 3DC (The latter being targeted at normal maps) a 4x4 RGBA block is compressed into 128 bits).

I haven't seen x86 sprout any DXT1DEC/DXT3DEC/DXT5DEC/3DCDEC instructions yet

DXT and 3DC are one of those things which are incredibly cheap and fast in hardware but relatively expensive in software.

Believe me, GPUs have the same bandwidth issues as CPUs, only at higher scales. Their caches are probably not much larger than the i7, or even smaller, but they're very targeted. The fragment scheduling hardware is very good at maximizing cache locality.

Love4Boobies · Post by **Love4Boobies** » Mon Apr 12, 2010 10:12 am

Owen wrote:Don't forget that modern GPUs have caches (And TLBs!) too They also have built in real time texture decompression hardware (The textures remain compressed in the cache for space efficiency - For DXT1, a 4x4 RGB + Mask pixel block is compressed into 64-bits. For DXT3, 5 and 3DC (The latter being targeted at normal maps) a 4x4 RGBA block is compressed into 128 bits).

Maybe someday they will add caches and TLBs to CPUs as well, eh?

I haven't seen x86 sprout any DXT1DEC/DXT3DEC/DXT5DEC/3DCDEC instructions yet

Of course you haven't. They have only had CPU cores. I'm not sure what part of this conversation you don't understand.

Owen · Post by **Owen** » Mon Apr 12, 2010 3:03 pm

Love4Boobies wrote:Of course you haven't. They have only had CPU cores. I'm not sure what part of this conversation you don't understand.

I was commenting on Brendan's post. In particular, the point about algorithms to use Nehalem's cache.

Benk · Post by **Benk** » Sat May 08, 2010 11:09 pm

64 bit mem is enough for address for data Modern CPUs already do 128 bit and 256 bit ( for some) in some MMX/SSE instructions eg the 128 bit XMM registers can be used for the types of tasks that need it .

Predictions ( in other words guesses) :

modern Quad CPU board can run 48 concurrent threads ( 4 * 6 * 2 ( HT) ) , in the near future you will get core counts go past 100 , in 10 years we could be looking at 1000 core desktops.

TLB and cache pollution are a critical problem.

Removing concurrency in the IPC design will become more important and allow average programs to use high core counts. This suggest Asyc IPC , a new runtime lib and no user threads even.

OSX , Intel , Linux and NT will not handle this well nor can they easily adapt ( INtel can) which will make it interesting. Witness the cost of the minor multi core improvements in Windows 7. A new OS designed for these things could gain a 10 fold advantage for average apps . Some languages like Ocml will blow past C on those systems at least with a similar amount of effort , many of the user threading based custom high speed work will need to be redone.

MMU and branch prediction will start to dissapear after 10 yeas to be replaced with 35% higher core counts and more advanced software MM ( which use Virtual addresses and patches the code when needed) .

Mobile CPU and smaller to remain 32 bit.

OSDev.org

Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs