OSDev.org

Posted: **Wed Mar 17, 2010 4:05 pm**

Owen wrote:I doubt it. Current CPUs are really close to the limits of performance working on 64-bit numbers; remember that for many operations doubling the word size doubles the time. For example, Phenom IIs overclock better in 32-bit than 64-bit mode.

Inside the CPU, doubling the width will usually add one layer of logic. That might be 10%, although it can often be hidden due to delays elsewhere (even on Pentium 4, the pipeline limit was in the adder-schedulers, not the actual adders).

Overclocking is a function of temperature and process margin. For the same chip, the physical pipeline is the same.

As for Intel killing ARM, I don't see that happening. You cannot make an x86 of comparable performance with a comparable silicon budget. x86 is a very ugly architecture; ARM is not perfect but a very clean one.

Intel vs. ARM will be very interesting. ARM is also ugly, in the ways that matter for designing a high-performance chip (one word: predication). The ugliness of x86 is restricted to the instruction decoder (actually, uROM) and legacy testing (knowledge base + tools).

Secondly, Intel and ARM do not directly compete: ARM licenses its cores to companies which have a combined revenue ~4x Intels. Thirdly, Intel is right on the EU's anti-trust watch list, and any attempt to do anything anti-competitive against ARM (Such as the aforementioned selling at a loss) is going to land them more large fines.

Revenue is irrelevant, look at profit and acres of silicon shipped (that is the metric Intel uses). I haven't looked in a while, but Intel has been shipping more silicon than the next few companies put together for a long time, and brings in $10's e9 in profit every year. Most other silicon manufacturers break even or lose money (they are supported by government subsidies).

It is important to note that Intel is not a computer company or design house or anything like that.

Intel takes sand and makes money.

Processor architecture, compilers, graphics and all that are just the means to the end.

You have to note that the ARM cores used in phones contain lots of peripherals, from the simple (SPI & I²C/SMBus) to communicate with external devices, to the specialized (H.264 decoders), to the application specific (Wi-Fi, UMTS and GSM basebands - often on one chip!). With Intel's current Atom platform, you require at least 4 chips to do the same (CPU + Northbridge + Southbridge + Radios); with their next, they're shrinking it to 3, but still each of those chips requires more power than an ARM which can do the same.

Intel is one generation ahead in process technology. That makes a lot of room for integration and power consumption... other parts will get bigger as performance increases, and Intel will "polish the turd" on Atom, driving it to maximum efficiency.

Intel will continue to grow, but don't expect them to get into the mobile phone market. They simply cannot compete there, and they have over half of the industry competing against them.

Oh, and the performance of ARM cores is increasing faster than that of the Atom. And this from a company with 1/1000th the revenue to do R&D with.

It is easy to increase performance when your performance is low... the interesting thing will be to see how much it costs ARM to increase more, and how much better Atom gets. I wouldn't call it either way...

Posted: **Wed Mar 17, 2010 5:09 pm**

nedbrek wrote:
Owen wrote:I doubt it. Current CPUs are really close to the limits of performance working on 64-bit numbers; remember that for many operations doubling the word size doubles the time. For example, Phenom IIs overclock better in 32-bit than 64-bit mode.
Inside the CPU, doubling the width will usually add one layer of logic. That might be 10%, although it can often be hidden due to delays elsewhere (even on Pentium 4, the pipeline limit was in the adder-schedulers, not the actual adders).

Overclocking is a function of temperature and process margin. For the same chip, the physical pipeline is the same.

Doubling the width makes the worst case propogation delays quite a bit worse in most adders. The Pentium IV is a very bad example as, well, it's a very bad processor by many metrics. With more modern processors, much of the critical path is the latency in calculating the upper bits of a number.

As for Intel killing ARM, I don't see that happening. You cannot make an x86 of comparable performance with a comparable silicon budget. x86 is a very ugly architecture; ARM is not perfect but a very clean one.
Intel vs. ARM will be very interesting. ARM is also ugly, in the ways that matter for designing a high-performance chip (one word: predication). The ugliness of x86 is restricted to the instruction decoder (actually, uROM) and legacy testing (knowledge base + tools).

Predication requires no additional logic beyond that required by conditional branching. It also has a major benefit, in that it vastly reduces the amount of branching performed, and therefore reduces the required complexity of the branch prediction logic and reduces the cache pressure speculative fetches induce.

As for ugliness of x86 being restricted to the instruction decoder... Please, have you ever looked at all the cruft? The GDT? the LDT? 16-bit mode? A20 and the associated cache and TLB ugliness? The circuitry to support receiving interrupts from the 70s era PIC? How much of this is actually useful?

Secondly, Intel and ARM do not directly compete: ARM licenses its cores to companies which have a combined revenue ~4x Intels. Thirdly, Intel is right on the EU's anti-trust watch list, and any attempt to do anything anti-competitive against ARM (Such as the aforementioned selling at a loss) is going to land them more large fines.
Revenue is irrelevant, look at profit and acres of silicon shipped (that is the metric Intel uses). I haven't looked in a while, but Intel has been shipping more silicon than the next few companies put together for a long time, and brings in $10's e9 in profit every year. Most other silicon manufacturers break even or lose money (they are supported by government subsidies).

It is important to note that Intel is not a computer company or design house or anything like that.

Intel takes sand and makes money.

Processor architecture, compilers, graphics and all that are just the means to the end.

ARM cores cover an order of magnitude more silicon than Intel do per year. Oh, and by the way, the biggest semiconductor manufacturers are all highly profitable, and often with higher margins than Intel. The ones which are experiencing financial difficulty at the moment are only marginally doing so, and can be expected to re-enter their previous positions as the market recovers.

And please provide me with evidence of government subsidies, or was that something you just made up? And I'm not talking about some small company here; pick one of the big 20.

You have to note that the ARM cores used in phones contain lots of peripherals, from the simple (SPI & I²C/SMBus) to communicate with external devices, to the specialized (H.264 decoders), to the application specific (Wi-Fi, UMTS and GSM basebands - often on one chip!). With Intel's current Atom platform, you require at least 4 chips to do the same (CPU + Northbridge + Southbridge + Radios); with their next, they're shrinking it to 3, but still each of those chips requires more power than an ARM which can do the same.
Intel is one generation ahead in process technology. That makes a lot of room for integration and power consumption... other parts will get bigger as performance increases, and Intel will "polish the turd" on Atom, driving it to maximum efficiency.

Intel one generation ahead? Please. The rest of the industry is operating at the same process nodes as Intel are, and the difference that a process node makes is actually smaller than you think, as much of the benefits go straight out the window with increased leakage current (For reference: Half of the power consumed by a sub-90nm chip is leakage. Half).

Intel will continue to grow, but don't expect them to get into the mobile phone market. They simply cannot compete there, and they have over half of the industry competing against them.

Oh, and the performance of ARM cores is increasing faster than that of the Atom. And this from a company with 1/1000th the revenue to do R&D with.
It is easy to increase performance when your performance is low... the interesting thing will be to see how much it costs ARM to increase more, and how much better Atom gets. I wouldn't call it either way...

The Cortex-A9 beats the best Atom on Intel's next-2-year roadmap both performance and efficiency wise, and is going into production soon (Expect to see it in products in 1 to 1½ years)

Posted: **Thu Mar 18, 2010 5:14 am**

Owen wrote: Doubling the width makes the worst case propogation delays quite a bit worse in most adders. The Pentium IV is a very bad example as, well, it's a very bad processor by many metrics. With more modern processors, much of the critical path is the latency in calculating the upper bits of a number.

Sure, but the adder is not the limit in most designs. The Pentium 4 is a good example, because it ran it's adders at double speed (Northwood hit at least 6 GHz in the end) (although they were 16 bit adders). The schedulers were also running that speed, and the scheduler is a much more complicated structure. Prescott eliminated back-to-back adds, because of the difficulty in scheduling and the bypass bus (and because the performance wasn't worth it).

Predication requires no additional logic beyond that required by conditional branching. It also has a major benefit, in that it vastly reduces the amount of branching performed, and therefore reduces the required complexity of the branch prediction logic and reduces the cache pressure speculative fetches induce.

Consider the following case for an Out-of-Order machine:
p1/r1+flags = r2 - r3
(z) p2/r2 = r4 + r8
(nz) p3/r2 = r5 + r8
ld r6 = [r2/?]

When it comes time to rename r2 for the load, what physical register should the renamer return? p2 or p3? (will it be able to determine that z+nz guarantees overwrite of the old value?) The renamer is a time and space critical structure. It can be pipelined, but you must sustain the machine width in throughput each cycle.

Also, the benefits of conditional execution are mostly had by a single (simple) predicated instruction: cmov (which x86 has).

As for ugliness of x86 being restricted to the instruction decoder... Please, have you ever looked at all the cruft? The GDT? the LDT? 16-bit mode? A20 and the associated cache and TLB ugliness? The circuitry to support receiving interrupts from the 70s era PIC? How much of this is actually useful?

Sure, that takes a knowledge base of people familiar with all that junk (and lots of test cases), and some transistors off the critical path. That's why Intel is good at x86 and newcomers struggle for a while (but it can be learned, the Centaur guys do a lot with the little they have).

ARM cores cover an order of magnitude more silicon than Intel do per year. Oh, and by the way, the biggest semiconductor manufacturers are all highly profitable, and often with higher margins than Intel. The ones which are experiencing financial difficulty at the moment are only marginally doing so, and can be expected to re-enter their previous positions as the market recovers.

And please provide me with evidence of government subsidies, or was that something you just made up? And I'm not talking about some small company here; pick one of the big 20.

It's too early to dig for volume numbers, but here are the revenue numbers (not the best, but there they are). Intel(#1) ~= Samsung(#2) + Toshiba(#3) + TI(#4) (within 10%).

Samsung is supported by the Korean government, while Toshiba is supported by the Japanese government (heck, AMD's fab is mostly funded by the Germans).

Intel one generation ahead? Please. The rest of the industry is operating at the same process nodes as Intel are, and the difference that a process node makes is actually smaller than you think, as much of the benefits go straight out the window with increased leakage current (For reference: Half of the power consumed by a sub-90nm chip is leakage. Half).

Intel is at 32nm, moving to 22.
Samsung shows 45 for their "45 and below" page (although the should be close to 32 at this point)
TI is at 45, readying to move to 32
Toshiba is mostly DRAM.

Leakage is a factor everyone has been dealing with, and it is the same at each node (90 was one of the worst). We've given up some performance at each node to get leakage under control. But it does reduce area, which increases profit.

Posted: **Thu Mar 18, 2010 12:45 pm**

nedbrek wrote:But it does reduce area, which increases profit.

Heh. The cost of the silicon is not the driving cost in CPU profitability. It's R&D.
And I agree with Brendan -- except about the embedded CPU thing.

Posted: **Thu Mar 18, 2010 2:16 pm**

nedbrek wrote: Consider the following case for an Out-of-Order machine:
p1/r1+flags = r2 - r3
(z) p2/r2 = r4 + r8
(nz) p3/r2 = r5 + r8
ld r6 = [r2/?]

When it comes time to rename r2 for the load, what physical register should the renamer return? p2 or p3? (will it be able to determine that z+nz guarantees overwrite of the old value?) The renamer is a time and space critical structure. It can be pipelined, but you must sustain the machine width in throughput each cycle.

Also, the benefits of conditional execution are mostly had by a single (simple) predicated instruction: cmov (which x86 has).

How is this complexity any different from the following x86 code:

Code: Select all

sub %ebx, %eax
jz 1f
add %ecx, %edx
2:
mov (%edx), %esi
...
1: sub %edi, %edx
jmp 2b

(Other than the ARM code being shorter, smaller, and faster)

As for ugliness of x86 being restricted to the instruction decoder... Please, have you ever looked at all the cruft? The GDT? the LDT? 16-bit mode? A20 and the associated cache and TLB ugliness? The circuitry to support receiving interrupts from the 70s era PIC? How much of this is actually useful?
Sure, that takes a knowledge base of people familiar with all that junk (and lots of test cases), and some transistors off the critical path. That's why Intel is good at x86 and newcomers struggle for a while (but it can be learned, the Centaur guys do a lot with the little they have).

It all increases the core real estate, and much of it (Particular the core segmentation support and 16-bit mode) directly impinges on the critical path.

And both A20EN# and the unpredictable variable-length instruction encodings (Which can sometimes reach 15 bytes - which means very long decoder chains, as many prefixes are legal in any order) impinge directly on the caches. The L1 instruction cache gets fouled with the need to pre-decode instructions, both L1 and L2 caches need to handle coherency in the face of A20 being enabled and disabled, and you have lots of internal components which can cause many loads for dependencies (A pathological case: Loading a GDT entry into segment register involving load of GDT Descriptor -> Load of PTE -> Load of shadow PTE -> Load of shadow PDE -> Load of shadow PDPT -> Load of shadow PML4E; Load of PDE -> Load of shadow PTE ..., you get the picture)

It's too early to dig for volume numbers, but here are the revenue numbers (not the best, but there they are). Intel(#1) ~= Samsung(#2) + Toshiba(#3) + TI(#4) (within 10%).

Samsung is supported by the Korean government, while Toshiba is supported by the Japanese government (heck, AMD's fab is mostly funded by the Germans).

I asked for evidence of the subsidies you claim, but you haven't yet provided any. Does any such evidence actually exist?

And BTW, AMD are fabless. They outsource all manufacturing because doing so is cheaper (Greater economies of scale, for a start).

Intel one generation ahead? Please. The rest of the industry is operating at the same process nodes as Intel are, and the difference that a process node makes is actually smaller than you think, as much of the benefits go straight out the window with increased leakage current (For reference: Half of the power consumed by a sub-90nm chip is leakage. Half).
Intel is at 32nm, moving to 22.
Samsung shows 45 for their "45 and below" page (although the should be close to 32 at this point)
TI is at 45, readying to move to 32
Toshiba is mostly DRAM.

Leakage is a factor everyone has been dealing with, and it is the same at each node (90 was one of the worst). We've given up some performance at each node to get leakage under control. But it does reduce area, which increases profit.

Leakage is steadily increasing. At 90nm it was ~25% of current. Today it is 50%. There is nothing you can do about it without changing the element you use as your semiconductor (Carbon looks like a good option), because it is simply caused by the inability to saturate transistors. There are methods to combat leakage, but they are stopgaps.

And I've not even discussed the work required to mitigate quantum tunnelling

By the way, older processes are cheaper. In particular, they tend to have better yield, and don't require the exotic materials newer processes do.

Posted: **Thu Mar 18, 2010 4:28 pm**

Owen wrote:
nedbrek wrote: Consider the following case for an Out-of-Order machine:
p1/r1+flags = r2 - r3
(z) p2/r2 = r4 + r8
(nz) p3/r2 = r5 + r8
ld r6 = [r2/?]

When it comes time to rename r2 for the load, what physical register should the renamer return? p2 or p3? (will it be able to determine that z+nz guarantees overwrite of the old value?) The renamer is a time and space critical structure. It can be pipelined, but you must sustain the machine width in throughput each cycle.

Also, the benefits of conditional execution are mostly had by a single (simple) predicated instruction: cmov (which x86 has).
How is this complexity any different from the following x86 code:
Code: Select all
sub %ebx, %eax
jz 1f
add %ecx, %edx
2:
mov (%edx), %esi
...
1: sub %edi, %edx
jmp 2b
(Other than the ARM code being shorter, smaller, and faster)

The x86 code becomes (I'm assuming Motorola syntax from the % and (), I prefer this Itanium-ish syntax, which is clearer)

Code: Select all

1:r0/p1+flags = r0 - r3
2:jz p1/flags, 1f (predict not taken)
3:r2/p2+flags = r2 + r1
4:r6/p3 = load r2/p2
5:r2/p4 = r2/p2 - r7
6:jmp 2b

Jumps are predicted taken or not taken in the front end, and validated by the execution core. They add nothing to the dependency graph (or rather, the graph is constructed assuming they are correct).

No instruction (uop, really) needs more than 2 inputs, and all instructions produce 1 (or 0) outputs.

In the ARM case, every conditional instruction needs an additional input (the old value of the dest, on top of the flags input) in order to forward the data should the condition be false. This serializes the data flow graph (every conditional instruction must wait for all previous producers).

That's 4 clocks for ARM (totally serialized), versus 2 or 3 clocks for x86 with 2 integer ports (2 is optimal, but the machine probably won't promote the alu op over the branches in priority for the picker).

Clock1: Execute 1 and 2
Clock2: Execcute 3 and 6
Clock3: Execute 4 and 5

That's 4 (times the width of your machine) read ports in the rename table (x86 needs 2 per uop). The rename table is effectively a register file, which grows rapidly with the number of ports. Should the size exceed a single clock, you can add all the complexity of pipelining to that. It burns a lot of power.

That's the naive, expensive solution. There are several others (which either hammer performance more, or complicate implementation more)...

Posted: **Fri Mar 19, 2010 2:49 am**

paolodinhqd wrote:
Solar wrote:Well... 128 bit might actually be ahead. But I did that "one bit per atom" calculation once and found that, with 128 bit, we're already talking more atoms than the entire earth (IIRC), which makes it pretty much a definite limit.
Solar [Taking more atoms than the entire earth] <-- you must be great at physics in calculating this, you're sure about this calculation?

No, I'm not, actually, as I came up with different (but no less impressive) figures when re-doing the maths just now.

I use carbon as reference element (because it removes some conversions from the logic).

The Avogadro constant states that there are 6.022 x 10^23 atoms per 12 gram of carbon.

Amorphous carbon is ~2 gramm per cubic centimeter. Wikipedia: Carbon

One cubic centimeter of carbon has therefore ( 6.022 x 10^23 ) / 12 * 2 = ~1 x 10^23 atoms.

Adjusting from base 2 to base 10, ( 1 x 2^128 ) = ( ~3.4 x 10^38 ).

Dividing 2^128 by the number of atoms in a cubic centimeter of carbon gives ( 3.4 x 10^38 ) / ( 1 x 10^23 ) = 3.4 x 10^15.

That is, 2^128 atoms of carbon would take up 3.4 cubic kilometers. Kind of unwieldly for a memory device even if we could use every single atom in that cube to store a byte, as you will agree.

2^256 is just completely ridiculous. That's 10^77, which is larger than the age of the universe in Planck times, and close to the number of protons in the universe (see Googol).

That is why I say, we might get to 2^128 once we find that 2^64 (~1.85 x 10^19) is not enough to enumerate every atom in a cubic centimeter of carbon (but why would you want that?). But I cannot imagine any reason, even in a million years, faster-than-light travel and complete control over quantum mechanics, why 2^256 should become necessary.

Disclaimer: I am aware that there are, already at this day, architectures having 256-bit, and even 512-bit, data paths. Especially SIMD engines work with data registers as wide as possible. The calculations above are aimed at the futility of 256-bit addressing.

Posted: **Fri Mar 19, 2010 4:58 am**

paolodinhqd wrote:I'm thinking of the future for OSes and so
i think of the future for CPUs.
CPUs are reaching their limit and breaking the Moore's law.
That's why Intel and other manufacturers trying to
add more and more cores to a single CPU.
The time for Moore's law is running out.

We already have 64 bit CPUs,
and I believe in the next 100 years,
there won't be such thing called 128 bit,
except for servers those processing IP v6
and some other computer systems serving researches.

The limit for memory is enough in 64 bit.
It's twenty zeros behind number one.

How do guys think?

Current CPUs already have 128 and 256 bit MMX /SSE instructions.

Latest CPU from Intel is 6 Core with HT so thats 12 threads per CPU. This will continue which will mean OS will become more asyncronous / message passing.

Posted: **Fri Mar 19, 2010 5:43 am**

See my post above. There's a difference between the width of data registers, and the width of the address bus.

Posted: **Sun Apr 04, 2010 12:23 am**

I really think that before we get any higher, we are going to have to get everyone on 64-bit processors and operating systems first. And, yes, I do think that eventually we will get to some point where we can't make the blasted transistors any smaller, but we just might come up with something to get around that when we get there. Perhaps a three-way transistor(Base,Collector, Emitter, Other-unknown-pin). Perhaps we will switch completely to quantum computers, or chemical computers, or trinary, once we have exhausted all digital binary options. If we take the "Increase the Base" method, wouldn't it be fun to one day meet a computer that actually processed in base-10, just like you and I do? (Of course, that would mean they would get higher numbers, and we would have some sort of weird base 27 computer way off in the future.

)

But for the time being, I think we should worry about getting the 64-bit processors out there on the shelves and running. It's all about the consumers.

Posted: **Fri Apr 09, 2010 4:01 am**

As a physics student, I hope that photonic computing would catch up someday, and here are some reasons why :
-Two photons of sufficiently different wavelength don't interact significantly with each other and with the same absorbing materials. Technically, this means that we can imagine literally putting two CPU cores at the same place in space. Now that's what I call parallelism !

-There is only little heat generation in optical computers. Heat dissipation is no longer an issue. This means no more noisy fans, and circuitry making more use of the 3rd dimension of space (instead of using large flat motherboards and ICs, we could imagine small cubic computers).
-Photons are excellent for quantum computing, because they interact weakly with matter and hence may keep wave function coherence over very large space scales. Optical quantum cryptography manages to send signals over hundreds of kilometers nowadays and is nearly mature enough for commercial use, and the recent factorization of 15 at Bristol was done using optical quantum computers. So if we go into photonic computing for classical computation and improve optical technology now, we may make the move to optical quantum computing way easier later.
-Photons move faster and can transmit information on larger space scales than electrical signals. That's why optical fibers get so much love nowadays. Electrical buses are getting closer and closer to their physical limits, and Intel are already doing research on using photons for inter-core communication in their new CPUs. Now what's the point of using electric circuitry in computers in order to handle optical signals ? Signal conversions are expensive from many points of view, all-optical information processing would permit us to avoid this.

Sure, optical computers will be larger in the beginning, because the diffraction limit is higher for optical signals that we manage to generate today than for electric signals. But doesn't all that sound interesting enough to give this technology a little chance ?

Posted: **Fri Apr 09, 2010 1:39 pm**

Here is the CPU of tommorrow, today.
1. Memristors -- http://www.nytimes.com/2010/04/08/science/08chips.html
2. Quantum ALU -- http://www.sciencenews.org/view/generic ... er_created
3. Optical Buses -- http://sbir.nasa.gov/SBIR/successes/ss/5-013text.html and http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf

We are already using busses with an effective bandwidth of 2.5 GB/s per link (see PCI Express), eventually we will have to switch to a mechanism were we can transfer data at the speed of light, thus optics are the way to go. Data processing is becoming increasingly expensive and to speed it up, we must use superposition states with a Quantum ALU. The amount of information in the world is increasing exponentially, and as thus we must find a new way to store data, Memristors (essentially holographic storage) is the way to go as we could potentially store 20+GB per cubic centimeter, or yobbibytes of information in something the size of a small dresser. This is the cpu of tommorrow, today. Enjoy!

Posted: **Sat Apr 10, 2010 4:04 am**

*______________*

I want that in my computer NOW !

(However, I think that quantum computing needs to work at 300K before being ready for everyday use. Personally, I'd prefer a perfectly silent CPU of reasonable size which has an average of 3W energy consumption and the power of my 3000+ than an über-fast quantum CPU which needs 1 m^3 of phase-change and lasers coolers to work ^^)

Posted: **Sat Apr 10, 2010 12:24 pm**

Well, I think in future the Memory will raise extremely. The entire HardDisk will be preloaded into RAM for much faster access. And the Cache on the Chips will raise, too.
Perhaps we will have one or more cores for each thread in future. Parallel processing will be the most important thing. Every Core gets 4 GByte of Memory and there will be shared 32 GByte of Memory for every Core. The applications will minimize traffic between cores and every piece of hardware will be more intelligent, for example the harddisk hisself decides which data in RAM has to be written back. The Processor only sends an interrupt to the network card and the card will handle the rest. Our CPUs will be able to perform much more instructions like dot-product or matrix multiplication within one instruction and the cycles will be reduced. In the past most communication was parallel. Now we use mostly serial interfaces, because of much higher traffic rates. But in future we will get back to parallel. 32 wires in, 32 wires out and lets say 12 wires for other purpose in the future network cable. At very high traffic we will have multiple cables parallel and the devices use them dynamically. Perhaps one core handels sending and another handles recieving.

That is what I dream.

Posted: **Sat Apr 10, 2010 4:33 pm**

sebihepp wrote:Well, I think in future the Memory will raise extremely. The entire HardDisk will be preloaded into RAM for much faster access. And the Cache on the Chips will raise, too.

Hmm - there are some problems with that. Right now, disks are used to create the illusion of more RAM. We have a lot of disk space because it is dirt cheap. We are far from a breakthrough, memory is getting less expensive but disk is doing so at an even faster pace. Also, there is the problem of write-through and power/hardware failure. I think it might be more plausible to have a major breakthrough in volatile memory design such that it can replace non-volatile memory (such as RAM) completely.

Perhaps we will have one or more cores for each thread in future.

If we get more cores, we will want more threads. More than one core per thread is a silly idea. It would be nice if we came up with something different than threads for parallelism.

Parallel processing will be the most important thing.

To a point, definitely. However, some problems are sequential in nature - and even if they're not, it's enough for one part to be sequential to cause a bottleneck for the whole parallel system. We are still far from reaching that point though.

Every Core gets 4 GByte of Memory and there will be shared 32 GByte of Memory for every Core.

Even a 32-bit x86 CPU has access to more than 4 GiBs of memory. As for shared memory, it did cause a lot of fuss a few years ago; we've learnt a lot about the advantages of message passing since...

The applications will minimize traffic between cores and every piece of hardware will be more intelligent, for example the harddisk hisself decides which data in RAM has to be written back. The Processor only sends an interrupt to the network card and the card will handle the rest. Our CPUs will be able to perform much more instructions like dot-product or matrix multiplication within one instruction and the cycles will be reduced.

Whare you're describing in mainly NUMA and it's happening today. That's the main reason for the death of video cards. Specialized CPUs are employed for many things such as graphics/sound processing, NICs, etc.

Perhaps one core handels sending and another handles recieving.

That's usually what happens in communication (unless one endpoint is a peripheral)

You don't send stuff from core A to core A, do you?

OSDev.org

Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs

Re: Future of CPUs