OSDev.org

Posted: **Sat Jul 31, 2010 3:12 pm**

bontanu wrote:
Owen wrote: Complicating the processor if the application needs/benefits from it is not an issue
Yes... but it is not really needed just an old custom and it does occupy a lot of area in the CPU core and as the number of cores does increase this area does become a drag.

And there is another problem: those additional layers do reduce speed by approximatively 15% - 30% and in a world where the CPU frequency has reached the top limits this is a further problem.

My estimate is that paging and memory protection as we know it today will be dropped and instead we will see many more simple and faster cores that can be configured to execute a single process into a "zone" by a master CPU.

This trend is already visible.

You vastly overestimate the impact paging has on speed: Next to none. It adds latency, yes, but does not reduce speed

bontanu wrote:
Why must the address adders be able to perform 64 bit addition? You can cut out most of the logic for the bits above 47, reduce gate delayes and speed things up
I suggest that you understand electronics and CPU architecture better.

The address calculations for the instructions and operands are performed BEFORE the paging mechanism by the CPU.

Paging is only a translation mechanism that converts a logical address into a physical address.

Hence if you think a little you will observe that the adders already perform full 64bits additions (and yeah this does slow down 64 bits long mode)

For your example: mov rax, [rsi + 4*rbx + my_offset_32]

Here I can setup values in RSI and RBX that added together will generate an linear address of your choice... with whatever bit set or reset... obviously the adder must be able to perform the addition (think about LEA) even IF later on the paging mechanism will generate an address that is not canonical and eventually an exception.

Besides the current address limitations will most likely be removed and they will not redesign the CPU with each bit of address added.

Paging in not there to simplify the CPU. Limits on the physical address canonical form are but they will be slowly removed.

Bits propagate with adders / carry and NO you can not cut off bits above 47 )

Please stop assuming I don't understand electronics. I do, very well.

You can set up the logic in order to detect faults without calculating the results. This can be very fast. The knowledge required to do such things is very well known in the semiconductor industry, and I would expect AMD and Intel to have it.

As for performance? 64 bit arithmetic does not affect load-store performance. Most of the maths is done in parallel with fetching the TLB entries. The other bits only need to match the latency of the TLB. You can delay taking the exception for quite a while.

bontanu wrote:
Every mode switch requires circuitry to appropriately maintain the pipelines to preserve the image of in-order execution. That's a non-trivial cost, particularly in testing
Yes but the mode change is only done once at OS startup and left in place for ever after. The circuits are bad at predicting this even today. You are required to help them a little by writing bits in control registers and performing a long jump. It is not something that is expected to happen at every instruction... not even twice in an hour ... hence not a big problem.

Stop trying to justify forcing paging on long mode from an architectural point of view. There is no logical reason to have paging forced in there... nothing other than a very small benefit and mostly because "AMD said so" when they designed long mode.

You're missing something: Intel and AMD have to test their CPUs to make sure that under no conditions does a valid instruction cause an error. In other words, that an OS generally does something only at CPU startup does not reduce testing load. Prediction does not come into this; enforcing full pipeline serialization does.

bontanu wrote:
Segmentation will only be resurrected if you kill C, C++ and Fortran.
Well you do not understand the concept of segmentation.

It is already "resurrected" in the new Cell CPU's and in Sun's CPU's. They have a "zone" that is pretty much the same as a segment (and more) reserved for the process / CPU. yeah the compilers have to adapt a little (more for Cell but less for SUN)... Not much to change if the segments are FLAT. Practically the application does not know or care but the security is better.

Our applications compile in C/C++ with no problem on Sun CPU's... no change is required in code for this feature hence I think you are considering segmentation like something you have learned about the 8086 real mode in 1990 ...
those things evolve you know

Hence there is no connection with killing C/C++ or whatever HLL programming language of your choice.

Cell's design has nothing like CPUs or segmentation; nor is the SPE design general. You might see SPE style coprocessors developing, but they'll have limited niche.

The general purpose CPU rules.

As to Sun's "zones": Care to link to a product? This is certainly not a feature of any product on their roadmap. Unless you're referring to Solaris zones (which would mean Sun were being very confusing); they are a completely unrelated product.

bontanu wrote:
Well... C was the fastest growing programming language this year. I think you may be waiting a while here...
I do program every day in C ... it is my job

Even on new architectures, backwards compatibility is king.
Yes of course unless it is not... modern segmentation does not break any backward compatibility. Paging is also transparent to applications and even to ASM code hence with or without it the application land does not care...

But you do have a point in the fact that most mainstream OS rely heavy on paging for memory management. However this can be changed underneath (as IBM does or did it) and the gain in CPU speed and simplicity might be worth complicating the OS memory management code a little... and it will allow more cores per CPU ... we will see.

As for my claim of unpaged access existing for backwards compatibility - you just proved my point! No mainstream OS requires it, therefore there is no reason to complicate the processor supporting it in new models.
Yes, I agree that probably this was the reason behing AMD dropping it. And yes there is a minimal gain... a very small gain in the front end of the CPU.

But your claim that it complicates things does not stand beacuse this "mode" is the simple base on witch the "paging" mode is build.

Hence long mode is just do not giving the "user" access to the more simple mode but it has to have this mode inside the CPU as a basement for the more complicated paging mode to exist

I will give you a much "better" argument: you can enable paging BUT make it identity mapped in order to simulate non-paging access ... (of course that this also has a problem but a much smaller one)

Really, paging eats up a lot of resources in both speed / latency / delays and in CPU die areas ... and this is NO longer acceptable with the top limit we have on CPU speeds... hence in time it will disappear. The Simple Address Space research made by Microsoft with Singularity shows this trend.

In consequence I would not base my OS architecture on paging and /or on memory protection schemes of today.

Paging's effect on memory bandwidth is laughably small; CPUs have optimized TLBs, and there are various paging systems inside modern CPUs in order to maximize performance. A reliable OS cannot be built without paging unless you use managed code, and unmanaged code is still king.

As for the latency issues of paging? For most code, about 1 or 2 cycles. Maybe 10 if you have to hit L2 cache. When you're talking about memory latencies of 100s to 1000s of cycles... well, its in the noise floor.

Posted: **Sat Jul 31, 2010 4:21 pm**

In consequence I would not base my OS architecture on paging and /or on memory protection schemes of today.

In all honesty, what proportion of an OS is actually concerned with paging or memory protection ? I have less than 300 lines that have anything to do with paging or memory protection. My small O/S has around 6k lines so far. Paging is 5% of that and dropping.

My understanding is that the L1 and L2 caches in x86_64 operate on virtual addresses. The TLB and other page translation hardware is only used when the L1 and L2 caches miss and that all happens in parallel with the L3 cache lookup. Because of that I just don't understand how it matters speed-wise that we use paging.

As we get more cores why not just make the L1 and L2 caches bigger and design our operating systems to miss the caches less often ?

Posted: **Sat Jul 31, 2010 7:55 pm**

gerryg400 wrote:
In consequence I would not base my OS architecture on paging and /or on memory protection schemes of today.
In all honesty, what proportion of an OS is actually concerned with paging or memory protection ? I have less than 300 lines that have anything to do with paging or memory protection. My small O/S has around 6k lines so far. Paging is 5% of that and dropping.

My understanding is that the L1 and L2 caches in x86_64 operate on virtual addresses. The TLB and other page translation hardware is only used when the L1 and L2 caches miss and that all happens in parallel with the L3 cache lookup. Because of that I just don't understand how it matters speed-wise that we use paging.

As we get more cores why not just make the L1 and L2 caches bigger and design our operating systems to miss the caches less often ?

All x86 caches are physically tagged. Virtual tagging creates coherency problems (Must flush caches on page table change, must not have same address duplicated in cache). ARMv5s (ARM9s) had virtually addressed caches; with the memory system overhaul of ARMv6, they moved to physically addressed too.

Of course, the processor has other things to do in parallel with the TLB lookup, so its not a big issue (For a start, the TLB and cache lookups can proceed concurrently)

Posted: **Sat Jul 31, 2010 7:58 pm**

Owen wrote: You vastly overestimate the impact paging has on speed: Next to none. It adds latency, yes, but does not reduce speed

You vastly underestimate the impact that paging has on CPU speed and other operations:
a) something has to be added and checked for each instruction that references memory. You can claim that this takes "no time" and "no resources" but that is a lie.
b) TLB and cache miss does happen and then it slows down a lot
c) the paging circuits add delays and make the signal paths longer
d) the complicated paging circuits generate heat that has to be dissipated and reduces overall CPU performance
e) the paging circuits and TLB cache occupy a lot of area on the CPU and this reduces performance or area available for other cache

Please stop assuming I don't understand electronics. I do, very well.

Oh, well if you do then you do hide it very well indeed.

You can set up the logic in order to detect faults without calculating the results.

Only in certain circumstances. This does not always work.

This can be very fast.

Not in this case.

The knowledge required to do such things is very well known in the semiconductor industry, and I would expect AMD and Intel to have it.

Irelevant statement for our debate. You create a scenario and then you "expect something" ...dreaming.

Have you checks what I was hinting with the LEA [esi +4*ebx +offset_32] instruction?
Do you realize the consequences?

Do it again... Hint: can this instruction generate an exception or not? (Intel manual page).

a) Does it have to add 64 bits numbers that represent address BEFORE going into paging translation?
b) Does it have to return me a valid result?
c) Is there an exception (delayed or not) generated for an invalid address? (bit 47

)

As for performance? 64 bit arithmetic does not affect load-store performance.

Arithmetic does not but 64 bits pointers and values do harm load store a little... size does matter after all

Then there is heat and die size areas

Then you can not schedule more load/store/fetches if the arithmetic pipeline /units are full (remember that you need to add for paging) and you can not consume loaded data/instructions if the arithmetic units are busy /full and idle load and store units do not help speed.

The additions in 64 bits for paging are much slower that the additions in 32 bits for the same paging.

But not doing those addition for non-paged mode is the maximum speed up and it opens up time and slots for other actions.

Removing paging circuits remove a lot o heat and free up a lot of on die space.

Most of the maths is done in parallel with fetching the TLB entries. The other bits only need to match the latency of the TLB. You can delay taking the exception for quite a while.

Right. This is the old argument that you can perform some actions with zero costs and zero resources. If you are able to believe it ...

You're missing something: Intel and AMD have to test their CPUs to make sure that under no conditions does a valid instruction cause an error. In other words, that an OS generally does something only at CPU startup does not reduce testing load.

No I do not. The testing has to be done always because the non-paging mode is a base prerequisite of paging-mode. Hence by removing the mode you do not remove any tests or you remove a very small set of tests.

In fact you do not have to perform additional test in order to test this non-paging "mode".

If you refer to the few instructions needed (or not needed) to setup such a mode... then this is one instruction in 600 others at max and anyway the paging mode and TLB requires much more complicated testing and provides much more points of failure.

Testing is done at production time by sampling a few CPU's in the lot. Paging will steal resources and time from all of the users for the lifetime of all sold CPU's.

Besides there is no testing needed for non paging access... but there is a LOT of testing needed for paging.

Prediction does not come into this; enforcing full pipeline serialization does.

Full pipeline serialization is needed for all other modes switch including PAGING and some instructions ... hence irrelevant since it has to be tested anyway.

Paging's effect on memory bandwidth is laughably small; CPUs have optimized TLBs, and there are various paging systems inside modern CPUs in order to maximize performance.

Not true. It is not uncommon to see 5% to 15% overall slowdown (or more in special cases) but now we can not measure it anymore on x64 can we? because they removed the "other" mode completely

The problem is with task switching that requires loading different TLB;s and paging tables every now and then. You can mitigate it a little but it is still a big speed problem with many CPU's and many threads / processes executing.

A reliable OS cannot be built without paging unless you use managed code, and unmanaged code is still king.

I believe it is very possible to build a very reliable OS without paging and without managed code. After all paging only affects memory management techniques and the alternatives are very well known to me.

However I do agree that paging does spoil you a little

Posted: **Sat Jul 31, 2010 8:36 pm**

I believe it is very possible to build a very reliable OS without paging and without managed code. After all paging only affects memory management techniques and the alternatives are very well known to me.

Would you care to tell us how you would implement protection between tasks without using hardware (i.e. paging, segmentation) or software (i.e. managed code) protection?

Posted: **Sat Jul 31, 2010 8:37 pm**

Oh come on, you're being incredibly dense.

Adding a new mode requires only some already largely tested mode switches to be tested?

Wrong. It requires testing those mode switches with huge variants of code on either side. It requires testing every instruction in the architecture, and all of the potential corner cases and worst case scenarios of it. It is expensive beyond belief.

And you're putting words in my mouth. "Right. This is the old argument that you can perform some actions with zero costs and zero resources. If you are able to believe it ...". I never said that.

Really, paging logic is, compared to many other components of a CPU, quite tiny. It doesn't even show up on CPU floor plans, unlike the SIMD units (which are quite huge), or caches (Which are well over 50% of the die).

But the thing I find most absurd is that you expect Intel or AMD to remove features from a CPU.

Riiight. When they're selling CPUs with features to be backwards compatible with code from nineteen-freaking-seventy-nine.

Posted: **Sun Aug 01, 2010 4:47 am**

Owen wrote:Oh come on, you're being incredibly dense.

Adding a new mode requires only some already largely tested mode switches to be tested?

Wrong. It requires testing those mode switches with huge variants of code on either side. It requires testing every instruction in the architecture, and all of the potential corner cases and worst case scenarios of it. It is expensive beyond belief.

What I say is that this "non paging" mode that they removed is already there as a basis of the CPU functionality hence it is tested anyway and you gain very little by not exposing it the the system programmer.

Let us see what it requires:

So the logical address has to be calculated for each instruction that reads/writes memory and for opcode code fetches.

The "non paging mode" only requires that you output this address from EIP directly to the address buss (of course buffered etc) This is very simple to test.

Unfortunately the additional "paging mode" does require that you process the logical address in EIP by going to a few (3-4) levels of page tables and directories in order to obtain another physical or virtual address. This is very complicated to test (a lot of tables to setup for each test)

Hence the fact that you remove "non paging mode" saves what testing?

Let us see in more details:
---------------------------
Well, in order to have this mode you do need a flip flop that is reset at power up and the output of this flip flop would enable/disable a 2:1 mux on the final address output buffers.

One entry to this 2:1 MUX is directly taken from EIP register and the other input is taken from the results of the "paging" circuits.

So what AMD "saved" with not allowing non paged access in long mode is just a flip flop and a 2:1 MUX. Everything else has to exist on die.

However those circuits have to exist from 32 bits protected modes ... and ... they also have to exists for SMM CPU mode.... hence we gain almost nothing in size and testing from not having this mode exposed to the system programmer.

Hence the gain is that we do not have to test ONE flip-flop and a 2:1 MUX if it works in long mode.

But we do have to test this in 32bits protected mode and in SMM mode and in real mode (for disabling the whole paging mechanism)...

I consequence we do not test ONE flip-flop for ONE of the FOUR CPU modes ..

Yeah, this is "a gain" but a very small gain IMHO.

Now on instruction testing:
-------------------------------
Most instructions are NOT related to this "paging" versus "non paging" mode because all of the paging translation is performed AFTER instructions are decoded and executed. Instruction are unaware of paging being on/off because most of them operate with logical address.

Only the instructions that would be used to setup such a mode must be tested... aka a write of a bit in one of the CRxx registers.

Then we "must" (not really) also test a few paging related instructions like INVLPG but thy can perform "nothing" or "whatever" in such a "non paged" mode... and you have to test them for 32 bits protected mode and SMM modes anyway.

You can simply write in the manual: "this instruction has an undefined behavior if used when CPU is in long non-paged mode"

and skip testing them completely.

And you're putting words in my mouth. "Right. This is the old argument that you can perform some actions with zero costs and zero resources. If you are able to believe it ...". I never said that.

Please excuse if I do so... this is how I have perceived your arguments: that you can calculate / perform the paging mechanism with "almost" zero impact on speed and delays.

I argument that this speed/delay cost is no longer so small and we can not ignore it anymore when our CPU's have reached a top limit speed.

Really, paging logic is, compared to many other components of a CPU, quite tiny. It doesn't even show up on CPU floor plans, unlike the SIMD units (which are quite huge), or caches (Which are well over 50% of the die).

Comparing with cache and SIMD is unfair. Cache is a large memory and SIMD units have many large registers (128 or 256 bits) and have to perform a lot of operations on those registers.

But I do agree that you have a valid point here regarding the surface gains by removing paging.

But paging being nested with multiple sequential levels of access (in order to reduce size) might add to the delay arguments.

And the TLB caches might be visible on floor plan

But the thing I find most absurd is that you expect Intel or AMD to remove features from a CPU.

To clarify: NO I do not expect them to remove features form a CPU. I did not expected them to remove non-paging mode from x64 long mode either but they did.

It was just a theory about what would be a possible line of evolution of the CPU's. I think that I detect seeds for this state of mind in the industry but of course I might be wrong.

The paging structures become bigger and more nested / layered with more RAM and address space in x64. Hence I also think that in the future we need a way to specify a zone (or if you like "a segment") that has a base and a size and some translation and protection method setup for the whole "zone" instead to have it for each "page"

Riiight. When they're selling CPUs with features to be backwards compatible with code from nineteen-freaking-seventy-nine.

I also acknowledge the huge importance of backward compatibility for business success. However innovation has it's role also...

And AMD just dropped the non paging mode... hence small things are possible

)

Of course that IMHO they dropped the wrong thing... but hey it's a start!

Posted: **Fri Aug 27, 2010 6:43 am**

skyking wrote:Is there a known reason why amd chose to do it this way

Yes: canonical addresses.

JAL

OSDev.org

Why does long mode require paging.

Re: Why does long mode require paging.

Re: Why does long mode require paging.

Re: Why does long mode require paging.

Re: Why does long mode require paging.

Re: Why does long mode require paging.

Re: Why does long mode require paging.

Re: Why does long mode require paging.

Re: Why does long mode require paging.