Can overlap segments achieve protection?

Brendan · Post by **Brendan** » Thu Feb 19, 2015 5:27 am

Hi,

rdos wrote:
Brendan wrote:
rdos wrote:That makes no sense at this is only part of 32-bit code where upper halves are NOT available without doing far jumps which essentially will stop out-of-order execution.
You're making the mistake of assuming most of the CPU cares if you're running 16-bit, 32-bit or 64-bit code. It doesn't.
It does because loading a 32-bit register in 64-bit mode does NOT clobber the upper half of it. If it did, there would be no sense in having 32-bit operand overrides.

Every 32-bit operation done (in 64-bit or 32-bit code) will wipe out the upper 32-bits of the destination register.

rdos wrote:
Brendan wrote:Um, what? There's no reason you can't use full 64-bit pointers, it's just slower. Fortunately it's extremely rare (I doubt I've ever seen an executable file that's larger than 4 GiB) so no sane people care that it's slower.
Which means that basically all 64-bit designs are wasting hardware and performance with 4 level paging when nobody cares for more than 2 levels. We could simply reduce the address space to 32-bit and just introduce 64-bit operations to existing 32-bit mode instead. To me it seems like software developers simply are not using the hardware features, which in this case is due to high-level compilers and linkers that cannot handle the setup.

You can have a 4 GiB executable that uses 100 TiB of dynamically allocated space. For the dynamically allocated stuff you can't use "hard-coded" addresses in the instruction itself (and need to store pointers to it because you don't know the address at compile/link time) so the "+/- 2 GiB" limit for immediate addresses doesn't make any difference at all for that case.

rdos wrote:
Brendan wrote:Note: I suspect that you're trying to blame AMD because your OS is poorly designed, and I really do think it's unreasonable to blame AMD for your mistake.
Not at all. I blame the GCC team as they are the one's that haven't implemented this in a way that allows me to exploit it. The hardware is perfectly functional while the software (C compiler and linker) is not.

GCC is GNU's compiler designed for GNU's systems (just like MSVC is Microsoft's compiler designed for Microsoft's systems, and
IBM XL C/C++ is IBM's compiler designed for IBM's AIX systems). Your compiler (the one you designed for your OS) doesn't exist; so if GCC's "-mcmodel=large" option doesn't work for your obscure special case then perhaps you should just write your own compiler.

rdos wrote:Tested it, and it's not sign-extension as previously claimed:

I'm not too sure who claimed that (in which context). Typically data is zero extended, but addresses are sign extended. For example, if you did "mov eax,-1" you'd end up with "RAX = 0x00000000FFFFFFFF" (zero extended) and if you did "lea rax,[-1]" (where the -1 is a 32-bit immediate) you'd end up with "RAX = 0xFFFFFFFFFFFFFFFF" (sign extended). In both cases the instruction has no dependency on the previous value of RAX, so it's not as slow as (e.g.) "mov ax,-1" would be.

rdos wrote:I think this does explain why 32-bit code manipulating 32-bit registers will zero upper half of the 64-bit register. This seems to be a "feature" (rather bug) of long mode.

What this essentially means is that general registers that needs to be preserved and does not return values must be saved before calling an unknown handler in 32-bit mode (as it cannot save the upper half if it uses a 32-bit register), but that if a value is returned there is no need to bother about the upper half as it will automatically be cleared when the 32-bit code loads that particular register.

Who cares? It's not like you can do "push eax" in 64-bit interrupt handlers - you save/restore the 64-bit registers regardless of whether the interrupted process was 32-bit or not.

Cheers,

Brendan

Octocontrabass · Post by **Octocontrabass** » Thu Feb 19, 2015 5:39 am

rdos wrote:We could simply reduce the address space to 32-bit and just introduce 64-bit operations to existing 32-bit mode instead.

Good luck running a program that malloc()'s 6GB in a single call with only 32-bit addresses.

rdos · Post by **rdos** » Thu Feb 19, 2015 6:10 am

Octocontrabass wrote:
rdos wrote:We could simply reduce the address space to 32-bit and just introduce 64-bit operations to existing 32-bit mode instead.
Good luck running a program that malloc()'s 6GB in a single call with only 32-bit addresses.

Yes, but that is a special case. Most applications will not need more than 4G of data, and will not need 64-bit addresses, and thus are paying the price for the 4 level page tables and bloat created with 64-bit addresses for no good reason.

Brendan · Post by **Brendan** » Thu Feb 19, 2015 6:38 am

Hi,

rdos wrote:
Octocontrabass wrote:
rdos wrote:We could simply reduce the address space to 32-bit and just introduce 64-bit operations to existing 32-bit mode instead.
Good luck running a program that malloc()'s 6GB in a single call with only 32-bit addresses.
Yes, but that is a special case. Most applications will not need more than 4G of data, and will not need 64-bit addresses, and thus are paying the price for the 4 level page tables and bloat created with 64-bit addresses for no good reason.

An extra 8 KiB of RAM (for one PDPT and one PML4) is an extremely small price to pay for the ability to support applications that want to use more than about 2 GiB of virtual space.

Cheers,

Brendan

rdos · Post by **rdos** » Thu Feb 19, 2015 8:13 am

Brendan wrote: An extra 8 KiB of RAM (for one PDPT and one PML4) is an extremely small price to pay for the ability to support applications that want to use more than about 2 GiB of virtual space.

You also need hardware for decoding two additional levels of paging, which will inevitably slow down TLB-shutdowns as well as all types of TLB misses. In addition to that, storing 64-bit addresses causes code-bloat.

Brendan · Post by **Brendan** » Thu Feb 19, 2015 9:41 am

Hi,

rdos wrote:
Brendan wrote: An extra 8 KiB of RAM (for one PDPT and one PML4) is an extremely small price to pay for the ability to support applications that want to use more than about 2 GiB of virtual space.
You also need hardware for decoding two additional levels of paging, which will inevitably slow down TLB-shutdowns as well as all types of TLB misses. In addition to that, storing 64-bit addresses causes code-bloat.

Transistors are relatively cheap now - less than $4.80 each (if you buy a pack of 25)! If you don't want to use the transistors for some esoteric philosophical reason, then you have my permission to use a Z80 CPU instead.

Cheers,

Brendan

rdos · Post by **rdos** » Thu Feb 19, 2015 4:46 pm

Brendan wrote: Transistors are relatively cheap now - less than $4.80 each (if you buy a pack of 25)!

That's rather irrelevant. Doubling the number of levels in the page translation process will slow down ALL code. That's more or less inevitable. If this was not the case, AMD could just as well gone for 6 or 7 levels and thus could cover the entire 64-bit address space. These kind of things are always trade-offs.

Brendan · Post by **Brendan** » Thu Feb 19, 2015 9:38 pm

Hi,

rdos wrote:
Brendan wrote: Transistors are relatively cheap now - less than $4.80 each (if you buy a pack of 25)!
That's rather irrelevant. Doubling the number of levels in the page translation process will slow down ALL code. That's more or less inevitable. If this was not the case, AMD could just as well gone for 6 or 7 levels and thus could cover the entire 64-bit address space. These kind of things are always trade-offs.

Going from 3 levels (e.g. PAE) to 4 levels (adding a PML4) isn't quite "doubling".

For a naive implementation, 1 more level in the page translation process means slightly slower TLB misses. To mitigate that you can increase the number of TLBs, and also cache "higher level tables" (e.g. PDPT entries, etc), and start doing TLB prefetching. Intel has done all of these things. This mostly means that (on average for a typical load) it's not slower and only costs more transistors.

Basically, there are multiple trade-offs: features vs. performance vs. transistors. The world has gone with more features and more transistors. We are software developers, we make software to suit the hardware we choose to support. Whining that your software doesn't suit the hardware you've chosen to support won't change anything.

Cheers,

Brendan

SoulofDeity · Post by **SoulofDeity** » Thu Feb 19, 2015 9:57 pm

Octocontrabass wrote:
rdos wrote:We could simply reduce the address space to 32-bit and just introduce 64-bit operations to existing 32-bit mode instead.
Good luck running a program that malloc()'s 6GB in a single call with only 32-bit addresses.

What? Who in their right mind allocates 6GiB of memory in a single call? That's ignorant... That said, I agree, truncating the address space is a bad idea. I mean, it's not impossible... You could use a table of handles to memory in a larger address space or make a virtual bank-switching system, but have fun synchronizing that.

Brendan · Post by **Brendan** » Thu Feb 19, 2015 10:36 pm

Hi,

SoulofDeity wrote:
Octocontrabass wrote:
rdos wrote:We could simply reduce the address space to 32-bit and just introduce 64-bit operations to existing 32-bit mode instead.
Good luck running a program that malloc()'s 6GB in a single call with only 32-bit addresses.
What? Who in their right mind allocates 6GiB of memory in a single call? That's ignorant... That said, I agree, truncating the address space is a bad idea. I mean, it's not impossible... You could use a table of handles to memory in a larger address space or make a virtual bank-switching system, but have fun synchronizing that.

More likely is that they allocate thousands of smaller areas, which happen to add up to 6 GiB in total. Of course it's more than just "malloc()" alone - e.g. memory mapping a single large file isn't necessarily uncommon.

I'd also suggest that it's not just processes. For a simple example, a monolithic kernel running on computer that happens to have 32 GiB of RAM might want to use 8 GiB or more of that RAM for caching file data.

Basically; it's "kernel worst case + process worst case". If the kernel might consume up to 100 GiB of space (e.g. for a very high-end server) and an SQL database engine might consume another 2 TiB of space; then you're probably going to struggle to handle the combined worst cases if you're using 32-bit virtual addresses.

Of course for embedded systems (where everything is relatively tiny), it's very likely that 64-bit virtual addressing is overkill, and also very likely that laptop/desktop/server Intel and AMD CPUs are overkill (e.g. it's like gluing a rocket engine on a skate-board and complaining the engine is too heavy). Instead, you'd probably want a cheap ARM CPU (or maybe a 32-bit Intel Atom SoC) and probably shouldn't be using any 64-bit CPU in the first place.

Cheers,

Brendan

Octocontrabass · Post by **Octocontrabass** » Fri Feb 20, 2015 3:13 am

Brendan wrote:More likely is that they allocate thousands of smaller areas, which happen to add up to 6 GiB in total. Of course it's more than just "malloc()" alone - e.g. memory mapping a single large file isn't necessarily uncommon.

No, it really does allocate huge blocks of memory. To be fair, >4GiB blocks are uncommon; I've only seen huge allocations like that when doing stupid things like trying to put more than a day's worth of high-resolution video on a single Blu-ray.

Rusky · Post by **Rusky** » Fri Feb 20, 2015 11:00 am

On the other hand, if you have 32GB of physical RAM it'd be ridiculous not to allow an application to allocate 6GB at once. There are a few legitimate use cases.

OSDev.org

Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?

Re: Can overlap segments achieve protection?