OSDev.org

Posted: **Tue Aug 29, 2006 6:09 am**

After reading the Intel and AMD manuals for IA-32e/AMD64, I'm starting to think that you can't do a simple higher half kernel in AMD64. Why? Because you can't statically (at compile/link time) know where is that higher half!

In other words:

All IA-32 processors support 32 bit virtual addresses, and either 16 (386 SX) or 32 bit (all other IA-32) physical addresses. In those processors, higher half is thus 0x80000000-0xFFFFFFFF virtual for any IA-32 processor.

However, while the AMD64 architecture supports UP TO 64 bit virtual addresses and UP TO 52 bit physical ones, processors are not required to implement all that: just a minimum of 48 bit virtual and 40 bit physical. All addresses must be "canonical", that is, non implemented bits (63-48) must be copies op the last implemented one (that is, sign extension).

Thus, the address space is (not taking into account possible 52 and 60 bit implementations):
48 bit: 256 TiB
56 bit: 64 PiB (65 536 TiB)
64 bit: 16 EiB (16 777 216 TiB)
Lower half:
48 bit: 0x0000000000000000-0x00007FFFFFFFFFFF
56 bit: 0x0000000000000000-0x007FFFFFFFFFFFFF
64 bit: 0x0000000000000000-0x7FFFFFFFFFFFFFFF
Higher half:
48 bit: 0xFFFF800000000000-0xFFFFFFFFFFFFFFFF
56 bit: 0xFF80000000000000-0xFFFFFFFFFFFFFFFF
64 bit: 0x8000000000000000-0xFFFFFFFFFFFFFFFF

I reckon this is an intelligent implementation: the "higher half" is always at the very top of the address space, expanding down with more available bits, while the lower half does just the opposite. Seamlessly scalable. However, how can I link my kernel to always use the widest possible address space? I don't think it's possible. Being sincere, as I'm writing this I'm realising that even the 48-bit higher half is 128 TiB, so that should suffice: if I can't fit my microkernel in that space, I'm in trouble ;D

I'm also thinking that, because of the architecture, the OS must not "be nice" to applications and let them get part of the higher half just as my 32 bit design does (I have 3.5GiB/0.5GiB instead of the more common 2/2), because they could get really confused by the "hole" in the memory that will appear in <64bit implementations. So, if the OS gets the higher half and the apps the lower one, nobody notices the hole and everyone is happy. Only in true 64 bit processors can the scheme be adjusted, i.e., with 12EiB/4EiB, or even 15/1. An OS whose KERNEL uses 1 exbibyte, that is 1,048,576 gibibytes looks monstruous to me, but I'm starting to think Windows 2015 will require that

But, continuing with my strange mindstorm... How will current OSs manage processors with >48 bits vaddrs? AFAIK, AMD and Intel have only defined the paging mechanisms up to the PML4, so that only covers up to 48 bits. That probably means that 1) neither one has plans to deliver AMD64>48 bit processors in the near future (maybe some years), and 2) all "64 bit" OSes now are really "48 bit", and if run ten years into the future on a true 64 bit processor they could only map 256TiB of memory (48 bits) with their PML4 because they don't know how the paging scheme for the remaining 16 bits will look like! So, those of you running WinXP Pro x64, or Linux x86_64, you're really running WinXP Pro x48 / Linux x86_48

By the way, I think they could even crash if run on a processor >48bit if Intel & AMD don't devise a clever way for the processor to recognize if the OS is setting CR3 to a PML4 instead of the PML5, PML6 or whatever used in those future processors, maybe through a setting in a new CRx. Of course, these OSes could also crash for not supporting other hardware of that time, maybe EFI2, Super Hiper Mega PCI-Express3, DDR7++, etc.

Well, enough of this. I'm returning to my cell in the mental sanatorium. Just asking for your thoughts on the matter...

Posted: **Tue Aug 29, 2006 7:34 am**

My plan, for both x86 & x86-64, incase you're interested, is to reserve the lowest 64MB for the kernel (identity mapped), then comes application space up to X, and from X to the top of memory is dynamic kernel memory (buffers, caches, etc). The value of X may be chosen by a config option or by some run-time calculation based on the size of the address space. X should always be at least 1GB though.

This allows a machine which will be used as a database server to be given lots of application address space (X=3.5GB) and a pure file server to be given lots of kernel space (X=1GB) for caches (these examples are for a 32-bit system).

The nice thing about this setup is it can scale to any address space size, with the kernel always running in identity mapped memory. For processors which do not support this layout (eg. MIPS with fixed kernel/user segments) another layout must be used, but this is not a problem since binaries for one layout will not be expected to run on another layout.

Hope this helps.

Posted: **Tue Aug 29, 2006 8:11 am**

Habbit wrote: I reckon this is an intelligent implementation: the "higher half" is always at the very top of the address space, expanding down with more available bits, while the lower half does just the opposite. Seamlessly scalable. However, how can I link my kernel to always use the widest possible address space? I don't think it's possible.

I don't think it's impossible, look at Windows 2000: The kernel image (ntoskrnl.exe) is a relocatable PE-file. The boot loader offers the option to put it above the 2GB or the 3GB-Mark - and relocates the image accordingly. I guess something similar could be used here as well.

Now, I *don't* say that this is straight-forward, simple or desirable for a hobby os. As you said already, the higher half with 48-bit addressing is huge already, so why the hassle? Just my 2 euro-cents!

cheers Joe

Posted: **Tue Aug 29, 2006 8:45 am**

Hi,

My thinking is that the more "levels" the paging system has the worse things get for TLB misses (e.g. loading N different pieces of RAM into the CPU just to convert a linear address into a physical address, especially when RAM itself is relatively slow to access).

IMHO it's more likely they'd invent a completely different paging system, with larger pages, more entries per page table/directory and fewer levels. For an example consider 8 MB pages, page tables and page directories (with 1048576 entries per page table/directory). This works well and means you get 2 page directories per address space (one low and one high, for user-level and kernel-level perhaps, where the same upper page directory could be used in every address space). A linear address would look like this:

[tt]666655555555554444444444333333333322222222221111111111
3210987654321098765432109876543210987654321098765432109876543210
DTTTTTTTTTTTTTTTTTTTTPPPPPPPPPPPPPPPPPPPPOOOOOOOOOOOOOOOOOOOOOOO[/tt]

Where 'D' selects which page directory, 'T' selects which page table, 'P' selects which page and 'O' determines the offset in that page. For more fun, I'd use CR3 for the lower page directory and "CR5" for the higher page directory. This means for a TLB miss the CPU would only need to fetch part of the page directory and part of a page table (2 chunks of data from RAM)...

Larger page size also means one TLB entry covers more of the linear address space, so it'd improve the efficiency of the TLB cache. The only disadvantage is that it'd waste more RAM, but if there's so much RAM that you need true 64 bit addressing then I don't think this will be a much of a problem.

In any case, all of your OS's linear memory management code would probably need to be rewritten to suit how they decide to implement the true 64 bit paging. Why not wrap the true 64 bit paging code in conditional "#if/#else/#endif" statements (and anything else that needs to be changed), and then use a different linker script? This gives 2 seperate kernel binaries (one for 48 bit paging and another for 64 bit paging), but would that matter?

Of course the other question is how long will it take before 48 bit linear addresses are too small? I'm guessing around 30 years....

Cheers,

Brendan

Posted: **Tue Aug 29, 2006 9:09 am**

Brendan wrote: Of course the other question is how long will it take before 48 bit linear addresses are too small? I'm guessing around 30 years....

Well, twelve years ago I was happy as hell when I got my 200 MiB hard drive... I couldn't even start figuring out what to do with so much space! Nowadays I'm cursing at my 400 GiB hard drive, of which there are only 1.7 GiB left (which is nearly 8 times the total capacity of my old drive!). Sure, the growth on disk drives is nearly linear while the address space growth is exponential (each step doubling the bits is newMaxAddress=oldMaxAddress^2), but...

128 terabytes look more memory than you'll ever need, but, again, the Intel engineers also thought that 4 GiB was "more than you'll ever need". Think about something that is not yet invented, such as 3D video processing (I'm not talking about 3D-like 2D images, I'm talking about true 3D, kinda like holograms, which are currently being researched with more or less success). Now think about how could it be represented in memory uncompressed (i.e. as a a sequence of frames with hundreds of uncompressed RBG bitmaps each, one bitmap for "depth slice" in the frame) and 128 TiB won't be enough, just as 4 GiB won't even let you START processing an uncompressed RGB movie (nor a losslessly compressed one such as HuffYUV, whose size range on the 70 - 80 GiB for the ESC 2006, by the way).

I'm not advocating for a 256 bit address space - yet ;D. But future surprises us more frequently than you may reckon. Just 50 years ago, the very idea of ME (in Spain) writing to YOU (wherever in the world) nearly instantaneously was... science fiction. Look ten years into the past and you will see yourself cherishing your new 28.8K modem. And now I have a 4 MiB cable link...

Posted: **Tue Aug 29, 2006 10:34 am**

You could extend the PML4 system to a 4-level full-depth by increasing the size of a page but slightly.

The amount of bits you stretch are (pagesize) + (pagesize-3) * 4. You need this to be equal to 64. 5*pagesize - 12 = 64, so pagesize = 76 / 5 = 15 rem 1. You use 32k pages, each table contains 4k entries, the final page is 32k. 4k * 4k * 4k * 4k * 32k = 8192PB = 8EB. Now, add a second PML4 (CR7?) table for the kernel-side of the space and you're all set.

AMD, are you hiring?

Posted: **Tue Aug 29, 2006 7:00 pm**

Hi,

Habbit wrote:128 terabytes look more memory than you'll ever need, but, again, the Intel engineers also thought that 4 GiB was "more than you'll ever need". Think about something that is not yet invented, such as 3D video processing (I'm not talking about 3D-like 2D images, I'm talking about true 3D, kinda like holograms, which are currently being researched with more or less success). Now think about how could it be represented in memory uncompressed (i.e. as a a sequence of frames with hundreds of uncompressed RBG bitmaps each, one bitmap for "depth slice" in the frame) and 128 TiB won't be enough, just as 4 GiB won't even let you START processing an uncompressed RGB movie (nor a losslessly compressed one such as HuffYUV, whose size range on the 70 - 80 GiB for the ESC 2006, by the way).

48 bit addressing is more than enough for this - if you allow half the address space for the kernel and half of the remainder for user-level code that still leaves 46 bits for data addressing, which is enough for a 32768 * 32768 * 16384 3D bitmap (at 32 bits per pixel). However, I would also expect at least some of the video decompression would be done in hardware - it'd be cheaper to implement than fixing the bandwidth problem (at 64 frames per second you'd need to shift data at a rate of 4096 TB/s across the "PCI" bus to the video card).

The way I see it, it became cost effective to shift away from 16/20 bit in around 1985, and it started being cost effective to shift away from 32 bit around 2005. A linear progression would imply that it might start being cost effective to shift away from 48 bit around 2025, but these things are determined by demand. The shift from 16/20 bit to 32 bit was in much higher demand than the shift from 32 bit to 48 bit, which IMHO was/is a slow process that most normal consumers still don't care too much about (and was mostly a strategic move by AMD to prevent the industry shifting to Itanium, a platform that AMD doesn't have a "technology sharing" agreement for).

The other thing I'm wondering is when will 80x86 choke on it's own "backward compatability"? Sooner or later the costs involved with providing compatability with 30 year old hardware will increase to the point where another platform will be able to deliver the same "bang per buck" with much lower "units sold". For example, if something portable like Linux became the dominant OS (or if Microsoft seriously decided to support another architecture) then it makes sense to start moving to a cleaner (cheaper to manufacture) architecture. This would be a "tipping" point - once people start to move away from 80x86 it starts losing it's "high volume" advantage and the alternative architecture starts losing it's "low volume" disadvantage. Interestingly, there's already things like PCI, USB, ACPI and EFI that are intended to be cross-platform, and older 80x86 software can run under emulators - it's only really AMD (and inertia) that is currently preventing the end of 80x86...

Cheers,

Brendan

Posted: **Tue Aug 29, 2006 7:36 pm**

Brendan wrote: (and was mostly a strategic move by AMD to prevent the industry shifting to Itanium, a platform that AMD doesn't have a "technology sharing" agreement for).

The other thing I'm wondering is when will 80x86 choke on it's own "backward compatability"? Sooner or later the costs involved with providing compatability with 30 year old hardware {...} it's only really AMD (and inertia) that is currently preventing the end of 80x86...

The end of the x86 architecture... seems so tragic. But there needn't be such a "cost of supporting 30 year old hardware". It's true that current processors boot in real mode like an ultra-fast, pipelined 8086+8087 just for the sake of compatibility with old software (mainly the BIOS). There are, however, other lights ahead and/or already over our heads.

Maybe there're things about EFI I really don't like: the very prospect of the computer connecting to the Internet (yea, that Internet full of viruses and spam) to update its "hardware OS-independent drivers" without the need for an OS seems creepy (can you imagine your OS-independent, super-low-level hard disk driver becoming infected with an EFI virus? It would have the same access to it - or even more - than those old DOS virus that burned floppies spinning them too fast too long).
However, at the end of the day, I would like to see it working on my box, so new EFI processors can boot directly into protected mode because no real mode is required to run the BIOS. Hell, real mode & v8086 mode and all their quirks could be directly removed from all AMD64 processors since it wouldn't be needed (nor supported in 64-bit mode). And it's not such a strange vision: an old Intel wonder named 80376 did just that.

About the IA-64: I'm not going to say it's a bad architecture, but... I've tried to read the systems programmer manual and well, how could I put it... I couldn't understand 75% of it. Its too damn complex! Sure explicit parallelism gives you (near) full control of what is executed where and when, but the very idea of having to synchronise the registers...
It's an architecture where programs can't be written in assembly language, and that bothers me a bit. I mean, if I write a prime numer finder program in C and compile it with GCC -O3 in a x86, and benchmark it against my hand-coded assembly version without x87, MMX, SSE or whatever; GCC beats me by 0.05% (at max optimization, with normal setting my code just kills the compiler's). I can't think of the prime finder assembly program for an IA-64 (well, I can, if I just ignore that EPIC bullshit, which would be wasting the power of the architecture), and, of course, can't even imagine the complexity of a compiler that TRIES to do explicitly parallel computing...
Synthetysing, the Itanium architecture is, in my opinion, overpowered in such a sense that makes the extra power quite unusable except in very concrete environments... It's like we used SSE instructions for everything in the x86 instruction set.

Another 90 degree turn in the thread

Ok ok, I know, back to my sanatorium cell before my brain raises a #GPF when trying to fetch the next instruction from a non-canonical %rIP

Posted: **Tue Aug 29, 2006 8:46 pm**

Hi,

Habbit wrote:However, at the end of the day, I would like to see it working on my box, so new EFI processors can boot directly into protected mode because no real mode is required to run the BIOS. Hell, real mode & v8086 mode and all their quirks could be directly removed from all AMD64 processors since it wouldn't be needed (nor supported in 64-bit mode). And it's not such a strange vision: an old Intel wonder named 80376 did just that.

For the CPU, dumping real mode, protected mode, virtual 80x86, etc is just the beginning. Convert the instruction set to something that is easier to decode, increase the number registers, completely remove segmentation (including FS & GS), remove the distinction between general registers and FPU/MMX/SSE, etc. I'd also make some other changes (simplifying the paging system and giving each address space an "address space ID" so that the TLB cache can be tagged to prevent the need to flush old entries during context switches), use a standard way of controlling sleep states and doing debugging and performance monitoring, etc. The CPUs could be faster, easier to program and cheaper to produce...

This is only the beginning though, the chipset would be next. Serial ports, parallel ports, the PS/2 controller, the PICs, the PIT, the ISA DMA controller, the floppy controller, gate A20, all the memory holes, etc - all gone. Shift everything that's left to the top of the 64 bit address space (local APIC, I/O APIC, HPET, PCI devices, the BIOS, etc). The state save area for SMM can go too (ACPI makes SMM mostly obsolete anyway). Now we've got a clean slab of RAM from 0x00000000 up, with no "PCI to LPC" bridge (find a better alternative for the CMOS/RTC, which is all that's left) and much simpler memory caching/decoding.

Then take a look at specific devices. Video is first - dump the legacy stuff and use a LFB and nothing else, and then make the GPU handle video mode switches so that we can do "out videoControlPort, modeNumber" instead of messing about with obscure details in software (combined with some way of finding out the resolution and colour depth for each modeNumber) - no real need for a video BIOS now so that can go. Storage devices are crap too - for each transfer the host can send an operation type, logical address (on the device), a transfer size (in bytes), a physical address (in RAM) and a "job number". When an operation completes the controller can set the job number and status before causing an IRQ so the OS knows what happened (and so multiple jobs can be queued). Differences in block sizes, cylinder, head, sector, etc can all go, and so can the differences between hard drives, CD-ROMs, tape drives, IDE/SCSI, etc - one device driver to control them all (no SCSI ROMs or anything either).

I guess you see where I'm heading - hardware design, hardware manufacture and software costs (for both OSs and device drivers) can all be significantly reduced by dumping 80x86 completely. It'd be sad to see 80x86 go (but also sad to see it stay)...

Anyway, in an attempt to bring this back to "on topic", IMHO it's not worth worrying about what might happen in the future if someone decides to provide true 64 bit paging - the only thing you can do is allow for some portability in your memory management code, which is a good idea in any case...

Cheers,

Brendan

OSDev.org

AMD64, virtual memory and canonical addresses

AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses

Re:AMD64, virtual memory and canonical addresses