Hi,
iseyler wrote:What I am trying to get is full and complete access to all of the available RAM in the system without loosing any to memory-mapped PCI devices. Also I am using 2MB pages so I don't use PT's.
Most CPUs have separate TLBs for 4 KiB pages and large pages, and if you don't use 4 KiB pages you won't be able to use most of the TLB and you'll get worse performance because of that. For an example, here's the TLB details for a Intel Core 2 CPU (originally from
Sandpile):
- 4 KiB Code TLBs - 128 Entries, 4-Way, LRU
- Large Code TLBs - 8 Entries (2 MiB pages only) or 4 Entries (4 MiB pages only), 4-Way, LRU
- 4 KiB Data TLBs - L0: 16 Entries, 4-Way, LRU, Loads Only; L1: 256 Entries, 4-Way, LRU
- Large Data TLBs - L0: 16 Entries, 4-Way, LRU, Loads Only; L1: 32 Entries, 4-Way, LRU
- PDC - 32 Entries, 4-Way, LRU (note: I have no idea what "PDC" is - I'm guessing it's "Page Directory Cache" that caches page directory entries instead of page table entries to reduce lookup times for TLB misses)
And here's the details for AMD's K8:
- 4 KiB Code L1 - 32 Entries, Fully, LRU, 2 Cycle Latency
- 4/2 MiB Code L1 - 8 Entries, Fully, LRU, 2 Cycle Latency
- 4 KiB Data L1 - 32 Entries, Fully, LRU -- Dual, 2 Cycle Latency
- 4/2 MiB Data L1 - 8 Entries, Fully, LRU -- Dual, 2 Cycle Latency
- 4 KiB Code L2 - 512 Entries, 4-Way, LRU, Exclusive, 10 Cycle Latency
- 4 KiB Data L2 - 512 Entries, 4-Way, LRU, Exclusive, 10 Cycle Latency
- Miscellaneous - 24 Entry PDE Cache; 32 Entry CAM Flush Filter
As you can see, in almost all cases there's between 4 to 16 times as many 4 KiB TLBs as there are 2 MiB TLBs. Exceptions are the L0 data TLBs in Intel Core 2 (where 4 KiB pages have the same number of entries as 2 MiB pages), and AMD K8's L2 TLBs (where there simply isn't any TLBs for 2 MiB pages).
Also note that performance differences will depend on data usage. For a simple example, if you reserve 2 MiB for each process' stack then using a 2 MiB TLB will probably waste that TLB entry, because only a small part of the area is likely to be used often and you'd probably get the same number of TLB misses with a 4 KiB TLB entry.
Basically, for maximum TLB efficiency (minimum number of TLB misses) you'd want to use large page TLB entries for large data structures (where all data in the 2 MiB area is likely to be accessed at the same time), and use 4 KiB TLB entries for everything else (e.g. when you've run out of large page TLB entries).
Then there's RAM usage. For 4 KiB pages you'd need to use RAM for page tables (and this RAM could be considered wasted). For "2 MiB pages only", if you need to allocate 10 bytes for something then you'd be forced to allocate an entire 2 MiB page and you'd waste almost all of it. Regardless of what you do, RAM will be wasted for one reason or another. It's hard (impossible) to predict which method would waste more RAM unless you know in advance how many processes will be running and how much space each process will use.
iseyler wrote:What I would like is to map all of the physical RAM to start at 0xFFFF800000000000 and run the OS and apps in there
Because there's no segmentation for 64-bit code, that would mean that 64-bit processes would need to use relocatable code, and that you can't have any protection between separate processes, or between processes and the kernel (unless you use "software isolation", like Singularity).
iseyler wrote:So if a system had 4GB of RAM and a video card with 1GB of RAM you could theoretically access 5GB of memory space.
At 32 bits per pixel, 5 GiB is enough space for a 36000 * 36000 high resolution picture of my Grandmother. 5 GiB is small when you're talking about total address space size (even though it's large when you're talking about RAM). There's lots of techniques that are used to make the address space size seem a lot larger than available RAM (shared memory, swap space, allocation on demand, zeroed page optimization, memory mapped files, etc).
For most 64-bit OSs (where there's a separate virtual address space for each process), you'd end up with 131072 GiB of virtual address space for each process plus 131072 GiB of virtual address space for the kernel. If the OS supports a maximum of 4294967296 processes (32-bit process IDs); then in theory you end up with a maximum of 562949953552384 GiB of total virtual address space. Of course you'd probably need to use some swap space for that...
Lastly, in the past I've always tried to encourage people to avoid identity mapping RAM in kernel space for a variety of reasons (e.g. the ability to send parts of the kernel to swap space, fault tolerance, NUMA optimizations, etc). I won't repeat it here, because in reality most people (including me) need to write several kernels before they've learnt enough to start thinking about implementing more advanced features, and by the time they're ready for this they don't really need to ask.
iseyler wrote:Is this even possible? Or am I completely wrong in how x86 memory works?
It is possible. IMHO it's also not a very good design, but also IMHO there's probably no reason to worry about whether it's a good design or not (as long as you learn from it, it probably won't matter)...
Cheers,
Brendan