OSDev.org

Posted: **Thu Aug 18, 2016 6:39 pm**

To make things more bearable, it should probably be better to make allocations semi-static, meaning that different subsystems would use different areas in the logical/physical address space, as if it was a hard disk partition with special files and areas. For example, we would treat memory-mapped devices with a semi-static style and format to allocate, and the same would happen for regular applications, shared stuff, etc...

In this way we would get to define a lot of details and structure of pages and the whole memory management system beforehand. We would then make it flexible by allowing the smallest and the biggest possible allocations in a single operation.

I also found this in a book; seems very interesting:

If you dynamically allocate your device structure at runtime, use the VMM service _HeapAllocate, which is very similar to malloc. However, if your device structure includes a large buffer (4Kb or larger), you'll want to include only a pointer to the buffer in the device structure itself, and then allocate the large buffer separately using _PageAllocate. The rule is to use _HeapAllocate for small allocations and _PageAllocate for large allocations, where small and large are relative to 4Kb.

Posted: **Thu Aug 18, 2016 11:19 pm**

Hi,

alexfru wrote:
LtG wrote:Out of curiosity, where'd you get the number 6?
Something like a movsd instruction crossing 3 page boundaries (1 code and 2 data).

Yes.

There's multiple cases where CPU accesses "EIP plus 2 other addresses" (MOVSD, CMPSD, PUSH, POP); and with misaligned accesses (e.g. accessing 2 bytes at 0x0FFFFFFF) each of the 3 simultaneous accesses can be split across 2 pages (and page tables, and ..); leading to a worst case of "6 pages plus 6 page tables plus page directory = 52 KiB" (for "plain 32-bit paging"), and a worst case of "6 pages plus 6 page tables plus 6 page directories plus 6 PDPTs plus PML4 = 100 KiB" (for long mode).

Note that there are extremely rare cases involving even more simultaneous accesses (e.g. the "Debug Store Mechanism" can add an additional access to anything, hardware virtualisation with nested paging structures, etc). For this reason; if you're buying a new computer I'd recommend getting at least 256 KiB of RAM to make sure you have enough.

Cheers,

Brendan

Posted: **Fri Aug 19, 2016 12:05 am**

Brendan wrote: There's multiple cases where CPU accesses "EIP plus 2 other addresses" (MOVSD, CMPSD, PUSH, POP); and with misaligned accesses (e.g. accessing 2 bytes at 0x0FFFFFFF) each of the 3 simultaneous accesses can be split across 2 pages (and page tables, and ..); leading to a worst case of "6 pages plus 6 page tables plus page directory = 52 KiB" (for "plain 32-bit paging"), and a worst case of "6 pages plus 6 page tables plus 6 page directories plus 6 PDPTs plus PML4 = 100 KiB" (for long mode).

Page tables and directories are only involved when the addresses aren't in the TLB yet. I guess, the CPU would have to bring the addresses into the TLB *before* executing the instruction (at least the addresses for the code pages containing the instruction) and it shouldn't evict those TLB entries (the ones that are valid for the instruction) from the TLB if, say, just page out of 6 code/data pages isn't accessible, which means you probably don't need to have 13 physical pages mapped at the same time on a 80386. IOW, here's what can happen:

page fault for the physical address of the first byte of the instruction (fills one TLB entry if handled)
page fault for the instruction fetch
page fault for the physical address of the remaining bytes of the instruction (fills one more TLB entry if handled)
page fault for the instruction fetch
page faults for the physical address of the data accessed by the instruction (several more TLB entries filled if handled)
page faults for the data accesses by the instruction

Once filled, the TLB entries should not get invalidated and then filled again when we return from the page fault handler to the instruction (unless the TLB is too small for all the activities performed by the page fault handler). Right? Wrong?

Posted: **Fri Aug 19, 2016 8:16 am**

alexfru wrote: Page tables and directories are only involved when the addresses aren't in the TLB yet. I guess, the CPU would have to bring the addresses into the TLB *before* executing the instruction (at least the addresses for the code pages containing the instruction) and it shouldn't evict those TLB entries (the ones that are valid for the instruction) from the TLB if, say, just page out of 6 code/data pages isn't accessible, which means you probably don't need to have 13 physical pages mapped at the same time on a 80386. IOW, here's what can happen:

page fault for the physical address of the first byte of the instruction (fills one TLB entry if handled)

page fault for the instruction fetch

page fault for the physical address of the remaining bytes of the instruction (fills one more TLB entry if handled)

page fault for the instruction fetch

page faults for the physical address of the data accessed by the instruction (several more TLB entries filled if handled)

page faults for the data accesses by the instruction
Once filled, the TLB entries should not get invalidated and then filled again when we return from the page fault handler to the instruction (unless the TLB is too small for all the activities performed by the page fault handler). Right? Wrong?

I think your point is right, but I don't think it helps much if at all.. You still need those TLB entries to point to physical memory, if you're going to repurpose the backing memory (RAM) to load in what ever is needed by the the code (which caused the #PF to begin with) then you're going to need to fill that RAM with something else.

Essentially you'd need to make the cache think that two different TLB (virtual address) entries are pointing to same backing memory (same 4KiB RAM page), yet the cache would have different contents for each virtual address.

Or did you mean when the page dir has a not-present PDE you'd create one on the fly (incl. the page table), containing only that one PTE entry, allowing the #PF to be resolved and a single TLB entry to be "fed" to the TLB. And then always re-purposing the 4KiB RAM page for the same use? I think it might be possible to get that to work...

Or did you mean something else?

Brendan wrote: There's multiple cases where CPU accesses "EIP plus 2 other addresses" (MOVSD, CMPSD, PUSH, POP); and with misaligned accesses (e.g. accessing 2 bytes at 0x0FFFFFFF) each of the 3 simultaneous accesses can be split across 2 pages (and page tables, and ..); leading to a worst case of "6 pages plus 6 page tables plus page directory = 52 KiB" (for "plain 32-bit paging"), and a worst case of "6 pages plus 6 page tables plus 6 page directories plus 6 PDPTs plus PML4 = 100 KiB" (for long mode).

Might get slightly off topic, but won't you need at least the following for worst/pathological case:

IDT
#PF handler
Stack (#PF will push to it)
Current code page
Source data page
Destination data page
Page table for each page (assuming absolutely worst case where everything is spread so far apart that they are in separate page tables)
Page directory

For most of those you need to account for two pages for unaligned/page-boundary access. I'm not sure if there's some internals of the CPU you could take advantage of to reduce this list slightly, such as relying the cache on having old entries, but if that works it would at least be a horrible hack.

The OS dev could try to minimize the #PF handler and thus merge it with the IDT in the same page.

So I count:

IDT + #PF handler; 1 page (combined, which means at most ~4KiB for #PF handler)
Stack (#PF will push to it); 2 pages (page boundary)
Current code page; 2 pages (page boundary)
Source data page; 2 pages (page boundary)
Destination data page; 2 pages (page boundary)
Page table for each page; 9 pages (one for each of the above pages)
Page directory; 1 page

Giving a total of 19 pages = 72KiB of RAM (still under Brendans 256KiB).
I'm interested if anyone can come up with anything that must be added to that list or can be removed, keeping in mind absolute worst/pathological case..

edit. Forgot to list GDT, though that could possibly be on the same page as IDT and #PF handler..

Posted: **Fri Aug 19, 2016 8:57 am**

Hi,

LtG wrote:Might get slightly off topic, but won't you need at least the following for worst/pathological case:

None of those things are part of a normal process (e.g. part of kernel). If you're not looking only at "minimum for normal process" then you're going to need a driver that's able to access swap space (and everything it depends on while accessing swap space).

Cheers,

Brendan

OSDev.org

Virtual Memory Goals

Re: Virtual Memory Goals

Re: Virtual Memory Goals

Re: Virtual Memory Goals

Re: Virtual Memory Goals

Re: Virtual Memory Goals