Handling physical memory schedules with paging
Posted: Sat Oct 31, 2020 4:03 am
Many modern hardware devices are built on setting up physical memory schedules, like network cards, AHCI, and USB controllers. There is no direct way of accessing physical memory when using paging with a non-unity mapping. Many of these drivers tend to need extra entries in the schedules for virtual addresses, and when having a physical address it's hard to translate it to a virtual without maping it with paging, something that is slow and inefficient.
Another problem with using physical memory is that simple mistakes can lead to devices corrupting kernel or application memory. It's easy enough to do this with malicious code in a PCI device, but even well-behaved PCI devices can corrupt things in kernel by handing them corrupt schedules.
With 32-bit paging you can use 4M pages, and with long mode and PAE paging you can use 2M pages. These are large enough to create a memory pool of descriptors and buffers typically used by PCI devices that are based on physical memory schedules. It provides easy translation between physical address and ordinary adress and the reverse by adding or subtracting a "base". It's also easy to determine if a given physical address is within the 2M area by comparing the most significant bits.
For a segmented OS (like mine), it would be a good idea to allocate a selector with the linear base and a 2M limit. This way kernel software cannot exceed the limits of the 2M area and thus cannot access linear or physical memory that doesn't belong to the object. The translation between physical address and offset would then just subtract the physical base. If the offset is above 2M, a protection fault will result rather than accessing invalid memory.
Typical scheduled devices use many objects of the same size, which typically should be aligned. I think a bitmap can be used for allocation (rather than linked lists). The bitmap could either be part of the 2M area or a separate object.
There are a few challenges though. One is that it must be possible to allocate both 2M and 4M physical pages from both below 4G and anywhere in memory. This is because some PCI devices only support 32-bit physical addresses. Another is that it also must be possible to alocate 2M and 4M aligned linear addresses. When using 32-bit paging, 4M linear addresses and 4M physical pages must be used, and in order not to waste to much memory these blocks needs to be split into two 2M pages. It might also be possible to split blocks even further for devices that are not expected to use up to 2M of memory. For devices that need more than 2M, they could allocate new 2M areas on demand. It might even be a better idea to use a 64k base size instead of 2M. Or maybe support for both could be added.
So, maybe an API like this:
CreateMemoryBlock32() and CreateMemoryBlock64(). Returns a handle and creates a header with no memory objects
AllocateMemoryBlock(handle, size). Returns a 32-bit block handle, a 48-bit segmented address (or a linear address) and a physical address
GetMemoryBlockPhysical(handle). Converts between 32-bit block handle and physical address
GetMemoryBlockAddress(handle). Converts between 32-bit block handle and 48-bit address
FindMemoryBlockPhysical(physical). Finds the 32-bit block handle for a physical address.
The 32-bit handle could be split into different parts. The higher bits would be from CreateMemoryBlock(). The lowest bits would be from the 2M memory block. If the minimum allocation size is 32 bytes then up to 64k entries could be allocated from a 2M memory block and so 16-bits will be needed for block handles within a 2M block. This would leave 16-bits for the CreateMemoryBlock handle and associating multiple memory blocks to the same handle. Maybe 10 bits could be used for the handle (which would allow up to 1024 handles to be allocated) and 6 bits for the memory block, which would allow up to 64 memory blocks with a 2M size (a total size of 128M). Altough a more realistic scenario would be to initially allocate a few 64k blocks and then 2M blocks to save memory.
Using a 32 byte minimum block limit is a bit problematic as the handles will need to be two bytes and so would consume 128k. A better approach might be to only use 64k memory blocks for allocation of 32-byte memory blocks. This will only consume 4k. Always using 4k would set a minimum limit for allocation from 2M blocks to 1k. Thus, allocations of memory objects below 1k would use 64k blocks and above would use 2M blocks. This also means that this part will only use 10 bits instead of 16. The bitmaps used could also be made smaller with this method.
Another problem with using physical memory is that simple mistakes can lead to devices corrupting kernel or application memory. It's easy enough to do this with malicious code in a PCI device, but even well-behaved PCI devices can corrupt things in kernel by handing them corrupt schedules.
With 32-bit paging you can use 4M pages, and with long mode and PAE paging you can use 2M pages. These are large enough to create a memory pool of descriptors and buffers typically used by PCI devices that are based on physical memory schedules. It provides easy translation between physical address and ordinary adress and the reverse by adding or subtracting a "base". It's also easy to determine if a given physical address is within the 2M area by comparing the most significant bits.
For a segmented OS (like mine), it would be a good idea to allocate a selector with the linear base and a 2M limit. This way kernel software cannot exceed the limits of the 2M area and thus cannot access linear or physical memory that doesn't belong to the object. The translation between physical address and offset would then just subtract the physical base. If the offset is above 2M, a protection fault will result rather than accessing invalid memory.
Typical scheduled devices use many objects of the same size, which typically should be aligned. I think a bitmap can be used for allocation (rather than linked lists). The bitmap could either be part of the 2M area or a separate object.
There are a few challenges though. One is that it must be possible to allocate both 2M and 4M physical pages from both below 4G and anywhere in memory. This is because some PCI devices only support 32-bit physical addresses. Another is that it also must be possible to alocate 2M and 4M aligned linear addresses. When using 32-bit paging, 4M linear addresses and 4M physical pages must be used, and in order not to waste to much memory these blocks needs to be split into two 2M pages. It might also be possible to split blocks even further for devices that are not expected to use up to 2M of memory. For devices that need more than 2M, they could allocate new 2M areas on demand. It might even be a better idea to use a 64k base size instead of 2M. Or maybe support for both could be added.
So, maybe an API like this:
CreateMemoryBlock32() and CreateMemoryBlock64(). Returns a handle and creates a header with no memory objects
AllocateMemoryBlock(handle, size). Returns a 32-bit block handle, a 48-bit segmented address (or a linear address) and a physical address
GetMemoryBlockPhysical(handle). Converts between 32-bit block handle and physical address
GetMemoryBlockAddress(handle). Converts between 32-bit block handle and 48-bit address
FindMemoryBlockPhysical(physical). Finds the 32-bit block handle for a physical address.
The 32-bit handle could be split into different parts. The higher bits would be from CreateMemoryBlock(). The lowest bits would be from the 2M memory block. If the minimum allocation size is 32 bytes then up to 64k entries could be allocated from a 2M memory block and so 16-bits will be needed for block handles within a 2M block. This would leave 16-bits for the CreateMemoryBlock handle and associating multiple memory blocks to the same handle. Maybe 10 bits could be used for the handle (which would allow up to 1024 handles to be allocated) and 6 bits for the memory block, which would allow up to 64 memory blocks with a 2M size (a total size of 128M). Altough a more realistic scenario would be to initially allocate a few 64k blocks and then 2M blocks to save memory.
Using a 32 byte minimum block limit is a bit problematic as the handles will need to be two bytes and so would consume 128k. A better approach might be to only use 64k memory blocks for allocation of 32-byte memory blocks. This will only consume 4k. Always using 4k would set a minimum limit for allocation from 2M blocks to 1k. Thus, allocations of memory objects below 1k would use 64k blocks and above would use 2M blocks. This also means that this part will only use 10 bits instead of 16. The bitmaps used could also be made smaller with this method.