DMA Allocators, or: Brendan lied to me!

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1618

OK, he left something out of his memory management guide, but that doesn't get clicks.

I've been reading the SeaBIOS source code for its EHCI implementation lately, and was surprised how easy everything they do is. A pipe is just a QH that is either linked into the periodic schedule or the async list, and to transfer stuff on an async pipe, you just allocate the TDs on stack, fill them out, and link them to the QH, then wait for them to be processed. All very nice and simple.

Then I noticed that this can only possibly work with SeaBIOS because it is running in unpaged 32-bit mode, whereas my OS is running in 64-bit mode. Reading the EHCI spec, whether or not the EHCI is capable of handling 64-bit addresses is a runtime property, and even on 64-bit implementations you have the weird property that all QHs and TDs and the periodic schedule must come from the same 4GB page. And if they are 32-bit implementations, then everything, even the buffers, must come from below 4GB.

I thought about this problem. Obviously only the EHCI driver knows what addresses the EHC is prepared to accept. So how should USB class drivers deal with this? Does my EHCI driver need a method to allocate physical memory? More generally, do all my devices need something like that? I wondered how a more mature OS deals with this issue, so I looked at the Linux source code. And stumbled into the rabbit hole that is the world of DMA allocators.

Essentially, Linux has an abstract representation of every device (called "struct device") that associates each device with its position in a hierarchy. Each device has, among other things, a bus device and a DMA mask. That DMA mask is set by the device driver to indicate the capabilities of the device, but the bus the device is found on also adds more DMA properties. The bus can add DMA limits, or DMA windows. It is entirely possible that we need to translate between physical and DMA addresses. And that is already the crux of the issue: External hardware does not understand physical addresses, but rather their own DMA addresses. And those two things can be different by an offset, and limited by a mask.

Actually that was one of the more profound things I learned there: Each device has its own DMA address space, so DMA addresses only make sense in connection to the device they belong to.

Getting back to EHCI, this design with the bus hierarchy established beforehand allows the EHCI driver to not even care if it is a PCI EHCI or a memory mapped EHCI they found. The "device" structure is already filled out, and it just allocates the buffers accordingly. QHs, TDs and the periodic schedule come from below 4GB, but then it would allow 64-bit addresses if the code were not commented out. Oh well, whatever the issue, I am sure they will fix it any day now.

Actually, how those buffers work is also pretty clever. So there is this function called "dma_alloc_coherent()" that will allocate a physical page in the correct DMA window for the device, then link it into kernel address space, returning both the DMA address and the virtual pointer. Then there are also DMA pools which use that function to subdivide the page further (using a linked list approach to physical memory management, because they return fixed-size elements), thereby allowing a QH to only take up 128 bytes instead of 4096. Pretty nifty stuff.

This discovery does mean that in general, it will not be possible to write arbitrary RAM to USB stick; this I/O buffers must come from DMA memory for the stick (and thereby, its HC). And if I've understood correctly, Linux works around this by allocating the memory for its page cache from DMA memory for the device that is being cached. Meaning that a userspace read() that hits the page cache will be satisfied with just a memcpy(), and if it misses the cache, the associated drive can fill the page cache as quickly as possible.

So that was interesting. My question is, what other ways are there to deal with weird DMA requirements? What do you guys use?

thewrongchristian · **Joined:** Tue Apr 03, 2018 2:44 am **Posts:** 404

nullplan wrote:

So that was interesting. My question is, what other ways are there to deal with weird DMA requirements? What do you guys use?

My toy OS doesn't worry about it.

I don't support any ISA devices that require ISA DMA (PA<16MB).

And I don't use PAE, so I don't support memory with PA>4GB.

But I have planned for both, and my physical memory manager is zoned. When allocating physical memory, I can pass flags to say "give me ISA DMA memory", or give me "32-bit DMA memory", or by default allocate anywhere.

Probably the most elegant way to handle this is to use IOMMU, so the EHCI 32-bit address space can be mapped to anywhere in the 64-bit physical address space under control of a 64-bit OS.

xeyes · **Joined:** Mon Dec 07, 2020 8:09 am **Posts:** 212

nullplan wrote:

Obviously only the EHCI driver knows what addresses the EHC is prepared to accept. So how should USB class drivers deal with this?

Software:

If the upper levels don't know about the limitations, somewhere on the lower levels, there could be a need to copy buffers within the main memory.

If you want 'zero copy' all the way, then the upper levels including and especially the user spaces apps have to know about the situation and cooperate.

Hardware:

Use IOMMU?

Me:

Hardcoded a 64KB buffer for SB16's ISA DMA and copy buffers within the main memory for it. :lol:

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1618

xeyes wrote:

If the upper levels don't know about the limitations, somewhere on the lower levels, there could be a need to copy buffers within the main memory.

If you want 'zero copy' all the way, then the upper levels including and especially the user spaces apps have to know about the situation and cooperate.

I think for now I will copy Linux's design with the general-purpose "device" structure that can be used to allocate DMA-able memory for any given device. Then the USB class drivers know to allocate their buffers that way from whatever HC that spawned them. For USB MSC (and block devices more generally), the idea of an in-kernel page cache is alluring, and it would mean that user space applications do not need to allocate their memory in a special way, since I/O system calls will transform, after finitely many steps, into copying from or to the page cache. The cache buffers themselves are backed by special DMA-able memory for the device the cache is for. The user doesn't need to know about hardware limitations.

I could conceive of a fast path, in which the user-space address and length given are divisible by the page size, in which case I can give the user a COW mapping of the page cache, but that is a special case, and I need to handle the general case, where neither is the case, gracefully. I cannot give the user a writable mapping of the page cache, because then other processes can see the changes to one process's I/O buffer before it has committed those with a write() system call. In general, I don't know how useful this fast path would be, given those limitations.

MollenOS · **Joined:** Wed Oct 26, 2011 12:00 pm **Posts:** 202

Just like some of the other replies here describe, I have implemented a 'dmabuf' interface which can be used by both applications and drivers. It's partly inspired by the dma_buf linux interface, and allows for zero-copy dma transactions. You can see the interface at https://github.com/Meulengracht/MollenO ... s/dmabuf.h

Drivers obviously allocate device memory for this using the dmabuf interface, however so does the libc for instance for some cases. My libc has a default transfer buffer of a certain size which will be used to facilitate most read/writes, so in theory it'll be copy-once (because we can't assure the buffer the user-supplies is correct for the underlying hardware), and for larger transfers that don't fit the default transfer buffer we allocate a new to fit that one. All transfer buffers are allocated in low memory to allow for zero-copy the rest of the stack. The dmabuf interface supports a scatter-gatter list to make sure we don't need continous physical memory allocation.

I hope this can clarify things a bit.

xeyes · **Joined:** Mon Dec 07, 2020 8:09 am **Posts:** 212

nullplan wrote:

xeyes wrote:

If the upper levels don't know about the limitations, somewhere on the lower levels, there could be a need to copy buffers within the main memory.

If you want 'zero copy' all the way, then the upper levels including and especially the user spaces apps have to know about the situation and cooperate.

I think for now I will copy Linux's design with the general-purpose "device" structure that can be used to allocate DMA-able memory for any given device. Then the USB class drivers know to allocate their buffers that way from whatever HC that spawned them. For USB MSC (and block devices more generally), the idea of an in-kernel page cache is alluring, and it would mean that user space applications do not need to allocate their memory in a special way, since I/O system calls will transform, after finitely many steps, into copying from or to the page cache. The cache buffers themselves are backed by special DMA-able memory for the device the cache is for. The user doesn't need to know about hardware limitations.

I'm sure it will work out with buffer copying, there are only 2 possible cases:

1.mem <= 4GB, won't have any problem whatsoever.
2.mem > 4GB, then there can be up to 3.x GB dedicated for the DMAs, so as good as or better than case 1.

nullplan wrote:

I could conceive of a fast path, in which the user-space address and length given are divisible by the page size, in which case I can give the user a COW mapping of the page cache, but that is a special case, and I need to handle the general case, where neither is the case, gracefully. I cannot give the user a writable mapping of the page cache, because then other processes can see the changes to one process's I/O buffer before it has committed those with a write() system call. In general, I don't know how useful this fast path would be, given those limitations.

Like in software, passing big objects around by reference is faster.

Aligning the user visible buffer is not that important. DMAs don't necessarily require the same alignment as the CPU page sizes. On the other hand, a software interface could always deal with the unaligned parts at the beginning and the end separately while still take advantage of 'zero copy' for the bulk of the transfer.

This probably won't mix well with a normal disk cache eitherway.

Even if you come up with a design to merge them together, this can effectively purge the whole disk cache in a split of a second and leave everyone else in the cold until the cache is slowly warmed up again.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3196

The EHCI issue is not really a 32-bit vs 64-bit OS issue, rather an issue about physical addresses. A 32-bit OS using PAE can allocate 64-bit physical addresses.

Anyway, I solved it by considering the device to be a 32-bit device, and so it will always get addresses below 4GB. It's not worth the trouble of adding more complex physical memory allocators just so EHCI can get higher 4G blocks. I only have two physical address allocators: 32-bit and 64-bit. EHCI will use the 32-bit version, just like UHCI and OHCI. XHCI uses the 64-bit allocator.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1618

rdos wrote:

The EHCI issue is not really a 32-bit vs 64-bit OS issue, rather an issue about physical addresses. A 32-bit OS using PAE can allocate 64-bit physical addresses.

The issue is that different devices can accept different physical addresses. And which addresses those are is a runtime property in at least some devices. And indeed that DMA addresses are different from physical addresses, at least in principle.

rdos wrote:

Anyway, I solved it by considering the device to be a 32-bit device, and so it will always get addresses below 4GB. It's not worth the trouble of adding more complex physical memory allocators just so EHCI can get higher 4G blocks. I only have two physical address allocators: 32-bit and 64-bit. EHCI will use the 32-bit version, just like UHCI and OHCI. XHCI uses the 64-bit allocator.

So what do you do about ISA DMA? I had the idea of hardcoding zones as well, but that felt wrong somehow. There is ISA DMA, which needs 24 bit physical addresses and blocks that do not cross a 64k boundary. There is SMP initialization which requires a 20 bit address when done with the SIPI method (and a 64-bit address when done with the ACPI mailbox method). According to the Linux source code, some EHCI implementations have errata that say they do not work well with addresses above 2GB, so those need 31 bit addresses. Do I add another zone whenever a device comes along with a new weird requirement?

Plus, the zoning adds difficulties when confronted with a device that may use one or the other. As you admit when you say that you treat EHCI as a 32-bit device because the added code complexity is not worth it. With the method presented here, there are no zones, and the max address is a data item. So all devices can take advantage of all addresses (well, most of them. Again the thing with the 4GB page for the periodic list and all the QHs and TDs is just weird, and Linux will fix that page to zero, and I intend to do the same), so you do not run into a fragmentation issue where you have to return "no more memory" even though plenty is free, just not in your zone.

rdos · **Joined:** Wed Oct 01, 2008 1:55 pm **Posts:** 3196

nullplan wrote:

rdos wrote:

The EHCI issue is not really a 32-bit vs 64-bit OS issue, rather an issue about physical addresses. A 32-bit OS using PAE can allocate 64-bit physical addresses.

The issue is that different devices can accept different physical addresses. And which addresses those are is a runtime property in at least some devices. And indeed that DMA addresses are different from physical addresses, at least in principle.

I think that is only in principle. There are some issues if you want to support multi-CPU setups with private address spaces, but apart from that, I think you can assume they are unity-mapped.

nullplan wrote:

rdos wrote:

Anyway, I solved it by considering the device to be a 32-bit device, and so it will always get addresses below 4GB. It's not worth the trouble of adding more complex physical memory allocators just so EHCI can get higher 4G blocks. I only have two physical address allocators: 32-bit and 64-bit. EHCI will use the 32-bit version, just like UHCI and OHCI. XHCI uses the 64-bit allocator.

So what do you do about ISA DMA?

I don't support it. I use PIO on ISA devices.

nullplan wrote:

I had the idea of hardcoding zones as well, but that felt wrong somehow. There is ISA DMA, which needs 24 bit physical addresses and blocks that do not cross a 64k boundary.

My physical memory allocator can support aligned addresses when allocating multiple pages, like 64 aligned. I can also allocate 2M pages.

nullplan wrote:

There is SMP initialization which requires a 20 bit address when done with the SIPI method (and a 64-bit address when done with the ACPI mailbox method).

I reserve fixed addresses for this.

nullplan wrote:

According to the Linux source code, some EHCI implementations have errata that say they do not work well with addresses above 2GB, so those need 31 bit addresses. Do I add another zone whenever a device comes along with a new weird requirement?

Plus, the zoning adds difficulties when confronted with a device that may use one or the other. As you admit when you say that you treat EHCI as a 32-bit device because the added code complexity is not worth it. With the method presented here, there are no zones, and the max address is a data item. So all devices can take advantage of all addresses (well, most of them. Again the thing with the 4GB page for the periodic list and all the QHs and TDs is just weird, and Linux will fix that page to zero, and I intend to do the same), so you do not run into a fragmentation issue where you have to return "no more memory" even though plenty is free, just not in your zone.

Actually, the main issue with EHCI is if Windows takes advantage of the 4G banking method. If it doesn't, chances are some EHCI chips are broken. The decision made in Linux points to some EHCI chips being broken.

You also need to separate the issue of QHs and TDs from the issue if you can map an external address to the queues, or if you need to allocate a new block below 4G and copy the buffer. If you have any physical address above 4G in the system, then the copy method must be used (unless you do it on a address-by-address basis).

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1618

rdos wrote:

Actually, the main issue with EHCI is if Windows takes advantage of the 4G banking method. If it doesn't, chances are some EHCI chips are broken. The decision made in Linux points to some EHCI chips being broken.

Well no. The comment there explains that it is a limitation of the Linux DMA allocator to not be able to restrict allocations to any 4GB page except the first one (it only supports a limit, not a base address). Although frankly, I would suspect most other OSes to have the same limitation. And with the possibility of the CTRLDSSEGMENT register being nonzero not being exercised, you are right that hardware designers may not have tested it thoroughly.

OSDev.org

DMA Allocators, or: Brendan lied to me!

Who is online