Is CPU really free to do other tasks while DMA is going on?

jacks · Post by **jacks** » Fri Sep 07, 2012 9:19 am

Hi all,

While working on pci bus master dma, some doubts began to grow in my understanding, regarding - what cpu can and cannot do during dma transfers. I hope someone here, can clear the things.

If a reference architecture is required, I'm working on x86 arch. and pci ide bus master transfers. No paging, no virtual memory.

Specifically, what is the use of DMA(for e.g pci bus master dma), when cpu can't use pci bus to access any other device and it also can't access main memory(RAM) during bus master transfers.

Is it something like, at hardware level(wrt timings etc.) dma is always faster than programmed I/O, no matter what is the speed of cpu.

I doubt it, as dma transfers are very much depend on the bus on which device resides, e.g PCI, while a cpu of much higher frequency can always be installed on same system, thus increasing(at least theoritically), the programmed I/O transfers' speed.

Of course it all depends on the timings of the concerned device - whether the device is faster when using programmed I/O or dma, but I am concerned with the capabilities a cpu have(wrt I/O and memory accesses) during a bus master dma transfer.

Specifically, why we say that unlike programmed i/o, when dma is being performed, cpu is free to do other tasks. What are these other tasks?

What am I missing here?

If cpu is free to do other tasks while dma is going on, then how does it ensure cache coherency?

Do we need to take some extra/specific steps to assure cache coherency, or is cache coherency taken care of completely, by x86 cpu hardware itself, possibly with some help from northbridge/MCH?

What am I missing?

rdos · Post by **rdos** » Fri Sep 07, 2012 12:44 pm

DMA is not always faster than programmed IO. It depends on transfer lengths and alike. There is also always some kind of cost for setting up DMA, and the memory structures required for DMA.

As for what is going on. Once DMA is programmed, or a bus-master is enabled, it can live it's own life doing the transfers much like an independent core can. You generally setup some memory pointers that the bus-master uses in order to know what to transfer. You can for instance study modern network controllers, AHCI controllers or USB controllers to get an idea how the memory structures a bus master uses looks like.

JamesM · Post by **JamesM** » Fri Sep 07, 2012 12:48 pm

Hi,

Firstly, DMA will always be faster than port I/O [reply to rdos: assuming you're transferring more than 16 bytes of data...

] . Think about it - the DMA controller reads from the PCI bus then squirts that data to main memory - it is limited by the minimum of the speed of the PCI bus and the speed of the main memory bus.

For the CPU to perform the same task, it would have to take data from the PCI bus into a register, then squirt that up to main memory. That's two bus cycles, plus the latency of one "in" and one "out" instruction; each of which is around 3 cycles if I recall correctly.

You say while a "DMA operation is in progress" - what you actually mean is while the device is processing/seeking and then during the actual burst transfer of the result. During the time in which the device is processing, the CPU can do *anything*. When the DMA engine goes into busmastering mode to perform the memory transfer, the CPU cannot access main memory.

But that is unlikely to slow it down; around 99% of memory accesses are in some level of cache.

As to cache coherency, the cache coherent interconnect is usually hidden behind the northbridge, so CCI traffic doesn't go on the same bus as the RAM sits (the CCI clock speed is likely higher than the FSB speed). Also remember that cache coherency traffic sits behind the L3 cache anyway.

rdos · Post by **rdos** » Fri Sep 07, 2012 1:38 pm

JamesM wrote:Firstly, DMA will always be faster than port I/O [reply to rdos: assuming you're transferring more than 16 bytes of data... ] . Think about it - the DMA controller reads from the PCI bus then squirts that data to main memory - it is limited by the minimum of the speed of the PCI bus and the speed of the main memory bus.

You forget that you always need to program DMA. In a paged system, this can consist of costly operations like translating between linear and physical addresses, splitting up longer transfers into multiple transfers because they span pages and so on. These things are not for free. So I would put the cut-even a little above 16.

Owen · Post by **Owen** » Fri Sep 07, 2012 1:59 pm

Cache coherency is very much an architecture specific thing. On x86, you don't need to care about it (the processor does it for you). On other architectures, things differ.

For example, on ARM, you'll need to

Do a data synchronization barrier to ensure that all memory operations have reached cache
If the DMA controller is to read memory, flush all cache lines which are to be read to the appropriate level of shareability
Another barrier to ensure that the cache operations have completed
Initiate the DMA transaction
Invaidate all cache lines which have just been written to the appropriate level of sharability
Do a third barrier

Most non-x86 architectures will have similar.

Also note that very few x86 CPUs will really give the the bus master control of the memory bus. Expect the bus master read/writes to go through at least the load/store buffers (because a 33MHz PCI transfer would otherwise far exceed the maximum cycle times of the main RAM)

bewing · Post by **bewing** » Fri Sep 07, 2012 2:15 pm

I'm not sure that JamesM stressed it enough. PCI Busmastering is smart about accessing the memory bus. Additionally, you do need to pay attention to the fact that the CPU has several layers of cache between it and the actual memory bus. But the PCI Busmaster waits until the memory bus is not being used before initiating a transfer. The whole point of cache is that it actually accesses the memory bus as little as possible, which means that most of the time, the memory bus is completely unused, and open for busmastering. So most of the time, the actual busmastering transfer hapeens completely invisibly, during memory bus cycles that otherwise would have been "wasted".

On rare occasions, the CPU/cache may actually want to access the memory bus while a busmastering transfer is actually taking place -- this is what CPU hyperthreading is for ... to give the CPU something else to do during a long wait for a memory transfer. But the actual burst transfer of a few Kb of data from the device to the memory happens very quickly.

As JamesM said, once you initiate a "DMA transfer" from a device, the device will spend most of its time gathering a buffer of data to send. The actual memory transfer part of the DMA transaction happens quite fast.

Yes, as Owen said, for DMA there is sometimes a coherency issue.

But you do not seem to realize quite how the IO Port Bus works. It is an 8-bit bus, where the timing is controlled by the device on the other end of the transfer. "Getting a faster CPU chip" will not have the tiniest effect on the length of time it takes to transfer one byte over the IO Port bus. And transferring one byte will often take 100ns. Please calculate that out in instruction cycles, and compare with how much the CPU could do in that time if it wasn't halted waiting for a byte to transfer over the IO Port bus.

Owen · Post by **Owen** » Fri Sep 07, 2012 4:05 pm

bewing wrote:But you do not seem to realize quite how the IO Port Bus works. It is an 8-bit bus, where the timing is controlled by the device on the other end of the transfer. "Getting a faster CPU chip" will not have the tiniest effect on the length of time it takes to transfer one byte over the IO Port bus. And transferring one byte will often take 100ns. Please calculate that out in instruction cycles, and compare with how much the CPU could do in that time if it wasn't halted waiting for a byte to transfer over the IO Port bus.

Theres no "8-bit IO port bus". Hell, theres no IO port bus.

On x86, IO port accesses are just accesses on the same I/O busses that connect to the rest of the system. An I/O access to a PCI-Express GPU (on a recent CPU with on-board PCI-Express controller) just heads directly from the core, to the on-chip PCI-Express controller, to the GPU

Accesses to legacy devices (I.e. the "ISA" stuff) will occur by travelling over a PCI-Express connection to the "Southbridge" or "IO hub", where it will go through a bridge to the LPC (low pin count) bus, which is a 4-bit bus pretending to be the 16-bit ISA bus.

But you're right in that all accesses have a latency determined by their target bus (pretty much every bus has a way for a device to report "I'm not ready yet", and no processor changes will make the LPC bus go faster than 33MHz

jacks · Post by **jacks** » Sat Sep 08, 2012 12:17 am

Thanks to all of you for clarifying things from different aspects. It helped me a lot!

Owen wrote: Also note that very few x86 CPUs will really give the the bus master control of the memory bus. Expect the bus master read/writes to go through at least the load/store buffers (because a 33MHz PCI transfer would otherwise far exceed the maximum cycle times of the main RAM)

Are these load/store buffers physically present in MCH(Memory controller hub) or northbridge, or somewhere else?

Suppose the CPU/cache may actually want to access the main memory bus while a pci busmastering transfer is actually taking place. This transfer may be of few nanoseconds or microseconds, but the point is: during this time frame, no matter how small it is, what will happen, in case of x86 arch.?

Will cpu stall itself, untill current bus master transfer completes?, or

Will cpu instructs MCH/northbridge to stall current bus master transfer, and then cpu will access main memory, fetch/send the data, and then instruct northbridge to resume the busmaster transfer? But in this case, what if cpu accesses the main memory region that is involved in current bus master transfer that is stalled?

or something else will happen?

bewing · Post by **bewing** » Sat Sep 08, 2012 5:08 pm

If CPU hyperthreading is turned on, the CPU will switch to the other hyperthread.

If not, the CPU will stall until the busmastering is complete.

Owen · Post by **Owen** » Sun Sep 09, 2012 6:22 am

jacks wrote:Thanks to all of you for clarifying things from different aspects. It helped me a lot!

Owen wrote: Also note that very few x86 CPUs will really give the the bus master control of the memory bus. Expect the bus master read/writes to go through at least the load/store buffers (because a 33MHz PCI transfer would otherwise far exceed the maximum cycle times of the main RAM)
Are these load/store buffers physically present in MCH(Memory controller hub) or northbridge, or somewhere else?

Suppose the CPU/cache may actually want to access the main memory bus while a pci busmastering transfer is actually taking place. This transfer may be of few nanoseconds or microseconds, but the point is: during this time frame, no matter how small it is, what will happen, in case of x86 arch.?

Will cpu stall itself, untill current bus master transfer completes?, or

Will cpu instructs MCH/northbridge to stall current bus master transfer, and then cpu will access main memory, fetch/send the data, and then instruct northbridge to resume the busmaster transfer? But in this case, what if cpu accesses the main memory region that is involved in current bus master transfer that is stalled?

or something else will happen?

Remember: Most I/O busses are *significantly* slower than main memory (you really have to get up to PCIE-8x or such before they reach similar orders of magnitude). Additionally, the memory bus does lots of discrete transactions (i.e. "Read the 32 byte cell at 0xabcd", where 0xabcd may not even be 32 byte aligned, but the transfer will wrap at the 32 byte boundary - this enables the CPU to fetch entire cache lines, but also get the bit it is stalled on first)

On chip memory busses aren't multimaster busses in the conventional sense. Forget about things like PCI which have up to 8 devices per bus. On chip busses are "single master, multiple slave" or even "point to point" busses joined by matrixes. Additionally, they're also packetized like the memory bus.

What this means is that requests to the memory controller get interleaved. Busmastering stalls nothing (except on legacy PCI, where it stalls accesses to/by other legacy PCI devices).

And this is a good job: Most memory chips allow having multiple "pages" open, the whole interface is deeply pipelined, and when multiple "banks" are present (whether in the form of multiple DIMMs per channel, or paged DRAM ICs) you can have transactions open with multiple DRAMs at once.

jacks · Post by **jacks** » Sun Sep 09, 2012 11:19 pm

Owen wrote:
Remember: Most I/O busses are *significantly* slower than main memory (you really have to get up to PCIE-8x or such before they reach similar orders of magnitude). Additionally, the memory bus does lots of discrete transactions (i.e. "Read the 32 byte cell at 0xabcd", where 0xabcd may not even be 32 byte aligned, but the transfer will wrap at the 32 byte boundary - this enables the CPU to fetch entire cache lines, but also get the bit it is stalled on first)

On chip memory busses aren't multimaster busses in the conventional sense. Forget about things like PCI which have up to 8 devices per bus. On chip busses are "single master, multiple slave" or even "point to point" busses joined by matrixes. Additionally, they're also packetized like the memory bus.

What this means is that requests to the memory controller get interleaved. Busmastering stalls nothing (except on legacy PCI, where it stalls accesses to/by other legacy PCI devices).

And this is a good job: Most memory chips allow having multiple "pages" open, the whole interface is deeply pipelined, and when multiple "banks" are present (whether in the form of multiple DIMMs per channel, or paged DRAM ICs) you can have transactions open with multiple DRAMs at once.

So, this means that at hardware level, nothing is stalled, it is something like either parallel access or accesses stored in a queue.

In other words, if cpu need to access the main memory, while pci bus master transfer is going on, cpu will send access request to MCH/Northbridge, as if main memory is free to respond to access. From cpu's perspective main memory can be accessed at any instant cpu wants, no matter whether pci bus master transfer is going on or not - ie cpu knows nothing about bus master transfer?

It is at the hardware level(ie MCH/northbridge + memory chips on DIMM modules), that memory accesses requests are either responded in parallel or are store in something like queue, so that one access request does not interfere with other.

I know I didn't get it clearly? What I missed?

Owen · Post by **Owen** » Mon Sep 10, 2012 5:54 am

No, you got that perfectly. The only effect it will have on the CPU is increased latency (because the accesses are now being interleaved) and decreased bandwidth (because the bus master is using some of the bandwidth)

jacks · Post by **jacks** » Tue Sep 11, 2012 8:13 am

Owen wrote:No, you got that perfectly. The only effect it will have on the CPU is increased latency (because the accesses are now being interleaved) and decreased bandwidth (because the bus master is using some of the bandwidth)

This means, on x86 architecture cache coherency is completely taken care of by MCH/northbridge, and cpu is not even aware of it, except of course responding to transactions initiated by MCH/northbridge for maintaining cache coherency.

Am I right?

Owen · Post by **Owen** » Tue Sep 11, 2012 9:36 am

Correct, providing the cache settings of all CPUs are set correctly (i.e. if one processor has a block of physical memory UC and another WB, then bad things will happen), and noting some oddities surrounding the write combining cache setting.

jacks · Post by **jacks** » Tue Sep 11, 2012 9:01 pm

Owen wrote:Correct, providing the cache settings of all CPUs are set correctly (i.e. if one processor has a block of physical memory UC and another WB, then bad things will happen), and noting some oddities surrounding the write combining cache setting.

Are these cache settings for CPUs set by MCH/Northbridge or BIOS has something to do with these at power on?.

OSDev.org

Is CPU really free to do other tasks while DMA is going on?

Is CPU really free to do other tasks while DMA is going on?

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going

Re: Is CPU really free to do other tasks while DMA is going