Page 1 of 1

IDE DMA

Posted: Tue May 26, 2015 3:37 am
by bigbob
Hi,

Maybe "ATA DMA" would be a better subject, I don't know.
I was using ATA PIO for a long time but it is really slow (e.g. copying wav-files), so I decided to try to implement DMA.
I have read :
http://wiki.osdev.org/ATA/ATAPI_using_DMA
http://forum.osdev.org/viewtopic.php?f=1&t=24394
and several documents, including the Intel IDE Controller specification which says that bit3 should be set to 1 in the BusMasterCmd-register if we want to WRITE, but it seems that the osdev-wiki-page is correct because if bit3 is set then it means READ.

Reading sectors from winchester with DMA works well for me. The interrupt always gets fired(once after 64Kb buffer, I have one entry in the PRDT) in case of READ (no matter if I read 1, 128 or 69631 consecutive sectors). I tested it on real HW (Dell D820).
The problem is that if I WRITE, the interrupt doesn't get fired if sectorcount is smaller than 128 (as if bit7 of low-byte of sectorcount would matter). So, if I try to write 140 sectors with DMA, then I should get two interrupts, because I have just one entry in the PRDT(i.e. 64Kb dma-buffer) and I call WRITE in a loop. Unfortunately I get only the first interrupt (with sectorcnt=128 (128*512 ==> 64Kb)) and the second one doesn't come (with sectorcnt=12).
This occurs on D820.
With Bochs it works, because the two interrupts get fired, in case of 140 sectors.

Note that the contents of the buffer gets written to disk correctly on real HW, even if the interrupt doesn't get fired.
In other words, If I WRITE just one sector, then there is no interrupt, there will be a timeout, but the data is written to disk.

I could check if the sectorcnt is smaller than 128, and have a 1ms delay instead of a "timeout waiting for the interrupt".
Knowing that Bochs always sends the interrupt, there must be a better solution.

Re: IDE DMA

Posted: Tue May 26, 2015 8:09 am
by bigbob
Additional info.

Reading 0x0FFF sectors with PIO takes 12s, with DMA 14s (without the last interrupt and with timeout: 16s).
So, DMA is really for multitasking, but as far as I know DMA should be faster than this.
First I thought, that UMDA is not set, but I checked the UDMAC-register (UltraDMA Control Register, offset 48h in PCI-space) and it was 0x05, so the PrimaryDrive0 and SecondaryDrive0 are enabled, but I set the register to 0x0F.
Maybe having UDMA working, requires doing something else too. I am still investigating.

This is way too slow.

Re: IDE DMA

Posted: Tue May 26, 2015 3:46 pm
by Brendan
Hi,
bigbob wrote:Reading 0x0FFF sectors with PIO takes 12s, with DMA 14s (without the last interrupt and with timeout: 16s).
So, DMA is really for multitasking, but as far as I know DMA should be faster than this.
First I thought, that UMDA is not set, but I checked the UDMAC-register (UltraDMA Control Register, offset 48h in PCI-space) and it was 0x05, so the PrimaryDrive0 and SecondaryDrive0 are enabled, but I set the register to 0x0F.
Maybe having UDMA working, requires doing something else too. I am still investigating.
There's 3 different times:
  • How quickly the drive can seek to the first sector
  • How quickly data can be transferred between the drive's internal buffer and the disk media
  • How quickly data can be transferred between RAM and the drive's internal buffer
UDMA and PIO only effect how quickly data can be transferred between RAM and the drive's internal buffer. The other 2 things depend on the disk drive.

The only thing I could find out about the disk drive is that Dell D820 comes with several options, ranging from "60GB hard disk @ 7200RPM" up to about 120 GiB (and that it's a mobile drive). I'd expect seek times to be irrelevant (e.g. less than 20 ms) and over 50 MB/s for transferring data between the drive's internal buffer and the disk media. At 12 seconds for 0xFFF sectors it's extremely unlikely that either of these is the bottleneck.

There's 8 different UDMA modes with different speeds ranging from 16.7 MB/s to 167 MB/s. There's 7 different PIO modes where speed depends on software, but max. speed ranges from 3.3 MB/s to 25 MB/s.

You don't say which PIO mode you're using or which UDMA mode you're using. Assuming 512 byte sectors, 12 seconds to transfer 0x0FFF sectors works out to 0.167 MB/s. This is far slower than the slowest PIO or UDMA mode.

Based on all of the above, I'd assume there's something severely wrong with both your PIO and UDMA code. For example, maybe you didn't ask the disk drive which modes it supports and you're using PIO/UDMA modes that aren't supported, and it's slow because its trying to work in a situation that shouldn't work at all (e.g. a massive amount of transmission errors and retries happening). Also, it's possible that other problems (e.g. not getting an IRQ after a successful write in UDMA mode) might be symptoms of the same underlying problem.


Cheers,

Brendan

Re: IDE DMA

Posted: Wed May 27, 2015 12:39 am
by bigbob
Brendan wrote: You don't say which PIO mode you're using or which UDMA mode you're using. Assuming 512 byte sectors, 12 seconds to transfer 0x0FFF sectors works out to 0.167 MB/s. This is far slower than the slowest PIO or UDMA mode.

Based on all of the above, I'd assume there's something severely wrong with both your PIO and UDMA code. For example, maybe you didn't ask the disk drive which modes it supports and you're using PIO/UDMA modes that aren't supported, and it's slow because its trying to work in a situation that shouldn't work at all (e.g. a massive amount of transmission errors and retries happening). Also, it's possible that other problems (e.g. not getting an IRQ after a successful write in UDMA mode) might be symptoms of the same underlying problem.
Thanks for your help, Brendan.
DMA: I must admit that I use the values set by BIOS, since I read that it sets DMA most of the time correctly.
I will check what modes are set by the BIOS and also will check my algorithm.
Maybe having just one PRDT-entry is not enough, since I set up PRDT, send the disk the sectorcount and lba-value with PIO, then send the READExt/WriteExt command to the disk with PIO. Next, I wait for the interrupt. If I have more sectors to read/write than 128 (since I have only one 64Kb DMA buffer), I call these things in a loop.
On the other hand, having several PRDT-entries would just eliminate sending the PIO-commands (sectorcount, LBA, ReadExt/WriteExt) several times.
I would have to handle the interrupt after every PRDT-entry.

PIO: I implemented PIO according to the wiki and comments on osdev, but I haven't checked what PIO-mode is set.
I will do that too. And also, there can be problems with my implementation.

Re: IDE DMA

Posted: Wed May 27, 2015 2:18 am
by Brendan
Hi,
bigbob wrote:
Brendan wrote: You don't say which PIO mode you're using or which UDMA mode you're using. Assuming 512 byte sectors, 12 seconds to transfer 0x0FFF sectors works out to 0.167 MB/s. This is far slower than the slowest PIO or UDMA mode.

Based on all of the above, I'd assume there's something severely wrong with both your PIO and UDMA code. For example, maybe you didn't ask the disk drive which modes it supports and you're using PIO/UDMA modes that aren't supported, and it's slow because its trying to work in a situation that shouldn't work at all (e.g. a massive amount of transmission errors and retries happening). Also, it's possible that other problems (e.g. not getting an IRQ after a successful write in UDMA mode) might be symptoms of the same underlying problem.
Thanks for your help, Brendan.
DMA: I must admit that I use the values set by BIOS, since I read that it sets DMA most of the time correctly.
I will check what modes are set by the BIOS and also will check my algorithm.
Even if the BIOS didn't set the UDMA mode and only used the slowest possible UDMA mode, it still wouldn't be that slow.
bigbob wrote:Maybe having just one PRDT-entry is not enough, since I set up PRDT, send the disk the sectorcount and lba-value with PIO, then send the READExt/WriteExt command to the disk with PIO. Next, I wait for the interrupt. If I have more sectors to read/write than 128 (since I have only one 64Kb DMA buffer), I call these things in a loop.
On the other hand, having several PRDT-entries would just eliminate sending the PIO-commands (sectorcount, LBA, ReadExt/WriteExt) several times.
I would have to handle the interrupt after every PRDT-entry.
The worst case here is that the next sector moves under the disk heads while you're setting up the next transfer, causing a "nearly full disk rotation" delay before the next transfer begins. At 7200 RPM a disk rotation takes about 2.3 us. Splitting a 2 MiB transfer into 64 KiB pieces would cost 2048/64*2.4 us, or about 76.8 us extra. It's very unlikely that this is the cause of the performance problem.
bigbob wrote:PIO: I implemented PIO according to the wiki and comments on osdev, but I haven't checked what PIO-mode is set.
I will do that too. And also, there can be problems with my implementation.
I think (not entirely sure) that "rep insw" is PIO modes 3 and 4, and "rep insd" is PIO modes 5 and 6. In any case, I'd expect your transfer to be around 10 times faster than you're currently getting with PIO.

Mostly; I'm wondering if the performance problem is caused by something completely unrelated; like leaving IRQs disabled for long periods of time, or a massive number of faulty sectors (causing the hard drive to fetch from alternative/replacement sector pool), or having an "IRQ flood" going on, or having CPU caches disabled, or...(!)

Note: It might be fun to test how quickly Windows or Linux (or maybe even BIOS) can read 2 MiB from the disk; just to get a much better idea of what sort of speeds you should be expecting from the drive.


Cheers,

Brendan

Re: IDE DMA

Posted: Wed May 27, 2015 3:13 am
by bigbob
Hi,
Brendan wrote:
Even if the BIOS didn't set the UDMA mode and only used the slowest possible UDMA mode, it still wouldn't be that slow.
The worst case here is that the next sector moves under the disk heads while you're setting up the next transfer, causing a "nearly full disk rotation" delay before the next transfer begins. At 7200 RPM a disk rotation takes about 2.3 us. Splitting a 2 MiB transfer into 64 KiB pieces would cost 2048/64*2.4 us, or about 76.8 us extra. It's very unlikely that this is the cause of the performance problem.
Yes, something else is the problem here. I use the functions ata_pollling and ata_polling2 (see their explanations below where I comment on PIO) and the timeout in ata_polling can be the reason for the long read-time.
Brendan wrote: I think (not entirely sure) that "rep insw" is PIO modes 3 and 4, and "rep insd" is PIO modes 5 and 6. In any case, I'd expect your transfer to be around 10 times faster than you're currently getting with PIO.
I read this article:
http://wiki.osdev.org/ATA_PIO_Mode
but I didn't implement it literally. I think this is a new article on PIO, and I implemented my code two or three years ago according to the previous article, but I can be wrong.

I use "rep insw" and call ata_polling (see its code below) after reading the data of a sector.
I call ata_polling2 after the last block of data.
I suspect that ata_polling is not correct. There is a 500ms timeout involved, and that can cause the 12s read-time in case of 0x0FFF sectors.
Currently I am rewriting the code of my pio48_read according to the osdev-wiki article.
Brendan wrote: Mostly; I'm wondering if the performance problem is caused by something completely unrelated; like leaving IRQs disabled for long periods of time, or a massive number of faulty sectors (causing the hard drive to fetch from alternative/replacement sector pool), or having an "IRQ flood" going on, or having CPU caches disabled, or...(!)
I don't know. It's possible, if rewriting my code according to that article won't help.
I will check the result of the IDENTIFY command too. I have a HDINFO command in my OS (it uses the result of IDENTIFY), its output is:
Serial number: MPCDN7Y4HGWT2L
Firmware version: MC40C10H
Model number: Hitachi HTS721080G9SA00
Supported: LBA LBA48 DMA
MaxLBA28: 156301488
MaxLBA48: 156301488
Max number of logical sectors per r/w multiple cmds: 16
Capacity: 74 Gb
Primary bus, master

Maybe the code that computes the size is still not accurate enough (80Gb).
Brendan wrote: Note: It might be fun to test how quickly Windows or Linux (or maybe even BIOS) can read 2 MiB from the disk; just to get a much better idea of what sort of speeds you should be expecting from the drive.
There was Debian Linux on the D820 and the file-transfers was ok (e.g. a 30Mb wav-file was copied from USB pretty fast).
I know that 0x0FFF is about 2Mb, but now I can't test it, because I ruined Linux completely with my writing sectors to the disk. :)

There are some extra delays in the code below: instead of 400ns, it waits 2ms.

Code: Select all

%define ATA_TIMEOUT_VAL   500

ata_polling:
			push ebx
			push edx
			xor eax, eax
			; here wait at least 400ns for BSY to be set
			mov ebx, 2
			call pit_delay
			mov DWORD [pit_ticks2], 0
			mov dx, [ata_port_base]
			add dx, ATA_PORT_STATUS
.Poll		in al, dx
			test al, ATA_ST_BSY				; wait for BUSY bit to clear
			jz	.Check
			cmp DWORD [pit_ticks2], ATA_TIMEOUT_VAL
			jna	.Poll
			mov al, ATA_TIMEOUT
			jmp .Back
.Check		in al, dx						; read STATUS
			test al, ATA_ST_ERR				; test for error
			jz	.ChkDF
			mov al, ATA_ERR
			jmp .Back
.ChkDF		test al, ATA_ST_DF				; test for device fault
			jz	.ChkDRQ
			mov al, ATA_ERR					; we could have different error ids
			jmp .Back
.ChkDRQ		test al, ATA_ST_DRQ				; test for DRQ, it should be set
			jnz	.Ok
			mov al, ATA_ERR					; we could have different error ids
			jmp .Back
.Ok			mov al, ATA_OK
.Back		pop edx
			pop ebx
			ret


ata_polling2:
			push ebx
			push edx
			xor eax, eax
			; here wait at least 400ns for BSY to be set
			mov ebx, 2
			call pit_delay
			mov DWORD [pit_ticks2], 0
			mov dx, [ata_port_base]
			add dx, ATA_PORT_STATUS
.Poll		in al, dx
			test al, ATA_ST_BSY				; wait for BUSY bit to clear
			jz	.Ok
			cmp DWORD [pit_ticks2], ATA_TIMEOUT_VAL
			jna	.Poll
			mov al, ATA_TIMEOUT
			jmp .Back
.Ok			mov al, ATA_OK
.Back		pop edx
			pop ebx
			ret


Re: IDE DMA

Posted: Wed May 27, 2015 3:31 am
by Octocontrabass
bigbob wrote:Maybe the code that computes the size is still not accurate enough (80Gb).
80GB is 74GiB. The numbers don't match because you are confusing GB and GiB.

Re: IDE DMA

Posted: Wed May 27, 2015 4:19 am
by Coomer69
bigbob wrote:There are some extra delays in the code below: instead of 400ns, it waits 2ms.
Does the code wait 2ms each sector?

Re: IDE DMA

Posted: Wed May 27, 2015 4:19 am
by bigbob
Octocontrabass wrote:
bigbob wrote:Maybe the code that computes the size is still not accurate enough (80Gb).
80GB is 74GiB. The numbers don't match because you are confusing GB and GiB.
Of course, you are right. I forgot about that.

By the way, I rewrote polling according to that article and now with PIO it takes about 1 second (it difficult to measure it) to read 0x0FFF sectors.
I am still testing it, but the data from the disk looks correct in RAM.
That 2ms delay during every polling was too much.

That will also speed up DMA, but first I concentrate on PIO.
Hopefully that last interrupt with DMA will be fixed somehow.

EDIT: @SapphireBeauty: yes, that seems to be the problem.

Re: IDE DMA

Posted: Wed May 27, 2015 4:19 am
by Schol-R-LEA
Octocontrabass wrote:
bigbob wrote:Maybe the code that computes the size is still not accurate enough (80Gb).
80GB is 74GiB. The numbers don't match because you are confusing GB and GiB.
For those unfamiliar with this, GiB (gibibytes) is for the values of bytes expressed as powers of 2. For example, while one gigabyte is (strictly defined) 1,000,000,000 decimal, one gibibyte is 1073741824 (2^30). Hence, a value in GB is always less than the actual value in GiB.

Mind you, few people bother with this, gigabytes is loosely equated with GiB in most usage. However, when you actually need to express exact values, as is often the case in OS dev, the distinction becomes quite important.