Page 1 of 1

need opinions on what to do...

Posted: Mon Jan 05, 2009 6:19 pm
by 01000101
I recently discovered that my OS (DiNS) uncovers a very nasty bug in (at least 2) Intel PRO/1000 XX cards when the host OS utilizes full RX & TX checksum offloading (having the NIC calculate the checksum instead of the host CPU). Basically, my hunch is that, because my OS drops corrupted (bad checksum) packets instead of re-calculating and re-sending, it somewhat cripples the Intel PRO/1000 checksum offloading as ~10% of the sent packets from the host had incorrect TCP checksums. This was confirmed with Wireshark and just watching outgoing packets for checksum errors, and sure enough, there were a decent amount of incorrect TCP checksums that would later be dropped by my OS if attempted to pass-through. [edit]This was dually confirmed as the problem disappeared when I disabled checksum offloading in the host OS.[/edit]

Should my OS compensate for the incorrect checksum calculations from the offloading and recalculate bad checksums and send them on their way, or should it just let the bug be fully exposed and continue as it is? Or, option 3, just let the bad checksummed packets continue their route and let another device act on it and just go into extreme pass-through mode for them?

Re: need opinions on what to do...

Posted: Mon Jan 05, 2009 6:25 pm
by samoz
I'm not a huge network buff, so I probably won't be able to help much in that department, but a good rule from UNIX programming is when things break, to fail loudly.

In your case, maybe it would be possible to recalculate the failed checksums on the CPU and then take action, depending on whether or not the checksums actually are failures. This sounds like the safest course of action to me.

If you just passed on bad checksum packets, what would be the point of checksumming in the first place?

Re: need opinions on what to do...

Posted: Tue Jan 06, 2009 5:17 am
by JackScott
Can you put an option in the interface and let the user decide? If not, fail loudly, as samoz said.

Re: need opinions on what to do...

Posted: Tue Jan 06, 2009 5:32 am
by LoseThos
The scary thing is if it's a byte checksum, one out of 256 bad packets will go through. If you have noisy communication, I think you should reduce speed or something. Software is not able to handle too much noise. I once used an RS232 link and tried to push it as fast as possible. At first, I thought software could catch errors. Eventually, I learned software can't catch all issues and faulty communication sucks when sending files and stuff! So, I went with a lower BAUD rate.

Re: need opinions on what to do...

Posted: Tue Jan 06, 2009 7:08 am
by Combuster
Someone needs to get his facts straight.

Ethernet has a checksum of its own. The fact that it gets across the link means that at least one checksum (over a larger area!) is correct. The probability of an error that the ethernet sum is correct but the TCP sum is not due to corruption is by far *not* near the 10%, but in digits behind the decimal point.

It also means that whatever is sending the data has likely a broken TCP implementation. And I wonder what OS that is.

Edit: +1 For the suggestion to make it an configuration option

Re: need opinions on what to do...

Posted: Tue Jan 06, 2009 7:26 pm
by 01000101
Combuster wrote:Someone needs to get his facts straight.

Ethernet has a checksum of its own. The fact that it gets across the link means that at least one checksum (over a larger area!) is correct. The probability of an error that the ethernet sum is correct but the TCP sum is not due to corruption is by far *not* near the 10%, but in digits behind the decimal point.

It also means that whatever is sending the data has likely a broken TCP implementation. And I wonder what OS that is.

Edit: +1 For the suggestion to make it an configuration option
lol -1 for the insinuation that my OS is the fault for the TCP checksum error.

with or without my OS being used, it still has the error, and when I turn off the host OS's (Windows or Ubuntu) checksum offloading... it magically goes away. =)

I know that the ethernet layer has a crc check on the *entire* packet to ensure sanity, but if the host OS miscalculated, or the offloading failed for a bit, then it will seem sane to the ethernet layer, but upper layers will see the bad IP/TCP/UDP checksum and drop the packet once received.

I'm adding the user-configuration for this as we speak (thanks JackScott), and have already implemented checksum calculations for IP/TCP/UDP and will replace bad ones (or null ones) if necessary (and the user has that on).

Re: need opinions on what to do...

Posted: Wed Jan 07, 2009 3:16 am
by jal
01000101 wrote:I'm adding the user-configuration for this as we speak (thanks JackScott), and have already implemented checksum calculations for IP/TCP/UDP and will replace bad ones (or null ones) if necessary (and the user has that on).
So is it just that specific type of card that is causing the error? Just for my information: did you find any reference of that error on the net? Usually Intel will publish bugs.


JAL

Re: need opinions on what to do...

Posted: Wed Jan 07, 2009 1:46 pm
by 01000101
I found quite a few links to similar issues with the entire family of (checksum offloading capable) cards. All of the actual solutions I found were to turn off the offloading.

I don't think it's an Intel problem, I think it's an implementor issue. But I don't have the issue anymore even with the offloading on and my OS acting as a transparent gateway.

It wouldn't surprise me if there are alot of bugs in the drivers for these cards as they are both new cards, and thus have very new drivers. And the code around the checksum offloading is no walk-in-the-park either. =)