ECC and RAM faults (was "What do you hate...")
Posted: Mon Apr 09, 2012 9:28 pm
Hi,
[mini-rant]
Most RAM failures are a sticky bit somewhere, an intermittent but recurring fault, or a random error (e.g. cosmic ray). ECC will correct almost all of these faults; and if an area of RAM is so messed up that ECC can't correct the errors it will report a problem and won't let you think it's working reliably when it isn't. There is a very small chance (probably about 1 chance in 4 billion) that an error can occur and not be corrected or detected (but that is many orders of magnitude better than "zero chance of errors being corrected or detected").
The problem with normal RAM is that it can be faulty and nobody will know. The first time I got burnt everything worked fine, except Internet Explorer would crash occasionally. Of course back then, even with perfectly good hardware, Internet Explorer would crash occasionally. There wasn't any reason to suspect faulty RAM (and plenty of reason to suspect IE). Then one day I ran defrag. Every file was loaded into RAM then written back to a different place on the hard drive. About 30% of the files were corrupted due to the faulty RAM (that I had no reason to suspect was faulty). Trying to figure out which files were OK and which files were trashed was too much of a nightmare - after about an hour of trying I realised it was futile and gave up (wiped everything). In that case the BIOS's memory test said everything was fine. It took memtest about 4 hours to detect anything was wrong.
The second time (a completely different computer, many years later, this time running Linux) the only symptom was that GCC would occasionally crash. I didn't know if it was a software problem or not. I wrote my own little memory tester - 4 processes that each allocated 1.5 GiB of RAM and constantly pounded it. Eventually (after leaving it running for ages, and occasionally stopping/restarting processes to get them to test hopefully different areas of RAM) my little tester found/reported a problem. After that I booted memtest86 - first time memtest86 found the problem after about 6 hours of running. Second time it found the problem after about 10 hours of running. As far as I can tell the problem was actually caused by dust - the computer was running 24 hours a day with no air filters, and cleaning the dust out made the problem seem to go away. It was a Core2 quad-core with 8 GiB of RAM, and was less than 12 months old at the time. I haven't used that computer since (except for testing my OS).
Now think about the computer you're using now. Is there faulty RAM occasionally trashing your data? How do you know? Do you create backups? Are your backups corrupted? If you stop using the computer and run memtest86 on it for a few hours, then there could still be faults that memtest86 didn't find. What if you stop using it and run memtest86 on it for 4 entire days? It's very likely memtest86 would find any faults in that time; but what happens if a fault develops the day after you've tested it all properly? To make sure, you really should consider running memtest86 for at least 23 hours each day. Of course you'd probably need to buy a set of 8 computers so you can actually get 8 hours of work done each day, in between all the testing.
If you care about your data, use ECC - it's the only sane alternative. Note: I also use RAID for hard drives.
[/mini-rant]
Cheers,
Brendan
[mini-rant]
In theory there's catastrophic failures (e.g. where RAM stops functioning at all). ECC won't help with these. I've never seen them happen though and if they do happen you're only looking at losing whatever was in memory at the time - no worse that a triple fault caused by a software bug; and the firmware's useless little memory test actually would find the problem. Basically catastrophic failures aren't really a problem at all.Rudster816 wrote:Brendan wrote:Basically, I won't trust that Windows machine for anything important because it doesn't have ECC RAM.
ECC RAM can still fail, and I wouldn't be the least bit surprised if it's failure rates were not much better than non ECC RAM.
Most RAM failures are a sticky bit somewhere, an intermittent but recurring fault, or a random error (e.g. cosmic ray). ECC will correct almost all of these faults; and if an area of RAM is so messed up that ECC can't correct the errors it will report a problem and won't let you think it's working reliably when it isn't. There is a very small chance (probably about 1 chance in 4 billion) that an error can occur and not be corrected or detected (but that is many orders of magnitude better than "zero chance of errors being corrected or detected").
The problem with normal RAM is that it can be faulty and nobody will know. The first time I got burnt everything worked fine, except Internet Explorer would crash occasionally. Of course back then, even with perfectly good hardware, Internet Explorer would crash occasionally. There wasn't any reason to suspect faulty RAM (and plenty of reason to suspect IE). Then one day I ran defrag. Every file was loaded into RAM then written back to a different place on the hard drive. About 30% of the files were corrupted due to the faulty RAM (that I had no reason to suspect was faulty). Trying to figure out which files were OK and which files were trashed was too much of a nightmare - after about an hour of trying I realised it was futile and gave up (wiped everything). In that case the BIOS's memory test said everything was fine. It took memtest about 4 hours to detect anything was wrong.
The second time (a completely different computer, many years later, this time running Linux) the only symptom was that GCC would occasionally crash. I didn't know if it was a software problem or not. I wrote my own little memory tester - 4 processes that each allocated 1.5 GiB of RAM and constantly pounded it. Eventually (after leaving it running for ages, and occasionally stopping/restarting processes to get them to test hopefully different areas of RAM) my little tester found/reported a problem. After that I booted memtest86 - first time memtest86 found the problem after about 6 hours of running. Second time it found the problem after about 10 hours of running. As far as I can tell the problem was actually caused by dust - the computer was running 24 hours a day with no air filters, and cleaning the dust out made the problem seem to go away. It was a Core2 quad-core with 8 GiB of RAM, and was less than 12 months old at the time. I haven't used that computer since (except for testing my OS).
Now think about the computer you're using now. Is there faulty RAM occasionally trashing your data? How do you know? Do you create backups? Are your backups corrupted? If you stop using the computer and run memtest86 on it for a few hours, then there could still be faults that memtest86 didn't find. What if you stop using it and run memtest86 on it for 4 entire days? It's very likely memtest86 would find any faults in that time; but what happens if a fault develops the day after you've tested it all properly? To make sure, you really should consider running memtest86 for at least 23 hours each day. Of course you'd probably need to buy a set of 8 computers so you can actually get 8 hours of work done each day, in between all the testing.
If you care about your data, use ECC - it's the only sane alternative. Note: I also use RAID for hard drives.
[/mini-rant]
Cheers,
Brendan