ECC and RAM faults (was "What do you hate...")

Brendan · Post by **Brendan** » Mon Apr 09, 2012 9:28 pm

Hi,

[mini-rant]

Rudster816 wrote:
Brendan wrote:Basically, I won't trust that Windows machine for anything important because it doesn't have ECC RAM.

ECC RAM can still fail, and I wouldn't be the least bit surprised if it's failure rates were not much better than non ECC RAM.

In theory there's catastrophic failures (e.g. where RAM stops functioning at all). ECC won't help with these. I've never seen them happen though and if they do happen you're only looking at losing whatever was in memory at the time - no worse that a triple fault caused by a software bug; and the firmware's useless little memory test actually would find the problem. Basically catastrophic failures aren't really a problem at all.

Most RAM failures are a sticky bit somewhere, an intermittent but recurring fault, or a random error (e.g. cosmic ray). ECC will correct almost all of these faults; and if an area of RAM is so messed up that ECC can't correct the errors it will report a problem and won't let you think it's working reliably when it isn't. There is a very small chance (probably about 1 chance in 4 billion) that an error can occur and not be corrected or detected (but that is many orders of magnitude better than "zero chance of errors being corrected or detected").

The problem with normal RAM is that it can be faulty and nobody will know. The first time I got burnt everything worked fine, except Internet Explorer would crash occasionally. Of course back then, even with perfectly good hardware, Internet Explorer would crash occasionally. There wasn't any reason to suspect faulty RAM (and plenty of reason to suspect IE). Then one day I ran defrag. Every file was loaded into RAM then written back to a different place on the hard drive. About 30% of the files were corrupted due to the faulty RAM (that I had no reason to suspect was faulty). Trying to figure out which files were OK and which files were trashed was too much of a nightmare - after about an hour of trying I realised it was futile and gave up (wiped everything). In that case the BIOS's memory test said everything was fine. It took memtest about 4 hours to detect anything was wrong.

The second time (a completely different computer, many years later, this time running Linux) the only symptom was that GCC would occasionally crash. I didn't know if it was a software problem or not. I wrote my own little memory tester - 4 processes that each allocated 1.5 GiB of RAM and constantly pounded it. Eventually (after leaving it running for ages, and occasionally stopping/restarting processes to get them to test hopefully different areas of RAM) my little tester found/reported a problem. After that I booted memtest86 - first time memtest86 found the problem after about 6 hours of running. Second time it found the problem after about 10 hours of running. As far as I can tell the problem was actually caused by dust - the computer was running 24 hours a day with no air filters, and cleaning the dust out made the problem seem to go away. It was a Core2 quad-core with 8 GiB of RAM, and was less than 12 months old at the time. I haven't used that computer since (except for testing my OS).

Now think about the computer you're using now. Is there faulty RAM occasionally trashing your data? How do you know? Do you create backups? Are your backups corrupted? If you stop using the computer and run memtest86 on it for a few hours, then there could still be faults that memtest86 didn't find. What if you stop using it and run memtest86 on it for 4 entire days? It's very likely memtest86 would find any faults in that time; but what happens if a fault develops the day after you've tested it all properly? To make sure, you really should consider running memtest86 for at least 23 hours each day. Of course you'd probably need to buy a set of 8 computers so you can actually get 8 hours of work done each day, in between all the testing.

If you care about your data, use ECC - it's the only sane alternative. Note: I also use RAID for hard drives.

[/mini-rant]

Cheers,

Brendan

gravaera · Post by **gravaera** » Mon Apr 09, 2012 9:55 pm

When you put it like that, it sounds like a real horror

Brendan · Post by **Brendan** » Mon Apr 09, 2012 10:28 pm

Hi,

gravaera wrote:When you put it like that, it sounds like a real horror

It's worse than you think. Google did some research into memory errors recently. They've got a huge number of computers and gathered statistics (both corrected and uncorrected errors) from the ECC in each of them. The results?

An average DIMM experiences nearly 4000 correctable errors per year.

The computer I'm using now has 6 DIMMs. That means I can expect an average of about 24000 correctable errors per year, 2000 correctable errors per month, 65.7 correctable errors per day, 2.7 correctable errors per hour, or one correctable error every 22.2 minutes.

Now imagine if there was no ECC to correct those correctable errors.

Cheers,

Brendan

piranha · Post by **piranha** » Mon Apr 09, 2012 10:52 pm

Brendan wrote:Hi,

gravaera wrote:When you put it like that, it sounds like a real horror
It's worse than you think. Google did some research into memory errors recently. They've got a huge number of computers and gathered statistics (both corrected and uncorrected errors) from the ECC in each of them. The results?

An average DIMM experiences nearly 4000 correctable errors per year.

The computer I'm using now has 6 DIMMs. That means I can expect an average of about 24000 correctable errors per year, 2000 correctable errors per month, 65.7 correctable errors per day, 2.7 correctable errors per hour, or one correctable error every 22.2 minutes.

Now imagine if there was no ECC to correct those correctable errors.

Cheers,

Brendan

That doesn't take into account the fact that memory isn't constantly full. The rate of errors that would only affect a single bit would decrease if you take into account that many of the bits are irrelevant, right?
Edit: Not that I'm saying that thats not a huge number of errors...
-JL

Brendan · Post by **Brendan** » Tue Apr 10, 2012 1:18 am

Hi,

piranha wrote:
Brendan wrote:It's worse than you think. Google did some research into memory errors recently. They've got a huge number of computers and gathered statistics (both corrected and uncorrected errors) from the ECC in each of them. The results?

An average DIMM experiences nearly 4000 correctable errors per year.

The computer I'm using now has 6 DIMMs. That means I can expect an average of about 24000 correctable errors per year, 2000 correctable errors per month, 65.7 correctable errors per day, 2.7 correctable errors per hour, or one correctable error every 22.2 minutes.
That doesn't take into account the fact that memory isn't constantly full. The rate of errors that would only affect a single bit would decrease if you take into account that many of the bits are irrelevant, right?

The number of errors is the number of errors. The number of errors that actually effect running software is different, and would be less than total number of errors.

For example, I've got 12 GiB of RAM and rarely actually use more than 4 GiB of it. Of the 4 GiB I am using, some is for things like graphics data where some corruption isn't really going to cause problems. Even for code there's parts that aren't normally used (e.g. error handling code) and parts that are only used sometimes (e.g. the compression code in gzip when you're decompressing files). Of course some are more severe - worst would be a corrupted library or kernel on disk (where one RAM error effects everything until you reinstall).

If you estimate that only 2% of RAM errors end up causing actual problems, then (without ECC) I might be able to run for an average of 18.2 hours before something somewhere gets messed up. This computer has been running 24 hours per day for about 2 years (excluding the occasional/brief reboot), so that means ECC has probably prevented almost 1000 actual problems so far.

Cheers,

Brendan

bluemoon · Post by **bluemoon** » Tue Apr 10, 2012 7:23 am

Brendan, thanks for the information.

I feel lucky using ECC.

btw non-buffered ECC is cheap, only buffered ram is rocket price, so there should not be much concern on using ECC on cheap computers.

NickJohnson · Post by **NickJohnson** » Tue Apr 10, 2012 9:21 am

So here's the question: how hard would it be (i.e. how much overhead would it create) to have a system that checksums its shared libraries on a regular basis to check for errors like that? It seems like that would avert most of the catastrophic errors (like compression library code being corrupted.)

bluemoon · Post by **bluemoon** » Tue Apr 10, 2012 9:39 am

The problem is that you load stuff into RAM and performed the checksum, its all good, and right before you save those data into disk, part of the data magically turn into junk bytes, caused by unstable electric, cosmic rays or pure luck.

Things can happen after you run the checksum and before you store your data...

Rudster816 · Post by **Rudster816** » Tue Apr 10, 2012 12:54 pm

Brendan wrote:Hi,

gravaera wrote:When you put it like that, it sounds like a real horror
It's worse than you think. Google did some research into memory errors recently. They've got a huge number of computers and gathered statistics (both corrected and uncorrected errors) from the ECC in each of them. The results?

An average DIMM experiences nearly 4000 correctable errors per year.

The computer I'm using now has 6 DIMMs. That means I can expect an average of about 24000 correctable errors per year, 2000 correctable errors per month, 65.7 correctable errors per day, 2.7 correctable errors per hour, or one correctable error every 22.2 minutes.

Now imagine if there was no ECC to correct those correctable errors.

Cheers,

Brendan

Under Conclusion 1 in that article, it says that 8% of their DIMM's in their fleet suffered one or more correctable errors in a given year. Saying that an average DIMM experiences 4000 correctable errors per year is grossly misleading, because it implies that it would be monstrously unlikely that a DIMM could go a year without any errors, when in fact, it has a 92% chance to last an entire year without error. The only reason that 4000 CE's per year statistic is true is because one faulty DIMM can spit out tens of thousands of CE's per hour and run for a very long time before it's replaced in a server environment.

Your own faulty RAM took 4 hours to show any sign of errors with Memtest, which is certainly far from average use. If your everday DIMM had an error every 22.2min on average, it would be highly unlikely that any set of RAM would pass a Memtest that lasted for more than a couple of hours. In all my years of computer forum trolling, I've seen dozens of instances of people's memory passing 8 hours+ of Memtest. I've also seen quite a few crazy people do 24-48hr runs of both Memtest and Prime95\Linpack with no errors.

I've ran overclocked CPU's, GPU's, and RAM for close to 6 years now, and the only time I've ever lost any data due to hardware failure was because of a 1.5TB Seagate HDD. The only reason I actually lost any data was because I couldn't afford a second HDD to do a backup with, but none of what was on it was "mission critical". Now I have two 2TB HDD's that are 1:1 mirrors (non RAID).

My current CPU\Motherboard have been to absolute hell and back about a dozen times. The CPU is a Core i7 920 and has been run at 4.0ghz (190.5x21) for the vast majority of it's time I've had it (since July 09). It's also been benched at 4.6ghz+ with chilled water cooling, and been at 5.1ghz with dry ice. Without fail though, it chugs on today without error, despite all of the things I've done to it that Intel would definitely not approve of. If something as fragile as a 263 mm^2 piece of silicon with 731 million transistors can go through all of that and still work flawlessly, I think you can rely on a stick of RAM that you know worked when you first bought it. If you're thinking about asking how I know my CPU still works fine, I'm not going to ramble on about exactly how I know (mainly because I don't know for sure, but neither does anyone else). I'll just say that I know it to work to the degree that it worked when I first got it.

I'm not about to go out (or recommend it to others) and buy some overpriced hunk of junk (compared to what I have now) server motherboard and equally junky (compared to what I have now) RAM based on the one in a million chance that a couple of bytes could get screwed up in one of the files I save to my disk. With a hard backup, the odds of ECC RAM preventing any type of meaningful data loss are so low that it's not even worth the 5min you spend pondering rather or not its worth it.

If you decide that you want all your computer's to have ECC RAM, that's fine if it works for you, and there's nothing wrong with it. Leading on that computers without ECC RAM are severely more prone to data loss than their non ECC counterparts, that's wrong.

Brendan · Post by **Brendan** » Tue Apr 10, 2012 9:28 pm

Hi,

Rudster816 wrote:Saying that an average DIMM experiences 4000 correctable errors per year is grossly misleading, because it implies that it would be monstrously unlikely that a DIMM could go a year without any errors, when in fact, it has a 92% chance to last an entire year without error. The only reason that 4000 CE's per year statistic is true is because one faulty DIMM can spit out tens of thousands of CE's per hour and run for a very long time before it's replaced in a server environment.

Is a 92% chance that a DIMM will last a year without an error (and then spit out tens of thousands of errors per hour) what you'd call "good"?

If there's a 92% chance that one DIMM will last a year without an error, then the chance of 2 DIMMs both lasting a year without error is 0.92 * 0.92" or 84.64%. For a system like mine with 6 DIMMs the chance of all DIMMs lasting a year without error would be 60.6355%.

Rudster816 wrote:Your own faulty RAM took 4 hours to show any sign of errors with Memtest, which is certainly far from average use.

Yes. For average use, about 5% of the time the second system crashed in less than 1 minute of running GCC. The first system was similar - Internet Explorer was the only thing that ever crashed (sometimes you could look at 50 web pages without a problem, sometimes it'd crash on the first web page you look at). In addition, for the second system if you attempted to reinstall Windows the installer would always crash.

Memtest tests all RAM. This means it consumes a lot of time testing RAM that is fine. For average use, I suspect that Windows and Linux allocate physical pages in a certain order, so that you could have a faulty page of RAM that is never allocated while you're using "x MiB" of RAM but when you start using "X + 1 MiB" of RAM the faulty page is allocated by a process and gets pounded by that process.

Rudster816 wrote:I've ran overclocked CPU's, GPU's, and RAM for close to 6 years now, and the only time I've ever lost any data due to hardware failure was because of a 1.5TB Seagate HDD. The only reason I actually lost any data was because I couldn't afford a second HDD to do a backup with, but none of what was on it was "mission critical". Now I have two 2TB HDD's that are 1:1 mirrors (non RAID).

I have never been involved in a plane crash, therefore planes never crash.

Was it the same computer that you ran for 6 years? Was it 12 different computers that you ran for 6 months each? Given that I think the problem I had with the second system I was caused by dust, can you see how the age of the system/s might make a difference?

When you say "for 6 years", do you mean one hour per week for 6 years, or running 24 hours per day for 6 years? My second system probably could have run for 4 hours per day for 6 years before it had its problem (rather than 24 hours per day for 1 year).

Rudster816 wrote:If you're thinking about asking how I know my CPU still works fine, I'm not going to ramble on about exactly how I know (mainly because I don't know for sure, but neither does anyone else).

And that is my main point. Without ECC you don't know if RAM is faulty or not, and if RAM does become faulty there's no way of guessing how much damage it will do until you actually do notice.

Rudster816 wrote:I'm not about to go out (or recommend it to others) and buy some overpriced hunk of junk (compared to what I have now) server motherboard and equally junky (compared to what I have now) RAM based on the one in a million chance that a couple of bytes could get screwed up in one of the files I save to my disk. With a hard backup, the odds of ECC RAM preventing any type of meaningful data loss are so low that it's not even worth the 5min you spend pondering rather or not its worth it.

Do you honestly think someone who's willing to pay twice as much (or more) for hardware specially designed for people that have no data (gamers), who typically void their hardware's warranty as soon as they get it home, is in a position to offer an opinion about "overpriced hunks of junk"?

I think you should buy more interior case lights - I heard they help. I've got some LEDs here that I paid 5 cents each for. I can put a fancy "gamer's lights" sticker on them and let you have a pack of 10 for only $65. It's a bargain! 10% discount if you buy our special "oxygen free copper" speaker cables at the same time!

Cheers,

Brendan

Rudster816 · Post by **Rudster816** » Tue Apr 10, 2012 10:18 pm

Brendan wrote: Do you honestly think someone who's willing to pay twice as much (or more) for hardware specially designed for people that have no data (gamers), who typically void their hardware's warranty as soon as they get it home, is in a position to offer an opinion about "overpriced hunks of junk"?

I think you should buy more interior case lights - I heard they help. I've got some LEDs here that I paid 5 cents each for. I can put a fancy "gamer's lights" sticker on them and let you have a pack of 10 for only $65. It's a bargain! 10% discount if you buy our special "oxygen free copper" speaker cables at the same time!

You just flat out assume I'm some stupid gamer who thinks pretty lights are cool, which is the farthest thing from the truth. For starters, I do very little gaming. The only gaming I've done all year is that for the past week and a half I've been playing Skyrim, but I don't see me playing it for more than a couple more weeks, as my interest is already waning rather quickly. The only LED's in my case are the ones soldered on to my motherboard. As much as I hate to admit it, I bought my speakers at Walmart.

I like to do extreme overclocking, which is why I own high end hardware, not because I think I need it to play WoW. I regarded server equipment as "overpriced junk" because it costs even more than my hardware, yet it's of a much lower quality. The reason is simple, the people who are in the market for servers have deeper pockets.

I've made the same mistake you did (misjudging someone) many times, feels weird being on the other side.

Brendan wrote: I have never been involved in a plane crash, therefore planes never crash.

I never stated, implied, or though anything along the lines of "it works for me, so it works for everyone else". You're just assuming stuff about me and putting words in my mouth.

Brendan wrote: Was it the same computer that you ran for 6 years? Was it 12 different computers that you ran for 6 months each? Given that I think the problem I had with the second system I was caused by dust, can you see how the age of the system/s might make a difference?

When you say "for 6 years", do you mean one hour per week for 6 years, or running 24 hours per day for 6 years? My second system probably could have run for 4 hours per day for 6 years before it had its problem (rather than 24 hours per day for 1 year).

My computer has basically been in a constant state of upgrade. Over the years I've upgraded parts as I've gotten the money. The only pieces that I've held on to more than two years are some HDD's and a DVD drive. One thing thats never changed though, is that my PC runs 24/7 and gets way more use than it probably should.

Brendan · Post by **Brendan** » Tue Apr 10, 2012 11:56 pm

Hi,

Rudster816 wrote:
Brendan wrote:Do you honestly think someone who's willing to pay twice as much (or more) for hardware specially designed for people that have no data (gamers), who typically void their hardware's warranty as soon as they get it home, is in a position to offer an opinion about "overpriced hunks of junk"?

I think you should buy more interior case lights - I heard they help. I've got some LEDs here that I paid 5 cents each for. I can put a fancy "gamer's lights" sticker on them and let you have a pack of 10 for only $65. It's a bargain! 10% discount if you buy our special "oxygen free copper" speaker cables at the same time!
You just flat out assume I'm some stupid gamer who thinks pretty lights are cool, which is the farthest thing from the truth.

I assumed you're someone that wastes cash for the purpose of destroying the life expectancy of their hardware. The only slightly sane reason for this that I could think of is "gamer", so I assumed that too.

I guess this means you're someone that wastes cash for the purpose of destroying the life expectancy of their hardware, for no sane reason at all?

Do you still think you're in a position to offer an opinion about "overpriced hunks of junk"?

My computer has basically been in a constant state of upgrade.

Which is probably why you haven't noticed any of the problems caused by normal aging, or pointlessly exacerbated aging.

The best computer I ever bought was a dual Pentium III server. I got it at a very nice price, second hand off of eBay. I don't know how much use it got before I bought it (I think it was from a server room somewhere and was replaced/upgraded but can't be sure). It was a bit too noisy so I disconnected a bank of "drive bay" fans when I got it; then I used it like this (24 hours per day) for about 4 years. It's still here in my pool of test machines - not a single problem with any piece of hardware, even though all of it remains unchanged from the day I got it.

Given the choice between using that old dual Pentium III server and using the machine you're currently abusing; I'd go back to that old server without a second thought.

Cheers,

Brendan

Rudster816 · Post by **Rudster816** » Fri Apr 13, 2012 5:25 pm

Brendan wrote:
I assumed you're someone that wastes cash for the purpose of destroying the life expectancy of their hardware. The only slightly sane reason for this that I could think of is "gamer", so I assumed that too.

I guess this means you're someone that wastes cash for the purpose of destroying the life expectancy of their hardware, for no sane reason at all?

Do you still think you're in a position to offer an opinion about "overpriced hunks of junk"?

My computer has basically been in a constant state of upgrade.
Which is probably why you haven't noticed any of the problems caused by normal aging, or pointlessly exacerbated aging.

The best computer I ever bought was a dual Pentium III server. I got it at a very nice price, second hand off of eBay. I don't know how much use it got before I bought it (I think it was from a server room somewhere and was replaced/upgraded but can't be sure). It was a bit too noisy so I disconnected a bank of "drive bay" fans when I got it; then I used it like this (24 hours per day) for about 4 years. It's still here in my pool of test machines - not a single problem with any piece of hardware, even though all of it remains unchanged from the day I got it.

Given the choice between using that old dual Pentium III server and using the machine you're currently abusing; I'd go back to that old server without a second thought.

Overclocking\Benchmarking hardware is a hobby, so I'm no different than a lot of other people who spend money on hobbies for rather pointless reasons. It's no different than someone going out and buying a $30k sports car and modding it, racing it, etc. Although that's a helluva lot more expensive, but probably more of thrill too. I don't care if other people think I'm crazy for spending lots of money on computer stuff. I also don't care that you'd rather use your PIII server than my machine that I've been abusing. Would I like people to understand it better? Sure.

Also, a rundown on the warranties for my hardware:

Motherboard: Lifetime warranty, still valid.
CPU: 3 Year warranty. Would have expired in July had I not already (technically) voided it for overclocking. Full disclosure though, if it broke between now and July, I'd lie and say I didn't OC it and get it replaced.
Power Supply: 5 year warranty, still valid.
RAM: Lifetime, still valid.
HDD's: However long they last, still valid. I believe they are 5 years for my WD Greens (maybe less, I'm not sure)
GPU's: However long they last, still valid. IIRC, one of them is a lifetime, the other is a 3 year.

Good enough for me.

MJD · Post by **MJD** » Wed Apr 25, 2012 7:23 am

I'm not about to go out (or recommend it to others) and buy some overpriced hunk of junk (compared to what I have now) server motherboard and equally junky (compared to what I have now) RAM based on the one in a million chance that a couple of bytes could get screwed up in one of the files I save to my disk. With a hard backup, the odds of ECC RAM preventing any type of meaningful data loss are so low that it's not even worth the 5min you spend pondering rather or not its worth it.

If you decide that you want all your computer's to have ECC RAM, that's fine if it works for you, and there's nothing wrong with it. Leading on that computers without ECC RAM are severely more prone to data loss than their non ECC counterparts, that's wrong.

Actually, to get ECC memory in your machine, you don't need server motherboards/server CPUs. Many of AMD's consumer line of products support ECC memory just fine. I run with ECC RAM to protect myself against random errors, but I have good quality consumer pieces through and through. While you might not getting high spec'ed ECC RAM, the stuff you can get works perfectly for a decent price.

OSDev.org

ECC and RAM faults (was "What do you hate...")

ECC and RAM faults (was "What do you hate...")

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: What do you hate on the current OS that runs on your pc?

Re: ECC and RAM faults (was "What do you hate...")

Re: ECC and RAM faults (was "What do you hate...")

Re: What do you hate on the current OS that runs on your pc?