Page 1 of 1

Determining the failing bits of an uncorrectable ECC error.

Posted: Thu Sep 19, 2019 1:45 pm
by matviy
I'm running a system with an x86 AMD CPU, that has a 64-bit data bus + 8-bits of ECC, so 72-bits total. I have 9 x8 DRAMs on my DIMM.

I'm getting uncorrectable ECC errors that trigger a machine check exception and cause a reboot. I can dump the machine check registers from the CPU before the BIOS initializes to see the "uncorrectable ECC error" bit set. I can also get the error Syndrome.

However, there doesn't seem to be a way to figure out which data bits specifically are bad. The Syndrome is only valid for correctable errors, so it doesn't map to anything on the ECC syndrome lookup table.

The system uses 4-bit symbol BCH ECC that is single-symbol-correcting and dual-symbol-detecting. So since it's uncorrectable, I'm getting errors across at least two symbols. In fact, I'm fairly sure that it's an entire DRAM (8-bits) that's completely dying.

If the AMD CPU somewhere had a register that contained the 72-bits read from memory that caused the error, i could probably look at it to see which bits are obviously wrong. But there doesn't seem to be such a register anywhere.

The datasheet for this CPU is here: https://www.amd.com/system/files/TechDo ... h_BKDG.pdf

Any ideas how to figure out which chip is failing?

EDIT: To be specific, the reason i need to know which chip is failing is because they're all soldered to the main board.

Re: Determining the failing bits of an uncorrectable ECC err

Posted: Sun Sep 22, 2019 9:06 am
by bzt
Hi,
matviy wrote:I'm getting uncorrectable ECC errors that trigger a machine check exception and cause a reboot. I can dump the machine check registers from the CPU before the BIOS initializes to see the "uncorrectable ECC error" bit set. I can also get the error Syndrome.
This is quite a problem. ECC usually capable of fixing one faulty bit, and that happens transparently. If it can't, that means there are more, even number of errors on the same row or column.
matviy wrote:However, there doesn't seem to be a way to figure out which data bits specifically are bad.
Sure, if there would be a way, then it would be a correctable error.
matviy wrote:Any ideas how to figure out which chip is failing?
Well, programaticaly from your OS? You'll need the memory bank information for that, usually found in SMBIOS and ACPI tables.

For Linux, just execute "dmidecode -t 6" and "dmidecode -t 17". It will tell you which RAM area corresponds to which memory bank (at least, if you're lucky you will get the exact chip too). You can also read the hardware fault log if your firmware supports that.

Cheers,
bzt