Determining the failing bits of an uncorrectable ECC error.

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
Post Reply
matviy
Posts: 1
Joined: Thu Sep 19, 2019 1:23 pm
Libera.chat IRC: matviy

Determining the failing bits of an uncorrectable ECC error.

Post by matviy »

I'm running a system with an x86 AMD CPU, that has a 64-bit data bus + 8-bits of ECC, so 72-bits total. I have 9 x8 DRAMs on my DIMM.

I'm getting uncorrectable ECC errors that trigger a machine check exception and cause a reboot. I can dump the machine check registers from the CPU before the BIOS initializes to see the "uncorrectable ECC error" bit set. I can also get the error Syndrome.

However, there doesn't seem to be a way to figure out which data bits specifically are bad. The Syndrome is only valid for correctable errors, so it doesn't map to anything on the ECC syndrome lookup table.

The system uses 4-bit symbol BCH ECC that is single-symbol-correcting and dual-symbol-detecting. So since it's uncorrectable, I'm getting errors across at least two symbols. In fact, I'm fairly sure that it's an entire DRAM (8-bits) that's completely dying.

If the AMD CPU somewhere had a register that contained the 72-bits read from memory that caused the error, i could probably look at it to see which bits are obviously wrong. But there doesn't seem to be such a register anywhere.

The datasheet for this CPU is here: https://www.amd.com/system/files/TechDo ... h_BKDG.pdf

Any ideas how to figure out which chip is failing?

EDIT: To be specific, the reason i need to know which chip is failing is because they're all soldered to the main board.
User avatar
bzt
Member
Member
Posts: 1584
Joined: Thu Oct 13, 2016 4:55 pm
Contact:

Re: Determining the failing bits of an uncorrectable ECC err

Post by bzt »

Hi,
matviy wrote:I'm getting uncorrectable ECC errors that trigger a machine check exception and cause a reboot. I can dump the machine check registers from the CPU before the BIOS initializes to see the "uncorrectable ECC error" bit set. I can also get the error Syndrome.
This is quite a problem. ECC usually capable of fixing one faulty bit, and that happens transparently. If it can't, that means there are more, even number of errors on the same row or column.
matviy wrote:However, there doesn't seem to be a way to figure out which data bits specifically are bad.
Sure, if there would be a way, then it would be a correctable error.
matviy wrote:Any ideas how to figure out which chip is failing?
Well, programaticaly from your OS? You'll need the memory bank information for that, usually found in SMBIOS and ACPI tables.

For Linux, just execute "dmidecode -t 6" and "dmidecode -t 17". It will tell you which RAM area corresponds to which memory bank (at least, if you're lucky you will get the exact chip too). You can also read the hardware fault log if your firmware supports that.

Cheers,
bzt
Post Reply