Tuesday, March 1, 2011

EDAC: Which DIMM?

EDAC (Error Detection and Correction) messages are designed to provide information about hardware problems with the system memory. Some of these messages are correctable, and some are uncorrectable.
EDAC is documented at http://www.kernel.org/doc/Documentation/edac.txt








Our HP hardware running RHEL5 , We often get DIMMs in our servers going bad with the following errors in syslog:


 EDAC k8 MC1: general bus error: participating processor(local node origin), time-out(no timeout)
 memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
 EDAC MC1: CE page 0x103ca78, offset 0xf88, grain 8, syndrome 0x9f65, row 1, channel 0, label "":
 k8_edac
 EDAC MC1: CE - no information available: k8_edac Error Overflow set
 EDAC k8 MC1: extended error code: ECC chipkill x4 error

EDAC is Error Detection and Correction, it will try to detect and correct hardware problems. In this case it appears that chipkill is detecting the problem and correcting it. Any significant hardware problems may not be experienced in the short term; however, it is recommend to have the DIMMs checked and replace the faulty one.

To locate which DIMM is having the issue, going back to the EDAC errors above I saw on my server's console, MC1 (Memory Controller 1) means CPU1, row 1 is referred to as csrow1 (Chip-Select Row 1) in the Linux EDAC documentation, and channel 0 means memory channel 0.

EDAC MC1: CE page 0x103ca78, offset 0xf88, grain 8, syndrome 0x9f65, row 1, channel 0, label "": k8_edac


            Channel 0       Channel 1
    ===================================
    csrow0  | DIMM_A0       | DIMM_B0 |
    csrow1  | DIMM_A0       | DIMM_B0 |
    ===================================

    ===================================
    csrow2  | DIMM_A1       | DIMM_B1 |
    csrow3  | DIMM_A1       | DIMM_B1 |
    ===================================
 
Of course, there are actually two DIMM slots called DIMMA0 on my server (one for each CPU), but again the MC1 error corresponds to CPU1, which is listed under "Bank Locator" in the output of dmidecode:


 # dmidecode -t memory | grep DIMMA0 -B9 -A8
 Handle 0x002E, DMI type 17, 27 bytes.
  Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA0
        Bank Locator: CPU0
        Type: DDR2
        Type Detail: Synchronous
        Speed: 533 MHz (1.9 ns)
        Manufacturer: 
        Serial Number: 
        Asset Tag: 
        Part Number: 
 --
 Handle 0x003E, DMI type 17, 27 bytes.
 Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA0
        Bank Locator: CPU1
        Type: DDR2
        Type Detail: Synchronous
        Speed: 533 MHz (1.9 ns)
        Manufacturer: 
        Serial Number: 
        Asset Tag: 
        Part Number:
 
(On my workstation, dmidecode actually shows the Part Number and Serial Number for my DIMMs, which is very useful.)
In addition to looking at errors on the console and in logs, you can also see errors per MC/CPU, row/csrow, and channel by examining /sys/devices/system/edac. In my case the errors were only on MC1, csrow1, channel 0:


# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
 /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow4/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow5/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow6/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow7/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow7/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:6941652
 /sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow3/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow4/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow5/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow6/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow7/ch1_ce_count:0