Monday, July 26, 2010

fmadm: Solaris Fault Manager Defined

Fault management allows system software to send telemetry data to the fmd(1m) daemon, which then diagnoses the problem, and takes action (e.g., offlining a faulty components and logging an error with FMRI/UUID information to syslog) based on the type of event received. The diagnosis phase is controlled by a set of diagnosis engines, which can be viewed with the fmadm(1m) utilities “config” option: 












 # fmadm config
 MODULE                   VERSION STATUS  DESCRIPTION
 cpumem-diagnosis         1.6     active  CPU/Memory Diagnosis
 cpumem-retire            1.1     active  CPU/Memory Retire Agent
 disk-transport           1.0     active  Disk Transport Agent
 eft                      1.16    active  eft diagnosis engine
 etm                      1.2     active  FMA Event Transport Module
 fabric-xlate             1.0     active  Fabric Ereport Translater
 fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
 io-retire                1.0     active  I/O Retire Agent
 snmp-trapgen             1.0     active  SNMP Trap Generation Agent
 sp-monitor               1.0     active  Service Processor Monitor
 sysevent-transport       1.0     active  SysEvent Transport Agent
 syslog-msgs              1.0     active  Syslog Messaging Agent
 zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
 zfs-retire               1.0     active  ZFS Retire Agent

If the fault manager daemon (fmd) detects a fault, it will log a detailed message to syslog, and update the fault manager error and fault logs. The contents of these logfiles can be viewed with the fmdump(1m) utility:

 # fmdump -v
 TIME                 UUID                                 SUNW-MSG-ID
 fmdump: /var/fm/fmd/fltlog is empty


 # fmdump -e -v
 TIME                 CLASS                                 ENA
 fmdump: /var/fm/fmd/errlog is empty

If a device is diagnosed as faulty, this will be indicated in the fmadm(1m) “faulty” output:


 # fmadm faulty
 STATE RESOURCE / UUID
 -------- ---------------------------------------------------------

The fault management daemon (fmd) keeps track of service events and numerous pieces of key statistical data. This information can be accessed and printed with the fmstat(1m) utility: 


 # fmstat
 module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
 cpumem-diagnosis         0       0  0.0    0.1   0   0     0     0   3.0K      0
 cpumem-retire            0       0  0.0    0.0   0   0     0     0    12b      0
 disk-transport           0       0  0.0    1.6   0   0     0     0    40b      0
 eft                      0       0  0.0    0.2   0   0     0     0   925K      0
 etm                      0       0  0.0    0.0   0   0     0     0   8.2K   144b
 fabric-xlate             0       0  0.0    0.1   0   0     0     0      0      0
 fmd-self-diagnosis     309       0  0.0    0.0   0   0     0     0      0      0
 io-retire                0       0  0.0    0.0   0   0     0     0      0      0
 snmp-trapgen             0       0  0.0    0.0   0   0     0     0      0      0
 sp-monitor               0       0  0.0   46.9   0   0     0     0    24b      0
 sysevent-transport       0       0  0.0   27.2   0   0     0     0      0      0
 syslog-msgs              0       0  0.0    0.0   0   0     0     0    32b      0
 zfs-diagnosis            8       0  0.0    0.9   0   0     0     0      0      0
 zfs-retire               0       0  0.0    0.0   0   0     0     0      0      0

To clear FMA faults and Error logs from Solaris.

Show faults in FMA: 
 
 # fmadm faulty

For each fault listed from the 'fmadm faulty' run:

 # fmadm repair event-ID

Clear error reports and resource cache:

 # cd /var/fm/fmd
 # rm e* f/* c*/eft/* r**

Reset the fmd serd modules:

 # fmadm config
 # fmadm reset cpumem-diagnosis
 # fmadm reset cpumem-retire
 # fmadm reset eft
 # fmadm reset io-retire
 # fmadm config

Reset or refresh any diabled modules:

 # fmadm config
(Check and confirm missing module)

Check the fmd service if its online:

 # svcs -a fmd

Check if you do have the disabled service under:

 # ls /var/fm/fmd/ckpt

Clear the Faulted / Disabled module via:

 # fmdadm repair fmd:///module/module-name

Restore and activate disalbed module:

 # svcadm disable -st fmd
 # cd /var/fm/fmd/ckpt
 # mv module-name save.module-name
 # svcadm enable fmd

Confirm Disabled module is now active.

 # fmdadm config

If you are interested in learning more about this amazingly cool technology, you can check out the following resources: 

Sun's Fault Management Presentation


No comments:

Post a Comment