Monday, July 26, 2010
fmadm: Solaris Fault Manager Defined
Fault management allows system software to send telemetry data to the fmd(1m) daemon, which then diagnoses the problem, and takes action (e.g., offlining a faulty components and logging an error with FMRI/UUID information to syslog) based on the type of event received. The diagnosis phase is controlled by a set of diagnosis engines, which can be viewed with the fmadm(1m) utilities “config” option:
# fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-diagnosis 1.6 active CPU/Memory Diagnosis
cpumem-retire 1.1 active CPU/Memory Retire Agent
disk-transport 1.0 active Disk Transport Agent
eft 1.16 active eft diagnosis engine
etm 1.2 active FMA Event Transport Module
fabric-xlate 1.0 active Fabric Ereport Translater
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 1.0 active I/O Retire Agent
snmp-trapgen 1.0 active SNMP Trap Generation Agent
sp-monitor 1.0 active Service Processor Monitor
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.0 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
zfs-retire 1.0 active ZFS Retire Agent
If the fault manager daemon (fmd) detects a fault, it will log a detailed message to syslog, and update the fault manager error and fault logs. The contents of these logfiles can be viewed with the fmdump(1m) utility:
# fmdump -v
TIME UUID SUNW-MSG-ID
fmdump: /var/fm/fmd/fltlog is empty
TIME UUID SUNW-MSG-ID
fmdump: /var/fm/fmd/fltlog is empty
# fmdump -e -v
TIME CLASS ENA
fmdump: /var/fm/fmd/errlog is empty
TIME CLASS ENA
fmdump: /var/fm/fmd/errlog is empty
If a device is diagnosed as faulty, this will be indicated in the fmadm(1m) “faulty” output:
# fmadm faulty
STATE RESOURCE / UUID
-------- ---------------------------------------------------------
The fault management daemon (fmd) keeps track of service events and numerous pieces of key statistical data. This information can be accessed and printed with the fmstat(1m) utility:
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-diagnosis 0 0 0.0 0.1 0 0 0 0 3.0K 0
cpumem-retire 0 0 0.0 0.0 0 0 0 0 12b 0
disk-transport 0 0 0.0 1.6 0 0 0 0 40b 0
eft 0 0 0.0 0.2 0 0 0 0 925K 0
etm 0 0 0.0 0.0 0 0 0 0 8.2K 144b
fabric-xlate 0 0 0.0 0.1 0 0 0 0 0 0
fmd-self-diagnosis 309 0 0.0 0.0 0 0 0 0 0 0
io-retire 0 0 0.0 0.0 0 0 0 0 0 0
snmp-trapgen 0 0 0.0 0.0 0 0 0 0 0 0
sp-monitor 0 0 0.0 46.9 0 0 0 0 24b 0
sysevent-transport 0 0 0.0 27.2 0 0 0 0 0 0
syslog-msgs 0 0 0.0 0.0 0 0 0 0 32b 0
zfs-diagnosis 8 0 0.0 0.9 0 0 0 0 0 0
zfs-retire 0 0 0.0 0.0 0 0 0 0 0 0
To clear FMA faults and Error logs from Solaris.
Show faults in FMA:
Show faults in FMA:
# fmadm faulty
For each fault listed from the 'fmadm faulty' run:
# fmadm repair event-ID
Clear error reports and resource cache:
# cd /var/fm/fmd
# rm e* f/* c*/eft/* r**
# rm e* f/* c*/eft/* r**
Reset the fmd serd modules:
# fmadm config
# fmadm reset cpumem-diagnosis
# fmadm reset cpumem-retire
# fmadm reset eft
# fmadm reset io-retire
# fmadm config
# fmadm reset cpumem-diagnosis
# fmadm reset cpumem-retire
# fmadm reset eft
# fmadm reset io-retire
# fmadm config
Reset or refresh any diabled modules:
# fmadm config
(Check and confirm missing module)
Check the fmd service if its online:
# svcs -a fmd
Check if you do have the disabled service under:
# ls /var/fm/fmd/ckpt
Clear the Faulted / Disabled module via:
# fmdadm repair fmd:///module/module-name
Restore and activate disalbed module:
# svcadm disable -st fmd
# cd /var/fm/fmd/ckpt
# mv module-name save.module-name
# svcadm enable fmd
# cd /var/fm/fmd/ckpt
# mv module-name save.module-name
# svcadm enable fmd
Confirm Disabled module is now active.
# fmdadm config
If you are interested in learning more about this amazingly cool technology, you can check out the following resources:
Sun's Fault Management Presentation
Sun's Fault Management Presentation
Labels:
Solaris Fault Managment
No comments:
Post a Comment