Tuesday, March 1, 2011

EDAC: Which DIMM?

EDAC (Error Detection and Correction) messages are designed to provide information about hardware problems with the system memory. Some of these messages are correctable, and some are uncorrectable.
EDAC is documented at http://www.kernel.org/doc/Documentation/edac.txt








Our HP hardware running RHEL5 , We often get DIMMs in our servers going bad with the following errors in syslog:


 EDAC k8 MC1: general bus error: participating processor(local node origin), time-out(no timeout)
 memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
 EDAC MC1: CE page 0x103ca78, offset 0xf88, grain 8, syndrome 0x9f65, row 1, channel 0, label "":
 k8_edac
 EDAC MC1: CE - no information available: k8_edac Error Overflow set
 EDAC k8 MC1: extended error code: ECC chipkill x4 error

EDAC is Error Detection and Correction, it will try to detect and correct hardware problems. In this case it appears that chipkill is detecting the problem and correcting it. Any significant hardware problems may not be experienced in the short term; however, it is recommend to have the DIMMs checked and replace the faulty one.

To locate which DIMM is having the issue, going back to the EDAC errors above I saw on my server's console, MC1 (Memory Controller 1) means CPU1, row 1 is referred to as csrow1 (Chip-Select Row 1) in the Linux EDAC documentation, and channel 0 means memory channel 0.

EDAC MC1: CE page 0x103ca78, offset 0xf88, grain 8, syndrome 0x9f65, row 1, channel 0, label "": k8_edac


            Channel 0       Channel 1
    ===================================
    csrow0  | DIMM_A0       | DIMM_B0 |
    csrow1  | DIMM_A0       | DIMM_B0 |
    ===================================

    ===================================
    csrow2  | DIMM_A1       | DIMM_B1 |
    csrow3  | DIMM_A1       | DIMM_B1 |
    ===================================
 
Of course, there are actually two DIMM slots called DIMMA0 on my server (one for each CPU), but again the MC1 error corresponds to CPU1, which is listed under "Bank Locator" in the output of dmidecode:


 # dmidecode -t memory | grep DIMMA0 -B9 -A8
 Handle 0x002E, DMI type 17, 27 bytes.
  Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA0
        Bank Locator: CPU0
        Type: DDR2
        Type Detail: Synchronous
        Speed: 533 MHz (1.9 ns)
        Manufacturer: 
        Serial Number: 
        Asset Tag: 
        Part Number: 
 --
 Handle 0x003E, DMI type 17, 27 bytes.
 Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 4096 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA0
        Bank Locator: CPU1
        Type: DDR2
        Type Detail: Synchronous
        Speed: 533 MHz (1.9 ns)
        Manufacturer: 
        Serial Number: 
        Asset Tag: 
        Part Number:
 
(On my workstation, dmidecode actually shows the Part Number and Serial Number for my DIMMs, which is very useful.)
In addition to looking at errors on the console and in logs, you can also see errors per MC/CPU, row/csrow, and channel by examining /sys/devices/system/edac. In my case the errors were only on MC1, csrow1, channel 0:


# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
 /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow4/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow5/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow6/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow6/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow7/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc0/csrow7/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:6941652
 /sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow3/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow3/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow4/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow5/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow6/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow6/ch1_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow7/ch0_ce_count:0
 /sys/devices/system/edac/mc/mc1/csrow7/ch1_ce_count:0


Monday, November 29, 2010

java: Alert on java crashes

The last couple of days I have noticed a number of irregular core dumps created in my system core file dump location /var/core









 -rw-------   1 root    root  2529790203 Nov 10 11:55 core_host1_java_1094_300_1289350401_28578
 -rw-------   1 root    root  2564932547 Nov 15 13:06 core_host1_java_1094_300_1289786684_1664
 -rw-------   1 root    root  2498732827 Nov 17 17:29 core_host1_java_9092_300_1289975232_5664
 -rw-------   1 root    root  2525420387 Nov 19 12:08 core_host3_java_1094_300_1290128885_16234


Depending on how you've setup your core file dump pattern, you can determine which process/application user account its comming from by just reading the file core name. eg

 # coreadm|grep pattern
    global core file pattern: /var/core/core_%n_%f_%u_%g_%t_%p
    init core file pattern: /var/core/core_%n_%f_%u_%g_%t_%p

 %n ; system node name uname -n
 %f ; executable filename
 %u ; uid
 %g ; gid
 %t ; time in seconds since 1970,1,1.
 %p ; PID
 
My core dump process is coming from a java process. Bugs can occur in a Java runtime environment and most administrators would want to get notified.
If you need to take a corrective action and diagnose further, you will need to be alerted at the time of incident.
The Java runtime has a number of useful options that can be used for this purpose. The first option is “-XX:OnOutOfMemoryError”, which allows a command to be run when the runtime environment incurs an out of memory condition. When this option is combined with the logger command line utility:

 java -XX:OnOutOfMemoryError=”logger Java process %p encountered an OOM condition” …

Syslog entries will be generated each time an Out Of memory (OOM) event occurs.

Another useful option is “-XX:OnError”, which allows a command to be run when the runtime environment incurs a fatal error (i.e., a hard crash). When this option is combined with the logger utility:

 java -XX:OnError=”logger -p Java process %p encountered a fatal condition” …

Syslog entries will be generated when a fatal event occur.

The options above allow you to run one or more commands when these errors are encountered, so you could chain together a utility (logger or mail) to generate alerts, and maybe a restarter script to start a new Java process.



Monday, November 22, 2010

zones: Physical to Virtual (P2V) Migration

Since the release of Solaris 10 Update 9 I was interested in trying out its new capabilities. One of which was to move from an existing Oracle Solaris 10 Physical system to a virtual container quickly and easily on a separate system. And also the Host ID migration feature which it also allows.

In this post I will provide a demonstration of P2V , migrating a physical Solaris system into a zone, using Host ID migration, keeping in mind in this example both systems are built from a Solaris 10 Update 9 release and is a ZFS on root build.





1. Collect the system information you wish to be migrated into a zone.
Obtain the hostname:

 # hostname

Obtain the hostid:

 # hostid
 
Obtain the root passwd
View what software being run on the system if necessary
Check the network configuration on the system:

 # ifconfig -a   

View the storage utilized for example the contents of /etc/vfstab
View the amount of local disk storage in use, and determine the target host has enough space for the install
Examine the /etc/system of any specific or unique changes you may wish to keep or change once the  environment has been virtualized.

2. Use the flarcreate command to create a flash archive image of the system you wish to migrate.
Login as root to the source system and change to the root directory.

 # cd /
   
Run the flarcreate using the following options:
Note: Here I have used the option to compress using the -c , used the archived method with cpio, you can use pax if you wish, then supplied the content name identifier of the archive using the -n option (best practice s10u9-system-name), also I have used the -y option because this system has a separate /var dataset and is ZFS on root based, then supplied the path destination of the flar archive creation.
When the flar creation is running be sure to monitor and review any errors thoroughly.

 # flarcreate -c -L cpio -n name -y /var /path/to/flar

3. Transfer the flar archive created to the destination target host.

4. Create a new zone configuration on the target host:
Note: I don't want to inherit any packages from the global or any loop-back file systems, so I have created a whole root zone configuration and provided a new  ip-address as I don't want it to conflict with the existing migrated host as its still alive.

Also you will see I have added a hostid entry in the zone configuration, when applications are migrated from a physical Solaris system into a zone on a new system, the hostid changes to be the hostid of the new machine. In some cases, applications depend on the original hostid, and it is not possible to update the application configuration. In these cases, the zone can be configured to use the hostid of the original system. This is done by setting a zonecfg property to specify the hostid as shown below. The value used should be the output of the hostid command as run on the original system previously.


 # zonecfg -z hostname
 hostname: No such zone configured
 Use 'create' to begin configuring a new zone.
 zonecfg:hostname> create -b
 zonecfg:hostname> set autoboot=true
 zonecfg:hostname> set zonepath=/zones/hostname
 zonecfg:hostname> set bootargs="-m verbose"
 zonecfg:hostname> set hostid=84###375
 zonecfg:hostname> add net
 zonecfg:hostname:net> set physical=bge0
 zonecfg:hostname:net> set address=ip-address
 zonecfg:hostname:net> end
 zonecfg:hostname> verify
 zonecfg:hostname> commit
 zonecfg:hostname> exit


5. Installing the zone on the target system using the flar archive created. Become root user and install the configured zone using the install -a option and the path to the flar archive created. Noticed I have used the -p option becuase I want to preserve the system identity, the zone will have the same identity as the system used to create the image. You can use the -u to sys-unconfig the zone.
Best practice make sure you tail the zone installation log file and ensure no errors are found.


 # zoneadm -z hostname install -p -a /path/to/flar
 A ZFS file system has been created for this zone.
       Log File: /var/tmp/hostname.install_log.IQaGnI
     Installing: This may take several minutes...
 Postprocessing: This may take a while...
    Postprocess: Updating the zone software to match the global zone...
    Postprocess: Zone software update complete
    Postprocess: Updating the image to run within a zone

         Result: Installation completed successfully.
       Log File: /zones/hostname/root/var/log/hostname.install17462.log

6. Boot the zone into single user mode and login via the console and of course use the root password from your migrated host. Make any necessary checks.

 # zoneadm -z hostname boot -s
 # zlogin -C hostname

From here you can see the P2V is complete, hostname and data has been kept intact including the host-id.
Now you can decide whether you need to make any further necessary changes such as hostname, network configuration etc etc. And of course be prepared to boot the migrated zone for a live production environment.

Wednesday, November 17, 2010

news: Oracle Solaris 11 Express Download Available

Solaris 11 Express 2010.11 is now available for download .
You will also find an overview and documentation available at the following link:
Click Here:

Solaris 11 Express now allows administrators to test and deploy within their enterprise environments and greatly simplify their day to day operations. It contains many different technology innovations that are not available in Oracle Solaris 10 such as new package management tools and utilities, built-in network virtualization, and support for the latest hardware platforms. To see the list of features that
are new to Oracle Solaris 11 Express please Clicke Here:

Oracle Solaris 11 Express is the latest release of the OracleSolaris operating system. This release is the path forward for developers, end-users and partners using previous generations of OpenSolaris releases.
This release will provide administrators to access the latest technology and innovation that will form
a future Oracle Solaris 11 which will be released sometime in 2011.

Here is a youtube video tutorial if you wish to test it out in your virtualbox setup. Enjoy
http://www.youtube.com/watch?v=r5hlrqlQAIc

Monday, November 15, 2010

inetd: Disable inetd Connection Logging for individual Services



Noticed a large number of unwanted constant connection messages in my system messages file.




 








 Aug 31 18:36:39 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19080] from ip-address 45632
 Aug 31 18:36:39 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19081] from ip-address 45633
 Aug 31 18:40:35 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19288] from ip-address 48640
 Aug 31 18:40:39 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19290] from ip-address 48641
 Aug 31 18:41:05 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19333] from ip-address 48653
 Aug 31 18:41:05 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19333] from ip-address 48653
 Aug 31 18:41:05 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19334] from ip-address 48654
 Aug 31 18:45:51 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19543] from ip-address 48714
 Aug 31 18:45:52 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19544] from ip-address 48715
 Aug 31 18:50:09 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19781] from ip-address 48786
 Aug 31 18:50:09 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[19782] from ip-address 48787
 Aug 31 18:57:59 ausydwebt01 inetd[455]: [ID 317013 daemon.notice] vnetd[24199] from ip-address 48871

The above is coming from my Veritas Netbackup network connection daemon which is constantly filling up my messages file during its nightly backup procedure.

If inetd is running, the "tracing" feature can be used to log information about the source of any network connections seen by the daemon. Rather than disabling inetd tracing for all services, the administrator has the option of disabling tracing for individual services with inetadm -m svcname tcp_trace=FALSE , where is the name of the specific service that should use tracing.

1. The following command will display the properties for the vnetd service.


 # inetadm -l svc:/network/vnetd/tcp:default
 SCOPE    NAME=VALUE
          name="vnetd"
          endpoint_type="stream"
          proto="tcp"
          isrpc=FALSE
          wait=FALSE
          exec="/usr/openv/bin/vnetd"
          user="root"
 default  bind_addr=""
 default  bind_fail_max=-1
 default  bind_fail_interval=-1
 default  max_con_rate=-1
 default  max_copies=-1
 default  con_rate_offline=-1
 default  failrate_cnt=40
 default  failrate_interval=60
 default  inherit_env=TRUE
 default  tcp_trace=TRUE
 default  tcp_wrappers=TRUE
 default  connection_backlog=10

2. The following command will disable tracing for the vnetd service

 # inetadm -m svc:/network/vnetd/tcp:default tcp_trace=FALSE

3. Confirm the changes using the display option again.

 
 # inetadm -l svc:/network/vnetd/tcp:default
 SCOPE    NAME=VALUE
          name="vnetd"
          endpoint_type="stream"
          proto="tcp"
          isrpc=FALSE
          wait=FALSE
          exec="/usr/openv/bin/vnetd"
          user="root"
 default  bind_addr=""
 default  bind_fail_max=-1
 default  bind_fail_interval=-1
 default  max_con_rate=-1
 default  max_copies=-1
 default  con_rate_offline=-1
 default  failrate_cnt=40
 default  failrate_interval=60
 default  inherit_env=TRUE
          tcp_trace=FALSE
 default  tcp_wrappers=TRUE
 default  connection_backlog=10



Wednesday, November 10, 2010

news: Solaris 11 Express Summit

The slides are now available for the presentations at the Oracle Solaris 11 Express Summit at the LISA Conference which was hosted on Tuesday Nov 9th.
The event showcased Oracle Solaris 11 Express, targeting System Administrators and Architects.






 
Here are the slides available for the following presentations:
  • Slide 1 Introduction to Oracle Solaris 11 Express, Markus Flierl
  • Slide 2 Image Packaging System, Bart Smaalders
  • Slide 3 Deploying Oracle Solaris 11 in the Enterprise, Dave Miner
  • Slide 4 Advances in Solaris Networking with Crossbow and Beyond, Nicolas Droux
  • Slide 5 Oracle Solaris Containers in Oracle Solaris 11 Express, Dan Price
  • Slide 6 ZFS Features in Oracle Solaris Express, Cindy Swearingen
  • Slide 7 New Security Features in Oracle Solaris 11 Express, Glenn Faden
  • Slide 8 Deploying Applications Using SMF and Other Solaris 11 Features, Liane Praza

The stream of videos have been recorded and can be viewed at the following link: Click here


 

Monday, November 8, 2010

JASS: Auditing & Controlling Output Logs

You can configure the Solaris Security Toolkit audit option to report or omit banners and messages. 

You might want to eliminate pass messages (JASS_LOG_SUCCESS variable) from the output so you can report and focus only on fail messages (JASS_LOG_FAILURE variable).

If the logging variable is set to 0, then no output is generated for messages of that type. Conversely, if the logging variable is set to 1, then messages are displayed. The default action for each of these variables is to display the output.







All Banner Output, This parameter controls the display of banner messages. These messages are
usually surrounded by separators comprised of either equal sign (“=”) or dash (“-”)
characters.
 JASS_LOG_BANNER

[ERR], This parameter controls the display of error messages. If set to 0, no error messages will be generated.
 JASS_LOG_ERROR

[FAIL] This parameter controls the display of failure messages. If set to 0, no failure messages will be generated.
 JASS_LOG_FAILURE

[NOTE] This parameter controls the display of notice messages. If set to 0, no notice messages will be generated.
 JASS_LOG_NOTICE

[PASS] This parameter controls the display of success or passing status messages. If set to 0, no success messages will be generated.
 JASS_LOG_SUCCESS

[WARN] This parameter controls the display of warning messages. If set to 0, no warning messages will be generated.
 JASS_LOG_WARNING


Using these options is very useful when you only need to view specific messages. By setting these options, you can minimize output, yet still focus on areas you deem critical. For example, by setting all logging variables to 0 except for JASS_LOG_FAILURE (leave it at the default of 1), the audit reports only on failures
generated by the logFailure function.


 # JASS_LOG_FAILURE=1
 # JASS_LOG_FAILURE=1
 # JASS_LOG_NOTICE=0
 # JASS_LOG_SUCCESS=0
 # JASS_LOG_WARNING=0
 # export JASS_LOG_WARNING JASS_LOG_SUCCESS JASS_LOG_NOTICE JASS_LOG_FAILURE

 # ./jass-execute -a secure.driver -V 2
 update-at-deny [FAIL] User test is not listed in
 /etc/cron.d/at.deny.
 update-at-deny [FAIL] Audit Check Total : 1 Error(s)
 update-inetd-conf [FAIL] Service ftp is enabled in
 /etc/inet/inetd.conf.
 update-inetd-conf [FAIL] Service telnet is enabled in
 /etc/inet/inetd.conf.
 update-inetd-conf [FAIL] Service rstatd is enabled in
 /etc/inet/inetd.conf.
 update-inetd-conf [FAIL] Audit Check Total : 3 Error(s)

Here I have provided a Jass Auditing script that can be run on a weekly, monthly , yearly however you may choose. The audit will alert on any system changes via Email set by the MAIL_LIST variable. The script requires Repository directory under /opt/SUNWjass.

jass-audit.sh

 #!/usr/bin/ksh

 HOST=`hostname`
 TIMESTAMP=`date +%H%M.%d%m`
 SPOOL="/opt/SUNWjass"
 L_LOG="$SPOOL/Repository/Jass_Audit.$TIMESTAMP"
 L_OUT="$SPOOL/Repository/Jass_Audit.$TIMESTAMP.OUT"
 MAIL_LIST=""
 JASS_LOG_FAILURE=1
 JASS_LOG_NOTICE=0
 JASS_LOG_SUCCESS=0
 JASS_LOG_WARNING=0
 export JASS_LOG_WARNING JASS_LOG_SUCCESS JASS_LOG_NOTICE JASS_LOG_FAILURE

 $SPOOL/bin/jass-execute -a server-secure.driver -V 2 -o $L_LOG

 if [ -f $L_LOG ]; then
    ERR=`grep FAIL $L_LOG|wc -l`
     if [ $ERR -ne 0 ]; then
      echo "Solaris Security Log: AUDIT (${HOST}) $TIMESTAMP" > $L_OUT
      echo "" >> $L_OUT
      echo "File : "$L_LOG" " >> $L_OUT
      echo "========================================================"  >> $L_OUT
      echo "Failures : " >> $L_OUT
      grep FAIL $L_LOG | egrep -v Error >> $L_OUT
      echo "========================================================"  >> $L_OUT
      tail -12 $L_LOG >> $L_OUT
      mailx -s "Solaris Security Toolkit Log: AUDIT (${HOST})" $MAIL_LIST < $L_OUT
     else
      exit 0
     fi
 fi

Output Example:


 Solaris Security Log: AUDIT (host-name) 1301.0211

 File : /opt/SUNWjass/Repository/Jass_Audit.1301.0211
 ========================================================
 Failures :
 update-at-deny                 [FAIL] User test is not listed in /etc/cron.d/at.deny.
 ========================================================
 server-secure.driver           [SUMMARY] Results Summary for AUDIT run of server-secure.driver
 server-secure.driver           [SUMMARY] The run completed with a total of 84 scripts run.
 server-secure.driver           [SUMMARY] There was a Failure  in   1 Script
 server-secure.driver           [SUMMARY] There were  Errors   in   0 Scripts
 server-secure.driver           [SUMMARY] There was a Warning  in   1 Script
 server-secure.driver           [SUMMARY] There were  Notes    in  19 Scripts
 server-secure.driver           [SUMMARY] Failure Scripts listed in:
 server-secure.driver                   /var/opt/SUNWjass/run/20101102130155/jass-script-failures.txt
 server-secure.driver           [SUMMARY] Warning Scripts listed in:
 server-secure.driver                   /var/opt/SUNWjass/run/20101102130155/jass-script-warnings.txt
 server-secure.driver           [SUMMARY] Notes Scripts listed in:
 server-secure.driver                   /var/opt/SUNWjass/run/20101102130155/jass-script-notes.txt