Installing and Maintaining CA AppLogic › BFC User Guide › BFC Troubleshooting

BFC Troubleshooting

This section contains the following topics:

Deployment Failure

Problem

If you redeploy a node multiple times, or destroy and recreate a grid several times, a node boots from the utility image, but the CA AppLogic deployment fails. The following messages appear in the node console:

  EXECUTING INSTRUCTIONS...
etcrcS.dS7Orun: line 6: ./run: Permission denied
BFC utility image processing is complete. Console login is disabled.
udevd event[1000]: wait_for_sysfs: waiting for  /sys/devices/pciOOOO:OO/OOOO:
OO:
1f .Z. hostOtargetO:O:Oioerr_cnt  failed
udevd-event[99?]: wait_for_sysfs: waiting for  sysdevicespciOOOO:OOOOOO:OO:1
f.2,hostl/targetl:O:O,ioerrcnt  failed
udevd event[1005]: wait_for_sysfs: waiting for  ,sys,devices,pciOOOO:OO,OOOO:
OO:
1f .Z. hostZtargetZ:O:Oioerr_cnt  failed
[ 33.62331?] sd 0:0:0:0: Attached scsi generic sg  type 0

Reason

An incomplete cleanup of the old configuration in the BFC causes this behavior. The BFC caches the MAC addresses of the deployed servers.

Solution

The BFC also keeps a copy under the /boot_command/config directory. Look in that directory for one file with the MAC address name of the effected server. Delete that file and retry deployment.

IPMI Remote Power Management Problems

As part of discovery and inventory process, the BFC attempts to configure the servers it discovers for remote power management via IPMI. When configured successfully, the server is 'power controlled' and the BFC can intelligently control the power management operations on the server. Failure to configure the BMC (Baseboard Management Controller) for IPMI LAN access results in a server in 'manual power' mode.

You can identify a server in 'manual power' mode wherever the server status is identified in the user interface. Specifically:

When you hover the mouse over the server name to display the status. The following message displays: "Compute host has no power control and will never be shut down. If a power failure occurs and the host does not return to online status, user intervention will be required." That is, you will need to manually power cycle the server.
On the Power Type column of the Power tab in the Server properties page. The following text displays: "Manual" or “IPMI (Warning: Unable to configure user)”.

When a 'power controlled' server is not responding to remote power status or action requests it is considered 'degraded'. The 'degraded' state refers to an intermittent state resulting from sporadic communication failures or a temporarily non-responsive BMC. Typically these conditions self-correct.

However, another condition can lead to the 'degraded' state for a server, resulting from the inability to set up the BFC PowerAdmin user account for which the remote IPMI calls get authenticated.

When a server is discovered, the BFC attempts to add the BFC's user (PowerAdmin__BFC) to the power controller. If that attempt fails for some reason, the BFC then uses the system-wide IPMI password. The server is degraded, and as noted above, the user configuration failure message appears.

There are two cases in which the fallback to the system-wide IPMI password may fail:

A system-wide IPMI password is not set.
A non-BFC user/password is already specified for the server. Note: Server-specific credentials are always used irrespective of the ability to configure the PowerAdmin_BFC user.

Do one of the following to change the power state from 'degraded' to 'power controlled'. In either case, you are entering credentials for an existing user.

Specify a user for all servers. (For example, you can designate a single user name with the same password on all your power controllers. Use the Backup IPMI Authentication option on the BFC Tab of the Administration page in the BFC.)
Specify a user on a 'per-box' basis. (For example, if you have individual user/passwords on each box. On the Servers page, click the server you want to work with. On the Power tab, select the server, then click the Power Action drop-down menu and select Set Password. You are prompted to set User ID and Password credentials.)

Note: Although the BFC may attempt to put its own user on the power controller, the BFC never changes or deletes any existing users already configured on a power controller.

Grid Recovery Procedures

Grid Controller Failure

This topic covers various types of grid controller failures that need manual intervention by the grid administrator to restore grid controller operation.

Note: We strongly recommend that you read the CA AppLogic high-availability reference to familiarize yourself with CA AppLogic's high-availability capabilities; especially related to grid controller HA and the types of failures that may occur.

Types of Grid Controller Failures

CA AppLogic automatically recovers from two types of grid controller failures:

Failure of the grid controller virtual machine that runs on the grid's primary server (that is, software failure)
Failure of the primary server that is running the grid controller virtual machine (that is, hardware failure, software failure)

In certain failure scenarios, the grid controller may become inaccessible and is not automatically recoverable by CA AppLogic. Such cases as observed by the user are summarized below:

The recovery GUI is accessible and fails to restart the grid controller.
- The reason why this has occurred is specified in the recovery GUI under the Details and Messages sections. You can see at which stage the recovery failed and why it failed. If this information is not sufficient to recover the grid controller, see the next section on how to recover the grid controller.
- If the reason that the controller cannot be started is due to corrupt system volumes (boot, meta, impex), the only way to recover the grid controller is by contacting CA support (CA AppLogic can detect corrupt system volumes and displays an error message in the recovery GUI).
The grid controller is not restarted within a few minutes and there is no access to either the recovery GUI or the CA AppLogic GUI.
- Be sure that you give CA AppLogic at least 5 minutes to recover; if CA AppLogic has not recovered the grid controller within 5-10 minutes, see the next section on how to recover the grid controller.

An administrator can view the recovery GUI state information that is maintained by CA AppLogic. This information is stored in a file that is located in dom0 on the controller server (where the grid controller is going to be started). See the last topic at the bottom of this document for the location and format of the recovery GUI status file.

What to do when a grid controller is inaccessible

The following is a list of reasons why the grid controller might not have restarted on its own:

The most common reason is that one of the grid controller's system volumes are not available. This may occur if the only good stream/mirror for either the boot, meta or impex volume is on a server that is down.
Too many primary or secondary servers are down. CA AppLogic requires at least a majority (n/2+1 where n is the number of primary and secondary servers in the grid) of the primary and secondary servers to be operational to be able to start the grid controller on one of the other servers.

To restore the grid controller to operation, do the following:

Restore all of the primary and secondary servers that are down. Once these servers are restored to operation, the grid controller should be restored within roughly 5 minutes. If this does not resolve the issue, try the go to the next step.
Using the BFC, perform the following steps to designate a new primary server for the grid and start the grid controller on the new primary server.
1. Select Grids from the left Menu.
  The Grids page appears. The state of the grid can be running, stopped, failed, failed - running (grid create failed but left the servers running), needs attention, and requires reboot.
2. Select the check box next to the grid you want to work with.
3. Click the Grid Action menu and select Edit Grid Parameters.
  The Edit Grid Parameters dialog opens.
4. Enter the following grid parameter information to designate a new primary server for the grid:
  primary=srvaddr
  
  The srvaddr value is the server ID (srvNN) or address of the server to that will become the new controller. The addresses must be accessible on the Backbone LAN. If a name is specified, it must resolve to an address that is accessible on the Backbone.
  
  When this setting is used on an operational grid (with a running controller), the controller is immediately shut down and restarted on the new host. This should not affect applications on the grid, but it may disrupt GUI access to the grid controller and delay or interrupt application management commands that are in progress.
  
  It is also recommended that with the primary server, there are at least two secondary servers in the grid. If the grid does not have any secondary servers or the secondary servers are down and cannot be restored, perform the following steps to configure at least two secondary servers for the grid. If there are not enough servers available, it is recommended to add more servers to the grid for grid controller HA.
5. Enter the following grid parameter information to designate two secondary servers for the grid:
  secondary=srvaddr,srvaddr,...
  
  The srvaddr values are the server IDs or addresses of servers that are allowed to take over the role of controller host in the case of a failure of the primary controller host. This setting can be used to restrict or change the automatic assignment of secondary controller hosts. Up to seven secondary hosts may be specified. This setting can be specified by itself, or together with the primary= setting, to simultaneously re-assign the secondary hosts and move the controller to a new primary host. The secondary= setting has no effect on a disabled grid. To re-assign secondary controller hosts, first recover the grid using the primary=srvaddr grid parameter.
6. Select the reboot required option to 'Yes'.
7. Click the Save button on the Edit Grid Parameters dialog.

If any of the above suggestions do not restore the grid controller, this is a fatal problem and requires manual intervention to resolve; contact CA support immediately. Collect the following information for CA support:

The current state of the grid
If the recovery GUI is operational, at what stage it failed and the error message displayed in the message section of the GUI
Which servers in the grid are down/inoperational (if any)

Recovery GUI State File

An administrator can view the recovery GUI state information that is maintained by CA AppLogic. This information is stored in a file that is located in dom0 on the controller server (the server where the grid controller is going to be started). The file is named /usr/local/recovery/gui/chroot/data/status and contains the current status of the grid controller recovery. The information stored in this file is what is used by the controller recovery GUI to display the progress/status. This file uses the following format, encoded in JSON (example data is used below):

{
"grid_name"               : "my-grid-name",
"grid_version"            : "2.7.6",
"role"                    : "Recovery controller 2",
"status"                  : "Recovery in progress (master recovery controller is srv2)",
"recovery_start_time"     : "15:14:41 PDT (Mar 21, 2009)",
"recovery_eta"            : "15:23:54 PDT",
"recovery_remaining_time" : 278,
"current_time"            : "15:19:12 PDT",
"stage"                   : 0,
"stage_remaining_time"    : 79,
"failure_reason"          : "srv1 down (no response for 30 sec on either network)",
"known_servers"           : "srv1:down,srv2:up",
"stages"                  : [
                            "Waiting for quorum (at least 3 of the N servers to connect)",
                            "Waiting for server with controller volumes to become available",
                            "Waiting for remaining controller volume streams",
                            "Verifying both networks are present",
                            "Sharing controller volumes",
                            "Mounting controller volumes",
                            "Starting grid controller",
                            "Grid controller started"
                            ],
"msgs"                    : [
                            {
                            "time"     : "15:15:23 PDT",
                            "severity" : "alert",
                            "text"     : "My alert message"
                            },
                            {
                            "time"     : "15:16:00 PDT",
                            "severity" : "info",
                            "text"     : "My info message"
                            }
                            ]
}

Some notes on the fields above:

times reflect grid local time, except recovery_remaining_time and stage_remaining_time which are in seconds
stages is an array of stage descriptions in order of their intended occurrence while the value of stage is the array subscript of the current stage
failure_reason and msgs are optional fields; all other fields are always present
msgs contains errors or other notifications that are obtained during the recovery process; these messages relay important information to the user
severity has two valid values, alert and info

When a grid controller recovery is not in progress, only a few fields are supplied in the status file:

{
"grid_name"            : "my-grid-name",
"grid_version"         : "2.7.6",
"role"                 : "Recovery controller 2",
"status"               : "Okay",
"known_servers"        : "srv2:up,srv1:up"
}

Storage Device Failure

Overview

To help prevent data loss in the event of a storage device failure, CA AppLogic monitors the health of all storage devices that support Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T. (sometimes written as SMART). S.M.A.R.T. is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.

Fundamentally, storage device failures fall into one of two basic classes:

Predictable failures: These types of failures happen gradually over time, such as mechanical wear and gradual degradation of storage surfaces. A monitoring device can detect these problems.
Unpredictable failures: These types of failures happen suddenly and without warning. These failures range from defective electronic components to a sudden mechanical failure. Mechanical failures account for about 60 percent of all drive failures. Most mechanical failures result from gradual wear, although an eventual failure may be catastrophic. However, before complete failure occurs, there are usually certain indications that failure is imminent. These may include increased heat output, increased noise level, problems with reading and writing of data, an increase in the number of damaged disk sectors, and so on.

Degree of Support

CA AppLogic provides support for monitoring the following types of hard disks provided that the disk itself provides S.M.A.R.T. support and successfully responds to smartctl -i:

ATA drives
SCSI drives
ATA drives sitting behind the following RAID controllers:
- 3Ware 6/7/8000 series controller
- 3Ware 9000 series controller
- HighPoint RocketRaid controller
- MegaRaid controller
- Compaq Smart Raid controller
- Areca Raid controller

Notes:

The support for disks that sit behind the above RAID controllers is a limitation of the smartmontools package used by CA AppLogic to monitor the disks.
The smartctl -i command can be used directly on the server to determine if its storage devices can be monitored by CA AppLogic. CA recommends to verify that CA AppLogic can monitor the server's storage devices before adding the server to a grid.

Determining if storage health monitoring is supported/enabled on your grid's servers

The following alerts are logged to the grid dashboard that indicate whether storage health monitoring is enabled and what storage devices are not monitored by CA AppLogic (on a per-server basis):

INFO: The following hard disks are not monitored on server server: list of storage devices
INFO: Storage failure detection is disabled on server name

In addition, to determine if storage health monitoring is supported for a particular server within a grid, execute the following command for the server:
3t srv info name --extended
and inspect the --- Disk Check Information --- section of the output. If at least one storage device can be monitored on a server, the Supported value is yes; otherwise it is no.

What to do if the "Possible storage system failure" dashboard alert is present on a grid

This section describes the action(s) that should be taken when a storage system failure alert is logged to the grid dashboard.

The following messages are critical errors and are indications of either immanent or potential storage failure.

Device DEVICE, SMART Failure: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE.
Device DEVICE, FAILED SMART self-check. BACK UP DATA NOW!
Device DEVICE, Failed SMART usage Attribute: ATTRIBUTE
Device DEVICE, Self-Test Log error count increased from M to N
Device DEVICE, not capable of SMART self-check
Device DEVICE, failed to read SMART Attribute
Device DEVICE, unable to open device
Device DEVICE, Read SMART Error Log Failed
Device DEVICE, Read SMART Self-Test Log Failed
Device DEVICE, Temperature NUM Celsius reached critical limit of NUM2 Celsius (Min/Max M/N)

If one of the alerts above are present on the grid dashboard, what can be done to save your data depends upon the state of the volumes that have streams on the failing server:

Find out if there are any degraded volumes that have their only good stream/mirror on the failing server.
- Execute vol list server=srvX and note all degraded volumes
- For each degraded volume, execute vol info volume-name and note any volume that has its only good mirror on the failing server.
If all volumes with streams/mirrors on the failing server are in the OK state (That is, they are not degraded):
- Disable the server using the srv disable command.
- Restart the applications that have components running on the server (app restart).
- After the applications have been restarted, contact your service provider to take the failing server offline for service.
- This will result in degraded volumes; however CA AppLogic will automatically repair those volumes in the background.
- Once the server is repaired, it can be brought back online and re-enabled with srv enable.
If there is at least one volume that is degraded with its only good stream/mirror on the failing server:
- The volumes must be repaired. Use the vol repair volume-name --force command to force CA AppLogic to repair these volumes as soon as possible. It is recommended to repair the volumes in order of importance of their data (in case the storage device fails during the volume repair process). It is recommended to repair the volumes in the following order:
  - application volumes
  - singleton volumes
  - catalog/class volumes
  - volcache volumes (only if the volcache volumes have been incorrectly modified from the original class)
  - global volumes
- When all volume repairs for these volumes have been completed (vol repair --status), repeat all of the steps in the first case above.

The following message are informational and may not indicate possible disk failure. However, you should contact your grid service provider for assistance in diagnoses of the particular problem.

Device DEVICE, ATA error count increased from M to N:
Device DEVICE, NUM Currently unreadable (pending) sectors:
Device DEVICE, NUM Offline uncorrectable sectors:

Clear a Locked Grid

If you find a grid in a locked state, you can clear the lock by executing a command on the control node.

When a grid is in the locked state, the description in the state icon includes the following information:

The grid <gridname> is locked by process #13421

Follow these steps:

Log in to the BFC control node as root.
Execute the following command:
```
service bfc restart
```
The lock is cleared.