In case there is an unexpected grid failure, report the failure to CA Technologies technical support. Below are examples of these types of failures:
Before submitting a bug report to CA Technologies, review the release notes to verify the problem is not already known.
With the bug report, collect all of the following logs including backups (that is, xxx.1, xxx.2, and so on) from the grid and send them to CA Technologies (the grid and servers logs require administrator access). You may use the 3tsrv utility on each server to collect the server specific logs and information.
This log contains the output of any grid commands issued by the BFC, as well as output from the user-identifiable actions the BFC is taking. This log will probably be the most useful for troubleshooting.
This log is primarily used by CA Technologies development, but may contain useful data when diagnosing issues with discovery.
This log contains the output from the installation process.
This is a good log to watch when discovering servers, because dhcp requests are logged here. If a server is powered on and you do not see its dhcp in this log, then it is probably not configured correctly with respect to PXE.
Contains the inventory files for servers and can help in diagnosing discovery/inventory issues.
Contains the logs created when a server is deployed into a grid. If a server is failing when being added to a grid, these files may help.
In addition to the logs above, collect the following information about each server within the grid (dom0):
This topic covers various types of grid controller failures that need manual intervention by the grid administrator to restore grid controller operation.
Note: We strongly recommend that you read the CA 3Tera AppLogic high-availability reference to familiarize yourself with CA 3Tera AppLogic's high-availability capabilities; especially related to grid controller HA and the types of failures that may occur.
CA 3Tera AppLogic automatically recovers from two types of grid controller failures:
In certain failure scenarios, the grid controller may become inaccessible and is not automatically recoverable by CA 3Tera AppLogic. Such cases as observed by the user are summarized below:
An administrator can view the recovery GUI state information that is maintained by CA 3Tera AppLogic. This information is stored in a file that is located in dom0 on the controller server (where the grid controller is going to be started). See the last topic at the bottom of this document for the location and format of the recovery GUI status file.
The following is a list of reasons why the grid controller might not have restarted on its own:
To restore the grid controller to operation, do the following:
The Grids page appears. The state of the grid can be running, stopped, failed, failed - running (grid create failed but left the servers running), needs attention, and requires reboot.
The Edit Grid Parameters dialog opens.
primary=srvaddr
The srvaddr value is the server ID (srvNN) or address of the server to that will become the new controller. The addresses must be accessible on the Backbone LAN. If a name is specified, it must resolve to an address that is accessible on the Backbone.
When this setting is used on an operational grid (with a running controller), the controller is immediately shut down and restarted on the new host. This should not affect applications on the grid, but it may disrupt GUI access to the grid controller and delay or interrupt application management commands that are in progress.
It is also recommended that with the primary server, there are at least two secondary servers in the grid. If the grid does not have any secondary servers or the secondary servers are down and cannot be restored, perform the following steps to configure at least two secondary servers for the grid. If there are not enough servers available, it is recommended to add more servers to the grid for grid controller HA.
secondary=srvaddr,srvaddr,...
The srvaddr values are the server IDs or addresses of servers that are allowed to take over the role of controller host in the case of a failure of the primary controller host. This setting can be used to restrict or change the automatic assignment of secondary controller hosts. Up to seven secondary hosts may be specified. This setting can be specified by itself, or together with the primary= setting, to simultaneously re-assign the secondary hosts and move the controller to a new primary host. The secondary= setting has no effect on a disabled grid. To re-assign secondary controller hosts, first recover the grid using the primary=srvaddr grid parameter.
If any of the above suggestions do not restore the grid controller, this is a fatal problem and requires manual intervention to resolve; contact CA support immediately. Collect the following information for CA support:
An administrator can view the recovery GUI state information that is maintained by CA 3Tera AppLogic. This information is stored in a file that is located in dom0 on the controller server (the server where the grid controller is going to be started). The file is named /usr/local/recovery/gui/chroot/data/status and contains the current status of the grid controller recovery. The information stored in this file is what is used by the controller recovery GUI to display the progress/status. This file uses the following format, encoded in JSON (example data is used below):
{
"grid_name" : "my-grid-name",
"grid_version" : "2.7.6",
"role" : "Recovery controller 2",
"status" : "Recovery in progress (master recovery controller is srv2)",
"recovery_start_time" : "15:14:41 PDT (Mar 21, 2009)",
"recovery_eta" : "15:23:54 PDT",
"recovery_remaining_time" : 278,
"current_time" : "15:19:12 PDT",
"stage" : 0,
"stage_remaining_time" : 79,
"failure_reason" : "srv1 down (no response for 30 sec on either network)",
"known_servers" : "srv1:down,srv2:up",
"stages" : [
"Waiting for quorum (at least 3 of the N servers to connect)",
"Waiting for server with controller volumes to become available",
"Waiting for remaining controller volume streams",
"Verifying both networks are present",
"Sharing controller volumes",
"Mounting controller volumes",
"Starting grid controller",
"Grid controller started"
],
"msgs" : [
{
"time" : "15:15:23 PDT",
"severity" : "alert",
"text" : "My alert message"
},
{
"time" : "15:16:00 PDT",
"severity" : "info",
"text" : "My info message"
}
]
}
Some notes on the fields above:
When a grid controller recovery is not in progress, only a few fields are supplied in the status file:
{
"grid_name" : "my-grid-name",
"grid_version" : "2.7.6",
"role" : "Recovery controller 2",
"status" : "Okay",
"known_servers" : "srv2:up,srv1:up"
}
To help prevent data loss in the event of a storage device failure, CA 3Tera AppLogic monitors the health of all storage devices that support Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T. (sometimes written as SMART). S.M.A.R.T. is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.
Fundamentally, storage device failures fall into one of two basic classes:
These types of failures happen gradually over time, such as mechanical wear and gradual degradation of storage surfaces. A monitoring device can detect these problems.
These types of failures happen suddenly and without warning. These failures range from defective electronic components to a sudden mechanical failure. Mechanical failures account for about 60 percent of all drive failures. Most mechanical failures result from gradual wear, although an eventual failure may be catastrophic. However, before complete failure occurs, there are usually certain indications that failure is imminent. These may include increased heat output, increased noise level, problems with reading and writing of data, an increase in the number of damaged disk sectors, and so on.
CA 3Tera AppLogic provides support for monitoring the following types of hard disks provided that the disk itself provides S.M.A.R.T. support and successfully responds to smartctl -i:
Notes:
The following alerts are logged to the grid dashboard that indicate whether storage health monitoring is enabled and what storage devices are not monitored by CA 3Tera AppLogic (on a per-server basis):
In addition, to determine if storage health monitoring is supported for a particular server within a grid, execute the following command for the server:
3t srv info name --extended
and inspect the --- Disk Check Information --- section of the output. If at least one storage device can be monitored on a server, the Supported value is yes; otherwise it is no.
This section describes the action(s) that should be taken when a storage system failure alert is logged to the grid dashboard.
The following messages are critical errors and are indications of either immanent or potential storage failure.
If one of the alerts above are present on the grid dashboard, what can be done to save your data depends upon the state of the volumes that have streams on the failing server:
The following message are informational and may not indicate possible disk failure. However, you should contact your grid service provider for assistance in diagnoses of the particular problem.
|
Copyright © 2011 CA.
All rights reserved.
|
|