Previous Topic: Grid Recovery ProceduresNext Topic: Storage Health Monitoring


Grid Controller Recovery

This topic covers various types of grid controller failures that need manual intervention by the grid administrator to restore grid controller operation.

Note the following:

Types of Grid Controller Failures

CA 3Tera AppLogic automatically recovers from two types of grid controller failures:

In certain failure scenarios, the grid controller may become inaccessible and is not automatically recoverable by CA 3Tera AppLogic. Such cases as observed by the user are summarized below:

An administrator can view the recovery GUI state information that is maintained by CA 3Tera AppLogic. This information is stored in a file that is located in dom0 on the controller server (where the grid controller is going to be started). See the last topic at the bottom of this document for the location and format of the recovery GUI status file.

What to do when a grid controller is inaccessible

The following is a list of reasons why the grid controller might not have restarted on its own:

To restore the grid controller to operation, do the following:

  1. Restore all of the primary and secondary servers that are down. Once these servers are restored to operation, the grid controller should be restored within roughly 5 minutes. If this does not resolve the issue, try the go to the next step.
  2. Using the BFC, perform the following steps to designate a new primary server for the grid and start the grid controller on the new primary server.
    1. Select Grids from the left Menu.

      The Grids page appears. The state of the grid can be running, stopped, failed, failed - running (grid create failed but left the servers running), needs attention, and requires reboot.

    2. Click the desired grid name in the GRID column.

      The Servers tab for the grid appears.

    3. Click the Miscellaneous tab.
    4. Click the Set button.

      The Edit Properties dialog appears.

    5. Enter the following grid parameter information to designate a new primary server for the grid:

      primary=srvaddr

      The srvaddr value is the server ID (srvNN) or address of the server to that will become the new controller. The addresses must be accessible on the Backbone LAN. If a name is specified, it must resolve to an address that is accessible on the Backbone.

      When this setting is used on an operational grid (with a running controller), the controller is immediately shut down and restarted on the new host. This should not affect applications on the grid, but it may disrupt GUI access to the grid controller and delay or interrupt application management commands that are in progress.

      It is also recommended that with the primary server, there are at least two secondary servers in the grid. If the grid does not have any secondary servers or the secondary servers are down and cannot be restored, perform the following steps to configure at least two secondary servers for the grid. If there are not enough servers available, it is recommended to add more servers to the grid for grid controller HA.

    6. Enter the following grid parameter information to designate two secondary servers for the grid:

      secondary=srvaddr,srvaddr,...

      The srvaddr values are the server IDs or addresses of servers that are allowed to take over the role of controller host in the case of a failure of the primary controller host. This setting can be used to restrict or change the automatic assignment of secondary controller hosts. Up to seven secondary hosts may be specified. This setting can be specified by itself, or together with the primary= setting, to simultaneously re-assign the secondary hosts and move the controller to a new primary host. The secondary= setting has no effect on a disabled grid. To re-assign secondary controller hosts, first recover the grid using the primary=srvaddr grid parameter.

    7. Select the On option to reboot the grid.
    8. Click the Save button on the Edit Properties dialog.

If any of the above suggestions do not restore the grid controller, this is a fatal problem and requires manual intervention to resolve; contact CA support immediately. Collect the following information for CA support:

Recovery GUI State File

An administrator can view the recovery GUI state information that is maintained by CA 3Tera AppLogic. This information is stored in a file that is located in dom0 on the controller server (the server where the grid controller is going to be started). The file is named /usr/local/recovery/gui/chroot/data/status and contains the current status of the grid controller recovery. The information stored in this file is what is used by the controller recovery GUI to display the progress/status. This file uses the following format, encoded in JSON (example data is used below):

{
"grid_name"               : "my-grid-name",
"grid_version"            : "2.7.6",
"role"                    : "Recovery controller 2",
"status"                  : "Recovery in progress (master recovery controller is srv2)",
"recovery_start_time"     : "15:14:41 PDT (Mar 21, 2009)",
"recovery_eta"            : "15:23:54 PDT",
"recovery_remaining_time" : 278,
"current_time"            : "15:19:12 PDT",
"stage"                   : 0,
"stage_remaining_time"    : 79,
"failure_reason"          : "srv1 down (no response for 30 sec on either network)",
"known_servers"           : "srv1:down,srv2:up",
"stages"                  : [
                            "Waiting for quorum (at least 3 of the N servers to connect)",
                            "Waiting for server with controller volumes to become available",
                            "Waiting for remaining controller volume streams",
                            "Verifying both networks are present",
                            "Sharing controller volumes",
                            "Mounting controller volumes",
                            "Starting grid controller",
                            "Grid controller started"
                            ],
"msgs"                    : [
                            {
                            "time"     : "15:15:23 PDT",
                            "severity" : "alert",
                            "text"     : "My alert message"
                            },
                            {
                            "time"     : "15:16:00 PDT",
                            "severity" : "info",
                            "text"     : "My info message"
                            }
                            ]
}

Some notes on the fields above:

When a grid controller recovery is not in progress, only a few fields are supplied in the status file:

{
"grid_name"            : "my-grid-name",
"grid_version"         : "2.7.6",
"role"                 : "Recovery controller 2",
"status"               : "Okay",
"known_servers"        : "srv2:up,srv1:up"
}