Using CA AppLogic › Grid User Guide › High Availability › Automated Recovery of the Grid Controller

Automated Recovery of the Grid Controller

CA AppLogic can now tolerate failures of the grid controller with minimal to no application downtime. The grid controller is no longer a single point of failure for the grid.

This section contains the following topics:

Grid Controller Failure

Grid Controller Server Failure

Application repair upon Grid Controller Recovery

Grid failures that require manual intervention

Grid Controller Failure

CA AppLogic automatically recovers from various types of failures of the grid controller virtual machine that runs on the grid's primary server (the primary server is the server that runs the grid controller virtual machine; each CA AppLogic grid has one and only one primary server). Recovery from a failed grid controller has no effect on the applications that are running on the grid. CA AppLogic monitors the grid controller and automatically detects and handles any of the following software failure conditions that may lead to grid controller downtime:

Crash of the grid controller
Unexpected shutdown/reboot of the grid controller
Unresponsive grid controller
Corruption of the grid controller boot or meta volumes
Out of memory errors within the grid controller

When any of the above failures occur, the grid controller is automatically restarted on the primary server without affecting any of the running applications. From a visibility standpoint, the grid controller will be unavailable for under 5 minutes while the controller recovery is in progress. Once the grid controller has recovered from the failure, it automatically reacquires the state of the grid and continues operation as if the failure never occurred. An alert is posted to the grid dashboard that conveys the reason why the grid controller had failed. See the grid controller dashboard messages for a full list of the alerts that can be posted.

Like the physical server failures and the appliance failures, if the grid controller fails 3 times within a 24 hour period, CA AppLogic does not automatically restart the grid controller. If this situation occurs, contact your service provider immediately.

Notes:

The automatic restart of the grid controller upon failure is supported on CA AppLogic grids of all sizes (single server, 2 servers, and so on).
No matter what happens to the grid controller, the running applications should never be affected and should continue to operate.

Grid Controller Server Failure

In addition to automatically handling both grid controller and physical server failures, CA AppLogic can also handle failures of the grid controller server (that is, the physical server where the grid controller is currently running; also known as the primary server). A controller server may fail for any number of reasons, the same as any other server within the grid.

To tolerate failures of the controller server, CA AppLogic server roles to define a set of servers in a grid that are able to run the grid controller in case of failures (that is, backup controller servers). CA AppLogic uses the following server roles within a grid (these are automatically configured by CA AppLogic but may also be specified by a user or grid administrator):

Primary: Server that is currently running the CA AppLogic grid controller
Secondary: Server that may run the CA AppLogic grid controller in case of a primary controller server failure
None: Server will never run the CA AppLogic grid controller and does not participate in controller high-availability

By default, every CA AppLogic grid is configured with the following server roles:

One server grids: one primary server; single server grids cannot tolerate failures of the primary server (no grid controller server high-availability)
Two server grids: one primary and one secondary server (and one reference server)
Three or more server grids: one primary and two secondary servers; the remaining servers in the grid have their role set to none and do not participate in the controller server failure recovery

If the primary server fails (hardware/software failure, powered-down, and so on), one of the secondary servers automatically takes over as the new primary server for the grid. If the old primary server is restored to operation, it automatically becomes a non-primary server (secondary server). The new primary server starts the grid controller and once the controller is restored, the controller automatically reacquires the state of the grid. Just like for physical server failures, CA AppLogic also automatically restarts appliances that were running on the failed primary server. The use of secondary controller servers and the auto-restart of appliances allows CA AppLogic to tolerate failures of the primary controller server.

A user can view the server roles that are assigned in their grid using the srv list command. The server roles may also be modified using the srv set command.

For grids with exactly 2 servers, CA AppLogic requires a 3rd reference server to properly support the grid controller HA. By default when a 2.7+ grid is installed or upgraded, the CA AppLogic installer assigns the BFC server as the reference server for the grid. The same BFC server may be used as a reference server for all grids on the same backbone.

Here are some important notes to keep in mind about CA AppLogic's grid controller high-availability:

Important: For a grid to recover from a controller server failure, there must be at least 2 secondary servers up and running at the time of the server failure. If this requirement is not met (for example, there is only one secondary server at the time of the primary server failure), the grid controller remains down and requires grid administrator intervention to restore the grid controller to an operational state. If this type of controller failure is encountered, contact your service provider for assistance.

If the primary server fails and does not come back online for at least 2 hours, CA AppLogic automatically assigns a new secondary server within the grid to maintain at least 2 secondary servers for grid controller failover.
In general when CA AppLogic schedules appliances to start on either comp start or app start, the appliances are scheduled on servers based on both their role and available resources. CA AppLogic first tries to schedule appliances to run on servers with a role of none, then the primary server and lastly secondary servers. The secondary servers are used as a last resort for scheduling so there is a greater chance that there is available resources to start the controller if needed.
When a secondary server takes over as the new primary server, if there are not enough resources available on the server to start the grid controller, CA AppLogic restarts appliances which are running on the new primary server on other servers within the grid so the grid controller can be started on the new primary server. This may break appliance failover groups. If CA AppLogic stops one of these appliances it may not be able to restart the appliance on another server because there may not be enough resources to satisfy the failover group.

This section contains the following topics:

Overview

Visibility During Controller Server Recovery

Authentication

Dashboard

Overview

Primary: Server that is currently running the CA AppLogic grid controller
Secondary: Server that may run the CA AppLogic grid controller in case of a primary controller server failure
None: Server will never run the CA AppLogic grid controller and does not participate in controller high-availability

By default, every CA AppLogic grid is configured with the following server roles:

One server grids: one primary server; single server grids cannot tolerate failures of the primary server (no grid controller server high-availability)
Two server grids: one primary and one secondary server (and one reference server)
Three or more server grids: one primary and two secondary servers; the remaining servers in the grid have their role set to none and do not participate in the controller server failure recovery

A user can view the server roles that are assigned in their grid using the srv list command. The server roles may also be modified using the srv set command.

Here are some important notes to keep in mind about CA AppLogic's grid controller high-availability:

If the primary server fails and does not come back online for at least 2 hours, CA AppLogic automatically assigns a new secondary server within the grid to maintain at least 2 secondary servers for grid controller failover.
In general when CA AppLogic schedules appliances to start on either comp start or app start, the appliances are scheduled on servers based on both their role and available resources. CA AppLogic first tries to schedule appliances to run on servers with a role of none, then the primary server and lastly secondary servers. The secondary servers are used as a last resort for scheduling so there is a greater chance that there is available resources to start the controller if needed.
When a secondary server takes over as the new primary server, if there are not enough resources available on the server to start the grid controller, CA AppLogic restarts appliances which are running on the new primary server on other servers within the grid so the grid controller can be started on the new primary server. This may break appliance failover groups. If CA AppLogic stops one of these appliances it may not be able to restart the appliance on another server because there may not be enough resources to satisfy the failover group.

Visibility During Controller Server Recovery

When the primary server fails, a user may point their browser to their grid controller host name/IP and observe the controller recovery progress. Once the controller has been recovered, the user is automatically redirected to the CA AppLogic GUI for their grid.

Note: Controller Recovery progress appears only for Xen-based grids, and does not appear for ESX-based grids.

Authentication

The user must be authenticated to access the controller recovery GUI to observe the recovery progress/status. To log into the recovery GUI, click on the Login button, enter the recovery GUI password within the dialog and click the OK button.

Note: The recovery GUI password may be modified via the grid set command. A controller reboot is required for the new password to take effect.

Dashboard

After the user authenticates, they will have access to the dashboard of the controller recovery GUI.

The controller recovery GUI displays the following information:

Dashboard
- Grid Name: name of the grid
- CA AppLogic Version: version of the grid
- Status: displays that the recovery is in progress and which server is the new controller server
- Role: current role of the new controller server; for controller recovery the role is always "Secondary"
- Known Servers: list of all servers in the grid that contain the good streams for the controller system volumes
Recovery in progress
- Reason for failure: the reason why the controller server failed (may be unknown)
- Process started: day and time when the controller recovery process started (when the recovery GUI has been started by CA AppLogic)
- Current time: current time
- Estimated completion: estimated completion time when the grid controller will be recovered
- Remaining time: estimated time that is left to recover the grid controller
  Typically it takes 1-3 minutes for the recovery controller to start, and 11-13 minutes for the controller to start completely.
- Details: this is the detail of the various stages for the actual grid controller recovery
  See the grid controller recovery details for the list of detail messages that can be logged during the recovery process
Messages
- This is used to log informational messages and warnings/errors that are encountered during the grid controller recovery
- See the grid controller recovery messages for the list of messages that can be logged during the recovery process

After the grid controller is restored, an alert is posted to the grid dashboard that describes the reason why the controller had failed.

Notes:

The controller recovery GUI is also displayed during the boot of a grid so the boot process can be observed by the user (that is, grid reboot). In this case, Status would show Starting the controller on primary server srvX.
If the controller recovery fails and the grid controller is not started, contact your service provider immediately.

Application repair upon Grid Controller Recovery

When the grid controller fails, it is possible that at the time of the failure users were starting/stopping/restarting applications and components. Upon restoration of the grid controller, CA AppLogic helps ensure that all applications and components are restored to their expected state; based on the previous commands that were executing before the grid controller had failed. This process of restoring the application/component state is known as repair. Both applications and components have an associated target state that is used in the repair process to help ensure that they are properly restored.

As an example, if an application was in the middle of an application restart (app restart) and right before the grid controller failure the application was stopping, CA AppLogic automatically verifies that the application is properly restarted. In this case, the application's target state is RESTART_STOPPING to indicate that the application was stopping as part of an app restart. The target state for an application can be obtained by executing app info (the target state is only displayed for non-stopped applications).

Applications that are under repair after a grid controller restart may be in one of the following states:

REPAIRING: application is currently being repaired by CA AppLogic (components are being stopped/started as needed to restore the application to the appropriate target state)
RESTART_STOPPING: application is currently being stopped as part of an app restart as in the example above

While the application repair is in progress, the following alert is posted on the grid dashboard:

Grid recovery in progress: There were N active application(s) when the controller went down. M application(s) have been recovered. The state of P application(s) has been reacquired. Recovering Q application(s).

After the application repair is complete, the previous alert is destroyed and the following alert is posted on the grid dashboard (assuming everything was recovered successfully):

Grid recovery completed on time: There were N active application(s) when the grid controller went down. N application(s) have been recovered. The state of P application(s) has been reacquired.

If there was a failure recovering the applications, the following alert is posted on the grid dashboard:

Grid recovery completed on time: There were N active application(s) when the grid controller went down. M application(s) failed to be recovered.

If an application fails to be recovered, use the list log command to view the controller log for details regarding the failure. Usually applications fail to be recovered for one or more of the following reasons:

Not enough resources in the grid (cpu, memory, bandwidth)
If one or more servers are down, it is possible that some of the application/appliance volumes are in an ERROR state

Note: During the automated application repair process, CA AppLogic does not allow the user/grid-administrator to execute destructive CLI commands. This includes any command that affects the state of the grid or any server, application, component, class, catalog or volume. The following error message is displayed if a destructive command is executed during application repair:

Cannot execute command at this time - the grid controller is currently busy recovering from a failure.

Important: Applications are repaired by CA AppLogic using the app repair command. This command is valid only for applications that are in a FAILED state. Users may execute this command directly to repair applications that may have failed (that is, to restore an application where the user has completed the debugging of failed components).

Grid failures that require manual intervention

A particular grid failure can occur where the grid controller is not automatically restarted by CA AppLogic. Such cases as observed by the user are summarized below. If any of the following situations are encountered, contact your service provider immediately.

The grid controller is not restarted within a few minutes and there is no access to either the recovery GUI or the CA AppLogic GUI
The recovery GUI is accessible and fails to restart the grid controller; in this case the reason why this has happened is specified in the recovery GUI

Partners only: See the grid controller recovery topic for more information about how to recover your grid in case one of these failures occur.