CA AppLogic can now tolerate failures of the grid controller with minimal to no application downtime. The grid controller is no longer a single point of failure for the grid.
This section contains the following topics:
Grid Controller Server Failure
Application repair upon Grid Controller Recovery
Grid failures that require manual intervention
CA AppLogic automatically recovers from various types of failures of the grid controller virtual machine that runs on the grid's primary server (the primary server is the server that runs the grid controller virtual machine; each CA AppLogic grid has one and only one primary server). Recovery from a failed grid controller has no effect on the applications that are running on the grid. CA AppLogic monitors the grid controller and automatically detects and handles any of the following software failure conditions that may lead to grid controller downtime:
When any of the above failures occur, the grid controller is automatically restarted on the primary server without affecting any of the running applications. From a visibility standpoint, the grid controller will be unavailable for under 5 minutes while the controller recovery is in progress. Once the grid controller has recovered from the failure, it automatically reacquires the state of the grid and continues operation as if the failure never occurred. An alert is posted to the grid dashboard that conveys the reason why the grid controller had failed. See the grid controller dashboard messages for a full list of the alerts that can be posted.
Like the physical server failures and the appliance failures, if the grid controller fails 3 times within a 24 hour period, CA AppLogic does not automatically restart the grid controller. If this situation occurs, contact your service provider immediately.
Notes:
In addition to automatically handling both grid controller and physical server failures, CA AppLogic can also handle failures of the grid controller server (that is, the physical server where the grid controller is currently running; also known as the primary server). A controller server may fail for any number of reasons, the same as any other server within the grid.
To tolerate failures of the controller server, CA AppLogic server roles to define a set of servers in a grid that are able to run the grid controller in case of failures (that is, backup controller servers). CA AppLogic uses the following server roles within a grid (these are automatically configured by CA AppLogic but may also be specified by a user or grid administrator):
By default, every CA AppLogic grid is configured with the following server roles:
If the primary server fails (hardware/software failure, powered-down, and so on), one of the secondary servers automatically takes over as the new primary server for the grid. If the old primary server is restored to operation, it automatically becomes a non-primary server (secondary server). The new primary server starts the grid controller and once the controller is restored, the controller automatically reacquires the state of the grid. Just like for physical server failures, CA AppLogic also automatically restarts appliances that were running on the failed primary server. The use of secondary controller servers and the auto-restart of appliances allows CA AppLogic to tolerate failures of the primary controller server.
A user can view the server roles that are assigned in their grid using the srv list command. The server roles may also be modified using the srv set command.
For grids with exactly 2 servers, CA AppLogic requires a 3rd reference server to properly support the grid controller HA. By default when a 2.7+ grid is installed or upgraded, the CA AppLogic installer assigns the BFC server as the reference server for the grid. The same BFC server may be used as a reference server for all grids on the same backbone.
Here are some important notes to keep in mind about CA AppLogic's grid controller high-availability:
Important: For a grid to recover from a controller server failure, there must be at least 2 secondary servers up and running at the time of the server failure. If this requirement is not met (for example, there is only one secondary server at the time of the primary server failure), the grid controller remains down and requires grid administrator intervention to restore the grid controller to an operational state. If this type of controller failure is encountered, contact your service provider for assistance.
In addition to automatically handling both grid controller and physical server failures, CA AppLogic can also handle failures of the grid controller server (that is, the physical server where the grid controller is currently running; also known as the primary server). A controller server may fail for any number of reasons, the same as any other server within the grid.
To tolerate failures of the controller server, CA AppLogic server roles to define a set of servers in a grid that are able to run the grid controller in case of failures (that is, backup controller servers). CA AppLogic uses the following server roles within a grid (these are automatically configured by CA AppLogic but may also be specified by a user or grid administrator):
By default, every CA AppLogic grid is configured with the following server roles:
If the primary server fails (hardware/software failure, powered-down, and so on), one of the secondary servers automatically takes over as the new primary server for the grid. If the old primary server is restored to operation, it automatically becomes a non-primary server (secondary server). The new primary server starts the grid controller and once the controller is restored, the controller automatically reacquires the state of the grid. Just like for physical server failures, CA AppLogic also automatically restarts appliances that were running on the failed primary server. The use of secondary controller servers and the auto-restart of appliances allows CA AppLogic to tolerate failures of the primary controller server.
A user can view the server roles that are assigned in their grid using the srv list command. The server roles may also be modified using the srv set command.
For grids with exactly 2 servers, CA AppLogic requires a 3rd reference server to properly support the grid controller HA. By default when a 2.7+ grid is installed or upgraded, the CA AppLogic installer assigns the BFC server as the reference server for the grid. The same BFC server may be used as a reference server for all grids on the same backbone.
Here are some important notes to keep in mind about CA AppLogic's grid controller high-availability:
Important: For a grid to recover from a controller server failure, there must be at least 2 secondary servers up and running at the time of the server failure. If this requirement is not met (for example, there is only one secondary server at the time of the primary server failure), the grid controller remains down and requires grid administrator intervention to restore the grid controller to an operational state. If this type of controller failure is encountered, contact your service provider for assistance.
When the primary server fails, a user may point their browser to their grid controller host name/IP and observe the controller recovery progress. Once the controller has been recovered, the user is automatically redirected to the CA AppLogic GUI for their grid.
Note: Controller Recovery progress appears only for Xen-based grids, and does not appear for ESX-based grids.
The user must be authenticated to access the controller recovery GUI to observe the recovery progress/status. To log into the recovery GUI, click on the Login button, enter the recovery GUI password within the dialog and click the OK button.
Note: The recovery GUI password may be modified via the grid set command. A controller reboot is required for the new password to take effect.
After the user authenticates, they will have access to the dashboard of the controller recovery GUI.
The controller recovery GUI displays the following information:
Typically it takes 1-3 minutes for the recovery controller to start, and 11-13 minutes for the controller to start completely.
See the grid controller recovery details for the list of detail messages that can be logged during the recovery process
After the grid controller is restored, an alert is posted to the grid dashboard that describes the reason why the controller had failed.
Notes:
When the grid controller fails, it is possible that at the time of the failure users were starting/stopping/restarting applications and components. Upon restoration of the grid controller, CA AppLogic helps ensure that all applications and components are restored to their expected state; based on the previous commands that were executing before the grid controller had failed. This process of restoring the application/component state is known as repair. Both applications and components have an associated target state that is used in the repair process to help ensure that they are properly restored.
As an example, if an application was in the middle of an application restart (app restart) and right before the grid controller failure the application was stopping, CA AppLogic automatically verifies that the application is properly restarted. In this case, the application's target state is RESTART_STOPPING to indicate that the application was stopping as part of an app restart. The target state for an application can be obtained by executing app info (the target state is only displayed for non-stopped applications).
Applications that are under repair after a grid controller restart may be in one of the following states:
While the application repair is in progress, the following alert is posted on the grid dashboard:
Grid recovery in progress: There were N active application(s) when the controller went down. M application(s) have been recovered. The state of P application(s) has been reacquired. Recovering Q application(s).
After the application repair is complete, the previous alert is destroyed and the following alert is posted on the grid dashboard (assuming everything was recovered successfully):
Grid recovery completed on time: There were N active application(s) when the grid controller went down. N application(s) have been recovered. The state of P application(s) has been reacquired.
If there was a failure recovering the applications, the following alert is posted on the grid dashboard:
If an application fails to be recovered, use the list log command to view the controller log for details regarding the failure. Usually applications fail to be recovered for one or more of the following reasons:
Note: During the automated application repair process, CA AppLogic does not allow the user/grid-administrator to execute destructive CLI commands. This includes any command that affects the state of the grid or any server, application, component, class, catalog or volume. The following error message is displayed if a destructive command is executed during application repair:
Cannot execute command at this time - the grid controller is currently busy recovering from a failure.
Important: Applications are repaired by CA AppLogic using the app repair command. This command is valid only for applications that are in a FAILED state. Users may execute this command directly to repair applications that may have failed (that is, to restore an application where the user has completed the debugging of failed components).
A particular grid failure can occur where the grid controller is not automatically restarted by CA AppLogic. Such cases as observed by the user are summarized below. If any of the following situations are encountered, contact your service provider immediately.
Partners only: See the grid controller recovery topic for more information about how to recover your grid in case one of these failures occur.
|
Copyright © 2013 CA Technologies.
All rights reserved.
|
|