Using CA AppLogic › Grid User Guide › High Availability › Automated Recovery of Applications and Services

Automated Recovery of Applications and Services

This section contains the following topics:

Server Failure

CA AppLogic can automatically recover from the loss of one or more physical servers. A physical server that is part of a CA AppLogic grid may fail for any of the following reasons:

spontaneous restart of the server: Kernel crash, reboot/reset/power-cycle
complete failure of the server: Burned out the main logic board, bad power-supply, bad NIC
partial failure of the server: Intermittent hard disk failure
partial network connection failure of the server: Bad cable, in the case of non-HA network

To help tolerate server failures, CA AppLogic mirrors all volumes across the servers of a grid (by default all volumes are mirrored by 2). Volume mirroring makes it possible for appliances to be able to sustain operation through a physical server failure; unless of course the appliance is running on the failed server.

CA AppLogic detects a failed server by the loss of the server's network connection to the grid controller (typically within 3 minutes of the server failure). When the failure is detected, any appliances that were running on that server are automatically scheduled to run on other servers in the grid. Appliances can only be restarted as long as there are enough available resources in the grid. CA AppLogic displays an alert on the grid dashboard if there are not enough available resources to restart appliances upon server failure. If this alert is present on the grid dashboard, contact your service provider so additional servers can be added to the grid.

There are not enough available resources to restart components running on n server(s) [list_of_servers].

Upon server failure and the automatic restart of appliances, CA AppLogic posts recovery alerts on the grid dashboard. As an example, a user will see the following alerts upon the failure of srv3 in their grid (assuming that there were appliances running on srv3 and there are enough available grid resources to restart the appliances):

Lost connection to server 'srv3' on date.
Appliance 'comp_name' failed due to server lost at date. Lost connection to server srv3.
Restarting appliance 'comp_name' on date due to server failure.
The previous two messages are repeated for each appliance that needs to be restarted.

When an appliance is successfully restarted after the server failure, the previous two alerts are destroyed and the following alert is posted on the grid dashboard:

Restarted appliance 'comp_name' on date due to server failure.
The previous message is repeated for each appliance that was restarted successfully.

If CA AppLogic is unable to restart one or more appliances, one of the following alerts is posted on the grid dashboard for each failed appliance. Use the list log command to view the controller log for details on exactly why the appliance failed to be started.

Failed to restart appliance 'comp_name' on date after server failure. Failed to allocate resources for appliance.
Failed to restart appliance 'comp_name' on date after server failure. Appliance restart failed.

Server Flapping

If a physical server fails three times within a 24-hour period (known as server flapping), CA AppLogic automatically disables that server (using the srv disable command) . This prevents resources from being scheduled on the server because it is likely going to fail again. The server may be re-enabled using the srv enable command. When the server is re-enabled, it must fail three times within a 24-hour period before it is automatically disabled again.

The following alert is posted to the grid dashboard when CA AppLogic detects that a server is flapping:

Server server was disabled on date because it has gone down too often within the specified time period.

Degraded Volumes

After a server failure, CA AppLogic volumes that had a mirror on the failed server become degraded (assuming those volumes have mirrors on other available servers). CA AppLogic automatically repairs degraded volumes.

CA AppLogic volumes are automatically repaired in the following priority order to ensure that the most important volumes are repaired first which reduces the risk of application/grid downtime:

system volumes (boot, meta, impex)
application user volumes
application local catalog volumes
global catalog and _GLOBAL volumes
volcache volumes
all other volumes

After a server failure, CA AppLogic waits 4 hours before it attempts to repair the volumes that were degraded as a result of the server failure. This gives the server a chance to be recovered so the degraded volumes can be recovered on the same server in which their streams were originally assigned. If the failed server is not recovered within 4 hours,

CA AppLogic repairs the degraded volumes using the other servers in the grid. Volumes that immediately need repair after a server failure can be initiated by the user using the vol repair vol_name --force command.

The automated volume repair runs once every 6 hours to collect the list of degraded volumes and initiate repairs over those volumes. users do not have to do anything to repair degraded volumes; volume repair is now automatic. The user can instruct CA AppLogic to retrieve the current list of degraded volumes by executing vol check; this can be used to help ensure that the current list of degraded volumes are scheduled for repair.

If there are particular volumes that should not be repaired or the repair should be delayed, you may suspend the repair of the volume or all volumes by using the vol suspend command mentioned below. You can only suspend the volume repairs for a maximum of 1 week. The maximum volume suspend time can be configured by the grid administrator. See the automated volume repair configuration topic for more information (this topic is accessible only to grid administrators).

In addition to automatic volume repair, CA AppLogic allows the user to execute the following volume maintenance operations:

Initiate repairs over specific volumes right away: vol repair vol_name --force
Suspend the repair of a specific volume for the specified amount of time: vol repair vol_name --suspend time=time
Resume the repair of a suspended volume right away: vol repair vol_name --resume
Retrieve the current repair status for a specific volume: vol repair vol_name --status
Retrieve the current state of all degraded volumes and schedule repairs: vol check

Note: The suspend, resume and status operations may also be executed over all volumes by omitting the volume name. See the vol repair CLI reference for more information.

Server Rebooting and Power Control

Servers can reboot under the control of CA AppLogic in the following three cases:

When the server is rebooted by the user either from a grid reboot (grid reboot) or a server reboot (srv reboot).
When the server loses connection to the grid controller and does not reconnect to the grid controller within 1minute.
In this case CA AppLogic uses the server's management control to power-cycle the server (if power control is available).
When the server is unable to communicate with the other servers in the grid; usually caused by a NIC/hardware failure on the server.
A server in this state is known as an isolated server as it cannot communicate with any other server in the grid

If a grid is configured to use server management (that is, power control), CA AppLogic automatically power-cycles servers that lose connection to the grid controller as stated above. This is used for the faster recovery of appliances in case of server failure. In addition, the user can take advantage of the server management to execute any of the following server operations:

srv power_off: power down the specified server; this can be used to save power for servers that are not being used or to power-down faulty servers
srv power_on: power up the specified server
srv power_cycle: power-cycle the specified server; useful for rebooting a non-responsive server to get it back online

See the CLI reference for more information about the server power control commands.

Notes:

To verify your grid has server management mode, execute grid info -v. The server management support is specified by the Server management mode field in the output.
CA AppLogic currently supports only IPMI-based server management.

Appliance Failure

CA AppLogic automatically restarts appliances that have crashed/shutdown unexpectedly.

The appliance is restarted on the same server where it was running; with exactly the same resources and settings. Currently, CA AppLogic detects failed appliances if the appliance's virtual machine disappears from the server. This occurs if the appliance crashes or is shutdown/rebooted. Typically, the appliance failure is detected and the appliance is restarted within 1 minute. The appliance restart time also depends on how long it takes for the appliance itself to boot.

When an appliance failure is detected, the following alert is posted on the grid dashboard:

Restarting appliance 'comp_name' on date due to appliance failure.
The previous message is repeated for each failed appliance that needs to be restarted.

When an appliance is successfully restarted after failure, the previous alert is destroyed and the following alert is posted on the grid dashboard:

Restarted appliance 'comp_name' on date due to appliance failure.
The previous message is repeated for each appliance that was restarted successfully.

If CA AppLogic is unable to restart one or more appliances, the following alert is posted on the grid dashboard for each failed appliance. Use the list log command to view the controller log for details on exactly why the appliance failed to be started.

Failed to restart appliance 'comp_name' on date after appliance failure.

Notes:

If an appliance needs to be rebooted/shutdown without CA AppLogic detecting that as an appliance failure, the field engineering code 64 can be used for this purpose. See the field engineering code reference for more information.
CA AppLogic cannot detect internal appliance failures such as software crashes, software malfunctions/bugs, and low memory conditions. CA is planning on adding this type of appliance failure detection in a future release.

Appliance flapping

If an appliance fails 3 times within a 24 hour period (known as appliance flapping), CA AppLogic does not automatically restart that appliance. The appliance is left in the STANDBY state and can be manually started by a user using comp start. When the appliance is restarted, it must again fail 3 times within a 24 hour period before the failed appliance is not automatically restarted.

The following alert is posted to the grid dashboard when CA AppLogic detects that an appliance is flapping:

Appliance 'comp_name' restart failed at date. Appliance has failed too often within the specified time period.

Note: Appliance flapping does not apply to appliances that use field engineering code 64 (appliance reboots/shutdowns are ignored by CA AppLogic).

Network Failures

CA AppLogic can tolerate failures in the external (public) or backbone (private) network with no application downtime (provided that the grid's hardware setup is configured for Network High Availability and the network is not already degraded). A user can verify their grid has network HA enabled/available by executing the grid info command.

When CA AppLogic detects a network failure, it reports the issue to the user by posting a message to the grid dashboard. In addition, the HA state of the affected network can be viewed by inspecting the grid HA state available on the grid dashboard or via the grid info command. CA AppLogic will detect a network failure on the external or backbone network when one of the following occurs:

A physical NICs on one of the grid servers fails (hardware failure)
A network cable to a NIC is unplugged or damaged.
A port on a switch fails
A switch itself entirely fails, such as a malfunctions or loses power.

Provided the above failures occur when the affected network is not already in a degraded state, there should not be any application downtime. Although there might be a slight interruption in network connectivity during recovery (a few seconds). If the network HA is already degraded and there is an additional network failure, the entire grid and all applications might be affected depending upon the type of failure.

Note: If using a network HA configuration and there is an external network failure, applications/appliances that use external interfaces may become inaccessible for up to 5 minutes. This appears to be caused by the external router caching MAC addresses. Waiting for the router to flush its ARP cache or sending an ARP response with arping from the application restores operation. This only affects the external network. The backbone network is not affected.

Administrators can view information related to network HA, including information that describes the network topology via the grid info --verbose and srv info --extended commands.

CA AppLogic provides the following commands to grid administrators that can be used to dynamically configure network HA:

Enable/disable network HA on the grid through grid set (both the external and backbone networks)
- Enable: grid set ha_network=1
- Disable: grid set ha_network=0
Enable/disable network HA on a particular network of the grid through grid set
- Enable network HA on the backbone network: grid set ha_backbone=1
- Disable network HA on the backbone network: grid set ha_backbone=0
- Enable network HA on the external network: grid set ha_external=1
- Disable network HA on the external network: grid set ha_external=0
Enable/disable network HA on a server through srv set
- Enable: srv set ha_network=1
- Disable: srv set ha_network=0
Enable/disable network HA on a particular network on a server through srv set
- Enable network HA on the backbone network: srv set server name ha_backbone=1
- Disable network HA on the backbone network: srv set server name ha_backbone=0
- Enable network HA on the external network: srv set server name ha_external=1
- Disable network HA on the external network: srv set server name ha_external=0
Specify which switch is active through grid set
- For the backbone network: grid set backbone=switch ID
- For the external network: grid set external=switch ID
Specify which NIC is active on a server through srv set
- For the backbone network: srv set server name backbone=switch ID
- For the external network: srv set server name external=switch ID
Initiate network topology discovery
- On all servers in the grid: grid net_discover
- On a single server: srv net_discover server name

In addition, a server command provides grid administrators the ability to identify servers and NICs by having the specified NIC on the server blink/flash its LED for a minute. The format of the server command is:

srv identify server name nic=NIC name

See the Grid CLI reference and CLI reference for more information about the grid and server network HA configuration commands respectively.