Previous Topic: High AvailabilityNext Topic: Automated Recovery of the Grid Controller


Automated Recovery of Applications and Services

This section contains the following topics:

Server Failure

Appliance Failure

Network Failures

Server Failure

CA AppLogic can automatically recover from the loss of one or more physical servers. A physical server that is part of a CA AppLogic grid may fail for any of the following reasons:

To help tolerate server failures, CA AppLogic mirrors all volumes across the servers of a grid (by default all volumes are mirrored by 2). Volume mirroring makes it possible for appliances to be able to sustain operation through a physical server failure; unless of course the appliance is running on the failed server.

CA AppLogic detects a failed server by the loss of the server's network connection to the grid controller (typically within 3 minutes of the server failure). When the failure is detected, any appliances that were running on that server are automatically scheduled to run on other servers in the grid. Appliances can only be restarted as long as there are enough available resources in the grid. CA AppLogic displays an alert on the grid dashboard if there are not enough available resources to restart appliances upon server failure. If this alert is present on the grid dashboard, contact your service provider so additional servers can be added to the grid.

There are not enough available resources to restart components running on n server(s) [list_of_servers]. 

Upon server failure and the automatic restart of appliances, CA AppLogic posts recovery alerts on the grid dashboard. As an example, a user will see the following alerts upon the failure of srv3 in their grid (assuming that there were appliances running on srv3 and there are enough available grid resources to restart the appliances):

When an appliance is successfully restarted after the server failure, the previous two alerts are destroyed and the following alert is posted on the grid dashboard:

If CA AppLogic is unable to restart one or more appliances, one of the following alerts is posted on the grid dashboard for each failed appliance. Use the list log command to view the controller log for details on exactly why the appliance failed to be started.

Server Flapping

If a physical server fails three times within a 24-hour period (known as server flapping), CA AppLogic automatically disables that server (using the srv disable command) . This prevents resources from being scheduled on the server because it is likely going to fail again. The server may be re-enabled using the srv enable command. When the server is re-enabled, it must fail three times within a 24-hour period before it is automatically disabled again.

The following alert is posted to the grid dashboard when CA AppLogic detects that a server is flapping:

Server server was disabled on date because it has gone down too often within the specified time period.
Degraded Volumes

After a server failure, CA AppLogic volumes that had a mirror on the failed server become degraded (assuming those volumes have mirrors on other available servers). CA AppLogic automatically repairs degraded volumes.

CA AppLogic volumes are automatically repaired in the following priority order to ensure that the most important volumes are repaired first which reduces the risk of application/grid downtime:

  1. system volumes (boot, meta, impex)
  2. application user volumes
  3. application local catalog volumes
  4. global catalog and _GLOBAL volumes
  5. volcache volumes
  6. all other volumes

After a server failure, CA AppLogic waits 4 hours before it attempts to repair the volumes that were degraded as a result of the server failure. This gives the server a chance to be recovered so the degraded volumes can be recovered on the same server in which their streams were originally assigned. If the failed server is not recovered within 4 hours,

CA AppLogic repairs the degraded volumes using the other servers in the grid. Volumes that immediately need repair after a server failure can be initiated by the user using the vol repair vol_name --force command.

The automated volume repair runs once every 6 hours to collect the list of degraded volumes and initiate repairs over those volumes. users do not have to do anything to repair degraded volumes; volume repair is now automatic. The user can instruct CA AppLogic to retrieve the current list of degraded volumes by executing vol check; this can be used to help ensure that the current list of degraded volumes are scheduled for repair.

If there are particular volumes that should not be repaired or the repair should be delayed, you may suspend the repair of the volume or all volumes by using the vol suspend command mentioned below. You can only suspend the volume repairs for a maximum of 1 week. The maximum volume suspend time can be configured by the grid administrator. See the automated volume repair configuration topic for more information (this topic is accessible only to grid administrators).

In addition to automatic volume repair, CA AppLogic allows the user to execute the following volume maintenance operations:

Note: The suspend, resume and status operations may also be executed over all volumes by omitting the volume name. See the vol repair CLI reference for more information.

Server Rebooting and Power Control

Servers can reboot under the control of CA AppLogic in the following three cases:

If a grid is configured to use server management (that is, power control), CA AppLogic automatically power-cycles servers that lose connection to the grid controller as stated above. This is used for the faster recovery of appliances in case of server failure. In addition, the user can take advantage of the server management to execute any of the following server operations:

See the CLI reference for more information about the server power control commands.

Notes:

Appliance Failure

CA AppLogic automatically restarts appliances that have crashed/shutdown unexpectedly.

The appliance is restarted on the same server where it was running; with exactly the same resources and settings. Currently, CA AppLogic detects failed appliances if the appliance's virtual machine disappears from the server. This occurs if the appliance crashes or is shutdown/rebooted. Typically, the appliance failure is detected and the appliance is restarted within 1 minute. The appliance restart time also depends on how long it takes for the appliance itself to boot.

When an appliance failure is detected, the following alert is posted on the grid dashboard:

When an appliance is successfully restarted after failure, the previous alert is destroyed and the following alert is posted on the grid dashboard:

If CA AppLogic is unable to restart one or more appliances, the following alert is posted on the grid dashboard for each failed appliance. Use the list log command to view the controller log for details on exactly why the appliance failed to be started.

Failed to restart appliance 'comp_name' on date after appliance failure. 

Notes:

Appliance flapping

If an appliance fails 3 times within a 24 hour period (known as appliance flapping), CA AppLogic does not automatically restart that appliance. The appliance is left in the STANDBY state and can be manually started by a user using comp start. When the appliance is restarted, it must again fail 3 times within a 24 hour period before the failed appliance is not automatically restarted.

The following alert is posted to the grid dashboard when CA AppLogic detects that an appliance is flapping:

Note: Appliance flapping does not apply to appliances that use field engineering code 64 (appliance reboots/shutdowns are ignored by CA AppLogic).

Network Failures

CA AppLogic can tolerate failures in the external (public) or backbone (private) network with no application downtime (provided that the grid's hardware setup is configured for Network High Availability and the network is not already degraded). A user can verify their grid has network HA enabled/available by executing the grid info command.

When CA AppLogic detects a network failure, it reports the issue to the user by posting a message to the grid dashboard. In addition, the HA state of the affected network can be viewed by inspecting the grid HA state available on the grid dashboard or via the grid info command. CA AppLogic will detect a network failure on the external or backbone network when one of the following occurs:

Provided the above failures occur when the affected network is not already in a degraded state, there should not be any application downtime. Although there might be a slight interruption in network connectivity during recovery (a few seconds). If the network HA is already degraded and there is an additional network failure, the entire grid and all applications might be affected depending upon the type of failure.

Note: If using a network HA configuration and there is an external network failure, applications/appliances that use external interfaces may become inaccessible for up to 5 minutes. This appears to be caused by the external router caching MAC addresses. Waiting for the router to flush its ARP cache or sending an ARP response with arping from the application restores operation. This only affects the external network. The backbone network is not affected.

Administrators can view information related to network HA, including information that describes the network topology via the grid info --verbose and srv info --extended commands.

CA AppLogic provides the following commands to grid administrators that can be used to dynamically configure network HA:

In addition, a server command provides grid administrators the ability to identify servers and NICs by having the specified NIC on the server blink/flash its LED for a minute. The format of the server command is:

srv identify server name nic=NIC name 

See the Grid CLI reference and CLI reference for more information about the grid and server network HA configuration commands respectively.