Using CA AppLogic › Grid User Guide › High Availability › Detection of Potential Application and Grid Issues

Detection of Potential Application and Grid Issues

CA AppLogic proactively detects various types of issues with running applications, the grid and its servers. This allows CA AppLogic to alert the user/grid administrator of potential problems that may subsequently endanger the availability of running applications. This section describes the types of issues CA AppLogic monitors/detects and what the user/grid administrator can do to proactively avoid unnecessary application and grid downtime.

Note: See the Dashboard Notification Messages Guide for a full list of the alerts that can be posted to a grid dashboard.

This section contains the following topics:

Application Recovery Issues

Grid Controller Recovery Issues and Improper Grid Configuration

Grid Issues

Server Hardware Failure Detection

Application Recovery Issues

CA AppLogic detects the following problems that can cause subsequent application downtime upon the failure of one of the servers:

Insufficient grid resources; there are not enough available resources in the grid to restart one or more running appliances
- Dashboard alert: There are not enough available resources to restart components running on n server(s) [list]
- What to do: Contact your service provider to add more servers to your grid so there are enough resources to recover from server failures.
Use of degraded volumes (volumes that are not in the OK state); the failed server contains the one and only stream for one or more of the application's volumes
- Dashboard alert: n running applications have degraded volumes [list]
- What to do: CA AppLogic takes care of this by itself; it automatically repairs these volumes in the background.
Use of a degraded shared catalog volume; this may cause a massive amount of downtime because the volume is shared across all class instances
- Dashboard alert: n catalog class(es) with shared volumes have degraded volumes [list]
- What to do: CA AppLogic takes care of this by itself; it automatically repairs these volumes in the background.
All servers in the grid are disabled which will cause application downtime upon a server failure
- Dashboard alert: HA is unavailable due to the grid having no enabled servers.
- What to do: Enable one or more of the disabled servers using the srv enable command.

Grid Controller Recovery Issues and Improper Grid Configuration

CA AppLogic detects the following grid controller recovery issues that could potentially cause the grid controller to become inaccessible in case of a grid controller server failure:

Grid does not have controller HA due to one or more of the grid controller servers being down
- Dashboard alert: Grid does not have controller HA. X of Y controller servers are down. To restore controller HA, Z of the following controller servers have to be brought back online: list of servers
- What to do: Bring the specified servers back online or add new servers to the grid. Contact your service provider for assistance.
Improper grid configuration for controller HA
- Dashboard alert: The grid is not configured for controller HA; a secondary controller server needs to be assigned or else the grid cannot recover from grid controller server failures. Assign one of the running servers as a secondary controller server to enable controller HA on the grid.
- What to do: There are no servers assigned to be a secondary grid controller (backup grid controller). Contact your service provider immediately.

CA AppLogic detects the following improper grid configuration that could potentially cause grid failures or application downtime:

Single server grids do not have HA features
- Dashboard alert: HA is unavailable due to the grid being a single server grid.
- What to do: Most of CA AppLogic's HA features require at least 2 servers. Contact your service provider to add at least one more server to your grid to take advantage of the HA features described in this document.
Grid is not configured with the appropriate amount of controller memory, controller cpu or server memory
- Dashboard alert: Grid resources are not configured correctly. This may lead to degradation in grid performance or grid instability. Update the following grid resources on your grid or contact technical support: controller memory | controller CPU | server memory
- What to do: Contact your service provider immediately. The grid must be reconfigured to use the correct amount of resources or the grid might become unstable which may affect the uptime of running applications.

Grid Issues

CA AppLogic is able to detect various types of grid issues that may cause application start failures or other issues:

CA AppLogic failed to cleanup a volume mount on one of the servers.
- Dashboard alert: Failed to destroy mount 'volume name'. Unable to stop device 'mount device'. Contact technical support.
- What to do: Contact your service provider immediately. This issue may cause application start failures.
CA AppLogic failed to cleanup a volume share on one of the servers.
- Dashboard alert: Failed to unshare volume stream 'volume name'. Unable to detach volume from 'hoop device'. Contact technical support.
- What to do: Contact your service provider immediately. This issue may cause application start failures.
The NTP daemon that is used to synchronize the time between all servers, appliances and the grid controller has been restarted.
- Dashboard alert: The NTP daemon was found not to be running on the server, but has been successfully restarted.
- What to do: This is only a warning message. The NTP daemon crashed or stopped working for some reason so CA AppLogic restarted the daemon. Contact your service provider.
The NTP daemon that is used to synchronize the time between all servers and the grid controller is not running.
- Dashboard alert: The NTP daemon was found not to be running on the server and could not be restarted. The time on the server and the time in the appliances running on the server will no longer be synchronized with the clock on the grid controller. Contact technical support for assistance.
- What to do: Contact your service provider immediately. The times on the servers, appliances and grid controller may eventually become out of sync. The NTP daemon needs to be restarted manually by an administrator.

Server Hardware Failure Detection

CA AppLogic detects the following hardware issues on the servers within a grid:

The hard disk on a server is beginning to fail
- Dashboard alert: Possible storage system failure on Server server. Error: Device: device, failure message
- What to do: The hard disk on the specified server is likely to fail and can potentially cause data loss. CA AppLogic automatically disables such servers so the server is not used for appliances or volumes. Contact your service provider immediately. The volumes and appliances need to be migrated off of the server and its hard disk needs to be replaced.