Storage Health Monitoring

Troubleshooting › Grid Recovery Procedures › Storage Health Monitoring

Storage Health Monitoring

Overview

To help prevent data loss in the event of a storage device failure, CA 3Tera AppLogic monitors the health of all storage devices that support Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T. (sometimes written as SMART). S.M.A.R.T. is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.

Fundamentally, storage device failures fall into one of two basic classes:

Predictable failures: These types of failures happen gradually over time, such as mechanical wear and gradual degradation of storage surfaces. A monitoring device can detect these problems.
Unpredictable failures: These types of failures happen suddenly and without warning. These failures range from defective electronic components to a sudden mechanical failure. Mechanical failures account for about 60 percent of all drive failures. Most mechanical failures result from gradual wear, although an eventual failure may be catastrophic. However, before complete failure occurs, there are usually certain indications that failure is imminent. These may include increased heat output, increased noise level, problems with reading and writing of data, an increase in the number of damaged disk sectors, and so on.

Degree of Support

CA 3Tera AppLogic provides support for monitoring the following types of hard disks provided that the disk itself provides S.M.A.R.T. support and successfully responds to smartctl -i:

ATA drives
SCSI drives
ATA drives sitting behind the following RAID controllers:
- 3Ware 6/7/8000 series controller
- 3Ware 9000 series controller
- HighPoint RocketRaid controller
- MegaRaid controller
- Compaq Smart Raid controller
- Areca Raid controller

Notes:

The support for disks that sit behind the above RAID controllers is a limitation of the smartmontools package used by CA 3Tera AppLogic to monitor the disks.
The smartctl -i command can be used directly on the server to determine if its storage devices can be monitored by CA 3Tera AppLogic. CA recommends to verify that CA 3Tera AppLogic can monitor the server's storage devices before adding the server to a grid.

Determining if storage health monitoring is supported/enabled on your grid's servers

The following alerts are logged to the grid dashboard that indicate whether storage health monitoring is enabled and what storage devices are not monitored by CA 3Tera AppLogic (on a per-server basis):

INFO: The following hard disks are not monitored on server server: list of storage devices
INFO: Storage failure detection is disabled on server name

In addition, to determine if storage health monitoring is supported for a particular server within a grid, execute the following command for the server:
3t srv info name --extended
and inspect the --- Disk Check Information --- section of the output. If at least one storage device can be monitored on a server, the Supported value is yes; otherwise it is no.

What to do if the "Possible storage system failure" dashboard alert is present on a grid

This section describes the action(s) that should be taken when a storage system failure alert is logged to the grid dashboard.

The following messages are critical errors and are indications of either immanent or potential storage failure.

Device DEVICE, SMART Failure: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE.
Device DEVICE, FAILED SMART self-check. BACK UP DATA NOW!
Device DEVICE, Failed SMART usage Attribute: ATTRIBUTE
Device DEVICE, Self-Test Log error count increased from M to N
Device DEVICE, not capable of SMART self-check
Device DEVICE, failed to read SMART Attribute
Device DEVICE, unable to open device
Device DEVICE, Read SMART Error Log Failed
Device DEVICE, Read SMART Self-Test Log Failed
Device DEVICE, Temperature NUM Celsius reached critical limit of NUM2 Celsius (Min/Max M/N)

If one of the alerts above are present on the grid dashboard, what can be done to save your data depends upon the state of the volumes that have streams on the failing server:

Find out if there are any degraded volumes that have their only good stream/mirror on the failing server.
- Execute vol list server=srvX and note all degraded volumes
- For each degraded volume, execute vol info volume-name and note any volume that has its only good mirror on the failing server.
If all volumes with streams/mirrors on the failing server are in the OK state (That is, they are not degraded):
- Disable the server using the srv disable command.
- Restart the applications that have components running on the server (app restart).
- After the applications have been restarted, contact your service provider to take the failing server offline for service.
- This will result in degraded volumes; however CA 3Tera AppLogic will automatically repair those volumes in the background.
- Once the server is repaired, it can be brought back online and re-enabled with srv enable.
If there is at least one volume that is degraded with its only good stream/mirror on the failing server:
- The volumes must be repaired. Use the vol repair volume-name --force command to force CA 3Tera AppLogic to repair these volumes as soon as possible. It is recommended to repair the volumes in order of importance of their data (in case the storage device fails during the volume repair process). It is recommended to repair the volumes in the following order:
  - application volumes
  - singleton volumes
  - catalog/class volumes
  - volcache volumes (only if the volcache volumes have been incorrectly modified from the original class)
  - global volumes
- When all volume repairs for these volumes have been completed (vol repair --status), repeat all of the steps in the first case above.

The following message are informational and may not indicate possible disk failure. However, you should contact your grid service provider for assistance in diagnoses of the particular problem.

Device DEVICE, ATA error count increased from M to N:
Device DEVICE, NUM Currently unreadable (pending) sectors:
Device DEVICE, NUM Offline uncorrectable sectors: