Major Grid Crash Recovery, Controller Boot Volume Corrupted

Reference Information › CA AppLogic Support Knowledge Base › Overview of Support Knowledge Base › Major Grid Crash Recovery, Controller Boot Volume Corrupted

Major Grid Crash Recovery, Controller Boot Volume Corrupted

Recovery after a major grid crash controller vm and the primary sever showed filesystem corruption after crash.

Issue: Grid servers crashed including the primary. Several reboots later, the grid was up and running but the primary showed some disk errors in the controller boot volume. server one (primary) also exhibited some disk filesystem errors.

The following procedure was used to correct the disk errors on the primary and to move the controller to a secondary server.

SSHed into srv1 of the grid
Executed 'xm list' to see if the grid controller was running
Executed 'xm console controller' to open the console of the grid controller
From the console, saw filesystem errors for the grid controller’s boot volume
Stopped the grid controller by executing '/usr/local/apl-srv/bin/ctlb_ctl2.sh stop'
Executed 'sdinit cmd=read' to get the location of the streams for the grid controller’s boot volume
Saw that the one good stream for the grid controller’s boot volume is on srv2
SSHed into srv2 of the grid
Executed 'ps aux | grep hoop' to get a list of hoop devices
Executed 'hosetup' on each hoop device to find out which one is associated to the controller’s boot volume stream
Assembled an md device with that hoop device using 'mdadm --assemble'
Executed 'fsck' on the assembled md device
Answered 'y' to all questions
Stopped the md device using 'mdadm --stop'
Rebooted all servers