Grid Controller and Grid Node Troubleshooting

Reference Information › CA AppLogic Support Knowledge Base › Overview of Support Knowledge Base › Backbone Fabric Controller Home › Brownbag Session - 2012-02-09 - Backbone Fabric Controller 3.1: Logging › Grid Controller and Grid Node Troubleshooting

Grid Controller and Grid Node Troubleshooting

Basic things to be check on the Grid controller and server nodes:

Login to the Grid controller and review the /var/log/messages for any errors with regards to the issue.
The error output will list a particular node that the mount has failed or any message that lists failed to mount on 127.0.0.1 is always going to be a stuck mount on the controller.
To list the stuck md devices on Grid controller: cat /proc/mdstat
For any md devices listed on the controller you can stop them with: mdadm --stop /dev/mdX
Verify there is any network connectivity issue between grid servers
You can check the command 3tsrv bd list –all to view any stuck volume on Grid nodes
For example how do you verify stuck mount volume: 3tsrv bd list --all output hoop devices Probably error message looks like below not exactly the same this is in my last case with Right servers

messages:Jun 2 20:47:46 RSGRID4-srv4 AppLogic: Failed to mount /var/applogic/volumes/vols/v-ebb55ea5-dbf4-4138-accb-480bf70891d6 on /dev/hoop10

messages:Jun 2 20:47:46 RSGRID4-srv4 VRM: VL_CTL.c(716): m_vol_share():Server srv4 - status 47 : Share create request for volume vol.srv4.v-ebb55ea5-dbf4-4138-accb-480bf70891d6 failed

--- hoop Devices ---

Name Volume Shared Port

hoop0 v-3475134b-b62b-4102-b3bd-ba1f87696df2 Y 63001

hoop1 v-108f94ef-d9be-4ab8-9e30-3c391c88f2fd Y 63002

hoop2 v-ab19b4af-59ee-4deb-a56c-f3a19b1596b7 Y 63003

hoop3 v-9301b19f-3e5e-4970-bac5-370a76a6ac6d Y 63004

hoop4 v-c5ce7932-02b1-433e-b675-59b8ae2babb7 Y 63005

hoop5 v-5dc34277-09cc-4be7-9c8b-6a7d164a9d24 Y 63006

hoop6 v-37e3b7ad-662c-4a58-ae7f-dc46b7f18e5d Y 63007

hoop7 v-fd4c5cbe-6e21-4416-baa2-0f548d771e7f Y 63008

hoop8 v-e2c2d50c-dd0e-4c37-b7e9-fbc9abcbaf42 Y 63009

hoop9 v-06668ea6-9bd2-4777-8258-5772d5d378e6 Y 63010

hoop10 v-3475134b-b62b-4102-b3bd-ba1f87696df2 N n/a

hoop100 v-3475134b-b62b-4102-b3bd-ba1f87696df2 N n/a

hoop10 and hoop100 devices are not mounted and no port defined so we found that these are stuck volumes on srv4 and did hoosetup -d /dev/hoopX to clear the stuck mount
Also check 3tsrvctl list mounts to view any /dev/hoop X device.
On the Grid controller check 3t server list -- map to view which application is running on the Grid node.
Run xm list to view the controller vm is running on which Grid node.
On the Grid controller go to /var/log directory and run command grep controller * to review all the log messages on the controller.
Log into any grid node and run: 3tsrv sd get this will display were the volume streams are located, there sync status, and the which nodes are marked as primary or secondary role.
The boot volume and the impex volume are not critical and can be replaced if needed however the meta volume is extremely important. If there is a situation were there is only 1 stream of the meta volume available and its marked as error, its a good idea to first make a copy of this volume stream before doing anything else. It is possible to recover a grid still even without this volume but its an extremely tedious process which can take many weeks for even a small grid.
If you need to recover the controller on the secondary node before recovering the controller on secondary node make sure to stop the heartbeat service and run the command on the node you need to make as primary 3tsrv set role=primary --recover
To stop the heartbeat service run the command service heartbeat stop
This command can take a few minutes to complete