Previous Topic: Brownbag Session - 2012-02-09 - Backbone Fabric Controller 3.1: LoggingNext Topic: Brownbag Session - 2012-02-23 - Backbone Fabric Controller 3.1: Install, BFC Users, and Scripting


Grid Controller and Grid Node Troubleshooting

Basic things to be check on the Grid controller and server nodes:

  1. Login to the Grid controller and review the /var/log/messages for any errors with regards to the issue.
  2. The error output will list a particular node that the mount has failed or any message that lists failed to mount on 127.0.0.1 is always going to be a stuck mount on the controller.
  3. To list the stuck md devices on Grid controller: cat /proc/mdstat
  4. For any md devices listed on the controller you can stop them with: mdadm --stop /dev/mdX
  5. Verify there is any network connectivity issue between grid servers
  6. You can check the command 3tsrv bd list –all to view any stuck volume on Grid nodes

    For example how do you verify stuck mount volume: 3tsrv bd list --all output hoop devices Probably error message looks like below not exactly the same this is in my last case with Right servers

    messages:Jun 2 20:47:46 RSGRID4-srv4 AppLogic: Failed to mount /var/applogic/volumes/vols/v-ebb55ea5-dbf4-4138-accb-480bf70891d6 on /dev/hoop10

    messages:Jun 2 20:47:46 RSGRID4-srv4 VRM: VL_CTL.c(716): m_vol_share():Server srv4 - status 47 : Share create request for volume vol.srv4.v-ebb55ea5-dbf4-4138-accb-480bf70891d6 failed

    --- hoop Devices ---

    Name Volume Shared Port

    hoop0 v-3475134b-b62b-4102-b3bd-ba1f87696df2 Y 63001

    hoop1 v-108f94ef-d9be-4ab8-9e30-3c391c88f2fd Y 63002

    hoop2 v-ab19b4af-59ee-4deb-a56c-f3a19b1596b7 Y 63003

    hoop3 v-9301b19f-3e5e-4970-bac5-370a76a6ac6d Y 63004

    hoop4 v-c5ce7932-02b1-433e-b675-59b8ae2babb7 Y 63005

    hoop5 v-5dc34277-09cc-4be7-9c8b-6a7d164a9d24 Y 63006

    hoop6 v-37e3b7ad-662c-4a58-ae7f-dc46b7f18e5d Y 63007

    hoop7 v-fd4c5cbe-6e21-4416-baa2-0f548d771e7f Y 63008

    hoop8 v-e2c2d50c-dd0e-4c37-b7e9-fbc9abcbaf42 Y 63009

    hoop9 v-06668ea6-9bd2-4777-8258-5772d5d378e6 Y 63010

    hoop10 v-3475134b-b62b-4102-b3bd-ba1f87696df2 N n/a

    hoop100 v-3475134b-b62b-4102-b3bd-ba1f87696df2 N n/a

    hoop10 and hoop100 devices are not mounted and no port defined so we found that these are stuck volumes on srv4 and did hoosetup -d /dev/hoopX to clear the stuck mount

  7. Also check 3tsrvctl list mounts to view any /dev/hoop X device.
  8. On the Grid controller check 3t server list -- map to view which application is running on the Grid node.
  9. Run xm list to view the controller vm is running on which Grid node.
  10. On the Grid controller go to /var/log directory and run command grep controller * to review all the log messages on the controller.
  11. Log into any grid node and run: 3tsrv sd get this will display were the volume streams are located, there sync status, and the which nodes are marked as primary or secondary role.
  12. The boot volume and the impex volume are not critical and can be replaced if needed however the meta volume is extremely important. If there is a situation were there is only 1 stream of the meta volume available and its marked as error, its a good idea to first make a copy of this volume stream before doing anything else. It is possible to recover a grid still even without this volume but its an extremely tedious process which can take many weeks for even a small grid.
  13. If you need to recover the controller on the secondary node before recovering the controller on secondary node make sure to stop the heartbeat service and run the command on the node you need to make as primary 3tsrv set role=primary --recover
  14. To stop the heartbeat service run the command service heartbeat stop

    This command can take a few minutes to complete