Guide to Troubleshooting Stuck Volumes

Reference Information › CA AppLogic Support Knowledge Base › Overview of Support Knowledge Base › Guide to Troubleshooting Stuck Volumes - Extended

Guide to Troubleshooting Stuck Volumes - Extended

This guide is meant to be used with AppLogic version 2.9.9+. Some or all of the commands used in this guide were not available in previous versions.

This guide assumes the appropriate checks have already been done to ensure the volume is not currently in use by any running appliance, filer or other process such as vol repair/migrate. Removing a volume stream or its associated device while in use can lead to data loss or corruption.

(work in progress)

Utilities and Commands used in this guide

3tsrv

3tsrv has many commands, the one we are interested in is the 'bd list' command which list out all block devices used on a node.

3tsrv bd list --all
3tsrv <cmd>
 usage:
   bd list [--md|--hoop|--nbd|--all] [--batch] -- list active block devices on server
Example usage
[root]# 3tsrv bd list --all
--- md Devices ---
Name       Attached Devices
----------------------------------
md1        nbd1, hoop0
md2        nbd3, hoop1
md3        nbd5, hoop2
 
--- hoop Devices ---
Name       Volume                                        Shared    Port   
------------------------------------------------------------------------
hoop0      v-e69315e2-fff3-473c-97dc-af309466134a        Y         63001  
hoop1      v-1b0ecc0e-782c-4ddb-b8cb-b7981560fd3e        Y         63002  
hoop2      v-f821f907-4ae0-4044-87ea-6dd39cefe9fd        Y         63003  
hoop3      v-cf98dc46-03dc-45cd-ae2f-a5999705ab87        Y         63004  
hoop255    swap.img                                      N         n/a   
 
--- nbd Devices ---
Name       Remote IP            Remote Port    
-------------------------------------------
nbd1       192.168.2.1          63021          
nbd3       192.168.2.1          63020          
nbd5       192.168.2.1          63022

3tsrvctl

3tsrvctl also has many commands, the ones we will be using are list, destroy and info. These commands display the logical mounts on the node and which appliance they belong to.

3tsrvctl commands

3tsrvctl <cmd> [<entity>] [<prop>=<val>]* [--batch] [--verbose]
 
Logical Mount Commands (<entity> is mount, mounts, mnt, or mnts)
   list      - list Logical mounts
   info      - get logical mount information
   destroy   - destroy logical mount
Example usage:
[root]# 3tsrvctl list mounts
Name                                                         # Volumes  Mount ID       
mnt.srv2.SYSTEM:_sys.boot                                    2          /dev/md1      
mnt.srv2.SYSTEM:_sys.meta                                    2          /dev/md2      
mnt.srv2.SYSTEM:_sys.impex                                   2          /dev/md3
mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs              2          /dev/md6
[root]#
 
 
[root]# 3tsrvctl info mount mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs 
Name            : mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs
Mount ID        : /dev/md6
Mount Attributes: exclusive
# Volumes       : 2
Volumes         :
   State
   synchronized
   synchronized
[root]#
 
 
[root]# 3tsrvctl mount destroy mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs
[root]#

hosetup

hosetup is used to create and destroy hoop devices.

hosetup usage
usage:
  hosetup loop_device                                      # give info
  hosetup -d loop_device                                   # delete
  hosetup [ -e encryption ] [ -o offset ] loop_device file # setup
Example usage
[root]# hosetup /dev/hoop58 /var/applogic/volumes/vols/v-d9478e8d-a317-4e4a-a77e-7377f5a65693 
[root]#
 
 
[root]# hosetup /dev/hoop58
/dev/hoop58: [fd00]:42319875 (/var/applogic/volumes/vols/v-d9478e8d-a317-4e4a-a77e-7377f5a65*)
[root]#
 
 
[root]# hosetup -d /dev/hoop58
[root]#
nbd-server / nbd-client
nbd-server is used to setup a share over the network. nbd-client connects to a nbd-server.
nbd-server usage
This is nbd-server version 2.8.8
Usage: port file_to_export [size][kKmM] [-l authorize_file] [-r] [-m] [-c] [-a timeout_sec]

nbd-client usage

nbd-client version 2.8.8
Usage: nbd-client [bs=blocksize] host port nbd_device [-swap] [-persist]
Or   : nbd-client -d nbd_device

Clearing Stuck Devices

Hoop Devices

A hoop device is basically a performance enhanced version of loop. It performs the exact same function as a normal loop device would.

The most common stuck hoop device scenario is were the hoop device is unused (not part of an assembled md device) but there is a volume attached still. These will show up in the output of '3tsrv bd list' with n/a listed as the port number and under the share column it will have a 'N'. These can be cleared by running 'hosetup -d /dev/hoopX' on the device.

Example of unused hoop device:
Example of unused /dev/hoop1 device
[root]# 3tsrv bd list --hoop
Name       Volume                                        Shared    Port   
------------------------------------------------------------------------
hoop0      v-e69315e2-fff3-473c-97dc-af309466134a        Y         63001  
hoop1      v-1b0ecc0e-782c-4ddb-b8cb-b7981560fd3e        N         n/a 
hoop2      v-f821f907-4ae0-4044-87ea-6dd39cefe9fd        Y         63003  
[root]#
 
[root]# hosetup -d /dev/hoop1
[root]# 3tsrv bd list --hoop
Name       Volume                                        Shared    Port   
------------------------------------------------------------------------
hoop0      v-e69315e2-fff3-473c-97dc-af309466134a        Y         63001  
hoop2      v-f821f907-4ae0-4044-87ea-6dd39cefe9fd        Y         63003  
[root]#
In some cases the hoop device may fail to delete giving a error that its currently busy.
fail to clean up hoop
[root]# hosetup -d /dev/hoop14
ioctl: LOOP_CLR_FD: Device or resource busy
[root]#
This means that the hoop device is still being used by something. We should check 'bd list --all' output and ensure nothing is listed as using this hoop device.
[root]# 3tsrv bd list --all | grep hoop14
md15       nbd29, hoop14
hoop14     v-e52af9a3-a45c-4a7b-b53a-ebeea9fc696d        N         n/a

In this case it looks like md15 still has hoop14 attached but not in use. To fix this we need to first stop the md device before hoop14 will stop. See the section on md devices for more info on this process.

nbd Devices

The nbd devices have two parts, a server on the node that has holds the volume stream, and a client that is on the node that the vm is running on. The nbd-server is used to share out a hoop device over the backend network. The nbd-client will connect to the nbd-server and provide the shared hoop device as a /dev/nbd device on the node the appliance is running on.

It is very rare (I have yet to see this) to see a stuck nbd device. Whenever the nbd-server is killed, the client is automatically closed. There is occasions were the md device will be left created though and so the nbd device sometimes must be cleaned up manually after stopping the md device. To clean them up simply kill the nbd-server process on the remote node.

Example of shutting down nbd-server

[root@srv2]# 3tsrv bd list --md | grep md5
--- md Devices ---
Name       Attached Devices
----------------------------------
md5        hoop13, nbd8
[root@srv2]# mdadm --stop /dev/md5
[root@srv2]# 3tsrv bd list --nbd | grep nbd8
Name       Remote IP            Remote Port    
-------------------------------------------   
nbd8       192.168.2.1          63005     
[root@srv2]# 
[root@srv2]# ssh 192.168.2.1 
 
[root@srv1]#
[root@srv1]# ps -ef | grep 63005
root     32182     1  0 10:36 ?        00:00:00 /usr/bin/nbd-server 63005 /dev/hoop3
root     32203 32182  1 10:36 ?        00:03:21 /usr/bin/nbd-server 63005 /dev/hoop3
[root@srv1]# kill 32182
[root@srv1]# exit
 
[root@srv2]#
[root@srv2]# 3tsrv bd list --nbd | grep nbd8
APPLOGIC RESTRICTED AREA
[root@srv2]#

In the above example, md5 is stuck in use with no appliance associated with it. After stopping md5 we check 'bd list' output to determine were the nbd device is being shared from. The remote IP here is 192.168.2.1 (srv1) so we ssh into srv1 and then search the processes for the same port the client is connected on, 63005. We kill the parent process and that shuts down the nbd-server on srv1 and the nbd-client on srv2.

md Devices

The md device is made up of 2 or more volume streams assembled into a raid1. In the 'bd list' output you will sometimes see 'raid1' listed in the attached device column like the md5 device below.

md5        raid1, nbd9, hoop4
md6        nbd22, hoop6

The md5 device corresponds to a read-only volume. I am not really sure why this is as there all raid1.. maybe someone else can explain this. md6 is the typical output of a read-write volume.

Troubleshooting Examples

Stuck mount on grid node

In the controller /var/log/messages we see the following:

Mar 13 21:04:51 grid3 applogic: [511:error s:47]: srv6: Share create request for volume vol.srv6.v-19bbae8f-86ec-4688-a6e8-146c7e53f851 failed
This indicates that srv6 has failed to attach the volume to a hoop device. We know this because its trying to mount the v-xxx-xxx-xxx stream which only mounts to a hoop device. On srv6 you would see the additional error of:
Mar 13 21:04:51 grid3-srv6 AppLogic: Failed to mount /var/applogic/volumes/vols/v-19bbae8f-86ec-4688-a6e8-146c7e53f851 on /dev/hoop18
Mar 13 21:04:51 grid3-srv6 VRM: VL_CTL.c(716): m_vol_share(): Server srv6 - status 47 : Share create request for volume vol.srv6.v-19bbae8f-86ec-4688-a6e8-146c7e53f851 failed

This can indicate that either the volume stream is bad (highly unlikely unless there is also drive errors) or hoop18 may already be in use (applogic thinks its free). To fix this we would need to log into srv6 and stop the hoop18 device.

Stuck mount on controller

In the controllers /var/log/messages:

Sep 22 09:58:30 grid1 applogic: Failed to mount _cat.system.NAS.boot: id='/dev/md2', cmd='mdadm --assemble /dev/md2 --force --run /dev/nbd2 /dev/nbd4'
Sep 22 09:58:30 grid1 ctld: MT_LMC.c(565): m_mnt_create(): Server 127.0.0.1 - status 47 : Mount create request for mount '_cat.system.NAS.boot' failed
Sep 22 09:58:31 grid1 applogic: [67336:error s:47]: _cat.system.NAS.boot: failed to mount volume
In this example the m_mnt_create command is failing on server 127.0.0.1, which is the loopback IP. Anytime you see the failure occur on 127.0.0.1 it is a problem on the controller. In the first line we see that command it was trying to run was 'mdadm --assemble /dev/md2'. On the controller running 'cat /proc/mdstat' will show:
[root]# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 nbd4[0]
      20971520 blocks [1/1] [U]

So in this case /dev/md2 is already in use on the controller. To fix this would require stopping /dev/md2 on the controller.

Stuck mount on grid node but no log entry specifying which device

In the controllers /var/log/messages we see the following when trying to start an application:

Jun 20 10:04:29 test applogic: [101:error s:47]: srv5: Mount create request for mount 'mnt.srv5.QA_9.main.kanakcqa43:QA_9.class.kanakcqa43.boot' failed
Jun 20 10:04:29 test applogic: [67336:error s:47]: QA_9.class.kanakcqa43.boot: failed to mount volume
Jun 20 10:04:29 test applogic: [66583:error s:47]: QA_9: vol.srv5.QA_9.main.kanakcqa43:QA_9.class.kanakcqa43.boot: failed to create mount
Jun 20 10:04:29 test applogic: [66568:error s:47]: QA_9:main.kanakcqa43: failed to allocate resources for component
Jun 20 10:04:46 test applogic: [66310:error s:47]: QA_9: failed to allocate resources

From the first line we can tell that srv5 is the grid node with the problem. This line is indicating its trying to mount an assembled volume (has a vol name rather then vol stream v-xxx-xxx-xxx), so we most likely have a issue with a md device. Lets check the logs on srv5:

Jun 20 10:04:29 test-srv5 VRM: MT_LMC.c(565): m_mnt_create(): Server srv5 - status 47 : Mount create request for mount 'mnt.srv5.QA_9.main.kanakcqa43:QA_9.class.kanakcqa43.boot' failed

This single log entry is all we get and it mentions no specific device that is failing on. So our next step is to check the output of 'bd list --all' for any inconsistencies, with special attention to the md devices since this is what we suspect to be the issue.

[root@test-srv5 ~]# 3tsrv bd list --all
--- md Devices ---
Name       Attached Devices
----------------------------------
md4        nbd8
md5        nbd9, nbd10
md6        nbd16, nbd12
md7        nbd13, nbd14
md11       nbd15, hoop35
md12       nbd23, nbd24
md13       nbd26, nbd27
md16       nbd25, hoop36
md17       nbd33, nbd34
--- hoop Devices ---
Name       Volume

In this case everything looks good for all the devices. The only inconsistency is md4, but this could simply be a degraded volume so we need to do more checking. Looking at the output of '3tsrvctl list mounts' we notice that there is no vm currently running that is using md4 nor is there a nbd8 process:

[root@test-srv5 ~]# 3tsrvctl list mounts
Name                                                         # Volumes  Mount ID       
mnt.srv5.KCEngDev237.main.KCEngDev237:KCEngDev237.class.kanakcdevtmpA.boot 2          /dev/md11     
mnt.srv5.KCEngDev237.main.KCEngDev237:KCEngDev237.class.kanakcdevtmpA.swap 2          /dev/md12     
mnt.srv5.KCEngDev240.main.KCEngDev240:KCEngDev240.class.kanakcdevtmpA.boot 2          /dev/md16     
mnt.srv5.KCEngDev240.main.KCEngDev240:KCEngDev240.class.kanakcdevtmpA.swap 2          /dev/md17     
mnt.srv5.Dev_2.main.kanakcdev6:Dev_2.class.kanakcdev6.boot   2          /dev/md6      
mnt.srv5.Dev_2.main.kanakcdev6:Dev_2.class.kanakcdev6.swap   2          /dev/md7      
mnt.srv5.CI_3.main.kanakcci11:CI_3.class.kanakcci11.boot     2          /dev/md5      
mnt.srv5.CI_3.main.kanakcci11:CI_3.class.kanakcci11.swap     2          /dev/md13   
 
 
[root@test-srv5 ~]# ps -ef | grep nbd8
root     11024  6897  0 10:49 pts/3    00:00:00 grep nbd8
APPLOGIC RESTRICTED AREA
[root@test-srv5 ~]#

So now we can conclusively say that md4 is the culprit and needs to be shutdown manually. All that is needed here is to stop the md4 device and try to start the application again.