This guide is meant to be used with AppLogic version 2.9.9+. Some or all of the commands used in this guide were not available in previous versions.
This guide assumes the appropriate checks have already been done to ensure the volume is not currently in use by any running appliance, filer or other process such as vol repair/migrate. Removing a volume stream or its associated device while in use can lead to data loss or corruption.
(work in progress)
3tsrv has many commands, the one we are interested in is the 'bd list' command which list out all block devices used on a node.
3tsrv bd list --all 3tsrv <cmd> usage: bd list [--md|--hoop|--nbd|--all] [--batch] -- list active block devices on server Example usage [root]# 3tsrv bd list --all --- md Devices --- Name Attached Devices ---------------------------------- md1 nbd1, hoop0 md2 nbd3, hoop1 md3 nbd5, hoop2 --- hoop Devices --- Name Volume Shared Port ------------------------------------------------------------------------ hoop0 v-e69315e2-fff3-473c-97dc-af309466134a Y 63001 hoop1 v-1b0ecc0e-782c-4ddb-b8cb-b7981560fd3e Y 63002 hoop2 v-f821f907-4ae0-4044-87ea-6dd39cefe9fd Y 63003 hoop3 v-cf98dc46-03dc-45cd-ae2f-a5999705ab87 Y 63004 hoop255 swap.img N n/a --- nbd Devices --- Name Remote IP Remote Port ------------------------------------------- nbd1 192.168.2.1 63021 nbd3 192.168.2.1 63020 nbd5 192.168.2.1 63022
3tsrvctl also has many commands, the ones we will be using are list, destroy and info. These commands display the logical mounts on the node and which appliance they belong to.
3tsrvctl commands
3tsrvctl <cmd> [<entity>] [<prop>=<val>]* [--batch] [--verbose] Logical Mount Commands (<entity> is mount, mounts, mnt, or mnts) list - list Logical mounts info - get logical mount information destroy - destroy logical mount Example usage: [root]# 3tsrvctl list mounts Name # Volumes Mount ID mnt.srv2.SYSTEM:_sys.boot 2 /dev/md1 mnt.srv2.SYSTEM:_sys.meta 2 /dev/md2 mnt.srv2.SYSTEM:_sys.impex 2 /dev/md3 mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs 2 /dev/md6 [root]# [root]# 3tsrvctl info mount mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs Name : mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs Mount ID : /dev/md6 Mount Attributes: exclusive # Volumes : 2 Volumes : State synchronized synchronized [root]# [root]# 3tsrvctl mount destroy mnt.srv2.js_lamp1.main.content:js_lamp1.user.fs [root]#
hosetup is used to create and destroy hoop devices.
hosetup usage usage: hosetup loop_device # give info hosetup -d loop_device # delete hosetup [ -e encryption ] [ -o offset ] loop_device file # setup Example usage [root]# hosetup /dev/hoop58 /var/applogic/volumes/vols/v-d9478e8d-a317-4e4a-a77e-7377f5a65693 [root]# [root]# hosetup /dev/hoop58 /dev/hoop58: [fd00]:42319875 (/var/applogic/volumes/vols/v-d9478e8d-a317-4e4a-a77e-7377f5a65*) [root]# [root]# hosetup -d /dev/hoop58 [root]# nbd-server / nbd-client nbd-server is used to setup a share over the network. nbd-client connects to a nbd-server. nbd-server usage This is nbd-server version 2.8.8 Usage: port file_to_export [size][kKmM] [-l authorize_file] [-r] [-m] [-c] [-a timeout_sec]
nbd-client version 2.8.8 Usage: nbd-client [bs=blocksize] host port nbd_device [-swap] [-persist] Or : nbd-client -d nbd_device
Clearing Stuck Devices
A hoop device is basically a performance enhanced version of loop. It performs the exact same function as a normal loop device would.
The most common stuck hoop device scenario is were the hoop device is unused (not part of an assembled md device) but there is a volume attached still. These will show up in the output of '3tsrv bd list' with n/a listed as the port number and under the share column it will have a 'N'. These can be cleared by running 'hosetup -d /dev/hoopX' on the device.
Example of unused hoop device: Example of unused /dev/hoop1 device [root]# 3tsrv bd list --hoop Name Volume Shared Port ------------------------------------------------------------------------ hoop0 v-e69315e2-fff3-473c-97dc-af309466134a Y 63001 hoop1 v-1b0ecc0e-782c-4ddb-b8cb-b7981560fd3e N n/a hoop2 v-f821f907-4ae0-4044-87ea-6dd39cefe9fd Y 63003 [root]# [root]# hosetup -d /dev/hoop1 [root]# 3tsrv bd list --hoop Name Volume Shared Port ------------------------------------------------------------------------ hoop0 v-e69315e2-fff3-473c-97dc-af309466134a Y 63001 hoop2 v-f821f907-4ae0-4044-87ea-6dd39cefe9fd Y 63003 [root]# In some cases the hoop device may fail to delete giving a error that its currently busy. fail to clean up hoop [root]# hosetup -d /dev/hoop14 ioctl: LOOP_CLR_FD: Device or resource busy [root]# This means that the hoop device is still being used by something. We should check 'bd list --all' output and ensure nothing is listed as using this hoop device. [root]# 3tsrv bd list --all | grep hoop14 md15 nbd29, hoop14 hoop14 v-e52af9a3-a45c-4a7b-b53a-ebeea9fc696d N n/a
In this case it looks like md15 still has hoop14 attached but not in use. To fix this we need to first stop the md device before hoop14 will stop. See the section on md devices for more info on this process.
The nbd devices have two parts, a server on the node that has holds the volume stream, and a client that is on the node that the vm is running on. The nbd-server is used to share out a hoop device over the backend network. The nbd-client will connect to the nbd-server and provide the shared hoop device as a /dev/nbd device on the node the appliance is running on.
It is very rare (I have yet to see this) to see a stuck nbd device. Whenever the nbd-server is killed, the client is automatically closed. There is occasions were the md device will be left created though and so the nbd device sometimes must be cleaned up manually after stopping the md device. To clean them up simply kill the nbd-server process on the remote node.
Example of shutting down nbd-server
[root@srv2]# 3tsrv bd list --md | grep md5 --- md Devices --- Name Attached Devices ---------------------------------- md5 hoop13, nbd8 [root@srv2]# mdadm --stop /dev/md5 [root@srv2]# 3tsrv bd list --nbd | grep nbd8 Name Remote IP Remote Port ------------------------------------------- nbd8 192.168.2.1 63005 [root@srv2]# [root@srv2]# ssh 192.168.2.1 [root@srv1]# [root@srv1]# ps -ef | grep 63005 root 32182 1 0 10:36 ? 00:00:00 /usr/bin/nbd-server 63005 /dev/hoop3 root 32203 32182 1 10:36 ? 00:03:21 /usr/bin/nbd-server 63005 /dev/hoop3 [root@srv1]# kill 32182 [root@srv1]# exit [root@srv2]# [root@srv2]# 3tsrv bd list --nbd | grep nbd8 APPLOGIC RESTRICTED AREA [root@srv2]#
In the above example, md5 is stuck in use with no appliance associated with it. After stopping md5 we check 'bd list' output to determine were the nbd device is being shared from. The remote IP here is 192.168.2.1 (srv1) so we ssh into srv1 and then search the processes for the same port the client is connected on, 63005. We kill the parent process and that shuts down the nbd-server on srv1 and the nbd-client on srv2.
The md device is made up of 2 or more volume streams assembled into a raid1. In the 'bd list' output you will sometimes see 'raid1' listed in the attached device column like the md5 device below.
md5 raid1, nbd9, hoop4 md6 nbd22, hoop6
The md5 device corresponds to a read-only volume. I am not really sure why this is as there all raid1.. maybe someone else can explain this. md6 is the typical output of a read-write volume.
Troubleshooting Examples
In the controller /var/log/messages we see the following:
Mar 13 21:04:51 grid3 applogic: [511:error s:47]: srv6: Share create request for volume vol.srv6.v-19bbae8f-86ec-4688-a6e8-146c7e53f851 failed This indicates that srv6 has failed to attach the volume to a hoop device. We know this because its trying to mount the v-xxx-xxx-xxx stream which only mounts to a hoop device. On srv6 you would see the additional error of: Mar 13 21:04:51 grid3-srv6 AppLogic: Failed to mount /var/applogic/volumes/vols/v-19bbae8f-86ec-4688-a6e8-146c7e53f851 on /dev/hoop18 Mar 13 21:04:51 grid3-srv6 VRM: VL_CTL.c(716): m_vol_share(): Server srv6 - status 47 : Share create request for volume vol.srv6.v-19bbae8f-86ec-4688-a6e8-146c7e53f851 failed
This can indicate that either the volume stream is bad (highly unlikely unless there is also drive errors) or hoop18 may already be in use (applogic thinks its free). To fix this we would need to log into srv6 and stop the hoop18 device.
In the controllers /var/log/messages:
Sep 22 09:58:30 grid1 applogic: Failed to mount _cat.system.NAS.boot: id='/dev/md2', cmd='mdadm --assemble /dev/md2 --force --run /dev/nbd2 /dev/nbd4'
Sep 22 09:58:30 grid1 ctld: MT_LMC.c(565): m_mnt_create(): Server 127.0.0.1 - status 47 : Mount create request for mount '_cat.system.NAS.boot' failed
Sep 22 09:58:31 grid1 applogic: [67336:error s:47]: _cat.system.NAS.boot: failed to mount volume
In this example the m_mnt_create command is failing on server 127.0.0.1, which is the loopback IP. Anytime you see the failure occur on 127.0.0.1 it is a problem on the controller. In the first line we see that command it was trying to run was 'mdadm --assemble /dev/md2'. On the controller running 'cat /proc/mdstat' will show:
[root]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 nbd4[0]
20971520 blocks [1/1] [U]
So in this case /dev/md2 is already in use on the controller. To fix this would require stopping /dev/md2 on the controller.
In the controllers /var/log/messages we see the following when trying to start an application:
Jun 20 10:04:29 test applogic: [101:error s:47]: srv5: Mount create request for mount 'mnt.srv5.QA_9.main.kanakcqa43:QA_9.class.kanakcqa43.boot' failed Jun 20 10:04:29 test applogic: [67336:error s:47]: QA_9.class.kanakcqa43.boot: failed to mount volume Jun 20 10:04:29 test applogic: [66583:error s:47]: QA_9: vol.srv5.QA_9.main.kanakcqa43:QA_9.class.kanakcqa43.boot: failed to create mount Jun 20 10:04:29 test applogic: [66568:error s:47]: QA_9:main.kanakcqa43: failed to allocate resources for component Jun 20 10:04:46 test applogic: [66310:error s:47]: QA_9: failed to allocate resources
From the first line we can tell that srv5 is the grid node with the problem. This line is indicating its trying to mount an assembled volume (has a vol name rather then vol stream v-xxx-xxx-xxx), so we most likely have a issue with a md device. Lets check the logs on srv5:
Jun 20 10:04:29 test-srv5 VRM: MT_LMC.c(565): m_mnt_create(): Server srv5 - status 47 : Mount create request for mount 'mnt.srv5.QA_9.main.kanakcqa43:QA_9.class.kanakcqa43.boot' failed
This single log entry is all we get and it mentions no specific device that is failing on. So our next step is to check the output of 'bd list --all' for any inconsistencies, with special attention to the md devices since this is what we suspect to be the issue.
[root@test-srv5 ~]# 3tsrv bd list --all --- md Devices --- Name Attached Devices ---------------------------------- md4 nbd8 md5 nbd9, nbd10 md6 nbd16, nbd12 md7 nbd13, nbd14 md11 nbd15, hoop35 md12 nbd23, nbd24 md13 nbd26, nbd27 md16 nbd25, hoop36 md17 nbd33, nbd34 --- hoop Devices --- Name Volume
In this case everything looks good for all the devices. The only inconsistency is md4, but this could simply be a degraded volume so we need to do more checking. Looking at the output of '3tsrvctl list mounts' we notice that there is no vm currently running that is using md4 nor is there a nbd8 process:
[root@test-srv5 ~]# 3tsrvctl list mounts Name # Volumes Mount ID mnt.srv5.KCEngDev237.main.KCEngDev237:KCEngDev237.class.kanakcdevtmpA.boot 2 /dev/md11 mnt.srv5.KCEngDev237.main.KCEngDev237:KCEngDev237.class.kanakcdevtmpA.swap 2 /dev/md12 mnt.srv5.KCEngDev240.main.KCEngDev240:KCEngDev240.class.kanakcdevtmpA.boot 2 /dev/md16 mnt.srv5.KCEngDev240.main.KCEngDev240:KCEngDev240.class.kanakcdevtmpA.swap 2 /dev/md17 mnt.srv5.Dev_2.main.kanakcdev6:Dev_2.class.kanakcdev6.boot 2 /dev/md6 mnt.srv5.Dev_2.main.kanakcdev6:Dev_2.class.kanakcdev6.swap 2 /dev/md7 mnt.srv5.CI_3.main.kanakcci11:CI_3.class.kanakcci11.boot 2 /dev/md5 mnt.srv5.CI_3.main.kanakcci11:CI_3.class.kanakcci11.swap 2 /dev/md13 [root@test-srv5 ~]# ps -ef | grep nbd8 root 11024 6897 0 10:49 pts/3 00:00:00 grep nbd8 APPLOGIC RESTRICTED AREA [root@test-srv5 ~]#
So now we can conclusively say that md4 is the culprit and needs to be shutdown manually. All that is needed here is to stop the md4 device and try to start the application again.
|
Copyright © 2013 CA Technologies.
All rights reserved.
|
|