Correcting Application States on Controller Boot

Reference Information › CA AppLogic Support Knowledge Base › Overview of Support Knowledge Base › How to Replace BFC and Controller Server SSL Certificate › Correcting Application States on Controller Boot

Correcting Application States on Controller Boot

When the controller boots after an event or a programmed grid or system halt, it tries to contact the different servers in the grid to determine the status of the different applications. This is done by the 3tgridctl.pl script in /usr/local/applogic/scripts. As part of the script operation, it scans the servers in the grid for the states of their domU applications and reports that back to the controller. The typical messages you will be seeing in the controller messages are the following:

 Apr  3 04:35:22 loda31tg2 3tgridctl: starting grid recovery
...
Apr  3 04:35:48 loda31tg2 3tgridctl: Grid recovery in progress: There were 8 active application(s) when the grid controller went down. 8 application(s) failed to be recovered.

After determining the status of the different applications in the servers, the controller will try to start them if they were running when the controller went down and they are now reported as being stopped

These are both part of the High Availability processes offered by Applogic.

There are several situations in which we may need to manually correct the state of the applications:

Sometimes applications may fail to start on controller boot or we may want to prevent them from starting automatically even if they were stopped when the controller went down
Applications may show in an unknown state when the controller boots. When this happens the application cannot be stopped (it reports it as already stopped on doing app stop) and it cannot be started because controller comes back indicating the application is in an unknown state.

This article deals with how to deal and correct these situations so that applications recover their desired status

Resolution

The most likely cause for applications showing in unknown is that the controller has been unable to contact the different servers to determine the status of the applications. If this is so, the following messages will show up in the /var/log/messages of the controller on boot:

Apr  3 13:59:29 ICT-Grid01 3tgridctl: 3tgridctl failed, unable to enumerate servers

In this case, check the contents of the /var/applogic/state/ha/srvs directory. Inside that directory you will find several files: srvX.desc, which correspond to the servers in your grid. The typical contents of one of those files is:

server srv1
   {
   disabled = 0
   maintenance = 0
   last_known_state = up
   mgmt_id = "10.250.10.55:PowerAdmin__BFC:0025901531139428"
   }

where the parameters are self explanatory. The last entry, mgmt_id indicates the IPMI IP in the BFC as well as its internal id. If any of these files is corrupt or has the wrong information, the 3tgridctl will be unable to enumerate the servers and will report applications in state unknown. If the file contents are corrupted, it may be recreated: if you don't know what mgmt_id should stand for, just leave it blank, that is mgmt_id="": AppLogic will recreate it properly afterwards.

Once the srvX.desc files are properly created it is possible to run the 3tgridctl script manually to set up the application states properly. To do so perform the following actions:

cd /usr/local/applogic/scripts
perl 3tgridctl.pl start

The controller knows in which state each application is by looking at the contents of /var/applogic/state/ha/apps. In that directory you will find a subdirectory for each application that was totally or partially running in the grid when it went down. For instance:

ls -al apps/
 
total 40
drwxr-xr-x  10 root     root     4096 Apr  2 17:43 .
drwxr-xr-x   7 applogic applogic 4096 Mar 20 05:30 ..
drwxr-xr-x   3 root     root     4096 Mar 27 14:16 chera08_Win0864S
drwxr-xr-x   5 root     root     4096 Mar 28 08:41 fc15-arka
drwxr-xr-x   3 root     root     4096 Mar 27 14:28 fc15-jun
drwxr-xr-x   3 root     root     4096 Mar 27 14:18 fedora14_abdul
drwxr-xr-x   4 root     root     4096 Mar 27 14:18 fedora_arka
drwxr-xr-x   3 root     root     4096 Mar 27 14:22 gilmi06_test_oni_perf
drwxr-xr-x   4 root     root     4096 Apr  2 15:12 josh
drwxr-xr-x   3 root     root     4096 Apr  2 17:45 jun_vds_win03e

One way to prevent the application from starting on next controller boot is to delete the corresponding directory. For instance:

rm -rf gilmi06_test_oni_perf

But actually the structure of the applications directory closely mimicks that of the srvs contents. For instance, if we examine the contents of the josh application (composed of one linux main appliace connected to a net appliance) the directory will contain the following:

ls -la josh/
total 20
drwxr-xr-x   4 root root 4096 Apr  2 15:12 .
drwxr-xr-x  10 root root 4096 Apr  2 17:43 ..
-rw-r--r--   1 root root  206 Apr  2 15:12 app.desc
drwxr-xr-x   2 root root 4096 Apr  2 14:59 main.LINUX64
drwxr-xr-x   2 root root 4096 Apr  2 14:59 main.NET

The .desc files indicate the status of the application or component for HA purposes. For instance in this case:

more app.desc
application josh
   {
   target_state = "running"
   cpu          = 0%
   mem          = 0
   bw           = 0
   sched        = ""
   cap_cpu      = 0
   debug        = 1
   ts_started   = 1364940777
   }

This is just indicating that the application was in running state. If changing to the component directories, we will find another desc file indicating if the component is in strandby or not:

cd main.NET
ls -la
total 12
drwxr-xr-x  2 root root 4096 Apr  2 14:59 .
drwxr-xr-x  4 root root 4096 Apr  2 15:12 ..
-rw-r--r--  1 root root   45 Apr  2 14:59 comp.desc
more comp.desc
component main.NET
   {
   standby = 0
   }

which indicates that the component is not in standby.

It must be noted that, if an application gives problems during controller boot (for instance because the controller goes down or loops rebooting) we must hurry up to delete it from or change its status in the said directories before the grid recovery process begins.