Explanation of High Availability in AppLogi

Reference Information › CA AppLogic Support Knowledge Base › Overview of Support Knowledge Base › Explanation of High Availability in AppLogi

Explanation of High Availability in AppLogi

Several aspects of High Availability that AppLogic maintains:

With respect for volumes, as long as you have more than 1 server, volumes will be mirrored (unless the volume is created as non-mirrored; use vol info to check if a volume is mirrored and whether the mirror is in sync.
With respect to applications/appliances, as long as you have more than 1 server and available resources, appliances will get restarted if a server fails (esp. if you also configure IPMI during grid setup) [the HA check failed message you got indicates there are not enough resources to restart components or volume mirrors are broken]
With respect to the grid controller, AppLogic will automatically assign up to two secondary servers. Minimum of 3 servers are required for proper controller failover, as long as IPMI power control to be configured.
With respect to network high availability, AppLogic enables this automatically during installation if you provide the needed number of NICs and switches. In general, it is not necessary to use the "grid set" commands to configure network HA (these commands are for troubleshooting).

This links might help in this scenario.

http://forum.3tera.com/showthread.php?p=15661

http://forum.3tera.com/showthread.php?t=4181

One of the most important features of AppLogic's high-availability is the automated application/grid recovery from various types of failures. For applications that are running on AppLogic, automated recovery is provided for AppLogic appliances that may fail due to either appliance software failure or server failure (either physical server failure or a software bug that leads to a server failure). In addition, an AppLogic grid is able to tolerate failures of its grid controller. If the AppLogic grid controller fails, it is either restarted on the same server (if possible) or a different server while minimizing the impact on running applications.

There is one small caveat to the algorithm we mentioned: if a server goes down, then the components are restarted using the default way of starting the applications in Applogic, which is, using the PACK option and always leaving the controller to the end. That is, all other servers will be filled up before the controller is used.

In the pack option the applications are started on the server with least resources first of all and in this order, so this also influences the way that the applications get started.

pack servers, schedule appliances on server with least amount of available resources. When using the pack server scheduling mode, the servers are chosen in priority order based on their assigned role and the least amount of available resources. AppLogic will always use servers with a role of none first, secondary second and primary last; without regards to the available resources on those servers (i.e., the secondary servers won't be used until all of the none servers are used, the primary server won't be used until all secondary and none servers are used).
The application restart will fail if --debug is specified and the application has field engineering code 16 set.

AppLogic 2.9: http://doc.3tera.com/AppLogic29/AdvFieldCodes.html

AppLogic 3.0: http://doc.3tera.com/AppLogic30/Developer_Guide/2003342.html

AppLogic 3.1: http://doc.3tera.com/AppLogic31/en/Developer_Guide/1639993.html

AppLogic 3.5: http://doc.3tera.com/AppLogic35/en/2023840.html

Restarting an application resets the component flapping counters that are maintained by AppLogic. A components flapping counter is the number of times that component has failed within the last 24 hours (each component times that component has failed within the last 24 hours (each component has its own flapping counter). Once a component fails 3 times in 24 hours, the component is not restarted by AppLogic. comp start/restart or app start may also be used to reset a components flapping counter.

In order to tolerate failures of the controller server, AppLogic uses server roles to define a set of servers in a grid that are able to run the grid controller in case of failures (i.e., backup controller servers). AppLogic uses the following server roles within a grid (these are automatically configured by AppLogic but may also be specified by a user or grid maintainer):

primary: Server that is currently running the AppLogic grid controller
secondary: Server that may run the AppLogic grid controller in case of a primary controller server failure
none: Server will never run the AppLogic grid controller and does not none: Server will never run the AppLogic grid controller and does not participate in controller high-availability

By default, every AppLogic grid is configured with the following server roles:

• 1 server grids: one primary server; single server grids cannot tolerate failures of the primary server (no grid controller server high-availability)
• 2 server grids: one primary and one secondary server (and one reference server, see below)
• 3+ server grids: one primary and two secondary servers; the remaining servers in the grid have their role set to none and do not participate in the controller server failure recovery

If the primary server fails (hardware/software failure, powered-down, etc), one of the secondary servers automatically takes over as the new primary server for the grid. If the old primary server is restored to operation, it serves for the grid. If the old primary server is restored to operation, it automatically becomes a non-primary server (secondary server). The new primary server starts the grid controller and once the controller is restored, the controller automatically reacquires the state of the grid. Just like for physical server failures, AppLogic also automatically restarts appliances that were running on the failed primary server. The use of secondary controller servers and the auto-restart of appliances allow AppLogic to tolerate failures of the primary controller server.

In order for a grid to recover from a controller server failure, there must be at least 2 secondary servers up and running at the time of the server failure. If this requirement is not met (e.g., there is only one secondary server at the time of the primary server failure), the grid controller remains down and requires grid maintainer intervention in order to restore the grid controller to an operational state. If this type of controller failure is encountered, please contact your service provider for assistance failure is encountered, please contact your service provider for assistance.

If the primary server fails and does not come back online for at least 2 hours, AppLogic automatically assigns a new secondary server within the grid in order to maintain at least 2 secondary servers for grid controller failover.

In general when AppLogic schedules appliances to start on either comp start or app start, the appliances are scheduled on servers based on both their role and available resources. AppLogic first tries to schedule appliances to run on servers with a role of none, then the primary server and lastly secondary servers. The secondary servers are used as a last resort for scheduling so there is a greater chance that there are available resources to start the controller if needed.

When a secondary server takes over as the new primary server, if there are not enough resources available on the server to start the grid controller, AppLogic restarts appliances which are running on the new primary server on other servers within the grid so the grid controller can be started on the new primary server. Note that this may break appliance failover groups. If AppLogic stops one of these appliances it may not be able to restart the appliance on another server since there may not be enough resources to satisfy the failover group.

Question: What is the meaning of the "HA check failed: There are not enough available resources to restart components running on 2 servers [srv2,srv3]" messages in the controller ?

Answer:

Sometimes messages of the type HA check failed: There are not enough available resources to restart components running on 2 servers [srvX,srvY, srvZ…] are observed in the controller logs. These messages indicate that the amount of resources available in all the grid servers is not enough to be able to restart all the components running in any of the servers indicated if for any reason it goes down.

One of the main components of AppLogic is the possibility of having Application High Availability. This implies that if one of the grid servers goes down, the rest of the remaining servers need to be able to take over its role and restart the different components formerly running in that server.

Even though when referencing the amount of resources a grid has, the total amount of CPU, Memory an Bandwidth is often considered, each node contributes a specific amount to that global figure, and each component has its own requirements. Therefore, the ability to restart certain components if a server goes down is going to be constrained by:

The amount of resources available on each server node. That means CPU, Memory and Bandwidth

The amount of resources required by each component. A certain component needs to fit in its entirety in a given node (e.g. it is not possible to allocate CPU in one node and Bandwidth in another for a given component)

The scheduling algorithm. In the event of a given server going down, AppLogic will try to allocate its applications to the rest of the servers using a pack scheduling algorithm (the servers are ordered by their resources and those with less resources are filled first) and leaving the controller as the last server where appliances are started

As a result, even though globally resources may be available, and even at node-level, HA may not be possible.

Let's consider an example. Let's imagine a grid has the following distribution of resources:

server srv1 : role primary, state up(enabled), 4.25/3.65 cpu, 12797/10142 MB mem, 801/1199 Mbps bw

server srv2 : role secondary, state up(enabled), 7.00/1.00 cpu, 21468/2239 MB mem, 1800/200 Mbps bw

server srv3 : role secondary, state up(enabled), 6.00/2.00 cpu, 12270/11437 MB mem, 911/1089 Mbps bw

server srv4 : role none, state up(enabled), 7.95/0.05 cpu, 22164/1543 MB mem, 1651/349 Mbps bw

server srv5 : role none, state up(enabled), 8.00/0.00 cpu, 16384/7323 MB mem, 1411/589 Mbps bw

server srv6 : role none, state up(enabled), 6.75/1.25 cpu, 17372/6335 MB mem, 1350/650 Mbps bw

server srv7 : role none, state up(enabled), 6.00/2.00 cpu, 20592/3115 MB mem, 1780/220 Mbps bw

And the list of applications running on srv2 is the following:

1 AP1: running, 0.50 cpu, 1536 MB, 500 MBps

2 AP2: running, 1.00 cpu, 6144 MB, 500 MBps

3 AP3: running, 0.25 cpu, 750 MB, 100 MBps

4 AP4: running, 3.00 cpu, 6144 MB, 300 MBps

5 AP5: running, 2.00 cpu, 6144 MB, 300 MBps

6 AP6: running, 0.25 cpu, 750 MB, 100 MBps

In this particular case srv2 requires 7.00 CPU, 21468 MB and 1800 MBps so in theory there are globally enough resources in the grid to accommodate the components. However, the message

HA check failed: There are not enough available resources to restart components running on 1 servers [srv2]

will be thrown in the controller (more servers may have the problem, but this is just an example for explanatory purposes):

In this case if srv2 fails, AppLogic will try to allocate its components starting with the server with least resources available, srv6, then srv7, srv3 and finally the controller, srv1. So, in this case:

Servers srv4 and srv5 cannot be used to restart any application

AP1 and AP3 would be started on srv6. After this srv6 would still have 0.50 CPU, 4049 MB Mem and 50 Mbps bandwidth free, but this is not enough to allocate resources to any other component

AP6 would be started on srv7. After this srv7 would still have 1.75 CPU, 2365 Mb Mem and 120 Mbps but it would be able to accommodate more applications from srv2, as it's got little bandwidth left

AP2 would be started on srv3, after which it would have 1.00 CPU, 5293 MB Mem and 589 Mbps bandwidth. No more applications could be started on any of the servers

AP4 and AP5 still need to be restarted, but the only server left is the controller itself, srv1, and it can either accommodate one or the other

Hence in this example HA cannot be ensured if srv2 goes down. In general it is recommended that at least one node with almost no applications running is available in the grid to accommodate a number of them in case one or several of the nodes restart. Grids should be provisioned with enough resources to make sure they are not running at the limit of their resources

User Guide for High Availability

AppLogic 2.9: http://doc.3tera.com/AppLogic29/High-Availability.html

AppLogic 3.0: http://doc.3tera.com/AppLogic30/en/User_Guide/HighAvailability.html

AppLogic 3.1: http://doc.3tera.com/AppLogic31/en/User_Guide/HighAvailability.html

AppLogic 3.5: http://doc.3tera.com/AppLogic35/en/HighAvailability.html