Previous Topic: CA Process Automation Installation FailsNext Topic: Oracle Bug #  9347941


Potential Problem When Running CA Process Automation on a VMWare Server When Using the E1000 Network Interface

Symptom:

The root causes of this problem are rare, sporadic, socket I/O failures, which may leave the calling software waiting indefinitely for a read to complete.

From the users perspective the most typical symptom will be the unexpected hanging of processes that normally complete without issue, which resume and complete as expected following a restart of the CA Process Automation Orchestrator.  This can impact a small subset of processes, or all running processes. It has no correlation with Orchestrator uptime, and may manifest shortly after a restart, or, after days, weeks, or months of otherwise flawless Orchestrator functionality.

This problem has only been seen in environments running high volumes of CA Process Automation processes. In most environments where the E1000 NIC is installed the problem has never occurred, or occurred so infrequently that it has not been detected.

Solution:

This problem is very difficult to confirm. If this problem occurs, often the CA Process Automation thread is stuck on a socket read, and no relevant errors are written to the log files, and confirmation of the problem requires reviewing a series of Java thread dumps taken during an occurrence of this problem to confirm the operator is stuck on a socket read.

When errors are observed in relation to this problem, they tend to indicate generic connection errors which could have other legitimate and unrelated causes. The following is such an example:

2013-07-24 18:55:23,219 WARN  [org.hibernate.jdbc.AbstractBatcher] [nPool Worker-23] exception clearing maxRows/queryTimeout
com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
                at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(Unknown Source)
                at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(Unknown Source)
                at com.microsoft.sqlserver.jdbc.SQLServerStatement.checkClosed(Unknown Source)
                at com.microsoft.sqlserver.jdbc.SQLServerStatement.getMaxRows(Unknown Source)
                at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.getMaxRows(CachedPreparedStatement.java:367)
                at org.jboss.resource.adapter.jdbc.WrappedStatement.getMaxRows(WrappedStatement.java:378)
                at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:272)
                at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:209)

. . . and so on.

In these cases identification of the problem is tentative, and other causes for communication failure must be excluded.   

Frequent process failure, or a repeatable failure of an individual operator or operators likely indicate other unrelated problems within the process design or Orchestrator functionality.

At sites where this problem has been confirmed, reconfiguring the VMWare server from an E1000 Network Interface Card driver to a VMXnet-3 NIC driver is seen to be a very effective mitigation. 

CA Technologies is hesitant to declare this a complete resolution as the incident rate for this is very rare and timeframe between occurrences even with the E1000 NIC can be quite long.    

If verification of the issue is required prior to making this change, please contact Support for assistance setting up the logging and Java thread dumps required to troubleshoot and verify this particular issue.