Manage high availability when messaging engines fail to start

Manage high availability when messaging engines fail to start

If an attempt to start a messaging engine on a server is unsuccessful, that server is disabled as a location for that messaging engine to run. After we have resolved the problem that prevented the messaging engine from starting, manually re-enable the server to maintain the high availability environment.
In a high availability environment, a messaging engine can run on multiple appservers. If an attempt to start a messaging engine on a server is unsuccessful, or the server hosting a running messaging engine stops, the high availability manager restarts the messaging engine on another eligible server. If the high availability manager cannot start the messaging engine on that server, the server becomes disabled as a location for that messaging engine to run, and the following message is produced in the JVM logs for that server:
CWSID0039E: HAManager-initiated activation has failed, messaging engine messaging_engine_name will be disabled

In some situations, the messaging engine can repeatedly fail to start. In the following example, a messaging engine, hosted in a cluster of three servers, is configured to use a data store. The cluster is started before the database that is hosting the data store. The messaging engine attempts to start on server1, and tries to connect to the data store for up to 15 minutes by default.

Because the database has not been started, the messaging engine cannot connect to the data store. The messaging engine fails to start and server1 is disabled for high availability. The messaging engine fails over to server2, and again attempts to start and connect to the data store.

If the database is still not started, the messaging engine fails to start and server2 is disabled for high availability. The messaging engine fails over to server3, and again attempts to start and connect to the data store.

If the database is still not running, the messaging engine fails to start and server3 is disabled for high availability. All servers in the cluster are now disabled for high availability, and the messaging engine cannot start until you start the database and re-enable at least one server.
When we have fixed the cause of the messaging engine's failure to start, re-enable the servers for high availability by either restarting the servers, or by following the steps in this task to enable them using the admin console.

Navigate to the high availability groups panel in the administrative console, to display a list of high availability groups. Refer to View high availability group information for details.
Find and click the relevant high availability group in the list. To find the relevant group, look for the bus and messaging engine names contained as name-value pairs within the group name. For example the group with the following name contains messaging engine MyCluster.000-MyBus, running on bus MyBus on cluster MyCluster:
IBM_hc=MyCluster, WSAF_SIB_BUS=MyBus,WSAF_SIB_MESSAGING_ENGINE=MyCluster.000-MyBus,type=WSAF_SIB

The panel for that group appears, showing the high availability state associated with each running server in the messaging engine cluster. If a server is in the disabled state (indicated by a red square), the high availability of the environment is compromised because the messaging engine cannot start on that server. If all servers are in the disabled state, the messaging engine cannot start until you enable at least one server.
Select any members that are in the disabled state, and click Enable.

Next steps

When a messaging engine that uses a data store fails over to another appserver, it might attempt to start before the database server has detected the loss of the network connection to the original appserver. Because the database server has not detected the loss of the connection, the data store table locks are not released and the messaging engine cannot start. In this situation, the messaging engine can fail to start on all servers in the cluster. To avoid this problem tune the system to detect the loss of the connection more quickly.