High availability environment troubleshooting tips

High availability environment troubleshooting tips

Message HMGR0218I is not displayed after a Java virtual machine starts

In a properly set up high availability environment, a high availability manager can reassess the environment it is managing and accept new components as they are added to the environment. For example, when a JVM is added to the infrastructure, a discovery process begins. During startup the JVM tries to contact the other members of the core group. When it finds another running JVM, it initiates a join process with that JVM that determines whether or not the JVM can join the core group. If the new JVM is accepted as a member of the core group, all of the JVMs, including the new one, log message HMGR0218I. This message is also displayed on the administrative console. Message HMGR0218I indicates the number of application servers in the core group that are currently online.
If this message is not displayed after a JVM starts, either a configuration problem or a communication problem has occurred. To fix this situation, verify that the application server is running on a current configuration, by either using the deployment manager to tell the node agent to synchronize, or use the syncNode command to manually perform the synchronization. If the JVM still cannot join the core group, a network configuration problem exists.

Message HMGR0123I appears in the system log file

Message HMGR0123I might appear in the system log file if the status of core group members changes at the same time as the active coordinator changes. For example, this message might be issued when a core group member restarts and becomes the active coordinator. This information message usually does not indicate a serious problem. Even if the message appears in the system log file, the new active coordinator receives the updated group status. To minimize the occurrences of this message, we should select a core group member that does not frequently restart as the preferred core group coordinator.

CPU starvation messages in the system log file

CPU starvation detected error messages are displayed in the system log file whenever there is not enough physical memory available to allow the high availability manager threads to have consistent runtimes. When the CPU is spending the majority of its time trying to load swapped-out processes while processing incoming work, thread starvation might occur. The high availability manager detects this condition, and logs these error messages informing you that threads are not getting the required runtime. To achieve good performance and avoid receiving these error messages, IBM recommends allocating at least 512 MB of RAM for each Java process running on a single machine.

High CPU usage in a large cell configuration when security is enabled

With certain configurations and states, the amount of time spent in discovery becomes substantial.

If a large the number of processes are defined within a core group, a proportionally large number of connections must be established to support these processes.
If a large number of inactive processes are defined within a core group, a proportionally large number of connections are attempted during each discovery interval.
If administrative security is enabled, the DCS connections are secured, and the impact of opening a connection greatly increases .

Use the Discovery and failure detection page in the administrative console to increase the length of time that the Discovery Protocol waits to calculates the set of unconnected core group members, and attempts to open connections to those members. Increasing the length of time between consecutive discovery periods decrease the amount of CPU time that is spent in discovery. See, Configure the Discovery Protocol for a core group, for more information.

Transient high availability heartbeat failures under heavy load

Under heavy a workload, transient heartbeat failure conditions might occur between replication partners in an high availability configuration, even though both of the replication partners appear to be running properly.

(Linux) For Linux operating systems, this problem might be caused by TCP connection issues on the replication channel between the replication partners. These connection issues occur because the TCP buffer is not large enough to support the high volume of replication data being used exchanged. To avoid these spurious heartbeat failure conditions, IBM recommends tuning the TCP buffer sizes as recommended in the Linux kernel tuning section of the topic that describes how to tune SIP servlets for Linux.

Related:

Core group coordinator
Configure core group preferred coordinators
Configure the default discovery protocol for a core group
Tune SIP servlets for Linux
syncNode command