Network Deployment (Distributed operating systems), v8.0 > Establishing high availability > High availability manager > Core groups (high availability domains)
Core group discovery and failure detection protocols
When a core group member starts, no connections to other core group members exist. If a core group is configured to run with either the default Discovery and Failure Detection Protocols or an alternative protocol provider, either the discovery and failure detection tasks or the alternate protocol provider tasks start as part of the process startup procedure. These tasks establish connectivity to other core group members, monitor this connectivity and handle connectivity failures for this core group member, at regularly scheduled intervals, as long as the core group member is active.
The default Discovery Protocol
New feature: Beginning in WAS v8.0 you can configure the server to use the High Performance Extensible Logging (HPEL) log and trace infrastructure instead of using SystemOut.log , SystemErr.log, trace.log, and activity.log files or native z/OS logging facilities. If you are using HPEL, you can access all of your log and trace information using the LogViewer command-line tool from your server profile bin directory. See the information about using HPEL to troubleshoot applications for more information on using HPEL.New feature:
New feature: Beginning in WAS v8.0 you can configure the server to use the High Performance Extensible Logging (HPEL) log and trace infrastructure instead of using SystemOut.log , SystemErr.log, trace.log, and activity.log files or native z/OS logging facilities. If you are using HPEL, you can access all of your log and trace information using the LogViewer command-line tool from your server profile bin directory. See the information about using HPEL to troubleshoot applications for more information on using HPEL.New feature:
The default Discovery Protocol establishes network connectivity with the other members of the core group.
To establish this connectivity, the Discovery Protocol retrieves the list of core group members and the associated network information from the product configuration settings. The Discovery Protocol then attempts to open network connections to all of the other core group members. At periodic intervals, the Discovery Protocol recalculates the set of unconnected members and attempts to open connections to those members.
When a connection is made to another core group member, the Discovery Protocol notifies the View Synchrony Protocol, and logs this event as an informational message, similar to the following message, in the SystemOut.log file.
DCSV1032I: DCS Stack DefaultCoreGroup at Member MyCell\anzio\nodeagent: Connected a defined member MyCell\anzioCellManager\dmgr.Connections can fail at any time for a variety of reasons. The Failure Detection Protocol detects connection failures and notifies the Discovery Protocol. The Discovery Protocol then attempts to open a new network connection to that member at the next scheduled interval.
The amount of CPU cycles that the Discovery Protocol task consumes is proportional to the number of core group members that are stopped or unreachable. The CPU cycles that the Discovery Protocol task consumes is negligible at the default settings.
Default Failure Detection Protocol
The Failure Detection Protocol monitors the core group network connections that the Discovery Protocol establishes. When the Failure Detection Protocol detects a failed network connection, it reports the failure to the View Synchrony Protocol and the Discovery Protocol. The View Synchrony Protocol adjusts the view to exclude the failed member. The Discovery Protocol attempts to reestablish a network connection with the failed member. This task runs as long as the member is active.
The Failure Detection Protocol uses two distinct mechanisms to find failed members:
It looks for connections that closed because the underlying socket was closed.
When a core group member normally stops in response to an administration command, the core group transport for that member also stops, and the socket that is associated with the transport closes. If a core group member terminates abnormally, the underlying operating system normally closes the sockets that the process opened and the socket associated with the core group transport. is closed.
For either type of termination, core group members that have an open connection to the terminated member are notified that the connection is no longer usable. The core group member that receives the socket closed notification considers the terminated member a failed member.
When a failed member is detected because of the socket closing mechanism, one or more of the following messages are logged in the SystemOut.log file for the surviving members:
DCSV1113W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: Suspected another member because the outgoing connection to the other member was closed. Suspected member is anzioCell01\nettuno\ServerB. DCS logical channel is View|Ptp. DCSV1111W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: Suspected another member because the outgoing connection from the other member was closed. Suspected members is anzioCell01\nettuno\ServerB. DCS logical channel is Connected|Ptp.The closed socket mechanism is the way that failed members are typically discovered. TCP settings in the underlying operating system, such as FIN_WAIT, affect how quickly socket closing events are received.
It listens for active heartbeats from the core group members.
The active heart beating mechanism is analogous to the TCP keep alive function. At regularly scheduled intervals, each core group member sends a ping packet on every open core group connection. The rate or periodicity at which the packet is sent is called the heartbeat transmission period.
Each core group member expects to receive a packet on each open connection from the core group member on the other end of the connection. If no packets are received over an open connection within the time length specified for the heartbeat timeout period, then the member on the other end of the connection is marked as failed.
The heartbeat timeout period must be a whole number that is a multiple of the heartbeat transmission period. The heartbeat timeout period must also be at least twice as large as the heartbeat transmission period.
When a member is marked as failed, the following message is sent to the error log file:
DCSV1112W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: Suspected member anzioCell01\nettuno\ServerB because of heartbeat timeout. Configured Timeout is 180000 milliseconds. DCS logical channel is Connected|Ptp.Active heartbeats are most useful for detecting core group members that are unreachable because the network is stopped. Active heartbeats consume some CPU usage. The amount of CPU usage that is consumed is proportional to the number of active members in the core group. The default configuration for active heartbeats is a balance of CPU usage and timely failed member detection.
We can use the admin console or wsadmin.sh to configure the heartbeat transmission period and heartbeat timeout period. Read the topic Configure the Failure Detection Protocol for a core group for a description of how to use the admin console to change these settings.
Alternative protocol providers
Currently, no alternative protocol providers are available for the IBM i and distributed platforms.
Alternative protocol providers
We can use an alternate protocol provider instead of the default Discovery Protocol and Failure Detection Protocol to monitor and manage communication between core group members. In general, alternate protocol providers, such as the z/OS Cross-system Coupling Facility (XCF)-based provider, uses less system resources than the default Discovery Protocol and Failure Detection Protocol, especially during times when the core group members are idle. An alternate protocol provider generally use less system resources because it does not perform the member-to-member TCP/IP pinging that the default protocol providers use to determine if a core group member is still active.
Before reconfiguring a specific core group to use an alternative protocol provider, verify that the core group meets the following requirements. If the core group does not meet all of these requirements, continue to use the default Discovery Protocol and the default Failure Detection Protocol with this core group.
- The core group is homogenous. This means that the core group processes must all reside on the same platform. For example, the core group cannot contain a mixture of z/OS and distributed processes.
- If the core group needs to be bridged to another core group, using the core group bridge service, then all of the core groups that are bridged to this core group are also homogeneous core groups.
- All members of the core group must be at v7.x of the product. If any members of the core group are running at a v6.x level of the product, then update them to v7.x, before you can switch to the alternative protocol provider.
Core groups (high availability domains)
Configure the default Discovery Protocol for a core group
Configure the default Failure Detection Protocol for a core group
Related
Core group custom properties