IBM Tivoli Monitoring > Version 6.3 Fix Pack 2 > Installation Guides > High Availability Guide for Distributed Systems > Monitor functions and architecture

IBM Tivoli Monitoring, Version 6.3 Fix Pack 2


High Availability considerations for the Tivoli Enterprise Monitoring Server

In general, the Tivoli monitoring components are highly resilient. The components are tolerant of network and communications failures, attempting to reconnect to other components and retry communication until they succeed. The functions described in Monitor functions have the following requirements for the various components:

See Table 1 for the failover options available for each monitoring component.

Other options are available to achieve high availability, such as installing multiple Tivoli Enterprise Portal Servers and using the migrate-export and migrate-import commands to synchronize their customization.


Options for Tivoli Monitoring component resiliency

Component Potential single point of failure? Cluster failover available? Hot standby failover available?
Hub monitoring server Yes Yes Yes
Portal server Yes Yes No
Tivoli Data Warehouse database Yes Yes No
Warehouse Proxy Agent Yes, if a single Warehouse Proxy Agent is in the environment. Yes No
Summarization and Pruning Agent Yes Yes No
Remote monitoring server No. Another monitoring server can assume the role of a remote monitoring server for connected agents. This is known as "agent failover." N/A N/A
Agent Not a single point of failure for the whole monitoring solution, but a specific point of failure for the specific resource being monitored. Yes No
Tivoli Enterprise Monitoring Automation Server No, the Tivoli Enterprise Monitoring Automation Server at the peer Hot Standby Hub would take over the role of publishing OSLC resource registrations and responding to metric requests. No Yes1

1 Hot standby failover is only available for this component when the co-located hub monitoring server is configured for hot standby and the Tivoli Enterprise Automation Server is installed at each of the hubs in a hot standby environment. This component does not support hot standby failover independent of the hub monitoring server. When the automation server is configured in a Hot Standby environment, the Registry Services component of Jazz for Service Management must be at version 1.1.0.1 (or later). For additional information on configuring the automation server for Hot Standby support, see the IBM Tivoli Monitoring Installation and Setup Guide.

For resiliency characteristics for each option, see Table 2.


Resiliency characteristics of IBM Tivoli Monitoring components and features

Component Characteristics of a hub cluster failover Characteristics of a hub hot standby failover
Hub monitoring server The hub monitoring server is restarted as soon as the cluster manager detects failure. Communication failure between hubs causes the standby hub to start processing to establish itself as master, or primary hub server.
Portal server The portal server reconnects to the hub monitoring server as soon as it is restarted. The portal server needs to be reconfigured to point to the new hub.
Tivoli Data Warehouse database No relationship to hub No relationship to hub
Warehouse Proxy Agent As an agent, the Warehouse Proxy Agent reconnects to its hub and continues to export data from agents to the Tivoli Data Warehouse. As an agent configured with a secondary connection to the hub server, the Warehouse Proxy Agent connects to its secondary hub and continues to export data from agents to the Tivoli Data Warehouse.
Summarization and Pruning Agent As an agent, the Summarization and Pruning Agent reconnects to its hub and continues to summarize and prune data from the Tivoli Data Warehouse. As an agent configured with a secondary connection to the hub server, the Summarization and Pruning Agent connects to its secondary hub and continues to summarize and prune data from the Tivoli Data Warehouse.
Remote monitoring server The remote monitoring server detects the hub restart and tries to reconnect, synchronizing with the hub. When configured with a secondary connection to the hub server, the remote monitoring server retries the connection with the primary hub and if unsuccessful tries to connect to the secondary hub. When the new hub has been promoted to master, the remote monitoring server detects the hub restart and tries to reconnect, synchronizing with the hub.
Agent All agents directly connected to the hub reconnect to the hub after restart and begin synchronization. When configured with a secondary connection to the hub server, agents directly connected to the hub perceive the loss of connection and retry. With the first hub down, the agent tries to connect to the second hub, and begin synchronization that includes restarting all situations.
Event data Agents resample all polled situation conditions and reassert all that are still true.

Situation history is preserved.

Agents resample all polled situation conditions and reassert all that are still true.

Previous situation history is not replicated to the failover hub server and thus lost.

To persist historical event data, use the Tivoli NetCool/OMNIbus or Tivoli Enterprise Console.

Hub failback

(Failback is the process of moving resources back to their original node after the failed node comes back online.)

Available through cluster manager administration and configuration. The secondary hub must be stopped so that the primary hub can become master again.
Time for failover The detection of a failed hub and subsequent hub restart is quick and can be configured through the cluster manager.

The synchronization process continues until all situations are restarted and the whole environment is operational. The amount of time depends on the size of the environment, including the number of agents and distributed situations.

The detection of a failed hub is quick. There is no restart of the hub, but the connection of remote monitoring server and agents to the standby hub require at least one more heartbeat interval because they try the primary before trying the secondary.

The synchronization process continues until all situations are restarted and the whole environment is operational. The amount of time depends on the size of the environment, including the number of agents and distributed situations.

z/OS environments The clustered solution on a z/OS hub has not yet been tested and therefore is not a supported configuration.

Remote monitoring servers on z/OS systems are supported.

Hot standby is fully supported on z/OS systems, for both remote and local hubs.
Data available on failover hub All data is shared through disk or replication. All Enterprise Information Base data, except data for the following components, is replicated through the mirror synchronization process:

  • Situation status history

  • Publishing of any Tivoli Universal Agent metadata and versioning

  • Remote deployment Depot

Manageability of failover Failover can be automatic or directed through cluster administration.

You control which hub is currently the master hub server and the current state of the cluster.

Failover can be directed by stopping the hub. Note that the starting order controls which hub is the master hub server.

When using a clustered hub monitoring server, you must completely shut down for maintenance. However, in a hot standby environment, you can apply a patch one node at a time. For further information on the primary hub monitoring server and its configuration, see The clustering of IBM Tivoli Monitoring components.


Parent topic:

Monitor functions and architecture

+

Search Tips   |   Advanced Search