Configure for agent and remote monitoring server high availability and disaster recovery

IBM Tivoli Monitoring > Version 6.3 Fix Pack 2 > Installation Guides > High Availability Guide for Distributed Systems > Configure for high availability and disaster recovery
IBM Tivoli Monitoring, Version 6.3 Fix Pack 2

Configure for agent and remote monitoring server high availability and disaster recovery

All agents can be defined with a primary and secondary monitoring server, which allows the agent to connect to the secondary monitoring server if the primary is unavailable. Failover to the secondary monitoring server occurs automatically if the agent fails to communicate with the primary monitoring server.
If no other communication occurs between the agent and the monitoring server, the longest interval it should take for the failover to occur is the heartbeat interval, which defaults to 10 minutes.
The primary concern when building a high availability and disaster recovery configuration for the agents and remote monitoring servers is to determine how many agents to connect to each remote monitoring server. For Tivoli Monitoring V6.3, no more than 1500 monitoring agents should connect to each remote monitoring server.
The following information is important when planning your agents and remote monitoring servers:

Ensure that failover does not result in many more than 1500 monitoring agents reporting to a single remote monitoring server. There are two strategies users typically take to avoid this situation.

The first and preferred strategy involves having a spare remote monitoring server. By default, the spare remote monitoring server has no agents connected. When the monitoring agents that report to the primary monitoring server are configured, they are configured to use the spare remote monitoring server for their secondary monitoring server. Over time, network and server anomalies cause the agents to migrate.
To manage this environment, write a situation to monitor how many agents are connect to the spare remote monitoring server. You can then use the situation to trigger a Take Action command that forces the agents back to their primary remote monitoring server by restarting them. Restarting the agents cause them to connect to their primary monitoring server. Ideally, migrate the agents back to their primary remote monitoring server when the number of agents connect to the spare monitoring server is greater than 20.
The disadvantage to using a spare remote monitoring server is that you must dedicate a spare server to be the spare remote monitoring server. Some users choose to co-locate this server with the Warehouse Proxy Agent or run in a virtualized environment to minimize the extra hardware required.

The second strategy is to evenly distribute the agents so that they failover to different remote monitoring servers to ensure that no remote monitoring server becomes overloaded. In the example below, there are four remote monitoring servers. In this example, configure one-third of the agents on each remote monitoring server to failover to a different remote monitoring server. Review the following scenario:
RTEMS_1 has 1125 agents, RTEMS_2 has 1125 agents, RTEMS_3 and RTEMS_4 have 1125 agents.
A third of RTEMS_1’s agents failover to RTEMS_2, a third failover to RTEMS_3, and a third failover to RTEMS_4.
This strategy ensures that none of the remote monitoring servers become overloaded. The problem with this strategy is that it requires a lot of planning and tracking to ensure that all of the remote monitoring servers are well-balanced.

If you want your agent to failover to a remote monitoring server in another data center, ensure that you have good network throughput and low latency between the data centers.

Connect a very small number of agents to the hub monitoring server. Typically, only the Warehouse Proxy Agent, Summarization and Pruning Agent, and any OS agents that are monitoring the monitoring server are connected to the hub monitoring server.
Use the Tivoli Monitoring heartbeat capabilities to ensure that agents are running and accessible. The default heartbeat interval is 10 minutes. If an agent does not contact the monitoring server, a status of MS_Offline is seen at the monitoring server. An event can be generated when an agent goes offline. An administrator can evaluate whether the agent is having problems or whether there is another root cause. In addition, there is a solution posted on the Tivoli Integrated Service Management Library Web site that leverages the MS_Offline status and attempts to ping the server to determine if the server is down or whether the agent is offline. You can find more information by searching for "Perl Ping Monitoring Solution" or navigation code "1TW10TM0F" in the IBM Integrated Service Management Library.

Parent topic:
Configure for high availability and disaster recovery

+
Search Tips | Advanced Search