IBM Tivoli Monitoring > Version 6.3 Fix Pack 2 > Installation Guides > High Availability Guide for Distributed Systems > Configure for high availability and disaster recovery

IBM Tivoli Monitoring, Version 6.3 Fix Pack 2


Configure for agent and remote monitoring server high availability and disaster recovery

All agents can be defined with a primary and secondary monitoring server, which allows the agent to connect to the secondary monitoring server if the primary is unavailable. Failover to the secondary monitoring server occurs automatically if the agent fails to communicate with the primary monitoring server.

If no other communication occurs between the agent and the monitoring server, the longest interval it should take for the failover to occur is the heartbeat interval, which defaults to 10 minutes.

The primary concern when building a high availability and disaster recovery configuration for the agents and remote monitoring servers is to determine how many agents to connect to each remote monitoring server. For Tivoli Monitoring V6.3, no more than 1500 monitoring agents should connect to each remote monitoring server.

The following information is important when planning your agents and remote monitoring servers:

Connect a very small number of agents to the hub monitoring server. Typically, only the Warehouse Proxy Agent, Summarization and Pruning Agent, and any OS agents that are monitoring the monitoring server are connected to the hub monitoring server.

Use the Tivoli Monitoring heartbeat capabilities to ensure that agents are running and accessible. The default heartbeat interval is 10 minutes. If an agent does not contact the monitoring server, a status of MS_Offline is seen at the monitoring server. An event can be generated when an agent goes offline. An administrator can evaluate whether the agent is having problems or whether there is another root cause. In addition, there is a solution posted on the Tivoli Integrated Service Management Library Web site that leverages the MS_Offline status and attempts to ping the server to determine if the server is down or whether the agent is offline. You can find more information by searching for "Perl Ping Monitoring Solution" or navigation code "1TW10TM0F" in the IBM Integrated Service Management Library.


Parent topic:

Configure for high availability and disaster recovery

+

Search Tips   |   Advanced Search