Failing over

IBM Tivoli Monitoring > Version 6.3 Fix Pack 2 > Installation Guides > High Availability Guide for Distributed Systems > The hot standby option > Use hot standby > Failover scenario
IBM Tivoli Monitoring, Version 6.3 Fix Pack 2

Failing over

The acting hub might become unavailable for a number of reasons. It might need to be shut down for scheduled maintenance, the computer on which it is running might need to be shut down or might have stopped, or it can be experiencing networking problems.
When the standby hub discovers that the acting hub is unavailable, it takes over the role of the acting hub and issues the following messages:
04/17/07  10:46:40 KQM0004  FTO detected lost parent connection 
at 04/17/07 10:46:40.
04/17/07  10:46:40 KQM0009  FTO promoted HUB_SECONDARY as the acting HUB.  
The primary hub is now the standby hub and the secondary hub is the acting hub, as depicted in Figure 1:
Figure 1. Configuration after failover
The automation server that is co-located with the secondary hub determines that the secondary hub is now the acting hub. Therefore, this automation server assumes the acting role for OSLC resource registration. As part of the failover processing, it sends a request to Registry Services to update resource URLs to point to the new acting automation server and begins registering new and changed resources. Because the resource URLs were updated in Registry Services, the new acting automation server receives and responds to resource requests from OSLC clients. If any OSLC clients cached the old resource URLs, the automation server co-located with the standby hub redirects the client requests to the new acting automation server.
As the remote monitoring servers and agents connected to the previous acting hub discover that the primary hub is no longer available, they switch and reconnect to the new acting hub. Because these components are in various states of processing and communication with the hub monitoring server, the discovery and reconnection with the new hub is not synchronized.
All remote monitoring servers and agents now report to the new acting hub. There is no mechanism available to switch them back to the standby hub while the acting hub is still running. The only way to switch them to the standby hub is to shut down the acting hub.
The processing that takes place after reconnection is similar to the processing that takes place after reconnection in an environment without a hot standby server. The following processing applies with regard to situations and policies:

Pure events that occurred before the failover are not visible. Subsequent pure events are reported when they occur.

Sampled situations are reevaluated and are reported again if they are still true.

A Master Reset Event is sent to the Tivoli Enterprise Console when the failover occurs. Events that result from situations being reevaluated are resent to the Tivoli Enterprise Console if the monitoring server has been configured to send events to the Tivoli Enterprise Console.

Policies are restarted.

The Tivoli Enterprise Portal Server must be reconfigured to point to the new acting hub and then restarted. All portal clients reconnect to the portal server after its restart.
When reconfiguring the portal server on Windows systems for a different monitoring server, a window is displayed asking if a snapshot of the portal server data should be taken. No is the correct response when reconfiguring the portal server for a hot standby monitoring server because the same portal server data is relevant to both the primary and hot standby monitoring server.
When Yes is selected as the response to the dialog, a snapshot of the portal server data is taken through the "migrate-export" process. The data is saved in a file called saveeexport.sql and is placed in the %CANDLE_HOME%\CNPS\CMS\HOSTNAME:Port directory, where HOSTNAME:Port is the current monitoring server hostname and connection port number.
Then, if no existing snapshot exists for the monitoring server that is being switched to, a new set of portal server data is used and all the customizations are not included. In order to get these restored for use on the new monitoring server, a "migrate-import" needs to be run using the saveexport.sql created from the snapshot.
When reconfiguring the portal server to switch back to previous monitoring server, answering Yes causes the previous snapshot to be automatically loaded thus restoring the customization. Responding No should be done when switching between the primary hub monitoring server and the hot standby monitoring server since the same portal server data should be relevant to both.
The new acting hub, which is the secondary hub, retains its role even after the primary hub Tivoli Enterprise Monitoring Server becomes operational again. The primary hub monitoring server now becomes the standby hub. When the new standby hub starts, it checks the Enterprise Information Base of the new acting hub for updates and replicates updates to its own Enterprise Information Base if necessary. The two hub Tivoli Enterprise Monitoring Servers also start monitoring connections with each other to ensure that the other hub is running.
If a remote monitoring server or agent experiences a transient communication problem with the acting hub and switches over to the standby hub, the standby hub instructs it to retry the connection with the acting hub because the standby hub knows that the acting hub is still available.
The environment continues to operate with the configuration shown in Figure 1 until the acting hub is shut down or until the computer on which the acting hub is running becomes unavailable. Each time the acting hub becomes unavailable, the failover scenario described in this section is repeated.

Parent topic:
Failover scenario

+
Search Tips | Advanced Search