Tune the detection of database connection loss
If a messaging engine is configured to use a data store and cannot connect to its data store, for example because the database containing the data store is not running, the messaging engine does not start. We can tune the system to increase the chance of a successful start of the messaging engine.
In a single-server environment, when we start the application server the messaging engine attempts to start. If the database is unavailable for more than 15 minutes, the messaging engine might enter the stopped state and need to be started manually.
In a high availability environment, a messaging engine starts either as part of the server or cluster startup, or as part of the failover process. During messaging engine startup, the messaging engine attempts to connect to the data store, for up to 15 minutes by default. If one of the following statements remains true during that time, the messaging engine cannot start on the server, and the server is disabled for high availability:
- The database is unavailable or not running.
- In a failover situation, the database does not detect the loss of the network connection to the original application server, and therefore does not release the locks on the data store.
This disabled state can propagate to all members of the cluster. We must manually re-enable the servers to maintain your high availability environment.
We can increase the chance of the messaging engine starting successfully by configuring various parameters, such as the 15 minute default timeout, on the database server or application server.
Tasks
- On the database server, configure the operating system to minimize the amount of time taken to detect the loss of a network connection to an application server. Refer to the documentation for the operating system for details. For example, the following table lists the relevant parameters for Windows and AIX operating systems:
We can calculate the total amount of time taken for the database server to detect the failure of the connection to the application server, using the following formula:
Parameter name on Windows operating systems Parameter name on AIX operating systems Description KeepAliveTime tcp_keepidle The amount of time (in milliseconds on Windows operating systems and in 0.5 seconds on AIX operating systems) to wait before sending a keepalive request for an inactive connection. KeepAliveInterval tcp_keepintvl The amount of time (in milliseconds on Windows operating systems and in 0.5 seconds on AIX operating systems) to wait for a response. TCPMaxDataRetransmissions tcp_keepcnt The number of requests to send before ending the connection. time to detect connection failure = keep alive time + (keep alive interval x number of requests)
For example, for a Windows system with the parameters set according to the following table, the total amount of time taken for the database server to detect the failure of the connection to the application server is 350 seconds.
Your database product might also have relevant parameters that we can configure, for example, the IDLE THREAD TIMEOUT parameter in DB2 for z/OS .
Parameter Value KeepAlive 300000 milliseconds KeepAliveInterval 10000 milliseconds TCPMaxDataRetransmissions 5 When the database server detects the loss of the connection to the application server, the database releases the locks on the data store. The messaging engine can now access the data store and can therefore start successfully.
- On the application server, tune the messaging engine to wait for an appropriate amount of time for the data store to become available. By default, the messaging engine will attempt to connect to the data store every 2 seconds for 15 minutes. Complete the rest of this step to adjust these timings.
- Click Service integration -> Buses -> bus_name -> [Topology] Messaging engines -> engine_name -> [Additional Properties] Custom properties to navigate to the custom properties panel for the messaging engine.
- Click New.
- Type sib.msgstore.jdbcInitialDatasourceWaitTimeout in the Name field and an appropriate value in the Value field. This property is the time, in milliseconds, to wait for the data store to become available. The default is 900000 (15 minutes). This time includes the time required to establish a connection to the database and to obtain the required table locks.
Ensure that the value of this property is greater than the total time taken for the database server to detect the loss of a network connection, as configured in step 1.
- Click OK.
- Click New.
- Type sib.msgstore.jdbcStaleConnectionRetryDelay in the Name field and an appropriate value in the Value field. This property is the time, in milliseconds, to wait between attempts to connect to the data store. The default is 2000 (2 seconds). For example, if we set the sib.msgstore.jdbcInitialDatasourceWaitTimeout property to 600000, and the sib.msgstore.jdbcStaleConnectionRetryDelay property to 3000, the messaging engine will attempt to connect every 3 seconds until 10 minutes has passed.
- Click OK.
- Save changes to the master configuration.
- Restart the application server.
- If we have a cluster, repeat the previous steps to add these properties for every messaging engine in the cluster.
By configuring these parameters and custom properties, you minimize the amount of time taken for the database server to detect the loss of a network connection, and ensure that the messaging engine waits for a reasonable amount of time for the database connection to recover before attempting to start.
What to do next
We might want to configure the messaging engine and server to restart in the event of a database connection failure. This behavior reduces the risk of the messaging engine being in an inconsistent state when the database connection is restored.
Related:
Data store life cycle Manage high availability when messaging engines fail to start Microsoft support article: TCP/IP and NBT configuration parameters for Windows XP Configure messaging engine and server behavior when a data store connection is lost Service integration custom properties