Tuning the detection of database connection loss

If a messaging engine is configured to use a data store and cannot connect to its data store, for example because the database that contains the data store is not running, the messaging engine does not start. We can tune the system to increase the chance of a successful start of the messaging engine.

In a single-server environment, when you start the application server the messaging engine attempts to start. If the database is unavailable for more than 15 minutes, the messaging engine might enter the stopped state and need to be started manually. In a high availability environment, a messaging engine starts either as part of the server or cluster startup, or as part of the failover process. During messaging engine startup, the messaging engine attempts to connect to the data store, for up to 15 minutes by default. If one of the following statements remains true during that time, the messaging engine cannot start on the server, and the server is disabled for high availability:

This disabled state can propagate to all members of the cluster. You must manually re-enable the servers to maintain the high availability environment.

We can increase the chance of the messaging engine starting successfully by configuring various parameters, such as the 15 minute default timeout, on the database server or appserver.

 

  1. On the database server, configure the operating system to minimize the amount of time taken to detect the loss of a network connection to an appserver. Refer to the documentation for the operating system for details. For example, the following table lists the relevant parameters for Windows and AIX operating systems:

    Table 1. TCP/IP parameters
    Parameter name on Windows operating systems Parameter name on AIX operating systems Description
    KeepAliveTime tcp_keepidle The amount of time (in milliseconds on Windows operating systems and in 0.5 seconds on AIX operating systems) to wait before sending a keepalive request for an inactive connection.
    KeepAliveInterval tcp_keepintvl The amount of time (in milliseconds on Windows operating systems and in 0.5 seconds on AIX operating systems) to wait for a response.
    TCPMaxDataRetransmissions tcp_keepcnt The number of requests to send before ending the connection.

    We can calculate the total amount of time taken for the database server to detect the failure of the connection to the application server, by using the following formula:

    time to detect connection failure = keep alive time + (keep alive interval x number of requests) For example, for a Windows system with the parameters set according to the following table, the total amount of time taken for the database server to detect the failure of the connection to the appserver is 350 seconds.

    Table 2. Example parameter values
    Parameter Value
    KeepAlive 300000 milliseconds
    KeepAliveInterval 10000 milliseconds
    TCPMaxDataRetransmissions 5

    Your database product might also have relevant parameters that we can configure, for example, the IDLE THREAD TIMEOUT parameter in DB2 for z/OS.

    When the database server detects the loss of the connection to the appserver, the database releases the locks on the data store. The messaging engine can now access the data store and can therefore start successfully.

  2. On the appserver, tune the messaging engine to wait for an appropriate amount of time for the data store to become available. By default, the messaging engine will attempt to connect to the data store every 2 seconds for 15 minutes. Complete the rest of this step if you want to adjust these timings.

    1. Click Service integration > Buses > bus_name > [Topology] Messaging engines > engine_name > [Additional Properties] Custom properties to navigate to the custom properties panel for the messaging engine.

    2. Click New.

    3. Type sib.msgstore.jdbcInitialDatasourceWaitTimeout in the Name field and an appropriate value in the Value field. Is the time, in milliseconds, to wait for the data store to become available. The default value is 900000 (15 minutes). This time includes the time required to establish a connection to the database and to obtain the required table locks.

      Ensure that the value of this property is greater than the total time taken for the database server to detect the loss of a network connection, as configured in step 1.

    4. Click OK.

    5. Click New.

    6. Type sib.msgstore.jdbcStaleConnectionRetryDelay in the Name field and an appropriate value in the Value field. Is the time, in milliseconds, to wait between attempts to connect to the data store. The default value is 2000 (2 seconds). For example, if you set the sib.msgstore.jdbcInitialDatasourceWaitTimeout property to 600000, and the sib.msgstore.jdbcStaleConnectionRetryDelay property to 3000, the messaging engine will attempt to connect every 3 seconds until 10 minutes has passed.

    7. Click OK.

    8. Save the changes to the master configuration.

    9. Restart the appserver.

    10. If we have a cluster, repeat the previous steps to add these properties for every messaging engine in the cluster.

 

Results

By configuring these parameters and custom properties, you minimize the amount of time taken for the database server to detect the loss of a network connection, and verify the messaging engine waits for a reasonable amount of time for the database connection to recover before attempting to start.

 

What to do next

[Fix Pack 1 or later]

We might want to configure the messaging engine and server to restart in the event of a database connection failure. This behavior reduces the risk of the messaging engine being in an inconsistent state when the database connection is restored.    



Last updated Nov 10, 2010 8:23:07 PM CST