Detecting hung threads in J2EE applications

A common error in J2EE applications is a hung thread. A hung thread can result from a simple software defect (such as an infinite loop) or a more complex cause (for example, a resource deadlock). System resources, such as CPU time, might be consumed by this hung transaction when threads run unbounded code paths, such as when the code is running in an infinite loop. Alternately, a system can become unresponsive even though all resources are idle, as in a deadlock scenario. Unless an end user or a monitoring tool reports the problem, the system may remain in this degraded state indefinitely.

The hang detection option for WAS is turned on by default. One can configure a hang detection policy to accommodate your applications and environment so that potential hangs can be reported, providing earlier detection of failing servers. When a hung thread is detected, WAS notifies you so that one can troubleshoot the problem.

Using the hang detection policy, one can specify a time that is too long for a unit of work to complete. The thread monitor checks all managed threads in the system (for example, Web container threads and object request broker (ORB) threads) . Unmanaged threads, which are threads created by applications, are not monitored.

When WAS detects that a thread has been active longer than the time defined by the thread monitor threshold, the application server takes the following actions:

  • Logs a warning in the WAS log that indicates the name of the thread that is hung and how long it has already been active. The following message is written to the log

    WSVR0605W: Thread threadname has been active for 
    hangtime and may be hung.  There are totalthreads 
    threads in total in the server that may be hung.
    
    where: threadname is the name that appears in a JVM thread dump, hangtime gives an approximation of how long the thread has been active and totalthreads gives an overall assessment of the system threads.

  • Issues a Java Management Extensions (JMX) notification. This notification enables third-party tools to catch the event and take appropriate action, such as triggering a JVM thread dump of the server, or issuing an electronic page or e-mail. The following JMX notification events are defined in the com.ibm.websphere.management.NotificationConstants class:

    • TYPE_THREAD_MONITOR_THREAD_HUNG This event is triggered by the detection of a (potentially) hung thread.

    • TYPE_THREAD_MONITOR_THREAD_CLEAR This event is triggered if a thread that was previously reported as hung completes its work. See False Alarms.

  • Triggers changes in the performance monitoring infrastructure (PMI) data counters. These PMI data counters are used by various tools, such as the Tivoli Performance Viewer, to provide a performance analysis.

 

False Alarms

If the work actually

completes, a second set of messages, notifications and PMI events is produced to identify the false alarm. The following message is written to the log

WSVR0606W: Thread threadname was previously reported to be 
hung but has completed. It was active for approximately hangtime. 
There are totalthreads threads in total in the server that still 
may be hung.
where threadname is the name that appears in a JVM thread dump, hangtime gives an approximation of how long the thread has been active and totalthreads gives an overall assessment of the system threads.

 

Automatic adjustment of the hang time threshold

If

the thread monitor determines that too many false alarms are issued (determined by the number of pairs of hang and clear messages), it can automatically adjust the threshold. When this adjustment occurs, the following message is written to the log

WSVR0607W: Too many thread hangs have been falsely reported.  The hang 
threshold is now being set to thresholdtime.
where: thresholdtime is the time (in seconds) in which a thread can be active before it is considered hung.

We can prevent WebSphere Application Server from automatically adjusting the hang time threshold. See Configuring the hang detection policy

 

See also


Adjusting the hang detection policy of a running server
Configuring the hang detection policy