Hung threads in Java Platform, Enterprise Edition applications

Hung threads in Java Platform, Enterprise Edition applications

WebSphere Application Server monitors thread activity and performs diagnostic actions if one has become inactive. When WebSphere detects that a thread has been active longer than the time defined by the thread monitor threshold, the application server takes the following actions:

Logs a warning in the WAS log that indicates the name of the thread that is hung and how long it has already been active.
WSVR0605W: Thread threadname has been active for hangtime and may be hung. There are (x) threads in total in the server that may be hung.

where: threadname is the name that appears in a JVM thread dump, hangtime gives an approximation of how long the thread has been active and totalthreads gives an overall assessment of the system threads.
Issues a JMX notification.
This notification enables third-party tools to catch the event and take appropriate action, such as triggering a JVM thread dump of the server, or issuing an electronic page or email.

TYPE_THREAD_MONITOR_THREAD_HUNG
This event is triggered by the detection of a (potentially) hung thread.
TYPE_THREAD_MONITOR_THREAD_CLEAR
This event is triggered if a thread that was previously reported as hung completes its work. Consult the section on false alarms for more information.

Triggers changes in the performance monitoring infrastructure (PMI) data counters. These PMI data counters are used by various tools, such as the Tivoli Performance Viewer, to provide a performance analysis.
Triggers changes in the performance monitoring infrastructure (PMI) data counters. These PMI data counters are used by various tools, such as the Tivoli Performance Viewer, to provide a performance analysis.

False Alarms

If the work actually completes, a second set of messages, notifications and PMI events is produced to identify the false alarm. The following message is written to the log:
WSVR0606W: Thread threadname was previously reported to be hung but has completed. It was active for approximately hangtime. There are totalthreads threads in total in the server that still may be hung.
where threadname is the name that appears in a JVM thread dump, hangtime gives an approximation of how long the thread has been active and totalthreads gives an overall assessment of the system threads.

Automatic adjustment of the hang time threshold

If the thread monitor determines that too many false alarms are issued (determined by the number of pairs of hang and clear messages), it can automatically adjust the threshold. When this adjustment occurs, the following message is written to the log:
WSVR0607W: Too many thread hangs have been falsely reported. The hang threshold is now being set to thresholdtime.
where: thresholdtime is the time (in seconds) in which a thread can be active before it is considered hung.
We can prevent WAS from automatically adjusting the hang time threshold. See Configure the hang detection policy

System Alarms

An application server monitors the activity of threads on which system alarms execute. When a system alarm thread has been active longer than the time defined by the alarm thread monitor threshold, the application server logs the following warning in the system log. This message indicates the name of the thread that is not responding, the length of time that the thread has already been active, and the exception stack of the thread, which identifies the system component.

UTLS0008W: The alarm thread threadname has been active for n milliseconds and may be hung. totalthreadsthreadstack

In this message, threadname is the name that appears in a JVM thread dump, n is approximately how long the thread was active, totalthreads is an overall assessment of the system threads, and threadstack is the exception stack of the thread.

If the alarm work eventually completes, the following message is written to the system log. This message indicates thread that produced the false alarm.
UTLS0009W: Alarm Thread threadname was previously reported to be hung but has 
   completed.  It was active for approximately n milliseconds.
In this message, threadname is the name that appears in a JVM thread dump, and n is approximately how long the thread was active.

Typically, system alarms do not process heavy loads because such activity might slow the processing of later system alarms, which in turn might impact server behavior. The UTLS0008W message is intended to help IBM Support personnel investigate problems potentially caused by system alarm behavior.

All of the system alarms share a common alarm thread pool. The properties which govern the monitoring of this thread pool can be tuned using the administrative console. We can reduce the frequency at which WebSphere generates alarm hung thread messages by adjusting the alarm thread monitor check interval or threshold. See the topic Configure the hang detection policy for a description of how to change these settings.

Configure the hang detection policy
Tivoli Performance Viewer
Example: Adjust hang detection policy