Previous - Centralized Installation Manager | Home
Troubleshooting
- Overview
- WAS Logs
- Server log files
- JVM log interpretation
- Logging modes
- High Performance Extensible Logging
- Cross Component Trace
- Sensitive log and trace guard
- Javacores and Heapdumps
- HTTP Plug-in Log
- Tools for collecting and analyzing diagnostic data
- Hang detection policy
- Memory leak detection policy
- MustGather for troubleshooting
- IBM Support Assistant
- Troubleshooting scenarios
- Hung threads
- High CPU
- Out of Memory exceptions in WAS
- Web material
Overview
Two basic actions to solve a problem:
- Data collection
Reproduce the problem and collect diagnostic data. Data collection includes...
- Java log files
- trace logs
- thread dumps
- heap dumps
- system dumps
- Analysis and isolation
Analyze both the results and the diagnostic data until the root cause is found and isolated.
Questions to ask...
What are the symptoms of the problem? For example, application server start failures Where does the problem occur? For example, Web server plug-in When does the problem occur? For example, during peak system load Has the problem begun after a configuration change? For example, a new application deployment Can the problem be reproduced? Enable tracing and collect diagnostic data WAS Logs
- JVM logs
Created by redirecting the System.out and System.err streams of the JVM to independent log files:
- One set of the JVM logs for each application server and all of its applications is located by default in...
<PROFILE_HOME>/<profile_name>/logs/<server_name>
- SystemOut.log and SystemErr.log
- Process logs
Contain two output streams (stdout and stderror) that are accessible to native code running in the process:
- One set for each application server
- native_stderr.log and native_stdout.log
- IBM Service logs
Contain the WAS messages that are written to the System.out stream. The log also contains some special messages that hold extended service information normally not of interest, but can be important when analyzing problems.
- One per profile (node)
- activity.log
Server log files
WAS log files are located in the following folder:<was_root>/profiles/<profile_name>/logs
Each application server instance has its own log folder containing their own set of log and trace files. This folder is found using the following location syntax:
<was_root>/profiles/<profile_name>/logs/<appserver>
Under each application server's folder are several logs. Here are the available logs:
SystemOut.log
SystemErr.logstandard JVM output and error logs. startServer.log
stopServer.logInformation logged by the server as it starts and shuts down. trace.log Output from a diagnostic trace, if tracing is enabled. native_stdout.log
native_stderr.logUsed by the operating system to log out of memory exceptions and verbose garbage page data if verbose garbage collection is enabled.
Configuring JVM logs
The JVM logs can be configured from the dmgr console. For both log files, systemOut and systemErr, we can specify the following attributes:
- The path to their location
- File formatting (basic or advanced)
- Log file rotation (by file size or time interval)
- Maximum number of historical files to store on the file system
To access the JVM logs configuration, from the dmgr console, navigate to
Troubleshooting | Logs and Trace | server_name | JVM Logs.
Enable verbose garbage collection
Verbose garbage page is an option provided by the JVM run time. It is often recommended that we have verbose garbage collection enabled permanently in production.
The overhead cost is quite small and the benefits of having it enabled when issues happen are considerable.
The verbose garbage collection feature must be enabled for each application server. From the dmgr console, select to...
Servers | All servers | server_name | Server Infrastructure | Java and Process Management | Process definition | JVM | Verbose Garbage Collection
Save the configurations, synchronize the nodes, and restart the modified application servers.
<exclusive-start id="38" timestamp="2012-06-28T19:29:58.285" intervalms="718.340"> <response-info timems="0.008" idlems="0.008" threads="0" lastid="0A01CE00" lastname="main" /> </exclusive-start> <af-start id="39" totalBytesRequested="40" timestamp="2012-06-28T19:29:58.285" intervalms="718.303" /> <cycle-start id="40" type="scavenge" contextid="0" timestamp="2012-06-28T19:29:58.286" intervalms="718.304" /> <gc-start id="41" type="scavenge" contextid="40" timestamp="2012-06-28T19:29:58.286"> <mem-info id="42" free="198926216" total="234881024" percent="84"> <mem type="nursery" free="0" total="33554432" percent="0" /> <mem type="tenure" free="198926216" total="201326592" percent="98"> <mem type="soa" free="188860296" total="191260672" percent="98" /> <mem type="loa" free="10065920" total="10065920" percent="100" /> </mem> <remembered-set count="26568" /> </mem-info> </gc-start> <allocation-stats totalBytes="22519864" > <allocated-bytes non-tlh="839696" tlh="21680168" /> <largest-consumer threadName="main" threadId="0A01CE00" bytes="22421696" /> </allocation-stats> <gc-op id="43" type="scavenge" timems="57.326" contextid="40" timestamp="2012-06-28T19:29:58.343"> <scavenger-info tenureage="9" tiltratio="50" /> <memory-copied type="nursery" objects="301744" bytes="13055172" bytesdiscarded="3240" /> <finalization candidates="654" enqueued="74" /> <references type="soft" candidates="3390" cleared="0" enqueued="0" dynamicThreshold="32" maxThreshold="32" /> <references type="weak" candidates="14601" cleared="0" enqueued="0" /> </gc-op> <heap-resize id="44" type="expand" space="nursery" amount="3014656" count="1" timems="0.050" reason="excessive time being spent scavenging" timestamp="2012-06-28T19:29:58.343" /> <gc-end id="45" type="scavenge" contextid="40" durationms="57.509" timestamp="2012-06-28T19:29:58.343"> <mem-info id="46" free="220507040" total="236388352" percent="93"> <mem type="nursery" free="21580824" total="35061760" percent="61" /> <mem type="tenure" free="198926216" total="201326592" percent="98"> <mem type="soa" free="188860296" total="191260672" percent="98" /> <mem type="loa" free="10065920" total="10065920" percent="100" /> </mem> <pending-finalizers system="73" default="1" reference="0" classloader="0" /> <remembered-set count="24218" /> </mem-info> </gc-end>
Diagnostic trace
Diagnostic trace provides detailed information about how the application server components run within a managed process. This trace allows administrators to examine processes in the application server and diagnose various issues.Tracing can be started while the server is running using Runtime Diagnostic Trace or when the server is started using Configuration Diagnostic Trace. The trace outputs can be directed to either a file or a memory buffer.
Tracing has a significant impact on performance and must only be enabled temporarily for problem determination purposes.
To enable diagnostic traces, use the deployment manager's dmgr console, navigating to Troubleshooting | Application servers | server_name | Diagnostic trace service.
We can configure the following parameters for diagnostic tracing:
Trace Output:
- None
- Memory buffer
- File
Trace Output Format:
- Basic
- Advanced
- Log analyzer
Change log detail levels. These are used to set up trace strings.
JVM log interpretation
The JVM logs contain print data written by applications. They can be written in the following formats:
- Basic Format: The format used in earlier versions of WAS.
- Advanced Format: Extends the basic format by adding information about an event, when possible.
Both basic and advanced formats use many of the same fields and formatting techniques.
The following is a list of those formats and the various fields associated with those formats:
TimeStamp The timestamp is formatted using the locale of the process where it is formatted. It includes a fully-qualified date (for example YYMMDD), 24 hour time with millisecond precision, and a time zone. ThreadId An eight-character hexadecimal value generated from the hash code of the thread use by the message. ThreadName The name of the Java thread use by the message or trace event. ShortName The abbreviated name of the logging component use by the message or trace event. Typically the class name for WAS internal components, but can be some other identifier for user applications. LongName The full name of the logging component use by the message or trace event. This is typically the fully-qualified class name for WAS internal components, but can be some other identifier for user applications. EventType A one-character field that indicates the type of message or trace event. Message types are in upper case. F Fatal message. E Error message. W Warning message. A Audit message. I Informational message. C Configuration message. D Detail message. O Message written directly to System.out by user application or internal components. R Message written directly to System.err by user application or internal components. Z Placeholder to indicate the type was not recognized.ClassName The class use by the message or trace event. MethodName The method use by the message or trace event. Organization The organization that owns the application use by the message or trace event. Product The product use by the message or trace event. Component The component within the product use by the message or trace event. Basic format...
<timestamp><threadId><shortName><eventType>[className][methodName]<message>
[6/28/12 10:10:27:777 EDT] 00001135 impl I CWLRB5604I: [06/28/12 10:10:27:777 EDT] Freeing IVTStep2 batch data stream: inputStream
[6/28/12 10:10:27:778 EDT] 00001135 impl I CWLRB5604I: [06/28/12 10:10:27:778 EDT] Freeing IVTStep2 batch data stream: outputStream
[6/28/12 10:10:27:778 EDT] 00001135 impl I CWLRB5854I: [06/28/12 10:10:27:778 EDT] Job Step [XDCGIVT:00006,IVTStep2]: Metric = clock Value = 00:0 0:00:006
[6/28/12 10:10:27:779 EDT] 00001135 impl I CWLRB5854I: [06/28/12 10:10:27:779 EDT] Job Step [XDCGIVT:00006,IVTStep2]: Metric = retry Value = 0
[6/28/12 10:10:27:780 EDT] 00001135 impl I CWLRB5844I: [06/28/12 10:10:27:780 EDT] Job Step Batch Data Stream [XDCGIVT:00006,IVTStep2,outputStream]: Metric = skip Value = 0
Advanced format structure...
<timestamp><threadId><eventType><UOW><source=longName>[className][methodName]<Organization><Product><Component>[thread=threadName]<message>
[6/29/12 16:32:31:318 EDT] 00000001 I UOW=null source=com.ibm.ws.config.ModelMgr org=IBM prod=WebSphere component=Application Server thread=[main] WSVR0801I: Initializing all server configuration models
[6/29/12 16:32:34:391 EDT] 00000001 I UOW=null source=com.ibm.ws.xd.agent.websphere.runtime.component.XDAComponentImpl org=IBM prod=WebSphere component=Application Server thread=[main] CWXDA0001I: XDA service XDAComponentImpl initialized successfully.
[6/29/12 16:32:34:449 EDT] 00000001 I UOW=null source=com.ibm.ws.grid.endpointselector.GAPAgentComponent org=IBM prod=WebSphere component=Application Server thread=[main] CWLRS6000I: GAP (Grid Application Placement) Component has initialized successfully on process ManagedProcess. [6/29/12 16:32:34:513 EDT] 00000001 I UOW=null source=com.ibm.ws.sib.utils.ras.SibMessage org=IBM prod=WebSphere component=Application Server thread=[main] [:] CWSIU0000I: Release: WAS85.SIB Level: gm1216.02
Logging modes
Two modes of logging and tracing exist in WAS V8.5:
- Basic mode
The default mode, is the existing log and trace framework from prior releases of WAS.
- HPEL mode
This is a new log and trace framework. HPEL mode must be explicitly enabled. After HPEL mode is enabled, the following logs are no longer written to...
- SystemOut.log
- SystemErr.log
- trace.log
- activity.log
Log and trace content is written to a log or trace data repository in a proprietary binary format. This content can also be written to a text log file...
- TextLog_<yy.mm.dd>_<hh.mm.ss>
<yy.mm.dd> is a date (year, month, date) of the first record in the file, and <hh.mm.ss> is the time (hour, minute, second).
For example...
TextLog_12.07.16_09.02.09.log
Disabling the writing of this same text log file results in the largest possible performance benefit of HPEL. A log viewing tool, LogViewer, is provided to allow for viewing, filtering, monitoring, and formatting the log and trace data in the repositories.
A disadvantage is that, unless one enables text logs,, one can no longer use a text editor to view logging information.
When enabled, the HPEL text log file stores content from...
- Java trace
- Java logs
- System.out
- System.err
We can disable the HPEL text log in cases where it is not needed as indicated by the dotted lines.
High Performance Extensible Logging
HPEL is a log and trace facility provided as a part of WAS V8.5.
HPEL provides a convenient mechanism for storing and accessing log, trace, System.err, and System.out information produced by the application server or the applications. It is an alternative to the basic log and trace facility, which provided the JVM logs, diagnostic trace, and service log files commonly named SystemOut.log/SystemErr.log, trace.log, and activity.log.
View log content using the LogViewer tool
HPEL provides a log data repository, a trace data repository, and a text log file.
All the data written to log and trace repositories are parsed and formatted to be stored in a text log file. To enhance server performance, disable the text log file.
The log data repository is a storage facility for log records. Log data is typically intended to be reviewed by administrators. This includes any information applications or the server write to System.out, System.err, or java.util.logging at level Detail or higher (including Detail, Config, Info, Audit, Warning, Severe, Fatal, and some custom levels).
The trace data repository is a storage facility for trace records. Trace data is typically intended for use by application programmers or by the WAS support team. Trace records include any information applications or the server that can write to java.util.logging at levels below level Detail (including Fine, Finer, Finest, and some custom levels).
HPEL text log
The text log file is a plain text file for log and trace records. The text log file is provided for convenience. Primarily so that log content can be read without having to run the LogViewer command-line tool to convert the log data repository content to plain text.
The text log file does not contain any content not also stored in either the log data repository or trace data repository. We can disable the text log to enhance server performance. The text log can be configured to record trace content for debugging convenience.
Log and trace performance
HPEL was designed and tested to significantly outperform the existing basic log and trace facility. One result is the application server can run with trace enabled yet causing less impact to performance than tracing the same components using basic logging. Another result is that applications that frequently write to the logs might run faster with HPEL. A number of factors contribute to the overall performance of HPEL logging and tracing.
Log and trace repositories are not shared across processes
Synchronizing activities between processes causes a degradation in performance to all processes involved. With HPEL, each server process has its own log data repository, trace data repository, and text log file. Because these files are not shared across processes, the server runtime environment does not need to synchronize with other processes when writing to these destinations.
Log and trace data is buffered before being written to disk
Writing large blocks of data to a disk is more efficient than writing the same amount of data in small blocks. HPEL provides buffer log and trace data before writing it to disk. By default, log and trace data is stored in an 8 KB buffer before being written to disk. If the buffer is filled within 10 seconds, the buffer is written to disk. If the buffer is not filled within that time it is automatically written to disk to ensure the logs have the most current information.
Administration of log and trace
Administrators can configure how much disk space to dedicate to logs or trace, or how long to retain log and trace records, and leave the management of log and trace content up to the server. As another example all log, trace, System.out, and System.err content can be accessed using one easy-to-use command (LogViewer), avoiding any possible confusion over which file to access for certain content.
Enable and disable HPEL
To activate HPEL logging and tracing, in the dmgr console, click...
Troubleshooting | Logs and trace | <your_server> | Switch to HPEL Mode
To disable HPEL logging and tracing, in the dmgr console, click...
Troubleshooting | Logs and trace | <your_server> | Change log and trace mode | Switch to Basic Mode
A server restart is needed after enabling or disabling HPEL logging and tracing.
HPEL log and trace settings
HPEL log settings can only be accessed when the server is configured to use HPEL log and trace mode.To configure HPEL log settings, from the dmgr console, click...
Troubleshooting | Logs and trace | server_name | Configure HPEL logging
At this point, the following options will be available for configuration:
Directory path Directory to which log files are written. Enable log record buffering Logging system avoids writing to disk each time a log record is created. The logging system creates a buffer that can hold a number of log records, and writes the buffered events when the buffer is full. The logging system also writes the buffered events after a few seconds have passed, even if the buffer is not full. Start new log file daily at: Time Enables the logging framework to close the log file and start a new file at the specified time of day. Begin cleanup of oldest records Log cleanup settings to be used to automatically purge the oldest log records, or log records that no longer fit in the configured space, from the log repository. Log record age limit Lifespan, in hours, that log records can remain in the log repository before the log records can be automatically deleted by the server. Maximum log size Maximum total size, in megabytes, the server allows the log repository to reach. When the log repository approaches this size limit, the server deletes the oldest records from the log repository to make space for new log records. Out of space action
Stop server Server stops when the server is unable to write to the log repository. Purge old records Server continues to run. The oldest log records are removed when the server is unable to write to the log repository. Stop logging Server continues to run, but the server cannot continue to write to the log when the server is unable to write to the log repository.
To configure HPEL logging and trace settings, from the dmgr console, click...
Troubleshooting | Logs and trace | server_name | Configure HPEL trace
HPEL output format:
- Trace to a directory
The same set of options that were described for configuring HPEL logging, are applicable to HPEL trace settings when this option is selected.
- Trace to a memory buffer
Specify the memory buffer size in megabytes (MB) and the directory to use for dumping the memory buffer.
To configure HPEL text log configuration settings, from the dmgr console, click...
Troubleshooting | Logs and trace | server_name | Configure HPEL text log
Options...
- Enable Text Log
In addition to writing log and trace records in binary format, the logging system writes them in a text format as well:
- When enabled, the same set of options that were described for configuring HPEL logging, are applicable to HPEL trace settings when this option is selected.
- Text Output Format
Specifies the format to use in the text log file:
- Select Basic to specify a shorter, one-line-per-record format.
- Select Advanced to specify a longer format using full logger name and more details about each record.
- Include trace records
Whether trace records are included in the text log file and log records. Reading from the log data and trace data repositories
Now the data is stored in proprietary binary format instead of text as in basic logging (the only exception is for the text log repository). In this way, the following benefits are realized:
- There is no more text parsing.
- More data is available because truncation is not necessary.
- Data is not formatted unless it is needed.
- No need to clear log files before server start, for example, to diagnosis a problem.
- Trace speed is improved and more data can be available, and it has half of the impact that basic tracing has.
- Provides a common solution between z/OS and distributed platform.
- Applications running with HPEL run faster than running the same application with basic logging.
To read the log and trace records in this new format, a new command line was introduced called logViewer. It reads the data from repositories, formats it, and displays it to the administrator
Viewing data from HPEL repositories
After configuring HPEL, the data can be viewed using the logViewer command
...
[6/28/12 15:27:30:618 EDT] 0000003e CoordinatorIm I HMGR0228I: The Coordinator is not an Active Coordinator for core group DefaultCoreGroup. The active coordinator set is [saw211-RHEL3Cell01\saw211-RHEL3CellManager01\dmgr].
[6/28/12 15:27:30:620 EDT] 0000003e CoordinatorIm I HMGR0218I: A new core group view has been installed. The core group is DefaultCoreGroup. The view identifier is (185:0.saw211-RHEL3Cell01\saw211-RHEL3CellManager01\dmgr). The number of members in the new view is 7.
[6/28/12 15:27:30:621 EDT] 0000003e CoreGroupMemb I DCSV8050I: DCS Stack DefaultCoreGroup at Member saw211-RHEL3Cell01\saw211-RHEL3Node01\swatJVM01: New view installed, identifier (185:0.saw211-RHEL3Cell01\saw211-RHEL3CellManager01\dmgr), view size is 7 (AV=7, CD=7, CN=7, DF=8)
[6/28/12 15:27:30:692 EDT] 0000004e ViewReceiver I DCSV1033I: DCS Stack DefaultCoreGroup at Member saw211-RHEL3Cell01\saw211-RHEL3Node01\swatJVM01: Confirmed all new view members in view identifier (185:0.saw211-RHEL3Cell01\saw211-RHEL3CellManager01\dmgr). View channel type is View|Ptp.
[6/28/12 15:27:35:992 EDT] 00000058 NGUtil$Server I ASND0002I: Detected server swatJVM02 started on node saw211-RHEL4Node01
[6/28/12 15:27:38:066 EDT] 00000091 ServerStopped I ODCF0009I: Marked server /cell/saw211-RHEL3Cell01/node/saw211-RHEL4Node01/server/swatJVM02 STARTED due to transaction: remote - initial contribution from saw211-RHEL3Cell01\saw211-RHEL4Node01\swatJVM02
Operation Complete
Processed 346 records in 0.229 seconds (1,510.917 records per second).
Messages from the most recent server run, obtained by issuing the following command on a Linux test environment:
logViewer.sh -latestInstance
There are several options to be used with the logViewer command. The following list demonstrates some possibilities that can be used:
Show messages starting at a specific level or higher: logViewer -minlevel <message_level>
Show messages from log and trace for a specific thread: logViewer -Thread <thread_id>
Show messages from log and trace in advanced format: logViewer -format advanced
Advanced format view
...
[6/28/12 15:27:19:554 EDT] 00000062 I UOW= source=com.ibm.son.mesh.Peer org=IBM prod=WebSphere component=Application Server thread=[sonInThreadPool : 3]
ODCF8532I: Added neighbor ip=9.42.171.209 udp=11008 tcp=11010 ID=e2c0f4018a2ee0d59673e03a38768e08b18444e0
version=0;
cellName=saw211-RHEL3Cell01;
bridgedCells=[];
properties={MEMBER_STARTUP_TIME=1340911627497, MEMBER_VERSION=4, memberName=saw211-RHEL3Cell01\saw211-RHEL4Node01\swatJVM02, epoch=1340911630430, inOdc=1, ODC_PUBLISHER_ONLY=false}, neighbor set is now 6 nodes 0 ip=9.42.171.209 udp=11008 tcp=11010 ID=e2c0f4018a2ee0d59673e03a38768e08b18444e0 version=0;
cellName=saw211-RHEL3Cell01;
bridgedCells=[];
properties={MEMBER_STARTUP_TIME=1340911627497, MEMBER_VERSION=4, memberName=saw211-RHEL3Cell01\saw211-RHEL4Node01\swatJVM02, epoch=1340911630430, inOdc=1, ODC_PUBLISHER_ONLY=false} 1 ip=9.42.170.178 udp=11008 tcp=11010 ID=888ac4b6fef1afa4e7e79f786c1d70f4fcc9b1a1 version=0;
cellName=saw211-RHEL3Cell01;
bridgedCells=[];
properties={MEMBER_STARTUP_TIME=1340911443846, MEMBER_VERSION=4, memberName=saw211-RHEL3Cell01\saw211-RHEL5Node01\swatJVM03, epoch=1340911448237, inOdc=1, ODC_PUBLISHER_ONLY=false} 2 ip=9.42.171.242 udp=11003 tcp=11004 ID=8ac1cdb57102f65d8b662a66e7af796010f62bcf version=0;
cellName=saw211-RHEL3Cell01;
bridgedCells=[];
properties={MEMBER_VERSION=4, MEMBER_STARTUP_TIME=1340844138854, memberName=saw211-RHEL3Cell01\saw211-RHEL3Node01\nodeagent, inOdc=1, epoch=1340844140738, ODC_PUBLISHER_ONLY=false} 3 ip=9.42.171.209 udp=11003 tcp=11004 ID=6e2777bc38d54873c49bce229a574ac0d4e32305 version=0;
cellName=saw211-RHEL3Cell01;
bridgedCells=[];
properties={MEMBER_STARTUP_TIME=1340832422435, MEMBER_VERSION=4, memberName=saw211-RHEL3Cell01\saw211-RHEL4Node01\nodeagent, epoch=1340832424261, inOdc=1, ODC_PUBLISHER_ONLY=false} 4 ip=9.42.170.178 udp=11003 tcp=11004 ID=3fc77268a35c9a1bf21d154aef896b48cb902cee version=0;
cellName=saw211-RHEL3Cell01;
bridgedCells=[];
properties={MEMBER_STARTUP_TIME=1340832433243, MEMBER_VERSION=4, memberName=saw211-RHEL3Cell01\saw211-RHEL5Node01\nodeagent, epoch=1340832435110, inOdc=1, ODC_PUBLISHER_ONLY=false} 5 ip=9.42.171.242 udp=11005 tcp=11006 ID=fb3955437989332fbe4478b7aba8bfd84b191a52 version=0;
cellName=saw211-RHEL3Cell01;
bridgedCells=[];
properties={MEMBER_VERSION=4, MEMBER_STARTUP_TIME=1340827441504, memberName=saw211-RHEL3Cell01\saw211-RHEL3CellManager01\dmgr, inOdc=1, epoch=1340827464976, ODC_PUBLISHER_ONLY=false}.
...
[6/28/12 15:27:35:992 EDT] 00000058 I UOW=source=com.ibm.ws.xd.nodedetect.NGUtil$ServerStatusBBSubscriber org=IBM prod=WebSphere component=Application Server thread=[BBSon : 0] ASND0002I: Detected server swatJVM02 started on node saw211-RHEL4Node01
[6/28/12 15:27:38:066 EDT] 00000091 I UOW= source=com.ibm.ws.odc.nd.ServerStoppedListener org=IBM prod=WebSphere component=Application Server thread=[Thread-61]
ODCF0009I: Marked server /cell/saw211-RHEL3Cell01/node/saw211-RHEL4Node01/server/swatJVM02 STARTED due to
transaction: remote - initial contribution from saw211-RHEL3Cell01\saw211-RHEL4Node01\swatJVM02
Operation Complete
Processed 346 records in 0.227 seconds (1,524.229 records per second).
Show messages from log and trace from a specific time range: logViewer -startDate <date/time/timezone> -stopDate <date/time/timezone>
Another way to monitor logs and traces is using the WebSphere console. Click
Troubleshooting | Logs and trace | <your_server> | View HPEL logs and trace
For information...
Cross Component Trace
Cross Component Trace (XCT) annotates the logs so that log entries related to a request serviced by more than one thread, process, or even server are identified as belonging to the same unit of work. XCT helps identify the root cause of problems across components, which provides the following benefits:
- Enables administrators and support teams to follow the flow of a request from end-to-end. XCT follows as a request traverses thread or process boundaries, travels between stack products, and the WAS.
- Helps to resolve questions about which component is responsible for a request that fails.
XCT is a function built into the WAS log and trace framework. When enabled, XCT annotates the logs so that log entries related to a request serviced by more than one thread, process, or even server are identified as belonging to the same unit of work. Applications built using distributed architectures, such as Service Oriented Architecture, will benefit from XCT. XCT helps facilitate problem determination across multiple services on different systems.
The following list identifies the different XCT modes:
- Fully disabled.
- With XCT request IDs added to existing log and trace records.
- With XCT request IDs added to existing log and trace records and XCT log records added to log files.
- With XCT request IDs added to existing log and trace records, XCT log records added to log files, and data snapshots captured.
The following list identifies the various XCT request IDs:
- XCT request IDs are identifiers added to log and trace records. These records are produced by the server when the server is configured to use HPEL.
- XCT adds the same request ID to every log or trace record as long as the log or trace record is a part of the same request. The request ID is added regardless of which thread or Java JVM produces the log or trace entry.
- When XCT is used with the HPEL log and trace infrastructure, we can view request IDs with the logViewer tool when logs are output in advanced format.
Log record with an XCT request ID
[3/18/11 14:50:17:391 EDT] 00000018 W UOW= source=com.ibm.somelogger.QuickLogTest org= prod= component= thread=[WebContainer : 1] requestID=BJrcVPo+Yk4-AAAAAAA8zAA hello world
XCT log records are...
- Typically added to the logs to:
- Demarcate the beginning and ending of work for a particular request on a particular thread.
- Demarcate when work is about to be transferred to another thread or process, or to indicate when work returned from another thread or process.
- Demarcate when work moves from major component to major component, even if work continues on the same thread. For example, to show transfer of control from application server code to application code.
XCT log record in the log file...
[3/18/11 14:50:17:391 EDT] 00000031 XCT I BEGIN BJrcVPo+Yk4-AAAAAAA8zAA 00000000000-cccccccccc2 HTTPCF(OutboundRequest /index.html RemoteAddress(127.0.0.1) RequestContext(36001645))
- Composed of:
- XCT type (BEGIN / END)
- XCT parent correlator ID (for example, 00000000000-cccccccccc2)
- XCT current correlator ID (for example, BJrcVPo+Yk4-AAAAAAA8zAA)
- XCT annotations (for example, HTTPCF(OutboundRequest /index.html RemoteAddress(127.0.0.1) RequestContext(36001645))
XCT tools:
- HPEL logViewer tool
Filter log and trace records by request ID.
- XCT Log Viewer
Take advantage of XCT log records or XCT request IDs, or both, when rendering log and trace content. Available as a tool add-on for the IBM Support Assistant.
Use Cross-Component Trace to troubleshoot applications
Use log and trace files to determine whether applications are running correctly.
Multiple threads within an application server might be used to handle requests. Some requests might be handled by more than one application server, such as when one application server makes a request to another application server for a web services request.
Use XCT to augment log and trace files with correlation information. This correlation information clarifies which threads and which application server processes participated in the handling of each request.
To enable XCT, from the dmgr console, click...
Troubleshooting | Logs and trace | server_name | Change log detail levels
Options...
- Disable logging and tracing of potentially sensitive data
The application server has a list of loggers which are known to potentially write sensitive information to the log and trace when enabled. For example, enabling certain HTTP related loggers at FINEST level can result in confidential user-specified information from HTTP requests being stored in the trace files. If you want the server to avoid enabling these loggers at levels which are known to be used for potentially sensitive information, check the Disable logging and tracing of potentially sensitive data option. When the server is started, or when the log detail level specification is modified at run time, the server will compare the list of loggers and levels specified in the log detail level specification to the list of loggers and levels in the sensitive logger list. At that point, the server will update the log detail level specification as needed.
- Change Log Detail Levels
Enter a log detail level that specifies the components, packages, or groups to trace.
If you select the Runtime tab, and expand Components and Groups, the list of components, packages, and groups are displayed. This display contains all the components that are registered on the running application server and in the static list.
- Enable log and trace correlation
Enables or disables log and trace correlation to be tracked for the application server.
Options include:
- Include request IDs in log and trace records
Include request IDs in log and trace files. This enables seeing which log and trace entries, in all threads and application server processes, are related to the same request.
- Include request IDs in log and trace records and correlation log records
Create correlation log records when to log how requests branch between threads and processes. We can also see extra information about each request.
- Include request IDs in log and trace records, create correlation log records, and capture data snapshots as appropriate.
Capture data snapshots when to store entire request and response bodies to the file system.
After XCT is enabled, we can use XCT request ID information to track requests.
Ensure you are using HPEL log and trace mode, as basic mode log and trace does not store request IDs.
Filter logs to look for important information, such as errors or warnings using the HPEL LogViewer command-line tool. Output logs using advanced format so that we can see the request ID information in the logs. For example (on Linux):
logViewer.sh -minLevel WARNING -format advanced
When you find the log entries that are of interest to you, note the request ID associated with those entries.
Filter logs by request ID using the HPEL LogViewer command-line tool and using the request IDs you noted in the previous step:
logViewer.sh -includeExtensions -requestID=<requestID>
Sensitive log and trace guard
The sensitive log and trace guard is a feature that helps administrators prevent sensitive information from being exposed in log and trace files.
The sensitive log and trace guard uses an internal list of allowable levels for sensitive loggers. It specifies the lowest level at which listed loggers can generate log or trace data without containing potentially sensitive data. We can also add our own loggers to the list the sensitive log and trace guard will block.
To enable the sensitive log and trace guard feature, using the deployment manager's dmgr console, navigate to Troubleshooting | Logs and trace | server_name.
Click Change log detail levels, and select the Disable logging and tracing of potentially sensitive data option to enable sensitive log and trace guard.
Javacores and Heapdumps
We can use the Java runtime environment to create dump and core files to help with troubleshooting. We can use the dmgr console to trigger the creation of these dumps and core files.
The Java JVM is capable of producing Java dump and core files to aid in troubleshooting. We can use heap dump and system dump files to help diagnose memory-related problems, such as memory leaks. Use Java core files to help diagnose problems where the CPU is persistently 100% busy, when threads are hanging, or where threads are in a deadlock.
The process of generating dump and core files can have a noticeable performance impact on the system that can last for many seconds or minutes. If necessary, use your test and development environments to better understand the impact of generating dump and core files.
The following procedure describes how you use the deployment manager's administrative console to generate javacores and heapdumps:
- Select server(s) for which to collect a system, core, or heap dump...
Troubleshooting | Java dumps and cores | server | [System Dump | Java Core | Heap Dump]
The system dump, heap dump, or Java core is created and stored in the profile root directory of the server from which you requested the dump or core.
Generating a heap dump or a system dump is not supported for a non-IBM JVM.
HTTP Plug-in Log
The HTTP plug-in records messages to a log file stored under the WebSphere HTTP Plug-in directory on the Web server machine. Default location:
PLUGIN_ROOT/logs/web_server_name/http_plugin.log
The log level can be set to different values depending on the amount of detail you want written to the log. Setting the log level to Trace causes all steps in the HTTP request process to be logged.
Setting the plug-in log level to Trace produces a lot of output and can cause an impact in server performance. Input/output happens very frequently and can also cause the plug-in log file to grow quickly, even causing file truncation.
Tools for collecting and analyzing diagnostic data
WAS provides features to assist in the detection and therefore in the capture of information for problem determination. Moreover, IBM has a comprehensive set of instructions on how to gather data for nearly all types of issues that can impact a WebSphere environment. IBM provides the IBM Support Assistant, a tool with great value for problem determination analysis.
Hang detection policy
Hang detection policy provides early detection of failing servers. The hang detection option for WAS is turned on by default.
A hung thread can result from a simple software defect, such as an infinite loop, or a more complex cause, such as a resource deadlock. System resources, such as CPU time, might be consumed by this hung transaction when threads run unbounded code paths, such as when the code is running in an infinite loop. Alternately, a system can become unresponsive even though all resources are idle, as in a deadlock scenario. Unless a user or a monitoring tool reports the problem, the system might remain in this degraded state indefinitely.
Using the hang detection policy, we can specify a time too long for a unit of work to complete. The thread monitor checks all managed threads in the system such as web container threads and object request broker (ORB) threads. Unmanaged threads, which are threads created by applications, that are not monitored. When WebSphere detects that a thread is active longer than the time defined by the thread monitor threshold, the application server takes the following actions:
- Logs a warning in the WAS log that indicates the name of the thread hung and how long it was already active. The following message is written to the log:
[6/29/12 14:39:22:921 EDT] 0000009d ThreadMonitor W WSVR0605W: Thread "WebContainer : 1" (00000099) has been active for 139,950 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
- Issues a JMX notification.
This notification enables third-party tools to catch the event and take appropriate action. Some common actions are triggering a JVM thread dump of the server or issuing an electronic page or email.
- Triggers changes in the PMI data counters.
In some situations, false alarms for hung threads can also happen. If the work actually completes, a second set of messages, notifications, and PMI events is produced to identify the false alarm. The following message is written to the log:
WSVR0606W: Thread "WebContainer : 1" (00000099) was previously reported to be hung but has completed. It was active for approximately 139,950 milliseconds. There are 0 threads in total in the server that still may be hung.
If the thread monitor determines that too many false alarms are issued (determined by the number of pairs of hang and clear messages), it can automatically adjust the threshold.
The thread hang detection option is enabled by default. We can take the following steps to adjust the hang detection policy values or to disable hang detection completely:
- From the dmgr console, click...
Servers | Application Servers | server_name | Server Infrastructure | Administration | Custom Properties | New
- Add the following properties:
com.ibm.websphere.threadmonitor.interval The frequency, in seconds, at which managed threads in the selected application server are interrogated:
com.ibm.websphere.threadmonitor.threshold The length of time, in seconds, in which a thread can be active before it is considered hung.
com.ibm.websphere.threadmonitor.false.alarm.threshold The number of times that false alarms can occur before automatically increasing the threshold.
com.ibm.websphere.threadmonitor.dump.java Set to true to cause a javacore to be created when a hung thread is detected and a WSVR0605W message is printed. com.ibm.websphere.threadmonitor.dump.stack Set to true to cause a stack trace to be printed when a hung thread is detected and a WSVR0605W message is printed. - Click Apply.
- Save the changes.
Make sure a file synchronization is performed before restarting the servers.
- Restart the Application Server for the changes to take effect.
Memory leak detection policy
The leak detection policy for the WAS is turned off by default. You can configure a leak detection, prevention, and action policy to accommodate your applications and environment so that potential memory leaks are reported and acted upon. Leak detection, prevention, and proactive fixing provides protection and resiliency for servers that face persistent out of memory errors.
Memory leak types include...
- Class loader
Generally caused by the application code or JRE triggered code. Retaining a reference to a single object from a web application pins every class loaded by the web application. These references often remain after a web application reload. With each reload, more classes are pinned which leads to an out of memory error.
- JRE triggered
Memory leaks occur when JRE code uses the context class loader to load an application singleton. These singletons can be threads or other objects that are loaded by the JRE using the context class loader.
- Application
If the web application code triggers the initialization of a singleton or a static initializer, the following conditions apply:
- The context class loader becomes the web application class loader.
- A reference is created to the web application class loader. This reference is never garbage collected.
- Pins the class loader, and all the classes loaded by it, in memory.
- Application triggered leaks
Application code sometimes retains references to objects that are no longer required and used. Therefore, these objects continue to exist and to take up memory space and cannot be cleared by the Garbage Collector.
Application leak types include...
- Custom ThreadLocal class
- Web application class instance as ThreadLocal value
- Web application class instance indirectly held through a ThreadLocal value
- ThreadLocal pseudo-leak
- ContextClassLoader and threads created by web applications
- ContextClassLoader and threads created by classes loaded by the common class loader
- Static class variables
- JDBC driver registration: RMI targets
Approaches to memory leak detection typically involve examination of counters...
- JVM Tool Interface (JVMTI)
- performance monitoring infrastructure (PMI)
These are used to watch for slow growth in Java or native heap usage.
WAS v8.5 provides a top down pattern to handle memory leak situations. The application server watches for suspect patterns in application code at run time, and operates in the following manner:
- Detection
WAS checks for known causes of memory leaks and issues warnings when an application leak is detected.
- Prevention (on by default)
Applies only to JRE triggered leaks. These leaks are prevented by initializing singletons at server startup, when the application server class loader is the context class loader.
- Action
Take proactive action to fix memory leaks. These actions have reasonable defaults and are configured on a case-by-case basis.
WAS has some means of protection against memory leaks when stopping or redeploying applications. If leak detection, prevention, and action are enabled, monitors application and module activity, then performs diagnostic actions to detect and fix leaks when an application or an individual module stops.
This feature helps in increasing application up time with frequent application redeployments without cycling the server.
Therefore, we can configure WAS to detect, prevent, and take action, if possible, on classloader memory leaks using the memory leak detection policy. If a classloader memory leak is detected, WAS notifies you with informational messages in the log and by taking JVM heapdumps so that we can troubleshoot the problem. Optionally, you might also choose to have WAS mitigate, and if possible, fix the memory leak using reflection and other techniques.
Configure the memory leak detection policy
MustGather for troubleshooting
The IBM mustGather for the WAS website is an essential guide for troubleshooting. In regards to problems with the IBM WAS, it provides a complete set of instructions for collecting troubleshooting data.
You will find instructions for, the following list, about collecting data by component for all WAS releases between 5.1 and 8.5:
- Classloader
- Crash
- DB Connections/Connection Pooling
- High Availability (HA)
- Java SDK
- Performance, hang, or high CPU
- Out of Memory
The information available in the MustGather website, will help you gather data from problematic environments and help familiarize you with the troubleshooting process.
This process greatly improves the accuracy of data analysis, as it will collect a complete set of information from problematic servers and applications. Visit (and bookmark) the MustGather for WAS website at following website: http://www-01.ibm.com/support/docview.wss?uid=swg21145599
IBM Support Assistant
IBM Support Assistant is a free troubleshooting application that helps you research, analyze, and resolve problems using various support features and tools. IBM Support Assistant enables you to find solutions yourself using the same troubleshooting techniques used by the IBM Support team. It also allows you to organize and transfer your troubleshooting efforts between members of your team or to IBM for further support.IBM Support Assistant helps you diagnose problems with applications deployed to WAS. We can track and organize troubleshooting activity, automate the collection of MustGather files, and use tools that analyze artifacts to understand the root cause of errors.
We can customize IBM Support Assistant by installing the problem determination tools needed to troubleshoot various problems. For example, when the JVM runs out of memory, application threads hang, the JVM crashes, or if performance declines, we can use analysis tools to diagnose these problems.
Among the many options available from the IBM Support Assistant, you will find the following tools:
- Java Troubleshooting
- IBM Monitoring and Diagnostic Tools for Java - Memory Analyzer
Used for analyzing system dumps and Java heap dumps:
- Troubleshoot memory leaks
- Understand the architecture of your Java application through footprint analysis
- Improve performance by tuning memory footprint and optimizing Java collections and cache usage
- Customize analysis with additional plug-ins and reports
- IBM Monitoring and Diagnostic Tools for Java - Garbage Collection and Memory Visualizer (GCMV)
Used for Analyzing and visualizing verbose GC logs:
- Monitor and fine-tune heap size and garbage page performance
- Flag possible memory leaks
- Size the Java heap correctly
- Select the best garbage page policy
- IBM Monitoring and Diagnostic Tools for Java - Dump Analyzer
Used for analyzing system dumps produced by IBM JVMs to diagnose typical problems:
- Out of Memory
- Deadlocks
- JVM crashes
- Java Native Interface (JNI) crashes
- IBM Monitoring and Diagnostic Tools for Java - Health Center
Used for monitoring the status of a running IBM JVM:
- Identify if native or heap memory is leaking
- Discover which methods are taking most time to run
- Pin down I/O bottlenecks
- Visualize and tune garbage page
- View lock contentions
- Analyze unusual WebSphere Real Time events
- IBM Thread and Monitor Dump Analyzer for Java (TMDA)
Used for analyzing Java core files to help identify threading problems:
- Hangs
- Deadlocks
- Resource contention
- Bottlenecks
- WAS Troubleshooting Tools
- Database Connection Pool Analyzer for IBM WAS
Used for identifying JDBC problems:
- Connectivity problems
- JDBC connection leaks
- IBM Trace and Request Analyzer for WAS
Used for analyzing HTTP server plug-in traces and traces:
- Identify delays between HTTP requests and responses
- Identify delays and hangs during application execution
- IBM Web Server Plug-in Analyzer for WAS
Used for analyzing WAS plug-in configurations to help find improper or ill-advised settings that can result in runtime problems:
- Detecting configurations that can cause outages or performance degradation
- Identifying request or response failures
- Tracking HTTP return codes and URI failures
- Tracking clusters and cluster members
- Graphically visualizing runtime environment topologies
- IBM Web Services Validation Tool for WSDL and SOAP
Used for validating SOAP messages and WSDL schemas to identify potential problems prior to deployment.
- WAS extensions for Dump Analyzer
Used for analyzing WAS JVM system dumps to diagnose WebSphere specific problems.
Troubleshooting scenarios
Now that we have seen what are the most common issues and how to collect information to further diagnose them, this section shows you simulated scenarios of common issues. This section helps you understand what steps need to taken to identify the root cause of the problem.
Hung threads
These can occur due to many reasons, and monitors long running threads to warn about possible issues. After the issue is detected, either by the warning messages in the log files or by other monitoring tools, the first step is to take Java cores. A Java core is a snapshot of all running threads in one specific application server along with the stack trace for each one of them.
To create a Java core file from the commandline...
kill -3 PID
where PID is the process ID of the java process for the application server. On *nix systems, we can run 'ps -ef | grep java" locate.
To create a Java core file from the administration console...
Troubleshooting | Java dumps and cores | application_server | Java core
Core file location...
PROFILE_HOME/javacore.<YEAR><MONTH><DAY>.<TIME>.<PID>.<SEQUENTIAL>.txt
...for example...
/opt/WAS/AppServer/profiles/AppSrv01/javacore.20120628.171251.5424.0001.txt
The location is shown in native_stderr.log log...
JVMDUMP039I Processing dump event "user", detail "" at 2012/06/28 17:12:51 - wait.
JVMDUMP032I JVM requested Java dump using '/opt/WAS/AppServer/profiles/AppSrv01/javacore.20120628.171251.5424.0001 .txt' in response to an event
JVMDUMP010I Java dump written to /opt/WAS/AppServer/profiles/AppSrv01/javacore.20120628.171251.5424.0001. txt
JVMDUMP013I Processed dump event "user", detail "".
Create at least 3 java cores, in 2 minute intervals.
Diagnose hung thread issues
- After a thread is detected hung or if you suspect that something is hung, the first step is to immediately take java cores.
Minimum number of java cores recommended to take is 3 (with an interval of 2 minutes from each other).
- Identify from the SystemOut.log file, what thread is reported as hung.
For example, in the following, thread WebContainer : 0 is hung...
[6/28/12 18:28:19:227 EDT] 00000070 ThreadMonitor W WSVR0605W: Thread "WebContainer : 0" (000000a1) has been active for 146,658 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
- Download the java core files to a computer with IBM Support Assistance (ISA) installed.
- Install IBM Thread and Monitor Dump Analyzer for Java (TMDA) into ISA.
- From ISA, click...
Launch Activity | Analyze problem | Tools Catalog | IBM Thread and Monitor Dump Analyzer for Java (TMDA)
- In TMDA, open the Java core files...
File | Open Thread Dumps | Java core file | Open | Java core file | Thread Detail icon
- From the list of threads, select the thread identified as hung from step 2.
On the right side of the window, you will see the thread details
- From the window, we can see the stack trace of the hung thread and identify what is the component causing it.
In this case, this is a sample application that issued a sleep call.
High CPU
This section describes how to troubleshoot High CPU usage issues. With this kind of issue, the CPU usage for the process stays high over a long period of time, indicating some abnormality in the infrastructure or in the application.
The high CPU usage analysis is done by identifying what thread or threads are consuming most of the CPU time and then correlating those threads to the Java code they were executing. By doing this, we can pin point what parts of the code are consuming the CPU. High CPU usages can also be caused by excessive execution of the Java garbage collector, most likely because it needs tuning in the algorithm used or in the JVM memory size:
In this example, we use the Linux platform.
To collect the information for diagnosis: If not enabled, enable verbose garbage page logging
- Download the linperf.sh script from the MustGather website under Collect Data Manually
- Upload the linperf.sh script to the server you are diagnosing and give it execution permission.
- Use the top command to identify the PID of the top CPU consumer java process.
- As root user, run the script against that PID using linperf.sh PID.
Run the script from a writeable directory. Verify enough file system space to accommodate the three java core files created during the execution of the script.
- The linperf.sh script creates a packaged file with the information it collected from the server.
linperf_RESULTS.tar.gz
- Download linperf_RESULTS.tar.gz to the computer with ISA. Also download the Java core files created by the script.
- Extract the linperf_RESULTS.tar.gz, and open the file topdashH.PID.out.
What you see is a sequence of top commands, which includes the thread information in them (ps -H).
- Locate the top CPU consumer threads by looking at the PIDs in the sequence of top iterations that consistently shows high CPU usage.
Take a note of their PID, it is the first column. Alternatively, use the command line...
grep -A 2 "PID USER" topdashH.PID.out
...to filter and get the first two lines of all top iterations.
- Convert the thread PID numbers from decimal to hexadecimal using a simple calculator, such as the Windows or Linux calculator.
Put the calculator in programmer view, paste the value in decimal that you noted down, and click Hex. The value is converted from decimal to hexadecimal. 10.With the top consumer thread PIDs in hexadecimal values ready, we need to investigate in the Java core files. We identify what were those threads doing at the time the Java core was created. Use IBM Thread and Monitor Dump Analyzer for Java (TMDA) in a similar way that we did for diagnosing hung threads.
- Open the Java core in TMDA following the same instructions.
- From the list, sort using the NativeID column and locate the threads you noted down earlier.
After you select the thread, the stack trace shows on the right side of the window, which gives you a hint of what is taking most of the CPU processing.
Execute linperf.sh when the process is consuming high CPU; otherwise, the Java core files taken do not reflect the problem of the occurrence.
In a simulated scenario, we identified the PID 9777 as the top CPU consumer and it is a Java process.
The linper.sh was executed against that PID and displayed the output
linper.sh execution
[root@server tmp]# ./linperf.sh 9777
Fri Jun 29 14:40:59 EDT 2012 MustGather>> linperf.sh script starting...
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Script version: 2011.05.03.
Fri Jun 29 14:40:59 EDT 2012 MustGather>> PROBLEMATIC_PID is: 9777
Fri Jun 29 14:40:59 EDT 2012 MustGather>> SCRIPT_SPAN = 240
Fri Jun 29 14:40:59 EDT 2012 MustGather>> JAVACORE_INTERVAL = 1.0
Fri Jun 29 14:40:59 EDT 2012 MustGather>> TOP_INTERVAL = 60
Fri Jun 29 14:40:59 EDT 2012 MustGather>> TOP_DASH_H_INTERVAL = 5
Fri Jun 29 14:40:59 EDT 2012 MustGather>> VMSTAT_INTERVAL = 5
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Collecting user authority data...
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Collection of user authority data complete.
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Creating output files...
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Output files created:
Fri Jun 29 14:40:59 EDT 2012 MustGather>> vmstat.out
Fri Jun 29 14:40:59 EDT 2012 MustGather>> ps.out
Fri Jun 29 14:40:59 EDT 2012 MustGather>> top.out
Fri Jun 29 14:40:59 EDT 2012 MustGather>> topdashH.9777.out
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Starting collection of top data...
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Collection of top data started.
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Starting collection of top dash Hdata...
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Collection of top dash H data started for PID 9777.
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Starting collection of vmstat data...
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Collection of vmstat data started.
Fri Jun 29 14:40:59 EDT 2012 MustGather>> Collecting a ps snapshot...
Fri Jun 29 14:41:00 EDT 2012 MustGather>> Collected a ps snapshot.
Fri Jun 29 14:41:00 EDT 2012 MustGather>> Collecting a javacore...
Fri Jun 29 14:41:00 EDT 2012 MustGather>> Collected a javacore for PID 9777.
Fri Jun 29 14:41:00 EDT 2012 MustGather>> Continuing to collect data for 120 seconds...
...
Fri Jun 29 14:45:05 EDT 2012 MustGather>> Collecting other data. This may take a few moments...
Fri Jun 29 14:45:05 EDT 2012 MustGather>> Collected other data.
Fri Jun 29 14:45:05 EDT 2012 MustGather>> Preparing for packaging and cleanup...
Fri Jun 29 14:45:12 EDT 2012 MustGather>> Compressing output files into linperf_RESULTS.tar.gz vmstat.out ps.out top.out screen.out dmesg.out whoami.out df-hk.out topdashH.9777.out
Fri Jun 29 14:45:25 EDT 2012 MustGather>> Cleaning up...
Fri Jun 29 14:45:25 EDT 2012 MustGather>> Clean up complete.
Fri Jun 29 14:45:25 EDT 2012 MustGather>> linperf.sh script complete.
Fri Jun 29 14:45:25 EDT 2012 MustGather>> Output files are contained within ----> linperf_RESULTS.tar.gz. <----
Fri Jun 29 14:45:25 EDT 2012 MustGather>> The javacores that were created are NOT included in the linperf_RESULTS.tar.gz.
Fri Jun 29 14:45:25 EDT 2012 MustGather>> Check the <PROFILE_HOME> for the javacores.
Fri Jun 29 14:45:25 EDT 2012 MustGather>> Be sure to submit linperf_RESULTS.tar.gz, the javacores, and the server logs as noted in the MustGather.
[root@server tmp]#
The topdashH.9777.out file consistently showed PID 15811 as the top CPU consumer, shown in Example 31-11.
top thread CPU consumers
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15811 root 25 0 792m 590m 93m R 87.5 14.8 1:05.56 WebContainer : 9780 root 18 0 792m 590m 93m S 6.4 14.8 0:00.64 Signal Reporter -- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15811 root 25 0 792m 591m 93m R 97.7 14.8 1:10.45 WebContainer : 9777 root 15 0 792m 591m 93m S 0.0 14.8 0:00.00 java -- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15811 root 25 0 792m 592m 93m R 90.9 14.8 1:15.00 WebContainer : 15918 root 18 0 792m 592m 93m S 6.8 14.8 0:00.34 WebContainer : -- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15811 root 25 0 792m 592m 93m R 96.9 14.8 1:19.85 WebContainer : 9781 root 15 0 792m 592m 93m S 0.2 14.8 0:25.29 JIT CompilationAfter converting the PID number 15811 to hexadecimal (3DC3) and opening the Java core files in TMDA, we can see the result. It shows the code from the sample application as being executed at the time the Java core was created.
The conclusion for the analysis in this example is the code in com.ibm.DoComplicatedStuff.doWork is the most likely culprit for the high CPU usage.
Out of Memory exceptions in WAS
An Out of Memory exception can be caused by numerous reasons. A memory leak can be indicated by the steady increase in the memory usage without de-allocation It is expected the Java heap usage fluctuates up and down because the de-allocation of memory is only triggered when the heap usage reaches certain levels.
Similar to a saw, with slow increase in the memory (objects allocation) and then a sudden decrease (garbage page). If in every cycle, the heap usage after garbage collector execution shows a steady increase, it can also indicate a memory leak condition.
Another type of out of memory issue can be caused by heap fragmentation. Heap fragmentation is caused by the allocation and de-allocation of large objects in a short amount of time. The heap will have enough space to accommodate new large objects, but that space is not contiguous, causing an Out of Memory exception. In such cases, a steady increase pattern is not seen in the chart, so then the analysis tools will help in determining if heap fragmentation is the problem.
An out of memory condition is most commonly detected when an Out Of Memory exception is shown in the SystemErr.log file. Out Of Memory exceptions are sometimes referred to as OOM.
Out Of Memory exception in SystemErr.log
[6/29/12 18:40:31:919 EDT] 0000008f SystemErr R java.lang.OutOfMemoryError: Java heap space
[6/29/12 18:40:31:922 EDT] 0000008f SystemErr R at com.ibm.AllocateObject.doWork(AllocateObject.java:45)
[6/29/12 18:40:31:924 EDT] 0000008f SystemErr R at com.ibm.BaseServlet.service(BaseServlet.java:72)
[6/29/12 18:40:31:925 EDT] 0000008f SystemErr R at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
[6/29/12 18:40:31:927 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1214)
[6/29/12 18:40:31:928 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:774)
[6/29/12 18:40:31:929 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:456)
[6/29/12 18:40:31:929 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImp
l.java:178)
[6/29/12 18:40:31:929 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.filter.WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1027)
[6/29/12 18:40:31:933 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.webapp.WebApp.handleRequest(WebApp.java:3703)
[6/29/12 18:40:31:936 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.webapp.WebGroup.handleRequest(WebGroup.java:304)
[6/29/12 18:40:31:939 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.WebContainer.handleRequest(WebContainer.java:962)
[6/29/12 18:40:31:943 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.WSWebContainer.handleRequest(WSWebContainer.java:1662)
[6/29/12 18:40:31:946 EDT] 0000008f SystemErr R at com.ibm.ws.webcontainer.channel.WCChannelLink.ready(WCChannelLink.java:195)
[6/29/12 18:40:32:031 EDT] 0000008f SystemErr R at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleDiscrimination(HttpInboundLink.java:458)
[6/29/12 18:40:32:032 EDT] 0000008f SystemErr R at
com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleNewRequest(HttpInboundLink.java:522)
[6/29/12 18:40:32:032 EDT] 0000008f SystemErr R at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.processRequest(HttpInboundLink.java:311)
[6/29/12 18:40:32:032 EDT] 0000008f SystemErr R at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.ready(HttpInboundLink.java:282)
[6/29/12 18:40:32:599 EDT] 0000008f SystemErr R at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.sendToDiscriminators(NewConnectionInitialReadCallback.java:214)
[6/29/12 18:40:32:600 EDT] 0000008f SystemErr R at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.complete(NewConnectio nInitialReadCallback.java:113)
[6/29/12 18:40:32:601 EDT] 0000008f SystemErr R at com.ibm.ws.tcp.channel.impl.AioReadCompletionListener.futureCompleted(AioReadCompletionListener.java:165)
[6/29/12 18:40:32:602 EDT] 0000008f SystemErr R at com.ibm.io.async.AbstractAsyncFuture.invokeCallback(AbstractAsyncFuture.java:217)
[6/29/12 18:40:32:603 EDT] 0000008f SystemErr R at com.ibm.io.async.AsyncChannelFuture$1.run(AsyncChannelFuture.java:205)
[6/29/12 18:40:32:604 EDT] 0000008f SystemErr R at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1783)
After an Out Of Memory exception is detected, WAS by default creates three log files. The files are one Java Core, one Portable Heap Dump (or simply Heap dump) and one System Dump (native platform core dump).
native_stderr.log showing an Out Of Memory exception and dumps
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2012/06/29 18:39:48 - please wait.
JVMDUMP032I JVM requested System dump using '/opt/WAS/AppServer/profiles/AppSrv01/core.20120629.183948.19153.0001.dm p' in response to an event
JVMDUMP010I System dump written to /opt/WAS/AppServer/profiles/AppSrv01/core.20120629.183948.19153.0001.dmp
JVMDUMP032I JVM requested Heap dump using '/opt/WAS/AppServer/profiles/AppSrv01/heapdump.20120629.183948.19153.000 2.phd' in response to an event
JVMDUMP010I Heap dump written to /opt/WAS/AppServer/profiles/AppSrv01/heapdump.20120629.183948.19153.0002.phd
JVMDUMP032I JVM requested Java dump using '/opt/WAS/AppServer/profiles/AppSrv01/javacore.20120629.183948.19153.000 3.txt' in response to an event
JVMDUMP010I Java dump written to /opt/WAS/AppServer/profiles/AppSrv01/javacore.20120629.183948.19153.0003.txt
JVMDUMP032I JVM requested Snap dump using '/opt/WAS/AppServer/profiles/AppSrv01/Snap.20120629.183948.19153.0004.trc' in response to an event
JVMDUMP010I Snap dump written to /opt/WAS/AppServer/profiles/AppSrv01/Snap.20120629.183948.19153.0004.trc
JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError".
Using these files together with the native_stderr.log, which contains the verbose garbage collection output (if enabled), we can have a first look. These files and logs help diagnose what is accumulating in the memory and causing the Out of Memory exception.
The following steps help identify the likely cause of the Out of Memory Exception:
- Download the portable heap dump file (.phd) and the native_stderr.log file to a computer with ISA installed. The System Dump file (.dmp) is only needed in case the Out Of Memory exception is being caused by something outside the Java Heap, such as the JVM itself of native libraries.
- Open the native_stderr.log file in the ISA tool IBM Monitoring and Diagnostic Tools for Java - Garbage Collection and Memory Visualizer.
- The tool provides views of the heap usage and reports possible issues and recommendations
- To understand what objects are getting filled up in memory, use a tool to process the contents of the Portable Heap Dump and allow navigation and drill-down, such as IBM Heap Analyzer or IBM Monitoring and Diagnostic Tools for Java - Garbage Collection and Memory Analyzer.
For information about Out of Memory analysis, visit the following links at IBM DeveloperWorks:
Web material
Web material is available from the IBM Redbooks web server.
The web material includes the following files:
SG248056.zip Compressed code samples ITSOWebCustomerCredit.war Sample web application WLP85_server1.zip Compressed, sample Liberty profile server runtime ClassloaderTestV1.ear
ClassloaderTestV2.ear
ClassloaderTestV3.ear
VersionCheckerV2.jar
Online resources
- Command assistance for administrative scripting
- Properties based configuration
- Sample Scripts for WAS
- IBM HTTP Server performance tuning
- WebSphere application Server v8.5 Information center
- Java Diagnostics Guide
- IBM PMAT
- IBM Monitoring and Diagnostic tools for Java
- Case study: Performance tuning for WebSphere Application Sever V7 and V8
- Planning for Availability in the Enterprise
- IBM File System Locking Protocol Test for WAS
- IBM Support Assistant
- Complete list of ISA available add-ons
- IBM Education Assistant module for The IBM Garbage Collection and Memory Visualizer
- Garbage collection in WAS V8, Part1
- Garbage collection in WAS V8, Part2
- ITCAM for WAS 7.2
- Classify the Application Control Region in WLM OMVS rules
- Java Information Center
- IBM Extended Cache Monitor for IBM WAS technology
- WebSphere z/OS Optimized Local Adapters
- Introducing the IBM Support Assistant for WAS on z/OS
- IBM Support Assistant
- Java information center
- JSR 289 SIP Servlet API 1.1 Specification
- Overview of the purpose of OSGi
- OSGI home page
- IBM SDK Java Technology Edition V7 Information Center
- Demystifying class loading problems
- Samples and Tutorials with SCA using IBM Rational Application Developer
- OSGi alliance
- Tool to build your OSGi applications, Ant
- Tool to build your OSGi applications, Maven
- IBM AIX Toolbox for Linux Applications official download website
- IBM developerWorks, Transactional high availability and deployment considerations in WAS V6
- IBM Support Assistant