Throughput degradation analysis and solution

Throughput analysis methodology | Example of throughput degradation

Throughput degradation analysis and solution

+
Search Tips | Advanced Search

When encountering a throughput problem in performance testing, follow these steps to analyze and solve the problem. First, we should identify whether the problem is a...

low throughput problem
throughput degradation problem
gradual throughput degradation problem

The main identification method is to check whether there is a downward trend for throughput charts in the test reports, or an upward trend for the response time during the test. If so, it is a gradual throughput degradation problem, as shown in Figure 24-18. Otherwise, it is a low throughput problem, which may or may not be solved by performance tuning alone, and you may need to scale your hardware.
You can start analyzing the problem using this detailed process:

Check to see whether all WebSphere application commands degrade when compared to the baseline result.
You can do this by checking the average response time of all the commands in our test report. If only certain commands are slow, it usually means a design problem or code issue, and you can use the RAD profiler or an equivalent Java profiler to pinpoint the culprit in the code.
Check to see whether throughput degradation exists after restarting WebSphere Application Server.
If the problem is resolved after a server restart, it probably relates to a non-database-specific problem. The may be a memory problem, such as memory leak, heap fragmentation, or large object allocation. In some cases, the problem may be caused by a WebSphere Application Server defect, in which case you need to involve its service team. However, if you cannot solve the problem by restarting the server, go to step 3 to continue the analysis.
Check to see whether the database has been optimized.
For DB2, check whether runstats has been run on the DB server. If not, start runstats. Runstats is important to improve DB2 performance when the data volume is large or the system has been running for a long time. Runstats can also help to optimize the DB2 access plan, which makes DB2 more efficient.
This article mainly uses DB2 and runstats as our example.
Check to see whether database tuning has been done. If not, try to tune DB2 parameters.
The available tuning objectives include...

bufferpools
sortheap
locks

If the problem persists after DB tuning, there are two possible problems:

If the throughput is not gradual degradation, examine the DB2 snapshot file to analyze the status of the top SQL queries (the number of executions and costs of each query), and to find which query is causing the problem.
If the throughput is a gradual degradation, go to step 5 to continue the analysis.

Check to see whether the data is evenly distributed. If not, fix the data problem.
Unevenly distributed data is created by improper warming up or unbalanced operation during the test. For example, the tester uses some fixed users to place an order in the warm-up, which creates thousands of orders related to these users in the database. On the other hand, if the tester only uses some fixed users to place orders in the formal testing, the corresponding data will accumulate. This makes DB2 queries use table scans instead of index scans and degrades database performance. The better method is to omit the warm-up users from the formal test and select the random users from the bigger user scale. For example, select 20 users randomly from 400 users to do both warm-up and formal testing.
For divide and conquer, try to use the smallest scenarios to reproduce the problem.
This can isolate the scale of the possible causes of throughput degradation. To accomplish this, divide the test scenarios into different groups and test separately, find the groups that caused throughput degradation, then divide those groups again and again until you narrow down to the minimum scenario group that caused the problem.
Run the scenarios confirmed in step 6 for a long time and take multiple snapshots during the test.
For example, take a 10-hour snapshot separately in the first and second day, and then compare these two snapshots.
Compare multiple snapshots to see whether the cost/SQL and execution number of SQL queries are growing.
If yes, purge the data and tune the index, if needed. Through comparing these files, you can identify the top-growing cost queries. The cost can be one of the following metrics: execution time per query execution, user CPU time per query execution, and system CPU time per query execution. Note that other costs such as fetch time are not included in the execution time reported by the snapshot. Notice that search the same SQL query in multiple snapshots to find what is growing.
If the cost/SQL entries are all constant, compare the snapshots to see whether rows read/execution is growing for some SQL queries.
If so, try to tune the corresponding index to improve the performance. If not, analyze the access plan (for example, using DB2 Explain utilities) to see which can be amended.
In steps 8 and 9, you can identify the top queries that have growing cost/SQL or growing rows read per execution.
Usually, these two characteristics of identified queries are the main indicators for performance degradation. To solve these problems, drop the extraneous index, add a new index, or periodically clean out the large volume of obsolete data in some tables. If there are no growing cost SQL queries in the snapshot, analyze the access plan to see which can be amended based on the accumulated data.

Figure 24-20 is an example that reflects the statistics of rows read/execution per SQL through comparing snapshots. We made this chart by comparing two snapshots in a four-day test. One is a 10-hour snapshot taken on day 2 and the other is a 10-hour snapshot taken on day 4. The numbers in red mean that corresponding queries have growing costs and need to be amended.