Example of throughput degradation analysis

Throughput degradation | Database maintenance

Example of throughput degradation analysis

+
Search Tips | Advanced Search

This section introduces an example of throughput degradation that we encountered while testing a sample WebSphere Commerce application. The test was run on an AIX platform in a 1-node test environment (for example, DB2, IBM HTTP Server, and WebSphere Application Server were installed on the same machine). This throughput degradation problem could be observed in a 3-hour stress comparison test. Based on the test result from the previous code release, we set the throughput baseline at 11,580 transactions per hour and estimated a 10% performance improvement in the new release. The test target of this case was set at 12,738 transactions per hour.

Finding throughput degradation

In the first run, we checked the report of our stress test tool. The throughput curve looked fairly straight and the degradation was not obvious at first. The rate of degradation was around 5% per hour. However, with such a degradation rate, in a few days the application cannot handle any load with reasonable response times. Figure 24-21 shows the throughput chart in the stress test report.

Figure 24-21 Find throughput degradation in stress test tool report
This was a gradual throughput degradation (GTD) problem. We followed steps 1 to 3 to analyze the problem and got the following results:

All WebSphere Commerce commands were degraded, especially commands corresponding to orders.
When analyzing native_stderr.log, we found that the GC cycles were fine, and there were no memory issues, such as memory leak, heap fragmentation, and large object allocation.
After restarting the applications server, the throughput degradation still existed.
Runstats and common DB2 tuning had been done.

In step 4, we doubled SORTHEAP to 2048 to avoid sort overflows because some queries needed to create temporary tables.
In step 5, when checking the database we found some data issues. For example, after running this query, select member_id, count(*) from orders group by member_id having count(*) > 50, we got the result.

The data in Figure 24-23 indicates that four users had more orders than other users. After reviewing our test steps, we found the causes of this problem were:

In the warm-up, we often used four fixed users to run scenarios.
There were 200 users in the database, but only 20 users were selected to run scenarios in the formal test.
The above two factors made the data distribution uneven in the database, so our solution was in two parts:
Increase the number of users from 200 to 400.
In the warm-up and formal test, randomly select each virtual user from 20 non-overlapped users. After these changes, the data issues did not occur.

In step 6, we narrowed the scenarios to order the shopping flow.
In step 7, we kept the test running for four days and took two 10-hour snapshots separately on day 2 and day 4.
In step 8, after comparing the two snapshots, we found no growing costs/execution for any query.
In step 9, after comparing the two snapshots, we identified that the top SQL queries had growing rows read per execution.

Identifying these growing SQLs gave us clues to the solution of this throughput degradation problem. Throughput had increased based on these events:

We dropped extraneous index MEMBER_ID+TYPE+STOREENT_ID on the ORDERS table, so queries will use the correct index MEMBER_ID+STATUS+STOREENT_ID.
We created the CHECKED index for the BUSEVENT table.
We periodically cleaned the CTXMGMT/BUSEVENT table in the test.

Figure 24-24 shows the same SQL execution status after we applied the modifications just mentioned. Most of the growing rows read per execution have been solved.

After fixing the growing SQL costs, the throughput seemed stable for the first three hours.

However, the degradation still existed in the long run

Then we analyzed the access plan and decided to do runstats and rebind the database during a long-running test. Figure 24-27 and Figure 24-28 show how runstats helped to optimize the access plan. In this example, access plan overall costs do not include the actual fetch time.

After we did runstats and a rebind of the database in the middle of a long-running test, the overall throughput had been stabilized again and the throughput degradation problem was solved successfully. Therefore, running runstats and rebinding the database should be done regularly to ensure that database indexes are not out of date.