Throughput degradation | Database maintenance


Example of throughput degradation analysis


+

Search Tips   |   Advanced Search

This section introduces an example of throughput degradation that we encountered while testing a sample WebSphere Commerce application. The test was run on an AIX platform in a 1-node test environment (for example, DB2, IBM HTTP Server, and WebSphere Application Server were installed on the same machine). This throughput degradation problem could be observed in a 3-hour stress comparison test. Based on the test result from the previous code release, we set the throughput baseline at 11,580 transactions per hour and estimated a 10% performance improvement in the new release. The test target of this case was set at 12,738 transactions per hour.


Finding throughput degradation

In the first run, we checked the report of our stress test tool. The throughput curve looked fairly straight and the degradation was not obvious at first. The rate of degradation was around 5% per hour. However, with such a degradation rate, in a few days the application cannot handle any load with reasonable response times. Figure 24-21 shows the throughput chart in the stress test report.

Figure 24-21 Find throughput degradation in stress test tool report

This was a gradual throughput degradation (GTD) problem. We followed steps 1 to 3 to analyze the problem and got the following results:

In step 4, we doubled SORTHEAP to 2048 to avoid sort overflows because some queries needed to create temporary tables.

In step 5, when checking the database we found some data issues. For example, after running this query, select member_id, count(*) from orders group by member_id having count(*) > 50, we got the result.

The data in Figure 24-23 indicates that four users had more orders than other users. After reviewing our test steps, we found the causes of this problem were:

In step 6, we narrowed the scenarios to order the shopping flow.

In step 7, we kept the test running for four days and took two 10-hour snapshots separately on day 2 and day 4.

In step 8, after comparing the two snapshots, we found no growing costs/execution for any query.

In step 9, after comparing the two snapshots, we identified that the top SQL queries had growing rows read per execution.

Identifying these growing SQLs gave us clues to the solution of this throughput degradation problem. Throughput had increased based on these events:

  1. We dropped extraneous index MEMBER_ID+TYPE+STOREENT_ID on the ORDERS table, so queries will use the correct index MEMBER_ID+STATUS+STOREENT_ID.

  2. We created the CHECKED index for the BUSEVENT table.

  3. We periodically cleaned the CTXMGMT/BUSEVENT table in the test.

Figure 24-24 shows the same SQL execution status after we applied the modifications just mentioned. Most of the growing rows read per execution have been solved.

After fixing the growing SQL costs, the throughput seemed stable for the first three hours.

However, the degradation still existed in the long run

Then we analyzed the access plan and decided to do runstats and rebind the database during a long-running test. Figure 24-27 and Figure 24-28 show how runstats helped to optimize the access plan. In this example, access plan overall costs do not include the actual fetch time.

After we did runstats and a rebind of the database in the middle of a long-running test, the overall throughput had been stabilized again and the throughput degradation problem was solved successfully. Therefore, running runstats and rebinding the database should be done regularly to ensure that database indexes are not out of date.