Data Load utility performance tuning

Scheduled Data Load utility jobs can affect WebSphere Commerce performance. We can reduce the impact of this process by tuning the data load performance appropriately for our implementation.

When we are considering how to tune your Data Load process, ensure that you review the Data Load summary reports that generate after you run the Data Load utility. These reports can be used to identify what elements of the Data Load process that require tuning to improve performance.

Before beginning

Ensure that we are familiar with and understand the following concepts and tasks related to the Data Load utility:

To tune your Data Load utility process, we can tune the following processes, parameters, and caches:


Data Load mode

The Data Load mode parameter is used to set the type of load process that the Data Load utility is to run. We can set this mode to be Insert, Replace, or Delete in the wc-dataload.xml file for the data we are loading. Typically, Replace is used, however Insert and Delete can run faster. Running the utility in Insert or Delete mode does not require as many Ids to be resolved with the ID resolver utility. When we are using Insert or Delete, ensure that these actions are the only database operations required by your CSV file.


File difference preprocessing

We can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. By using this preprocessor that we can compare two input files, such as a previously loaded file and a new version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. Running a file difference can reduce the loading time required to load your routine updates to the WebSphere Commerce database, reduce server usage time, and improve server performance.

We can configure the Data Load utility file difference preprocessor to compare files by the values in each column, instead of entire records, to identify the changed records. We can also configure the file difference preprocessor to ignore specific columns when the process is comparing files.

For more information about this preprocessor, see Data Load file difference preprocessing.


Validation options

Configure the Data Load utility to validate the data that we are loading can affect your Data Load performance. The validation of the data your are loading is performed against the WebSphere Commerce database. If we are validating many records, or the validation process encounters many invalid records, this validation process can affect performance. By default, the following validations options are available as configurable properties for the Data Load utility:

If the data that we are loading does not require any of the validation processes to occur, ensure that the configurable property is set false to ensure that no validation performs.


ID resolver cache

If the ID resolver cache size is large enough, the cache can store all of the required IDs for a database table that data is being loaded into. If the size of this cache is not large enough to store all of the required Ids, then none of the IDs are cached. When the IDs are not cached, the data load process requires that the IDs are resolved directly against the database. When we are configuring your cache setting, consider the following behavior:

To tune your Id resolver cache, set an initial value that ensures all of the tables that we are loading data into can be cached. Then, set a second Data Load scenario with the Id resolver cache set to 0. In both instances, pass in the parameter -Dcom.ibm.commerce.foundation.dataload.idresolve.level=FINE when you call the Data Load utility to run. This parameter adds the resolving times to the trace, including the time required to populate the cache (if it is not set to 0), and the time required to resolve the Ids against the database or cache. With these times, we can identify whether we can increase or decrease your caching of Ids and reduce the time required to resolve the Ids for the data your are loading. The size of the Id resolver cache and is set in the Data Load environment configuration file. The following is a sample declaration of a cache that can store 1 million records:

To help tune the ID resolver cache, review the data load summary report after a data load completes. This summary includes the total time that is necessary for the ID resolver to resolve and check the IDs for the objects that are loaded. The Data Load utility can also be configured to output more information about the data load ID resolver process. This addition to the summary report includes the following information:

With this information, we can identify whether it would be more efficient to resolve and check IDs for a table against the ID resolver cache or directly against the database. We can then adjust the size of your ID resolver cache as needed, or exclude IDs from certain tables from being included within the cache. For more information about the ID resolver information that can be included in the data load summary report, see Verifying the results of the data load.


Batch size and commit count

Change the Data Load utility batch size and commit count parameters to reduce the effect of network latency and reduce the processing load on the Transaction Server. The commit count must be a multiple of the batch size. The commit count parameter specifies the number of rows that are flushed in a single transaction before a commit is issued. Database transactions are loaded in batches. These batches are kept in the Java memory until there are enough rows stored for a flush. Then, the batch contents are stored to the database as a single packet of data. The batches are stored in the database until the next commit occurs and the database loads the changes.

By increasing the batch size, we can reduce the effect that network latency can have on the Data Load process. Increasing the batch size can reduce the number of batches required to be sent to the database. The wait time for the database response can also be reduced by increasing the batch size.

By increasing the commit count, we can reduce the processing load on the Transaction Server. Increasing the commit count increases the load on the database. Increasing the commit count causes more records to be committed to the database in a single transaction. This increase results in less uncommitted data that remains stored on your Transaction Server and fewer overall transactions required to commit the data. The values for the batch size and commit count parameters are set in the load order configuration file and are typically in the 500-1000 range. The following is a sample declaration of the batch size and commit count parameters:


Java virtual machine (JVM) heap size

When we are tuning the ID resolver cache, we can adjust the JVM heap allocation. The JVM heap size must be proportionate to the ID resolver cache. For instance, if the ID resolver cache is large, specify a large JVM heap size. If the ID resolver cache is large, the Data Load utility does not resolve the ID from the database directly. The ID resolver cache however might use much of the JVM heap memory. For 1 GB JVM heap size, set the ID resolver cache size to be less than 2 million to prevent a Java out of memory error. If you encounter an out of memory exception during the data load process, the allocated JVM heap size might be too small. Ensure that the JVM heap size is sufficient to accommodate the ID resolver cache and batches that are store in memory. Set the value for the JVM heap size in the following parameters within the Data Load utility, dataload.sh file: -Xms1024m -Xmx4096m


Network latency

If the environment is configured to have a large physical distance between servers, the data load process can be impacted. Your network latency can affect the flush time when the Data Load utility runs. The flush time can be viewed in the Data Load utility summary report. The flush time includes the time required to transport a batch, including the time required for the database to respond. This flush time also includes network latency. If we are experiencing a large flush time, your system can be experiencing either poor database or network performance, or both. If you tune the environment and the Data Load performance is not within the required performance range, then, installing WebSphere Commerce on a local system might be necessary to improve Data Load utility performance.


Database tuning

By tuning the database, we can improve the performance of the Data Load utility by reducing the time required to commit records. We can view the time required to commit loaded data records in the Data Load utility summary report. There are many performance tuning tools available for improving database performance. For more information about database performance tuning, see:


Related concepts
Overview of the Data Load utility


Related tasks
Configure and running the Data Load utility


Related reference
Data Load utility best practices