Data Load utility performance tuning
Scheduled Data Load utility jobs can affect WebSphere Commerce performance. We can reduce the impact of this process by tuning the data load performance appropriately for our implementation.
When we are considering how to tune your Data Load process, ensure that you review the Data Load summary reports that generate after you run the Data Load utility. These reports can be used to identify what elements of the Data Load process that require tuning to improve performance.
Before beginning
Ensure that we are familiar with and understand the following concepts and tasks related to the Data Load utility:
- Running the Data Load utility
- Configure the Data Load utility
To tune your Data Load utility process, we can tune the following processes, parameters, and caches:
- Data Load mode
- File difference preprocessing
- Validation options
- ID resolver cache
- Batch size and commit count
- Java virtual machine (JVM) heap size
- Network latency
- Database tuning
Data Load mode
The Data Load mode parameter is used to set the type of load process that the Data Load utility is to run. We can set this mode to be Insert, Replace, or Delete in the wc-dataload.xml file for the data we are loading. Typically, Replace is used, however Insert and Delete can run faster. Running the utility in Insert or Delete mode does not require as many Ids to be resolved with the ID resolver utility. When we are using Insert or Delete, ensure that these actions are the only database operations required by your CSV file.
File difference preprocessing
We can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. By using this preprocessor that we can compare two input files, such as a previously loaded file and a new version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. Running a file difference can reduce the loading time required to load your routine updates to the WebSphere Commerce database, reduce server usage time, and improve server performance.
We can configure the Data Load utility file difference preprocessor to compare files by the values in each column, instead of entire records, to identify the changed records. We can also configure the file difference preprocessor to ignore specific columns when the process is comparing files.
For more information about this preprocessor, see Data Load file difference preprocessing.
Validation options
Configure the Data Load utility to validate the data that we are loading can affect your Data Load performance. The validation of the data your are loading is performed against the WebSphere Commerce database. If we are validating many records, or the validation process encounters many invalid records, this validation process can affect performance. By default, the following validations options are available as configurable properties for the Data Load utility:
- attributeValueValidation
- Indicates whether to validate the attribute value. The attribute value is mandatory except within a product and defining attribute relationship.
- validateAttribute
- Validates whether a SKU and a product have compatible defining attributes when the SKU is moved under the product. The validation logic determines whether the attributes or allowed values that are to be created, updated, or deleted belong to the current store.
- validateCatalog
- Validates whether more than one master catalog is being created for a store. If the store supports sales catalogs, the validation checks whether a catalog entry belongs to more than one master category. The validation also checks whether an attribute allowed value can be set to default in the current store.
- validateCatalogEntry
- Validates whether to check the types of the SKU and product when the Data Load adds a SKU under a product. This check is to make sure that the SKU is really a SKU and the product is really a product.
- validateCatalogGroup
- Validates whether a catalog group belongs to a specified catalog.
- validateUniqueDN
- Validates the uniqueness of the distinguished name (DN) to identify a user in CSV file. By default, to optimize data load performance, users in the CSV file are identified by the logon ID instead of the distinguished name.
If the data that we are loading does not require any of the validation processes to occur, ensure that the configurable property is set false to ensure that no validation performs.
ID resolver cache
If the ID resolver cache size is large enough, the cache can store all of the required IDs for a database table that data is being loaded into. If the size of this cache is not large enough to store all of the required Ids, then none of the IDs are cached. When the IDs are not cached, the data load process requires that the IDs are resolved directly against the database. When we are configuring your cache setting, consider the following behavior:
- To cache all of the IDs for a large table, the ID resolver cache can require a significant amount of time. If we are loading few records into a table with many individual records and IDs to cache, directly resolving the IDs against the database can require less time than caching all of the IDs for an entire table and resolving the IDs against the cache. We can configure the cache size to ensure that the cache is too small to store the IDs for large tables, but still large enough size to cache smaller tables to resolve the IDs against the cache. By reducing the size of this cache, we can reduce the time that is spent caching IDs for an entire table that we are loading only a few records into.
- The ID resolver cache is cleared after a load item is completed in the load order configuration file. If we are running multiple CSV files, then the cache must be repopulated after each item completes. If multiple CSV files load data into the same tables, consider merging the files wherever possible to reduce the caching time for repopulating the same table ID data.
To tune your Id resolver cache, set an initial value that ensures all of the tables that we are loading data into can be cached. Then, set a second Data Load scenario with the Id resolver cache set to 0. In both instances, pass in the parameter -Dcom.ibm.commerce.foundation.dataload.idresolve.level=FINE when you call the Data Load utility to run. This parameter adds the resolving times to the trace, including the time required to populate the cache (if it is not set to 0), and the time required to resolve the Ids against the database or cache. With these times, we can identify whether we can increase or decrease your caching of Ids and reduce the time required to resolve the Ids for the data your are loading. The size of the Id resolver cache and is set in the Data Load environment configuration file. The following is a sample declaration of a cache that can store 1 million records:
<_config:IDResolver className="com.ibm.commerce.foundation.dataload.idresolve.IDResolverImpl" cacheSize="1000000"/>To help tune the ID resolver cache, review the data load summary report after a data load completes. This summary includes the total time that is necessary for the ID resolver to resolve and check the IDs for the objects that are loaded. The Data Load utility can also be configured to output more information about the data load ID resolver process. This addition to the summary report includes the following information:
- The time that it takes to fetch and load the IDs for a table into the ID resolver cache
- The number of entries per table that are stored in the cache
- The number of hits to the cache per table to resolve IDs
- The time that it takes to resolve IDs for a table directly from the database table
- The number of hits to the database to resolve IDs for a table
With this information, we can identify whether it would be more efficient to resolve and check IDs for a table against the ID resolver cache or directly against the database. We can then adjust the size of your ID resolver cache as needed, or exclude IDs from certain tables from being included within the cache. For more information about the ID resolver information that can be included in the data load summary report, see Verifying the results of the data load.
Batch size and commit count
Change the Data Load utility batch size and commit count parameters to reduce the effect of network latency and reduce the processing load on the Transaction Server. The commit count must be a multiple of the batch size. The commit count parameter specifies the number of rows that are flushed in a single transaction before a commit is issued. Database transactions are loaded in batches. These batches are kept in the Java memory until there are enough rows stored for a flush. Then, the batch contents are stored to the database as a single packet of data. The batches are stored in the database until the next commit occurs and the database loads the changes.
By increasing the batch size, we can reduce the effect that network latency can have on the Data Load process. Increasing the batch size can reduce the number of batches required to be sent to the database. The wait time for the database response can also be reduced by increasing the batch size.
By increasing the commit count, we can reduce the processing load on the Transaction Server. Increasing the commit count increases the load on the database. Increasing the commit count causes more records to be committed to the database in a single transaction. This increase results in less uncommitted data that remains stored on your Transaction Server and fewer overall transactions required to commit the data. The values for the batch size and commit count parameters are set in the load order configuration file and are typically in the 500-1000 range. The following is a sample declaration of the batch size and commit count parameters:
<_config:LoadOrder commitCount="1000" batchSize="500" dataLoadMode="Replace">
Java virtual machine (JVM) heap size
When we are tuning the ID resolver cache, we can adjust the JVM heap allocation. The JVM heap size must be proportionate to the ID resolver cache. For instance, if the ID resolver cache is large, specify a large JVM heap size. If the ID resolver cache is large, the Data Load utility does not resolve the ID from the database directly. The ID resolver cache however might use much of the JVM heap memory. For 1 GB JVM heap size, set the ID resolver cache size to be less than 2 million to prevent a Java out of memory error. If you encounter an out of memory exception during the data load process, the allocated JVM heap size might be too small. Ensure that the JVM heap size is sufficient to accommodate the ID resolver cache and batches that are store in memory. Set the value for the JVM heap size in the following parameters within the Data Load utility, dataload.sh file: -Xms1024m -Xmx4096m
Network latency
If the environment is configured to have a large physical distance between servers, the data load process can be impacted. Your network latency can affect the flush time when the Data Load utility runs. The flush time can be viewed in the Data Load utility summary report. The flush time includes the time required to transport a batch, including the time required for the database to respond. This flush time also includes network latency. If we are experiencing a large flush time, your system can be experiencing either poor database or network performance, or both. If you tune the environment and the Data Load performance is not within the required performance range, then, installing WebSphere Commerce on a local system might be necessary to improve Data Load utility performance.
Database tuning
By tuning the database, we can improve the performance of the Data Load utility by reducing the time required to commit records. We can view the time required to commit loaded data records in the Data Load utility summary report. There are many performance tuning tools available for improving database performance. For more information about database performance tuning, see:
- (DB2) Database (DB2) performance considerations
- (Oracle) Database (Oracle) performance considerations
Related concepts
Overview of the Data Load utility
Related tasks
Configure and running the Data Load utility
Related reference
Data Load utility best practices