General data load best practices
The following best practices are recommended when using the Data Load utility to load data.
Configuration for the initial loads
For more information about recommended configurations during initial loads, see Scenario: Initial load.
Configuration for the delta loads
For more information about recommended configuration during delta loads, see Scenario: Delta load.
Running the data load script file
If you run the Data Load utility from the Utilities container, consider mounting a directory of the host system in a directory within the container. Place all your Data Load files in a subdirectory of the mounted directory. This approach ensures that the files are not lost when the container is overwritten by another Docker image. It can also be easier to edit files directly in the host system using your favorite editor. When running the utility , consider the following command-line options:
- -DXmlValidation=false
- Turn off XML validation if we are using some variable substitution for integer attributes.
- -D.level=FINEST
- Turn on more tracing in the utilities_root\logs\wc-dataload.log file.For large loads, specifying FINEST trace level, causes too much tracing in the log file. We can turn on tracing for one package as shown in the following examples.
- -Dcom.ibm.commerce.catalog.dataload.level=FINER
- Finer log level for catalog information.
- -Dcom.ibm.commerce.foundation.dataload.database.level=FINE
- Log SQL issues.
- -Dcom.ibm.commerce.foundation.dataload.level=CONFIG
- Log Data Load performance issues.
- -Dcom.ibm.commerce.price.dataload.level=FINER
- Log price-related load issues
- -Dcom.ibm.commerce.inventory.dataload.level=FINER
- Log inventory-related load issues.
- -Dcom.ibm.commerce.member.dataload.level=FINER
- Log member-related load issues.
- -Dcom.ibm.commerce.marketing.dataload.level=FINER
- Log marketing-related load issues.
- -Dcom.ibm.commerce.promotion.dataload.level=FINER
- Log promotion-related load issues.
- -Dcom.ibm.commerce.pagelayout.dataload.level=FINER
- Log Commerce Composer-related load issues.
- Customize the Java logging configuration file, utilities_root\ts.ear\XML\config\dataload\logging.properties. For example, we can change the log file path, the maximum log file size and how many log files to cycle through. By default, you have one log file, and the log file is overwritten every time you run the Data Load utility.
File difference preprocessing
We can run a file difference preprocess for routine data loads to improve the Data Load utility performance for loading these files. By using this preprocessor that we can compare two input files, such as a previously loaded file and a new version of this file. The preprocessor generates a difference file that contains only the records in the new file that are not within the old file or that are changed from the records in the old file. The Data Load utility can then load this difference file. If your routinely loaded files contain many previous loaded records, then running this file difference can result in shorter load times. Running a file difference can reduce the loading time required to load your routine updates to the WebSphere Commerce database, reduce server usage time, and improve server performance.
We can configure the Data Load utility file difference preprocessor to compare files by the values in each column, instead of entire records, to identify the changed records. We can also configure the file difference preprocessor to ignore specific columns when the process is comparing files.
For more information about this preprocessor, see Data Load file difference preprocessing.
Data Load utility configuration files
There are three types of data load configuration files:
- Load order configuration file (wc-dataload.XML)
- We can either have multiple load order configuration files or one load order configuration file to include all the load items. To load only a few load items, use the following command line when we are running the Data Load utility:
-DLoadOrder="loadItemName1, loadItemName2, loadItemName3"
Environment configuration file (wc-dataload-env.XML) We need only one copy of this configuration file. Business object configuration files Typically one business object configuration file corresponds to one type of input data that loads one type of business object. The file name convention is wc-loader-business object.XML. This file defines the DataReader, BusinessObjectBuilder, and BusinessObjectMediator used for the data load. It also defines all the customization configuration options. Keep all your data load configuration files relative to the wc-dataload.XML file. Ensure that the configuration files specified in the wc-dataload.XML file use the relative path. This path can make it easy to move the configuration files from one workstation to another.
Configure the data load order file (wc-dataload.xml)
Consider the following configurations:
- Specify commitCount, batchSize, dataLoaderMode at the LoadOrder level, so you do not need to specify them at each LoadItem level.
- Specify the commitCount to be greater than or equal to the batchSize. The commitCount is a multiple of the batchSize.
- To minimize the impact to your production environment, specify the commitCount and batchSize to 1. Specifying a large commitCount and batchSize improves the data load performance. However, this large commitCount and batchSize might affect the database and result in more database tables and rows that are being locked for a longer time.
- For easy debugging of some SQL errors, specify the batchSize to 1 and turn on the database tracing. These settings can help you determine which SQL statement or input line caused the SQL error. If the batchSize is greater than 1, the JDBC batch update is enabled. With this batch update enabled, relating the SQL error to the input line or SQL statement that caused the error can be difficult.
- Running the Data Load utility in insert mode can improve the performance of loading initial loads of large amounts of data. However, when you run the Data Load utility in insert mode the utility does not check the database before the utility loads your input data. The utility does not determine whether the data objects that we are creating exist in the database before the utility attempts to load and create the data objects. This behavior can cause the load process to fail when the utility attempts to create data objects that exist in the database. For example, if your input file is a CSV file that contains a line that creates a product and a product description. If the file also contains a second line that loads the same product with a description in a different language, the load process fails. Since the utility creates the product and description in the first line, when the utility encounters the second line, the utility attempts to create the product again. Since the product exists, the load fails. To load this data, we can use one of the following methods:
- Run the Data Load utility in replace mode. In replace mode the Data Load utility generates insert, update, or delete SQL statements, depending on the data that we are loading. This mode replaces existing data that is contained in the database with your input data.
- Run the Data Load utility in insert mode, but load your product and description information separately. Use one input file to load your product information, and a different input file to load all of your descriptions.
Configure the data load environment configuration file (wc-dataload-env.xml)
Consider the following configurations:
- Do not specify the <_config:FilePath /> element if all your configuration files are relative to the wc-dataload.XML file.
- Encrypt the database password in your configuration file. Run the wcs-encrypt utility as follows.
- (Linux) Open a command line shell in the Utilities container and go to the utilities_root/bin directory. See
Running utilities from the Utility server Docker container.(Developer) Open a command line shell and go to the WCDE_installdir\bin directory.
- Run the wcs-encrypt utility to learn the encrypted ASCII string.
- (Linux) ./wcs-encrypt.sh
(Developer) wcs-encrypt
We can run the wcs-encrypt utility in the utilities_root\bin directory to find out the encrypted ASCII string: wcs-encrypt.bat/sh <plain password>
- We can also leave the setting for the database password empty. When the password is empty, we are prompted to enter the password when you run the Data Load utility. We must enter the plain unencrypted password; your password is not echoed on the console for security purposes.
- The IDResolver configuration is optional. We can specify the ID resolver cache size. For initial load, specify a large cache size, such as 1 million. For delta load, specify a cache size of 0. If you do not specify the ID resolver configuration, the default cache size is 0.
- Instead of hardcoding the data writer in each of the data load business object configuration file, specify the default data writer class in this configuration file.
- Set the cache size for the ID resolver to be 0 when you run frequent loads that can include the same information in the repeated loads. For instance, when we are running scheduled updates of the catalog data that overrides existing data. For more information about configuring this cache size and this file, see Configure the data load environment settings.
- If an environment configuration file is specified in the command line, it takes precedence over the element that exists in the data load environment configuration file.
Configure the data load business object configuration file
Consider the following configurations:
- If we are loading files in CSV format, instead of hardcoding the CSV column names in the <_config:Data> node of the <_config:DataReader> element, we can place all the column names in the first line of the CSV file. Then, we must specify firstLineIsHeader="true" in the <_config:DataReader> element.
Note: This configuration applies only when using the CSVReader interface as your data reader.
- When you define the mapping or table configuration in the <_config:BusinessObjectBuilder> element, if the value is from the input data, you do not need to specify the valueFrom attribute. The default is valueFrom="InputData".
- If you do not want the Data Load utility to update all columns for a table when we are loading data, we can configure a column exclusion list. Configuring a column exclusion list causes the Data Load utility to avoid loading data into every column of a table. This exclusion configuration provides you the ability to exclude columns from being overwritten if the columns are known to already be populated with data.
See Configure a column exclusion list.
When you configure an exclusion list, we can configure the list to with an optional parameter, forUpdateOnly="true". If you configure an exclusion list with this parameter set to true, when you update a table record with an input file that includes changes for an excluded column, the utility does not update the value for the excluded column with the input file value. If the same load operation includes data to create records in the table, the utility ignores the exclusion list and inserts the input file values into all columns for the new records.
- If we are loading CSV files that contain leading whitespace, we can avoid errors by configuring the data load CSV reader to trim the token. To do so, set the trimTokenWhiteSpace property to true. For example:
<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="false" useHeaderAsColumnName="false"> <_config:property name="trimTokenWhiteSpace" value="true" /> </_config:DataReader>
CSV input files
Consider the following tips when we are editing or maintaining your CSV files:
- For clarity and readability, use the first line of the CSV file as a header.
<_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" >We can use a spreadsheet software to open the CSV file to see whether your data matches the column heading.
For flexibility of using and omitting optional columns or rearranging the columns, use the header as column name instead of hard coding the columns. <_config:DataReader className="com.ibm.commerce.foundation.dataload.datareader.CSVReader" firstLineIsHeader="true" useHeaderAsColumnName="true" >instead of:
<_config:Data> <_config:column number="1" name="Identifier" /> <_config:column number="2" name="Name" /> <_config:column number="3" name="ShortDescription" /> <_config:column number="4" name="LongDescription" /> <_config:column number="5" name="Thumbnail" /> <_config:column number="6" name="FullImage" /> <_config:column number="7" name="Delete" /> </_config:Data>
To use spreadsheet software to edit and save the CSV file, ensure that we use an editor that can save in UTF-8 format. For example, Open Office Calc. Otherwise, the spreadsheet software might reformat your CSV file and invalidate your data for the CSVReader. For example, if you have a time stamp column in the CSV file, not saving in UTF-8 format can reformat the time stamp data according to your locale setting. The reformatted time stamp data might not be valid for the use with your data load. By default, data load supports only the Java standard time stamp format "yyyy-MM-dd hh:mm:ss.nnnnnn". The Data Load utility can also support a customized time stamp format. If you have a customized time stamp format in your CSV file, we must specify the format in the load item configuration in the wc-dataload.XML file. For example, we can specify the following row in the <_config:LoadItem> element: <_config:property name="timestampPattern" value="yyyy-MM-dd HH.mm.ss" />
Loading by unique ID
Specifying the unique ID is optional when we are using the Data Load utility. However, if you specify the unique ID, you save the processing time required to resolve the ID, and performance is improved.
Reversing a data load
To reverse a load, we can run the same data load again with dataLoadMode="Delete" specified in the <_config:LoadOrder> element in your wc-dataload.XML data load configuration file. If we are reversing a load, specify the following configuration row within the <_config:LoadOrder> element:<_config:property name="actionOnError" value="1" />This configuration row ensures that the Data Load utility continues the process upon a soft delete error. This continuation is because dependent child records no longer exist because of cascade delete.
Tuning the Data Load utility
To reduce any performance impact from running the Data Load utility, we can adjust the Idresolver cache size and parameters related to the utility. See Data Load utility performance tuning.