Replication and extract, transform, and load (ETL) processes

WebSphere Commerce Analyzer copies transactional data from the WebSphere Commerce database to the WebSphere Commerce Analyzer data mart using DB2 replication technology.

DB2 replication is composed of two steps:

Capture

During WebSphere Commerce Analyzer configuration, ASN tables are created on the source database (that is, the WebSphere Commerce database), and then these tables are registered on the source database for replication. As each table is registered, a changed data (CD) table is created for that registered table (by default REPL.CD_tablename).

When started, the Capture program (ASNCAP) captures all the changed data (inserts, updates, and deletes) from the source database and moves it into the CD table created for that registered table.

CD tables are stored in tablespaces. Each CD table exists in a shared tablespace.

If you are capturing data from more than one source database run capture on both databases. You must set up and run the Capture program before you can run replication.

If Capture is not running, changes to the registered source tables are not captured from the replication source systems.

A Capture program is not required when running replication on a WebSphere Commerce Oracle database. Instead, triggers handle the function of the Capture program.

Detailed information about running the Capture program can be found in the SQL Replication Guide and Reference, which is available from the following URL: www.ibm.com/software/data/db2/udb/support/manualsv8.html

Apply

Apply copies the database tables identified for replication from the WebSphere Commerce database to the specified section (the staging area) in the data mart on the WebSphere Commerce Analyzer server.

The Apply program is located on the WebSphere Commerce Analyzer system.

Only a limited number of rows can be propagated by Apply at one time. Apply automatically splits what needs to be propagated into several mini-subscriptions in several mini-cycles. If the changed data is greater than the size of the data block, the Apply program converts a single subscription cycle into many mini-cycles. The Apply program also reduces the backlog to manageable pieces. It retries any unsuccessful mini-cycles, and reduces the size of the data block to match the available system resources. If replication fails during a mini-cycle, the Apply program retries the subscription set, beginning with the last successful mini-cycle. The number of minutes that you specify determines the size of the data block. This value is stored in the MAX_SYNCH_MINUTES column of the SUBSCRIPTION SET table. To change this value, see Updating the MAX_SYNCH_MINUTES column.

When you start replication, you can chose to replicate data continuously. If you choose to replicate continuously, as soon as one replication cycle finishes, another one immediately starts. If you choose not to replicate continuously, then replication only occurs when you start replication manually, or as scheduled.

Extract, transform, and load (ETL) processes

After replication finishes, the extract, transform and load (ETL) processes move the database tables from one area on the data mart, to another area in the data mart, which can be accessed by the WebSphere Commerce Accelerator to create reports.

The replication and ETL processes are made up of flows. A flow is a series of steps linked together. For more detailed information about these flows, see WebSphere Commerce Analyzer flows.

Note: IBM recommends to install and configure WebSphere Commerce Analyzer and complete the ETL process at the instance when WebSphere Commerce site goes live.

Staging tables

WebSphere Commerce Analyzer staging tables are replicas of the WebSphere Commerce data source. The staging database and tables in WebSphere Commerce Analyzer are not related to the WebSphere Commerce staging server. WebSphere Commerce Analyzer replicates data using DB2 replication and then populates the data in the staging tables.

WebSphere Commerce Analyzer replicates numerous WebSphere Commerce tables into its staging area. The replicated data is stored in staging tables on the WebSphere Commerce Analyzer machine. The replicated tables are later used as the source for the extract, transform, and load (ETL) process. The ETL steps use the staging tables to populate data into the WebSphere Commerce Analyzer data mart.

All staging tables are based on DB2's condensed replication option. The condensed staging table contains only the most current value for each row from the source table.

The name of the temporary staging table is the same as the source table name or is closely associated with the source table name. For example, the temporary staging table of the ADDRESS table is ADDRESS_R.

Each staging table has a LOAD_STATUS field. The value of the LOAD_STATUS indicates whether the data has already been processed by WebSphere Commerce Analyzer's ETL process. A value of -1 means the row has not been processed.

During replication these fields are added to each staging table:

IBMSNAP_INTENTSEQ

The log or journal record sequence number that uniquely identifies a change. This value is globally ascending.

IBMSNAP_OPERATION

A character value of I, U, or D, indicating an insert, update, or delete record, respectively. During the processing of ETL steps, any data having time information in this field that is between time_cut_off and time_cut_off_previous indicates data insertion or updates.

IBMSNAP_LOGMARKER

The approximate commit time at the source server. This column is always NULL after a full refresh. Incremental data processing is based on the time information of this field.

IBMSNAP_COMMITSEQ

The log record sequence number of the captured COMMIT statement.