The indexing process

The indexing process involves adding Documents to an IndexWriter. The searching process involves retrieving Documents from an index using an IndexSearcher. Solr can index both structured and unstructured content.

Structured content is organized. For example, some of the product description's predefined fields are title, manufacture name, description, and color.

Unstructured content, in contrast, lacks structure and organization. For example, it can consist of PDF files or content from external sources (such as tweets) that do not follow any predefined patterns.


Data Import Handler

The data import handler can perform full import or delta imports. When the full-import command is run, it stores the start time of the operation in the dataimport.properties file, which is in the same directory as the solrconfig.xml file. For example:


Fetching, reading, and processing data

The wc-data-config.xml file defines the following behaviors:

For example: the solrhome\MC_10001\en_US\CatalogEntry\conf\wc-data-config.xml file contains the following content:

Where two data sources are used:

In addition, the file contains the following types of content by default:

The following three documents exist: one for CatalogEntry, one for bundle, and one for dynamic kit.

The CatalogEntry document contains the following entities: Product and attachment_content. The Product entity contains the following parameters:

The wc-data-config.xml file also contains column-to-field mappings that specify the relationship between index field names and database column names. For example:

Where: CATENTRY_ID is the database column name and catentry_id is the index field name.


Crawling unstructured content

For unstructured content, the Solr ExtractingRequestandler uses Apache Tika to allow users to upload binary files and unstructured data to Solr. Then, Solr extracts and indexes the content. WebSphere Commerce uses the Droid site content crawler to crawl the web, and put the content into the file. That is, from the unstructuretmpfile path specified in the wc-data-source.xml file within the CatalogEntry index. Then, Tika parses this file and the information is indexed by the DIH. Unstructured data comes from two sources: the database and the crawler. The unstructured index contains two data configuration files: The wc-data-config.xml file contains product attachments, such as PDF files, while the wc-web-data-config.xml contains web content.

Note: All unstructured content must not be encrypted, so that it can be indexed and crawled correctly. For example, the solrconfig.xml file contains the following content:

Where:

For more information about unstructured content, see Unstructured and site content.

For more information about the WebSphere Commerce index schema, see WebSphere Commerce Search index schema and WebSphere Commerce Search index schema definition.