Unstructured and site content

Unstructured and site content
WebSphere Commerce Search can search for both structured and unstructured site content. Unstructured site content includes documents that do not adhere to a specific data model, such as product attachments contained in various formats. For example, content such as user manuals and warranty information are considered unstructured content. Its elements, construction, and organization are typically unknown and can vary depending on its file type.
Important: WebSphere Commerce Search indexes decrypted unstructured data by default. That is, processing encrypted data with WebSphere Commerce Search is not supported. When you work with search index types, unstructured content is categorized under the catalog entry search index.
Although the WebSphere Commerce database might not store the unstructured content, unstructured content can still be indexed and retrieved. For example, when a search for laptop is submitted, the search result can find the unstructured content such as attachments in .pdf or .doc format, which contain the laptop keyword.

Site content

When you work with search index types, site content is categorized under the catalog entry search index.
Site content includes HTML and other site files from WebSphere Commerce starter stores. It is fetched and crawled by the site content crawler.
WebSphere Commerce provides sample static HTML files by default, that the site content crawler fetches and crawls to help populate the site content search index. We can configure the site content crawler to fetch extra content from WebSphere Commerce starter stores.
See Indexing site content with WebSphere Commerce Search.

Supported file types

WebSphere Commerce Search uses parser libraries to detect and extract metadata and structured text content from documents. The following file types are supported by default:
Microsoft Office
Excel 97-2003 (.xls).
Excel 2007 (.xlsx).
Outlook documents (.msg).
PowerPoint 97-2003 (.ppt).
PowerPoint 2007 (.pptx).
Visio (.vsd).
Word 97-2003 (.doc).
Word 2007 (.docx).
JAVA
Classes (.class).
JAR files (.jar).
Documents and text
OpenDocument (.odt, odp, .ods).
Plain text (.txt).
Portable Document Format (.pdf).
Rich Text Format (.rtf).
The following Tika version is provided with WebSphere Commerce Search by default for parsing unstructured documents:

Tika 1.7

Unstructured content schema

WebSphere Commerce Search can directly extract metadata and content from the unstructured data source. Differing unstructured data formats might contain varying metadata information. For example, Microsoft Word files contain metadata such as creator, company, and created date, whereas JPEG image files contain metadata such as width and height.
Solr Cell provides a mechanism to add a prefix to the generated metadata field. This behavior requests that the typical schema design of unstructured content must contain at least one dynamic field, such as tika_*, to store all metadata information. The main difference between structured and unstructured content is that the name and total number of fields for one unstructured document might vary from another unstructured document.
WebSphere Commerce Search manages unstructured content by requesting Tika to parse the documents before they are processed. Then, they are sent to the WebSphere Commerce Search server for eventual indexing.

Schema changes for related structured and unstructured content

When structured content contains a relationship with unstructured content, it must contain a new field in the structured schema.xml file to represent the unstructured information. This new field can query the structured objects by their unstructured content. For example, when you search for products by the attachments' content information, the following new field definition resembles the following form:
<field name="unstructure" type="wc_text" indexed="true" stored="false" multiValued="true" />
Where the stored="false" snippet enables unstructured content to not be retrieved by queries.

Unstructured content indexing and handling
The information stored in unstructured content can be organized and stored from several locations, including the WebSphere Commerce database, in file systems of servers, and on the internet. Therefore, the indexing process of unstructured content uses a hybrid of data sources to create indexing information using the existing WebSphere Commerce Search indexing framework.
Unstructured content in the storefront
Searching for unstructured content requires two queries since unstructured content is indexed in a different core. One query gets the related IDs by searching the unstructure field of the structured content, while the other query searches the unstructured index using the keywords and generated IDs scope from the first search.
Enable search on additional unstructured content types
We can enable searching on more unstructured content types so that custom attachments data can be processed by search and retrieved in store search results.

Related tasks
Crawling WebSphere Commerce site content