Unstructured and site content

WebSphere Commerce Search can search for both structured and unstructured site content. Unstructured site content includes documents that do not adhere to a specific data model, such as product attachments contained in various formats. For example, content such as user manuals and warranty information are considered unstructured content. Its elements, construction, and organization are typically unknown and can vary depending on its file type.

Important: WebSphere Commerce Search indexes decrypted unstructured data by default. That is, processing encrypted data with WebSphere Commerce Search is not supported. When you work with search index types, unstructured content is categorized under the catalog entry search index.

Although the WebSphere Commerce database might not store the unstructured content, unstructured content can still be indexed and retrieved. For example, when a search for laptop is submitted, the search result can find the unstructured content such as attachments in .pdf or .doc format, which contain the laptop keyword.


Site content

When you work with search index types, site content is categorized under the catalog entry search index.

Site content includes HTML and other site files from WebSphere Commerce starter stores. It is fetched and crawled by the site content crawler.

WebSphere Commerce provides sample static HTML files by default, that the site content crawler fetches and crawls to help populate the site content search index. We can configure the site content crawler to fetch extra content from WebSphere Commerce starter stores.

See Indexing site content with WebSphere Commerce Search.


Supported file types

WebSphere Commerce Search uses parser libraries to detect and extract metadata and structured text content from documents. The following file types are supported by default: