Enable search on additional unstructured content types

We can enable searching on more unstructured content types so that custom attachments data can be processed by search and retrieved in store search results.

Important: WebSphere Commerce Search indexes decrypted unstructured data by default. That is, processing encrypted data with WebSphere Commerce Search is not supported.


Before beginning

Ensure that you complete the following tasks:


Procedure

  1. Create a parser for the new file type.

    WebSphere Commerce supports extra parsers to enable searching on more file types.

    1. Prepare for the extension.

      Before you implement the logic for the new file type, the MIME types of the new parser must be selected.

      1. Open the tika-mimetypes.xml file. The file is in the tika-core-0.4.jar file, under org/apache/tika/mime.

      2. Enter the MIME type to implement. For example, for media of type application/vnd.rn-realmedia:

          <mime-type type="application/vnd.rn-realmedia">
              <magic priority="50">
                <match value=".RMF" type="string" offset="0" />
              </magic>
              <glob pattern="*.rm"/>
            </mime-type>

      3. Find a reader that understands the file format so that it can be parsed successfully.

      4. If the parser must support more types, select more. These MIME types are required when you implement the logic.

    2. Implement the extension logic.

      1. Create a class that implements the org.apache.tika.parser.Parser interface. In com.ibm.commerce.tika.parser.video.VideoParser.getSupportedTypes(ParseContext), it must return the supported media type list.For example:

          private static final Set<MediaType> SUPPORTED_TYPES =
                  Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
                          MediaType.application("vnd.rn-realmedia"))));
          
          	public Set<MediaType> getSupportedTypes(ParseContext context) {
          		return SUPPORTED_TYPES;
          	}

        The application media type is given the value vnd.rn-realmedia to match the previously selected MIME type.

      2. The com.ibm.commerce.tika.parser.video.VideoParser.parse(InputStream, ContentHandler, Metadata, ParseContext) must handle the content of the media that comes as the InputStream parameter. In addition, it must also handle the metadata container of the media that comes as the Metadata parameter.For example:

          metadata.set(Metadata.CONTENT_TYPE, "application/vnd.rn-realmedia");
          metadata.add(Metadata.PUBLISHER, "Publisher");
          metadata.add(Metadata.LANGUAGE, "RM_language");
          metadata.add(Metadata.COMPANY, "IBM Commerce");
          XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
          xhtml.startDocument();
          xhtml.endDocument();

        When the result is returned from this method, the metadata can have extra publisher, language, and company information. However, no content is extracted.

    3. Assemble the logic and enable WebSphere Commerce Search to recognize it. A service registry file helps insert the new parser to be known to the WebSphere Commerce Search framework.

      1. Create the following file:

        • META-INF/services/org.apache.tika.parser.Parser

      2. Insert the parser's full class name into the file. For example:

          com.ibm.commerce.tika.parser.video.VideoParser

      3. Export the code and the register file into a JAR file and save it in the same directory as the tika-parser-version.jar file.

  2. Confirm the results in WebSphere Commerce Search.

    WebSphere Commerce Search automatically finds the proper parser for the file content. For example, if a realmedia file is in the extracting request, WebSphere Commerce Search returns the parser result. The Solr Cell uses the result and composes a new document and sends it to the search server for create and update commands. For example, we can check the index content, where the result resembles the following snippet:

      content_type:=>application/vnd.rn-realmedia
      tika_company:=>IBM Commerce
      tika_publisher:=>Publisher
      tika_language:=>RM_language
      tika_stream_size:=>614135


What to do next

After enable searching on more unstructured content types by creating a new parser, we can search the storefront to confirm that the search results contain our custom unstructured content types.